SIMD-based inverted index compression algorithms

Hongfei Yan; Xudong Zhang; Dongdong Shan; Xianling Mao; Xin Zhao

doi:10.7544/issn1000-1239.2015.20131548

SIMD-based inverted index compression algorithms

Hongfei Yan, Xudong Zhang, Dongdong Shan, Xianling Mao, Xin Zhao

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

The rapid growth of text information has brought about new challenges to traditional information retrieval. In large search engines, indexing is required to help users acquire important data they need, and techniques of inverted index have great influence on the efficiency of query processing in such systems. The data in inverted index is stored in the form of arrays of integers, and techniques of compression are required to reduce the cost of storing such data in disks and memory, as well as to boost the hit rate of CPU cache and speed up transferring data. Therefore, it is necessary to choose a highly efficient compression algorithm to process query effectively. In this paper, we propose two instruction-level-parallelized algorithms, i.e. SIMD-PB and SIMD-PFD, which improve two competitive compression algorithms respectively, i.e. PackedBinary and PForDelta, and exploit SIMD instructions to accelerate the Pack and Unpack procedure in the algorithms. Experiments based on public datasets of GOV2 and ClueWeb09B show that our novel algorithms have good performance on encoding and decoding speed without impairing the compression ratio, and outperform the former fastest inverted list compression algorithms by at most 17%, with respect to decompression speed. Furthermore, experiments indicate that our novel algorithms have better performance on longer posting list and larger block size w.r.t. decoding speed.

Original language	English
Pages (from-to)	995-1004
Number of pages	10
Journal	Jisuanji Yanjiu yu Fazhan/Computer Research and Development
Volume	52
Issue number	5
DOIs	https://doi.org/10.7544/issn1000-1239.2015.20131548
Publication status	Published - 1 May 2015

Keywords

Compression
Information retrieval
Integer encoding
Inverted index
Single instruction multiple data(SIMD)

Access to Document

10.7544/issn1000-1239.2015.20131548

Cite this

@article{f1fcf10b0f704bbaadafc3fcd514da13,

title = "SIMD-based inverted index compression algorithms",

abstract = "The rapid growth of text information has brought about new challenges to traditional information retrieval. In large search engines, indexing is required to help users acquire important data they need, and techniques of inverted index have great influence on the efficiency of query processing in such systems. The data in inverted index is stored in the form of arrays of integers, and techniques of compression are required to reduce the cost of storing such data in disks and memory, as well as to boost the hit rate of CPU cache and speed up transferring data. Therefore, it is necessary to choose a highly efficient compression algorithm to process query effectively. In this paper, we propose two instruction-level-parallelized algorithms, i.e. SIMD-PB and SIMD-PFD, which improve two competitive compression algorithms respectively, i.e. PackedBinary and PForDelta, and exploit SIMD instructions to accelerate the Pack and Unpack procedure in the algorithms. Experiments based on public datasets of GOV2 and ClueWeb09B show that our novel algorithms have good performance on encoding and decoding speed without impairing the compression ratio, and outperform the former fastest inverted list compression algorithms by at most 17%, with respect to decompression speed. Furthermore, experiments indicate that our novel algorithms have better performance on longer posting list and larger block size w.r.t. decoding speed.",

keywords = "Compression, Information retrieval, Integer encoding, Inverted index, Single instruction multiple data(SIMD)",

author = "Hongfei Yan and Xudong Zhang and Dongdong Shan and Xianling Mao and Xin Zhao",

year = "2015",

month = may,

day = "1",

doi = "10.7544/issn1000-1239.2015.20131548",

language = "English",

volume = "52",

pages = "995--1004",

journal = "Jisuanji Yanjiu yu Fazhan/Computer Research and Development",

issn = "1000-1239",

publisher = "Science China Press",

number = "5",

}

TY - JOUR

T1 - SIMD-based inverted index compression algorithms

AU - Yan, Hongfei

AU - Zhang, Xudong

AU - Shan, Dongdong

AU - Mao, Xianling

AU - Zhao, Xin

PY - 2015/5/1

Y1 - 2015/5/1

N2 - The rapid growth of text information has brought about new challenges to traditional information retrieval. In large search engines, indexing is required to help users acquire important data they need, and techniques of inverted index have great influence on the efficiency of query processing in such systems. The data in inverted index is stored in the form of arrays of integers, and techniques of compression are required to reduce the cost of storing such data in disks and memory, as well as to boost the hit rate of CPU cache and speed up transferring data. Therefore, it is necessary to choose a highly efficient compression algorithm to process query effectively. In this paper, we propose two instruction-level-parallelized algorithms, i.e. SIMD-PB and SIMD-PFD, which improve two competitive compression algorithms respectively, i.e. PackedBinary and PForDelta, and exploit SIMD instructions to accelerate the Pack and Unpack procedure in the algorithms. Experiments based on public datasets of GOV2 and ClueWeb09B show that our novel algorithms have good performance on encoding and decoding speed without impairing the compression ratio, and outperform the former fastest inverted list compression algorithms by at most 17%, with respect to decompression speed. Furthermore, experiments indicate that our novel algorithms have better performance on longer posting list and larger block size w.r.t. decoding speed.

AB - The rapid growth of text information has brought about new challenges to traditional information retrieval. In large search engines, indexing is required to help users acquire important data they need, and techniques of inverted index have great influence on the efficiency of query processing in such systems. The data in inverted index is stored in the form of arrays of integers, and techniques of compression are required to reduce the cost of storing such data in disks and memory, as well as to boost the hit rate of CPU cache and speed up transferring data. Therefore, it is necessary to choose a highly efficient compression algorithm to process query effectively. In this paper, we propose two instruction-level-parallelized algorithms, i.e. SIMD-PB and SIMD-PFD, which improve two competitive compression algorithms respectively, i.e. PackedBinary and PForDelta, and exploit SIMD instructions to accelerate the Pack and Unpack procedure in the algorithms. Experiments based on public datasets of GOV2 and ClueWeb09B show that our novel algorithms have good performance on encoding and decoding speed without impairing the compression ratio, and outperform the former fastest inverted list compression algorithms by at most 17%, with respect to decompression speed. Furthermore, experiments indicate that our novel algorithms have better performance on longer posting list and larger block size w.r.t. decoding speed.

KW - Compression

KW - Information retrieval

KW - Integer encoding

KW - Inverted index

KW - Single instruction multiple data(SIMD)

UR - http://www.scopus.com/inward/record.url?scp=84931091606&partnerID=8YFLogxK

U2 - 10.7544/issn1000-1239.2015.20131548

DO - 10.7544/issn1000-1239.2015.20131548

M3 - Article

AN - SCOPUS:84931091606

SN - 1000-1239

VL - 52

SP - 995

EP - 1004

JO - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

JF - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

IS - 5

ER -

SIMD-based inverted index compression algorithms

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this