Indexing compressed text-Reference-Cited by-同舟云学术

Indexing compressed text

Published:2005-07 Issue:4 Volume:52 Page:552-581
ISSN:0004-5411
Container-title:Journal of the ACM
language:en
Short-container-title:J. ACM

Author:

Ferragina Paolo¹,Manzini Giovanni²

Affiliation:

1. Università di Pisa, Pisa, Italy

2. Università del Piemonte Orientale, Alessandria, Italy

Abstract

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P [1, p ] within a text T [1, n ] in O ( p + occ log 1+ε n ) time for any chosen ε, 0<ε<1. This data structure uses at most 5 n H k ( T ) + o ( n ) bits of storage, where H k ( T ) is the k th order empirical entropy of T . The space usage is Θ( n ) bits in the worst case and o ( n ) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows--Wheeler Transform, and can be regarded as a compressed suffix array .Our second compressed data structure achieves O ( p + occ ) query time using O ( n H k ( T )log ε n ) + o ( n ) bits of storage for any chosen ε, 0<ε<1. Therefore, it provides optimal output-sensitive query time using o ( n log n ) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.

Publisher

Association for Computing Machinery (ACM)

Subject

Artificial Intelligence,Hardware and Architecture,Information Systems,Control and Systems Engineering,Software

Link

https://dl.acm.org/doi/pdf/10.1145/1082036.1082039

Reference44 articles.

1. A locally adaptive data compression scheme

2. Membership in Constant Time and Almost-Minimum Space

Cited by 453 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Tackling Challenges in Implementing Large-Scale Graph Databases;Communications of the ACM;2024-08

2. Executing Ad-Hoc Queries on Large Geospatial Data Sets Without Acceleration Structures;SN Computer Science;2024-06-13

3. An Average-Case Efficient Two-Stage Algorithm for Enumerating All Longest Common Substrings of Minimum Length $k$ Between Genome Pairs;2024 IEEE 12th International Conference on Healthcare Informatics (ICHI);2024-06-03

4. r-indexing the eBWT;Information and Computation;2024-06

5. Space-Efficient Indexes for Uncertain Strings;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13