Al fine di migliorare la tua esperienza di navigazione, questo sito utilizza i cookie di profilazione di terze parti. Chiudendo questo banner o accedendo ad un qualunque elemento sottostante acconsenti all’uso dei cookie.

At the Roots of Dictionary Compression: String Attractors

14 novembre 2019 ore 11:30 - 13:00

Aula 310, Sede di Viale Romania, 32

Speaker: Nicola Prezza, Università di Pisa

Abstract:

A well-known fact in the field of lossless text compression is thathigh-order entropy is a weak model when the input contains longrepetitions. Motivated by this fact, decades of research have generatedmyriads of so-called dictionary compressors: algorithms able toreduce the text’s size by exploiting its repetitiveness. Lempel-Ziv77 is one of the most successful and well-known tools of this kind,followed by straight-line programs, run-length Burrows-Wheelertransform, macro schemes, collage systems, and the compact directedacyclic word graph. In this paper, we show that these techniquesare different solutions to the same, elegant, combinatorialproblem: to find a small set of positions capturing all distinct text’ssubstrings.We call such a set a string attractor.We first show reductionsbetween dictionary compressors and string attractors. Thisgives the approximation ratios of dictionary compressors with respectto the smallest string attractor and allows us to uncover newasymptotic relations between the output sizes of different dictionarycompressors. We then show that the k-attractor problem —deciding whether a text has a size-t set of positions capturing allsubstrings of length at most k — is NP-complete for k ≥ 3. This,in particular, includes the full string attractor problem. We provideseveral approximation techniques for the smallest k-attractor,show that the problem is APX-complete for constant k, and givestrong inapproximability results. To conclude, we provide matchinglower and upper bounds for the random access problem on stringattractors. The upper bound is proved by showing a data structuresupporting queries in optimal time. Our data structure is universal:by our reductions to string attractors, it supports random access onany dictionary-compression scheme. In particular, it matches thelower bound also on LZ77, straight-line programs, collage systems,and macro schemes, and therefore essentially closes (at once) therandom access problem for all these compressors.