Speaker: Nicola Prezza, Università di Pisa
The field of data structures for text processing fundamentally changed at the beginning of the millennium, when several ground-breaking works showed independently that computation can be directly performed over compressed text representations (that is, without first decompressing the compressed file). Those findings have had profound implications in many fields including, most notably, biology and medicine: nowadays, if someone sends his or her genome to be analyzed, the analysis will almost certainly be performed using software based on compressed data structures. Those early results, however, were based on rather weak compressors. In particular, they were not able to effectively compress very repetitive texts. These limitations have become apparent only recently: as a result of dramatic advances in sequencing technology, we now have datasets of tens of thousands of genomes, which cannot be compressed at all using those methods. Recent research directions therefore focused on more powerful compressors such as Lempel-Ziv 77 (the algorithm at the basis of winZip). This, however, had an unwanted side effect: several different solutions were developed for all of those compressors, leading to dozens of different compressed data structures. In this talk, I will introduce a new successful theory, string attractors, which shows that all known compressors are deeply related to each other. In particular, this permits to build compressed data structures that are universal, in the sense that they can be built on top of any known compressor. The results at the core of this new theory have been presented at STOC 2018 and have opened a new promising research area which already includes six further publications on the topic.
Based on the paper: "At the roots of dictionary compression: string attractors", https://dl.acm.org/citation.cfm?doid=3188745.3188814