This repository contains the implementation of LCG, a scalable text compressor that relies on the concept of locally-consistent grammars. The main features of LCG are:
- Compression of TBs of text efficiently
- C++ >= 17
- CMake >= 3.7
The xxHash and CLI11 libraries are already included in the source files of this repository.
This implementation is still under development. Not all the features described in the help have been tested, and some of them are partially implemented.
For the moment, use LCG only to measure compression ratios.
Clone repository, enter the project folder and execute the following commands:
mkdir build
cd build
cmake ..
make
./lcg comp sample_file.txt
Our tool currently assumes the input is a concatenated collection of one or more strings, where every string ends with the same separator symbol. The tool assumes the last symbol in the file is the separator.
For collections of ASCII characters (i.e, regular text, DNA, protein, etc), inputs in one-string-per-line format should work just fine.
The compression algorithm of LCG is optimized to work with collections of strings that do not exceed the 4 GBs in length.
This cap in enough for most practical applications. However, if your collection contains strings longer than
that value, you can pass the flag -l/--long-strings
. Note the 4GB cap is on the string length, not the collection size. For instance,
you can have a 1TB-size collection, but if all the strings are less than 4GB in length, then the -l
flag is not necessary.
This tool still contains experimental code, so you will probably find bugs. Please report them in this repository.
This implementation was written by Diego Díaz .