Giter Club home page Giter Club logo

tudocomp's Introduction

tudocomp

The Technical University of DOrtmund COMPression Framework (tudocomp) is a lossless compression framework with the aim to support and facilitate the implementation of novel compression algorithms. It already comprises a range of standard data compression and encoding algorithms. These can be mixed and parameterized with the following uses in mind:

  • Baseline implementations of well-known compression schemes.
  • Detailed benchmarking and comparison of compression and encoding algorithms.
  • Easy integration of new algorithm implementations.

The framework offers a solid and extensible base for new implementations. Its design is focused on modularity and interchangeability. This way, the user can combine algorithms to find the optimal compression strategy for a given input. The framework gives this opportunity while creating as little performance overhead as possible.

Dependencies

tudocomp's CMake build process will either find external dependencies on the system if they have been properly installed, or automatically download and build them from their official repositories in case they cannot be found. In that regard, a proper installation of the dependencies is not required.

Said external dependencies are the following:

Additionally, the tests require Google Test (1.7.0 or later).

Documentation Build Requirements

For building the documentation, the following tools need to be installed:

  • LaTeX (specifically the pdflatex component)
  • Doxygen (1.8 or later).
  • Pandoc (1.19 or later).

Windows Support

While tudocomp has no explicit support Windows / Microsoft Visual C++, it is possible to use the Bash on Ubuntu on Windows with next to no feature limitations. However, note that the comparison tool relies on valgrind, which is not functional in this environment until the Windows 10 Creators Update.

License

The framework is published under the Apache License 2.0

tudocomp's People

Contributors

jonas-ellert avatar jzentgraf avatar kimundi avatar koeppl avatar linno60 avatar pdinklag avatar uewiebelitz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tudocomp's Issues

Test and document new build system behavior for c++ library dependencies

The SDSL, glog and boost dependencies now behave the same way:

  • cmake searches for them in the installed system, and if they are not found they are instead downloaded via an external project. (Though the sdsl one does not actually search the system yet)
  • each of them has its own cmake target name, rather than relying on using the ${FOO_include_dirs} and similar cmake variables that find_package() generates.

make LZ78 trie LZ78/LZW aware

Some implementations of LZ78 tries could exploit the fact whether they build a LZ78 or a LZW trie.
However, this is not possible in the current API. Can we somehow propagate the type LZ78 / LZW to the LZ78 tries?

LZ77 Rules

The implemented LZ77 rules are actually rules for the ESA-comp compressor.
ESA-comp uses rules of the form (src, target, num), but the LZSS variant of LZ77 uses just (src, num).

check_input program

I would like to have a extra program that can check my input text for bad characters that let my selection of compression algorithms abort. For instance, some require that the text ends with the \0 byte. Others additionally require that no \0 byte appears in the text such that such occurrences have to be escaped.
Currently, tdc just aborts with a segmentation fault in the latter case.

simplify JSON stats

Is it possible to simplify the "stats" array currently consisting of {key:"key, value:"value"} dictionaries as elements
into a simple dictionary in the form {"key": "value"}?
Is there then a limitation on what characters can be used for "key"?

Motivation: It would allow to easily extract measured stats with the command line tool jq.

unknown error in compare-tool

When using ./compare.py --suite ../etc/compare-suites/default.suite [filename], I get the following error message:

ERROR: Failed to load suite '../etc/compare-suites/default.suite'
__new__() got an unexpected keyword argument 'args'

This message does not help me in finding the error at all.

lexparse/plcp pipelines

I would like to have the following selectable parameters for the precomputed data structures
for all lexicographic parses (including lexparse and plcpcomp):

  • fastest possible way (currently implemented)
  • use plcp-sada instead of integer-array-based plcp
  • use suffix array + inverse suffix array sampling instead of Φ

factor_count vs num_factors

The LZ77 and LZ78 compressors define different stat keys for the same thing ("factor_count" vs "num_factors"). Can we unify this?

tudostats test fails

... and after a brief look at the test source, it appears that the assertion expects a wrong value and the actual value is actually correct. This needs investigation.

lz78: root nodes are unnecessary

In LZ78Compressor.hpp and LZWCompressor.hpp, the root nodes defined via the API call add_rootnode seem not necessary. Calling add_rootnode(c) just adds the character c into the LZ78 trie and gives it the LZ node ID c. Both compressors query these added nodes with get_rootnode(c), but the call always returns c. So there is no need to pollute the LZ78 tries with these 'root nodes' as the compressors can take care of these by themselves.

Cryptical ]] found in CmakeLists.txt

Is there a reason behind the ']]' in line 20, /src/tudocomp_driver/CMakeLists.txt ?
The closing brackets match the opening brackets in line 1, but these are commented out.
Interestingly, cmake only compiles the CMakeLists.txt if either both brackets are deleted, or both are present.
Is this a cmake bug?

make master the default branch

The unmaintained public branch is currently the default branch.
However, this branch does not offer the 0-escaping needed for compressing binary files.
I therefore would suggest to mitigate everything to the master branch and keep the master branch as the current stable branch, and have a develop branch for more experimental stuff.

strange help listings

Issuing ./tdc --list=bi gives

There is more than algorithm named 'bi'. Please specify using one of the following canonical IDs:
bi:lzss_bidirectional_coder
bi:lzss_bidirectional_coder

It could be also nice for lcpcomp to list all parameters for compression/depcompression at the same time instead of outputting

There is more than algorithm named 'lcpcomp'. Please specify using one of the following canonical IDs:
lcpcomp:decompressor
lcpcomp:compressor
``

Can we push default parameters into the coder `bi` such that we do not have to write `bi(binary,binary,binary)`, 
and use as default strategy the best one, i.e., plcpcomp?

Switch to memory-efficient plcp implementation

the current default implementation requestion the plcp array
uses code of include/tudocomp/ds/providers/PhiAlgorithm.hpp which builds the PLCP as a plain array,
although we have the succinct $2n + o(n)$ bit implementation already available at
include/tudocomp/ds/LCPSada.hpp.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.