Giter Club home page Giter Club logo

debian-dedup's Introduction

Required packages
-----------------

    aptitude install python python-debian python-lzma curl python-jinja2 python-werkzeug sqlite3 python-imaging python-yaml python-concurrent.futures python-pkg-resources

Create a database
-----------------
The database name is currently hardcoded as `test.sqlite3`. So copy the SQL
statements from `schema.sql` into `sqlite3 test.sqlite3`. In addition it is
highly recommended to put the database into WAL mode. Otherwise all your
reading queries will block forever when doing an import. This setting is
permanent.

    PRAGMA journal_mode = WAL;

Import packages
---------------
Import individual packages by feeding them to importpkg.py and readyaml.py:

    ./importpkg.py < somepkg.deb | ./readyaml.py

You can import your local apt cache:

    ./autoimport.py /var/cache/apt/archives

Import a full mirror (only http supported):

    ./autoimport.py -n -p http://your.mirror.example/debian

After changing the database, a few tables caching expensive computations need
to be (re)generated. Execute `./update_sharing.py`. Without this step the web
interface will report wrong results.

Viewing the results
-------------------
Run `./webapp.py` and enjoy a webinterface at `0.0.0.0:8800` or inspect the
SQL database by hand. Here are some example queries.

Finding the 100 largest files shared with multiple packages.

    SELECT pa.name, a.filename, pb.name, b.filename, a.size FROM content AS a JOIN hash AS ha ON a.id = ha.cid JOIN hash AS hb ON ha.hash = hb.hash JOIN content AS b ON b.id = hb.cid JOIN package AS pa ON a.pid = pa.id JOIN package AS pb ON b.pid = pb.id WHERE (a.pid != b.pid OR a.filename != b.filename) ORDER BY a.size DESC LIMIT 100;

Finding those top 100 files that save most space when being reduced to only
one copy in the archive.

    SELECT hash, sum(size)-min(size), count(*), count(distinct pid) FROM content JOIN hash ON content.id = hash.cid JOIN function ON hash.fid = function.id WHERE function.name = "sha512" GROUP BY hash ORDER BY sum(size)-min(size) DESC LIMIT 100;

Finding PNG images that do not carry a .png file extension.

    SELECT package.name, content.filename, content.size FROM content JOIN hash ON content.id = hash.cid JOIN package ON content.pid = package.id JOIN function ON hash.fid = function.id WHERE function.name = "png_sha512" AND lower(filename) NOT LIKE "%.png";

Finding .gz files which either are not gziped or contain errors.

    SELECT package.name, content.filename FROM content JOIN package ON content.pid = package.id WHERE filename LIKE "%.gz" AND (SELECT count(*) FROM hash JOIN function ON hash.fid = function.id WHERE hash.cid = content.id AND function.name = "gzip_sha512") = 0;

debian-dedup's People

Contributors

helmutg avatar paulproteus avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.