Giter Club home page Giter Club logo

Comments (5)

jaimergp avatar jaimergp commented on August 23, 2024

My findings so far:

  • The latest state of libcfgraph (last updated on Dec 4th 2023) contains:

    • 1,602,023 artifacts
    • 18,390,176 unique paths
    • 618,908,726 path-to-artifact relationships
  • We can naively put all the JSONs in a artifact -> JSON blob table. Using the new JSONB, this takes 46GB uncompressed, 2.5GB zstd. However, finding files in this table is very slow because it needs to scan the JSON blob for each artifact. Adding a virtual JSON index doesn't help much and it increases storage significantly. On the upside, this can store ALL metadata in a very simple way and takes <10 min to populate.

  • To optimize for file querying, I also created a database with a file_path -> conda_artifacts table, indexed by file_path. The conda_artifacts field is a text field where each line is a conda artifact "route" (channel/subdir/filename). This has a lot of duplication, but the exact queries are blazingly fast. It takes 37GB uncompressed, but compresses nicely to 850MB zstd! We can also add a FTS5 index for the paths, which allows for fast partial searches at a relatively small cost.

    • I also experimented with some forms of string interning to avoid the artifacts duplication, but it's VERY slow to populate (estimates of 30-60h, compared to the <20min mark we have with the non interned version), and would also involve slower retrievals, so I think this is a good compromise.

The code is available in this repository: https://github.com/jaimergp/conda-forge-paths. I added a GHA workflow, but the runner dies trying to clone libcfgraph 🚀 😂 My plan is to upload a couple of database.zst files to GH releases and have that a starting point.

from czi-conda-forge-mgmt.

jaimergp avatar jaimergp commented on August 23, 2024

Hm, I learnt about RETURNING and realized we can store the artifact paths on the go at no cost, and instead store the IDs, which should have little cost at query time. I added full-text-search to enable partial searches as well, and didn't change the size significantly. This all means that with this new approach the uncompressed database is only 8.8GB! Compressed size doesn't change much: 634MB.

We also get a new table for free: all the artifacts, and I also stored the timestamps, which will be useful at update time.

The https://github.com/jaimergp/conda-forge-paths repo is now up-to-date, and includes a datasette example.

$ ll path_to_artifacts.*
-rw-r--r--  1 jrodriguez  staff   8.8G Mar  8 16:07 path_to_artifacts.db
-rw-r--r--  1 jrodriguez  staff   634M Mar  8 16:30 path_to_artifacts.tar.zst

from czi-conda-forge-mgmt.

jaimergp avatar jaimergp commented on August 23, 2024

Demo search is now available at https://conda-metadata-app.streamlit.app/Search_by_file_path

from czi-conda-forge-mgmt.

jaimergp avatar jaimergp commented on August 23, 2024

@zklaus mentioned conda-forge/staged-recipes#25862 which could be used to reduce storage on server.

from czi-conda-forge-mgmt.

jaimergp avatar jaimergp commented on August 23, 2024

Progress in https://github.com/jaimergp/conda-forge-paths: repo has self-updating releases (assuming it works) now. A systemd config has been added in the server too, so it updates itself every week. I'll close here once I see a working deployment/release :)

from czi-conda-forge-mgmt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.