Giter Club home page Giter Club logo

nixpkgs-graph's Introduction

nixpkgs-graph

Build a graph database of nixpkgs.

About

This project aims at building a graph database of nixpkgs. Read more on our blog post: "Construction and analysis of the build and runtime dependency graph of Nixpkgs".

Usage

Requirements

Using build.sh

./build.sh 481f9b246d200205d8bafab48f3bd1aeb62d775b 0n6a4a439md42dqzzbk49rfxfrf3lx3438i2w262pnwbi3dws72g 

where

  • the first argument is the revision (the 40-character SHA-1 hash) of a commit
  • the second is the SHA256 hash of its content (same as nix-prefetch-url --unpack).

After running this script you will find in the ./rawdata/ folder:

  • nodes.json: raw data extracted with the Nix evaluation
  • nodes.csv: structured data which can be loaded by most tools
  • first_graph.png: image drawn with networkx
  • first_graph.gexf: data which can be loaded by Gephi
  • first_graph.grapgml: data which can be loaded by Neo4j
  • general_info.json: some basic information (number of nodes, number of edges)

If you want to query the graph with Neo4j using Cypher Shell, a shell.nix is provided:

$ nix-shell
[nix-shell]$ cypher-shell -a bolt://localhost:7687 "MATCH (n) RETURN COUNT(n) as number_of_nodes;"

Manual steps

  1. The provided Nix shell also create a Python virtual environment:

    nix-shell --command "exit"
    source .venv/bin/activate
  2. Run nixpkgs_graph in the command line:

    python3 -m nixpkgs_graph --help

    To get the nixpkgs database in json format, you can use the following code:

    python3 -m nixpkgs_graph build --rev 481f9b246d200205d8bafab48f3bd1aeb62d775b --sha256 0n6a4a439md42dqzzbk49rfxfrf3lx3438i2w262pnwbi3dws72g

    The -rev flag means revision, which is the 40-character SHA-1 hash of a commit. And -sha256 is its SHA256 hash.

  3. Generate the graph and do some basic analysis:

    python3 -m nixpkgs_graph generate-graph --input-file INPUT_FILE --output-folder OUTPUT_FOLDER

    The input file should be the path to the data extracted in the previous step.

  4. To use Neo4j to query the graph:

    • Find the .graphml format file in the output folder.

    • Copy it to the import folder of Neo4j $NEO4J_HOME/share/neo4j/import/.

    • Clear the original graph to avoid duplication:

      cypher-shell -a bolt://localhost:7687 "MATCH (n) DETACH DELETE n;"
    • Use APOC to import it:

      cypher-shell -a bolt://localhost:7687 "call apoc.import.graphml('<filename>.graphml', {})"

      Or in Neo4j browser if you use desktop version:

      call apoc.import.graphml('<filename>.graphml', {})
  • Use some simple commands to test if the graph is successfully imported:

    cypher-shell -a bolt://localhost:7687 "MATCH (n) RETURN n LIMIT 10;"

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contacts

Eloi Xuan WANG - @GearlessJohn - [email protected]

Guillaume Desforges - @GuillaumeDesforges - [email protected]

Project Link: https://github.com/tweag/nixpkgs-graph

Appendix

The following are details about the methods used.

How data is extracted

Each name/value pair in the JSON file represents a package under nixpkgs, and it contains the following information :

  • id: full name with version of the package under nixpkgs,
  • pname
  • version
  • package : path to which the package belongs (like [ nixpkgs python3Package ])
  • buildInputs of the package in which each buildInput has the /nix/store/hash-name(-dev) structure, so we can identifier the node by name.
  • propagatedBuildInputs of the package in which each propagatedBuildInput has also the /nix/store/hash-name(-dev) structure, so we can still identifier the node by name.
  • type = "node" which is used as an identification marker for lib.collect

Example :

{
  "buildInputs": "/nix/store/c1pzk30ksbff1x3krxnqzrzzfjazsy3l-gsettings-desktop-schemas-42.0 /nix/store/mmwc0xqwxz2s4j35w7wd329hajzfy2f1-glib-2.72.3-dev /nix/store/64mp60apx1klb14l0205562qsk1nlk39-gtk+3-3.24.34-dev /nix/store/6hdwxlycxjgh8y55gb77i8yqglmfaxkp-adwaita-icon-theme-42.0 ",
  "id": "chromium-103.0.5060.134",
  "package": [
    "nixpkgs",
    "chromium"
  ],
  "pname": "chromium",
  "propagatedBuildInputs":"",
  "type":"node",
  "version": "103.0.5060.134"
}

and another example of depth 1 under python3Packages:

{
    "buildInputs": "/nix/store/vakcc74vp08y1rb1rb1cla6885ayklk3-zstd-1.5.2-dev ",
    "id": "python3.9-zstd-1.5.1.0",
    "package": [
      "nixpkgs",
      "python3Packages",
      "zstd"
    ],
    "pname": "zstd",
    "propagatedBuildInputs":"/nix/store/xpwwghl72bb7f48m51amvqiv1l25pa01-python3-3.9.13 ",
    "type":"node",
    "version": "1.5.1.0"
  }

To get this data, we evaluate a Nix expression designed to yield all the data we want. Note that we use --json --strict when calling nix-instantiate.

The Nix expresison iterates on the key/value pairs of the root attribute set of nixpkgs (and some other selected attribute sets) using mapAttrs. Afterwards, we retrieve the desired sets via lib.collect.

How raw data is processed

For the first version of the graph, we used pandas to process the raw JSON data and networkx to process the graph data.

See nixpkgs_graph.py

Analyzing data

Use the networkx.read_gexf() function to read the .gexf file.

This project provides some basic infomatation:

  • number of nodes
  • number of edges
  • top 10 nodes which have the largest number of dependencies
  • top 10 most cited nodes
  • average number of dependencies of a derivation
  • cycles in the nixpkgs graph
  • length of the longest path in the graph

Visualizing the graph

Use Gephi to read and process the generated .gexf for visualization.

nixpkgs-graph's People

Contributors

gearlessjohn avatar guillaumedesforges avatar florentc avatar mic92 avatar

Stargazers

Airradda avatar Sathvik Birudavolu avatar Adrian Wyssmann avatar Rusty avatar Mohsen Ansari avatar Stefan Tatschner avatar カシオ 金城 大関 avatar  avatar Willie Möller avatar Leix b avatar Yann Hamdaoui avatar Sebastian Bolaños avatar Garrett Hopper avatar Alejandro Sánchez Medina avatar Ananya Sharma avatar Atharva avatar Peter Dragos avatar Radosław Szamszur avatar Matthias Meschede avatar

Watchers

Mathieu Boespflug avatar Tim Sears avatar Chuck Grindel avatar Alexander Vershilov avatar Richard Bullington-McGuire avatar Yogesh Sajanikar avatar Krzysztof Gogolewski avatar Gregg Reynolds avatar Manuel M T Chakravarty avatar Nicolas Frisby avatar Arnaud Spiwack avatar Mark Potter avatar  avatar Claudio Bley avatar Yves-Stan Le Cornec avatar Yann Hamdaoui avatar Mark Karpov avatar Andy D avatar  avatar Vince avatar  avatar Mathieu Montin avatar  avatar

Forkers

mic92

nixpkgs-graph's Issues

Cycles in the nixpkgs graph

Describe the bug

In the analysis of the nixpkgs graph, we found some simple cycles. Some cycles are of length 1, which means that some derivations have buildInputs or propagatedBuildInputs that contain themselves. There are also some cycles of length 2 or 3. There are six in total, as follows:

['chicken-5.3.0']
['chicken-4.13.0']
['mlton-20180207']
['gvfs-1.50.2', 'libgdata-0.18.1', 'gnome-online-accounts-3.44.0']
['gvfs-1.50.2', 'gnome-online-accounts-3.44.0']
['pipewire-0.3.51', 'ffmpeg-4.4.2', 'SDL2-2.0.20']

Specifically, to confirm if the error occurred when fetching the nixpkgs data, I accessed the raw nixpkgs data:

nix show-derivation nixpkgs#chicken

In the results given by nix there is the following information:

"/nix/store/1qlyycams6q39ll5r4p1sq57gcvhvgmn-chicken-5.3.0.drv": {
    ...
    "env": {
      ...
      "buildInputs": "/nix/store/c4ha2dqj3a1jp2dn962wdfq5wqy0gikv-chicken-5.3.0",
      ...
    }
    ...
}

This means that cycles do exist in the raw data of nixpkgs.

Expected behavior

It should not be possible to include cycles in the nixpkgs graph under normal circumstances. And the nodes involved in the cycles are even less likely to be evaluable.

Add tests

The Python code should be tested.

Build fails

Following instructions in README.md, then:

$ bash build.sh
build.sh: line 1: rawdata/nodes.json: No such file or directory
Traceback (most recent call last):
  File "/home/me/nixpkgs-graph/./node_format_trans.py", line 5, in <module>
    data = pd.read_json('rawdata/nodes.json').T
  File "/nix/store/x2w09z3x6f4bbw2j093s3y3ckl8msqzz-python3-3.9.6-env/lib/python3.9/site-packages/pandas/util/_decorators.py", line 207, in wrapper
    return func(*args, **kwargs)
  File "/nix/store/x2w09z3x6f4bbw2j093s3y3ckl8msqzz-python3-3.9.6-env/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/nix/store/x2w09z3x6f4bbw2j093s3y3ckl8msqzz-python3-3.9.6-env/lib/python3.9/site-packages/pandas/io/json/_json.py", line 614, in read_json
    return json_reader.read()
  File "/nix/store/x2w09z3x6f4bbw2j093s3y3ckl8msqzz-python3-3.9.6-env/lib/python3.9/site-packages/pandas/io/json/_json.py", line 748, in read
    obj = self._get_object_parser(self.data)
  File "/nix/store/x2w09z3x6f4bbw2j093s3y3ckl8msqzz-python3-3.9.6-env/lib/python3.9/site-packages/pandas/io/json/_json.py", line 770, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/nix/store/x2w09z3x6f4bbw2j093s3y3ckl8msqzz-python3-3.9.6-env/lib/python3.9/site-packages/pandas/io/json/_json.py", line 885, in parse
    self._parse_no_numpy()
  File "/nix/store/x2w09z3x6f4bbw2j093s3y3ckl8msqzz-python3-3.9.6-env/lib/python3.9/site-packages/pandas/io/json/_json.py", line 1140, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value

Add edges for build inputs

For now we only get nodes.

We now need to get edges so that we can build a graph.

Success

  • List of edges connecting nodes which were previously retrieved in #2 
  • Focus on build inputs dependencies

Document steps to get nixpkgs data to neo4j

For now, we generate a JSON file with the data, but using a graph database allows or easier analysis and interoperability with other tools.

As a starting place, we would like to see how it plays with Neo4j.

Inconsistent json file structure

Describe the bug
The structure of the json file automatically generated by nix is not always like

{
    "pname": "pytorch",
    "version": "11.0.0",
    ...
}

Collections like python3Packages add an extra layer to the json file and eventually prevent pandas from reading it properly.

Possible solution
Modify the code in default.nix or additionally use another tool to convert the edges.json file to a consistent format.

Expected behavior
All packages should be in the same level of the json file through the same format as the object.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.