Giter Club home page Giter Club logo

hdt's Introduction

HDT

Latest Version Lint and Test Documentation Benchmarks HDT Rust @ LD Party Video DOI

A Rust library for the Header Dictionary Triples compressed RDF format, including:

  • loading the HDT default format as created by hdt-cpp
  • efficient querying by triple patterns
  • serializing into other formats like RDF Turtle and N-Triples using the Sophia adapter

However it cannot:

  • load other RDF formats
  • load other HDT variants

For this functionality and acknowledgement of all the original authors, please look at the reference implementations in C++ and Java by the https://github.com/rdfhdt organisation.

It also cannot:

  • swap data to disk
  • modify the RDF graph in memory
  • run SPARQL queries

If you need any of the those features, consider using a SPARQL endpoint instead.

Examples

use hdt::Hdt;

let file = std::fs::File::open("example.hdt").expect("error opening file");
let hdt = Hdt::new(std::io::BufReader::new(file)).expect("error loading HDT");
// query
let majors = hdt.triples_with_pattern(Some("http://dbpedia.org/resource/Leipzig"), Some("http://dbpedia.org/ontology/major"),None);
println!("{:?}", majors.collect::<Vec<_>>());

You can also use the Sophia adapter to load HDT files and reduce memory consumption of an existing application based on Sophia, which is re-exported as hdt::sophia:

use hdt::{Hdt,HdtGraph};
use hdt::sophia::api::graph::Graph;
use hdt::sophia::api::term::{IriRef, SimpleTerm, matcher::Any};

let file = std::fs::File::open("dbpedia.hdt").expect("error opening file");
let hdt = Hdt::new(std::io::BufReader::new(file)).expect("error loading HDT");
let graph = HdtGraph::new(hdt);
let s = SimpleTerm::Iri(IriRef::new_unchecked("http://dbpedia.org/resource/Leipzig".into()));
let p = SimpleTerm::Iri(IriRef::new_unchecked("http://dbpedia.org/ontology/major".into()));
let majors = graph.triples_matching(Some(s),Some(p),Any);

If you don't want to pull in the Sophia dependency, you can exclude the adapter:

[dependencies]
hdt = { version = "...", default-features = false }

There is also a runnable example are in the examples folder, which you can run with cargo run --example query.

API Documentation

See docs.rs/latest/hdt or generate for yourself with cargo doc --no-deps without disabling default features.

Performance

The performance of a query depends on the size of the graph, the type of triple pattern and the size of the result set. When using large HDT files, make sure to enable the release profile, such as through cargo build --release, as this can be much faster than using the dev profile.

Profiling

If you want to optimize the code, you can use a profiler. The provided test data is very small in order to keep the size of the crate down; locally modifying the tests to use a large HDT file returns more meaningful results.

Example with perf and Firefox Profiler

$ cargo test --release
[...]
Running unittests src/lib.rs (target/release/deps/hdt-2b2f139dafe69681)
[...]
$ perf record --call-graph=dwarf target/release/deps/hdt-2b2f139dafe69681 hdt::tests::triples
$ perf script > /tmp/test.perf

Then go to https://profiler.firefox.com/ and open /tmp/test.perf.

Criterion benchmark

cargo bench --bench criterion

iai benchmark

cargo bench --bench iai

Comparative benchmark suite

The separate benchmark suite compares the performance of this and some other RDF libraries.

Community Guidelines

Issues and Support

If you have a problem with the software, want to report a bug or have a feature request, please use the issue tracker. If have a different type of request, feel free to send an email to Konrad.

Citation

DOI

If you use this library in your research, please cite our paper in the Journal of Open Source Software. We also provide a CITATION.cff file.

BibTeX entry

@article{hdtrs,
  doi = {10.21105/joss.05114},
  year = {2023},
  publisher = {The Open Journal},
  volume = {8},
  number = {84},
  pages = {5114},
  author = {Konrad Höffner and Tim Baccaert},
  title = {hdt-rs: {A} {R}ust library for the {H}eader {D}ictionary {T}riples binary {RDF} compression format},
  journal = {Journal of Open Source Software}
}

Citation string

Höffner et al., (2023). hdt-rs: A Rust library for the Header Dictionary Triples binary RDF compression format. Journal of Open Source Software, 8(84), 5114, https://doi.org/10.21105/joss.05114

Contribute

We are happy to receive pull requests. Please use cargo fmt before committing, make sure that cargo test succeeds and that the code compiles on the stable and nightly toolchain both with and without the "sophia" feature active. cargo clippy should not report any warnings.

hdt's People

Contributors

dependabot[bot] avatar konradhoeffner avatar pchampin avatar remram44 avatar timplication avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hdt's Issues

rsdict simd feature fails to build with Rust nightly 1.78

The rsdict dependency "simd" feature depends on packed_simd, which is no longer available in nightly 1.78.
See sujayakar/rsdict#9 and rust-lang/packed_simd#359.
I tried to refactor rsdict but wasn't successful yet, see sujayakar/rsdict#10.

Options

  1. Disable the "simd" feature for now but that would decrease the speed.
  2. Warn users to stay at nightly 1.77 or lower until this is fixed at rsdict but is inconvenient and can cause errors for users who don't notice it.
  3. Add a "simd" feature to hdt and pass that through to rsdict.

Update: Option 1 was chosen but this issue is kept open in case of a future rsdict update that allows reenabling it.

general simplification and optimization

As I'm not very experienced with Rust, there are surely many horribly inefficient parts in the code that could be simplified and optimized by someone more experienced with Rust.

further optimize triples_with_sp, triples_with_so and triples_with_po

Since 0.0.4 they are filtered before translation, which is already better than the Sophia default implementation, but they could be further optimized at the iterator level.
Either create new iterators or add parameters to the existing ones.

  • triples_with_sp
  • triples_with_so
  • triples_with_po

extend HdtGraph

  • triples_with_sp
  • triples_with_so
  • triples_with_po
  • triples_with_spo not needed right now

Use new WaveletMatrix 0.0.6 construction method to reduce memory usage

See kampersanda/sucds#44.
Using a modified hdt::tests which loads lscomplete20143.hdt and then returns.

Before

Command being timed: "cargo test --release hdt::tests"
User time (seconds): 89.53
System time (seconds): 7.13
Percent of CPU this job got: 225%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:42.95
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3370252
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 770
Minor (reclaiming a frame) page faults: 3897847
Voluntary context switches: 11918
Involuntary context switches: 12515
Swaps: 0
File system inputs: 2541256
File system outputs: 435912
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

After

Command being timed: "cargo test --release hdt::tests"
User time (seconds): 18.03
System time (seconds): 1.48
Percent of CPU this job got: 115%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.88
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3380940
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 773459
Voluntary context switches: 2180
Involuntary context switches: 654
Swaps: 0
File system inputs: 648120
File system outputs: 18624
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Not much change in the resident set size, however it is also much lower than expected, is /usr/bin/time not accurate? Try with heaptrack instead.

Document use within a SPARQL pipeline

It would be useful to document a few examples where the Rust hdt library is used within a full pipeline, starting from an HDT file (as generated by hdt-cpp) to SPARQL query results.

For example, the Python rdflib-hdt library wraps hdt-cpp and this function point is where the triple pattern query over the HDT is then used by the rdflib SPARQL query processor: (https://github.com/RDFLib/rdflib-hdt/blob/master/rdflib_hdt/hdt_document.py#L114)

Documenting how Rust hdt might provide triple pattern query results to a few separate SPARQL query engines would show users how the Rust hdt library can fit into a broader pipeline from data to SPARQL query results.

remove unnecessary enums

While it theoretically allows more freedom of implementation in the future, it is questionable whether that will ever be used and all the match statements are too verbose.

Editorial comments on JOSS paper

I'm reading through the JOSS submission now, and will post comments here.

On line 40, I cannot see that HDT-FoQ has been defined yet, so please introduce it here. I notice that it's introduced in the caption to Figure 2, but please do it in the main text also.

image

Create HDT files from RDF

It would be a valuable enhancement to create HDT files from RDF text serialization formats. A pure Rust implementation would be ideal, but a Rust wrapper over hdt-cpp would be a convenient alternative.

crates.io and test resources

Due to the test HDT file tests/resources/swdf.hdt (5.9 MB uncompressed), the hdt package uses 2.42 MB on crates.io, which is much more than other crates typically use.
While this is not much, the eternal nature of crates.io motivates saving space.
Thus, the filesize of the package should be significantly reduced before publishing a new version.

Options

  1. keep it like it is now
  2. use a smaller HDT test file
  3. download the test file when required so that users who don't need the tests don't download it
  4. automatically generate the test file

subject IDs off by one

[TripleId { subject_id: 0, predicate_id: 90, object_id: 13304 }, TripleId { subject_id: 0, predicate_id: 101, object_id: 19384 }, TripleId { subject_id: 0, predicate_id: 111, object_id: 75817 }, TripleId { subject_id: 1, predicate_id: 90, object_id: 19470 }, TripleId { subject_id: 1, predicate_id: 101, object_id: 13049 }, TripleId { subject_id: 1, predicate_id: 104, object_id: 13831 }, TripleId { subject_id: 1, predicate_id: 111, object_id: 75817 }, TripleId { subject_id: 2, predicate_id: 90, object_id: 19313 }]
sample [
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_1",
        "http://data.semanticweb.org/person/barry-norton",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_2",
        "http://data.semanticweb.org/person/reto-krummenacher",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_1",
        "http://data.semanticweb.org/person/robert-isele",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_2",
        "http://data.semanticweb.org/person/anja-jentzsch",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_3",
        "http://data.semanticweb.org/person/christian-bizer",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq",
    ),
    (
        "_:b10",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_1",
        "http://data.semanticweb.org/person/raphael-troncy",
    ),
]

IDs should start with 1 but subject IDs start at 0 which offsets all subjects by one, except the first one because 0 and 1 both map to the first.

in-memory representation

The HDT Reader hdt_reader.rs is optimized for reading of all triples from a buffered reader, but there should be an in-memory representation that can be queried for triple patterns. This could also lead to easier lifetimes than when using the reader.

add hdt2rdf command line application

Using Sophia, it should be quite easy to serialize the loaded HDT file into another format like RDF Turtle and NTriples, thus allowing a simple command line application like hdt2rdf in the C++ version.

add Python bindings?

It seems that with https://crates.io/crates/pyo3 one can use Rust code from Python.
rdflib, the fastest Python RDF library tested in the benchmark, had a very large RAM usage and loading time.
It could thus be useful to Python users to have a access to this library.

cyrillic URIs not found?

When using the Sophia HDT adapter in RickView, a URI with suffix хобби-N-0 is not found:

[WARN ] No triples found for entry/хобби-N-0. Did you configure the namespace correctly?

However when using Turtle, it works:

[DEBUG] ruthes:entry/хобби-N-0 HTML 357.8µs

This could also be an issue with Sophia, but this is unlikely as it works with the Sophia FastGraph where the Turtle is loaded, or it could be an error in the Sophia adapter, which is part of HDT.

As first step, a URI with a suffix like хобби-N-0 should be included in the test HDT file and the test suite.

triple pattern queries

Right now there is only an option to iterate over all triples, which is inefficient for large graphs.
Implement triple pattern queries and add tests.

  • triples_with_s
  • triples_with_o
  • triples_with_p

JOSS Review

Hi Konrad,

How do you feel about adding some examples to the paper? I think even just the code snippet (with some examples of triples that would be matched?) from the README would be useful to readers to quickly see how to use the library. This would also beef up the text section of the paper a little bit.

If possible, it would be nice to include this (or similar) as a cargo example (with dev dependencies) printing out some matched triples.

use hdt::{Hdt,HdtGraph};
use hdt::sophia::api::graph::Graph;
use hdt::sophia::api::term::{IriRef, SimpleTerm, matcher::Any};

let file = std::fs::File::open("dbpedia.hdt").expect("error opening file");
let hdt = Hdt::new(std::io::BufReader::new(file)).expect("error loading HDT");
let graph = HdtGraph::new(hdt);
let s = SimpleTerm::Iri(IriRef::new_unchecked("http://dbpedia.org/resource/Leipzig".into()));
let p = SimpleTerm::Iri(IriRef::new_unchecked("http://dbpedia.org/ontology/major".into()));
let majors = graph.triples_matching(Some(s),Some(p),Any);

// Maybe add a couple lines of expected output triples?

I think it might also be useful to summarize a few key benchmarking results/statistics in a table - it can be hard to interpret the graphs given the number of items be reproduced.

Last thing - I tried to clone and run your fork of the benchmarking repo, but ran into some compilation errors:

ierror[E0599]: no method named `triples_matching` found for struct `HdtGraph` in the current scope
   --> src/main.rs:154:16
    |
154 |         1 => g.triples_matching(Any, Some(rdf::type_), Some(dbo_person)),
    |                ^^^^^^^^^^^^^^^^ method not found in `HdtGraph`
    |
   ::: /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/sophia_api-0.8.0-alpha.0/src/graph.rs:153:8
    |
153 |     fn triples_matching<'s, S, P, O>(&'s self, sm: S, pm: P, om: O) -> GTripleSource<'s, Self>
    |        ---------------- the method is available for `HdtGraph` here
    |
    = help: items from traits can only be used if the trait is in scope
help: the following trait is implemented but not in scope; perhaps add a `use` for it:
    |
1   | use hdt::sophia::sophia_api::graph::Graph;
    |

error[E0599]: no method named `triples_matching` found for struct `HdtGraph` in the current scope
   --> src/main.rs:155:16
    |
155 |         2 => g.triples_matching(Some(dbr_vincent), Any, Any),
    |                ^^^^^^^^^^^^^^^^ method not found in `HdtGraph`
    |
   ::: /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/sophia_api-0.8.0-alpha.0/src/graph.rs:153:8
    |
153 |     fn triples_matching<'s, S, P, O>(&'s self, sm: S, pm: P, om: O) -> GTripleSource<'s, Self>
    |        ---------------- the method is available for `HdtGraph` here
    |
    = help: items from traits can only be used if the trait is in scope
help: the following trait is implemented but not in scope; perhaps add a `use` for it:
    |
1   | use hdt::sophia::sophia_api::graph::Graph;
    |

error[E0599]: no method named `triples_matching` found for struct `HdtGraph` in the current scope
   --> src/main.rs:156:16
    |
156 |         3 => g.triples_matching(Some(dbr_vincent), Some(rdf::type_), Any),
    |                ^^^^^^^^^^^^^^^^ method not found in `HdtGraph`
    |
   ::: /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/sophia_api-0.8.0-alpha.0/src/graph.rs:153:8
    |
153 |     fn triples_matching<'s, S, P, O>(&'s self, sm: S, pm: P, om: O) -> GTripleSource<'s, Self>
    |        ---------------- the method is available for `HdtGraph` here
    |
    = help: items from traits can only be used if the trait is in scope
help: the following trait is implemented but not in scope; perhaps add a `use` for it:
    |
1   | use hdt::sophia::sophia_api::graph::Graph;
    |

triples_with_p

Part of #3.
Test is failing, check if test is broken or the results.
cargo test triples::tests

fix bug with rickview and qbench2

  Compiling hdt v0.0.3 (/home/konrad/projekte/rust/hdt)
   Compiling rickview v0.0.5 (/home/konrad/projekte/rust/rickview)
    Finished dev [unoptimized + debuginfo] target(s) in 2.59s
     Running `target/debug/rickview`
[INFO ] Serving http://linkedspending.aksw.org/ at http://localhost:8080/
Start constructing wavelet matrix...finished constructing wavelet matrix with length 16144457
Constructing OPS index...finished constructing OPS index
HDT size in memory 248.2 MB, details:
Hdt {
    dict: FourSectDict {
        shared: total size 2.6 MB, sequence 26.8 KB with 9745 entries, 22 bits per entry, packed data 2.6 MB,
        subjects: total size 41.5 MB, sequence 193.0 KB with 59393 entries, 26 bits per entry, packed data 41.3 MB,
        predicates: total size 6.8 KB, sequence 48 B with 29 entries, 13 bits per entry, packed data 6.7 KB,
        objects: total size 59.5 MB, sequence 661.1 KB with 203407 entries, 26 bits per entry, packed data 58.8 MB,
    },
    triple_sect: Bitmap(
        total size 144.6 MB
        adjlist_y AdjList {
            sequence: 18.2 MB with 16144457 entries, 9 bits per entry,
            bitmap: 2.6 MB,
        }
        adjlist_z AdjList {
            sequence: 48.6 MB with 16213801 entries, 24 bits per entry,
            bitmap: 1.6 MB,
        }
        op_index total size 49.8 MB {
            sequence: 48.6 MB with 24 bits,
            bitmap: 1.1 MB
        }
        wavelet_y 23.8 MB,
    ),
}
thread 'actix-rt|system:0|arbiter:0' panicked at 'dictionary error in the shared section: index out of bounds: id 155895 > dictionary section len 437', /home/konrad/projekte/rust/hdt/src/hdt.rs:42:107
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'actix-rt|system:0|arbiter:1' panicked at 'dictionary error in the shared section: index out of bounds: id 155895 > dictionary section len 437', /home/konrad/projekte/rust/hdt/src/hdt.rs:42:107

DictSectPFC bug with handling UTF8

DictSectPFC truncates at byte indexes but this can fail when these aren't character boundaries due to UTF8 potentially using multiple bytes for characters.

[...]
'byte index 29 is not a char boundary; it is inside 'Ö' (bytes 28..30) of `http://dbpedia.org/resource/Östergötland_County`', /home/konrad/projekte/rust/hdt/src/dict_sect_pfc.rs:175:49
stack backtrace:
   0: rust_begin_unwind
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/core/src/panicking.rs:65:14
   2: core::str::slice_error_fail_rt
   3: core::str::slice_error_fail
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/core/src/str/mod.rs:86:9
   4: core::str::traits::<impl core::slice::index::SliceIndex<str> for core::ops::range::RangeFrom<usize>>::index
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/core/src/str/traits.rs:370:21
   5: core::str::traits::<impl core::ops::index::Index<I> for str>::index
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/core/src/str/traits.rs:65:9
   6: <alloc::string::String as core::ops::index::Index<core::ops::range::RangeFrom<usize>>>::index
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/alloc/src/string.rs:2380:10
   7: hdt::dict_sect_pfc::DictSectPFC::locate_in_block
             at /home/konrad/projekte/rust/hdt/src/dict_sect_pfc.rs:175:49
   8: hdt::dict_sect_pfc::DictSectPFC::string_to_id
             at /home/konrad/projekte/rust/hdt/src/dict_sect_pfc.rs:113:23
   9: hdt::four_sect_dict::FourSectDict::string_to_id
             at /home/konrad/projekte/rust/hdt/src/four_sect_dict.rs:100:26

support stable release channel

Is it a dealbreaker for anyone that we only support nightly Rust?
Now that the rsdict dependency is gone, there are only two features that still require nightly Rust:

  • round_char_boundary
  • int_roundings

As there were used for an old unicode handling method, we may be able to remove them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.