konradhoeffner / hdt Goto Github PK

View Code? Open in Web Editor NEW

19.0 3.0 4.0 3.14 MB

Library for the Header Dictionary Triples (HDT) compression file format for RDF data.

Home Page: https://crates.io/crates/hdt

License: MIT License

Rust 100.00%

hdt rdf rust

hdt's Introduction

HDT

A Rust library for the Header Dictionary Triples compressed RDF format, including:

loading the HDT default format as created by hdt-cpp
efficient querying by triple patterns
serializing into other formats like RDF Turtle and N-Triples using the Sophia adapter

However it cannot:

load other RDF formats
load other HDT variants

For this functionality and acknowledgement of all the original authors, please look at the reference implementations in C++ and Java by the https://github.com/rdfhdt organisation.

It also cannot:

swap data to disk
modify the RDF graph in memory
run SPARQL queries

If you need any of the those features, consider using a SPARQL endpoint instead.

Examples

use hdt::Hdt;

let file = std::fs::File::open("example.hdt").expect("error opening file");
let hdt = Hdt::new(std::io::BufReader::new(file)).expect("error loading HDT");
// query
let majors = hdt.triples_with_pattern(Some("http://dbpedia.org/resource/Leipzig"), Some("http://dbpedia.org/ontology/major"),None);
println!("{:?}", majors.collect::<Vec<_>>());

You can also use the Sophia adapter to load HDT files and reduce memory consumption of an existing application based on Sophia, which is re-exported as hdt::sophia:

use hdt::{Hdt,HdtGraph};
use hdt::sophia::api::graph::Graph;
use hdt::sophia::api::term::{IriRef, SimpleTerm, matcher::Any};

let file = std::fs::File::open("dbpedia.hdt").expect("error opening file");
let hdt = Hdt::new(std::io::BufReader::new(file)).expect("error loading HDT");
let graph = HdtGraph::new(hdt);
let s = SimpleTerm::Iri(IriRef::new_unchecked("http://dbpedia.org/resource/Leipzig".into()));
let p = SimpleTerm::Iri(IriRef::new_unchecked("http://dbpedia.org/ontology/major".into()));
let majors = graph.triples_matching(Some(s),Some(p),Any);

If you don't want to pull in the Sophia dependency, you can exclude the adapter:

[dependencies]
hdt = { version = "...", default-features = false }

There is also a runnable example are in the examples folder, which you can run with cargo run --example query.

API Documentation

See docs.rs/latest/hdt or generate for yourself with cargo doc --no-deps without disabling default features.

Performance

The performance of a query depends on the size of the graph, the type of triple pattern and the size of the result set. When using large HDT files, make sure to enable the release profile, such as through cargo build --release, as this can be much faster than using the dev profile.

Profiling

If you want to optimize the code, you can use a profiler. The provided test data is very small in order to keep the size of the crate down; locally modifying the tests to use a large HDT file returns more meaningful results.

Example with perf and Firefox Profiler

$ cargo test --release
[...]
Running unittests src/lib.rs (target/release/deps/hdt-2b2f139dafe69681)
[...]
$ perf record --call-graph=dwarf target/release/deps/hdt-2b2f139dafe69681 hdt::tests::triples
$ perf script > /tmp/test.perf

Then go to https://profiler.firefox.com/ and open /tmp/test.perf.

Criterion benchmark

cargo bench --bench criterion

requires persondata_en.hdt placed in tests/resources

iai benchmark

cargo bench --bench iai

requires persondata_en_10k.hdt placed in tests/resources
requires Valgrind to be installed

Comparative benchmark suite

The separate benchmark suite compares the performance of this and some other RDF libraries.

Community Guidelines

Issues and Support

If you have a problem with the software, want to report a bug or have a feature request, please use the issue tracker. If have a different type of request, feel free to send an email to Konrad.

Citation

If you use this library in your research, please cite our paper in the Journal of Open Source Software. We also provide a CITATION.cff file.

BibTeX entry

@article{hdtrs,
  doi = {10.21105/joss.05114},
  year = {2023},
  publisher = {The Open Journal},
  volume = {8},
  number = {84},
  pages = {5114},
  author = {Konrad Höffner and Tim Baccaert},
  title = {hdt-rs: {A} {R}ust library for the {H}eader {D}ictionary {T}riples binary {RDF} compression format},
  journal = {Journal of Open Source Software}
}

Citation string

Höffner et al., (2023). hdt-rs: A Rust library for the Header Dictionary Triples binary RDF compression format. Journal of Open Source Software, 8(84), 5114, https://doi.org/10.21105/joss.05114

Contribute

We are happy to receive pull requests. Please use cargo fmt before committing, make sure that cargo test succeeds and that the code compiles on the stable and nightly toolchain both with and without the "sophia" feature active. cargo clippy should not report any warnings.

hdt's People

Contributors

Stargazers

Watchers

Forkers

pchampin osorensen decisym

hdt's Issues

rsdict simd feature fails to build with Rust nightly 1.78

The rsdict dependency "simd" feature depends on packed_simd, which is no longer available in nightly 1.78.
See sujayakar/rsdict#9 and rust-lang/packed_simd#359.
I tried to refactor rsdict but wasn't successful yet, see sujayakar/rsdict#10.

Options

Disable the "simd" feature for now but that would decrease the speed.
Warn users to stay at nightly 1.77 or lower until this is fixed at rsdict but is inconvenient and can cause errors for users who don't notice it.
Add a "simd" feature to hdt and pass that through to rsdict.

Update: Option 1 was chosen but this issue is kept open in case of a future rsdict update that allows reenabling it.

add code examples to all the query function doc comments

compress OP-S index sequence

Uses 64 bit per entry right now, find sequence library for that if it takes too much code.

general simplification and optimization

As I'm not very experienced with Rust, there are surely many horribly inefficient parts in the code that could be simplified and optimized by someone more experienced with Rust.

further optimize triples_with_sp, triples_with_so and triples_with_po

Since 0.0.4 they are filtered before translation, which is already better than the Sophia default implementation, but they could be further optimized at the iterator level.
Either create new iterators or add parameters to the existing ones.

triples_with_sp
triples_with_so
triples_with_po

improve error handling

extend HdtGraph

triples_with_sp
triples_with_so
triples_with_po
triples_with_spo not needed right now

measure and optimize RAM consumption of the different components

Use new WaveletMatrix 0.0.6 construction method to reduce memory usage

See kampersanda/sucds#44.
Using a modified hdt::tests which loads lscomplete20143.hdt and then returns.

Before

Command being timed: "cargo test --release hdt::tests"
User time (seconds): 89.53
System time (seconds): 7.13
Percent of CPU this job got: 225%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:42.95
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3370252
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 770
Minor (reclaiming a frame) page faults: 3897847
Voluntary context switches: 11918
Involuntary context switches: 12515
Swaps: 0
File system inputs: 2541256
File system outputs: 435912
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

After

Command being timed: "cargo test --release hdt::tests"
User time (seconds): 18.03
System time (seconds): 1.48
Percent of CPU this job got: 115%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.88
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3380940
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 773459
Voluntary context switches: 2180
Involuntary context switches: 654
Swaps: 0
File system inputs: 648120
File system outputs: 18624
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Not much change in the resident set size, however it is also much lower than expected, is /usr/bin/time not accurate? Try with heaptrack instead.

Document use within a SPARQL pipeline

It would be useful to document a few examples where the Rust hdt library is used within a full pipeline, starting from an HDT file (as generated by hdt-cpp) to SPARQL query results.

For example, the Python rdflib-hdt library wraps hdt-cpp and this function point is where the triple pattern query over the HDT is then used by the rdflib SPARQL query processor: (https://github.com/RDFLib/rdflib-hdt/blob/master/rdflib_hdt/hdt_document.py#L114)

Documenting how Rust hdt might provide triple pattern query results to a few separate SPARQL query engines would show users how the Rust hdt library can fit into a broader pipeline from data to SPARQL query results.

remove unnecessary enums

While it theoretically allows more freedom of implementation in the future, it is questionable whether that will ever be used and all the match statements are too verbose.

improve documentation

readme
documentation on crates.io / docs.rs

Editorial comments on JOSS paper

I'm reading through the JOSS submission now, and will post comments here.

On line 40, I cannot see that HDT-FoQ has been defined yet, so please introduce it here. I notice that it's introduced in the caption to Figure 2, but please do it in the main text also.

drop sequence_y after load

Now that we have wavelet_y we can drop sequence_y and save memory.

test triples_with_o, triples_with_po and triples_with_so with explicit xsd:string datatype

The test file only contains literals that either have a different datatypes or none at all datatype, so we throw away the implicit xsd:string to match correctly.
However explicit xsd:string needs to be supported as well.

Create HDT files from RDF

It would be a valuable enhancement to create HDT files from RDF text serialization formats. A pure Rust implementation would be ideal, but a Rust wrapper over hdt-cpp would be a convenient alternative.

crates.io and test resources

Due to the test HDT file tests/resources/swdf.hdt (5.9 MB uncompressed), the hdt package uses 2.42 MB on crates.io, which is much more than other crates typically use.
While this is not much, the eternal nature of crates.io motivates saving space.
Thus, the filesize of the package should be significantly reduced before publishing a new version.

Options

keep it like it is now
use a smaller HDT test file
download the test file when required so that users who don't need the tests don't download it
automatically generate the test file

subject IDs off by one

[TripleId { subject_id: 0, predicate_id: 90, object_id: 13304 }, TripleId { subject_id: 0, predicate_id: 101, object_id: 19384 }, TripleId { subject_id: 0, predicate_id: 111, object_id: 75817 }, TripleId { subject_id: 1, predicate_id: 90, object_id: 19470 }, TripleId { subject_id: 1, predicate_id: 101, object_id: 13049 }, TripleId { subject_id: 1, predicate_id: 104, object_id: 13831 }, TripleId { subject_id: 1, predicate_id: 111, object_id: 75817 }, TripleId { subject_id: 2, predicate_id: 90, object_id: 19313 }]
sample [
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_1",
        "http://data.semanticweb.org/person/barry-norton",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_2",
        "http://data.semanticweb.org/person/reto-krummenacher",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_1",
        "http://data.semanticweb.org/person/robert-isele",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_2",
        "http://data.semanticweb.org/person/anja-jentzsch",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_3",
        "http://data.semanticweb.org/person/christian-bizer",
    ),
    (
        "_:b1",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq",
    ),
    (
        "_:b10",
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#_1",
        "http://data.semanticweb.org/person/raphael-troncy",
    ),
]

IDs should start with 1 but subject IDs start at 0 which offsets all subjects by one, except the first one because 0 and 1 both map to the first.

in-memory representation

The HDT Reader hdt_reader.rs is optimized for reading of all triples from a buffered reader, but there should be an in-memory representation that can be queried for triple patterns. This could also lead to easier lifetimes than when using the reader.

add hdt2rdf command line application

Using Sophia, it should be quite easy to serialize the loaded HDT file into another format like RDF Turtle and NTriples, thus allowing a simple command line application like hdt2rdf in the C++ version.

add Python bindings?

It seems that with https://crates.io/crates/pyo3 one can use Rust code from Python.
rdflib, the fastest Python RDF library tested in the benchmark, had a very large RAM usage and loading time.
It could thus be useful to Python users to have a access to this library.

cyrillic URIs not found?

When using the Sophia HDT adapter in RickView, a URI with suffix хобби-N-0 is not found:

[WARN ] No triples found for entry/хобби-N-0. Did you configure the namespace correctly?

However when using Turtle, it works:

[DEBUG] ruthes:entry/хобби-N-0 HTML 357.8µs

This could also be an issue with Sophia, but this is unlikely as it works with the Sophia FastGraph where the Turtle is loaded, or it could be an error in the Sophia adapter, which is part of HDT.

As first step, a URI with a suffix like хобби-N-0 should be included in the test HDT file and the test suite.

add benchmark

Document that this library doesn't work on stable Rust

This library requires the unstable feature round_char_boundary which means it will not build with stable rustc.

This should probably be mentioned in the README.

Wrong (not enough?) triples found for "http://www.snik.eu/ontology/meta" in tests

println!("{:?}", dict.string_to_id("http://www.snik.eu/ontology/meta", crate::dict::IdKind::Subject)); prints 1 but that identifies blank node _b1.

add web assembly target?

Would anyone need this?

Investigate whether that would be possible with reasonable efforts.

triple pattern queries

Right now there is only an option to iterate over all triples, which is inefficient for large graphs.
Implement triple pattern queries and add tests.

triples_with_s
triples_with_o
triples_with_p

regression in loading times?

Using a modified hdt::tests which loads lscomplete20143.hdt and then returns.

0.0.7: 4.541s
0.0.8: 47.16s

JOSS Review

Hi Konrad,

How do you feel about adding some examples to the paper? I think even just the code snippet (with some examples of triples that would be matched?) from the README would be useful to readers to quickly see how to use the library. This would also beef up the text section of the paper a little bit.

If possible, it would be nice to include this (or similar) as a cargo example (with dev dependencies) printing out some matched triples.

use hdt::{Hdt,HdtGraph};
use hdt::sophia::api::graph::Graph;
use hdt::sophia::api::term::{IriRef, SimpleTerm, matcher::Any};

let file = std::fs::File::open("dbpedia.hdt").expect("error opening file");
let hdt = Hdt::new(std::io::BufReader::new(file)).expect("error loading HDT");
let graph = HdtGraph::new(hdt);
let s = SimpleTerm::Iri(IriRef::new_unchecked("http://dbpedia.org/resource/Leipzig".into()));
let p = SimpleTerm::Iri(IriRef::new_unchecked("http://dbpedia.org/ontology/major".into()));
let majors = graph.triples_matching(Some(s),Some(p),Any);

// Maybe add a couple lines of expected output triples?

I think it might also be useful to summarize a few key benchmarking results/statistics in a table - it can be hard to interpret the graphs given the number of items be reproduced.

Last thing - I tried to clone and run your fork of the benchmarking repo, but ran into some compilation errors:

ierror[E0599]: no method named `triples_matching` found for struct `HdtGraph` in the current scope
   --> src/main.rs:154:16
    |
154 |         1 => g.triples_matching(Any, Some(rdf::type_), Some(dbo_person)),
    |                ^^^^^^^^^^^^^^^^ method not found in `HdtGraph`
    |
   ::: /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/sophia_api-0.8.0-alpha.0/src/graph.rs:153:8
    |
153 |     fn triples_matching<'s, S, P, O>(&'s self, sm: S, pm: P, om: O) -> GTripleSource<'s, Self>
    |        ---------------- the method is available for `HdtGraph` here
    |
    = help: items from traits can only be used if the trait is in scope
help: the following trait is implemented but not in scope; perhaps add a `use` for it:
    |
1   | use hdt::sophia::sophia_api::graph::Graph;
    |

error[E0599]: no method named `triples_matching` found for struct `HdtGraph` in the current scope
   --> src/main.rs:155:16
    |
155 |         2 => g.triples_matching(Some(dbr_vincent), Any, Any),
    |                ^^^^^^^^^^^^^^^^ method not found in `HdtGraph`
    |
   ::: /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/sophia_api-0.8.0-alpha.0/src/graph.rs:153:8
    |
153 |     fn triples_matching<'s, S, P, O>(&'s self, sm: S, pm: P, om: O) -> GTripleSource<'s, Self>
    |        ---------------- the method is available for `HdtGraph` here
    |
    = help: items from traits can only be used if the trait is in scope
help: the following trait is implemented but not in scope; perhaps add a `use` for it:
    |
1   | use hdt::sophia::sophia_api::graph::Graph;
    |

error[E0599]: no method named `triples_matching` found for struct `HdtGraph` in the current scope
   --> src/main.rs:156:16
    |
156 |         3 => g.triples_matching(Some(dbr_vincent), Some(rdf::type_), Any),
    |                ^^^^^^^^^^^^^^^^ method not found in `HdtGraph`
    |
   ::: /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/sophia_api-0.8.0-alpha.0/src/graph.rs:153:8
    |
153 |     fn triples_matching<'s, S, P, O>(&'s self, sm: S, pm: P, om: O) -> GTripleSource<'s, Self>
    |        ---------------- the method is available for `HdtGraph` here
    |
    = help: items from traits can only be used if the trait is in scope
help: the following trait is implemented but not in scope; perhaps add a `use` for it:
    |
1   | use hdt::sophia::sophia_api::graph::Graph;
    |

triples_with_p

Part of #3.
Test is failing, check if test is broken or the results.
cargo test triples::tests

fix bug with rickview and qbench2

  Compiling hdt v0.0.3 (/home/konrad/projekte/rust/hdt)
   Compiling rickview v0.0.5 (/home/konrad/projekte/rust/rickview)
    Finished dev [unoptimized + debuginfo] target(s) in 2.59s
     Running `target/debug/rickview`
[INFO ] Serving http://linkedspending.aksw.org/ at http://localhost:8080/
Start constructing wavelet matrix...finished constructing wavelet matrix with length 16144457
Constructing OPS index...finished constructing OPS index
HDT size in memory 248.2 MB, details:
Hdt {
    dict: FourSectDict {
        shared: total size 2.6 MB, sequence 26.8 KB with 9745 entries, 22 bits per entry, packed data 2.6 MB,
        subjects: total size 41.5 MB, sequence 193.0 KB with 59393 entries, 26 bits per entry, packed data 41.3 MB,
        predicates: total size 6.8 KB, sequence 48 B with 29 entries, 13 bits per entry, packed data 6.7 KB,
        objects: total size 59.5 MB, sequence 661.1 KB with 203407 entries, 26 bits per entry, packed data 58.8 MB,
    },
    triple_sect: Bitmap(
        total size 144.6 MB
        adjlist_y AdjList {
            sequence: 18.2 MB with 16144457 entries, 9 bits per entry,
            bitmap: 2.6 MB,
        }
        adjlist_z AdjList {
            sequence: 48.6 MB with 16213801 entries, 24 bits per entry,
            bitmap: 1.6 MB,
        }
        op_index total size 49.8 MB {
            sequence: 48.6 MB with 24 bits,
            bitmap: 1.1 MB
        }
        wavelet_y 23.8 MB,
    ),
}
thread 'actix-rt|system:0|arbiter:0' panicked at 'dictionary error in the shared section: index out of bounds: id 155895 > dictionary section len 437', /home/konrad/projekte/rust/hdt/src/hdt.rs:42:107
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'actix-rt|system:0|arbiter:1' panicked at 'dictionary error in the shared section: index out of bounds: id 155895 > dictionary section len 437', /home/konrad/projekte/rust/hdt/src/hdt.rs:42:107

read compressed hdt

For example brotli.

DictSectPFC bug with handling UTF8

DictSectPFC truncates at byte indexes but this can fail when these aren't character boundaries due to UTF8 potentially using multiple bytes for characters.

[...]
'byte index 29 is not a char boundary; it is inside 'Ö' (bytes 28..30) of `http://dbpedia.org/resource/Östergötland_County`', /home/konrad/projekte/rust/hdt/src/dict_sect_pfc.rs:175:49
stack backtrace:
   0: rust_begin_unwind
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/core/src/panicking.rs:65:14
   2: core::str::slice_error_fail_rt
   3: core::str::slice_error_fail
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/core/src/str/mod.rs:86:9
   4: core::str::traits::<impl core::slice::index::SliceIndex<str> for core::ops::range::RangeFrom<usize>>::index
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/core/src/str/traits.rs:370:21
   5: core::str::traits::<impl core::ops::index::Index<I> for str>::index
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/core/src/str/traits.rs:65:9
   6: <alloc::string::String as core::ops::index::Index<core::ops::range::RangeFrom<usize>>>::index
             at /rustc/b7bc90fea3b441234a84b49fdafeb75815eebbab/library/alloc/src/string.rs:2380:10
   7: hdt::dict_sect_pfc::DictSectPFC::locate_in_block
             at /home/konrad/projekte/rust/hdt/src/dict_sect_pfc.rs:175:49
   8: hdt::dict_sect_pfc::DictSectPFC::string_to_id
             at /home/konrad/projekte/rust/hdt/src/dict_sect_pfc.rs:113:23
   9: hdt::four_sect_dict::FourSectDict::string_to_id
             at /home/konrad/projekte/rust/hdt/src/four_sect_dict.rs:100:26

support stable release channel

Is it a dealbreaker for anyone that we only support nightly Rust?
Now that the rsdict dependency is gone, there are only two features that still require nightly Rust:

round_char_boundary
int_roundings

As there were used for an old unicode handling method, we may be able to remove them.

konradhoeffner / hdt Goto Github PK

hdt's Introduction

HDT

Examples

API Documentation

Performance

Profiling

Example with perf and Firefox Profiler

Criterion benchmark

iai benchmark

Comparative benchmark suite

Community Guidelines

Issues and Support

Citation

BibTeX entry

Citation string

Contribute

hdt's People

Contributors

Stargazers

Watchers

Forkers

hdt's Issues

Before

After

Options

Recommend Projects

Recommend Topics

Recommend Org