pchampin / sophia_rs Goto Github PK

View Code? Open in Web Editor NEW

214.0 13.0 23.0 3.02 MB

Sophia: a Rust toolkit for RDF and Linked Data

License: Other

Rust 100.00%

sophia_rs's Introduction

Sophia

A Rust toolkit for RDF and Linked Data.

It comprises the following crates:

sophia_api defines a generic API for RDF and linked data, as a set of core traits and types; more precisely, it provides traits for describing
- terms, triples and quads,
- graphs and datasets,
- parsers and serializers
sophia_iri provides functions, types and traits for validating and resolving IRIs.
sophia_inmem defines in-memory implementations of the Graph and Dataset traits from sophia_api.
sophia_term defines various implementations of the Term trait from sophia_api.
sophia_turtle provides parsers and serializers for the Turtle-family of concrete syntaxes.
sophia_xml provides parsers and serializers for RDF/XML.
sophia_jsonld provides preliminary support for JSON-LD.
sophia_c14n implements RDF canonicalization.
sophia_resource provides a resource-centric API.
sophia_rio is a lower-level crate, used by the ones above.

and finally:

sophia is the “all-inclusive” crate, re-exporting symbols from all the crates above. (actually, sophia_xml is only available if the xml feature is enabled)

In addition to the API documentation, a high-level user documentation is available (although not quite complete yet).

Licence

CECILL-B (compatible with BSD)

Citation

When using Sophia, please use the following citation:

Champin, P.-A. (2020) ‘Sophia: A Linked Data and Semantic Web toolkit for Rust’, in Wilde, E. and Amundsen, M. (eds). The Web Conference 2020: Developers Track, Taipei, TW. Available at: https://www2020devtrack.github.io/site/schedule.

Bibtex:

@misc{champin_sophia_2020,
        title = {{Sophia: A Linked Data and Semantic Web toolkit for Rust},
        author = {Champin, Pierre-Antoine},
        howpublished = {{The Web Conference 2020: Developers Track}},
        address = {Taipei, TW},
        editor = {Wilde, Erik and Amundsen, Mike},
        month = apr,
        year = {2020},
        language = {en},
        url = {https://www2020devtrack.github.io/site/schedule}
}

Third-party crates

The following third-party crates are using or extending Sophia

hdt provides an implementation of Sophia's traits based on the HDT format.
manas is a modular framework for implementing Solid compatible servers
nanopub is a toolkit for managing [nanopublications(https://nanopub.net/)

History

An outdated comparison of Sophia with other RDF libraries is still available here.

sophia_rs's People

Contributors

Stargazers

Watchers

sophia_rs's Issues

BNode scope when loading triples/quads into graph/dataset

Problem

Currently, when inserting bnode identifies in a graph, the bnode identifier is kept as is.

For example, loading this file into a graph:

_:b1 <tag:p> "foo".

then, loading this file into the same graph:

_:b1 <tag:q> "bar".

will result in the following graph (in Turtle):

  [] <tag:p> "foo"; <tag:q> "bar".

while it should be

  [] <tag:p> "foo".
  [] <tag:q> "bar".

i.e. two different subjects, because the bnode identifiers in the two different files have two different scopes.

NB: it is important for the developer to be able to handle bnodes consistently, so at the lowest level (e.g. Graph::insert), the API should consider bnode identifiers as stable. But on the other hand, the default behaviour when loading a file should be the correct one.

Proposed solution

The methods TripleSource.in_graph and QuadSource.in_dataset are the preferred way of loading a stream of triples/quads (such as the one coming from a parser) into a graph/dataset.

The proposed solution is to change the semantics of these methods, and make them rename the bnodes they receive to avoid name-clashes with existing bnodes in the graph/dataset. Whether this should be done by generating UUIDs or inspecting the target graph/dataset for existing name, I'm not sure yet...

New methods in_graph_raw and in_dataset_raw (better name?) should probably be added, which would have the current semantics of in_graph and in_dataset.

Add HDT serializer

A serializer implementation for the HDT would be nice.

Introduce trait `TermData`

In #5 @MattesWhite suggested to introduce the super-trait TermData for convienence:

/// Supertrait for all properties data of a `Term<T>` must provide.
pub trait TermData: AsRef<str> + Clone + Eq + Hash {}
impl<T> TermData for T where T: AsRef<str> + Clone + Eq + Hash {}

That would indeed simplify a lot of code.

Roadmap?

Does Sophia have a “road map” document covering what has been done so far and what is planned for future versions?

Add SPARQL support

add a (set of) type(s) for representing a SPARQL abstract query
add a parser for building a SPARQL abstract query (see above) from a string in the SPARQL syntax
incrementally implement the SPARQL algebra (this item should probably be split into sub-items)

NB: since sophia uses a generalized RDF model (including variables), a Graph can also be used as a basic graph pattern. The query module contains a preliminary implementation of this idea.

Add JSON-LD Support

This is a feature request. I'll gladly contribute the work when I get the time, but I figured I should go ahead and open the issue in case others want to contribute it. Or maybe there are existing solutions.

Background

This crate seems to be the most mature and most recently maintained RDF related crates. So, I'm hoping to use this in an LDP server I'm creating. However, JSON-LD is a required format in LDP servers.

The Request

Add JSON-LD as a supported format for parsing and serializing graphs

Collection parsing fails with XML

I am having a problem with parsing of collections.

Consider the following RDF:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
            xmlns:ex="http://example.org/stuff/1.0/">

  <rdf:Description rdf:about="http://example.org/basket">
    <ex:hasFruit rdf:parseType="Collection">
      <rdf:Description rdf:about="http://example.org/banana"/>
      <rdf:Description rdf:about="http://example.org/apple"/>
      <rdf:Description rdf:about="http://example.org/pear"/>
      <!--rdf:Description rdf:about="http://example.org/conference_pear">
      </rdf:Description-->
    </ex:hasFruit>
  </rdf:Description>
</rdf:RDF>

When I parse this using horned-triples (my own code):

https://github.com/phillord/horned-owl/blob/master/src/bin/horned-triples.rs

I get:

http://example.org/basket
	http://example.org/stuff/1.0/hasFruit
	n0
n0
	http://www.w3.org/1999/02/22-rdf-syntax-ns#first
	http://example.org/banana
n0
	http://www.w3.org/1999/02/22-rdf-syntax-ns#rest
	n1
n1
	http://www.w3.org/1999/02/22-rdf-syntax-ns#first
	http://example.org/apple
n1
	http://www.w3.org/1999/02/22-rdf-syntax-ns#rest
	n2
n2
	http://www.w3.org/1999/02/22-rdf-syntax-ns#first
	http://example.org/pear
n2
	http://www.w3.org/1999/02/22-rdf-syntax-ns#rest
	http://www.w3.org/1999/02/22-rdf-syntax-ns#nil

which seems correct. But remove the commented XML which should add
another fruit and I get:

http://example.org/basket
	http://example.org/stuff/1.0/hasFruit
	http://example.org/conference_pear

Alas my basket is empty! I was expecting it to be one longer.

W3C tests don't pass

failures:

---- parser::nt::test::w3c_test_suite stdout ----
thread 'parser::nt::test::w3c_test_suite' panicked at 'rdf-tests/ntriples not found, can not check W3C test-suite', sophia/src/parser/nt.rs:597:17
note: Run with `RUST_BACKTRACE=1` for a backtrace.

---- parser::nt::test::w3c_test_suite_generalized stdout ----
thread 'parser::nt::test::w3c_test_suite_generalized' panicked at 'rdf-tests/ntriples not found, can not check W3C test-suite', sophia/src/parser/nt.rs:632:17

Who to get w3c_test_suite ?

Have IriData implement PartialOrd

I need to match a number of triples against patterns. But the semantics of RDF make this painful as triples are not naturally ordered. It would make life easier if I could sort the triples first so that they would come out in a predictable order.

This is easiest enough to achieve with IriData by simply calling to_string and comparing these, but there is no real reason to make that String when we could compare the underlying data.

The same would apply for Literal and Variable also.

How to handle read-only Terms

During #55 the question arose how should read-only Term references be handled.

Note: This only affects situation where a Term is passed in to check and compare the contents of it. Not when the Term (reference) is passed in to be copy/clone in some way.

Currently, there are two approaches in sophia:

Take a reference to a monomorphized Term, e.g. Graph::triples_with_spo<'s, T, U, V>(&'s self, s: &'s Term<T>, p: &'s Term<U>, o: &'s Term<V>) -> GTripleSource<'s, Self>.
This version allows to directly pass references without the need to create an intermediate representation. However, this can increase compile times and code size significantly.
Take a RefTerm.
This prevents monomorphization. However, it requires to build an intermediate, 76 Byte big RefTerm.

An alternative approach would be having a TermTrait, so we could pass &dyn TermTrait in such situations. This would prevent monomorphization and does not require an intermediate RefTerm. However, a definition of such a trait is still to be figured out.

Move from `coercible-error` to a dedicated `StreamError`

This issue was raised in the discussion about #8 . The argument was that

coercible-error is unidiomatic, and
it advocates a unique error type, which is not approrpiate with the plan to split Sophia into semi-independant crates (#26).

I'll copy the relevant parts of the discussion below, for the sake of clarity.

XML Parser fails on large file

I have been trying out the XML parser on a large file. Even after an elongated period, it fails to parse, where the turtle parser succeeds.

As my large file I have been using the Gene Ontology available at:

http://purl.obolibrary.org/obo/go.owl

(The ttl version I have had to convert from this using the OWL API; I can put it somewhere if it is helpful).

The ttl version runs in 7 seconds, the XML version, I do not know whether it is stalling or just slow, because I have not had it complete yet.

fn main() -> Result<(),Error> {
    let input = "/home/phillord/scratch/go.ttl";
    //let input = "/home/phillord/scratch/go.owl";

    let file = File::open(input)?;
    let bufreader = BufReader::new(file);
    let triple_source = sophia::parser::turtle::parse_bufread(bufreader);
    //let triple_source = sophia::parser::xml::parse_bufread(bufreader);
    println!("collecting");
    let start = Instant::now();
    let graph: LightGraph = triple_source.collect_triples().unwrap();
    println!("{}: {:?}", graph.len(), start.elapsed());

    Ok(())
}

      Finished release [optimized] target(s) in 3.77s
     Running `target/release/horned-temp`
collecting
1431737: 7.743499503s

Real Examples

Looking at these examples:
https://docs.rs/sophia_api/0.6.2/sophia_api/graph/trait.Graph.html#examples

I ask myself.. how to actual do something there?
Though I am a seasoned C/C++/Java/... dev, I am new to rust. I got some stuff running that I wrote myself, but trying to use this library knocked me out. everything I try in this for loop, fails. I do not think it is meaningful to ask for specific things, as I would just run into the next one directly. It would be nice to have an example that actually does something with the data, like store into a vec or map.
Even better, a full tutorial!

.. or at least pointers to code - anywhere - that uses the library.

Non-deterministic test failure

At commit b113fc4, one of the tests (see below) is passing most of the times, but failing every now and then. It seems to be always the same test, and always at the same point.

Since it is non-deterministic, it is likely due to an unsafe block code. I'm also considering a wrong ref-counting in TermIndexMapU. The fact that it happens in the default_graph test, and not in any other test involving TermIndexMapU, might indicate that it has to do with the special treatment GraphId::Default has in TermIndexMapU.

Test id: graph::adapter::test::dataset::default_graph::test_retain
Stack trace:

   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:39
   1: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at src/libstd/sys_common/backtrace.rs:59
             at src/libstd/panicking.rs:197
   3: std::panicking::default_hook
             at src/libstd/panicking.rs:208
   4: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:474
   5: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:381
   6: rust_begin_unwind
             at src/libstd/panicking.rs:308
   7: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
   8: core::panicking::panic
             at src/libcore/panicking.rs:49
   9: core::option::Option<T>::unwrap
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/macros.rs:12
  10: <sophia::graph::inmem::_term_index_map_u::TermIndexMapU<T,F> as sophia::term::index_map::TermIndexMap>::dec_ref
             at sophia/src/graph/inmem/_term_index_map_u.rs:144
  11: <sophia::dataset::inmem::_hash_dataset::HashDataset<I> as sophia::dataset::indexed::IndexedDataset>::remove_indexed
             at sophia/src/dataset/inmem/_hash_dataset.rs:158
  12: <sophia::dataset::inmem::_hash_dataset::HashDataset<I> as sophia::dataset::_traits::MutableDataset>::remove
             at sophia/src/dataset/indexed.rs:102
  13: <sophia::dataset::adapter::DatasetGraph<D,E,sophia::term::graph_id::GraphId<F>> as sophia::graph::_traits::MutableGraph>::remove
             at sophia/src/dataset/adapter.rs:151
  14: <sophia::graph::_sinks::Remover<G> as sophia::triple::stream::TripleSink>::feed
             at sophia/src/graph/_sinks.rs:60
  15: sophia::triple::stream::TripleSource::in_sink
             at sophia/src/triple/stream.rs:55
  16: sophia::graph::_traits::MutableGraph::remove_all
             at sophia/src/graph/_traits.rs:450
  17: sophia::graph::_traits::MutableGraph::retain
             at sophia/src/graph/_traits.rs:524
  18: sophia::graph::adapter::test::dataset::default_graph::test_retain
             at sophia/src/graph/test.rs:261
  19: sophia::graph::adapter::test::dataset::default_graph::test_retain::{{closure}}
             at sophia/src/graph/test.rs:256
  20: core::ops::function::FnOnce::call_once
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/ops/function.rs:231
  21: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/liballoc/boxed.rs:704
  22: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:85
  23: test::run_test::run_test_inner::{{closure}}
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/panicking.rs:272
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/panic.rs:394
             at src/libtest/lib.rs:1468

Change `Term::batch_join` into something more usable

Currently, there are two ways to resolve (possibly relative) terms agains a base IRI:

base.join for one-shot operations,
base.batch_join for multiple operations.

The rationale for this design is that joining requires the structure of the base IRI to be parsed, and we do not store that structure in IRI Terms (to save space -- not all IRIs in a graph are expected to be used as base IRIs). batch_join allows to perform the parsing only once, and resolve multiple terms in a row.

A possibly better design would be to have a dedicated type, storing the parsed structure of an IRI, and providing methods to resove terms (and why not, whold triples, quads, or even graphs and datasets?). Term::batch_join would then be replaced by a Term::to_base method, returning such an object. This would allow for this kind of pattern:

let base = some_iri_term.to_base();
let abs_triples = some_triple_source.map_triples(|t| base.join_triple(&t));

Ensure doc coverage using #![deny(missing_docs)]

#![deny(missing_docs)] is a macro that can be placed in any module. Using it results in compiler errors if a public element is not documented (///).

While this might lead to a number of 'obvious' documentations it is still a good measure to ensure a thorough documentation of the whole crate. In addition, it forces the developer to think about the users point of few on an element.

Solving this issue consists in adding #![deny(missing_docs)] incrementally to each and every module, until all are documented. Then, we can keep only one directive at the top-level (lib.rs).

Get rid of the lifetime parameters on Triple, Graph and Dataset

They are causing much complexity, especially the requirement for using higher-ranked type bound in some situations.

It might be better to give up some genericity in order to make the API easier to use.

Provide a simpler parser API

This issue contributes to #23

For me the parser API of sophia is a bit confusing with its macros and Config.

I suggest the introduction of a Parser-trait:

pub trait Parser {
    fn parse_str<'src>(&'src mut self, input: &'src str) -> 
        Result<Box<dyn Iterator<Item=Result<[Term<Cow<'src, str>>; 3]>> + 'src>>;
    fn parse_read<'src, R: 'src + Read>(&'src mut self, input: &mut R) -> 
        Result<Box<dyn Iterator<Item=Result<[Term<Cow<'src, str>>; 3]>> + 'src>>;
}

For Quads a similar trait respectively

Allows Parsers to store state so they can lazily parse.
General enough for all serialization formats in question.
Easy for beginners to understand.
Result implements TripleSource

An implementation for the N-Triples parser, e.g.:

mod nt {
  #derive[default]
  struct Parser {
    strict: bool;
  }

  impl Parser {
    fn new_strict() -> Self { Self { strict: true, } }
  }
 
  impl Parser for Parser {
    ...
  }
}

Do we still need `TripleSink` and `QuadSink`?

In #37 @MattesWhite points out that sinks are not very idiomatic.

I introduced them to be able to consume sources, and I introduced sources because iterators had limitation (see the stream module documentation).

But we could reuse the notion of streaming mode, introduced in the Graph and Dataset traits, to make sources consumable in a more idiomatic way, e.g. with a for_each_triple method (resp. for_each_quad)...

Let's investigate that...

Generated tests of namespace! macro

Hi,

I've observed some strange behaviour executing generated tests created by the namespace! macro. I have the following code to create a skos namespace:

#[macro_use]
extern crate sophia_api;

pub mod skos {
    namespace!(
        "http://www.w3.org/2004/02/skos/core#",
        Concept,
        prefLabel,
        altLabel,
        hiddenLabel
    );
}

The code works fine in production, but the tests are failing completely randomly:

cargo test
    Finished test [unoptimized + debuginfo] target(s) in 0.01s
     Running target/debug/deps/skos_test-55fef88c12ab2b53

running 5 tests
test skos::test_valid_iri::Concept ... ok
test skos::test_valid_iri::hiddenLabel ... ok
test skos::test_valid_iri::prefLabel ... ok
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', library/test/src/lib.rs:356:75
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'skos::test_valid_iri::altLabel' panicked at 'called `Result::unwrap()` on an `Err` value: "SendError(..)"', library/test/src/lib.rs:592:30
error: test failed, to rerun pass '--lib'

Same code, no changes:

cargo test
    Finished test [unoptimized + debuginfo] target(s) in 0.01s
     Running target/debug/deps/skos_test-55fef88c12ab2b53

running 5 tests
test skos::test_valid_iri::altLabel ... ok
test skos::test_valid_iri::prefLabel ... ok
test skos::test_valid_iri::Concept ... ok
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', library/test/src/lib.rs:356:75
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: test failed, to rerun pass '--lib'

The namespace! macro was imported via sophia_api.

Any ideas why the tests are failing?

[discussion] Help make sophia a common RDF API for Rust

The design of Sophia emphasizes genericity. The goal was, from the start, to allow multiple implementations of the provided traits to coexist, not even necessarily inside the sophia crate itself.

The goal of this issue is to foster discussion on what is required to achieve this. Are there any design choices in Sophia's traits or underlying types which you find too opinionated or constraining? Are they too complex to be widely adopted?

Make (at least) 2 separate crates

At the very least

sophia_api containing the basic types and generic traits
sophia depending on the former, and containing everything else
change the licence of sophia_api to a more permissive one (Apache or BSD?)
Later, I might split the remaining sophia crate into more fine-grained crates...

NB: this issue was created following the discussion at #23 (comment)

Use sophia to connect to a SparQL db (in my case AnzoGraph)

Hi!

Thank you for this library!

I am working on a graphQL -> sparQL translator in Rust (https://github.com/007vasy/warpgrapher?organization=007vasy&organization=007vasy) and I having a hard time figuring out how to use sophia to connect to a database and execute the queries againts it. Could you give some pointers?

Make sophia ready to support external serializers

For now sophia only provides serializers for N-Triples and N-Quads. In order to support further serializers, we must develop an API for them, i.e. traits and some utilities. This contributes to #23.

When ever triples and graphs are mentioned those can be replaced by quad and dataset respectively. If not mentioned otherwise.

Steps towards an improved serializer API

Make the internals of Term::Literal an own type like IriData or BNodeId. This way it is easier to implement specific serializations for literals.

Remove the n3()-method from Term. It can be replaced by introducing a trait, e.g.:

trait NtSerializableTerm {
    fn nt(&self) -> String;
}

impl NtSerializableTerm for Term {
    fn nt(&self) -> String {
        crate::serializer::nt::stringify_term(self)
    }
}

Or at least rename the method properly.

Provide a trait Serializer or several specific one, e.g. TripleSerializer and GraphSerializer. The first for streams the second for more sophisticated actions (see #17). Some symmetry to the Parser trait would be nice.
Introduce the changes to the NT-serializer.

All points are open to discussion.

`percent_encoding` becomes private in `url`>=2.0.0

...and the XML parser uses it, which prevents us from upgrading the dependency to url beyond 1.7.2. @althanos, could you please have look at it and see if we can get rid of this? Thanks in advance...

Promote sophia in the Rust community

Today sophia is at an incredible state and the foundations for a powerful RDF framework for Rust are mostly done. However, while there is an increasing amount of people opening issues. The downloads of sophia on crates.io are still under a thousand. I assume we could improve the popularity and attract contributors to sophia or even getting people to add crates to the sophia-ecosystem if we promote sophia more in the Rust community.

For me a start would be:

Announcement post for sophia v0.6 at URLO.
Suggestion for Crate of the Week.
Post at Call for Participation.

The later two have a chance to be included into This Week in Rust which is at least for me the mandatory weekly update concerning Rust.

If you wish I could write the posts I don't think this could cause any harm, just wanted to ask for permission first.

How to get more information from parsers?

Beyond the parsed triples/quads, parsers may collect additional useful information, for example prefix declarations or base IRI. What would be the best API to get this information?

My initial idea was to add methods to the triple/quad source returned by the parse methods, to access this information. For example:

    let parsed_triples = sophia::parser::turtle::parse_bufread(ttl_file)?;
    my_graph.insert_all(&mut parsed_triples);
    let prefix_map = triples.get_prefix_map();

The drawback of this approach is that it forces us to keep the triple source, even when it is exhausted. Method such as Graph::insert_all can not consume it, they have to borrow it mutably (which is rather counter intuitive).

Another approach would be to use a kind of callback:

    let parsed_triples = sophia::parser::turtle::parse_bufread(ttl_file)?;
    let mut prefix_map = HashMap::new();
    parsed_triples.on_prefix(|prefix, iri| prefix_map.insert(prefix, iri));
    my_graph.insert_all(parsed_triples);

This approach might be slightly harder to implement, but offers more flexibility. And it makes it possible to consume sources while still getting the additional information.

@Tpt @MattesWhite any thought?

Add 'help wanted' issues

sophia is the most mature and active developed crate for RDF and linked data at the moment.
However, there is still a long way to go.
Saddly, it seems you don't have much time to work on it.

I realy want this crate to succeed and would like to contribute to it.
But I don't want to mess up your plans.
Therefore, I suggest that you add add some 'help wanted' or 'todo' issues or maybe some milestones to define the long-term goals (like adding SPARQL support?).
This will make it easier for me and others to contribute to this crate.

Change license?

I'm interested in contributing to Sophia, either

Working on implementing some of the JSON-LD processing algorithms (#16)
Working on carving out the Oxigraph SPARQL parser maybe into another crate (if @Tpt is ok with that) so it is compatible with Sophia types, and trying to get basic queries working over Sophia graphs (#19)
Adding persistent backends (for me, FoundationDB is the most interesting and immediately useful) (#22)

Although I totally understand and support the aims of copyleft licenses, I think that Sophia and the rest of the RDF ecosystem in Rust could gain much more adoption more quickly if it adopted a permissive license.

See #23 (comment).

Although it's not a blocker, it would personally make me more eager to contribute if the entire project used a permissive license (e.g. Apache 2.0 like Oxigraph does)

Also as a side note, I recently did this: https://github.com/alexkreidler/rust-iri-benchmarks, which may be silly, but I thought interesting to see how many different groups in the Rust RDF ecosystem were re-implementing the same thing differently.

That's why I hope Sophia and Oxigraph can achieve the goals of #23. I think that changing the license is a first step that could really make a difference.

How to use the Graph.iter_* interface?

Hi,
I'm using this crate a lot for an IoT application and realy like it. However, I run into problems when iterating a Graph.

The core idea I want to accomplish is to build structs from the contents of a Graph. Therefore I defined following trait:

pub trait FromGraph<'a, T, G>: Sized
where
    T: Borrow<str>,
    G: Graph<'a>,
    MyError: From<<G as Graph<'a>>::Error>,
{
    fn from_graph(s: &'a Term<T>, graph: &'a G) -> MyResult<Self>;
}

This works fine until I want to parse nested Structs, e.g. I have the following graph:

@prefix rdf: <http://www.w3.org/...#> .
@base <http://example.org/> .

EntityA a A ;
  rdf:value 42 .
EntityB a B ;
  hasA EntityA ;
  rdf:value 24 .

To use those in my code I've written the following in-Rust-representation:

lazy_static! {
    static ref HAS_A: Term<&'static str> = unsafe{ Term::new_iri_unchecked("http://example.org/hasA", Some(true)) };
}

#[derive(Debug, Clone, Copy)]
struct A {
    value: i32,
}

impl<'a, T, G> FromGraph<'a, T, G> for A
where
    T: Borrow<str>,
    G: Graph<'a>,
    MyError: From<<G as Graph<'a>>::Error>,
{
    fn from_graph(s: &'a Term<T>, graph: &'a G) -> MyResult<Self> {
        let t_value = graph.iter_for_sp(s, &rdf::value).last().ok_or(MyError)??;
        let t_value = t_value.o();
        let value = t_value.value().parse::<i32>()?;
        Ok(A { value })
    }
}

#[derive(Debug, Clone, Copy)]
struct B {
    a: A,
    value: i32,
}

impl<'a, T, G> FromGraph<'a, T, G> for B
where
    T: Borrow<str>,
    G: Graph<'a>,
    MyError: From<<G as Graph<'a>>::Error>,
{
    fn from_graph(s: &'a Term<T>, graph: &'a G) -> MyResult<Self> {
        let t_a = graph.iter_for_sp(s, &HAS_A).last().ok_or(MyError)??;
        let t_a = t_a.o(); // here is where the error occures
        let a = A::from_graph(t_a, graph)?;

        let t_value = graph.iter_for_sp(s, &rdf::value).last().ok_or(MyError)??;
        let t_value = t_value.o();
        let value = t_value.value().parse::<i32>()?;
        Ok(B { a, value })
    }
}

The error I get is t_a does not live long enough

   |
64 | impl<'a, T, G> FromGraph<'a, T, G> for B
   |      -- lifetime `'a` defined here
...
72 |         let t_a = t_a.o(); // here is where the error occures
   |                   ^^^----
   |                   |
   |                   borrowed value does not live long enough
   |                   argument requires that `t_a` is borrowed for `'a`
...
79 |     }
   |     - `t_a` dropped here while still borrowed

Do you have any idea how I could get this working? Maybe an API change could solve this?

In the end I want to accomplish something serde-flavored like:

#[derive(FromGraph)]
struct A {
  #[predicate(rdf:value)]
  value: i32,
}

#[derive(FromGraph)]
struct B {
  #[predicate(HAS_A)]
  a: A,
  #[predicate(rdf:value)]
  value: i32,
}

Any plans to use core stable instead of nightly?

Hi folks,

this is a really nice project!

I just found out, that I need to build 'sophia_iri' with Rust nightly.
Do you have any plans to use Rust stable instead of nightly?

Best wishes,
huhn

Provide a common abstraction of TripleSource and QuadSource

This PR originates from the discussion in #49 starting with this comment.

Summary

It would be beneficial to have convenience adapters for general tasks on *Sources, e.g.:
- rename bnodes into fresh ones
- replace bnodes with variables / variables with bnodes
- change absolute IRIs to relative ones (given a base IRI) (opposite of resolve)
- replace variables by their bound values (given a mapping)
- resolve IRIs given a base IRI
The current crate structure required to duplicate much of the code for adapters (and streaming_mode).

Problems

Implementing traits with associated traits does not consider trait bounds. Meaning is there is an adapter Normalizer<S> one can only implement either Iterator<Item = Triple> for Normalizer<S: TripleSource> or Iterator<Item = Quad> for Normalizer<S: QuadSource>. Both together would result in conflicting implementation.
Splitted implementations violate privacy bounds. Declaring a struct in crate::triple::stream and implementing QuadSource in crate::quad::stream requires the fields to be pub(crate) at least.

Solutions

For 1.: Maybe some overall abstraction for statements?

For 2.: Merge the stream submodules into a top-level module stream equal to triple and quad.

Provide a generic implementation of the RDF test suite

sophia provides traits for parsers and graphs. Accordingly, it should be possible to write a function that automatically runs the RDF test suite for the given format.

Following steps are required to do so:

implement a function to check if two graphs are isomorphic.
- I think we can borrow the algorithm from rio @Tpt ?
write a framework that executes the different test cases.
execute tests based on a manifest
collect test results in a graph.
write a function to start evaluation, e.g.
```
fn evaluate_parser<P: Parser>(parser_factory: Fn() -> P, directory: Path) -> HashGraph
```
- Of course their should also be a evaluate_quad_parser() function.

Except for the isomorphism everything can be feature flagged as it's normally not used.

Having the test suite available would make it easy for implementors to check their crates. In addition, we could analyse crates that have a sophia-wrapper and/or implement sophia::parser::TripleParser/QuadParser, e.g. rio.

Provide some "Getting Started" documentation

This create looks really promising for my Linked Data project. However, I'm really unsure about how to get started using it.

It would be really useful to have some examples of doing the most common things in the rust docs. For example: How to create a graph, add some triples, and write it as turtle to a file; How to read a turtle file in, parse it as a graph, then modify it's contents.

I'd be happy to contribute the documentation, if I can get some pointers on how to do it :)

XML Serialization!

As far as I can see there isn't an XML serializer available at the moment. This would be a nice to have.

[Discussion] Keep IRI as ns+suffix?

Topic

This issue's purpose is to discuss if it is beneficial to keep the current implementation of IRI. This discussion arose from #55.

Explanation

sophia's implementation of IRIs contains two elements namespace ns and an optional suffix. This means that an IRI is either represented as a whole in the namespace field or as namespace and suffix like a CURIE. This is different to other RDF libraries like rio where IRIs are always represented by a single string. The question is:

Is it beneficial to keep the current implementation of IRIs with separated namespace and suffix?

Discussion

Pro - less memory consumption in `Graph`s

The current implementation is beneficial when storing terms in, for example, a Graph with reference(-counted) TermData where namespaces must only kept once in memory while the whole string solution would require to copy namepaces over and over again. This reduces the overall memory consumption of sophia.

Pro - cheap `Namespace::get()`

With the current implementation it is easy and cheap to create a Namespace and get() suffixed IRIs from it.

Con - consume more stack-memory

On the other hand keeping space for an optional suffix means that sophia's IRIs take more stack-memory (nearly twice as much) as when the IRI would be represented in a single place. This makes it costly to create short lived references to terms like RefTerm.

Con - prefixes are part of syntax

The usage of prefixes/CURIEs while used in nearly all RDF serialization is part of this formats and not of actual RDF itself.

Con - resolving kills suffixes

When a relative IRI is resolved it's suffix gets lost in the process anyway.

Contribution

I invite you to join the discussion. Best tag your answers with either ### Pro - ..., ### Con - ... or ### Conclusion - ... and answer to particular points with ### Regarding - Pro/Con - ..., so it's easier to keep track of pros, cons and opinions.

Handling intermediate Strings

In some situations intermediate Strings are allocated and then transformed into TermData, e.g. normalization and resolving of IRIs.

Issue

I'm currently working on metis for an example for #55. I'd like to use CowTerm for my parser. This means that if an absolute IRI is parsed I have Cow::Borrowed and I would like it to remain Cow::Borrowed after resolving against the base IRI. On the other hand when a relative IRI is parsed the Cow::Borrowed should be turned into Cow::Owned after resolution. This means that I have to know if a &str comes from the original to track lifetimes or if I get a newly allocated String.

Discussion of Solutions

Change signature of Resolve<Iri> (and of normalization) to:

impl<'a, 'td, TD, TD2> Resolve<Iri<TD>, Iri<TD2>> for IriParsed<'a>
where
    TD: 'td + TermData,
    TD2: TermData + From<&'td str> + From<String>,
{ ... }

This, however, prevents passing in a TermFactory. So maybe it would be nice to have another Resolve.

Add resolve_with():

trait ResolveWith<S, STD, T = S, TTD = STD> {
    fn resolve_with<'td, B, O>(&self, other: &'td S, borrowed: B, owned: O) -> T 
    where
        STD: 'td + TermData,
        B: FnMut(&'td str) -> TTD,
        O: FnMut(String) -> TTD;

This adds complexity and is maybe to much for the single use case of Cow as this is not that important for other TermData. Besides a default implementation:

/// pseudo Rust
impl<'td, S, STD, T, TTD> Resolve<S, T> for ResolveWith<S, STD, T, TTD>
where
    STD: 'td,
    TTD: From<&'td str> + From<String>,
{
    fn resolve(&self, other: &S) -> T {
        self.resolve_with(other, Into::into, Into::into)
    }
}

Should be possible.

What do you think? Is this case important enough to add such complexity?

Percent-escaped IRI are considered invalid

Hi!

While running tests using the rdf-tests I stumbled upon this issue, where valid percent-encoded IRI would be considered invalid: for instance, parsing rdf-charmod-uris/test002.nt will fail because http://example.org/#Andr%C3%A9 is considered invalid.

Is my use case a good fit for sophia?

In the scope of the Rust ecosystem, is Sophia a good choice for serializing Hydra into JSON-LD?

My end goal is to create a wrapper around an existing Rust web framework (maybe Warp?) that makes it easy to generate Hydra JSON-LD documentation automatically from Rust source code with minimal effort. I know that there are many projects for serializing to JSON, but getting the JSON-LD semantics correct would be extremely helpful.

From the documentation I've looked at so far, I'm not quite sure if this project is a good choice to build on top of or not. Do you recommend looking for a different crate? Are there any hydra crates I've overlooked?

Make in-memory indexed graphs more generic

Following the discussion in #55, I realize that some people might be interested in reusing the implementation of graph::inmem or dataset::inmem, even for types that go
beyond the RDF model (even the generalized on that Sophia supports). For example:

Notation 3 (which @MattesWhite is working on)
RDF-*
Other flavors of property graphs

Actually, these implementations are rather agnostic to the kind of terms they handle, as internally they are just represented by integer indexes. Possibly, this could be factored out in a separate crate (sophia_igraph?). And of course, if (some of) the terms they contain can be seen as RDF terms, they could in turn implement the Graph/Dataset traits...

@BruJu what do you think?

Move from `error-chain` to `thiserror` for error management

Hi!

error-chain is not maintained anymore, so it would be a good idea to move to failure, which has all the features needed already, as well as backtrace support (i.e. the underlying pest error could be accessed from a ParserError using the cause() method).

Add persistent implementation of `dataset::MutableDataset`

In addition to the dataset::inmem module, it would be nice to have a disk-based persistent implementation of dataset::MutableDataset.

JSON-LD

Hi,

It's great to get json-ld 1.1 support, at least for serialization and parsing. Even if it's early stage, it could be interesting to list it on https://json-ld.org
What's the long term objective in terms of compliance with the rest of the spec? (flattening, compaction, framing)

Fabien

Make 'Term' a trait

Proposed first in #35 I suggest to introduce a trait Term.

Objective

sophia's aim is to become a central API for RDF in Rust (#23). As such it would be nice if third party terms could smoothly interoperate with sophia's ecosystem, e.g. rio's terms could be stored in sophia's HashGraph, N3 terms from metis could use the Serializer interface or another crate could provide compatible, ASCII-based terms that are based on &[u8] (maybe even going towards no_std-RDF 😊).

Required changes

With the latest effort to turn each kind of term (IRI, literal, blank node and variable) into an own type with an aligned API the step to create a trait is no longer that much work. A problem could be that this could turn advanced traits, such as Graph, in a mess of trait bounds. I'd like to commit a PoC-PR to see how it goes.

Suggestion

Here a suggestion how a trait may look like:

trait Term: Clone + Hash + Debug + Display 
    + PartialEq + PartialEq<Iri> + PartialEq<Literal> + PartialEq<BlankNode> + Eq
where
    Iri: TryFrom<Self>,
    Literal: TryFrom<Self>,
    BlankNode: TryFrom<Self>,
{
    type TermData: TermData;

    fn value(&self) -> String;
}

The focus should be to keep the Term trait as small as possible while other behaviour is kept separate in other traits, e.g. Resolve.

Bad turtle serialization (... or I am doing something wrong)

(related to #17)

I see that there is now code to serialize to turtle, but I get strange results.
Taking the sample code provided in the docu - modified to export turtle instead of n-triples, pretty with prefixes - I am expecting this:

(code: https://gist.github.com/hoijui/72aa8e5bf8dd7381425dfb6676e7fe62)

@prefix ex: <http://example.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

bla:alice
  foaf:name "Alice" ;
  foaf:mbox <mailto:[email protected]> .

bla:bob
  foaf:name "Bob" ;
  foaf:knows bla:alice .

but I get this:
(NOTE: PREFIX instead of @prefix, and duplicated : (this RDF looks quite rusty... ;-) ))

PREFIX bla:: <http://example.org/>
PREFIX ex:: <http://example.org/>
PREFIX foaf:: <http://xmlns.com/foaf/0.1/>

bla::alice
  foaf::name "Alice";
  foaf::mbox <mailto:[email protected]>.

bla::bob
  foaf::name "Bob";
  foaf::knows bla::alice.

Validating split IRI without allocating a new String

Currently, validating a split IRI (i.e. represented as a pair namespace + suffix) allocates a new string, concatenates both parts, and validates it agains a regexp.

It should be possible to validate this pair without the extra allocation, and this might be faster (although this would need to be measured).

RDFXML parser fails on xsd entities

I am trying to parse this file

http://www.drugtargetontology.org/dto/dto_vocabulary_gpcr_protein.owl

And getting an error!

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: SourceError(RdfXmlError { kind: Xml(EscapeError(UnrecognizedSymbol(1..4, Ok("xsd")))) })', src/io/rdf/reader.rs:323:56

The problem appears to be the use of an xsd entity.

<rdfs:label rdf:datatype="&xsd;string">HTR1A gene</rdfs:label>

The entity appears to be defined correctly. Is this expected?

Background

This crate seems to be the most mature and most recently maintained RDF related crates. So, I'm hoping to use this in an LDP server I'm creating. However, turtle is a required format in LDP servers.

The Request

Add turtle as a supported format for parsing and serializing graphs

pchampin / sophia_rs Goto Github PK

sophia_rs's Introduction

Sophia

Licence

Citation

Third-party crates

History

sophia_rs's People

Contributors

Stargazers

Watchers

Forkers

sophia_rs's Issues

Problem

Proposed solution

Background

The Request

Steps towards an improved serializer API

Summary

Problems

Solutions

Topic

Explanation

Discussion

Pro - less memory consumption in Graphs

Pro - cheap Namespace::get()

Con - consume more stack-memory

Con - prefixes are part of syntax

Con - resolving kills suffixes

Contribution

Issue

Discussion of Solutions

Objective

Required changes

Suggestion

Background

The Request

Recommend Projects

Recommend Topics

Recommend Org

Pro - less memory consumption in `Graph`s

Pro - cheap `Namespace::get()`