This might or might not be the right project for this issue... I'm t

Thanks for your hints, <a class="user-mention notranslate" data-hovercard-type="user"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

You're welcome, <a class="user-mention notranslate" data-hovercard-type="user" data-ho

I'm curious about two questions here: Is this really dependent

Problem querying large hdt dataset in fuseki about hdt-java HOT 16 CLOSED

rdfhdt commented on May 20, 2024

Problem querying large hdt dataset in fuseki

from hdt-java.

Comments (16)

mn120110d commented on May 20, 2024

Hi,
I encountered the same problem when generating hdt files with hdt-cpp. Only query that contains explicit cast and filter will give you the results, in your case that would be:
?entity ?p ?o .
filter (?o = xsd:string(“A. F. W. Sommer”)) .

Also, it is worth noticing that if you generate your hdt files with hdt-java, you won’t have this problem.

Best,
Nevena

from hdt-java.

osma commented on May 20, 2024

I wonder whether this has to do with the change in semantics of plain literals introduced in RDF 1.1? Just a thought.

from hdt-java.

larsgsvensson commented on May 20, 2024

Thanks for your hints, @mn120110d !

Only query that contains explicit cast and filter will give you the results, in your case that would be:
?entity ?p ?o .
filter (?o = xsd:string(“A. F. W. Sommer”)) .

Oh, that's a useful workaround. It does increase query execution time, though, since the endpoint first has to extract all triples and then remove most of them through the filter instead of directly looking in the index for relevant ones.

Also, it is worth noticing that if you generate your hdt files with hdt-java, you won’t have this problem.

Does that mean that the Java and the C++ implementations produce different results when I convert the same source file to hdt? To me there seem to be four different ways how that can happen:

The spec is unambiguous, the C++ implementation makes it right and the Java implementation makes it wrong.
The spec is unambiguous, the Java implementation makes it right and the C++ implementation makes it wrong
The spec is unambiguous and neither the Java, nor the C++ implementation makes it right
The spec is ambiguous and both the Java and the C++ implementations make it right while still producing different results.

Given that the C++ implementation seems to be better maintained, my guess would be the first option.

from hdt-java.

larsgsvensson commented on May 20, 2024

@osma

I wonder whether this has to do with the change in semantics of plain literals introduced in RDF 1.1? Just a thought.

Interesting thought. Can you expand a bit on that?

from hdt-java.

osma commented on May 20, 2024

In RDF 1.0, the plain literal "foo" is different from "foo"^^xsd:string. In RDF 1.1 they are the same since all plain literals (i.e. no language tag and no datatype) have an implicit datatype of xsd:string.

Now let's say the HDT file encodes such literals without a data type, but the hdt-jena layer expects them to have the data type xsd:string (or vice versa). They would have a different encoding and thus wouldn't match.

from hdt-java.

larsgsvensson commented on May 20, 2024

OK, I see.

When I inspected the hdt file with hdt-it!, I found the string A. F. W. Sommer"^^http://www.w3.org/2001/XMLSchema#string (with datatype) but adding the datatype in the SPARQL query didn't help. So the only possibility would be that the hdt-jena implementation strips the datatype from the literal in the SPARQL query but that further down the line it's expected.

from hdt-java.

mn120110d commented on May 20, 2024

You're welcome, @larsgsvensson .

I agree that the solution with filter increases execution time, but I haven't found a better way to do it.

Regarding your question:

Does that mean that the Java and the C++ implementations produce different results when I convert the same source file to hdt?

Here is an issue that mentions bout versions and their differences:
#58

I believe the answer to your question would be this part of the conversation:

regardless if a .hdt was created with C++ or Java, shouldn't it be the exact same format?

in theory yes, in practice not that easy without dedicated development teams :/

So my guess would be that no matter what tool you use, the generated hdt should be correct, but for now you'll have to adjust your queries to support different versions. I hope that someone more competent could give a better and more detailed explanation. :)

from hdt-java.

osma commented on May 20, 2024

I'm curious about two questions here:

Is this really dependent on the input file size? @larsgsvensson mentioned that the problem was with a large input file, but not with a very small subset. If that is true, where is the limit? There's a rather large gap between 266M triples and 13 triples...
If the file generated with hdt-java works but the hdt-cpp version doesn't, what's the difference? Is the datatype of plain literals encoded differently in hdt files generated by the two tools?

Trying to answer these might lead to more clues about where the problem is.

I think that the hdt-java version was originally created using a Jena version earlier than 3.0, which changed the semantics of plain literals to follow RDF 1.1. In fact it must be, since the project started around 2012 and Jena 3.0 was released in July 2015. So maybe when the dependency was updated eventually updated to Jena 3.0+, there were some code paths left that expected the old RDF 1.0 style plain literals. Or maybe the hdt-cpp tools assume RDF 1.0 style literals while hdt-java is all RDF 1.1.

from hdt-java.

TRnonodename commented on May 20, 2024

I'm not sure this is related to file size or which library generates the file. It seems like an incompatibility in the fuseki implementation.

I was able to recreate the issue with the bulk instrument file from permid.org (14Mb gzipped).

Using HDT-it! with the raw triples (which includes typed string literals) and default settings I get an HDT file that fuseki returns 0 rows for the queries:

select ?instr where { ?instr <http://permid.org/ontology/common/hasName> "Jardine Strategic Holdings IDR" . }

and

select ?instr where { ?instr <http://permid.org/ontology/common/hasName> "Jardine Strategic Holdings IDR"^^<http://www.w3.org/2001/XMLSchema#string> . }

If I remove the types from the file using sed

gzcat OpenPermID-bulk-instrument-20171106_072520.ntriples.gz | sed 's/\^\^<http:\/\/www.w3.org\/2001\/XMLSchema#string>//' > stripped.nt

the following query works in fuseki, returning one row

select ?instr where { ?instr <http://permid.org/ontology/common/hasName> "Jardine Strategic Holdings IDR" . }

Importantly, I get the same results whether using HDT-it (which relies on the CPP library) or the java-cli. I'm using the current build of HDT-it from the website on a mac and the current trunk from github for the CLI.

from hdt-java.

larsgsvensson commented on May 20, 2024

After some time I managed to get a closer look at this (or rather at the Server.java implementation that uses part of hdt-jena).

Having set up a TPF server using Server.java, I tried to query an HDT file where all literals have datatypes, particularly the default datatype is xsd:string. It didn't work. I dug a bit through the source code and found that access to the HDT dictionary is handled by org.rdfhdt.hdtjena.NodeDictionary. Here, the method #nodeToString strips off the datatype from the literal if the datatype is xsd:string which explains why the querying doesn't work:

public static String nodeToStr(Node node) {
  if(node==null || node.isVariable()) {
    return "";
  }else if(node.isURI()) {
    return node.getURI();
  } else if(node.isLiteral()) {
    RDFDatatype t = node.getLiteralDatatype();
    if(t==null || XSDDatatype.XSDstring.getURI().equals(t.getURI())) {
      // String
      return "\""+node.getLiteralLexicalForm()+"\"";
    } else if(RDFLangString.rdfLangString.equals(t)) {
      // Lang
      return "\""+node.getLiteralLexicalForm()+"\"@"+node.getLiteralLanguage();
    } else {
      // Typed
      return "\""+node.getLiteralLexicalForm()+"\"^^<"+t.getURI()+">";
    }
  } else {
    return node.toString();
  }
}

I think that this isn't conform with RDF 1.1 §3.3:

Please note that concrete syntaxes MAY support simple literals consisting of only a lexical form without any datatype IRI or language tag. Simple literals are syntactic sugar for abstract syntax literals with the datatype IRI http://www.w3.org/2001/XMLSchema#string.

The implementation in NodeDictionary turns the syntactic sugar into a norm.

from hdt-java.

wouterbeek commented on May 20, 2024

@larsgsvensson It is indeed possible to store simple literals in HDT. This means that the same RDF 1.1 term can be represented by two distinct terms in an HDT.

from hdt-java.

larsgsvensson commented on May 20, 2024

@wouterbeek Yes, sure it is. The point I'm aiming at is that if I convert the RDF triple
:subject :predicate "object"^^xsd:string .
to hdt using the cpp implementation, I cannot query it using the Java implementation since the cpp converter will retain the datatype xsd:string on all literals whereas the Java implementation strips the xsd:string from the literal before querying the NodeDictionary. That way the literal will never be found.

If there are two ways to store the literal, then I must be able to query them exactly as I stored them, or the search for "rdf-hdt" in the object position must deliver the same result as the search for "rdf-hdt"^^xsd:string (i. e. the implementation must consider
:subject :predicate "object"^^xsd:string .
and
:subject :predicate "object" .
equivalent in every respect). If the implementation lets me store it with an xsd:string datatype but doesn't let me query it that way, it means I shouldn't be allowed to store it that way in the first place.

As I see it, the only way to accomplish this is to mandate that in HDT files the datatype xsd:string is always added if not already present, or it's always removed if present. Then the implementations accessing the HDT file would need to be adjusted accordingly.

from hdt-java.

wouterbeek commented on May 20, 2024

@larsgsvensson I agree with you that the best solution would be to always store "..."^^<http://www.w3.org/2001/XMLSchema#> in HDT, and never store "...".

Unfortunately, "..." is a legal RDF term in Turtle, TriG, N-Triples, and N-Quads. This means that any compliant parser is allowed to emit "...", and that this should not be fixed in the parser (i..e, Serd).

Instead, what is needed is HDT-specific code that transforms "..." into "..."^^<http://www.w3.org/2001/XMLSchema#> upon HDT file creation. If somebody would be able to implement this in a pull request, then this would be very welcome.

(Notice that this would not invalidate existing HDTs that use "...". It would just guarantee that newly created HDTs are not ambiguous.)

from hdt-java.

wouterbeek commented on May 20, 2024

@larsgsvensson I've crated an issue for this in the proper place: rdfhdt/hdt-cpp#173

You can close this current issue if there are no other hdt-java specific components to it.

from hdt-java.

larsgsvensson commented on May 20, 2024

Thanks @wouterbeek. In rdfhdt/hdt-cpp#173 I suggested to do it the other way round since it seems that most implementations use the "..." form when reading.
I don't think there are any other hdt-java issues here so I'll close.

from hdt-java.

larsgsvensson commented on May 20, 2024

Just for documentation purposes, this is my current workaround:

Since hdt-java depends on hdt-jena, update org.rdfhdt.hdtjena.NodeDictionary so that the methods NodeDictionary#nodeToStr(Node) and NodeDictionary#nodeToStr(Node, PrefixMapping) are non-static. Update testcases accordingly.
In HdtBasedRequestProcessorForTPFs, overwrite the constructor as follows:

public HdtBasedRequestProcessorForTPFs(final String hdtFile) throws IOException {
	this.datasource = HDTManager.mapIndexedHDT(hdtFile, null); // listener=null
	this.dictionary = new NodeDictionary(this.datasource.getDictionary()) {
		@Override
		public String nodeToStr(final Node node) {
			if (node == null || node.isVariable()) {
				return "";
			} else if (node.isURI()) {
				return node.getURI();
			} else if (node.isLiteral()) {
				final RDFDatatype t = node.getLiteralDatatype();
					if (t == null) {
					// String
					return "\"" + node.getLiteralLexicalForm() + "\"";
				} else if (RDFLangString.rdfLangString.equals(t)) {
					// Lang
					return "\"" + node.getLiteralLexicalForm() + "\"@"
							+ node.getLiteralLanguage();
				} else {
					// Typed
					return "\"" + node.getLiteralLexicalForm() + "\"^^<"
							+ t.getURI() + ">";
				}
			} else {
				return node.toString();
			}
		}
	};
}

I. e. I replace nodeToStr( Node) with an implementation that keeps the datatype even when it's a string.

from hdt-java.

Problem querying large hdt dataset in fuseki about hdt-java HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent