Giter Club home page Giter Club logo

Comments (5)

ozekik avatar ozekik commented on June 12, 2024

Thank you for reporting!

The problem is that #Crew?oldid=2476206#Command_crew in <https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew> is, strictly speaking, an invalid IRI part with # followed by unescaped # (and therefore the document is an invalid RDF, in a precise sense.)
Some library such as rdflib just ignores it, but Rio (Rust RDF library behind lightrdf) is rigid and raises an exception.

As the resume-after-exception feature is WIP in Rio, I think a possible workaround for now is to fix invalid IRIs before parsing, like:

sed -r 's/([^#]*)#/\1%23/2g' latest-all.ttl

(Use -i to replace in-place and gsed on Mac)

from lightrdf.

plasticfist avatar plasticfist commented on June 12, 2024

Thank you for the quick response, this is very helpful. I'm usually hesitant to manually patch source files, but this might be the best fix for the moment, agree. (thank you for the sed as well) I'm still looking at dbpedia ttls, it throws an error with that dataset as well, which I can't make sense of. At first I thought the problem was that it wasn't actually turtle format in their .ttl files, but as I start to review the spec, maybe it is turtle? (just a bare lazy dump with no prefixes?). Still looking and trying converting back and forth to other formats (e.g. with rapper)

from lightrdf.

ozekik avatar ozekik commented on June 12, 2024

I understand that huge datasets in RDF tend to be more or less malformed.
In my opinion, if an ntriples file is available, it is easier than turtle to find and "patch" problems and track the changes.

from lightrdf.

plasticfist avatar plasticfist commented on June 12, 2024

here is the (first) dbpedia (ttl file, but turtle?) issue, for reference

../dbpedia/ttl/revisions_lang=en_uris.ttl
lightrdf.Error: error while parsing IRI 'http://dbpedia.org/resource/󠄀': Invalid IRI code point '󠄀' on line 19841225 at position 35

$ sed -n '19841223,19841227p;19841228q' revisions_lang=en_uris.ttl
<http://dbpedia.org/resource/𨳒> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/𨳒?oldid=786024110&ns=0> .
<http://dbpedia.org/resource/𩧢> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/𩧢?oldid=951071761&ns=0> .
<http://dbpedia.org/resource/󠄀> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄀?oldid=949255578&ns=0> .
<http://dbpedia.org/resource/󠄁> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄁?oldid=949255580&ns=0> .
<http://dbpedia.org/resource/󠄂> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄂?oldid=949255609&ns=0> .

including a screen capture, because terminal seems to give more information about the characters in these 5 lines
image

from lightrdf.

djstrong avatar djstrong commented on June 12, 2024

I have tried with this sed solution while parsing Wikidata, but:
lightrdf.Error: error while parsing IRI 'http://archive.is/EKEWo#34.7%': Invalid IRI percent encoding '%' on line 49533684 at position 41
Another:
lightrdf.Error: error while parsing language tag 'zh-classical': A subtag may be eight characters in length at maximum on line 59030363 at position 69
:(

from lightrdf.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.