Giter Club home page Giter Club logo

Comments (7)

TBoonX avatar TBoonX commented on July 17, 2024 1

Correction:

java -Dspark.kryoserializer.buffer.max=2048 -jar ...

Maximum size of the buffer is <2048MB and the -D parameter has to be used differently.

from rdfprocessingtoolkit.

Aklakan avatar Aklakan commented on July 17, 2024

Can you run

rpt sansa analyze csv your-data.csv --out-file report.ttl

(See also https://sansa-stack.github.io/SANSA-Stack/cli/tarql.html#inspecting-csv-files)

and check whether it can parse the CSV file correctly?
It should output an RDF document with parsing information about each split of the CSV file.

from rdfprocessingtoolkit.

TBoonX avatar TBoonX commented on July 17, 2024

Output:

@prefix eg: <http://www.example.org/> .
@prefix xds: <http://www.w3.org/2001/XMLSchema#> .

_:b0    eg:totalDuration  "0.012"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "81"^^xds:long .

_:b1    eg:regionEndProbeResult    _:b0 ;
        eg:totalElementCount       "860636"^^xds:long ;
        eg:totalBytesRead          "33554513"^^xds:long ;
        eg:totalTime               "0.7905820520000001"^^xds:double ;
        eg:splitStart              "0"^^xds:long ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:regionStartProbeResult  _:b2 ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:totalRecordCount        "860637"^^xds:long ;
        eg:regionStartSearchReadOverRegionEnd  false .

_:b2    eg:totalDuration  "0.0"^^xds:double ;
        eg:probeCount     "0"^^xds:long ;
        eg:candidatePos   "0"^^xds:long .

_:b3    eg:regionEndProbeResult    _:b4 ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:regionStartSearchReadOverRegionEnd  false ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:totalBytesRead          "33554469"^^xds:long ;
        eg:splitStart              "33554432"^^xds:long ;
        eg:totalRecordCount        "816606"^^xds:long ;
        eg:totalElementCount       "816606"^^xds:long ;
        eg:totalTime               "0.41145543700000003"^^xds:double ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:regionStartProbeResult  _:b5 .

_:b4    eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "37"^^xds:long .

_:b5    eg:totalDuration  "0.007"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "81"^^xds:long .

_:b6    eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "45"^^xds:long .

_:b7    eg:totalDuration  "0.008"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "37"^^xds:long .

_:b8    eg:totalTime               "0.423793693"^^xds:double ;
        eg:totalRecordCount        "799251"^^xds:long ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:totalElementCount       "799251"^^xds:long ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:splitStart              "67108864"^^xds:long ;
        eg:totalBytesRead          "33554477"^^xds:long ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:regionStartSearchReadOverRegionEnd  false ;
        eg:regionStartProbeResult  _:b7 ;
        eg:regionEndProbeResult    _:b6 .

_:b9    eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "45"^^xds:long .

_:b10   eg:regionStartProbeResult  _:b9 ;
        eg:totalTime               "0.34524198"^^xds:double ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:totalRecordCount        "810578"^^xds:long ;
        eg:totalElementCount       "810578"^^xds:long ;
        eg:splitStart              "100663296"^^xds:long ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:totalBytesRead          "33554513"^^xds:long ;
        eg:regionEndProbeResult    _:b11 ;
        eg:regionStartSearchReadOverRegionEnd  false .

_:b11   eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "81"^^xds:long .

_:b12   eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "81"^^xds:long .

_:b13   eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "81"^^xds:long .

_:b14   eg:splitStart              "134217728"^^xds:long ;
        eg:totalBytesRead          "33554513"^^xds:long ;
        eg:regionEndProbeResult    _:b12 ;
        eg:totalTime               "0.33095979400000003"^^xds:double ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:totalRecordCount        "803622"^^xds:long ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:totalElementCount       "803622"^^xds:long ;
        eg:regionStartProbeResult  _:b13 ;
        eg:regionStartSearchReadOverRegionEnd  false .

_:b15   eg:splitStart              "167772160"^^xds:long ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:regionStartProbeResult  _:b16 ;
        eg:regionEndProbeResult    _:b17 ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:totalTime               "0.35778299"^^xds:double ;
        eg:totalRecordCount        "804767"^^xds:long ;
        eg:totalBytesRead          "33554497"^^xds:long ;
        eg:regionStartSearchReadOverRegionEnd  false ;
        eg:totalElementCount       "804767"^^xds:long .

_:b16   eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "81"^^xds:long .

_:b17   eg:totalDuration  "0.004"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "65"^^xds:long .

_:b18   eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "65"^^xds:long .

_:b19   eg:regionStartSearchReadOverSplitEnd  false ;
        eg:splitStart              "201326592"^^xds:long ;
        eg:totalBytesRead          "33554507"^^xds:long ;
        eg:regionEndProbeResult    _:b20 ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:totalRecordCount        "812414"^^xds:long ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:regionStartSearchReadOverRegionEnd  false ;
        eg:totalElementCount       "812414"^^xds:long ;
        eg:totalTime               "0.334013382"^^xds:double ;
        eg:regionStartProbeResult  _:b18 .

_:b20   eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "75"^^xds:long .

_:b21   eg:totalDuration  "0.004"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "82"^^xds:long .

_:b22   eg:totalRecordCount        "809813"^^xds:long ;
        eg:splitStart              "234881024"^^xds:long ;
        eg:regionStartProbeResult  _:b23 ;
        eg:totalBytesRead          "33554514"^^xds:long ;
        eg:totalTime               "0.361883591"^^xds:double ;
        eg:regionEndProbeResult    _:b21 ;
        eg:regionStartSearchReadOverRegionEnd  false ;
        eg:totalElementCount       "809813"^^xds:long ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:tailElementCount        "1"^^xds:int .

_:b23   eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "75"^^xds:long .

_:b24   eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "82"^^xds:long .

_:b25   eg:totalElementCount       "814884"^^xds:long ;
        eg:splitStart              "268435456"^^xds:long ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:totalTime               "0.34096873"^^xds:double ;
        eg:totalBytesRead          "33554494"^^xds:long ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:regionStartProbeResult  _:b24 ;
        eg:regionStartSearchReadOverRegionEnd  false ;
        eg:totalRecordCount        "814884"^^xds:long ;
        eg:regionEndProbeResult    _:b26 .

_:b26   eg:totalDuration  "0.004"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "62"^^xds:long .

_:b27   eg:totalDuration  "0.004"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "70"^^xds:long .

_:b28   eg:totalDuration  "0.004"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "62"^^xds:long .

_:b29   eg:regionStartProbeResult  _:b28 ;
        eg:splitStart              "301989888"^^xds:long ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:regionStartSearchReadOverRegionEnd  false ;
        eg:totalElementCount       "812127"^^xds:long ;
        eg:totalRecordCount        "812127"^^xds:long ;
        eg:totalTime               "0.34634130300000004"^^xds:double ;
        eg:totalBytesRead          "33554502"^^xds:long ;
        eg:regionEndProbeResult    _:b27 .

_:b30   eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "70"^^xds:long .

_:b31   eg:regionStartSearchReadOverRegionEnd  false ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:totalTime               "0.33287019900000003"^^xds:double ;
        eg:regionStartProbeResult  _:b30 ;
        eg:splitStart              "335544320"^^xds:long ;
        eg:totalRecordCount        "809327"^^xds:long ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:totalBytesRead          "33554491"^^xds:long ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:totalElementCount       "809327"^^xds:long ;
        eg:regionEndProbeResult    _:b32 .

_:b32   eg:totalDuration  "0.004"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "59"^^xds:long .

_:b33   eg:totalTime               "0.345964153"^^xds:double ;
        eg:regionStartSearchReadOverRegionEnd  false ;
        eg:totalElementCount       "811930"^^xds:long ;
        eg:regionEndProbeResult    _:b34 ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:splitStart              "369098752"^^xds:long ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:totalBytesRead          "33554494"^^xds:long ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:totalRecordCount        "811930"^^xds:long ;
        eg:regionStartProbeResult  _:b35 .

_:b35   eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "59"^^xds:long .

_:b34   eg:totalDuration  "0.004"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "62"^^xds:long .

_:b36   eg:totalTime               "0.32924473800000004"^^xds:double ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:regionEndProbeResult    _:b37 ;
        eg:tailElementCount        "1"^^xds:int ;
        eg:splitSize               "33554432"^^xds:long ;
        eg:regionStartProbeResult  _:b38 ;
        eg:totalElementCount       "797831"^^xds:long ;
        eg:regionStartSearchReadOverRegionEnd  false ;
        eg:totalRecordCount        "797831"^^xds:long ;
        eg:totalBytesRead          "33554493"^^xds:long ;
        eg:splitStart              "402653184"^^xds:long .

_:b37   eg:totalDuration  "0.004"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "61"^^xds:long .

_:b38   eg:totalDuration  "0.004"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "62"^^xds:long .

_:b39   eg:totalDuration  "0.061000000000000006"^^xds:double ;
        eg:probeCount     "0"^^xds:long ;
        eg:candidatePos   "-1"^^xds:long .

_:b40   eg:splitStart              "436207616"^^xds:long ;
        eg:totalElementCount       "570104"^^xds:long ;
        eg:regionStartSearchReadOverRegionEnd  false ;
        eg:totalRecordCount        "570103"^^xds:long ;
        eg:regionStartSearchReadOverSplitEnd  false ;
        eg:regionStartProbeResult  _:b41 ;
        eg:splitSize               "24421635"^^xds:long ;
        eg:totalBytesRead          "24421635"^^xds:long ;
        eg:regionEndProbeResult    _:b39 ;
        eg:tailElementCount        "0"^^xds:int ;
        eg:totalTime               "0.30285072300000004"^^xds:double .

_:b41   eg:totalDuration  "0.005"^^xds:double ;
        eg:probeCount     "1"^^xds:long ;
        eg:candidatePos   "61"^^xds:long .

from rdfprocessingtoolkit.

Aklakan avatar Aklakan commented on July 17, 2024

Hm, so the CSV parsing looks ok - what happens when you increase the kryo size?

EXTRA_OPTS="-Dspark.kryoserializer.buffer.max=2000000000" rpt
java -D "-Dspark.kryoserializer.buffer.max=2000000000" -jar

from rdfprocessingtoolkit.

Aklakan avatar Aklakan commented on July 17, 2024

Does increasing kryo buffer size have any effect? Since the CSV parsing seems to work, it would indicate that a single partition of CSV data maps to a very large amounts of RDF data (maybe the mapping produces many duplicates?).

A mapping where several thousands of triples are attached to the same subject (e.g. due to incorrect mapping) might also cause this issue - it was somehow related to very large turtle blocks being formed which exceed internal thresholds.

Maybe switching to ntriples serialization makes the issue go away?

from rdfprocessingtoolkit.

TBoonX avatar TBoonX commented on July 17, 2024

It works with the buffer parameter, thanks!

from rdfprocessingtoolkit.

Aklakan avatar Aklakan commented on July 17, 2024

Updated Sansa CLI to use kryo's max buffer size of 2048 by default. It is possible to override it to make it smaller, but not sure if there is a good reason to do so.

SANSA-Stack/SANSA-Stack@4948399
(I realize I should have created a separate issue at sansa but oh well)

from rdfprocessingtoolkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.