Comments (7)
Correction:
java -Dspark.kryoserializer.buffer.max=2048 -jar ...
Maximum size of the buffer is <2048MB and the -D parameter has to be used differently.
from rdfprocessingtoolkit.
Can you run
rpt sansa analyze csv your-data.csv --out-file report.ttl
(See also https://sansa-stack.github.io/SANSA-Stack/cli/tarql.html#inspecting-csv-files)
and check whether it can parse the CSV file correctly?
It should output an RDF document with parsing information about each split of the CSV file.
from rdfprocessingtoolkit.
Output:
@prefix eg: <http://www.example.org/> .
@prefix xds: <http://www.w3.org/2001/XMLSchema#> .
_:b0 eg:totalDuration "0.012"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b1 eg:regionEndProbeResult _:b0 ;
eg:totalElementCount "860636"^^xds:long ;
eg:totalBytesRead "33554513"^^xds:long ;
eg:totalTime "0.7905820520000001"^^xds:double ;
eg:splitStart "0"^^xds:long ;
eg:tailElementCount "1"^^xds:int ;
eg:regionStartProbeResult _:b2 ;
eg:splitSize "33554432"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:totalRecordCount "860637"^^xds:long ;
eg:regionStartSearchReadOverRegionEnd false .
_:b2 eg:totalDuration "0.0"^^xds:double ;
eg:probeCount "0"^^xds:long ;
eg:candidatePos "0"^^xds:long .
_:b3 eg:regionEndProbeResult _:b4 ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:tailElementCount "1"^^xds:int ;
eg:totalBytesRead "33554469"^^xds:long ;
eg:splitStart "33554432"^^xds:long ;
eg:totalRecordCount "816606"^^xds:long ;
eg:totalElementCount "816606"^^xds:long ;
eg:totalTime "0.41145543700000003"^^xds:double ;
eg:splitSize "33554432"^^xds:long ;
eg:regionStartProbeResult _:b5 .
_:b4 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "37"^^xds:long .
_:b5 eg:totalDuration "0.007"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b6 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "45"^^xds:long .
_:b7 eg:totalDuration "0.008"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "37"^^xds:long .
_:b8 eg:totalTime "0.423793693"^^xds:double ;
eg:totalRecordCount "799251"^^xds:long ;
eg:tailElementCount "1"^^xds:int ;
eg:totalElementCount "799251"^^xds:long ;
eg:splitSize "33554432"^^xds:long ;
eg:splitStart "67108864"^^xds:long ;
eg:totalBytesRead "33554477"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:regionStartProbeResult _:b7 ;
eg:regionEndProbeResult _:b6 .
_:b9 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "45"^^xds:long .
_:b10 eg:regionStartProbeResult _:b9 ;
eg:totalTime "0.34524198"^^xds:double ;
eg:tailElementCount "1"^^xds:int ;
eg:totalRecordCount "810578"^^xds:long ;
eg:totalElementCount "810578"^^xds:long ;
eg:splitStart "100663296"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:splitSize "33554432"^^xds:long ;
eg:totalBytesRead "33554513"^^xds:long ;
eg:regionEndProbeResult _:b11 ;
eg:regionStartSearchReadOverRegionEnd false .
_:b11 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b12 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b13 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b14 eg:splitStart "134217728"^^xds:long ;
eg:totalBytesRead "33554513"^^xds:long ;
eg:regionEndProbeResult _:b12 ;
eg:totalTime "0.33095979400000003"^^xds:double ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:totalRecordCount "803622"^^xds:long ;
eg:tailElementCount "1"^^xds:int ;
eg:splitSize "33554432"^^xds:long ;
eg:totalElementCount "803622"^^xds:long ;
eg:regionStartProbeResult _:b13 ;
eg:regionStartSearchReadOverRegionEnd false .
_:b15 eg:splitStart "167772160"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionStartProbeResult _:b16 ;
eg:regionEndProbeResult _:b17 ;
eg:tailElementCount "1"^^xds:int ;
eg:splitSize "33554432"^^xds:long ;
eg:totalTime "0.35778299"^^xds:double ;
eg:totalRecordCount "804767"^^xds:long ;
eg:totalBytesRead "33554497"^^xds:long ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalElementCount "804767"^^xds:long .
_:b16 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b17 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "65"^^xds:long .
_:b18 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "65"^^xds:long .
_:b19 eg:regionStartSearchReadOverSplitEnd false ;
eg:splitStart "201326592"^^xds:long ;
eg:totalBytesRead "33554507"^^xds:long ;
eg:regionEndProbeResult _:b20 ;
eg:splitSize "33554432"^^xds:long ;
eg:totalRecordCount "812414"^^xds:long ;
eg:tailElementCount "1"^^xds:int ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalElementCount "812414"^^xds:long ;
eg:totalTime "0.334013382"^^xds:double ;
eg:regionStartProbeResult _:b18 .
_:b20 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "75"^^xds:long .
_:b21 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "82"^^xds:long .
_:b22 eg:totalRecordCount "809813"^^xds:long ;
eg:splitStart "234881024"^^xds:long ;
eg:regionStartProbeResult _:b23 ;
eg:totalBytesRead "33554514"^^xds:long ;
eg:totalTime "0.361883591"^^xds:double ;
eg:regionEndProbeResult _:b21 ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalElementCount "809813"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:splitSize "33554432"^^xds:long ;
eg:tailElementCount "1"^^xds:int .
_:b23 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "75"^^xds:long .
_:b24 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "82"^^xds:long .
_:b25 eg:totalElementCount "814884"^^xds:long ;
eg:splitStart "268435456"^^xds:long ;
eg:splitSize "33554432"^^xds:long ;
eg:tailElementCount "1"^^xds:int ;
eg:totalTime "0.34096873"^^xds:double ;
eg:totalBytesRead "33554494"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionStartProbeResult _:b24 ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalRecordCount "814884"^^xds:long ;
eg:regionEndProbeResult _:b26 .
_:b26 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "62"^^xds:long .
_:b27 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "70"^^xds:long .
_:b28 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "62"^^xds:long .
_:b29 eg:regionStartProbeResult _:b28 ;
eg:splitStart "301989888"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:tailElementCount "1"^^xds:int ;
eg:splitSize "33554432"^^xds:long ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalElementCount "812127"^^xds:long ;
eg:totalRecordCount "812127"^^xds:long ;
eg:totalTime "0.34634130300000004"^^xds:double ;
eg:totalBytesRead "33554502"^^xds:long ;
eg:regionEndProbeResult _:b27 .
_:b30 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "70"^^xds:long .
_:b31 eg:regionStartSearchReadOverRegionEnd false ;
eg:tailElementCount "1"^^xds:int ;
eg:totalTime "0.33287019900000003"^^xds:double ;
eg:regionStartProbeResult _:b30 ;
eg:splitStart "335544320"^^xds:long ;
eg:totalRecordCount "809327"^^xds:long ;
eg:splitSize "33554432"^^xds:long ;
eg:totalBytesRead "33554491"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:totalElementCount "809327"^^xds:long ;
eg:regionEndProbeResult _:b32 .
_:b32 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "59"^^xds:long .
_:b33 eg:totalTime "0.345964153"^^xds:double ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalElementCount "811930"^^xds:long ;
eg:regionEndProbeResult _:b34 ;
eg:tailElementCount "1"^^xds:int ;
eg:splitStart "369098752"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:totalBytesRead "33554494"^^xds:long ;
eg:splitSize "33554432"^^xds:long ;
eg:totalRecordCount "811930"^^xds:long ;
eg:regionStartProbeResult _:b35 .
_:b35 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "59"^^xds:long .
_:b34 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "62"^^xds:long .
_:b36 eg:totalTime "0.32924473800000004"^^xds:double ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionEndProbeResult _:b37 ;
eg:tailElementCount "1"^^xds:int ;
eg:splitSize "33554432"^^xds:long ;
eg:regionStartProbeResult _:b38 ;
eg:totalElementCount "797831"^^xds:long ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalRecordCount "797831"^^xds:long ;
eg:totalBytesRead "33554493"^^xds:long ;
eg:splitStart "402653184"^^xds:long .
_:b37 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "61"^^xds:long .
_:b38 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "62"^^xds:long .
_:b39 eg:totalDuration "0.061000000000000006"^^xds:double ;
eg:probeCount "0"^^xds:long ;
eg:candidatePos "-1"^^xds:long .
_:b40 eg:splitStart "436207616"^^xds:long ;
eg:totalElementCount "570104"^^xds:long ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalRecordCount "570103"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionStartProbeResult _:b41 ;
eg:splitSize "24421635"^^xds:long ;
eg:totalBytesRead "24421635"^^xds:long ;
eg:regionEndProbeResult _:b39 ;
eg:tailElementCount "0"^^xds:int ;
eg:totalTime "0.30285072300000004"^^xds:double .
_:b41 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "61"^^xds:long .
from rdfprocessingtoolkit.
Hm, so the CSV parsing looks ok - what happens when you increase the kryo size?
EXTRA_OPTS="-Dspark.kryoserializer.buffer.max=2000000000" rpt
java -D "-Dspark.kryoserializer.buffer.max=2000000000" -jar
from rdfprocessingtoolkit.
Does increasing kryo buffer size have any effect? Since the CSV parsing seems to work, it would indicate that a single partition of CSV data maps to a very large amounts of RDF data (maybe the mapping produces many duplicates?).
A mapping where several thousands of triples are attached to the same subject (e.g. due to incorrect mapping) might also cause this issue - it was somehow related to very large turtle blocks being formed which exceed internal thresholds.
Maybe switching to ntriples serialization makes the issue go away?
from rdfprocessingtoolkit.
It works with the buffer parameter, thanks!
from rdfprocessingtoolkit.
Updated Sansa CLI to use kryo's max buffer size of 2048 by default. It is possible to override it to make it smaller, but not sure if there is a good reason to do so.
SANSA-Stack/SANSA-Stack@4948399
(I realize I should have created a separate issue at sansa but oh well)
from rdfprocessingtoolkit.
Related Issues (20)
- SPARQL IRI function does not work HOT 3
- integrate shows strange behavior when JSON-LD is used HOT 4
- Service Enhancer breaks literals in service clause
- Add a 'dump graphs' sub-command for easily saving (named) graphs into separate files
- SPARQL Endpoint throws exception for SPARQL Update (POST, x-www-form-urlencoded) HOT 3
- No sparql response for GET query HOT 1
- java.lang.OutOfMemoryError with integrate HOT 4
- Base URL is ignored for files specfied as CLI arguments HOT 1
- NoClassDefFoundError: org/apache/hadoop/shaded/org/apache/commons/configuration2/Configuration HOT 3
- Smarter Auto-Update of Spatial Index HOT 1
- Non-deterministic Output Formatting
- Improve Documentation
- spark jakarta servlet compatibility
- Executing construct queries in streaming and sorted ways
- DBMS Probing log output should be hidden
- Yasgui does not support RDFStar
- Chromium/Chrome does not support yasgui to localhost
- Listen on multiple ports with different read-only/write privileges.
- DESCRIBE queries broken HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rdfprocessingtoolkit.