Giter Club home page Giter Club logo

graphipedia's Introduction

Graphipedia

A tool for creating a Neo4j graph database of Wikipedia pages and the links between them.

Building

This is a Java project built with Maven.

Check the neo4j.version property in the top-level pom.xml file and make sure it matches the Neo4j version you intend to use to open the database. Then build with

mvn package

This will generate a package including all dependencies in graphipedia-dataimport/target/graphipedia-dataimport.jar.

Importing Data

The graphipedia-dataimport module allows to create a Neo4j database from a Wikipedia database dump.

See Wikipedia:Database_download for instructions on getting a Wikipedia database dump.

Assuming you downloaded pages-articles.xml.bz2, follow these steps:

  1. Run ExtractLinks to create a smaller intermediate XML file containing page titles and links only. The best way to do this is decompress the bzip2 file and pipe the output directly to ExtractLinks:

    bzip2 -dc pages-articles.xml.bz2 | java -classpath graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks - enwiki-links.xml

  2. Run ImportGraph to create a Neo4j database with nodes and relationships into a graphdb directory

    java -Xmx3G -classpath graphipedia-dataimport.jar org.graphipedia.dataimport.neo4j.ImportGraph enwiki-links.xml graphdb

Just to give an idea, enwiki-20130204-pages-articles.xml.bz2 is 9.1G and contains almost 10M pages, resulting in over 92M links to be extracted.

On my laptop with an SSD drive the import takes about 30 minutes to decompress/ExtractLinks (pretty much the same time as decompressing only) and an additional 10 minutes to ImportGraph.

(Note that disk I/O is the critical factor here: the same import will easily take several hours with an old 5400RPM drive.)

Querying

The Neo4j browser can be used to query and visualise the imported graph. Here are some sample Cypher queries.

Show all pages linked to a given starting page - e.g. "Neo4j":

MATCH (p0:Page {title:'Neo4j'}) -[Link]- (p:Page)
RETURN p0, p

Find how two pages - e.g. "Neo4j" and "Kevin Bacon" - are connected:

MATCH (p0:Page {title:'Neo4j'}), (p1:Page {title:'Kevin Bacon'}),
  p = shortestPath((p0)-[*..6]-(p1))
RETURN p

graphipedia's People

Contributors

mirkonasato avatar pesua avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graphipedia's Issues

Less number of nodes created than expected for the 18 GB wikipedia datadump

When I use the dump at http://dumps.wikimedia.org/enwiki/20130805/
(enwiki-20130805-pages-meta-current.xml.bz2) which is 18.4GB in size with the importer, I get the same number of nodes and relationships as with the enwiki-20130805-pages-articles-multistream.xml.bz2 dump which is 10.1 GB in size. i.e 10 310 502 nodes and 96 711 488 relationships. The importer tool reports a lot of broken links. But from Wikipedia statistics at http://en.wikipedia.org/wiki/Wikipedia:Statistics there should be 31,728,583 pages i.e 31,728,583 nodes. From what I know the 10 GB dataset basically is just articles without talks or user pages, while the 18.4 GB dataset is all pages. Is there something around the parsing logic that is ignoring the data items in the 18GB dataset?

Performance notes are wrong

Hello,

if it took 30 minutes to process 9.1GB file, it means that the throughput was 5,06 MB/s.
(9.1G = 1024 * 9.1 MB = 9100 MB, 9100 / (30 * 60s) = 5,055555556 MB/s
5400 disks have 40 MB/s read / write throughput, so they are not the bottleneck. To speed things up you can use lbzip2 which is multi-threaded (it helped me a lot).

Best regards

Problem by using Neo4j 3.3.1

Hi:

Thanks for sharing the code :)

Following what is stated in README, I have changed the version of neo4j to 3.3.1 in the pom.xml file. However, when I execute the command mvn install, there was an error indicating that in ImportGraph.java [37,44], java.lang.String cannot be converted to java.io.File. Now I am reading the code and trying to fix this error, but considering that it has been 2 years since you last updated the code, I am worried that there might be other potential problems. Could you please check the code again? It would be of much help!

Many thanks for your time :)

Update to 2.2 by using docker

Dear Mirko,
thanks for your great work.

I am trying to import your graphdb inside the latest release of neo4j:
http://neo4j.com/blog/neo4j-2-2-0-scalability-performance

The first time i was asked by neo4j to update the database with a special property,
and after upgrading the server crashed with no log and no clear issues.

Since i am using Docker, a very easy and convenient system for isolations of applications inside container, may i ask you to rebuild the db using the updated neo4j version?

The docker image is available here:
https://registry.hub.docker.com/u/kbastani/docker-neo4j/

It also contains clear instructions on how to use if you have docker installed.
If you have problems on installing docker or using the image i may help you.
I may also build a dev environment for this.

Thanks,
Paolo

Diffifulties to import the graphdb into NEO4j

Hi,

I succeed in creating the graphdb directory but trying to import into Neo4j 2.1.7 I got this message :

"Starting Neo4j Server failed: Error starting org.neo4j.kernel.EmbeddedGraphDatabase, C:\Users\Joubert\Documents\Neo4j\graphdb"

Can you help me ? Maybe it is not the good version of Neo4j ?

Sincerely yours.

Upgrading to latest Neo4j?

I'm interested in trying out this library, super cool that you put this together! However Neo4j 3.2.9 is hard to come by at this point (doesn't seem like it's distributed on the official page anymore). Would be nice to be able to use with the latest version (looks like 4.2.5).

This is something I might be able to work on and submit a pull request for, but first wanted to get some sense of how much work it would be. Based on your last commit it looks like you've done this in the past from a previous version to 3.2.9; any sense of what the changes are that would have to be made?

On a side note, this isn't the first time that Neo4j's lack of backwards compatibility has been a thorn in my side.

IlligalStateException

Exception in thread "main" java.lang.IllegalStateException: Misaligned file size 68 for DynamicArrayStore[fileName:neostore.nodestore.db.labels, blockSize:60], expected version length 25

Hi, I am getting this error, am I doing something wrong?

Thanks,
Pedro

Compilation error due to missing packages

Hi, I'm not sure this project is still maintained but I'll add this issue here as well. I updated the neo4j version in the pom.xml file to 4.2.1 which is installed on my Linux debian system. I extracted the zip of the code downloaded from here into a random folder and ran mvn package but the compilation of Graphipedia-parent is successful but graphipedia-parent fails with missing package errors

xv@bunsen:~/bin/graphipedia-master$ mvn -e package
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$1 (file:/usr/share/maven/lib/guice.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[INFO] Error stacktraces are turned on.
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO] 
[INFO] Graphipedia Parent                                                 [pom]
[INFO] Graphipedia DataImport                                             [jar]
[INFO] 
[INFO] -----------------< org.graphipedia:graphipedia-parent >-----------------
[INFO] Building Graphipedia Parent 0.1.0-SNAPSHOT                         [1/2]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO] 
[INFO] ---------------< org.graphipedia:graphipedia-dataimport >---------------
[INFO] Building Graphipedia DataImport 0.1.0-SNAPSHOT                     [2/2]
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ graphipedia-dataimport ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ graphipedia-dataimport ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 9 source files to /home/xv/bin/graphipedia-master/graphipedia-dataimport/target/classes
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/RelationshipCreator.java:[29,36] package org.neo4j.unsafe.batchinsert does not exist
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/RelationshipCreator.java:[33,19] cannot find symbol
  symbol:   class BatchInserter
  location: class org.graphipedia.dataimport.neo4j.RelationshipCreator
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/RelationshipCreator.java:[41,32] cannot find symbol
  symbol:   class BatchInserter
  location: class org.graphipedia.dataimport.neo4j.RelationshipCreator
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/NodeCreator.java:[29,36] package org.neo4j.helpers.collection does not exist
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/NodeCreator.java:[30,36] package org.neo4j.unsafe.batchinsert does not exist
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/NodeCreator.java:[34,19] cannot find symbol
  symbol:   class BatchInserter
  location: class org.graphipedia.dataimport.neo4j.NodeCreator
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/NodeCreator.java:[39,24] cannot find symbol
  symbol:   class BatchInserter
  location: class org.graphipedia.dataimport.neo4j.NodeCreator
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/ImportGraph.java:[29,36] package org.neo4j.unsafe.batchinsert does not exist
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/ImportGraph.java:[30,36] package org.neo4j.unsafe.batchinsert does not exist
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/ImportGraph.java:[34,19] cannot find symbol
  symbol:   class BatchInserter
  location: class org.graphipedia.dataimport.neo4j.ImportGraph
[INFO] 10 errors 
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Graphipedia Parent 0.1.0-SNAPSHOT:
[INFO] 
[INFO] Graphipedia Parent ................................. SUCCESS [  0.005 s]
[INFO] Graphipedia DataImport ............................. FAILURE [  1.756 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.890 s
[INFO] Finished at: 2021-01-03T23:38:38+01:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project graphipedia-dataimport: Compilation failure: Compilation failure: 
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/RelationshipCreator.java:[29,36] package org.neo4j.unsafe.batchinsert does not exist
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/RelationshipCreator.java:[33,19] cannot find symbol
[ERROR]   symbol:   class BatchInserter
[ERROR]   location: class org.graphipedia.dataimport.neo4j.RelationshipCreator
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/RelationshipCreator.java:[41,32] cannot find symbol
[ERROR]   symbol:   class BatchInserter
[ERROR]   location: class org.graphipedia.dataimport.neo4j.RelationshipCreator
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/NodeCreator.java:[29,36] package org.neo4j.helpers.collection does not exist
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/NodeCreator.java:[30,36] package org.neo4j.unsafe.batchinsert does not exist
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/NodeCreator.java:[34,19] cannot find symbol
[ERROR]   symbol:   class BatchInserter
[ERROR]   location: class org.graphipedia.dataimport.neo4j.NodeCreator
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/NodeCreator.java:[39,24] cannot find symbol
[ERROR]   symbol:   class BatchInserter
[ERROR]   location: class org.graphipedia.dataimport.neo4j.NodeCreator
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/ImportGraph.java:[29,36] package org.neo4j.unsafe.batchinsert does not exist
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/ImportGraph.java:[30,36] package org.neo4j.unsafe.batchinsert does not exist
[ERROR] /home/xv/bin/graphipedia-master/graphipedia-dataimport/src/main/java/org/graphipedia/dataimport/neo4j/ImportGraph.java:[34,19] cannot find symbol
[ERROR]   symbol:   class BatchInserter
[ERROR]   location: class org.graphipedia.dataimport.neo4j.ImportGraph
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project graphipedia-dataimport: Compilation failure
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:957)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:289)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:193)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:566)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
Caused by: org.apache.maven.plugin.compiler.CompilationFailureException: Compilation failure
    at org.apache.maven.plugin.compiler.AbstractCompilerMojo.execute (AbstractCompilerMojo.java:858)
    at org.apache.maven.plugin.compiler.CompilerMojo.execute (CompilerMojo.java:129)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:957)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:289)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:193)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:566)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
[ERROR] 
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :graphipedia-dataimport

Could not find or load main class

Good day, does anyone encounters the problem below when trying to Import Graph to neo4j?
Error: Could not find or load main class org.graphipedia.dataimport.neo4j.ImportGraph
Caused by: java.lang.ClassNotFoundException: org.graphipedia.dataimport.neo4j.ImportGraph

Thanks!

Database files require upgrading

I get the following message when attempting to use the database created by graphipedia:

ERROR Neo4j cannot be started, because the database files require upgrading and upgrades are disabled in configuration. Please set 'allow_store_upgrade' to 'true' in your configuration file and try again.

Setting allow_store_upgrade=true results neo4j not responding to http requests. I don't see any logs generated either. I'm using the official docker image like this:

docker run \
    --name neo4j-wiki \
    --rm \
    --publish=7474:7474 \
    --volume=$HOME/neo4j/data:/data \
    --volume=$HOME/neo4j/conf:/conf \
    --env=NEO4J_CACHE_MEMORY=2G \
    --env=NEO4J_HEAP_MEMORY=4G \
    neo4j:2.3.2

In the neo4j-server.properties I've set allow_store_upgrade=true and org.neo4j.server.database.location=data/wikipedia.db.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.