Giter Club home page Giter Club logo

neo4j-webgraph's Introduction

Project Info

Neo4j-Webgraph is a Java application to crawl the configured websites and import their pages and links as nodes and relationships into a Neo4j graph database.

You configure the application by placing a config.properties file at the root of the classpath (a template config file is provided with the source code in the src/main/resources directory)

The application takes some command line parameters. ass -h argument to see all possible command line options, e.g.:

java -cp release/neo4j-webgraph-1.8.3.0-jar-with-dependencies.jar org.neo4japps.webgraph.importer.Main -h

Note that, during the import process, you can annotate your Neo4j graph nodes with additional properties by providing custom event handlers that get invoked when a node gets created.

Sample Twitter and Facebook event handlers are provided with the source code. They determine how often a page has been "liked" (Facebook) or twitted about using these services' remote APIs. See the source code for details.

Project set-up and build

  1. Download and install the Java JDK 1.7 or newer

  2. Download and install Maven 3.1 or newer

  3. Configure the import options in src/main/resources/config.properties

  4. Build the project using build.[cmd|sh]

    If the build is successful a jar file called neo4j-webgraph--jar-with-dependencies.jar is generated in the target directory.

  5. Optional: Generate Eclipse project files (only needed for development in Eclipse): run updateEclipseProject.[cmd|sh]

  6. Optional: Configure your local Maven repo in Eclipse (only needed for development in Eclipse)

    mvn -Declipse.workspace= eclipse:add-maven-repo

  7. Run the importer using import.[cmd|sh]

    The importer accepts a number of command line options. To see them run 'import -h' or 'importer -?'. If you are not sure just accept the defaults.

  8. Use the graph

    The import process creates a Neo4J graph database in the graph.db subdirectory.

    To use the imported graph you must download and install Neo4J from the Neo4J website. It is recommended you use the same Neo4J version that was used to build the graph. For exmaple, if the neo4j-webgraph version is 1.8.3.0 then you need Neo4J 1.8.3. Note that Neo4J can be installed on a different machine to the one used for building/running the import. To run Neo4J all you need is a JRE. All other tools mentioned above are needed only for building neo4j-webgraph.

    The easiest way is to copy the entire graph.db directory into the 'data' subdirectory of your Neo4J installation, for example neo4j-community-1.8.3\data\graph.db, and then (re)start Neo4J.

    Once Neo4J has been started you can point your browser at http://HOSTNAME:7474/webadmin/ to you browse and query the graph. See the Neo4J documentation for details.

    By default, for security reasons, the webadmin interface is only accessible from the machine running Neo4J. So if your Neo4J server is running on a remote machine (i.e not localhost) you need to un-comment the line: #org.neo4j.server.webserver.address=0.0.0.0 in the conf/neo4j-server.properties file on the server and restart Neo4J.

Exploring the graph using the Neo4J Cypher query language

Once you have imported the website(s) into a Neo4J graph you can browse the graph using Cypher queries. Cypher is the Neo4J graph query language (SQL-like).

Note: there are other ways of browsing the graph. Cypher is only one of them. See Neo4J documentation for details.

Using the webadmin app you can use the Neo4J remote shell page to run your Cypher queries, e.g. http://localhost:7474/webadmin/#/console/

You can also use the neo4j-shell utility in the bin directory.

Here are some sample queries. Note that in many cases it is convenient to find the start node(s) for the query via index lookup, like this:

start p1=node:index-name(key="value"), p2=node:index-name(key="value")
   where ...

Queries:

  • Find a node with a specific id, eg 152. This node may or may not be a web page node (in which case the 'url' property will be undefined).

    start p=node(3) return p.url

  • Selected a group of nodes by id

    start p=node(152,153,154) return p.url

  • Find a page node with a specific URL

    start p=node:pages(url="http://mydomain.com/") return p.url, p.incomingLinks, p.outgoingLinks, p.facebookTotalCount, p.twitterCount

  • Find all pages that link to a particular page

    start p=node:pages(url="http://subdomain.mydomain.com/") match linkingpage-[:LINKS_TO]->p return count(linkingpage)

  • Find all page nodes with at least 1000 incoming links

    start p=node:pages("url:*") where p.incomingLinks >= 1000 return p.incomingLinks, p.url order by p.incomingLinks desc

    // attempt to do the same without using p.incomingLinks property // unfortunately "group by/having" does not work (yet) in Cypher :( start p=node:pages("url:*") match linkingpage-[:LINKS_TO]->p group by p.url having count(linkingpage) >= 1000 return count (linkingpage), p.url

    // an alternative is to return the x top pages with the most number of links, e.g. start p=node:pages("url:*") match linkingpage-[:LINKS_TO]->p return count (linkingpage) as nrOflinks, p.url order by nrOflinks desc limit 50

  • Count all the home pages

    start p=node:pages(type="home") return count (p)

  • Find all the home pages, sorted by number of incoming links

    start p=node:pages(type="home") return p.url, p.incomingLinks, p.outgoingLinks order by p.incomingLinks desc

  • Find home page nodes with at least 50 incoming links

    start p=node:pages(type="home") where p.incomingLinks >= 50 return p.url, p.incomingLinks order by p.incomingLinks desc

  • Find number of pages in the 'somedomain' domain

    start p=node:pages(domain="somedomain") return count (p)

  • Find pages from the 'somedomain' domain with at least 50 incoming links

    start p=node:pages(domain="somedomain") where p.incomingLinks >= 50 return p.url, p.incomingLinks order by p.incomingLinks desc

  • Find pages outside the 'somedomain' domain with at least 100 incoming links, being linked to from pages from the somedomain domain

    start somedomainpage=node:pages(domain="somedomain") match p-[:LINKS_TO]->linkedpage where linkedpage.incomingLinks >= 100 and linkedpage.domain <> 'somedomain' return p.url, linkedpage.url, linkedpage.incomingLinks order by linkedpage.incomingLinks desc

  • Find all page nodes containing a certain text (i.e. where content matches a certain regexp)

    start p=node:pages("url:") where p.content =~ /Page not yet fetched./ return count(p)

  • Find page nodes with un-initialised Facebook/Twitter properties

    start p=node:pages("url:*") where not has(p.facebookTotalCount) return p.url

    start p=node:pages("url:*") where not has(p.twitterCount) return p.url

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.