Giter Club home page Giter Club logo

pygotham2018_graphmining's Introduction

Large Scale Graph Mining with Spark

PyGotham 2018 talk. See also my tutorial on Medium.

Getting started

This repo includes Dockerfile for running a Jupyter notebook with pyspark.

Running the notebook

  1. Make sure you have Docker installed.
  2. Run make build to create your Docker image. This may take a while.
  3. Run make run_notebook_volume. This starts a Docker container with a volume containing the notebooks and sample dataset
  4. Go to 127.0.0.0:8888 to see the notebook server. You may need to enter authentication token, which will be somewhere in your terminal output.
  5. Open work/notebooks/Graphframes_demo.

Stopping Jupyter notebook

  1. Find Docker process with docker ps.
  2. Kill container with docker kill <container_id>.

About the sample dataset

I also included a small sample dataset that I created from the Common Crawl September 2017 dataset. The data, stored in a parquet file under notebooks/data/outlinks_pq, has the following format:

  • parent: full URL of parent node, the html I pulled links from.
  • parentTLD: top level domain of parent
  • childTLD: top level domain of child
  • child: full url of child node, the link found on the parent web page.

Hopefully this will jumpstart your exploration of web graphs, LPA, PageRank, and other cool features!

References

Adamic, Lada A., and Natalie Glance. "The political blogosphere and the 2004 US election: divided they blog." Proceedings of the 3rd international workshop on Link discovery. ACM, 2005.

Common Crawl dataset (September 2017).

Farine, Damien R., et al. "Both nearest neighbours and long-term affiliates predict individual locations during collective movement in wild baboons." Scientific reports 6 (2016): 27704

Fortunato, Santo. "Community detection in graphs." Physics reports 486.3-5 (2010): 75-174.

Girvan, Michelle, and Mark EJ Newman. “Community structure in social and biological networks.” Proceedings of the national academy of sciences 99.12 (2002): 7821–7826.

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014.

Raghavan, Usha Nandini, Réka Albert, and Soundar Kumara. "Near linear time algorithm to detect community structures in large-scale networks." Physical review E 76.3 (2007): 036106.

Zachary karate club network dataset -- KONECT, April 2017.

Additional Resources

Spark

  • I like Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.
  • Also High Performance Spark by Holden Karau and Rachel Warren.

GraphFrames

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.