Giter Club home page Giter Club logo

googleplay-web-crawler's Introduction

GooglePlay Web Crawler

What is Hadoop Ecosystem?

hadoop

  • The core compositions of Hadoop are HDFS, Yarn, and other engines and App, like Mapreduce, Tez, Nutch, Pig, Hive, Spark, etc.
  • HDFS is composed of NameNode and DataNode for data storage.
  • Yarn is composed of Resource Manager and node Manager for resource assignment.
  • APPs like Pig, Hive are higher level language processor. They can conduct mapreduce job much easier.

How does web crawler work?

  • Use a customized Nutch to crawl apps metadata in GooglePlay
  • Inject seed to nutchDB
  • Generate urls to crawl from nutchDB
  • Fetch app meatadata from html pages
  • parse extracted metadata and outlinks
  • update nutchDB with new outlinks
  • Pig Loadfunc transforms nutchDB to readable text file form
  • Create table and manage data by Hive

Command Line

  • git clone
git clone https://github.com/apache/nutch
git checkout release-1.12
  • customize nutch
patch -p1 < /googleplaycrawler/googleplaycrawler.patch
  • run googleplaycrawler on single nutch cluster
echo "https://play.google.com/store/apps/details?id=com.facebook.orca" > seed

hadoop fs -put seed 

hadoop jar build/apache-nutch-1.12.job org.apache.nutch.googleplay.GooglePlayCrawler seed -numFetchers 10

  • check output

hadoop fs -text file:///xxxxxx/nutchdb/segments/xxxxx/parse_data/part-00000/data

  • fix the skew data job
patch -p1 < fixskew.patch
  • uplode seeds file and run web scrawler in AWS EMR emr

  • Aws S3

register target/nutchdbloader-0.0.1-SNAPSHOT.jar
register /home/hadoop/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-aws-2.7.3.jar register nutch-1.12.jar
loaded = load 's3n://test/nutchdb/segments/*/parse_data/part-*/data' using com.example.NutchParsedDataLoader();
filtered = filter loaded by $0 is not null;
store filtered into 'output';

text reults

results

googleplay-web-crawler's People

Contributors

ly16 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.