Final Project for Real-time and Big Data Analytics
Designed an application which allowed users to enter query words and returned their popularity analysis and summary of public attitudes towards them, based on real-time data crawled from Reddit, Twitter and Yelp.
Implemented spider program using Python and Scrapy Framework to crawl websites; used Java, MapReduce and Pig to clean crawled data; end results included graphs showing popularity trends, pie charts showing opinion distribution, list of top words associated with the searched word, prediction of future popularity etc.
Please see the following report for more details:
https://drive.google.com/open?id=0B_bKdJl2aPq_Qnd6ZFRRY2NWMlE
Run Configuration:
spark_twitter.py is used to parse twitter data and applies Spark technique.
yelp_merge.py, yelp_parse_business_list.py and yelp_parse_tip_list.py are used to parse yelp data.
parse_reddit.cpp is used to parse reddit data.
Sentiment.java, SentimentMapper.java and SentimentReducer.java in the folder named MapReduce are used to perform the major sentiment analytics on parsed data and apply MapReduce technique.
PostProcess.java is used to process the result from MapReduce and generate report.
run.sh is the shell script to run the whole project and generate report to std io
To run the shell script, please put Sentiment.jar which is generated by Sentiment.java, SentimentMapper.java and SentimentReducer.java at the folder same as run.sh
Please put PostProcess.java at the folder same as run.sh
If search reddit data, please put data file named input_r to hdfs If search yelp data, please put data file named input_y to hdfs If search twitter data, please put data file named input_t to hdfs