The hdp22-twitter-demo from tspannhw

Streaming workshop demo

This demo was part of below technical webinar workshops

"Real Time Monitoring with Hadoop" - Slides and webinar recording are available here
"Search Workshop" - Slides and webinar recording are available here

Author: Ali Bajwa

With special thanks to:

Guilherme Braccialli for helping to maintain the code and adding sentiment analysis component
Tim Veil for developing the original banana dashboard

Purpose: Monitor Twitter stream for S&P 500 companies to identify & act on unexpected increases in tweet volume

Ingest: Listen for Twitter streams related to S&P 500 companies
Processing:
- Monitor tweets for unexpected volume
- Volume thresholds managed in HBASE
Persistence:
- HDFS (for future batch processing)
- Hive (for interactive query)
- HBase (for realtime alerts)
- Solr/Banana (for search and reports/dashboards)
- Audits in Ranger/Solr/Banana
- Authorization policies in Ranger
Refine:
- Update threshold values based on historical analysis of tweet volumes
Demo setup:
- Either download and start prebuilt VM
- Start HDP 2.4 sandbox and run provided scripts to setup demo
Previous versions
- For HDP 2.3.2 instructions see here
- For HDP 2.3.0 instructions see here
- For HDP 2.2 instructions see here
- For HDP 2.1 version of this demo see here

Option 1: Setup demo using prebuilt VM based on HDP 2.3 sandbox
Option 2: Setup demo via scripts on vanilla HDP 2.4 sandbox
Kafka basics - optional
Setup Eclipse
Run demo to monitor Tweets about S&P 500 securities in realtime
Stop demo
Troubleshooting
Observe results in HDFS, Hive, Solr/Banana, HBase
Use Zeppelin to create charts to analyze tweets - optional
Import data into BI tools - optional
Other things to try - optional
Reset demo
Run demo on cluster

Option 1: Setup demo using prebuilt VM based on HDP 2.3 sandbox

Download VM from here. Import it into VMWare Fusion and start it up.
Find the IP address of the VM and add an entry into your machines hosts file e.g.

192.168.191.241 sandbox.hortonworks.com sandbox

Connect to the VM via SSH (password hadoop)

ssh [email protected]

Start the demo by

cd /root/hdp22-twitter-demo
./start-demo.sh
#once storm topology is submitted, press control-C

#start kafka twitter producer
./kafkaproducer/runkafkaproducer.sh

Observe results in HDFS, Hive, Solr/Banana, HBase
Troubleshooting: check the Storm webUI for any errors and try resetting using below script:

./reset-demo.sh

Option 2: Setup demo via scripts on vanilla HDP 2.4 sandbox

These setup steps are only needed first time and may take upto 30min to execute (depending on your internet connection)

While waiting on any step, if you don't already have Twitter credentials, follow steps here to get them
Download HDP 2.4 sandbox VM image file (Hortonworks_sanbox_with_hdp_2_4_vmware.ova) from Hortonworks website
Import the ova into VMWare Fusion and allocate at least 4cpus and 8GB RAM (its preferable to increase to 10GB+ RAM) and start the VM
Find the IP address of the VM and add an entry into your machines hosts file e.g.

192.168.191.241 sandbox.hortonworks.com sandbox

Connect to the VM via SSH (password hadoop). You can also SSH via browser by clicking: http://sandbox.hortonworks.com:4200

ssh [email protected]

Download code as root user

cd
git clone https://github.com/hortonworks-gallery/hdp22-twitter-demo.git

(Optional) To setup a development env on the sandbox to browse/modify the code, download Ambari service for VNC (details below). Not required if you just want to setup the demo

VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - \([0-9]\.[0-9]\).*/\1/'`
sudo git clone https://github.com/hortonworks-gallery/ambari-vnc-service.git   /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/VNCSERVER   
service ambari restart

Setup demo:Run below to setup demo (one time): it will start Ambari/HBase/Kafka/Storm and install maven, solr, banana.

cd /root/hdp22-twitter-demo
./setup-demo.sh

while it runs, proceed with installing VNC service per steps below

Optional - Setup VNC/Eclipse on your sandbox

Once the status of HDFS/YARN has changed from a yellow question mark to a green check mark...
Setup Eclipse on the sandbox VM and remote desktop into it using an Ambari service for VNC
- In Ambari open, Admin > Stacks and Services tab.
  - You can access this via http://sandbox.hortonworks.com:8080/#/main/admin/stack/services
- Deploy the service by selecting:
  - VNC Server -> Add service -> Next -> Next -> Enter password (e.g. hadoop) -> Next -> Proceed Anyway -> Deploy
  - Make sure the password is at least 6 characters or install will fail
- Connect to VNC from local laptop using a VNC viewer software (e.g. Tight VNC viewer or Chicken of the VNC or just your browser). Detailed steps here
- Import code into Eclipse using "Getting started with Storm and Maven in Eclipse environment" steps here
- Review Storm code in Eclipse under /root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm:
  - GNstorm.java: Main class, also where topology, KafkaSpout, HDFSBolts instatiated
  - TwitterScheme.java: defines structure of a Tweet
  - SolrBolt.java: writes to Solr
  - TwitterRuleBolt.java: defines business logic of when a tweet should results in an alert

Setup Twitter credentials

Twitter4J requires you to have a Twitter account and obtain developer keys by registering an "app". Create a Twitter account and app and get your consumer key/token and access keys/tokens: https://apps.twitter.com > sign in > create new app > fill anything > create access tokens
Then enter the 4 values into the file below in the sandbox

vi /root/hdp22-twitter-demo/kafkaproducer/twitter4j.properties
oauth.consumerKey=
oauth.consumerSecret=
oauth.accessToken=
oauth.accessTokenSecret=

Kafka basics

Once Kafka is started, run through the below commands to understand the basics

#check if kafka already started
ps -ef | grep kafka

#if not, start kafka
nohup /usr/hdp/current/kafka-broker/bin/kafka-server-start.sh /usr/hdp/current/kafka-broker/config/server.properties &

#create topic
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --zookeeper $(hostname -f):2181 --replication-factor 1 --partitions 1 --topic test

#list topic
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper $(hostname -f):2181 --list | grep test

#start a producer and enter text on few lines
/usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list $(hostname -f):6667 --topic test

#start a consumer in a new terminal your text appears in the consumer
/usr/hdp/current/kafka-broker/bin/kafka-console-consumer.sh --zookeeper $(hostname -f):2181 --topic test --from-beginning

#hit Control-C on both terminals to quit the consumer/producer

#delete topic (only works if delete.topic.enable and setup auto.create.topics.enable is set to true in Ambari > Kafka > Config)
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --delete --zookeeper $(hostname -f):2181 --topic test

Run Twitter demo

Most of the below steps are optional as they were already executed by the setup script above but are useful to understand the components of the demo:

Review the list of stock symbols whose Twitter mentiones we will be tracking http://en.wikipedia.org/wiki/List_of_S%26P_500_companies
Generate securities csv from above page and review the securities.csv generated. The last field is the generated tweet volume threshold

/root/hdp22-twitter-demo/fetchSecuritiesList/rungeneratecsv.sh
cat /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv

(Optional) for future runs: you can add other stocks/hashtags to monitor to the csv (make sure no trailing spaces/new lines at the end of the file). Find these at http://mobile.twitter.com/trends

sed -i '1i$HDP,Hortonworks,Technology,Technology,Santa Clara CA,0000000001,5' /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv
sed -i '1i#hadoopsummit,Hadoop Summit,Hadoop,Hadoop,Santa Clara CA,0000000001,5' /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv

Open connection to HBase via Phoenix and check you can list tables. Notice securities data was imported and alerts table is empty

/usr/hdp/current/phoenix-client/bin/sqlline.py  localhost:2181:/hbase-unsecure
!tables
select * from securities;
select * from alerts;
select * from dictionary;
!q

check Hive table schema where we will store the tweets for later analysis

hive -e 'desc tweets_text_partition'

Start Storm Twitter topology to generate alerts into an HBase table for stocks whose tweet volume is higher than threshold this will also read tweets into Hive/HDFS/local disk/Solr/Banana. The first time you run below, maven will take 15min to download dependent jars

cd /root/hdp22-twitter-demo
./start-demo.sh
#once storm topology is submitted, press control-C

(Optional) Other modes the topology could be started in future runs if you want to clean the setup or run locally (not on the storm running on the sandbox)

cd /root/hdp22-twitter-demo/stormtwitter-mvn
./runtopology.sh runOnCluster clean
./runtopology.sh runLocally skipclean

If you see errors like below, double check in Ambari that Storm is still up

Caused by: java.net.ConnectException: Connection refused
Could not find leader nimbus from seed hosts [sandbox.hortonworks.com]. Did you specify a valid list of nimbus hosts for config nimbus.seeds

open storm UI and confirm topology was created using either:
- Storm WebUI: http://sandbox.hortonworks.com:8744/
- Storm View: http://sandbox.hortonworks.com:8080/#/main/views/Storm_Monitoring/0.1.0/Storm
Start Kafka producer: In a new terminal, compile and run kafka producer to start producing tweets containing first 400 stock symbols values from csv

/root/hdp22-twitter-demo/kafkaproducer/runkafkaproducer.sh

To stop collecting tweets:

To stop producing tweets, press Control-C in the terminal you ran runkafkaproducer.sh
kill the storm topology to stop processing tweets

storm kill Twittertopology

Troubleshooting

If Storm webUI shows topology errors...
Check the Storm webUI for any errors and try resetting using below script:

./reset-demo.sh

(Optional): In case of Ranger authorization errors, add users to global allow policies
- Start Ranger and login to http://sandbox.hortonworks.com:6080 (admin/admin)
```
service ranger-admin start
```
- "HDFS Global Allow": add group root to this policy - by opening http://sandbox.hortonworks.com:6080/#!/hdfs/1/policy/2
- "HBase Global Allow": add group hadoop to this policy - by opening http://sandbox.hortonworks.com:6080/#!/hbase/3/policy/8
- "Hive Global Tables Allow": add user admin to this policy - by opening http://sandbox.hortonworks.com:6080/#!/hive/2/policy/5
  - Note you will need to first create an admin user - by opening http://sandbox.hortonworks.com:6080/#!/users/usertab

Observe results

Open the new Storm View and check the statistics for each Bolt: 'Acked' columns should start increasing.
The Storm view also allows you to rebalance the topology:
The statistics also available via the Storm UI http://sandbox.hortonworks.com:8744/
Open Files view and see the tweets getting stored: http://sandbox.hortonworks.com:8080/#/main/views/FILES/1.0.0/Files

Open Hive table via Hive view. Notice tweets appear in the Hive table that was created: http://sandbox.hortonworks.com:8080/#/main/views/HIVE/1.0.0/Hive

Open Banana UI and view/search tweet summary and alerts: http://sandbox.hortonworks.com:8983/solr/banana/index.html
- You can also access the UI via Ambari view by following steps here and replacing the url with http://sandbox.hortonworks.com:8983/solr/banana/index.html
- For more details on the Banana dashboard panels are built, refer to the underlying json file that defines all the panels
- In case you don't see any tweets, try changing to a different timeframe on timeline (e.g. by clicking 24 hours, 7 days etc). If there is a time mismatch between the VM and your machine, the tweets may appear at a different place on the timeline than expected.
Run a query in Solr to look at tweets/hashtags/alerts. Click on 'Query' and enter a query under 'q'. Examples are doctype_s:tweet and text_t:AAPL. You can choose an output format under 'wt' and click 'Execute Query'. http://sandbox.hortonworks.com:8983/solr/#/tweets

You can also search using Solr's APIs. The below displays all alerts in JSON format http://sandbox.hortonworks.com:8983/solr/tweets/select?q=*%3A*&df=id&wt=json&fq=doctype_s:alert
- The SolrBolt code showing how the data gets into Solr is shown here
Open connection to HBase via Phoenix and notice alerts were generated

/usr/hdp/current/phoenix-client/bin/sqlline.py  localhost:2181:/hbase-unsecure
select * from alerts

Notice tweets written to sandbox filesystem via FileSystem bolt

vi /tmp/Tweets.xls

Open Hive view and query the tweets table:

Use Zeppelin to create charts to analyze tweets

Apache Zeppelin can also be installed on the cluster/sandbox to generate charts for analysis using:
- Spark
- SparkSQL
- Hive
- Flink
The Zeppelin Ambari service can be used to easily install/manage Zeppelin on HDP cluster
Sample queries and charts:

Import data to BI Tool via ODBC for analysis - optional

Create ORC table and copy the tweets over:

hive -f /root/hdp22-twitter-demo/stormtwitter-mvn/createORC.sql

View the contents of the ORC table created: http://sandbox.hortonworks.com:8000/beeswax/table/default/tweets_orc_partition_single

Grant select access to user hive to the ORC table

hive -e 'grant SELECT on table tweets_orc_partition_single to user hive'

On windows VM create an ODBC connector called sandbox with below settings:

	Host=<IP address of sandbox VM>
	port=10000 
	database=default 
	Hive Server type=Hive Server 2 
	Mechanism=User Name 
	UserName=hive

Import data from tweets_orc_partition_single table into Excel over ODBC Data > From other Datasources > From dataconnection wizard > ODBC DSN > sandbox > tweets_orc_partition_single > Finish > Yes > OK

Create some visualizations using PowerCharts (e.g. plotting tweets with coordinates on map)

Other things to try: Analyze any kind of tweet - optional

Instead of filtering on tweets from certain stocks/hashtags, you can also consume all tweets returned by TwitterStream API and re-run runkafkaproducer.sh Note that in this mode a large volume of tweets is generated so you should stop the kafka producer after 20-30s to avoid overloading the system It also may take a few minutes after stopping the kafka producer before all the tweets show up in Banana/Hive

mv /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv /root/hdp22-twitter-demo/fetchSecuritiesList/securities.csv.bak

To filter tweets based on geography open below file and uncomment this line and re-run runkafkaproducer.sh

vi /root/hdp22-twitter-demo/kafkaproducer/TestProducer.java
/root/hdp22-twitter-demo/kafkaproducer/runkafkaproducer.sh

Reset demo

This empties out the demo related HDFS folders, Hive table, Solr core, Banana webapp and stops the storm topoogy

/root/hdp22-twitter-demo/reset-demo.sh

If kafka keeps sending your topology old tweets, you can also clear kafka queue

zookeeper-client
rmr /group1

Run demo on cluster

To run on actual cluster instead of sandbox, there are a few things that need to be changed before compiling/starting the demo:
- when running kafka/Hbase shell, change the zookeeper connect strings to use localhost instead of sandbox
- /root/hdp22-twitter-demo/solrconfig.xml: change sandbox reference to HDFS location of solr user e.g. hdfs://summit-twitterdemo01.cloud.hortonworks.com:8020/user/solr
- /root/hdp22-twitter-demo/default.json: change sandbox to Solr server (e.g. summit-twitterdemo01.cloud.hortonworks.com)
- /root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm/GNstorm.java: change zookeeper host, NN url, Hive metastore
  - BrokerHosts hosts = new ZkHosts("localhost:2181”);
  - String fsUrl = "hdfs://summit-twitterdemo01.cloud.hortonworks.com:8020";
  - String sourceMetastoreUrl = "thrift://summit-twitterdemo02.cloud.hortonworks.com:9083”;
- /root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm/SolrBolt.java: change Solr collection url reference and zookeeper host reference
  - server = new HttpSolrServer("http://summit-twitterdemo01.cloud.hortonworks.com:8983/solr/tweets");
  - conn = phoenixDriver.connect("jdbc:phoenix:localhost:2181:/hbase-unsecure",new Properties());
- /root/hdp22-twitter-demo/stormtwitter-mvn/src/main/java/hellostorm/TwitterRuleBolt.java: change Solr collection url reference and zookeeper host reference
  - conn = phoenixDriver.connect("jdbc:phoenix:localhost:2181:/hbase-unsecure",new Properties());
  - SolrServer server = new HttpSolrServer("http://summit-twitterdemo01.cloud.hortonworks.com:8983/solr/tweets");

tspannhw / hdp22-twitter-demo Goto Github PK

hdp22-twitter-demo's Introduction

Streaming workshop demo

Purpose: Monitor Twitter stream for S&P 500 companies to identify & act on unexpected increases in tweet volume

Contents

Option 1: Setup demo using prebuilt VM based on HDP 2.3 sandbox

Option 2: Setup demo via scripts on vanilla HDP 2.4 sandbox

Optional - Setup VNC/Eclipse on your sandbox

Setup Twitter credentials

Kafka basics

Run Twitter demo

To stop collecting tweets:

Troubleshooting

Observe results

Use Zeppelin to create charts to analyze tweets

Import data to BI Tool via ODBC for analysis - optional

Other things to try: Analyze any kind of tweet - optional

Reset demo

Run demo on cluster

hdp22-twitter-demo's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org