The project involves building a data pipeline that streams real-time trading data from the Finnhub.io using websockets. This data is then processed using Spark for analysis and realtime dashboard.
All applications are containerized into Docker containers
-
Data source layer - Finnhub.io offering real-time trades for US stocks, forex and crypto using websockets
-
Data ingestion layer - A containerized Python application named producer connects to the Finnhub.io websocket. It encodes retrieved messages into Avro format, as specified in the
schemas/trades.avsc
file, and ingests messages into the Kafka broker. -
Message broker layer - Messages from producer are consumed by the Kafka broker located in the broker container.
-
Stream processing layer - A Spark cluster consisting of one master and two worker node is set up. A PySpark application named processor is submitted to the Spark cluster manager, which delegates a worker for it. This application connects to the Kafka broker to retrieve messages, transforms them using Spark Structured Streaming, and loads them into Cassandra tables. The first query, which transforms trades into a feasible format, runs continuously, whereas the second query, involving aggregations, has a 5-second trigger.
-
Data serving layer - A Cassandra database stores and persists data from Spark jobs. Upon launching, the
cassandra-setup.cql
script runs to create the keyspace and tables. -
Visualization layer - Grafana connects to the Cassandra database using HadesArchitect-Cassandra-Plugin and serves visualized data to users, exemplified in the Finnhub Sample BTC Dashboard. The dashboard refreshes every 500ms.
A live dashboard that constantly updates to display trade prices every 500 milliseconds.
-
Obtain an API Key: Register at Finnhub to generate your API key
-
Configure Environment Variables:
- Locate and rename the
[sample.env](sample.env)
file to.env
- Update the
.env
file with the generated API key and any necessary configurations
FINNHUB_API_TOKEN=abc123xyz FINNHUB_STOCKS_TICKERS=["AAPL", "AMZN", "MSFT", "GOOG", "NVDA", "META", "TSLA", "NFLX", "BINANCE:BTCUSDT", "IC MARKETS:1"] FINNHUB_VALIDATE_TICKERS=1 KAFKA_SERVER=broker KAFKA_PORT=29092 KAFKA_TOPIC_NAME=market CASSANDRA_HOST=cassandra CASSANDRA_PORT=9042 CASSANDRA_USERNAME=cassandra CASSANDRA_PASSWORD=cassandra
- Locate and rename the
-
Start the Service: Launch the services
make run
-
View Producer Logs: Check the logs of the producer container to monitor activity
-
Execute Spark Job: Submit the Spark job
make spark-job
-
Access Visualization: Open your browser and go to
localhost:3003
to view the visualization. Enter admin for username and password. -
Shutdown Containers: Stop all running containers
make down