zsvoboda / ngods-stocks Goto Github PK

View Code? Open in Web Editor NEW

373.0 16.0 87.0 22.67 MB

New Generation Opensource Data Stack Demo

License: BSD 3-Clause "New" or "Revised" License

Shell 1.78% JavaScript 4.58% Dockerfile 5.45% Python 9.74% Jupyter Notebook 78.45%

cube dagster datahub dbt iceberg metabase python spark spark-sql trino

ngods-stocks's Introduction

ngods stock market demo

This repository contains a stock market analysis demo of the ngods data stack. The demo performs the following steps:

Download selected stock symbols data from Yahoo Finance API.
Store the stock data in ngods data warehouse (using Iceberg format).
Transform the data (e.g. normalize stock prices) using dbt.
Expose analytics data model using cube.dev.
Visualize data as reports and dashboards using Metabase.
Predicts stock prices using ARIMA in Apache Spark.

The demo is packaged as docker-compose script that downloads, installs, and runs all components of the data stack.

UPDATES

2023-02-03:
- Upgrade to Apache Iceberg 1.1.0
- Upgrade to Trino 406
- Migrated to the new JDBC catalog (removed the heavyweigt Hive Metastore)

ngods

ngods stands for New Generation Opensource Data Stack. It includes the following components:

Apache Spark for data transformation
Apache Iceberg as a data storage format
Trino for federated data query
dbt for ELT
Dagster for data orchetsration
cube.dev for data analysis and semantic data model
Metabase for self-service data visualization (dashboards)
Minio for local S3 storage

ngods is open-sourced under a BSD license and it is distributed as a docker-compose script that supports Intel and ARM architectures.

Running the demo

ngods requires a machine with at least 16GB RAM and Intel or Arm 64 CPU running Docker. It requires docker-compose.

Clone the ngods repo

git clone https://github.com/zsvoboda/ngods-stocks.git

Start the data stack with the docker-compose up command

cd ngods-stocks

docker-compose up -d

NOTE: This can take quite long depending on your network speed.

Stop the data stack via the docker-compose down command

docker-compose down

Execute the data pipeline from the Dagster console at http://localhost:3070/ with this yaml config file.

Cut and paste the content of the e2e.yaml file to this Dagster UI console page and start the data pipeline by clicking the Launch Run button.

NOTE: You can customize the list of stock symbols that will be downloaded.

Review and customize the cube.dev metrics, and dimensions. Test these metrics in the cube.dev playground.

See the cube.dev documentation for more information.

Check out the Metabase data visualizations that is connected to the cube.dev analytical model. You can run SQL queries on top of the cube.dev schema.

Use username [email protected] and password metabase1.

You can create your own data visualizations and dashboards. See the Metabase documentation for more information.

Predict stock close price. Run the ARIMA time-series prediction model notebook that is trained on 29 months of the Apple:AAPL stock data and predicts the next month.

Download DBeaver SQL tool.
Connect to the Postgres database that contains the gold stage data. Use jdbc:postgresql://localhost:5432/ngods JDBC URL with username ngods and password ngods.

Connect to the Trino database that has access to all data stages (bronze, silver, and gold schemas of the warehouse database). Use jdbc:trino://localhost:8060 JDBC URL with username trino and password trino.

Connect to the Spark database that is used for data transformations. Use jdbc:hive2://localhost:10009 JDBC URL with no username and password.

Customizing the demo

This chapter contains useful information for customizing the demo.

ngods directories

Here are few distribution's directories that you may need to customize:

conf configuration of all data stack components
- cube cube.dev schema (semantic model definition)
data main data directory
- minio root data directory (contains buckets and file data)
- spark Jupyter notebooks
- stage file stage data. Spark can access this directory via /var/lib/ngods/stage path.
projects dbt, Dagster, and DataHub projects
- dagster Dagster orchestration project
- dbt dbt transformations (one project per each medallion stage: bronze, silver, and gold)

ngods endpoints

The data stack has the following endpoints

Spark
- http://localhost:8888 - Jupyter notebooks
- jdbc:hive2://localhost:10009 JDBC URL (no username / password)
- localhost:7077 - Spark API endpoint
- http://localhost:8061 - Spark master node monitoring page
- http://localhost:8062 - Spark slave node monitoring page
- http://localhost:18080 - Spark history server page
Trino
- jdbc:trino://localhost:8060 JDBC URL (username trino / no password)
Postgres
- jdbc:postgresql://localhost:5432/ngods JDBC URL (username ngods / password ngods)
Cube.dev
- http://localhost:4000 - cube.dev development UI
- jdbc:postgresql://localhost:3245/cube JDBC URL (username cube / password cube)
Metabase
- http://localhost:3030 Metabase UI (username [email protected] / password metabase1)
Dagster
- http://localhost:3070 - Dagster orchestration UI
Minio
- http://localhost:9001 - Minio UI (username minio / password minio123)

ngods databases: Spark, Trino, and Postgres

ngods stack includes three database engines: Spark, Trino, and Postgres. Both Spark and Trino have access to Iceberg tables in warehouse.bronze and warehouse.silver schemas. Trino engine can also access the analytics.gold schema in Postgres. Trino can federate queries between the Postgres and Iceberg tables.

The Spark engine is configured for ELT and pyspark data transformations.

The Trino engine is configured for data federation between the Iceberg and Postgres tables. Additional catalogs can be configured as needed.

The Postgres database has accesses only to the analytics.gold schema and it is used for executing analytical queries over the gold data.

Demo data pipeline

The demo data pipeline is utilizes the medallion architecture with bronze, silver, and gold data stages.

and consists of the following phases:

Data are downloaded from Yahoo Finance REST API to the local Minio bucket (./data/stage) using this Dagster operation.
The downloaded CSV file is loaded to the bronze stage Iceberg tables (warehouse.bronze Spark schema) using dbt models that are executed in Spark (./projects/dbt/bronze).
Silver stage Iceberg tables (warehouse.silver Spark schema) are created using dbt models that are executed in Spark (./projects/dbt/silver).
Gold stage Postgres tables (analytics.gold Trino schema) are created using dbt models that are executed in Trino (./projects/dbt/gold).

All data pipeline phases are orchestrated by Dagster framework. Dagster operations, resources and jobs are defined in the Dagster project.

The pipeline is executed by running the e2e job from the Dagster console at http://localhost:3070/ using this yaml config file

ngods analytics layer

ngods includes cube.dev for semantic data model and Metabase for self-service analytics (dashboards, reports, and visualizations).

Analytical (semantic) model is defined in cube.dev and is used for executing analytical queries over the gold data.

Metabase is connected to the cube.dev via SQL API. End users can use it for self-service creation of dashboards, reports, and data visualizations. Metabase is also directly connected to the gold schema in the Postgres database.

ngods machine learning

Jupyter Notebooks with Scala, Java and Python backends can be used for machine learning.

Support

Create a github issue if you have any questions.

ngods-stocks's People

Contributors

Stargazers

Watchers

Forkers

zatte bochuxt sgouda0412 shuitao claytonjr kainoa21 kbendick jwlai-cloud gfranxman rcpbayindir bariseba tiny-prism-labs mspronk spunkjockey sizzles softiger j143 cekicbaris lucciano absognety tuananh pregno hushaoqing dmumpuu trymzet sabellow gabruelsr josephyys hanhongyuan gubyb myfjdthink nidaven dgpaysai kapsali29 tdarkow geofdobbe smurffers unstoppable94 chenhuiyeh khaferkamp savadev ckyeungac asamoal alexcstark atdavidpark smiecj m-raven muhammadfaiznoh mervynzhang luanpedro andrejnevesjr ghoshkoustav1us jean-humann dandpz rohankumardubey hofmanfr nkeleher srikanth-gandi fractal-dynamics ml-sketch motconmeobuon tana8m vurt kevinjqliu 0xhanh dtrounine eformat amineacher oreillyjw macvaz ethiraj gexar rchikkam chikkam huangxiaofeng10047 massipssa chinapat0843 fork-arhive ven2day jmaggesi raphelemmanuvel codingkrishna thanakorn-ki ziwon cuong0993 xiongxiongufl martyngigg

ngods-stocks's Issues

would you please provide complete docker-compose.yaml with all components(like trino,dbt)?

CDC - Is it part of Dagster or somewhere else?

exec /usr/bin/entrypoint.sh: no such file or directory

Hi,

It might be rookie me, but I can't seem to get rid of the error:

exec /usr/bin/entrypoint.sh: no such file or directory

coming from the aio container. I am running on windows, I don't know if this is what causes an error in the mapping, or if it is a general issue.

requirements.txt - Python dependency resolution stops with an error

365.9 WARNING: jsonschema 4.1.0 does not provide the extra 'format-nongpl'
365.9 WARNING: jsonschema 4.0.1 does not provide the extra 'format-nongpl'
365.9 WARNING: jsonschema 3.2.0 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.5.1 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.5.0 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.4.0 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.3.3 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.3.2 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.3.1 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.3.0 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.2.1 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.2.0 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.1.2 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.1.1 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.1.0 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 4.0.1 does not provide the extra 'format-nongpl'
366.0 WARNING: jsonschema 3.2.0 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.5.1 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.5.0 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.4.0 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.3.3 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.3.2 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.3.1 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.3.0 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.2.1 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.2.0 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.1.2 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.1.1 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.1.0 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 4.0.1 does not provide the extra 'format-nongpl'
366.2 WARNING: jsonschema 3.2.0 does not provide the extra 'format-nongpl'
366.3 WARNING: jsonschema 4.5.1 does not provide the extra 'format-nongpl'
366.3 WARNING: jsonschema 4.5.0 does not provide the extra 'format-nongpl'
366.3 WARNING: jsonschema 4.4.0 does not provide the extra 'format-nongpl'
366.3 WARNING: jsonschema 4.3.3 does not provide the extra 'format-nongpl'
366.3 WARNING: jsonschema 4.3.2 does not provide the extra 'format-nongpl'
366.3 WARNING: jsonschema 4.3.1 does not provide the extra 'format-nongpl'
366.3 WARNING: jsonschema 4.3.0 does not provide the extra 'format-nongpl'
366.4 INFO: pip is looking at multiple versions of pickleshare to determine which version is compatible with other requirements. This could take a while.
366.4 Downloading pickleshare-0.4.tar.gz (11 kB)
366.4 Preparing metadata (setup.py): started
366.6 Preparing metadata (setup.py): finished with status 'error'
366.6 error: subprocess-exited-with-error
366.6
366.6 × python setup.py egg_info did not run successfully.
366.6 │ exit code: 1
366.6 ╰─> [11 lines of output]
366.6 Traceback (most recent call last):
366.6 File "", line 2, in
366.6 File "", line 34, in
366.6 File "/tmp/pip-install-40xfnxrc/pickleshare_118f0a7d8a57465088b38d568a6c39b0/setup.py", line 3, in
366.6 import pickleshare
366.6 File "/tmp/pip-install-40xfnxrc/pickleshare_118f0a7d8a57465088b38d568a6c39b0/pickleshare.py", line 41, in
366.6 from path import path as Path
366.6 File "/tmp/pip-install-40xfnxrc/pickleshare_118f0a7d8a57465088b38d568a6c39b0/path.py", line 724
366.6 def mkdir(self, mode=0777):
366.6 ^
366.6 SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers
366.6 [end of output]
366.6
366.6 note: This error originates from a subprocess, and is likely not a problem with pip.
366.6 error: metadata-generation-failed
366.6
366.6 × Encountered error while generating package metadata.
366.6 ╰─> See above for output.
366.6
366.6 note: This is an issue with the package mentioned above, not pip.
366.6 hint: See above for details.
366.7
366.7 [notice] A new release of pip is available: 23.0.1 -> 23.2.1
366.7 [notice] To update, run: pip install --upgrade pip

failed to solve: process "/bin/sh -c pip3 install --no-cache-dir -r requirements.txt && rm requirements.txt" did not complete successfully: exit code: 1

Failed to solve: changes out of order: "conf/requirements.txt"

I'm trying to do the docker-compose up -d but i'm getting this error:

Custom CSV

Hi! I was trying to add a custom CSV file, but I wasn't able to do so.

Is there a way to add custom ones using the base you have?

Thank you so much in advantage, I'll appreciate so much every everything you could tell me!

Thoughts on using software defined assets and a Trino IO Manager?

Trino Error

First, awesome example.

Second, hopefully this is a silly issue which goes away easily.

Whenever I try to start the Trino container (e.g. docker-compose up)...it immediately shuts down with the following error:
'ERROR: Trino requires at least 4096 file descriptors (found 1024)'

Have you dealt with this before?

Error starting containers

Cloned the repo and while building,

#0 173.5 INFO: pip is looking at multiple versions of pickleshare to determine which version is compatible with other requirements. This could take a while.
#0 174.0   Downloading pickleshare-0.4.tar.gz (11 kB)
#0 174.1   Preparing metadata (setup.py): started
#0 174.2   Preparing metadata (setup.py): finished with status 'error'
#0 174.2   error: subprocess-exited-with-error
#0 174.2   
#0 174.2   × python setup.py egg_info did not run successfully.
#0 174.2   │ exit code: 1
#0 174.2   ╰─> [11 lines of output]
#0 174.2       Traceback (most recent call last):
#0 174.2         File "<string>", line 2, in <module>
#0 174.2         File "<pip-setuptools-caller>", line 34, in <module>
#0 174.2         File "/tmp/pip-install-9s9g_51u/pickleshare_7951e09ae637468caa6efddd2108a27f/setup.py", line 3, in <module>
#0 174.2           import pickleshare
#0 174.2         File "/tmp/pip-install-9s9g_51u/pickleshare_7951e09ae637468caa6efddd2108a27f/pickleshare.py", line 41, in <module>
#0 174.2           from path import path as Path
#0 174.2         File "/tmp/pip-install-9s9g_51u/pickleshare_7951e09ae637468caa6efddd2108a27f/path.py", line 724
#0 174.2           def mkdir(self, mode=0777):
#0 174.2                                   ^
#0 174.2       SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers
#0 174.2       [end of output]
#0 174.2   ```

open source questions

hi,

is there an alternative for cube dev?

also, what would you recommend in terms of upgrades if ever this would need to work with real time use cases ?

appreciate this btw! for reference i'm trying to build learning material for data engineers and a modern data stack that's open source is what is lacking !
Myk

Kyuubi-1.6.0 not supported

./aio/Dockerfile on line 89: Kyuubi 1.6.0 should be changed to Kyuubi 1.6.1 since the old version is not downloadable anymore.

Datahub wont start in the docker compose setup

Thank you very much for your hard work!
I know it is challenging to tame all the different software packages.

The standard docker-compose.yml works like a charm but with the .x86 I have the problem that I cannot get the datahub-gms container as well as elasticsearch to stay in healthy state.

I use a 8vcpu and 64 GiB machine on DigitalOcean to tinker with the setup for myself.
Is there maybe a trick to get Datahub working? This is next to Iceberg the thing I would really like to test :)

Thank you very much!

Question about your architecture

Hi @zsvoboda Why do you use DBT + Spark between bronze layer to silver layer; and only uses dbt+trino moving from Silver to Gold? Is there any specific reason?

Error when re-running the workflows

I get an error when re-running the workflow in dagster. For eg: Running dbt-bronze second time throws this error.

Any thoughts?

can not run sql in cube via jdbc

Hi, this project works fine,
but when I try to connect cube via dbeaver ( with jdbc:postgresql://localhost:3245/cube JDBC URL (username cube / password cube))
connect ok,
but we can only see database cube and schema public. no table/view can be use

Metabase stuck in a loop

Hello!

I'm trying to start the system via docker compose up . All pieces come up nice, except for Metabase which seems stuck in a loop with this error message:

metabase  | 2023-04-19 07:14:05,812 INFO db.update-h2 :: H2 v1 database detected, updating...
metabase  | 2023-04-19 07:14:05,812 INFO db.update-h2 :: Creating v1 database backup at /tmp/metabase-migrate-h2-db-v1-v2.sql
metabase  | 2023-04-19 07:14:06,340 INFO db.update-h2 :: Moving old app database to /conf/metabase.db.v1-backup.mv.db
metabase  | 2023-04-19 07:14:06,341 ERROR db.update-h2 :: Failed to update H2 database: #error {
metabase  |  :cause /conf/metabase.db.mv.db -> /conf/metabase.db.v1-backup.mv.db
metabase  |  :via
metabase  |  [{:type java.nio.file.AccessDeniedException
metabase  |    :message /conf/metabase.db.mv.db -> /conf/metabase.db.v1-backup.mv.db
metabase  |    :at [sun.nio.fs.UnixException translateToIOException UnixException.java 90]}]
metabase  |  :trace
metabase  |  [[sun.nio.fs.UnixException translateToIOException UnixException.java 90]
metabase  |   [sun.nio.fs.UnixException rethrowAsIOException UnixException.java 106]
metabase  |   [sun.nio.fs.UnixCopyFile move UnixCopyFile.java 481]
metabase  |   [sun.nio.fs.UnixFileSystemProvider move UnixFileSystemProvider.java 266]
metabase  |   [java.nio.file.Files move Files.java 1430]
metabase  |   [metabase.db.update_h2$update_BANG_ invokeStatic update_h2.clj 81]
metabase  |   [metabase.db.update_h2$update_BANG_ invoke update_h2.clj 68]
metabase  |   [metabase.db.update_h2$update_if_needed invokeStatic update_h2.clj 98]
metabase  |   [metabase.db.update_h2$update_if_needed invoke update_h2.clj 90]
metabase  |   [metabase.db.data_source.DataSource getConnection data_source.clj 29]
metabase  |   [com.mchange.v2.c3p0.WrapperConnectionPoolDataSource getPooledConnection WrapperConnectionPoolDataSource.java 161]
metabase  |   [com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$1PooledConnectionResourcePoolManager acquireResource C3P0PooledConnectionPool.java 213]
metabase  |   [com.mchange.v2.resourcepool.BasicResourcePool doAcquire BasicResourcePool.java 1176]
metabase  |   [com.mchange.v2.resourcepool.BasicResourcePool doAcquireAndDecrementPendingAcquiresWithinLockOnSuccess BasicResourcePool.java 1163]
metabase  |   [com.mchange.v2.resourcepool.BasicResourcePool access$700 BasicResourcePool.java 44]
metabase  |   [com.mchange.v2.resourcepool.BasicResourcePool$ScatteredAcquireTask run BasicResourcePool.java 1908]
metabase  |   [com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread run ThreadPoolAsynchronousRunner.java 696]]}

I assume it's a matter of permissions, because the H2 DB file resides on the local drive, although I tried with chmod 777 on data, and it's the same. Could you please share the version of the Metabase container that you used for base in building ?

Thank you!

[Question] Is iceberg a valid file format?

I have a quick question. When I check table configuration of dbt-spark, iceberg is not found but it is used for the dbt projects. Just wondering if it works even if it is not officially supported.

Cheers,
Jaehyeon

new

My name is Luis, I'm a big-data machine-learning developer, I'm a fan of your work, and I usually check your updates.

I was afraid that my savings would be eaten by inflation. I have created a powerful tool that based on past technical patterns (volatility, moving averages, statistics, trends, candlesticks, support and resistance, stock index indicators).
All the ones you know (RSI, MACD, STOCH, Bolinger Bands, SMA, DEMARK, Japanese candlesticks, ichimoku, fibonacci, williansR, balance of power, murrey math, etc) and more than 200 others.

The tool creates prediction models of correct trading points (buy signal and sell signal, every stock is good traded in time and direction).
For this I have used big data tools like pandas python, stock market libraries like: tablib, TAcharts ,pandas_ta... For data collection and calculation.
And powerful machine-learning libraries such as: Sklearn.RandomForest , Sklearn.GradientBoosting, XGBoost, Google TensorFlow and Google TensorFlow LSTM.

With the models trained with the selection of the best technical indicators, the tool is able to predict trading points (where to buy, where to sell) and send real-time alerts to Telegram or Mail. The points are calculated based on the learning of the correct trading points of the last 2 years (including the change to bear market after the rate hike).

I think it could be useful to you, to improve, I would like to share it with you, and if you are interested in improving and collaborating I am also willing, and if not file it in the box.

If tou want, Please read the readme , and in case of any problem you can contact me ,
If you are convinced try to install it with the documentation.
https://github.com/Leci37/LecTrade/tree/develop I appreciate the feedback