Create the database
docker-compose up
Start the spider by navigating to imdb_initial_etl/scraper
scrapy crawl imdb_spider
Crawler will take some time (as there are no proxies). Afterwards, move the .csv.gz file to ./transformations and run cleaning.py
python3 cleaning.py
General
- better star schema adherence 1a. seperate plot and title out to make movie_info a imdb_fact table 1b. seperate basegenre out of rankings and make another dimension table
- Bashscript to run scraper and move file to cleaner
- Schedule a cronjob to do this at a regular interval
- Migrate to AWS
- Institute a manager like Airflow
Scraper
- Fix list output
- Add S3 bucket file dumping
- More informative logging and dump it to a database
Cleaner
- Company name extraction handle foreign companies
- Optimize for speed
- Institute unit tests, and more rigorous validation
- More informative logging and dump to a database