This is a screenplay parser that extracts dialogues between characters. However it extracts the dialogues if the second character has a paranthetical. The scripts are crawled from http://www.imsdb.com/ .
-
Run scrapy : Go to brickset-scraper folder and run this in your terminal:
scrapy runspider scraper.py --output=names_links.json
This will generate "names_links.json" file in the same folder.
-
Run "json_parser.py" via terminal command "python json_parser.py names_links.json". This will read "names_links.json" and will create "all_name_script.txt". This new txt file has a movie name and a link to its script for each movie in the json file. Note that each script takes 1-2 seconds.
-
Run "html_list_parser.py" . This will read "all_name_script.txt" and will generate "all_dialogues.txt". This file has all the relevant dialogues from the movie scripts.
You need to have
- BeautifulSoup
- Scraper
- Python 3 or above
- Jupyter Notebook
Kamil Veli Toraman: kvtoraman
There is no licence for now. You can use as you please. This code tries to have a rule-based algorithm for movie scripts. If you have a better way, please inform me :)
- This is a result of a 2 month internship in Data Science Lab, Kaist.