Giter Club home page Giter Club logo

text_analysis_technobabble's Introduction

Text_Anlaysis_Technobabble

NLP (Natural Language Processing) Using Star Trek scripts as training data.

Using the website http://chakoteya.net/StarTrek/index.html, which contains formatted scripts from all five Star Trek series, this program downloads all the webpages into text files, sanitizes and preprocessing those scripts to extract character names and dialogue from the text, and models the dialogue of the top 100 characters (as ranked by lines spoken) into word clouds of the speaker. Word Clouds graphically represent most spoken words in both size and colour, with larger font sizes indicating higher frequencies and darker colours representing desner allocations of words within the text.

Python file descriptions

htm-process.py

  • Call this method first to scrape website for scripts -Uses line comphrehensions to generate urls for series using episode number ranges. -Uses BeautifulSoup to get the contents of the webpage, writes to plain text files

ProcessAllScripts-2.py

  • Extracts lines of dialogue from full script, ignoring erroneous text
  • Concatenates multiline character dialogue into single lines
  • Saves character's spoken lines into dictionary where {key: value} are represented as {character_name: lines_of_dialogue}.
  • Generates content for data_char_lines, with a folder per series with character's dialogue as text files stored within the folder.

FilterFiles.py

  • Uses file size to calculate a cutoff point for character files to keep
  • Natural cutoff occurs at top ~100 characters

processWords.py

  • Uses files in resources/ folder to create stopwords list (words to be removed from analysis) from nltk standard package and personal, curated list of stopwords collected through NLP projects in school.
  • Removes stopwords and punctuation from words dictionary.
  • Create word frequency dictionary per character.
  • Generates a Word Cloud image of top words for files listed in char_lines_top_100

text_analysis_technobabble's People

Contributors

k10forthewin avatar lucifer-linux avatar techno156 avatar

Stargazers

 avatar Soumick Chatterjee, PhD avatar Dan Sparks avatar  avatar Becky Wright avatar Chris Varenhorst avatar Khinshan Khan avatar Cheyanne avatar

Watchers

Turq avatar  avatar

text_analysis_technobabble's Issues

Sisk vs Sisco

Found in data_char_lines, translated to data_char_lines_100. Issue likely in data -> data_car_lines translation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.