###COURSE### DATASCI 350 DS: Methods for Data Analysis
###AUTHOR### Iswarya Murali
###PROJECT OBJECTIVE### Text Analysis of the Harry Potter books.
###DATA SOURCES###
- .\Books*.txt
- .\Lexicon\positive-words.txt and .\Lexicon\negative-words.txt (taken from https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon)
- http://harrypotter.answers.wikia.com/wiki/Top_200_most_named_harry_potter_characters_s
- https://en.wikipedia.org/wiki/List_of_spells_in_Harry_Potter
###DATA PREPARATION AND CLEANING###
- Read the books in text file format
- Normalization: Clean Up non-ascii, non-alphabet/non-numeric characters, remove all quotes and other extraneous punctuation except period.
- Tokenization: Replace alternate nicknames for characters.
- Remove stop words
###HYPOTHESIS TESTING AND ANALYSIS### For each book:
- Social Network Analysis:
- Generate a graph of the top 25 characters, per book. If 2 characters occur within a set threshold (15 non stopwords) of each other, count as 1 edge between them.
- Analyze the graph object:
- Node with Highest degree centrality, eigenvector centrality and betweenness centrality
(https://en.wikipedia.org/wiki/Centrality) - Mean Degree
- Pair of nodes with highest weighted edge (strongest relationship)
- Node with Highest degree centrality, eigenvector centrality and betweenness centrality
- Hypothesis Testing: Do the Edge Weights of the graph follow Power Law fit?
- House Popularity
- Plot mentions of the 4 houses, per book
- Hypothesis Testing: Are the Houses equally represented per Chi-Squared Goodness of Fit Test?
- Word Cloud: Generate a word cloud per book
- Spell popularity: Plot mentions of the occurrence of spells (scraped from Wikipedia), per book
- Sentiment Analysis: Check occurrences of negative words and psoitive words, as a progression between book 1 and book 7
All the results (plots/hypothesis tests) are saved as pdf/txt file in the Results folder.
###R FILES###
- HarryPotter-Analysis.R : Main Entry Point File. Reads the text of the 7 books and loops through all functions
- hp_helper.R : Helper file with all functions required to perform analysis.
- hp_unit_tests.R : Unit Tests for helper functions.
- hp_global_vars.R : File that defines some global variables that are used for the analysis.
###RESULTS### All Results are stored in the "Results" folder.
- HarryPotter_Plots.pdf
- HarryPotter_GraphAnalysis_HypothesisTesting.txt
- Log File: harrypotterlog.log