This work corresponds to the Peer-graded Assignment: Milestone Report. The aim is to show the exploration data and text mining done so far to prepare the final project of the data science captone project. This work have three parts. First, the text data comming from blogs, news and twitter in english is read and 10% is selected. Secondly, we create the corpus, clean the data and analyze the most frequence words and the combination of two and three words (n-grams). Finally we make some plots to show the result of the exploratory analysis. Moreover, in the end, the approach for the next step in the project is stated.
You can read the final html https://mcastrol.github.io/dataScienceCaptoneProject/DCCaptoneProject_FirstPart.html
- DCCaptoneProject_FirstPart.Rmd: Rmarkdown with the milestone development.
- DCCaptoneProject_FirstPart.html: renden Html to see directly from git
- full-list-of-bad-words-text-file_2018_03_26.txt: bad word excluded from the analysis
Input data can be getted in https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip