Natural Language Processing with Python and Spark
Takes raw text rdd and transforms it into rdd containing sentences only
Takes sentences only rdd and transforms it into rdd of key-value pairs,
the key being the sentence, and the value being list of tokens
Takes sentences only rdd and transforms it into gin index rdd,
the key is a word and the value is an iterable of sentences
Takes gin rdd and returns rdd of sentences that contain the token
Takes sentence-token list rdd and transforms it into the vocabulary rdd
containing only the alphabetical tokens from the text
Build tf breakdown of sentence-token list.
The resulting key is sentence-token pair.
Build idf breakdown of sentence-token list.
The resulting key is sentence.
Build tf-idf breakdown of sentence-token list.
The resulting key is sentence-token.
You will need nose to run tests:
nosetests2 test.py