The last few projects have been pretty simple with a Naive Bayes classifier and then a Logistic Regression classifier. My goal was to show that while not as flashy as building a neural network, they can easily be used to solve business problems quickly and, in some cases, very accurately.
However, the previous projects use Sparce vectors for the text data. Both of the previous projects use the TF-IDF method to build vectors for the text data. My goal with this next project is to use word embeddings. In practice, word embedding methods such as fasttext or word2vec are better at capturing the importance of words in the text data than the previous TF-IDF method.
I hope to show that using a word embedding method like word2vec in conjuction with a simple algorithm we are able to accurately predict what language is being used.