This repo contains complementary jupyter notebooks for part 1 and part 2 of a tour through features of spaCy, versatile and fast natural language processing library for python.
In the first part we go through
- preprocessing tasks: tokenisation, sentence segmentation, lemmatisation, stopwords,
- linguistic features: pos tags, dependency parsing, named entity recogniser
- visualisers for dependency trees and NER
- word vectors
- extensions
In the second part we discuss
- pretraining,
- training of a TextCategorizer.
The training and testing data used here is derived from Medium tags dataset published on Kaggle.
Questions, comments are suggestions are welcome!