As per here though with different sample data and a few tweaks.
- Add computer science/programming controlled vocabularies. #DONE
- Parse software engineering sample_corpus with gensim utilities. #DONE
- Corpus Streaming โ Build dictionary streaming one document at a time (data/build_corpus_dictionary.py) #DONE
- Explore Corpus Formats #DONE
- Topics and Transformations #FOCUS
- Create bespoke
MyCorpus
class in.classes/
. #TODO
- Design ontology model for software engineering domain. #TODO
- Include marker for single/few-letter domain words (i.e. "c", "R", etc.)
- Identify relevance and utility of tags present in
stackexchange_tags.tag_description
(surrounded by square brackets).- Can a reliable relationship be established between them and the parent tag?
- Build modelled controlled vocabulary for software engineering domain. #TODO
- Design mechanism to disambiguate single/few-letter domain words from non-domain instances. #TODO