asifkhan2017 Goto Github PK

Contains fully operational working code for creating large, biomedically pertinent semantic spaces based on "clean corpora" scraped from public databases, in particular abstracts from the National Library of Medicine's Pubmed biomedical literature database. The resulting semantic space has multiple uses esp. training Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) models in text-mining modules and packages such as Gensim and scikit-learn. The CorpusOptima system allows for comprehensive, systematic, and efficient scraping of exclusively abstract text -- without XML header, author/location data, or other ancillary text -- from all abstracts in a given chronological stretch (month-by-month in the primary version), conveniently organized to store all the scraped text from a given month or year of Pubmed abstracts in a local file or database entry. For incorporation into Gensim, scikit-learn or other modules implementing LSA and other document comparison tools, the "corpora" (semantic spaces) are available in two forms, each corresponding to a saved simplejson text file in the primary version of the code: 1. a large list of comma-separated strings with each string -- representing a separate document -- corresponding to text from a single scraped abstract (akin to the "documents" variable in Radim Rehurek's first Gensim tutorial -- https://radimrehurek.com/gensim/tut1.html ) 2. a nested list with the external list containing numerous lists of single-word tokens, each internal list representing the stemmed, lowercased, depunctuated, stopworded tokenization of each abstract, and thus with each internal list again corresponding to a distinct document (akin to the "texts" variable in Radim Rehurek's first Gensim tutorial -- https://radimrehurek.com/gensim/tut1.html ). The basic code module first uploaded maxes out at 100,000 abstracts per scrape (the maximum allowed under the NLM eutils API); a looped variant allows for scraping > 100,000 abstracts up to the total catalogued for a given month or year. I have used this code to scrape NLM Pubmed abstracts on a month-by-month and year-by-year basis dating back to 1911, one of the first years when abstracts in general were systematically catalogued for biomedical publications, with the full semantic space (complete corpus) of biomedical abstracts -- housing tens of millions of documents in total -- being stored in public Dropbox and Google Drive directories (each file corresponding to one month's or one year’s worth of saved corpora) as well as a database under construction.

machine-learning-templates

Templates for regression ,classification,NLP etc.

ncbi_bert

NCBI BERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).

networks

Social network anlaysis

networkx-tutorial

Twitter Network Analysis with NetworkX

online-social-network-analysis

Empirical Analysis of Predictive Algorithms for Collaborative Filtering, constructing a Social Network using Twitter Data, Community Detection and Link Prediction using Facebook ‘Like’ Data, Categorizing Movie Reviews based on Sentiment Analysis, Content-based Recommendation Algorithm using Python, Pandas, Numpy and scikit-learn.

asifkhan2017 Goto Github PK

asifkhan2017's Projects

Recommend Projects

Recommend Topics

Recommend Org