Giter Club home page Giter Club logo

text_mining_tools's Introduction

text_mining_tools

Tools to facilitate mining the scientific literature.

These tools contain two major classes:

  1. A Query class, which collects a list of DOIs from queries made, which can be limited to specific journals.
  2. An Article class, which allows operations to be performed on a given doi (from downloading to manipulations).

In addition to the classes, some tools reside in full_text_mine.py. This permits Vader sentiment analysis.

Currently, Wiley, ACS, RSC, Nature, and Science publications are supported by default. Only HTML/XML support is provided for mining. Thus, while Science articles can be downloaded, they are not in HTML format and thus cannot be mined, unless you can convert them to HTML. Be careful about this! Turning PDFs into mineable data is not a simple task. To proceed with your install, follow the instructions:

# install dependencies with the following
pip install -r requirements.txt 

Next, install the package.

# install the package with the following
python setup.py develop

After finding your articledownloader install, you will have to replace two files, which are stored locally in this repo (under adjusted_article_downloader/). To do this, find where your article downloader install is. It is highly likely that it is at the following location:

<anaconda-path>/envs/<conda-env-name>/lib/python3.6/site-packages/articledownloader/

Then, replace the two files under that path. We need to replace two scripts (articledownloader.py & scrapers.py), which will be under articledownloader/ in the path you located above. Copy the files from text_mining_tools/adjusted_article_downloader/ to replace the equivalent files in the path above. This file move is only necessary if you plan on using the article class to mass download articles, instead of preparing your own corpus separately, and building your article classes from the pre-built corpus.

cp text_mining_tools/adjusted_article_downloader/* <anaconda-path>/envs/<conda-env-name>/lib/python3.6/site-packages/articledownloader/

If you installed inside of a conda environment (recommended!), <conda-env-name> represents the name of your conda environment, and <anaconda-path> represents where your anaconda install is.

Note: The first time you install the NLTK package, you will need to manually install subpackages. Open a python terminal and type the following:

import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')

You can check if your NLTK install will be good to go by opening up NLTK in a python terminal and doing the following imports.

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

Currently, python 3.6 is recommended for this package.

*** Note: Please set aside a hard disk with plenty of space if you are planning automated downloads. ***

We recommend installing stanza additionally for dependency parsing.

pip install stanza

The first time stanza is installed, by default, models will not be installed. We want this for our pipelines. Thus, we need to run the following:

import stanza
stanza.download('en')

If you choose to use pybliometrics to do manuscript abstract analysis, you can install pybliometrics via pip.

pip install pybliometrics

You can then set up your Elsevier API Key using the following link: https://pybliometrics.readthedocs.io/en/stable/configuration.html, which would make abstract mining possible afterwards. The information for your Elsevier key will be stored in a config.ini file that is in a hidden folder (either .pybliometrics/ or .scopus/), that pybliometrics uses to automate abstract downloads.

text_mining_tools's People

Contributors

adityanandy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.