Giter Club home page Giter Club logo

ppaxe's Introduction

Publication Build Status Coverage Status PyPI version


Tool to retrieve protein-protein interactions and calculate protein/gene symbol ocurrence in the scientific literature (PubMed & PubMedCentral). Contains two python modules (core and report), and a python script (ppaxe).

Available for python 2.7 and python 3.x, and also as a standalone docker image.

Visit the PPaxe web application to use PPaxe on the web.

Citation

S. Castillo-Lara, J.F. Abril
PPaxe: easy extraction of protein occurrence and interactions from the scientific literature
Bioinformatics, AOP November 2018, bty988.

Quick Installation

To download and use the ppaxe Docker image:

docker pull compgenlabub/ppaxe:latest
docker run -v /local/path/to/output:/ppaxe/output:rw \
              compgenlabub/ppaxe -v -p ./papers.pmids -o ./output.tbl -r ./report

If you want to install PPaxe manually, go to the Install ppaxe manually section.

Usage

usage: ppaxe [-h] -p PMIDS [-d DATABASE] [-o OUTPUT] [-r REPORT] [-i IP] [-v]
             [-e]

Command-line tool to retrieve protein-protein interactions from the scientific
literature.

optional arguments:
  -h, --help            show this help message and exit
  -p PMIDS, --pmids PMIDS
                        Text file with a list of PMids or PMCids
  -d DATABASE, --database DATABASE
                        Download whole articles from database "PMC", or only
                        abstracts from "PUBMED".
  -o OUTPUT, --output OUTPUT
                        Output file to print the retrieved interactions in
                        tabular format.
  -r REPORT, --report REPORT
                        Print html report with the specified name.
  -i IP, --ip IP        Change the IP address of the StanfordCoreNLP server.
                        Default: http://localhost:9000
  -v, --verbose         Increase output verbosity.
  -e, --exclude         Exclude protein symbols not annotated in dictionary.

ppaxe classes

from ppaxe import core as ppcore
from ppaxe import report

# Perform query to PubMedCentral
pmids = ["28615517","28839427","28831451","28824332","28819371","28819357"]
query = ppcore.PMQuery(ids=pmids, database="PMC")
query.get_articles()

# Retrieve interactions from text
for article in query:
    article.extract_interactions()

# Get the predictions
for prediction in article.predictions:
  print(prediction.to_html())

# Print html report
# Will create 'report_file.html'
summary = report.ReportSummary(query)
summary.make_report("report_file")

ppaxe script

# Will read PubMed ids in pmids.txt, predict the interactions
# in their fulltext from PubMedCentral, and print a tabular output
# and an html report
ppaxe -p pmids.txt -d PMC -v -o output.tbl -r report

# Or with docker image
docker run -v /local/path/to/output:/ppaxe/output:rw compgenlabub/ppaxe -v -p pmids.txt -o output.tbl -r report

Report

The report output (option -r) will contain a simple summary of the analysis, the interactions retrieved (including the sentences from which they were retrieved), a table with the protein/gene counts and a graph visualization made using cytoscape.js.

Install ppaxe manually

  • Prerequisites
xml.dom
numpy
pycorenlp
cPickle
scipy

You can install this package manuallly using pip. However, before doing so, you have to download the Random Forest predictor and place it in ppaxe/data.

# Clone the repository
git clone https://github.com/scastlara/ppaxe.git

# Download pickle with RF
wget https://www.dropbox.com/s/t6qcl19g536c0zu/RF_scikit.pkl?dl=0 -O ppaxe/ppaxe/data/RF_scikit.pkl

# Install
pip install ppaxe
  • Download StanfordCoreNLP

In order to use the package you will need a StanfordCoreNLP server setup with the Protein/gene Tagger.

 # Download StanfordCoreNLP
 wget http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip
 unzip stanford-corenlp-full-2017-06-09.zip

 # Download the Protein tagger
 wget https://www.dropbox.com/s/ec3a4ey7s0k6qgy/FINAL-ner-model.AImed%2BMedTag%2BBioInfer.ser.gz?dl=0 -O FINAL-ner-model.AImed+MedTag+BioInfer.ser.gz

 # Download English tagger models
 wget http://nlp.stanford.edu/software/stanford-english-corenlp-2017-06-09-models.jar -O stanford-corenlp-full-2017-06-09/stanford-english-corenlp-2017-06-09-models.jar

 # Change the location of the tagger in ppaxe/data/server.properties if necessary
 # ...

 # Start the StanfordCoreNLP server
 cd stanford-corenlp-full-2017-06-09/
java -mx1000m -cp ./stanford-corenlp-3.8.0.jar:stanford-english-corenlp-2017-06-09-models.jar edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -serverProperties ~/ppaxe/ppaxe/data/server.properties

Once the server is up and running and ppaxe has been installed, you are good to go.

By default, ppaxe will assume the server is available at localhost:9000. If you want to change the address, set up the server with the appropiate port and change the address in ppaxe by assigning the new address to the variable ppaxe.ppcore.NLP:

  • Start the server
# Change the location of the ner tagger in server.properties manually
java -mx10000m -cp ./stanford-corenlp-3.8.0.jar:stanford-english-corenlp-2017-06-09-models.jar edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port your_port -serverProperties ppaxe/data/server.properties
  • Use the ppaxe package
from ppaxe import core as ppcore
from pycorenlp import StanfordCoreNLP

ppcore.NLP = StanfordCoreNLP(your_new_adress)

# Do whatever you want

Using the Gene dictionary

By default, PPaxe uses the HGNC dictionary of gene symbols to normalize the protein/gene symbols found in the article. The ppaxe command-line tool has the option -e that restricts all the results to only those proteins that match against the HGNC database. Users can change this file (located at ppaxe/data/HGNC_gene_dictionary.txt) in order to restrict their searches to only specific genes or proteins, or to normalize gene names using a different dictionary.

Documentation

Refer to the wiki of the package.

Running the tests

To run the tests:

python -m pytest -v tests

Authors

License

This project is licensed under the GNU GPL3 license - see the LICENSE file for details

ppaxe's People

Contributors

josepfabril avatar scastlara avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ppaxe's Issues

Not well-formed (invalid token)

Following errors:
File "/ppaxe/bin/ppaxe", line 173, in
main()
File "/ppaxe/bin/ppaxe", line 163, in main
stats = get_ppi(options, start_time, pmids)
File "/ppaxe/bin/ppaxe", line 89, in get_ppi
query.get_articles()
File "/usr/local/lib/python2.7/dist-packages/ppaxe/core.py", line 227, in get_articles
self.__get_pubmed(req)
File "/usr/local/lib/python2.7/dist-packages/ppaxe/core.py", line 182, in __get_pubmed
article_text = minidom.parseString(req.content)
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1928, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 34403, column 55

Error occurring on docker ppaxe.

Create docker

Add docker container to distribute ppaxe together with the StanfordCoreNLP and all the dependencies.

Specialising the module to non protein-protein interactions

I've had a play with this and its really nice.

I'm somewhat unfamiliar with the NLP that goes into this (and NLP in general), but I wanted to know if this should be straightforward to extend to looking for interactions other than protein-protein.

For example, I'm particular interested in looking for protein-disease interactions. If I have some list of proteins and diseases I am interested in finding interactions of, do you have any advice on how I can modify the ppaxe module to do this?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.