cokelaer / bioservices Goto Github PK

View Code? Open in Web Editor NEW

275.0 18.0 61.0 5.72 MB

Access to Biological Web Services from Python.

Home Page: http://bioservices.readthedocs.io

License: Other

Python 90.39% Jupyter Notebook 9.61%

kegg unichem eutils chembl chebi biomodels quickgo uniprot wikipathways muscle python

bioservices's Introduction

BIOSERVICES: access to biological web services programmatically

https://static.pepy.tech/personalized-badge/bioservices?period=month&units=international_system&left_color=black&right_color=orange&left_text=Downloads

Python_version_available:	BioServices is tested for Python 3.7, 3.8, 3.9, 3.10
Contributions:	Please join https://github.com/cokelaer/bioservices
Issues:	Please use https://github.com/cokelaer/bioservices/issues
How to cite:	Cokelaer et al. BioServices: a common Python package to access biological Web Services programmatically Bioinformatics (2013) 29 (24): 3241-3242
Documentation:	RTD documentation.

Bioservices is a Python package that provides access to many Bioinformatices Web Services (e.g., UniProt) and a framework to easily implement Web Services wrappers (based on WSDL/SOAP or REST protocols).

The primary goal of BioServices is to use Python as a glue language to provide a programmatic access to several Bioinformatics Web Services. By doing so, elaboration of new applications that combine several of the wrapped Web Services is fostered.

One of the main philosophy of BioServices is to make use of the existing biological databases (not to re-invent new databases) and to alleviates the needs for expertise in Web Services for the developers/users.

BioServices provides access to about 40 Web Services.

Contributors

Maintaining BioServices would not have been possible without users and contributors. Each contribution has been an encouragement to pursue this project. Thanks to all:

https://contrib.rocks/image?repo=cokelaer/bioservices

Quick example

Here is a small example using the UniProt Web Service to search for the zap70 specy in human organism:

>>> from bioservices import UniProt
>>> u = UniProt(verbose=False)
>>> data = u.search("zap70+and+taxonomy_id:9606", frmt="tsv", limit=3,
...                 columns="id,length,accession, gene_names")
>>> print(data)
Entry name   Length  Entry   Gene names
ZAP70_HUMAN  619     P43403  ZAP70 SRK
B4E0E2_HUMAN 185     B4E0E2
RHOH_HUMAN   191     Q15669  RHOH ARHH TTF

Note

major changes of UniProt API changed all columns names in June 2022. The code above is valid for bioservices versions >1.10. Earlier version used:

>>> data = u.search("zap70+and+taxonomy:9606", frmt="tab", limit=3,
...                 columns="entry name,length,id, genes")

Note that columns names have changed, the frmt was changed from tab to tsv and taxonomy is now taxonomy_id. Names correspondences can be found in:

u._legacy_names

More examples and tutorials are available in the On-line documentation

Current services

Here is the list of services available and their testing status.

Service	CI testing
arrayexpress
bigg
biocontainers
biodbnet
biogrid
biomart
biomodels
chebi
chembl
cog
dbfetch
ena
ensembl
eutils
eva
hgnc
intact_complex
kegg
muscle
mygeneinfo
ncbiblast
omicsdi
omnipath
panther
pathwaycommons
pdb
pdbe
pfam
pride
psicquic
pubchem
quickgo
reactome
rhea
seqret
unichem
uniprot
wikipathway

Note

Contributions to implement new wrappers are more than welcome. See BioServices github page to join the development, and the Developer guide on how to implement new wrappers.

Bioservices command

In version 1.8.2, we included a bioservices command. For now it has only one subcommand to download a NCBI accession number and possibly it genbank or GFF file (if available):

bioservices download-accession --accession K01711.1 --with-gbk

Changelog

Version	Description
1.12.0	migrating to pyproject
1.11.2	Update COG service to be more user-friendly and return all pages by default uniprot set progress to False in the search method Merged #250 and #249 user PRs (compress option in uniprot module and logging issue in biodbnet)
1.11.1	Fix regression i uniprot.mapping (#245)
1.11.0	Fix uniprot limitation of 25 results only ( For developers: all services are now refactorised to use services as an attribute rather than a parent class. Remove ReactomeOld and ReactomeAnalysis (deprecated) move rnaseq_ebi (deprecated) to attic for book_keeping
1.10.4	Fix v1.10.3 adding missing requirements.txt
1.10.3	Update pdb service to use v2 API remove biocarta (website not accesible anymore) Update Chembl (no API changes)
1.10.2	Fix #226 and applied PR from Fix from @GianArauz #232 about UniProt error Update MANIFEST to fix #232
1.10.1	allow command line to download genbank and GFF update pride module to use new PRIDE API (July 2022) Fixed KEGG bug #225
1.10.0	Update uniprot to use the new API (june 2022)
1.9.0	Update unichem to reflect new API
1.8.4	biomodels. Fix #208 KEGG: fixed #204 #202 and #203
1.8.3	Eutils: remove warning due to unreachable URL. Set REST as attribute rather and inheritance. NEW biocontainers module KEGG: add save_pathway method. Fix parsing of structure/pdb entry remove deprecated function from Reactome
1.8.2	Fix suds package in code and requirements
1.8.1	Integrated a change made in KEGG service (DEFINITON was changed to ORG_CODE) for developers: applied black on all modules switch suds-jurko to new suds community
1.8.0	add main standalone application. moved chemspider and clinvitae to the attic removed picr service, not active anymore
1.4.X	NEW RNAseq from EBI in rnaseq_ebi module Replaced deprecated HGNC with the official web service from genenames.org Fully updated EUtils since WSDL is now down; implementation uses REST now. Removed the apps/taxonomy module now part of http://github.com/biokit.
1.3.X	CACHE files are now stored in a general directory in the home New REST class to use requests package instead of urllib2. Creation of a global configuration file in .config/bioservice/bioservices.cfg NEW services: Reactome, Readseq, Ensembl, EUtils
1.2.X	NEW services: BioDBnet, BioDBNet, MUSCLE, PathwayCommons, GeneProf
1.1.X	NEW services: biocarta, pfam, ChEBI, UniChem
1.0.0:	first stable release
0.9.X:	NEW services: BioModels, Kegg, Reactome, Chembl, PICR, QuickGO, Rhea, UniProt,WSDbfetch, NCBIblast, PSICQUIC, Wikipath

bioservices's People

Contributors

Stargazers

Watchers

bioservices's Issues

pdb service

This service is not finalised. There seem to be more functionalities that are not yet available

Taxonomy missing

I found an other example in the documentation:

from bioservices.apps.taxonomy import Taxon

Result:

ImportError: No module named taxonomy

Using version 1.4.0

Compound DBLINKS KEGGParser error!

from bioservices import *
kegg = KEGG()
c = kegg.parse(kegg.get('cpd:C00087'))

c['DBLINKS']
{u'CAS': u'7704-34-9 10544-50-0 PubChem: 3387 ChEBI: 17909 26833 3DMET: B04617 N
IKKAJI: J3.750H'}

eutils does not use wsdl anymore

Need to convert the code that relied on WSDL (EFetch)

update setup to use wrapt package

ensemble Ontologies and Taxonomy section

Inconsistent behaviour of KeggParser

Hi,

KeggParser behaves differently for the same attribute in different entries, for instance if a compound has only one name KeggParser returns a string, although if it has more than one it returns a list. This makes it hard to automatically parse all the compounds names.

I would suggest that attributes that can be lists should always be returned as lists independently of the number of elements found. This happens with other attributes, e.g. reactions.

Thanks.

Cheers,

Bad arguments in xmltools.readXML class constructor

Here is the current constructor of readXML class:

class readXML(easyXML):
    def __init__(self, filename, fixing_unicode=False, encoding="utf-8"):
        url = urlopen(filename, "r")  # the bad function call...
        self.data = url.read()
        super(readXML, self).__init__(self.data, fixing_unicode, encoding)  # ...and the outdated constructor call

The second parameter of the urlopen is data, but in this case no data is being passed to the server request through the constructor, since this "r" argument is hardcoded... I don't know where it came from, but doesn't make sense. Also, in Python 3 this parameter must be of type bytes, so in this case we get an exception.

Even if this call succeeds, the fixing_unicode parameter on the parent's constructor was removed some time ago, so the super().__init__() call will fail too.

I will make a pull request fixing it soon.
Thanks.

Kegg parser module with non-supported attribute

Hi,

When parsing the some kegg modules a NotImplementedError was raised, see example below:

m = 'M00144'
s.parse(kegg_srv.get(p))

Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/IPython/core/interactiveshell.py", line 2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
dict([(p, s.parse(kegg_srv.get(p))) for p in kegg_pathways])
File "/Library/Python/2.7/site-packages/bioservices/kegg.py", line 1254, in parse
raise NotImplementedError("Entry %s not yet implemented" % dbentry)
NotImplementedError: Entry Complex Module not yet implemented

Thanks.

Cheers,

Possible to get GO term definitions?

Is it possible to get the definitions of GO terms with bioservices? I checked out the docs, but could find no GO examples.

BioMart does not have them afaicr.

Currently outsourcing this to the R GO.db package.

biomart bug

Hi!

I try to run the following code:

from bioservices import *
s = BioMart()
datasets = s.databases("ensembl")

I can create a Biomart object with BioMart(), but whenever I try to call a function I get the following error message (in this case databases()):

In python 2.7:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-9687ac9ee893> in <module>()
      1 s = BioMart()
----> 2 datasets = s.databases("ensembl")

/home/nic/anaconda3/envs/py27/lib/python2.7/site-packages/bioservices/biomart.pyc in _get_databases(self)
    394     def _get_databases(self):
    395         if self._databases is None:
--> 396             ret = self.registry()
    397             names = sorted([x.get("database", "?") for x in ret])
    398             self._databases = names[:]

/home/nic/anaconda3/envs/py27/lib/python2.7/site-packages/bioservices/biomart.pyc in registry(self)
    214         """
    215         ret = self.http_get("?type=registry", frmt="xml")
--> 216         ret = self.easyXML(ret)
    217         # the XML contains list of children called MartURLLocation made
    218         # of attributes. We parse the xml to return a list of dictionary.

/home/nic/anaconda3/envs/py27/lib/python2.7/site-packages/bioservices/services.pyc in easyXML(self, res)
    183         """
    184         from bioservices import xmltools
--> 185         return xmltools.easyXML(res)
    186 
    187 

/home/nic/anaconda3/envs/py27/lib/python2.7/site-packages/bioservices/xmltools.pyc in __init__(self, data, encoding)
     77         #    self.data = x.fixed_string.encode("utf-8")
     78         #else:
---> 79         self.data = data[:]
     80 
     81         try:

TypeError: 'int' object has no attribute '__getitem__'

in python 3.5:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-9687ac9ee893> in <module>()
      1 s = BioMart()
----> 2 datasets = s.databases("ensembl")

/home/nic/anaconda3/lib/python3.5/site-packages/bioservices/biomart.py in _get_databases(self)
    394     def _get_databases(self):
    395         if self._databases is None:
--> 396             ret = self.registry()
    397             names = sorted([x.get("database", "?") for x in ret])
    398             self._databases = names[:]

/home/nic/anaconda3/lib/python3.5/site-packages/bioservices/biomart.py in registry(self)
    214         """
    215         ret = self.http_get("?type=registry", frmt="xml")
--> 216         ret = self.easyXML(ret)
    217         # the XML contains list of children called MartURLLocation made
    218         # of attributes. We parse the xml to return a list of dictionary.

/home/nic/anaconda3/lib/python3.5/site-packages/bioservices/services.py in easyXML(self, res)
    183         """
    184         from bioservices import xmltools
--> 185         return xmltools.easyXML(res)
    186 
    187 

/home/nic/anaconda3/lib/python3.5/site-packages/bioservices/xmltools.py in __init__(self, data, encoding)
     77         #    self.data = x.fixed_string.encode("utf-8")
     78         #else:
---> 79         self.data = data[:]
     80 
     81         try:

TypeError: 'int' object is not subscriptable

I couldn't find anything online, is this a bug or am I doing something wrong?

Best,

Nico

Typo in Kegg Tutorial

On the KEGG tutorial page,

k.organism = "hsa"
k.pathwaysIds

should be changed to

k.organism = "hsa"
k.pathwayIds

cache files should be saved in config/bioservices, not locally

Missing SEQUENCE in KEGGParser

When parsing multiple kgml files I get the warning saying SEQUENCE is missing. Everything still functions. It happens for kegg compounds that are peptide sequences. Seems to be missing from the list on line 1374 of kegg.py. It suggested to pass it along to the repo.

add test and doc for clinvitae

test
doc

BioCyc Interface

It's great to see a multi-service package like this being developed! I found you via this post here

I noticed that there isn't an API for BioCyc yet.

I have an implementation of one here here that uses the Web API to allow browsing of BioCyc objects. See the example notebook for an example. Improving the API to seamlessly support a local PathwayTools server is on the someday todo list.

Would you be interested in merging this into the bioservices package? It seems a natural fit

Version 2

XML is an issue in many places. This is not bioservices issue but rather a natural difference between choices made by various organization. The easyXML class is very simple and simply parse the xml thanks to beautifulsoup4. The xmltodict package may be very useful for that purpose.
Output in dictiobnary, json is good but most of the times people want to do something with it such as plotting, statistics and so on. Right now, bioservices has hardly any dependencies but Pandas would be great to have. This means that matplotlib and numpy will be required but this would be a great addon.
Finalise the remaining missing package such as PDB and Ensembl (almost there)
use pandas officially
Wikipathway
- refactoring from WSDL to REST
- use Pandas
- missing functionalities to be implemented (not those with login though)

ensembl

Need to finalise the wrapping

kegg pathway2sif does not work anymore

k.pathway2sif('path:map04010')

TypeError: 'int' object has no attribute '__getitem__'

adding function to fetch information about a reaction.

biomodels contains ascii characters unicode error

>>> from bioservice import BioModels
>>> b = BioModels()
>>> b.getModelSBMLById('MODEL1006230101')
 <repr(<suds.sax.text.Text at 0x5767e90>) failed: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 5171: ordinal not in range(128)>

There seem to be an error but actually, this message is shown if we call type the variable name.

For instance this work without any message:

>>> result = b.getModelSBMLById('MODEL1006230101')

print functions works as well, so this is a repr issue

This is not impotant but one way to fix this is to call encore('utf8') on the output

add pride service

doc
test
api itself

bioservices.RNASEQ_EBI._get_organism chokes EVERY TIME

Any method (including all of your examples in the documentation) that want's to use "ORGANISM" as a piece of information will choke after a good 15 sec with similar to the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-60-d2f890b5dff6> in <module>()
      1 r = RNASEQ_EBI()
----> 2 r.organisms

/home/gus/.anaconda/envs/jupyter/lib/python3.5/site-packages/bioservices/rnaseq_ebi.py in _get_organism(self)
    289             res = sorted(list(set(res)))
    290 
--> 291             res.remove("ORGANISM")
    292 
    293             self._organisms = res

ValueError: list.remove(x): x not in list

due to the assert organism in self.organisms statment in those methods.

Using UniProt Service in C#

Hi All,

I am unable to import the UniProt Service in C# using IronPython.
Please let me know how can get it done

Thanks,

ensemble archive section

Version 1.5

Before 2.0, we should finalise pending issues on bioservices to have a final stable version.

Finalise EUtils
Finalise other modules if needed
Get taxon in ENA
lineage using ENA

Example fails

I just installed bioservices and copy pasted the example code in ipython. It didn't work.

 from bioservices import UniProt
 u = UniProt()
 data = u.search("zap70+and+taxonomy:9606", format="tab", limit=3, columns="entry name,length,id, genes")

Result:

TypeError: search() got an unexpected keyword argument 'format'

installation error of bioservices

Hi,
I have python2.6 installed on my server (I can not change to new one ). i am getting the following error message during installation

Extracting bioservices-1.3.5-py2.6.egg to /home/satya/python/lib/python2.6/site-packages
SyntaxError: ('invalid syntax', ('/home/satya/python/lib/python2.6/site-packages/bioservices-1.3.5-py2.6.egg/bioservices/kegg.py', 1569, 50, " output['dblinks'] = {k[0:-1]:v for k,v in output['dblinks'].items()}\n"))

SyntaxError: ('invalid syntax', ('/home/satya/python/lib/python2.6/site-packages/bioservices-1.3.5-py2.6.egg/bioservices/mapping/mappers.py', 149, 37, " data = {k1:{k2:v2['xkey'] for k2,v2 in self.alldata[k1].iteritems()} for k1 in self.alldata.keys()}\n"))

Please let me know the fix

Possible to access a KEGG entry without specifying the associated organism?

I added a detailed question on stackoverflow.

In short:

Let's say I am interested in a certain gene (e.g. b3640) and I do not know to which organism it belongs, is it then possible to get all the information for this gene without specifying the organism?

For example

from bioservices import *
kegg_con = KEGG()
res = kegg_con.get('b3640', parse=True)['NAME']

does not work since the organism is not specified. When it is specified, it all works as intended

kegg_con.get('eco:b3640', parse=True)['NAME']

returns

[u'dut']

When I try to determine the associated organism by using

kegg_con.find('genes', 'b3640')

the desired entry is not found, unfortunately.

So my two questions are therefore:

Is there a way so that I can access the information about a gene just based on its gene ID without specifying the organism it belongs to?
What would be the best way to retrieve the information to which organism the gene belongs? And why doesn't find fails when I search for the E.coli gene?

Uniprot return integer

I'm using bioservices after an automatic BLAST, it searches the accesionnumber in Uniprot and retreives the available data. However I check if their is a result using if len(res) != 0, this works well but sometimes I get the following error message: if len(res) != 0: TypeError: object of type 'int' has no len().
But after printing the concerning variable (res) I turns out to be blank.
I hope someone know why this is happening?

1 test in picr test fails

test_getUPIForBLAST

 self.data = data[:]
TypeError: 'int' object has no attribute '__getitem__'

QuickGO Term oboxml format returns HTML format

I recently open scripts from a project for my colleagues, began few months ago.
I was surprised that the call format method was changed, but I changed all the impacted scripts.
Now I am confronted to a new trouble. When I send a request to get a GO Term in OBOXML format, bioservices sends me a HTML format. I didn't see anything about this issue on the changelog.

Example:
Get the OBOXML format for the term GO:0000016

from bioservices import QuickGO as qgo
qgo = qgo()
term = qgo.Term("GO:0000016", frmt="oboxml")
print term

I checked the source script, by curiosity, and I wonder if the problem could be the parameters (script quickgO.py, line 116, 'frmt' in place of 'format'?)

"Warning. Found keyword SYSNAME, which has not special parsing for now. please report this issue..."

I received the following warnings which included the request to report them:

Warning. Found keyword SYSNAME, which has not special
parsing for now. please report this issue with the KEGG
identifier ( EC 4.2.1.3 Enzyme) into github.com/bioservices. Thanks T.C.

Warning. Found keyword SUBSTRATE, which has not special
parsing for now. please report this issue with the KEGG
identifier ( EC 4.2.1.3 Enzyme) into github.com/bioservices. Thanks T.C.

Warning. Found keyword ALL_REAC, which has not special
parsing for now. please report this issue with the KEGG
identifier ( EC 4.2.1.3 Enzyme) into github.com/bioservices. Thanks T.C.

Warning. Found keyword HISTORY, which has not special
parsing for now. please report this issue with the KEGG
identifier (EC 4.2.1.3 Enzyme) into github.com/bioservices. Thanks T.C.

Here is the link to the entry: kegg.

Found keyword BRACKET, which has not special parsing for now

Found keyword BRACKET, which has not special parsing for
now. please report this issue with the identifier ( C06042 Compound) into github.com/bioservices

use devtools from easydev 0.8.0

ensemble information section

uniprot module cleaning

in the mapping method, right now we are limited in the number of requests but is handle thanks to the multi_mapping. In fact, we can just merge the 2 methods and use a http post request.

test suite takes too long

The test suite takes 15 minutes. We should decrease the amount of time it takes by having a sub set of tests that will be run all the time after each commit and another that will be run before a release.

kegg parser improvments

DBLinks entry is not parsed as a dictionary.
set a parse parameter in the get method to so that entries can be parsed automatically by default.
add missing INTERACTION/STR_MAP
strip NTSEQ and AASEQ

to be commited soon

caching with suds

A caching with suds is possible so let us try to implement it.

KEGG parse for keyword "SEQUENCE"

Hi,
I get the following warning for a few compounds (eg. C15682, C12045, ...) with parser:

from bioservices import KEGG
k=KEGG()
c = k.get('C15682')
cp = k.parse(c)

Warning. Found keyword SEQUENCE, which has not special
parsing for now. please report this issue with the KEGG
identifier ( C15682 Compound) into github.com/bioservices. Thanks T.C.

add a count on requests to limit number of requests

If one launch too many requests at the same time, he may be blacklisted or future requests limited.
A counter to limit requests per second would be nice

biomart fails if service down without useful message

Kegg Parser pathway reaction list treated as a dict

When getting and parsing a kegg pathway the reaction list is splitted by new line which makes it a dictionary instead of a list.

ensemble sequence + variation + overlap + regualtion sections

bioservices fails to import in Python 3.3.3 due to gevent

Looks like a useful project for xref genes/proteins!

I can successfully install in Python 2.7.8, but in Python 3.3.3, after pip install bioservices (with some complaining about gevent, but supposed success) on import:

In [1]: import bioservices
File "/home/richard/venv3.3/lib/python3.3/site-packages/gevent/hub.py", line 282
except Exception, ex:
^
SyntaxError: invalid syntax

From google it seems gevent is not Py3k compatible. Is there a forked version I am missing?

add intact complex web services

uniprot tab names to be updated

After discussions with Klemens, it appear that we can add those columns:

[1] comment(x) where x can be any of the comment types, like comment(FUNCTION)
[2] database(y) where y can be any of the cross references, like database(CCDS) or database(InterPro)
[3] lineage-id(z) where z can be any of a number of taxonomic ranks, like lineage-id(PHYLUM) or lineage-id(GENUS); returns taxid, e.g. 2759
[3a] lineage(z), see [3] but returns the name, e.g. Eukaryota
[4] feature(a) where a can be any of the feature keys, like feature(DISULFIDE BOND)
[5] version(entry)
[6] sequence-modified
[7] proteome, returns proteome identifiers
[...]

hgnc tests fail

UniProt API has changed and need to be updated in uniprot module

"Interacts with" has been renamed 'interactor'

cokelaer / bioservices Goto Github PK

bioservices's Introduction

BIOSERVICES: access to biological web services programmatically

Contributors

Quick example

Current services

Bioservices command

Changelog

bioservices's People

Contributors

Stargazers

Watchers

Forkers

bioservices's Issues

Recommend Projects

Recommend Topics

Recommend Org