biocommons / bioutils Goto Github PK

View Code? Open in Web Editor NEW

19.0 5.0 17.0 1.62 MB

provides common tools and lookup tables used primarily by the hgvs and uta packages

License: Apache License 2.0

Makefile 4.78% Python 94.24% Perl 0.99%

bioinformatics genome-analysis genomics sequencing variant-analysis variation

bioutils's Introduction

bioutils -- bioinformatics utilities and lookup tables

bioutils provides some common utilities and lookup tables for bioinformatics.

bioutils.accessions -- parse accessions, infer namespaces
bioutils.assemblies -- Human assembly information (from NCBI/GRCh)
bioutils.cytobands -- map cytobands to coordinates (from UCSC cytoband tables)
bioutils.digests -- implementations of various digests
bioutils.normalize -- allele normalization (left shuffle, right shuffle, expanded, vcf)

To use an E-Utilities API key run add it to an environment variable called ncbi_api_key and it will be used in the E-Utilities request.

bioutils's People

Contributors

Stargazers

Watchers

Forkers

andreasprlic invitae naegelyd afrubin deena-b ispashayev dslituiev agopez ecalifornica trentwatt kyuhas pjcoenen gomoto arpitjain799 theferrit32 mihaitodor nickzoic

bioutils's Issues

revert `vmc_digest` back to `truncated_digest`

Consider:

def truncated_digest(blob, len) → Digest

And shorthands:
def td24(blob) → Digest
def td24x → hex string
def td24u → b64u string

Tests do not run when package is freshly installed

Can replicate with docker file:

FROM python:3.7

COPY . /app
WORKDIR /app
RUN pip install '.[dev,test]'
RUN make test

Output:

[...]
 > [5/5] RUN make test:
#7 0.343 pytest
#7 1.021 ImportError while loading conftest '/app/tests/conftest.py'.
#7 1.021 tests/conftest.py:5: in <module>
#7 1.021     import vcr
#7 1.021 /usr/local/lib/python3.7/site-packages/vcr/__init__.py:2: in <module>
#7 1.021     from .config import VCR
#7 1.021 /usr/local/lib/python3.7/site-packages/vcr/config.py:11: in <module>
#7 1.021     from .cassette import Cassette
#7 1.021 /usr/local/lib/python3.7/site-packages/vcr/cassette.py:12: in <module>
#7 1.021     from .patch import CassettePatcherBuilder
#7 1.021 /usr/local/lib/python3.7/site-packages/vcr/patch.py:41: in <module>
#7 1.021     _VerifiedHTTPSConnection = cpool.VerifiedHTTPSConnection
#7 1.021 E   AttributeError: module 'urllib3.connectionpool' has no attribute 'VerifiedHTTPSConnection'
#7 1.063 make: *** [Makefile:71: test] Error 4
------
executor failed running [/bin/sh -c make test]: exit code: 2

vcrpy version is not specified in setup.cfg

bioutils.normalize.normalize performance degrades for large sequences

Current implementations of left_trim and right_trim are O(n²) due to the array copy operations. This is fine performance-wise (though wasting CPU cycles) even up to a couple thousand characters, but past that it drops off rapidly and eventually reaches a point where the code will not return in a practical amount of time.

Array copy:

bioutils/src/bioutils/normalize.py

Line 243 in 468dbd7

alleles = [a[1:] for a in alleles]

synchronize biocommons.example with bioutils

bioutils administrative code is pretty old and not up-to-date with other biocommons tools. Reconcile with setup.cfg, pyproject.toml, etc from biocommons.example.

Support degenerate codons in translate table

biocommons/hgvs#595 identified that degenerate codons are not supported by the translate function.

For the hgvs, this is a regression relative to prior versions that used biopython.

While this is being fixed, consider moving the translation tables to _data (i.e., here) to be accessed the same way that assemblies and cytobands are.

Remove support for Python 2.7

seqhash disappeared from bioutils.digests

... was supposed to have be refactored to use vmc_digest

(PYL-W0102) Dangerous default argument

Description

Do not use a mutable like list or dictionary as a default value to an argument. Python’s default arguments are evaluated once when the function is defined. Using a mutable default argument and mutating it will mutate that object for all future calls to the function as well.

Occurrences

There are 2 occurrences of this issue in the repository.

See all occurrences on DeepSource → deepsource.io/gh/biocommons/bioutils/issue/PYL-W0102/occurrences/

Add T2T-CHM13v2.0

Bioutils only supports up to CHM1_1.1

https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.4

seqrepo tests fail due to removal of unicode coercion

I removed coercion of sequences to unicode in digests.py. Unsurprisingly, code that doesn't pass unicode now fails. Unfortunately, that includes seqrepo tests.

So, bring coercion back until this can be done more thoughtfully. (It would be good to indicate when sequences are being coerced and warn callers.)

Improve support for degenerate codons

PR #30 added basic support for codons so that any codon with an ambiguity code translated as X (the wildcard AA). However, it's often possible to translate codons with ambiguity codes where the ambiguity is irrelevant to the outcome. For example, in a standard translation table, CUN ⇒ Leu, GCN ⇒ Ala, GGN ⇒ Gly, AAY ⇒ Asn, etc.

This issue is to provide fuller support for ambiguity codes. Ideally, the solution will work for any translation table.

Make seqfetcher use an appropriate tool and email, and register them

See #8 for details

Add GRCh38p13 and GRCh38p14

Hi, could you please add the last 2 GRCh38 patches thanks?

GRCh38.p13

make_ac_name_map("GRCh38.p13")
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.10/dist-packages/bioutils/_data/assemblies/GRCh38.p13.json.gz'

GRCh38.p14

make_ac_name_map("GRCh38.p14")
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.10/dist-packages/bioutils/_data/assemblies/GRCh38.p14.json.gz'

Implement flexible sequence normalization

Implement normalization with the following arguments:

alleles[]: array of sequence strings
interval: location of alleles
bounds: maximal extent of normalization left and right (for intron or other barriers)
sequence_fetcher: callback to fetch sequence context
mode: shuffle left (vcf), shuffle right (hgvs), extend (voca)
consider: anchor: 0 (# of bases left and right)

Returns:

new interval
normalized alleles

See ga4gh/vrs-python#16 and ga4gh/vrs-python#17.

migrate from recordtype to attrs

recordtype is 6 years old and unmaintained.
It's started throwing false alarm errors like this:

Searching for recordtype
Reading https://pypi.python.org/simple/recordtype/
Downloading https://pypi.python.org/packages/cc/1c/7ff90f4379110d6ef92a7f44ce487f235dbb3243f17c5294a73e0156b6f4/recordtype-1.1.tar.gz#md5=8133256b9c62baa2019ec16db3b14115
Best match: recordtype 1.1
Processing recordtype-1.1.tar.gz
Writing /tmp/easy_install-_oii6rik/recordtype-1.1/setup.cfg
Running recordtype-1.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-_oii6rik/recordtype-1.1/egg-dist-tmp-f50gmf3h
  File "build/bdist.linux-x86_64/egg/recordtype.py", line 250
    exec template in namespace
                ^
SyntaxError: Missing parentheses in call to 'exec'

zip_safe flag not set; analyzing archive contents...
Moving recordtype-1.1-py3.5.egg to /home/reece/projects/biocommons/bioutils/.eggs

Installed /home/reece/projects/biocommons/bioutils/.eggs/recordtype-1.1-py3.5.egg

I have no idea why it fails and then succeeds, but I don't care to work it out. Migrate to attrs, which is better anyway.

Write/improve documentation

bioutils now has a proper docs directory. Nearly all of the docs are actually pulled from the docstrings in the source files. Results are automatically built at https://bioutils.readthedocs.io.

To build locally type make -C docs html, then open docs/build/html/index.html.

To write/improve documentation, do the following:

Set up a linux VM
Fork and clone this repo
Set up environment. make devready ought to do it.
Demonstrate that you can build with make -C docs html
Add a comment to this issue saying that you're working on docs for a specific file
Make changes, using make -C html to rebuild as necessary.
Commit with a message like #22: Added docs for normalize.py (The #22 refers to this issue and gihub will automatically create a link to it.)
git push
Submit a PR at github

N.B. bioutils.readthedocs.io won't be rebuilt until your PR is merged, so don't expect that to update immediately.

Thanks!

test issue

(PTC-W0019) Consider using literal syntax to create the data structure

Description

Using the literal syntax can give minor performance bumps compared to using function calls to create dict, list and tuple.  ```bash In [1]: timeit.timeit(stmt="dict()", number=100000000) Out[1]: 9.560388602000103 In [2]: timeit.timeit(stmt="{}", number=100000000) Out[2]: 1.685333584000091 In [3]: timeit.timeit(stmt="tuple()", number=100000000) Out[3]: 4.509182139000131 In [4]: timeit.timeit(stmt="()", number=100000000) Out[4]: 0.5455615430000762 In [5]: timeit.timeit(stmt="list()", number=100000000) …

Occurrences

There is 1 occurrence of this issue in the repository.

See all occurrences on DeepSource → deepsource.io/gh/biocommons/bioutils/issue/PTC-W0019/occurrences/

seqfetcher doesn't support Ensembl transcript versions

The Ensembl sequence API only supports transcripts, not transcript.version, and returns the latest transcript version sequence

Example:

from bioutils.seqfetcher import fetch_seq
fetch_seq("ENST00000543872.6")

throws exception:

RuntimeError: Failed to fetch ENST00000543872.6 (400 Client Error: Bad Request for url: http://rest.ensembl.org/sequence/id/ENST00000543872.6)

I will link a pull request that fixes this by stripping the version before calling the API, then checking if the version in the response matches

Email address and tool name missing from seqfetcher eutils call

On behalf of @andreasprlic:

The bioutils library makes request to the NCBI that are not valid according to their API spec.

https://www.ncbi.nlm.nih.gov/books/NBK25497/

NCBI requires email and tool parameters as part of URLs. These need to get registered with them for accessing eutils.

Add support for selenoproteins

bioutils currently does not support selenoproteins. It would be great to be able to add it to the codon translation table (technically it is already there, but maps to "") . We prob want to have a different translation table that maps the stop codon to SEC / U. Also add support for the alternate translation table as an option to translate_cds.

Finally I don't think that sequences.py has any test coverage. Add some bare bones unit tests for the newly added features.

Add function to infer origin from accession

e.g., where_from("NM_01234.5") → "RefSeq"

To be used when inferring namespace on unqualified accessions.

Release last 2.7 version with pinned dependencies

Implement origin inference using identifiers.org information

Identifiers.org contains regexps associated with identifier syntax. For example, http://identifiers.org/insdc shows:

field	value
Recommended name	Nucleotide Sequence Database
Alternative name(s)	International Nucleotide Sequence Database CollaborationINSDCNCBI nucleotideGenBank
Description	The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences.
Identifier pattern	^([A-Z]\d{5}\|[A-Z]{2}\d{6}\|[A-Z]{4}\d{8}\|[A-J][A-Z]{2}\d{5})(.\d+)?$
Registry identifier	MIR:00000029
Namespace	insdc
URI	http://identifiers.org/insdc/

Goal: implement functions to infer namespace from a given accession based on regexp matches.

Records obtained from identifiers.org should be general enough to enable implementing CURIEs (using the namespace) and resolvers (using the uri).

The registry is available at https://identifiers.org/service/registryxml

release latest main branch

Hi Reece,

I haven't see a release of bioutils since about a year ago. What is needed to get a new release out? Let me know if I can help with that.

Thanks,
Andreas

Fix testing warnings

We've accumulated test warnings, mostly related to deprecations in pytest. Fix these.

seqfetcher fails with protein accessions

NCBI apparently made a subtle change to eutilities that breaks sequence fetching by seqfetcher. Previously, fetching NP sequences from nucleotide worked fine; it stopped working around Feb 7, 2017.

This change will break hgvs validation for folks not using seqrepo, so the fix is urgent.

build broken

same problem as several of our projects. To fix, need to do the same as here (mark optional dependencies, move dependencies into pyproject.toml)

wheel build fails

A very small error in the setup.cfg file, introduced in version 0.5.0 is causing wheel builds to fail.

Here is the setup.cfg change to make:
-license-file = LICENSE
+license-file = LICENSE.txt

pip works around this issue for installations from the command-line, but it causes a problem for build chains relying on the wheel build (it breaks in AWS SAM using a docker container).

The problem can reproduced with this command:
pip wheel --wheel-dir wheels/bioutils/ bioutils==0.5.0

While the wheel build for the previous version works fine:
pip wheel --wheel-dir wheels/bioutils/ bioutils==0.4.4

See also biocommons/eutils#131, which would use eutils timing and caching.

Add pypi project description

description is empty at https://pypi.org/project/bioutils/ :-(
Update README if necessary and use that (via long-description = file: README.rst, IIRC)

biocommons / bioutils Goto Github PK

bioutils's Introduction

bioutils -- bioinformatics utilities and lookup tables

bioutils's People

Contributors

Stargazers

Watchers

Forkers

bioutils's Issues

Description

Occurrences

Description

Occurrences

Recommend Projects

Recommend Topics

Recommend Org