biocommons / bioutils Goto Github PK
View Code? Open in Web Editor NEWprovides common tools and lookup tables used primarily by the hgvs and uta packages
License: Apache License 2.0
provides common tools and lookup tables used primarily by the hgvs and uta packages
License: Apache License 2.0
Do not use a mutable like list
or dictionary
as a default value to an argument. Python’s default arguments are evaluated once when the function is defined. Using a mutable default argument and mutating it will mutate that object for all future calls to the function as well.
There are 2 occurrences of this issue in the repository.
See all occurrences on DeepSource → deepsource.io/gh/biocommons/bioutils/issue/PYL-W0102/occurrences/
Current implementations of left_trim
and right_trim
are O(n2) due to the array copy operations. This is fine performance-wise (though wasting CPU cycles) even up to a couple thousand characters, but past that it drops off rapidly and eventually reaches a point where the code will not return in a practical amount of time.
Array copy:
bioutils/src/bioutils/normalize.py
Line 243 in 468dbd7
Bioutils only supports up to CHM1_1.1
Hi, could you please add the last 2 GRCh38 patches thanks?
make_ac_name_map("GRCh38.p13")
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.10/dist-packages/bioutils/_data/assemblies/GRCh38.p13.json.gz'
make_ac_name_map("GRCh38.p14")
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.10/dist-packages/bioutils/_data/assemblies/GRCh38.p14.json.gz'
On behalf of @andreasprlic:
The bioutils library makes request to the NCBI that are not valid according to their API spec.
https://www.ncbi.nlm.nih.gov/books/NBK25497/
NCBI requires email and tool parameters as part of URLs. These need to get registered with them for accessing eutils.
biocommons/hgvs#595 identified that degenerate codons are not supported by the translate function.
For the hgvs, this is a regression relative to prior versions that used biopython.
While this is being fixed, consider moving the translation tables to _data
(i.e., here) to be accessed the same way that assemblies and cytobands are.
NCBI apparently made a subtle change to eutilities that breaks sequence fetching by seqfetcher. Previously, fetching NP sequences from nucleotide worked fine; it stopped working around Feb 7, 2017.
This change will break hgvs validation for folks not using seqrepo, so the fix is urgent.
Using the literal syntax can give minor performance bumps compared to using function calls to create dict
, list
and tuple
. <!--more--> ```bash In [1]: timeit.timeit(stmt="dict()", number=100000000) Out[1]: 9.560388602000103 In [2]: timeit.timeit(stmt="{}", number=100000000) Out[2]: 1.685333584000091 In [3]: timeit.timeit(stmt="tuple()", number=100000000) Out[3]: 4.509182139000131 In [4]: timeit.timeit(stmt="()", number=100000000) Out[4]: 0.5455615430000762 In [5]: timeit.timeit(stmt="list()", number=100000000) …
There is 1 occurrence of this issue in the repository.
See all occurrences on DeepSource → deepsource.io/gh/biocommons/bioutils/issue/PTC-W0019/occurrences/
same problem as several of our projects. To fix, need to do the same as here (mark optional dependencies, move dependencies into pyproject.toml)
Can replicate with docker file:
FROM python:3.7
COPY . /app
WORKDIR /app
RUN pip install '.[dev,test]'
RUN make test
Output:
[...]
> [5/5] RUN make test:
#7 0.343 pytest
#7 1.021 ImportError while loading conftest '/app/tests/conftest.py'.
#7 1.021 tests/conftest.py:5: in <module>
#7 1.021 import vcr
#7 1.021 /usr/local/lib/python3.7/site-packages/vcr/__init__.py:2: in <module>
#7 1.021 from .config import VCR
#7 1.021 /usr/local/lib/python3.7/site-packages/vcr/config.py:11: in <module>
#7 1.021 from .cassette import Cassette
#7 1.021 /usr/local/lib/python3.7/site-packages/vcr/cassette.py:12: in <module>
#7 1.021 from .patch import CassettePatcherBuilder
#7 1.021 /usr/local/lib/python3.7/site-packages/vcr/patch.py:41: in <module>
#7 1.021 _VerifiedHTTPSConnection = cpool.VerifiedHTTPSConnection
#7 1.021 E AttributeError: module 'urllib3.connectionpool' has no attribute 'VerifiedHTTPSConnection'
#7 1.063 make: *** [Makefile:71: test] Error 4
------
executor failed running [/bin/sh -c make test]: exit code: 2
vcrpy
version is not specified in setup.cfg
PR #30 added basic support for codons so that any codon with an ambiguity code translated as X (the wildcard AA). However, it's often possible to translate codons with ambiguity codes where the ambiguity is irrelevant to the outcome. For example, in a standard translation table, CUN ⇒ Leu, GCN ⇒ Ala, GGN ⇒ Gly, AAY ⇒ Asn, etc.
This issue is to provide fuller support for ambiguity codes. Ideally, the solution will work for any translation table.
... was supposed to have be refactored to use vmc_digest
We've accumulated test warnings, mostly related to deprecations in pytest. Fix these.
Consider:
def truncated_digest(blob, len) → Digest
And shorthands:
def td24(blob) → Digest
def td24x → hex string
def td24u → b64u string
Utils will require an api_key starting May 1, 2018.
We should support adding an api_key as an environment variable.
bioutils now has a proper docs directory. Nearly all of the docs are actually pulled from the docstrings in the source files. Results are automatically built at https://bioutils.readthedocs.io.
To build locally type make -C docs html
, then open docs/build/html/index.html
.
To write/improve documentation, do the following:
make devready
ought to do it.make -C docs html
make -C html
to rebuild as necessary.#22: Added docs for normalize.py
(The #22
refers to this issue and gihub will automatically create a link to it.)git push
N.B. bioutils.readthedocs.io won't be rebuilt until your PR is merged, so don't expect that to update immediately.
Thanks!
Hi Reece,
I haven't see a release of bioutils since about a year ago. What is needed to get a new release out? Let me know if I can help with that.
Thanks,
Andreas
e.g., where_from("NM_01234.5") → "RefSeq"
To be used when inferring namespace on unqualified accessions.
See #8 for details
The Ensembl sequence API only supports transcripts, not transcript.version, and returns the latest transcript version sequence
Example:
from bioutils.seqfetcher import fetch_seq
fetch_seq("ENST00000543872.6")
throws exception:
RuntimeError: Failed to fetch ENST00000543872.6 (400 Client Error: Bad Request for url: http://rest.ensembl.org/sequence/id/ENST00000543872.6)
I will link a pull request that fixes this by stripping the version before calling the API, then checking if the version in the response matches
Implement normalization with the following arguments:
Returns:
See ga4gh/vrs-python#16 and ga4gh/vrs-python#17.
A very small error in the setup.cfg file, introduced in version 0.5.0 is causing wheel builds to fail.
Here is the setup.cfg change to make:
-license-file = LICENSE
+license-file = LICENSE.txt
pip works around this issue for installations from the command-line, but it causes a problem for build chains relying on the wheel build (it breaks in AWS SAM using a docker container).
The problem can reproduced with this command:
pip wheel --wheel-dir wheels/bioutils/ bioutils==0.5.0
While the wheel build for the previous version works fine:
pip wheel --wheel-dir wheels/bioutils/ bioutils==0.4.4
description is empty at https://pypi.org/project/bioutils/ :-(
Update README if necessary and use that (via long-description = file: README.rst, IIRC)
recordtype is 6 years old and unmaintained.
It's started throwing false alarm errors like this:
Searching for recordtype
Reading https://pypi.python.org/simple/recordtype/
Downloading https://pypi.python.org/packages/cc/1c/7ff90f4379110d6ef92a7f44ce487f235dbb3243f17c5294a73e0156b6f4/recordtype-1.1.tar.gz#md5=8133256b9c62baa2019ec16db3b14115
Best match: recordtype 1.1
Processing recordtype-1.1.tar.gz
Writing /tmp/easy_install-_oii6rik/recordtype-1.1/setup.cfg
Running recordtype-1.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-_oii6rik/recordtype-1.1/egg-dist-tmp-f50gmf3h
File "build/bdist.linux-x86_64/egg/recordtype.py", line 250
exec template in namespace
^
SyntaxError: Missing parentheses in call to 'exec'
zip_safe flag not set; analyzing archive contents...
Moving recordtype-1.1-py3.5.egg to /home/reece/projects/biocommons/bioutils/.eggs
Installed /home/reece/projects/biocommons/bioutils/.eggs/recordtype-1.1-py3.5.egg
I have no idea why it fails and then succeeds, but I don't care to work it out. Migrate to attrs, which is better anyway.
bioutils administrative code is pretty old and not up-to-date with other biocommons tools. Reconcile with setup.cfg, pyproject.toml, etc from biocommons.example.
NCBI returns http status 429 and {"error":"API rate limit exceeded","api-key":"157.131.198.215","count":"5","limit":"3"}
when rate limit is exceeded.
Implement retries when this error is received.
See also biocommons/eutils#131, which would use eutils timing and caching.
Identifiers.org contains regexps associated with identifier syntax. For example, http://identifiers.org/insdc shows:
field | value |
---|---|
Recommended name | Nucleotide Sequence Database |
Alternative name(s) | International Nucleotide Sequence Database CollaborationINSDCNCBI nucleotideGenBank |
Description | The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. |
Identifier pattern | ^([A-Z]\d{5}|[A-Z]{2}\d{6}|[A-Z]{4}\d{8}|[A-J][A-Z]{2}\d{5})(.\d+)?$ |
Registry identifier | MIR:00000029 |
Namespace | insdc |
URI | http://identifiers.org/insdc/ |
Goal: implement functions to infer namespace from a given accession based on regexp matches.
Records obtained from identifiers.org should be general enough to enable implementing CURIEs (using the namespace) and resolvers (using the uri).
The registry is available at https://identifiers.org/service/registryxml
bioutils currently does not support selenoproteins. It would be great to be able to add it to the codon translation table (technically it is already there, but maps to "") . We prob want to have a different translation table that maps the stop codon to SEC / U. Also add support for the alternate translation table as an option to translate_cds.
Finally I don't think that sequences.py has any test coverage. Add some bare bones unit tests for the newly added features.
I removed coercion of sequences to unicode in digests.py. Unsurprisingly, code that doesn't pass unicode now fails. Unfortunately, that includes seqrepo tests.
So, bring coercion back until this can be done more thoughtfully. (It would be good to indicate when sequences are being coerced and warn callers.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.