ebivariation / contig-alias Goto Github PK
View Code? Open in Web Editor NEWService to provide synonyms of chromosome/contig identifiers
License: Apache License 2.0
Service to provide synonyms of chromosome/contig identifiers
License: Apache License 2.0
it's been ok until now to have both apis togther, but if we plan to start asking users to test the contig alias, it would be clearer for them if we hide the admin endpoints.
In case connecting directly to FTPs is not allowed, but there's an available proxy, we could change our FTP browser to use the proxy.
from https://commons.apache.org/proper/commons-net/apidocs/org/apache/commons/net/SocketClient.html :
setSocketFactory method, which allows you to control the type of Socket the SocketClient creates for initiating network connections. This is especially useful for adding SSL or proxy support
Note that the class we use from apache commons FTPClient inherits from SocketClient.
Tasks:
Using H2 has allowed us to quickstart the project but we are starting to face issues by not having a real DB.
I think the easiest way is to put in the application properties some fields with maven properties that will be replaced by the selected maven profile. In other projects we define it like:
spring.datasource.url=@contig-alias-dbUrl@
spring.datasource.username=@contig-alias-dbUsername@
spring.datasource.password=@contig-alias-dbPassword@
spring.jpa.hibernate.ddl-auto=@contig-alias-ddlBehaviour@
spring.jpa.database-platform=org.hibernate.dialect.PostgreSQLDialect
spring.jpa.generate-ddl=true
where the url is a jdbc url like jdbc:postgresql://localhost:5432/postgres
and ddl-auto is usually create
or validate
(https://docs.spring.io/autorepo/docs/spring-boot/1.1.0.M1/reference/html/howto-database-initialization.html, https://stackoverflow.com/questions/438146/what-are-the-possible-values-of-the-hibernate-hbm2ddl-auto-configuration-and-wha)
Having those properties and a running postgres should be the only changes needed.
Perform all alias resolution queries that only require access to a single table. This includes:
Fix this by always assuming a list of chromosomes will be returned by all queries
It returns a 500 right now, because an uncaught exception is thrown, but it's not a server error, really.
We allow enabling or disabling scaffold ingestion at compile time, but the classification of a sequence as "chromosome" or "scaffold" may not be completely right:
if (!line.startsWith("#")) {
parseChromosomeLine(line);
String[] columns = line.split("\t", -1);
if (columns.length >= 6) {
if (columns[3].equals("Chromosome")) {
parseChromosomeLine(columns);
} else if (isScaffoldsEnabled && columns[1].equals("unplaced-scaffold")) {
parseScaffoldLine(columns);
}
}
for instance, the MT (in ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.14_GRCh37.p13/) won't be considered as either chromosome nor scaffold because its column[3] is Mitochondrion, not Chromosome, and its column[1] is assembled-molecule, not unplaced-scaffold.
Likewise, there are other contigs that are different to unplaced-scaffold. GCA_000001405.14 has some of them, like alt-scaffold, fix-patch and others, and we should decide whether to include them and in which category if so. (Including as Chromosome is meant for a small set of very used sequences, scaffolds for anything else).
Currently the downloadAssemblyReport only access and download the assembly report from the FTP. We should add a function to retrieve the actual assembly that live alongside the report.
GitHub Actions is still runs tests on Java 11 even though the project's JDK version has been downgraded to Java 8.
Please use @ApiParam
annotation to add description and examples for the fields so its clearer to users. You can see implementations of this here
@ApiParam("assembly accesion in GCA format, e.g.: GCA_000003055.3") @PathVariable String accession) throws IOException {
Originally posted by @andresfsilva in #26
Queries that provide a part (sub-string) of a field. Specifically:-
the problem with this pagination approach is that when users do a query like "/assemblies/taxid/9606" and receive a list with 10 results, they might not know about the pagination thing and think that's the complete result.
We have to provide something in the result to signal that it's only a page. I'm not a huge fan of HATEOAS but it seems to be the closest to a standard that we currently have. take a look at one of our endpoints:
The good thing about it is that it tells you how many pages are there, and also gives you the links to get the next/previous/first/last pages (when it makes sense).
Adding assemblies ["GCA_000001405.10","GCA_000001405.11","GCA_000001405.12"]
using /v1/admin/assemblies
results in IOException
Error:-
2020-08-16 21:48:53.439 ERROR 5055 --- [pool-1-thread-1] u.a.e.e.c.dus.PassiveAnonymousFTPClient : Could not connect to FTP server 'ftp.ncbi.nlm.nih.gov'. FTP status was: null. Reply code: 221. Reply string: 221 Goodbye.
.
2020-08-16 21:48:53.439 ERROR 5055 --- [pool-1-thread-1] u.a.e.e.c.service.AssemblyService : IOException while fetching and inserting GCA_000001405.10
org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:324) ~[commons-net-3.6.jar:3.6]
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:300) ~[commons-net-3.6.jar:3.6]
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:523) ~[commons-net-3.6.jar:3.6]
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:648) ~[commons-net-3.6.jar:3.6]
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:622) ~[commons-net-3.6.jar:3.6]
at org.apache.commons.net.ftp.FTP.quit(FTP.java:904) ~[commons-net-3.6.jar:3.6]
at org.apache.commons.net.ftp.FTPClient.logout(FTPClient.java:1148) ~[commons-net-3.6.jar:3.6]
at uk.ac.ebi.eva.contigalias.dus.PassiveAnonymousFTPClient.disconnect(PassiveAnonymousFTPClient.java:80) ~[classes/:na]
at uk.ac.ebi.eva.contigalias.dus.PassiveAnonymousFTPClient.connect(PassiveAnonymousFTPClient.java:63) ~[classes/:na]
at uk.ac.ebi.eva.contigalias.dus.PassiveAnonymousFTPClient.connect(PassiveAnonymousFTPClient.java:33) ~[classes/:na]
at uk.ac.ebi.eva.contigalias.dus.NCBIBrowser.connect(NCBIBrowser.java:49) ~[classes/:na]
at uk.ac.ebi.eva.contigalias.datasource.NCBIAssemblyDataSource.getAssemblyByAccession(NCBIAssemblyDataSource.java:43) ~[classes/:na]
at uk.ac.ebi.eva.contigalias.datasource.NCBIAssemblyDataSource$$FastClassBySpringCGLIB$$534e408e.invoke(<generated>) ~[classes/:na]
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) ~[spring-core-5.2.6.RELEASE.jar:5.2.6.RELEASE]
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:771) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:139) ~[spring-tx-5.2.6.RELEASE.jar:5.2.6.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:691) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
at uk.ac.ebi.eva.contigalias.datasource.NCBIAssemblyDataSource$$EnhancerBySpringCGLIB$$4235151a.getAssemblyByAccession(<generated>) ~[classes/:na]
at uk.ac.ebi.eva.contigalias.service.AssemblyService.fetchAndInsertAssembly(AssemblyService.java:97) ~[classes/:na]
at uk.ac.ebi.eva.contigalias.service.AssemblyService.lambda$fetchAndInsertAssembly$2(AssemblyService.java:169) ~[classes/:na]
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[na:na]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[na:na]
at java.base/java.lang.Thread.run(Thread.java:832) ~[na:na]
I have noticed that some assemblies, in the NCBI database, does not have a corresponding fna/fasta file alongside their assembly_report file. Like this one for example: GCF_000001405.10.
While others do: GCF_000001405.26.
So this can be confusing while retrieving the data, because this means that we cannot retrieve the information of a seqCol object of any corresponding assembly accession.
There will be some limitations.
if this is the root path, https://wwwdev.ebi.ac.uk/eva/webservices/contig-alias/ it should redirect to the swagger page, without having to manually type https://wwwdev.ebi.ac.uk/eva/webservices/contig-alias/swagger-ui.html.
Instead it shows a json that allows making a full paginated scan of the DB (http://wwwdev.ebi.ac.uk/eva/webservices/contig-alias/assemblyEntities). I didn't know about this feature but I guess it can be useful, so ideally this would still be available in another path. If both things are not possible, having the swagger in the root path is more important.
wait, I missed this. it looks that we have different understanding on what our paging strategy should be. My point is to provide a uniform interface, so that you can use every endpoint in the same way. However, if you query by an assembly accession and we know it can only possibly return 1 element, we can just ignore the pagination parameters, (or remove them actually, if spring doesn't complain if they are provided). I don't think we should change the service interface from Optional<>
to Page<>
either. Only return createAppropriateResponseEntity(optionalAssembly, assemblyAssembler);
should stay from the current form, and you can add a method overload that takes an optional and creates a page so that the assembler can build the response.
When given an accession there currently exists functionality in the project to:
This functionality is presently being harnessed in various test cases. The goal is to expose this functionality to the user using an API. Doing this will require building these components:
from the refget paper:
Refget defines three supported identifier algorithms; MD5,TRUNC512 and GA4GH Identifier. All three algorithms normalise sequence input by stripping all whitespace characters and restricting to characters in the range A-Z. We chose this character range as a compromise between the methods and requirements employed by CRAM, ENA and the Variation Representation Specification (VRS).10MD5 is the default checksum algorithm used by the CRAM format’s M5 tag and hence the CRR. It is provided for backwards compatibility with existing CRAM files. However,there are limitations to md5’s algorithm the occurrence of a checksum collision between non-identical sequences would be catastrophic. To mitigate this concern, we co-developed two schemes with the Genomic Knowledge Standards’ Variation Representation Specification (VRS) based on the SHA-512 checksum algorithm called TRUNC512 andGA4GH identifier. Both schemes use the first 24 bytes of aSHA-512 digest. TRUNC512 chooses to represent this as ahex encoded string. GA4GH identifier converts these bytes into a base64 URL encoded string formatted as “ga4gh:SQ.XXXX”. Both algorithms are interchangeable since both represent the same underlying SHA-512 digest,however the GA4GH identifier is preferred to maintain VRS compatibility.
I tought that refget only used trunc512 and md5 but it seems we should support the GA4GH identifiers. Luckily, I think we can store just trunc512 and md5 as we are doing at the moment and allow searches by ga4gh id by transforming it on the fly to the trunc512 id.
What data/java model would be appropriate for storing sequence collections
We need to be able to represent all 3 levels of sequence collections:
The top level digest S3LCyI788LE6vq89Tc_LojEcsMZRixzP
The compact level
{
"sequences": "EiYgJtUfGyad7wf5atL5OG4Fkzohp2qe",
"lengths": "5K4odB173rjao1Cnbk5BnvLt9V7aPAa2",
"names": "g04lKdxiYtG3dOGeUC5AdKEifw65G0Wp"
}
The canonical level
{
"lengths": [
"1216",
"970",
"1788"
],
"names": [
"A",
"B",
"C"
],
"sequences": [
"76f9f3315fa4b831e93c36cd88196480",
"d5171e863a3d8f832f0559235987b1e5",
"b9b1baaa7abf206f6b70cf31654172db"
]
}
Additional question: Can we add another property to a sequence collection in this datamodel
I have been reviewing our codebase and noticed that the constant variables are written in individual classes. After conducting further analysis, I believe that consolidating these constant variables in a separate class could make our code more efficient, easier to read, and maintain.
We are indeed ingesting genbank accessions, but genbank is just one of the INSDC sources, along ENA and DDBJ. These accessions are in the same namespace, so they are compatible. Right now, someone searching for an ENA accession would need to put it under "genbank", and ENA is not genbank, but both are INSDC.
There's no need to rename all the internal variable names, only the ones that show in the swagger and in the json reponses. Also, while documenting INSDC parameters, make a quick mention to GenBank, ENA and DDBJ.
Write API docs for all endpoints using Swagger or any other similar tool.
Add pagination support to all endpoints for
This error shows up in the logs when running the service.
this probably comes from having a null int instead of an Integer for a default value in some endpoint.
2020-08-11 10:34:37.371 WARN 21295 --- [nio-8080-exec-2] i.s.m.p.AbstractSerializableParameter : Illegal DefaultValue null for parameter type integer
java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[na:1.8.0_211]
...
When a search is performed using the API, the service should first check if the result is present in database. If not, then the "dus" is used to fetch and parse the assembly report from NCBI's server. If a result is found online it is returned to the user and also added to the database as a caching mechanism.
Remove fields such as *.id,
chromosomes.id
, chromosomes.assembly
from JSON response.
I just realised about this. "uk.ac.ebi.eva" is the base package that we use for all of our projects.
The current java model in contig alias has two main entities:
This issue will investigate how this model can be modified to support storing and providing sequence collections for the assemblies already represented.
We need to be able to represent all 3 levels of sequence collections:
S3LCyI788LE6vq89Tc_LojEcsMZRixzP
{
"sequences": "EiYgJtUfGyad7wf5atL5OG4Fkzohp2qe",
"lengths": "5K4odB173rjao1Cnbk5BnvLt9V7aPAa2",
"names": "g04lKdxiYtG3dOGeUC5AdKEifw65G0Wp"
}
{
"lengths": [
"1216",
"970",
"1788"
],
"names": [
"A",
"B",
"C"
],
"sequences": [
"76f9f3315fa4b831e93c36cd88196480",
"d5171e863a3d8f832f0559235987b1e5",
"b9b1baaa7abf206f6b70cf31654172db"
]
}
Additional question: Can we add another property to a sequence collection in this datamodel
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.