legumeinfo / azulejo Goto Github PK
View Code? Open in Web Editor NEWTiling phylogenetic space with subtrees
License: BSD 3-Clause "New" or "Revised" License
Tiling phylogenetic space with subtrees
License: BSD 3-Clause "New" or "Revised" License
I hacked together a method based on the minimap2 alignment strategy. The basic idea right now is that after producing whole genome alignments via minimap, one can use the paftools liftover command to project coordinates of genes from the query genome into coordinates on the reference genome. Then, using an intervaltree-based script, you can produce a correspondence between the original query gene and whatever reference gene best overlaps the projected coordinate (currently using a simple heuristic to sort out cases where multiple genes are overlapped). So far, the method seems to be producing results on cowpea that are fairly consistent with the results produced by the current DAG-chainer based method as well as with the phytozome assignments (exact method they use is unknown, but seems to be based on correspondences to a single reference genome).
A couple of things that seem to me to be possible advantages of the whole genome-based method over the current implementation:
My current implementation is pretty simple and has so far only been evaluated on cowpea and glycine, but seems to be doing a competitive job. Worth further discussion with @joelb123 and @cann0010 (who may or may not get this message since the repo still hasn't been moved to legumeinfo organization per #108)
The azulejo_tool code might be spiffy for bash, but it's still bash. Move to python, which will make it easier to update, use, and test.
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
Implement a simple tool that compares two sets of adjacencies (e.g., synteny anchors, but could also be homology clusters). The comparison should use simple set logic classify for each adjacency set into (identity, superset, subset, incongruency) and produce both stats and a detailed file of differences of the results.
This functionality is in intersect-anchors, but some functionality was lost in a crash. Time-to-implement is estimated at 1 hour.
Calculate alignments and phylogenetic distance info per homology cluster and make that info available in to proxy gene calculation. The emphasis is on speed rather than accuracy at this point.
% sequence identity could also be passed along, though that is less informative than distance.
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
Calculate stats of phylogenetic distances among same-genome-adjacent, syntenic, same-genome-same-chromosome, and all proteins in a homology cluster.
We have to have consistent data structures for comparing synteny calculation methods. So far the only data structure we have are the histograms of assignments, and those are pretty much a mess for the DAGchainer method.
Create data structures for synteny blocks that can be populated by various methods and compared. The beginning of that is parsing the GFF files, which was address by a commit on Thursday.
There are parameters that the user can choose for the algorithm:
Like most scientific algorithms, the defaults should be a defensible choice obtained by comparison against a standard. There is expected tradeoff on completeness (fraction of genes in a match) vs colinearity (the parameter optimized by existing aligners such as DAGchainer). We are going to document how large this tradeoff is and justify a value, and we will emphasize collinearity where possible.
Proposed standard for matching is DAGchainer rather than minimap2, unless there is a reason otherwise.
Proposed data set is glycine7 rather than glycine33. If there is a choice for a second set to compare, it might be a good plan to choose a set that includes a genome with a large number of small scaffolds where the choices are likely sharper.
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
Now that homology and genome views are broken out separately, never to be put together in one giant table for reasons of scaling, it's necessary to do external joins on info passed from one view to another and also on the global view (e.g., synteny blocks).
I investigated Apache Flink for this purpose, and although I like the name it seems to be heavier weight than what we need.
I propose to use named pipes and file-locking to accomplish external merges and joins. This approach should scale up to the limits of file descriptors (shown by cat /proc/sys/fs/file-max) which seems to be 3M on my desktop system, which is roughly 100X the expected number of homology clusters even for largish eukaryotic genomes.
Performance will be disk-limited, but likely quite acceptable on systems with SSD storage available.
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
As of now, the homology family info is all in one giant table. This data structure will not scale.
Construct per-homology-family tables that span genomes.
where its development properly belongs.
Dependabot can't evaluate your Python dependency files.
As a result, Dependabot couldn't check whether any of your dependencies are out-of-date.
The error Dependabot encountered was:
Illformed requirement ["flynt"]
There are many binary dependencies of azulejo at the moment. Installing them is a pain. Implement installation as a series of commands in azulejo. Consider commonality with the "install" command in bionorm and with the installation tool in lorax and whether this should be abstracted to a separate package.
Dependabot couldn't fetch one or more of your project's path-based Python dependencies. The affected dependencies were setup.py
.
To use path-based dependencies with Dependabot the paths must be relative and resolve to a directory in this project's source code.
I believe I've fixed the issue you were having with ids in gff not matching fasta headers in the file:
/erdos/legumeinfo/data/public/Medicago_truncatula/jemalong_A17.gnm5.ann1_6.L2RX/medtr.jemalong_A17.gnm5.ann1_6.L2RX.protein.faa
if it checks out, we can make this the official datastore version, but this is simply to enable testing/evaluation for now.
Some of the steps in building syntenic blocks depend on perl codes that are difficult to install and maintain. Replace them with python equivalents.
Using glycine7 and the external homology from dagchainer_tool, verify that the current version produces the same results or justify differences.
Dependabot couldn't fetch one or more of your project's path-based Python dependencies. The affected dependencies were setup.py
.
To use path-based dependencies with Dependabot the paths must be relative and resolve to a directory in this project's source code.
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
Dependabot can't evaluate your Python dependency files.
As a result, Dependabot couldn't check whether any of your dependencies are out-of-date.
The error Dependabot encountered was:
Illformed requirement ["flynt"]
Create a directory of examples that have been used in testing that users can clone. Add a command to select which example users might want to use as a template.
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
There are no test codes for anything in azulejo. Implement pytest-based testing.
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
This functionality has been partly there in analysis.py, but update it to plot cumulative and regular histograms of homology cluster and synteny cluster sizes for different methods.
Using the histograms in #191, scan identity values in usearch clustering to identify the value with closest match to those produced by dagchainer_tool with default parameters. Present results in a way that go/no-go decision can be made for using usearch for homology clustering for now. Document the results.
Dependabot couldn't fetch one or more of your project's path-based Python dependencies. The affected dependencies were setup.py
.
To use path-based dependencies with Dependabot the paths must be relative and resolve to a directory in this project's source code.
Pass along whether start=M and stop=* from peptide files.
Dependabot can't evaluate your Python dependency files.
As a result, Dependabot couldn't check whether any of your dependencies are out-of-date.
The error Dependabot encountered was:
Illformed requirement ["flynt"]
Add command to populate file table from patterns of files or URLs. Ensure that it works with the Data Store.
Proxy gene assignment was broken at some point in introducing synteny calculations. Step one in this task is to get it going again by writing synteny assignments to homology groups (was working, broken by 0.9.19 tagged commit), which is a limited job that should take well under a day.
Step 2 is to use phylogenetic distance or taxonomy or both to aid in selecting proxy genes, and that is dependent on issues #162 and #163. This step is especially important when more than one proxy gene is to be selected, e.g. genes in syntenic anchors on a chromosome and one that is part of a clade with an organelle gene.
There's a bug in gffpandas that makes it unable to read many common files. Fork and fix
This is a science issue that depends on the code improvements in #160 #163 and #164. It also depends on phylogenetic thinking more than coding.
Using a few agreed-upon data sets, do a critical evaluation of how best to choose proxy genes. The current logic has identical length across syntenic sets as the first of several possible heuristics for selection. There are a number of assumptions built into this choice which should be tested against the data. One of them is whether the restriction of being in a syntenic set is helpful vis-a-vis other possible choices such as simply being from different genomes. Another is what to fall back on when length identity fails.
We will need to test the algorithm on various data sets in the data store.
Allow operation on compressed, symlinked files in a way that is compatible (but not dependent on) the naming convention used in the data store.
Dependencies are out of date and builds are failing to to multiple problems. This is a perfect time to move to poetry to make dependency updates easier.
This is a science issue in which software plays only a minor role through workable speed.
Inspection of a sample of homology-singleton genes shows that many of them look like crap and do not align to anything in various non-redundant protein sets (except perhaps themselves). Some singletons/orphans are better left behind on the "bone pile" since they will not be phylogenetically connected with anything else in biology, much less a set of gene families.
At the same time, knowing the genes from long-distance relationships makes the families that one calculates a bit better. It may also highlight some called genes as possible contaminants (or possible horizontal transfers) from other species.
Consider whether the clustering step can feasibly accommodate a non-redundant set across all of protein space. Uniprot50 comes to mind.
If clustering with a large NR set is feasible, then there are other interesting gene sets (e.g., the PDB) to consider including from the start.
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
Dependabot can't evaluate your Python dependency files.
As a result, Dependabot couldn't check whether any of your dependencies are out-of-date.
The error Dependabot encountered was:
Illformed requirement ["flynt"]
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
Dependabot couldn't parse the Pipfile found at /Pipfile
.
The error Dependabot encountered was:
Dependabot::DependencyFileNotParseable
Dependabot can't evaluate your Python dependency files.
As a result, Dependabot couldn't check whether any of your dependencies are out-of-date.
The error Dependabot encountered was:
Illformed requirement ["flynt"]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.