Giter Club home page Giter Club logo

azulejo's People

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

joelb123

azulejo's Issues

minimap2-based prototype implementation evaluation

I hacked together a method based on the minimap2 alignment strategy. The basic idea right now is that after producing whole genome alignments via minimap, one can use the paftools liftover command to project coordinates of genes from the query genome into coordinates on the reference genome. Then, using an intervaltree-based script, you can produce a correspondence between the original query gene and whatever reference gene best overlaps the projected coordinate (currently using a simple heuristic to sort out cases where multiple genes are overlapped). So far, the method seems to be producing results on cowpea that are fairly consistent with the results produced by the current DAG-chainer based method as well as with the phytozome assignments (exact method they use is unknown, but seems to be based on correspondences to a single reference genome).

A couple of things that seem to me to be possible advantages of the whole genome-based method over the current implementation:

  • not limited in applicability to protein coding genes; in fact, any annotated elements of a genome can in theory be put into correspondence using this approach
    -the liftover produces coordinate mappings regardless of whether the genomes have annotations at regions corresponding, so in theory missed annotations could be recovered more directly this way
  • the whole genome alignments should be more sensitive than gene-based since differences in introns and intergenic contexts help to inform their alignment decisions
  • the whole genome alignments contain information that will be useful for other purposes, such as variant calling between aligned genomes (also supported by paftools); a corollary to this is that it's easy to look at the alignments in IGV. Any of the methods can be evaluated reasonably easily using GCV and they would probably also translate fairly naturally into the multi-genome viewer being developed by MGI (I think this is also element-centric though not as tied to protein-coding genes). Perhaps JBrowse2 will also be a context in which such comparisons can be displayed?

My current implementation is pretty simple and has so far only been evaluated on cowpea and glycine, but seems to be doing a competitive job. Worth further discussion with @joelb123 and @cann0010 (who may or may not get this message since the repo still hasn't been moved to legumeinfo organization per #108)

get rid of bash scripts

The azulejo_tool code might be spiffy for bash, but it's still bash. Move to python, which will make it easier to update, use, and test.

improve intersect-anchors command

Implement a simple tool that compares two sets of adjacencies (e.g., synteny anchors, but could also be homology clusters). The comparison should use simple set logic classify for each adjacency set into (identity, superset, subset, incongruency) and produce both stats and a detailed file of differences of the results.

This functionality is in intersect-anchors, but some functionality was lost in a crash. Time-to-implement is estimated at 1 hour.

phylogenetic distance info into homology tables

Calculate alignments and phylogenetic distance info per homology cluster and make that info available in to proxy gene calculation. The emphasis is on speed rather than accuracy at this point.

% sequence identity could also be passed along, though that is less informative than distance.

Add data structures for synteny blocks

We have to have consistent data structures for comparing synteny calculation methods. So far the only data structure we have are the histograms of assignments, and those are pretty much a mess for the DAGchainer method.

Create data structures for synteny blocks that can be populated by various methods and compared. The beginning of that is parsing the GFF files, which was address by a commit on Thursday.

finalize algorithm defaults

There are parameters that the user can choose for the algorithm:

  1. thorny or non-thorny
  2. peatmer vs. simple k-mer
  3. k-mer length
  4. strictly adjacent vs extended adjacency (i.e., 1st, 2nd, 3rd match of peatmer)

Like most scientific algorithms, the defaults should be a defensible choice obtained by comparison against a standard. There is expected tradeoff on completeness (fraction of genes in a match) vs colinearity (the parameter optimized by existing aligners such as DAGchainer). We are going to document how large this tradeoff is and justify a value, and we will emphasize collinearity where possible.

Proposed standard for matching is DAGchainer rather than minimap2, unless there is a reason otherwise.

Proposed data set is glycine7 rather than glycine33. If there is a choice for a second set to compare, it might be a good plan to choose a set that includes a genome with a large number of small scaffolds where the choices are likely sharper.

do external merges/joins on homology and genome views

Now that homology and genome views are broken out separately, never to be put together in one giant table for reasons of scaling, it's necessary to do external joins on info passed from one view to another and also on the global view (e.g., synteny blocks).

I investigated Apache Flink for this purpose, and although I like the name it seems to be heavier weight than what we need.

I propose to use named pipes and file-locking to accomplish external merges and joins. This approach should scale up to the limits of file descriptors (shown by cat /proc/sys/fs/file-max) which seems to be 3M on my desktop system, which is roughly 100X the expected number of homology clusters even for largish eukaryotic genomes.

Performance will be disk-limited, but likely quite acceptable on systems with SSD storage available.

per-homology-family info

As of now, the homology family info is all in one giant table. This data structure will not scale.

Construct per-homology-family tables that span genomes.

Install binary dependencies

There are many binary dependencies of azulejo at the moment. Installing them is a pain. Implement installation as a series of commands in azulejo. Consider commonality with the "install" command in bionorm and with the installation tool in lorax and whether this should be abstracted to a separate package.

add jemalong_A17.gnm5.ann1_6.L2RX to medicago clusters to enable comparison

I believe I've fixed the issue you were having with ids in gff not matching fasta headers in the file:
/erdos/legumeinfo/data/public/Medicago_truncatula/jemalong_A17.gnm5.ann1_6.L2RX/medtr.jemalong_A17.gnm5.ann1_6.L2RX.protein.faa

if it checks out, we can make this the official datastore version, but this is simply to enable testing/evaluation for now.

eliminate perl dependencies

Some of the steps in building syntenic blocks depend on perl codes that are difficult to install and maintain. Replace them with python equivalents.

reproduce 0.9.18 results

Using glycine7 and the external homology from dagchainer_tool, verify that the current version produces the same results or justify differences.

create examples directory

Create a directory of examples that have been used in testing that users can clone. Add a command to select which example users might want to use as a template.

Implement testing

There are no test codes for anything in azulejo. Implement pytest-based testing.

scan usearch identity for match with BLAST clustering

Using the histograms in #191, scan identity values in usearch clustering to identify the value with closest match to those produced by dagchainer_tool with default parameters. Present results in a way that go/no-go decision can be made for using usearch for homology clustering for now. Document the results.

phylogeny-aware proxy gene assignments

Proxy gene assignment was broken at some point in introducing synteny calculations. Step one in this task is to get it going again by writing synteny assignments to homology groups (was working, broken by 0.9.19 tagged commit), which is a limited job that should take well under a day.

Step 2 is to use phylogenetic distance or taxonomy or both to aid in selecting proxy genes, and that is dependent on issues #162 and #163. This step is especially important when more than one proxy gene is to be selected, e.g. genes in syntenic anchors on a chromosome and one that is part of a clade with an organelle gene.

evaluate stats of synteny vs order

This is a science issue that depends on the code improvements in #160 #163 and #164. It also depends on phylogenetic thinking more than coding.

Using a few agreed-upon data sets, do a critical evaluation of how best to choose proxy genes. The current logic has identical length across syntenic sets as the first of several possible heuristics for selection. There are a number of assumptions built into this choice which should be tested against the data. One of them is whether the restriction of being in a syntenic set is helpful vis-a-vis other possible choices such as simply being from different genomes. Another is what to fall back on when length identity fails.

allow input from unmodified data store files

We will need to test the algorithm on various data sets in the data store.

Allow operation on compressed, symlinked files in a way that is compatible (but not dependent on) the naming convention used in the data store.

long-distance relationships and orphans

This is a science issue in which software plays only a minor role through workable speed.

Inspection of a sample of homology-singleton genes shows that many of them look like crap and do not align to anything in various non-redundant protein sets (except perhaps themselves). Some singletons/orphans are better left behind on the "bone pile" since they will not be phylogenetically connected with anything else in biology, much less a set of gene families.

At the same time, knowing the genes from long-distance relationships makes the families that one calculates a bit better. It may also highlight some called genes as possible contaminants (or possible horizontal transfers) from other species.

Consider whether the clustering step can feasibly accommodate a non-redundant set across all of protein space. Uniprot50 comes to mind.

If clustering with a large NR set is feasible, then there are other interesting gene sets (e.g., the PDB) to consider including from the start.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.