koadman / proxigenomics Goto Github PK

Hi-C analysis of heterogeneous samples

License: GNU General Public License v2.0

Makefile 14.94% C 44.43% C++ 0.53% Shell 2.96% HTML 15.20% Perl 2.54% PostScript 4.10% R 0.16% Python 6.20% Groff 8.29% Java 0.25% Perl 6 0.02% M4 0.39%

proxigenomics's Introduction

proxigenomics

Elements of Hi-C analysis, primarily intended for metagenomic samples.

###Simulation Pipeline

For work detailed in the manuscript:

"Deconvoluting simulated metagenomes: the performance of hard- and soft- clustering algorithms applied to metagenomic chromosome conformation capture (3C)"

the described pipeline is contained within this repository at: simulation/pipeline.
the resulting simulated dataset is publicly hosted at: http://doi.org/10.4225/59/57b0f832e013c

git clone https://github.com/koadman/proxigenomics.git

###Future Work We are working on a far more user-friendly and system-agnostic replacement and hope to have a release candidate ready soon. Though is presently a work-in-progress please visit meta-sweeper if you are interested.

proxigenomics's People

Contributors

Stargazers

Watchers

Forkers

koth thebevrishot

proxigenomics's Issues

The end-to-end workflow needs to be split into parts.

Nestly carries all variations from swept parameters to the very end, no work is done at higher levels in the workflow. This means that read-sets and their assemblies are generated at the deepest point, resulting in a large amount of duplication. Duplication means CPU and storage demands are much higher than necessary.

We should split off the generation of intermediate work and then combine them in a separate workflow.

Eg.

*Workflow1
WGS generation -> Asm -> mapping
*Workflow2
HiC generation -> mapping
*Workflow3
Incorporation of WGS/HiC variants => clustering => scoring (snps, etc)

Permutation control split over both file existence and parameter settings

Currently, a sweep is primarily defined in the config.yaml file, however the codebase also checks for the existence of files (trees and profiles) to include in the sweep.

It would be better if you could control all combinatorics in a single place/domain, or at least filter files down to smaller ranges by pattern matching.

Improve exit status propagation and rollback

Running jobs need to propagate errors more thoroughly. Situations where execution did not complete should be have side-effects in the form of target files. Incomplete target files cannot be distinguished from completed ones and will not get recreated on subsequent runs. Currently they must be manually identified and deleted.

Many scripts use "dumb" relative paths.

Many of the queue submission scripts use dumb relative paths which assume invocation from the pipeline folder.

Eg. sgerun_MKTRUTH.sh assumes it can call bin/alignmentToTruth.py regardless of working dir.

Create SNP/mutational associations

We need to begin with mapping of HiC data and capturing of mutation events on HiC reads and between these HiC fragments.

Some alignment work is redundant.

The pipeline currently does some redundant work, a by-product of the hierarchical design. In particular, the alignment of WGS contigs to references is done at for each combination of WGS_fold and HIC_fold, while in reality it is invariant across HIC_fold.

For small sweeps, this is not too significant, however if we were to explore many levels of HIC_fold or insert an additional parameter sweep in the wrong location, this could become significant time.

Move the alignment to a higher place within the hierarchy.

Edge weighting

Edge weights have a significant effect on clustering solutions with MCL. We should explore how various weighting relationshops effect performance.

Trialed

raw weight: number of degenerate Hi-C assocations between two contigs
normalized by contig length: geometric mean of associated contigs. 1/sqrt(Ni Nj)

Possible:

other parameters in normalization: no. reads, read depth.
other means if appropriate

Glob on reference folder requires filter for files

The glob on the reference folder currently does not apply a filter to the returned results. This can produce elements which are files and not folders.

We should have a simple test which filters non-folder entries.

Externalize binaries path

Work has been done to remove dependency on hard-coded paths, however the interface of some applications (sgEvolver) is most simply configured if its dependent binaries are on PATH.

If we do this for one, we might as well assume this as a requirement for all pipeline executables.

Currently we have Bash and Python scripts in a binary folder, and use relative references.

bin/
    pbsrun_XXX.sh
    makeTruth.py

Instead, we should probably put all dependent executables (samtools, bwa, sgEvolver) in or beneath this path, then add this to PATH or use an new environmental variable. Probably a new one, then that is already implemented for adding to PATH if desired.

reduce redundancy in job scripts

currently there are a large number of SGE & PBS job scripts. These scripts contain a lot of code redundancy as they also support local pipeline execution.

Now that we are no longer stuck with the clunky UTS HPC cluster this can all be streamlined to use DRMAA and the job scripts eliminated. 3rd party users wishing to run the pipeline could simply start a GVL instance in amazon EC2 or NeCTAR.

Update WGS and HiC stages to source reference genome data from evolution step

Need to update the scons files for WGS and HiC so that they take reference data from evo_data rather than the original static set of communities in references.

Additionally, the current method looks one folder beneath references to find each community. Subsequent paths are assumed to be references/communityN/genomes.fasta. For the evo_data folder, there is greater depth with naming as evo_data/{tree}/{sample_profile}/{tree_scale}. Here tree_scale is a float (which makes for an awkward name) and sample_profile is currently always set to uniform, that is the pipeline does not at present model variable abundance.

Place HPC logs into a log folder

During runtime, the system creates many logs. Rather than disable their creation, we should direct them all to a single log folder.

This will require each "pbsrun_X.sh" script specifying an output path for -o/-e options. Probably easiest done with a global.

weighted extended bcubed precision and recall

It can be argued that objects in a set to be classified may not be of equal value in terms of correctly classifying them. In this case, a weighting could be included in the calculation of precision and recall, rather than simply using set cardinality.

Where V is the intersection operator:

IE.

| A v B |

becomes

Product(w_i * o_i, {o_i, o_i E A v B})
where w_i is the weighting of object o_i.

Support both Hi-C and meta3C read simulation

Currently, simForward.py handles the ligation point in a simplistic fashion, where overhangs are simply reattached, which by chance mimics meta3C. In the case of Hi-C, the end-filling used in biotinylation results in a duplication at the ligation junction.

We need to add this behaviour as a selectable option.

Ideally, our generated internal files should have annotation fields.

There are situations when it would be nice to know what parameters were used to generate the given file. This can be determined implicitly by the sweep system, but when individual files are taken, having knowledge would be useful.

Also, we can then use this to make smarter programs.

For instance, if a truth table has been generated using thresholds, then subsequently scoring algorithms should be smart enough to check the thresholds and either report an error or first pre-filter input clusterings when there is a mismatch.

Eg. TT with contigs > 1000bp. Then a clustering generated with contigs > 500bp should first get pre-filtered. In the reverse case, where the comparison will be missing data, an error should occur.

Scons/Nestly error

This may be something simple but needs the eyes of someone more familiar with Nestly than myself:

ubuntu@server-cc59c734-68a1-4e33-bf4c-a5ee254c340b:/mnt/transient_nfs/proxigenomics/simulation/pipeline$ scons -j 1 -f SConstruct_deconvolve.py
scons: Reading SConscript files ...
TypeError: zip argument #2 must support iteration:
  File "/mnt/transient_nfs/proxigenomics/simulation/pipeline/SConstruct_deconvolve.py", line 34:
    @name_targets
  File "/usr/local/lib/python2.7/dist-packages/nestly/scons.py", line 153:
    self.nest.add(key, nestfunc, create_dir=False)
  File "/usr/local/lib/python2.7/dist-packages/nestly/core.py", line 174:
    for r in nestable(control):
  File "/usr/local/lib/python2.7/dist-packages/nestly/scons.py", line 151:
    return [func(destdir, control)]
  File "/usr/local/lib/python2.7/dist-packages/nestly/scons.py", line 46:
    return dict(zip(ret[:-1], ret[-1]))

Make pipeline runtime environment agnostic.

Currently, the pipeline is written explicitly to run on the UTS cluster. This runtime environment is pbspro and of limited deployment. At the least, there is no reason the system should impose this requirement on an end user.

Insert abstraction between jobs and runtime system. We might simply leave it to run as multi-threaded or investigate a library which could abstract this part. In Java, we could use DRMAA.

Uniform community flag

Allow users to request a uniform community profile, which would then be handled in code rather than require the explicit definition for a given community.

design primers for carrier DNA

an ideal carrier would be DNA with little sequence identity to most targets, probably around 10kbp

Ground truth is only a guess

Our currently pipeline requires mapping contigs back to the reference genomes. For hard clustering we choose a winner based on alignment extent, which is ultimately a guess.

For communities with low phylogenetic distance, this guess is poor and therefore metrics which compare solutions against the ground truth are unreliable.

Soft-clustering is obviously the approach required.

Filesystem latency can cause rollbacks

Due to the latency of our HPC's parallel FS, output files do not necessarily appear immediately for all hosts. Tasks which depend on the previous result can fail due to the non-existence of the required input file, which a few moments later will appear across the system.

We need to add at least a simple "if does not exist -- wait" in each task script.

This could be taken to a standard function which waits-until and then fails with error.

What a pain!

Place basis sequences and trees somewhere sensible.

Currently the basis files of the evolution step (tree, input seq) are expected to be in the working directory. We could consider putting these in a subdir, at least to avoid deleting them when tidying up.

bam2edges and edges2metis should be combined.

As the system has evolved, these two scripts perform a great deal of overlapping work.

The tasks they perform should be integrated into a single script.

Figure showing simulation workflow

We will evenutally need something like this for a manuscript, but more urgently the ARC grant I'm writing could use something like this!

binaries required by mktree.sh are unavailable

The mktree.sh script in the simulation pipeline reconstructs a phylogeny of the simulated genomes via progressiveMauve genome alignment. the various programs used by this script are not present or documented. I believe this is an optional step, simply a sanity check on sgEvolver. If so, is it ok to remove this target from SConstruct_evo.py?

Add assembly QA to pipeline.

It would probably be of some value if assembly QA was performed on the simulated assemblies. For sweeping coverage, with various assemblers or library protocols, this could be of value as a validation point.

Generated Fastq don't follow a naming convention

The fastq generated by sim_forward.py does not follow any accepted naming convention. Programs such as BWA will not treat the mate-pairs correctly in this case.

We should use a simply accepted convention. I would prefer not to add lots of fields with dummy values -- such as would happen if we adopted Illumina names.

External dependencies and overall structure a mess.

As it has come to be, the sweep codebase is making use of PROXIHOME to find some external executables/dependencies, but not all. We also have a strange and unnecessary hierarchy; most other folders within the proxigenomics repo are irrelevant to the sweep application.

We should consider moving this to its own home or create a submodule with explicit dependencies.

Simulation pipeline requires an overarching point of entry

Since parts of the sweep had to be split into separate SCons scripts, it has lead to only weak coupling of inter-script dependencies. At present this dependency hierarchy is managed manually -- running each stage in the correct order once the last one finishes. It highlights the need for something.

A overlord script is required, which performs the stages and would naturally take on the "whole simulation" output path and other variables such as seed. This script it itself could use Nestly -- but I would be tentative to do so as the goal isn't Application Inception. :-) Also, it would be a good time to yank out all the bash scripting.

Some code is running on the head node

A few short runnign python scripts are not being submitted to PBS and instead are running on the head node.

makeMCLinput.py
mclJoinWithTruth.py
parseSamCigar.py
samToEdges.py
f1score.py
vmeasre.py

Pruning of input graph

For communities with relatively clear phylogenetic separation, clustering is effective even on the complete graph. As communities get progressively closer, the performance of hard clustering drops off dramatically.

We should explore how pruning at the extrema of attributes effects clustering performance.

Possibilties

scaffold length
edge weight
read depth
read count

contact_map.py memory usage

When creating a contact map with 35Mbp of reference genome over 11 reference sequences (the Burton et al 2014 3 domain data) the script requires ~7GB of RAM which makes my 8GB macbook air beg for mercy. Is it possible to reduce the memory footprint somehow, e.g. by freeing unused datastructures? How is the scaling in the ref genome length and number of reads?

add support for running pipeline outside $HOME

Some of the scripts, e.g. {pbs,sge}run_METAART.sh assume that binaries can be found in paths such as:
$HOME/git/proxigenomics/simulation/hic_simulator/src/metaART.py
This is a problem when using the pipeline on a GVL CloudMan cluster because $HOME is not shared storage.

Share related parameters between make files

There are parameters defining the simulation sweep which are specified separately in make files. This was due to splitting the single make into three -- a solution necessary to avoid a much more complex script.

There should be an easy means of sharing this information.

Example -- coverage.

Soft-clustering descriptive values

The result of soft clustering needs some simple descriptive values.

avg degeneracy per node
total number of clusters
number of nodes unassigned (SR-MCL appears to not assign some nodes -- my interpretation of blank lines in metis output)

User set random seed

Users should be able to set the seed for random number generator. This should be visible in the output of simulation. Either STDOUT or to a file.

Community profile by specifying distribution function

Allow users to define the community profile by specify a desired distribution function and its required parameters.

simforward.py CLI help doesn't show reference and output paths.

Syntax of help is missing indicators for reference and output path variables.

Write out community profile for simulation

For functionally defined community profiles, write out the profile used in a given simulation. We will use the same format as used for input "by table" community profiles.

Eg. space delimited text.

seq_id taxon_id relative_probability

Integrate SG evolver

Sg Evolver needs to be integrated to make this a complete pipeline.

bcubed.py does not notice when supplied the wrong file format.

We need better typing of files. If bcubed.py is supplied a GraphML format file from a clustering progrma such as OclustR, it will happily read it as if it is an MCL format file.

Since there is no error, it goes on to score this garbage read against the truth table.

Soft-clustering

We must test soft clustering algorithm performance for at least phylogenetically close communities.

Options

SR-MCL
MaxMax
?

external binaries using path relative to invocation directory

For at least some scripts, external binaries are being referred to by a relative path, which assumes the invocation directory contains external.

Large files and method of parsing BAM

BAM files based on real experimental data has shown that large files are very inefficiently parsed using the method currently employed.

As of now, sequences are attended to in a specific order, making determining the offset within the contact matrix trivially easy (summed per outer loop). Unfortunately, the pysam method fetch(seq_name) appears to require much up-front IO that scales with BAM file size, resulting in a significant delay for each invocation. In the case of many reference sequences (possibly due to fragmented WGS assembly) this will become a huge penalty.

Therefore, we will require that a method be implemented which determines contact matrix offset from the predetermined sequence order and the fields available in the BAM file. This should not pose a problem.

Pipeline path references require refactoring

Path referencing within the pipeline use absolute paths above the pipeline project and probably some relative paths.

We need to decide on a single referencing method.

Two points that need to be decided (either/or)

All executable paths relative OR assumed on $PATH
Data references relative OR using an ENV setting.

Fix branching of pipeline developmental

Made an error while creating pipeline.r1 branch. Ended up with the code in both master and the branch!

Should performance metrics consider only classification

Currently, we are using F1 with pair counting to determine the quality of clustering solutions. This metric only considers the classification and not the nature of the contigs under classification. It could be said that incorrecrtly assigning a large contig is worse than making a mistake for a very small contig.

Specify an overall output path

There should be the facility to specify an overall output path, under which all sweep directories are created. A copy of the parameter sweep details should also be referenced, possibly as an appended log or some versioning process if repeated calls are to be preserved.

Symmetric measures can be used as consensus tests

Measures such as: v-measure, adjusted rand index, mutual information are symmetric with respect to prediction and ground truth. These could be used to compare consensus between clusterings rather than for only against truth.