tolkit / gfatk Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 2.0 4.27 MB

A plant organellar Graphical Fragment Assembly toolkit

License: MIT License

Rust 99.80% Shell 0.20%

gfatk's People

Contributors

Stargazers

Watchers

Forkers

asleonard lavafroth

gfatk's Issues

`gfatk linear` should linearise within subgraphs

Iterate over the subgraphs of a GFA
Apply linear
Will need modification of the fasta headers

Make `gfatk extract` be able to take more than one seg ID

Not sure how easy this is with the current implementation, but would be useful for the future.

Allow any name for GFA segments, lines, paths etc.

Currently the user is only allowed integer numbers for names of the GFA elements (which get parsed to usize). A gfatk rename function is provided which converts any GFA to usize parsable names. However, this is annoying and should allow any arbitrary name for e.g. a segment name. These can be parsed as Vec<u8>'s.

Extract multiple mito contigs

The time has come.

gfatk extract-chloro should only extract one subgraph as we expect a consistent structure.
gfatk extract-mito should allow for multiple contigs being extracted. These contigs are usually single segments, and circular. So we can restrict our extracting criteria to these at the moment, unless there is good evidence otherwise.

Add upper node number limit to `gfatk linear`

`gfatk linear -ei` bug

If we have the -e flag and also the -i flag, if one of the subgraphs overflows, we should fall back on normal linearisation?

Allow user supplied path in `gfatk linear`

Parse an argument or file which specifies the path. Must include orientations. E.g.

1-,2+,3+

Where each of the integers is a segment in the graph.

Format GFA to be compatible with gfatk

Might be nice to rename the nodes of in input GFA to be 1-indexed, so it can work with the rest of the toolkit. MBG does this by default. I never got around to implementing arbitrary index names, but oh well.

Restrict linear based on node number

As the linear algorithm is brute force, as the number of nodes increase, there is a very large increase in number of possible paths. At the moment, gfatk hangs if there are too many nodes in the graph (it tries to do the calculation). We should terminate calculations for number of nodes > 60? Probably should do some tests.

Obtain path sequences as specified in P-lines

Hi, thanks for providing this great toolkit!

I found this toolkit to be the only one I could find that is able to provide the fasta sequence for a path. The only thing is, I have to manually specify this. Would it be possible to supply the PathName of a path (or by default obtain the sequence for all paths)? My current workaround for one path is:

pathname="G1S1" #name of the path
pathsegments=$(awk -vpathname=${pathname} 'BEGIN{FS = OFS = "\t";} $1=="P" && $2==pathname {print $3;}' test.gfa) #obtaining the order of segments for the path
gfatk path test.gfa "${pathsegments}" | awk -vpathname=${pathname} '/^>/{$1=">"pathname} {print;}' #use gfatk for obtaining the sequence and change fasta header back to name of the path

However, ideally I think it should be possible to say something like gfatk path test.gfa --all > all_paths.fa, if P-lines are specified in the GFA format. (I have not found any tool that can do this but yours seems closest.) What are your thoughts on this?

gfatk linear usage questions

Hi,
Thanks for your user-friendly software!
I am looking for a way to extract a longest path in a GFA file.
But while I try
gfatk linear xxx.gfa
There is an error
Error: Edge coverage not found.

Could I achieve my goal without the coverage?

Best wishes!
Lan

Extract a BED file of the joined coordinates in `gfatk linear`

Add an option to extract a BED file of the coordinates of the joins between segments in a linearised fasta.

Remove redundancy from end of last joined contig

In gfatk linear remove the last overlap from the last contig. This will save redundant checks later down the pipeline.

Add `gfatk extract-chloro` command

Add a convenience subcommand to extract the putative chloroplast assembly.

This should be straightforward, as it should be of a rather consistent size and GC%.

Hashmap for NodeIndex <-> segment ID

Hi,

I've been hacking some stuff together for a different project since this has a nice implementation of gfa->petgraph, but it was slowing down way too much for large graphs (10+ million nodes).

I didn't need as much fancy handling, so I just implemented a hashmap for NodeIndex to segment ID (and can later invert it), and the graph went from probably taking hours to load to 10 seconds or so, so that is a big improvement (I was inspired by your hint!).

gfatk/src/utils.rs

Line 165 in 2e295ae

/// This should 100% have been a map-like structure...

Thanks,
Alex