tolkit / gfatk Goto Github PK
View Code? Open in Web Editor NEWA plant organellar Graphical Fragment Assembly toolkit
License: MIT License
A plant organellar Graphical Fragment Assembly toolkit
License: MIT License
linear
Not sure how easy this is with the current implementation, but would be useful for the future.
Currently the user is only allowed integer numbers for names of the GFA elements (which get parsed to usize
). A gfatk rename
function is provided which converts any GFA to usize
parsable names. However, this is annoying and should allow any arbitrary name for e.g. a segment name. These can be parsed as Vec<u8>
's.
The time has come.
gfatk extract-chloro
should only extract one subgraph as we expect a consistent structure.gfatk extract-mito
should allow for multiple contigs being extracted. These contigs are usually single segments, and circular. So we can restrict our extracting criteria to these at the moment, unless there is good evidence otherwise.If we have the -e
flag and also the -i
flag, if one of the subgraphs overflows, we should fall back on normal linearisation?
Parse an argument or file which specifies the path. Must include orientations. E.g.
1-,2+,3+
Where each of the integers is a segment in the graph.
Might be nice to rename the nodes of in input GFA to be 1-indexed, so it can work with the rest of the toolkit. MBG does this by default. I never got around to implementing arbitrary index names, but oh well.
As the linear algorithm is brute force, as the number of nodes increase, there is a very large increase in number of possible paths. At the moment, gfatk
hangs if there are too many nodes in the graph (it tries to do the calculation). We should terminate calculations for number of nodes > 60? Probably should do some tests.
Hi, thanks for providing this great toolkit!
I found this toolkit to be the only one I could find that is able to provide the fasta sequence for a path. The only thing is, I have to manually specify this. Would it be possible to supply the PathName
of a path (or by default obtain the sequence for all paths)? My current workaround for one path is:
pathname="G1S1" #name of the path
pathsegments=$(awk -vpathname=${pathname} 'BEGIN{FS = OFS = "\t";} $1=="P" && $2==pathname {print $3;}' test.gfa) #obtaining the order of segments for the path
gfatk path test.gfa "${pathsegments}" | awk -vpathname=${pathname} '/^>/{$1=">"pathname} {print;}' #use gfatk for obtaining the sequence and change fasta header back to name of the path
However, ideally I think it should be possible to say something like gfatk path test.gfa --all > all_paths.fa
, if P-lines are specified in the GFA format. (I have not found any tool that can do this but yours seems closest.) What are your thoughts on this?
Hi,
Thanks for your user-friendly software!
I am looking for a way to extract a longest path in a GFA file.
But while I try
gfatk linear xxx.gfa
There is an error
Error: Edge coverage not found.
Could I achieve my goal without the coverage?
Best wishes!
Lan
Add an option to extract a BED file of the coordinates of the joins between segments in a linearised fasta.
In gfatk linear
remove the last overlap from the last contig. This will save redundant checks later down the pipeline.
Add a convenience subcommand to extract the putative chloroplast assembly.
This should be straightforward, as it should be of a rather consistent size and GC%.
Hi,
I've been hacking some stuff together for a different project since this has a nice implementation of gfa->petgraph, but it was slowing down way too much for large graphs (10+ million nodes).
I didn't need as much fancy handling, so I just implemented a hashmap for NodeIndex to segment ID (and can later invert it), and the graph went from probably taking hours to load to 10 seconds or so, so that is a big improvement (I was inspired by your hint!).
Line 165 in 2e295ae
Thanks,
Alex
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.