Comments (4)
Thanks for the suggestion. Unfortunately I think for my personal workflow it might be too messy to have to selectively process some pod5s like this, my current process creates folders like this
├── data
└── pod5_links
├── block_1 # full of symlinks to pod5s
├── block_2 # full of symlinks to pod5s
└── ...
That way I can just use the block_#
folder as argument to each dorado call. It would add too much complexity for my liking.
from pod5-file-format.
Hi @Shians, yes absolutely.
If you want approximately similar sizes then you can get a more performant workflow versus exactly N reads.
Depending on your pipeline - (assuming it's something to do with basecalling) i'd recommend more reads than 4_000 records as there is a non-zero setup time for dorado as it needs to load the model and reference etc.
Tip
If you'rejust basecalling these records with dorado - Instead of subsetting files and cloning the data which is very IO intensive and can be slow. Use the -l, --read-ids A file with a newline-delimited list of reads to basecall.
argument which will search for the read ids in the whole dataset and distribute the jobs by indexing ids instead of by providing complete seperate inputs.
Approximate sizes suggestion
This will be much quicker as merging is simple and requires no searching for specific records.
Please edit for your specific needs. This is an untested example but should be sufficient to show what to do.
pod5 view data/ --include "filename" | sort | uniq -c > records_per_file.txt
head records_per_file.txt
100 file1.pod5
1000 file2.pod5
1234 file3.pod5
...
# Writes all the filenames to output_X.txt
awk -v N=<VALUE>'{
sum += $1
printf "%s", $2 > "data/output_" file_count ".txt"
if (sum >= N) {
sum = 0
file_count++
}
}' records_per_file.txt
for OUT in $(find . -iname "output_*.txt"); do
NEW_POD5="${OUT%.txt}.pod5"
pod5 merge $(cat $OUT) -o $NEW_POD5
done;
Exact subsetting into N equally sized batches
pod5 view data/ -IH > all_read_ids.txt
split all_read_ids.txt -n l/${BATCHES} batch. -a 4 -d --additional-suffix .txt
echo "read_id,dest" >> mapping.csv
for BATCH in $(find . -iname "batch.*txt"); do
NEW_POD5="${BATCH%.txt}.pod5"
awk '{print $1 "," $NEW_POD5}' >> mapping.csv
done
pod5 subset data/ --table mapping.csv --columns dest
from pod5-file-format.
Thanks for the fast reply! My current workflow batches jobs up by folders, I take all the existing pod5 files and create symbolic links into folders each containing an equal number of pod5 files, then run dorado on the folders containing links so I don't duplicate the data.
Your solutions are helpful, and I think I will adapt them into a strategy where I identify all files >1GB and break only those files up into 1GB pod5s. Perhaps also to aggregate files that are too small <100MB with merge.
from pod5-file-format.
@Shians,
Please try using the -l, --read-ids
dorado argument.
You can pass a symbolic link to the same pod5 file but instruct dorado to basecall only the first half of read ids and then have another worker basecall the other half. This way you don't need to duplicate any input data.
Something like this - but please make sure you're not missing / duplicating read_ids when splitting the read ids.
# get all read_ids from pod5 quickly
pod5 view my.pod5 -IH > all_read_ids.txt
# calculate total number of read_ids
NUM_READS=$(wc -l all_read_ids.txt)
# split into two parts
head -n $(NUM_READS/2) > first_half.txt
tail -n $(NUM_READS/2) > second_half.txt
# first worker
dorado basecall <model> my.pod5 --read-ids first_half.txt ..... > first_half.bam
# second worker
dorado basecall <model> my.pod5 --read-ids second_half.txt ..... > second_half.bam
from pod5-file-format.
Related Issues (20)
- Cannot install pod5 through pip on ARM due to dependency issues HOT 11
- Reader class attributes immutable (Cannot edit "sample_id" field of mutable read object) HOT 1
- getrandom error with pod5 convert fast5 HOT 14
- MantaControl': Unable to read fast5 file at /path/: HDF5 exception", HOT 2
- Getting the signal chunk size of a pod5 file HOT 1
- Missing conda pod5 package HOT 2
- No documentation regarding multi-file pod5 dependency HOT 2
- pod5 convert fast5 warning: Failed to read key read_XXX HOT 2
- Troubleshooting Conversion of Fast5 Files to Pod5 Format HOT 12
- error:XX.fast5 is not a multi-read fast5 file HOT 2
- pod5 filter get killed HOT 5
- pod5 convert fast5 is stalling HOT 4
- Split Read IDs Cause Missing Read Error? HOT 1
- Support for sampling rates larger than int16 when writing pod5 files
- Lost signal data HOT 3
- pod5 convert fast5 doesn't finish for a set of files HOT 11
- pod5 merge hangs indefinitely at 99-100%(the last 20 pod5 have not been merged)
- Pod5 convert to_fast5 is losing signal data
- pod5 breaks with new numpy 2.0.0 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pod5-file-format.