A demo of how to use the
oceania-query-fasta
Python package to run queries on the OcéanIA FASTA Query Service.
oceania-query-fasta
is a pip
-installable Python package client of OcéanIA FASTA Query Service which is an online service to query large FASTA files stored in the OcéanIA data storage. It currently supports the Ocean Microbial Reference Gene Catalog v2 with 100GB (gziped) FASTA, CSV, TSV files.
By using oceania-query-fasta
you do not need to move large files around. Instead, you run queries on our online service right from your Python code and get the results as a Pandas DataFrame
.
Requirements:
- Python 3.7.3 or newer,
- Virtual env
- Git, and
Create and activate a Python virtual environment:
git clone https://github.com/Inria-Chile/oceania-query-demo.git
cd oceania-query-demo/
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
The library may be used directly as a command line tool:
oceania query-fasta -h
Usage: oceania query-fasta [OPTIONS] <key> <query_file> <output_format>
<output_file>
Extract secuences from a fasta file in the OceanIA Storage.
<key> object key in the OceanIA storage
<query_file> CSV file containing the values to query.
Each line represents a sequence to extract in the format "sequence_id,start,end,type"
"sequence_id" sequence ID
"start" start index position of the sequence to be extracted
"end" end index position of the sequence to extract
"type" type of the sequence to extract
options are ["raw", "complement", "reverse_complement"]
type value is optional, if not provided default is "raw"
<output_format> results format
options are ["csv", "fasta"]
<output_file> name of the file to write the results
Options:
-h, --help Show this message and exit.
Or only for more information:
oceania -h
Usage: oceania [OPTIONS] COMMAND [ARGS]...
A simple OceanIA command line tool.
Options:
-h, --help Show this message and exit.
Commands:
query-fasta Extract secuences from a fasta file in the OceanIA Storage.
The sample-data/
folder contains the query file query_tara_a100000171.csv
TARA_A100000171_G_scaffold48_1,10,50,complement
TARA_A100000171_G_scaffold48_1,10,50
TARA_A100000171_G_scaffold48_1,10,50,reverse_complement
TARA_A100000171_G_scaffold181_1,0,50
TARA_A100000171_G_scaffold181_1,100,200
TARA_A100000171_G_scaffold181_1,200,230
TARA_A100000171_G_scaffold493_2,54,76
TARA_A100000171_G_scaffold50396_2,87,105
TARA_A100000171_G_C2001995_1,20,635
TARA_A100000171_G_C2026460_1,0,100
Run the query:
oceania query-fasta TARA_A100000171 query_tara_a100000171.csv csv example_tara_a100000171.output.csv
[08-06-2021 21:48:52] Sending request for fasta sequences
[08-06-2021 21:48:54] Request accepted
[08-06-2021 21:48:54] Waiting for results...
And then, check the output file example_tara_a100000171.output.csv
that should look like:
TARA_A100000171_G_scaffold48_1,10,50,complement,ACCGTAACGTAGGCCATATTATTTTCATGGTCTTCCACAA
TARA_A100000171_G_scaffold48_1,10,50,raw,TGGCATTGCATCCGGTATAATAAAAGTACCAGAAGGTGTT
TARA_A100000171_G_scaffold48_1,10,50,reverse_complement,AACACCTTCTGGTACTTTTATTATACCGGATGCAATGCCA
TARA_A100000171_G_scaffold181_1,0,50,raw,CCAAGACCAAGCAATTTTAACACCACACTTAGATACTGCGCAAACAGCGT
TARA_A100000171_G_scaffold181_1,100,200,raw,ATTATGTTACCAGCACTTGATAACCAAAAAGTTTGGGcaggattaaaattaactaaTGATCAATTAATTGCAACTGACGATGATCAAGCATACTTTAAGT
TARA_A100000171_G_scaffold181_1,200,230,raw,ATCAAACTGATGCTACTAACTCAGAAGCAT
TARA_A100000171_G_scaffold493_2,54,76,raw,TAAGTTTTTATTATTATATTTT
TARA_A100000171_G_scaffold50396_2,87,105,raw,AGCTGTTCGGAAAACTAG
TARA_A100000171_G_C2001995_1,20,635,raw,ACAGCACACCAAGCAGGTCGTCGACCGAAACGATATTGAGAAGAATAAGAACGGAAACCGCGATGGCTGCACTCACCTCCGGCGAGCGCCATTCGCGGGCAAACGCTATAAAGAGACCGATAATGACGACGCCAACGATCAGCGCGCCATAGGGCTCAATCAGGCTAGCGAACAAATGCACCCTCCGCTCGGTCCACGGCGCACTCTATGCGATGCCGGCCTGTATTGGAAAGCAGTCAGAATCAATTCGGACTTCTTTTTTAAGCAAACGGGCTTGGGCATTACCGCCCGGATAATGTACGGCTGACTGCATCCCGCCAACCGGCCAGCTTTTCCTTGCGCGCCGCTCCGTCCATTTCGGGAACGAACTGACGTTCGAGCGCCCAGCTTCTTGAAAACGCTTCTTGATCCGGCCAAAGCCCTGTCGCTTGCCCTGCGAGCCAGGCGGCCCCCAGAACCGTTGTTTCGAGCATATTTGGCCGGTCGACCGGTGCGTCGAGAA
TARA_A100000171_G_C2026460_1,0,100,raw,AATTTGAAACAACCCTAAAGTGTTTACCATAATAGGTTCTTAAATCAAAACCAACATTCCAAGTTAGGTTGTCGCCTAGCTTTTTCTCAAGGTTTGAAAT
Previous steps can be executed also from the following bash query_tara_a100000171.sh with:
bash query_tara_a100000171.sh
For this example we use the file sample-queries/query_tara_r110002003.csv
TARA_R110002003_G_scaffold3_1,3290,6293
TARA_R110002003_G_scaffold3_3,0,327
TARA_R110002003_G_scaffold3_3,944,2742
TARA_R110002003_G_scaffold3_4,379,379
TARA_R110002003_G_scaffold3_4,1530,1669
Execute the query:
oceania query-fasta TARA_R110002003 query_tara_r110002003.csv csv example_tara_r110002003.output.csv
[08-06-2021 21:48:52] Sending request for fasta sequences
[08-06-2021 21:48:54] Request accepted
[08-06-2021 21:48:54] Waiting for results...
And then, check the output file example_tara_r110002003.output.csv
:
id,start,end,type,sequence
TARA_R110002003_G_scaffold3_1,3290,6293,raw,TGATCGGGAGTCCTCCAGGCTTTGGATCGTTTGGGATAGATTTGTTCGAAGGAATACGGTGTCAGGAAAAGAGGATGAGGGATCGATAGTTGTGAGCTGGCATGAGCCATCAACGGTTCTGGAGTCTCGGGTACAAGTCTCACGCAGGTCTGACTGCTGGGCCACGTGCTGAAATGTATTGCTTGTAAAAGCAAATGCTTCACCGAGTAGGGTACAACAGATTGCGAATCGCATGATTTTGGATTGTTCGAGAGGTTGAATGTCTGAGAAGACGAACTTACTACTACAGCCTGCAAAGATTCATTGGGGTTGATATACTGTTGACGGTGGAGTTGGTGCGCCGAGTTATGAAACGCGGGATCGCAGTGAAGCGAAGAGCTGAAACATTTACTGCGAAACATGCCGTCTGTGTTCGAAACTGTACAGCTACCTCGTTGCTACAGCTTGAGTCTACGGGCACCGACTTCAGGCAGCACAATAGGCGCTCCTGACCTCTGCAGGAGGTACTATGAGCTTGCTGTTGAAGGCCTTATGCCACTAATTTGACGAGACCTGAGTTGCTACCCGCACATTTAAACATGCAAGACATACATCATGACAGCTTCGTTAATTGGGTCCGTCGATACAAGATCGAGCGGCGGAAATATCGATGAGCGCTGTTTTCAATAGTGTACTGTGATTTGCGATTTGCGGGGGAAGCAAGAGCGAGACGCGGATGACGGGGGAAGGTTGTCGCATTTGTTGTTCGAGGCTGAAACGAAGCGCTGCTCGGCAAAGCCTGCCATTCCGCGCTGGGAGGCTCGCCATTTCTTTCTTCCAATTGGACGAGGGAGCGTCTTGAGAATTTTCGAAATGACATGAAAGTCCAATAAGTCGATAGGCATGTTGACCGAGTCCGTAGGCACGAAAGACTGCAGCATTGATTTGAATTGCCCTCAATTTTCTTTGGTGACTACTCGATCGATCTCTGCCCACAATGTTGTTGCTCAGCACGACAACGGACTGCCATCCCTGGGACTATGAGTCAAGGGTGGAATGGTGATGACCGGTCATGATGTGCCAGAAGAGACAGCCATTGACTTGCCAGACGCATCTCCTGGTGCGATTGGCGCGATCAGACCCTTCGGCAGTGCGATACTGTATGCTCATTGATGTCATTCGTTATGGCTGAACGAGATGTCACACCCCTCGCGCCATGCAATGATCGCGGATCTCTGAAGACGCTGATGTTGTCGTCTAGCTCACTGTTATTTTCTCAAGAACTTCGCGACGGTAATCTCGTCCGAGGTGTCGGGCGTACTGATTGCACGTGCATGGGCATATGGGTGCGGTTGTTGACCAGGATGGGCCGTTTGGATGACTGCGCTCCGGGTAGGAGTCGCCAAGAGCTTACATGGTGCAAGAACAGAGGGCGGTATGTCGACATCTGTGAGGCGGCAGGCTGAAGTCAAGCTATTTCTGTCCTGATCAGCGCGAGGGCAGACAGGCAATTGTGCGACGGTAAATTCGGTGCCAATGGCTCTCGATATCAACGACAGGCCGGCGTCGTGCCCAGCACTACCGCCGGGCCGCCTAGTGCGAGCCCTTGCTGACGGGTTGCGCGTTATGCACCTGTTCAGTAGATTCATGAGCTCATGGTAGTGGTGGTAGCTGGCGTGGTGCTGAACGGGATGGTGAATGGTTCCGATGCTCCGATCCGAACCCCAAAGCGCCCCTCGCAACCCTACCCGGTAGCAGATGCCCCGGGATTCAGGCCACGTCAACGTAAGCTCAGCAAAGTGCGCAAAGACCTGCCTCAACTAGTCGACACTGTGGCCAAGCCTTTGCATATTCACCAGAGAGAGGAGAGCTCTCTTGTACGCCCCGTACCGTGTAGCACAGTCAGCGCGTCGAGAGTCCTGGAGTCGTCTCGTCGTAAGTCACGGTAATGGCAGTACAACGGGCGCGAAAGTCGACATAAGACCAGCTCCTCGAGCGAACAAGCCCGTCATCTTGATCGATGGAGCAAACATCTCAGGCTCTCTAGCTTTGTTCATCGGAGTTGGAACTGAAGACATTGTTTTTGGTGTTTGCGACCACAAGTTGAACGTCAACCGGCGCAAGAACGCCCCGAGCCCGATTGACTTCCGCTCCGCCCCCTTCTCGTCCGACGTGCACCGCCTTTCCACCATGTGGCATGACCCACCAAGGCAGCCAATCTGGCGCCCAATGATCTTCGTTGCTTCCAGCCTCATGTCCCAGTTTGGTTCGACCTAATCCGTTCTGGGGCCATTCTGCGATGTCCTGGAAACGCACGCTACACGCCACACTGCCTAGGATTCCGACCGCTGGACGGGAGTTATTATCTGACATGCTAGACTCGCGTCTTCGACCGCTTTGGGAAGGGCTTCGTCTCCACTGCGCGCTCGTAATCTACGCGTAGTCGCTGCCTCAGGAGGATTACTGCAGCGCCATGGAGAGGCGAATGCGAAACAACATTGTGTCCGGTCACCACTCGCAGCATATAAGGCATCAGGTTTCGCCCTCGTAAGCGATTGTTCCAACCCAAGTCAGCATTACTACACTCGCAACGAGAGACTATTCGCCTCGGCCTCCCCTTCAGAGTCATAGCTAGAAGTTTCTCATTGGCTTCCTTTCGACACAACTTCACCTCGCAATCTGCAAAATGACTGTTCCCCTACCAAACCCCGATCTCTGTCAGTCGAAATTGTCCAGCTTCACACCTTCAACTCACTAATCTGTTTCTCAAAGGTACCGCAGTCGGTCAGGTCGTTGCTGGCCGGCCATGTACAGTCGAAGGGACACTCTACGGCTACTACCCTAGCCTTGGCGCCAACGCTTTCTTCGCTGCTTTCTTCGCGGTCTGCTTTGCCTGGCAACTATATTGCGGCATCAGATACAAGACATGGACCTACATGGTGTGTATTGGAGTACGAGTCCTCATTGTCCATGACGCTGACCACTAGCAGATCGCCCTTTGTCTTGGGTGTGTCGGTGAAGCCG
TARA_R110002003_G_scaffold3_3,0,327,raw,TCCCTCTACACAGAGCAAACCTCCCAGGTAAGATCAGCCCGGGCTAGTCCCCTACCTGGGGTCGATGGATAAATAACCTTGAGTCCAGCTTACTTGCCCAGGATTCTACAGGCACTTCCGGGAGTGGTGTGAGCACTGATTCGACAGCCGAATACAGCGATGGCATGGCCTCATCGTACCGACACAGCTCCCGGGCTTCATACTCACCGTCTGTTGGACACCATCCTAGCTGGCCAGGCTCTAGCACGATTGCCTCATCATCCCAATCGATATCAACGAAAGGGAAGCAGCCAGCGCCCACGGCGGATGCTCTCGGGCGGCCATTTT
TARA_R110002003_G_scaffold3_3,944,2742,raw,CAACATCTCCCTCTTCTTTACTTTGAATCTCTCGTCCTTATTTCGTATCTATGAAATGAGTGCTAAAAATCTCAGGGAACGACTTCACAGCCTTTGCATCTATGTTTCACTTCTGGTACCTCATGCGATGGATGATATCACCGTCACCCGAAACTTACGAAGCGATCCCAGAATGGCTGAGACCAACGTAAGATACCGGTAGCAGCAGTTTGGTCTTTGCGCTCACGTTGTTCATTCTAGACCAAACCAGTTATTCATGCCTCACATCAACATGCTCGATTTCATCGCTTGGCCTGCGTTCCGCGAATTCGCTGTACAGGTTCCACGCATGCAAGAGCGGATGGACTGGATGATGGACATGAGTCTTACAATCCAGTGCGACTGGTCATTTGCCAACGATGAGGCTTTTCGAAGAGATGATGAGACAGGTTTGCTAGACCTATGTTTGGTGGCAAAGGTATGCTCACTTCGCTACATTAAGACTCCTCGAAAGAACCATGCAGTCAAAGGCTCAGGGACACACGCTATGAACCCTTGCTGACTAGGTCCAGACGGCTATGCGTGATCTCTCCTGTTGGTCTGTAGGGCCAACATTTAGAGCCTACGTAAGCAATGCGGATTCGTACGTGCGAATCAGGACAGAAGAATCATCCGGGTGAAGAGATTATTCGGACACCTTGGATATAATAGCCGAACGACAACATCAAATACAGTCTGTTGTGCAGCAACAACAAGAGTTTTATTTACGAATCTTTCCCAGATAAGTTATTATAATTGCCTCTAACTTACCACTTACTTAAGACTATAGAGCTGTAGAGGTTGTAGTGCTAACTATCATGCAAAAGGAAACCTTTGGTGGGGTGTCGAAATGTGACCGATTTTCTTTTACCCGGGTGGAACATTGACCGAGCTTGGTAACGACCTCCGCTTGGAAGGCGGAGTAAAGAAAGTGTAAGTTGCCCATACATACGTACTAGTAATCTCAGTCGGAAGCACGGAAAACCAGCATGCACACCAAGCCACTAAATAACACACCGATACCAAATGAAAACACCGCCAGGCATCTTTACGTCCGTCATCAGTACTACAACCTTCGCGCCATATACCGTTGGTACGTATGACGGCTTTTCGTACGGCCTTTTCACTGGATGTAATACCCATATGACTCGATATAAATATGCGAAACATCGTACGATGCGCCTCCAGAAATTCGATGACCACGTTAACTACGATGCACGTCATAAGTCGATGCTCATCGCGACAATGAGGGGCACGGAGGGGCAGACCCCCTGGTCAAGTCTTCCGACCCAATCATATTGTTCCTTTCCCTAGGGAAACTCGATCTCTTCATATAGAATCGATTCCGATCTTGTGATTCAACCACGGAAGTACCTCAGCTTGTCTGCTTGGGAGATGAGGCCGATTCACGACGGATTACGACGATTGCAGCGTGGGAGGACGTCTGGGCCAGTGGCGCTGCGGTAGTGGCGTTGTTCTAGTGTCGCAAACGGTCGTGATGGAAGCCGGATAGCTTCACACATTTGGGGGAGGGTCGAACGGAATATTACAAACAGATGGTGTTAAGTGCATGCGATCTTAGTGATGAGAGATGCTACTAACGAAGCTAGTCTTGCCGCTGCTGTGCCTTGTGAGGGATACCGGTAGGAGACCGATACCGTTAACTCAATCTCTCCAACCCGGAGACATAGCGCGGATCGGAATATGCATAGAACTTTTAGTCCAAGAGAGAAGCCAGTCGTAAGGAGAGTAGCAGGCAATGCCGAGTAGGTGACCAACT
TARA_R110002003_G_scaffold3_4,379,379,raw,
TARA_R110002003_G_scaffold3_4,1530,1669,raw,GAGCAATTTGCAGATGGTGGTGTAGTCCTCGAAGTTGGAACAGATGCTCGCGAGACTCCACGGTGTCAGGAGTGTCGGGAACCAACGATAGCTAGGAAAGTTAGTCCAGGCTCAGGGAACCAAAGGCCAAAAAAAAacc
Previous steps can be executed from the following bash query_tara_r110002003.sh
with:
bash query_tara_r110002003.sh
The library may be used directly as a python package.
Run query_tara_a100000171.py
with:
python3 examples/query_tara_a100000171.py
Run query_tara_r110002003.py
with:
examples/query_tara_r110002003.py
Example Jupyter notebooks are available on folder notebooks/
. To use them you can create a running instance of Jupyter notebooks by
jupyter notebook
or, alternatively, use the Google Colab links that are provided below.
Navigate to notebooks/query_tara_a100000171.ipynb
to find the code used example and then execute all the cells.
Navigate to notebooks/query_tara_r110002003.ipynb
to find the code used example and then execute all the cells.
Note: more Jupyter notebooks are available in the notebooks/
folder.