mdshw5 / pyfaidx Goto Github PK

View Code? Open in Web Editor NEW

449.0 9.0 75.0 12.01 MB

Efficient pythonic random access to fasta subsequences

Home Page: https://pypi.python.org/pypi/pyfaidx

License: Other

Python 100.00%

python bioinformatics fasta indexing dna protein genomics bgzf samtools

pyfaidx's Introduction

Description

Samtools provides a function "faidx" (FAsta InDeX), which creates a small flat index file ".fai" allowing for fast random access to any subsequence in the indexed FASTA file, while loading a minimal amount of the file in to memory. This python module implements pure Python classes for indexing, retrieval, and in-place modification of FASTA files using a samtools compatible index. The pyfaidx module is API compatible with the pygr seqdb module. A command-line script "faidx" is installed alongside the pyfaidx module, and facilitates complex manipulation of FASTA files without any programming knowledge.

If you use pyfaidx in your publication, please cite:

Shirley MD, Ma Z, Pedersen B, Wheelan S. Efficient "pythonic" access to FASTA files using pyfaidx. PeerJ PrePrints 3:e1196. 2015.

Installation

This package is tested under Linux and macOS using Python 3.7+, and and is available from the PyPI:

pip install pyfaidx  # add --user if you don't have root

or download a release and:

pip install .

If using pip install --user make sure to add /home/$USER/.local/bin to your $PATH (on linux) or /Users/$USER/Library/Python/{python version}/bin (on macOS) if you want to run the faidx script.

Python 2.6 and 2.7 users may choose to use a package version from v0.7.2 or earier.

Usage

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> genes
Fasta("tests/data/genes.fasta")  # set strict_bounds=True for bounds checking

Acts like a dictionary.

>>> genes.keys()
('AB821309.1', 'KF435150.1', 'KF435149.1', 'NR_104216.1', 'NR_104215.1', 'NR_104212.1', 'NM_001282545.1', 'NM_001282543.1', 'NM_000465.3', 'NM_001282549.1', 'NM_001282548.1', 'XM_005249645.1', 'XM_005249644.1', 'XM_005249643.1', 'XM_005249642.1', 'XM_005265508.1', 'XM_005265507.1', 'XR_241081.1', 'XR_241080.1', 'XR_241079.1')

>>> genes['NM_001282543.1'][200:230]
>NM_001282543.1:201-230
CTCGTTCCGCGCCCGCCATGGAACCGGATG

>>> genes['NM_001282543.1'][200:230].seq
'CTCGTTCCGCGCCCGCCATGGAACCGGATG'

>>> genes['NM_001282543.1'][200:230].name
'NM_001282543.1'

# Start attributes are 1-based
>>> genes['NM_001282543.1'][200:230].start
201

# End attributes are 0-based
>>> genes['NM_001282543.1'][200:230].end
230

>>> genes['NM_001282543.1'][200:230].fancy_name
'NM_001282543.1:201-230'

>>> len(genes['NM_001282543.1'])
5466

Note that start and end coordinates of Sequence objects are [1, 0]. This can be changed to [0, 0] by passing one_based_attributes=False to Fasta or Faidx. This argument only affects the Sequence .start/.end attributes, and has no effect on slicing coordinates.

Indexes like a list:

>>> genes[0][:50]
>AB821309.1:1-50
ATGGTCAGCTGGGGTCGTTTCATCTGCCTGGTCGTGGTCACCATGGCAAC

Slices just like a string:

>>> genes['NM_001282543.1'][200:230][:10]
>NM_001282543.1:201-210
CTCGTTCCGC

>>> genes['NM_001282543.1'][200:230][::-1]
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> genes['NM_001282543.1'][200:230][::3]
>NM_001282543.1:201-230
CGCCCCTACA

>>> genes['NM_001282543.1'][:]
>NM_001282543.1:1-5466
CCCCGCCCCT........

Slicing start and end coordinates are 0-based, just like Python sequences.

Complements and reverse complements just like DNA

>>> genes['NM_001282543.1'][200:230].complement
>NM_001282543.1 (complement):201-230
GAGCAAGGCGCGGGCGGTACCTTGGCCTAC

>>> genes['NM_001282543.1'][200:230].reverse
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> -genes['NM_001282543.1'][200:230]
>NM_001282543.1 (complement):230-201
CATCCGGTTCCATGGCGGGCGCGGAACGAG

Fasta objects can also be accessed using method calls:

>>> genes.get_seq('NM_001282543.1', 201, 210)
>NM_001282543.1:201-210
CTCGTTCCGC

>>> genes.get_seq('NM_001282543.1', 201, 210, rc=True)
>NM_001282543.1 (complement):210-201
GCGGAACGAG

Spliced sequences can be retrieved from a list of [start, end] coordinates: TODO update this section

# new in v0.5.1
segments = [[1, 10], [50, 70]]
>>> genes.get_spliced_seq('NM_001282543.1', segments)
>gi|543583786|ref|NM_001282543.1|:1-70
CCCCGCCCCTGGTTTCGAGTCGCTGGCCTGC

Custom key functions provide cleaner access:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', key_function = lambda x: x.split('.')[0])
>>> genes.keys()
dict_keys(['NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])
>>> genes['NR_104212'][:10]
>NR_104212:1-10
CCCCGCCCCT

You can specify a character to split names on, which will generate additional entries:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', split_char='.', duplicate_action="first") # default duplicate_action="stop"
>>> genes.keys()
dict_keys(['.1', 'NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])

If your key_function or split_char generates duplicate entries, you can choose what action to take:

# new in v0.4.9
>>> genes = Fasta('tests/data/genes.fasta', split_char="|", duplicate_action="longest")
>>> genes.keys()
dict_keys(['gi', '563317589', 'dbj', 'AB821309.1', '', '557361099', 'gb', 'KF435150.1', '557361097', 'KF435149.1', '543583796', 'ref', 'NR_104216.1', '543583795', 'NR_104215.1', '543583794', 'NR_104212.1', '543583788', 'NM_001282545.1', '543583786', 'NM_001282543.1', '543583785', 'NM_000465.3', '543583740', 'NM_001282549.1', '543583738', 'NM_001282548.1', '530384540', 'XM_005249645.1', '530384538', 'XM_005249644.1', '530384536', 'XM_005249643.1', '530384534', 'XM_005249642.1', '530373237','XM_005265508.1', '530373235', 'XM_005265507.1', '530364726', 'XR_241081.1', '530364725', 'XR_241080.1', '530364724', 'XR_241079.1'])

Filter functions (returning True) limit the index:

# new in v0.3.8
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', filt_function = lambda x: x[0] == 'N')
>>> genes.keys()
dict_keys(['NR_104212', 'NM_001282543', 'NR_104216', 'NR_104215', 'NM_001282549', 'NM_000465', 'NM_001282545', 'NM_001282548'])
>>> genes['XM_005249644']
KeyError: XM_005249644 not in tests/data/genes.fasta.

Or just get a Python string:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', as_raw=True)
>>> genes
Fasta("tests/data/genes.fasta", as_raw=True)

>>> genes['NM_001282543.1'][200:230]
CTCGTTCCGCGCCCGCCATGGAACCGGATG

You can make sure that you always receive an uppercase sequence, even if your fasta file has lower case

>>> from pyfaidx import Fasta
>>> reference = Fasta('tests/data/genes.fasta.lower', sequence_always_upper=True)
>>> reference['gi|557361099|gb|KF435150.1|'][1:70]

>gi|557361099|gb|KF435150.1|:2-70
TGACATCATTTTCCACCTCTGCTCAGTGTTCAACATCTGACAGTGCTTGCAGGATCTCTCCTGGACAAA

You can also perform line-based iteration, receiving the sequence lines as they appear in the FASTA file:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> for line in genes['NM_001282543.1']:
...   print(line)
CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
CGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC
...

Sequence names are truncated on any whitespace. This is a limitation of the indexing strategy. However, full names can be recovered:

# new in v0.3.7
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> for record in genes:
...   print(record.name)
...   print(record.long_name)
...
gi|563317589|dbj|AB821309.1|
gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds
gi|557361099|gb|KF435150.1|
gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced
gi|557361097|gb|KF435149.1|
gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds
...

# new in v0.4.9
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', read_long_names=True)
>>> for record in genes:
...   print(record.name)
...
gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds
gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced
gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds

Records can be accessed efficiently as numpy arrays:

# new in v0.5.4
>>> from pyfaidx import Fasta
>>> import numpy as np
>>> genes = Fasta('tests/data/genes.fasta')
>>> np.asarray(genes['NM_001282543.1'])
array(['C', 'C', 'C', ..., 'A', 'A', 'A'], dtype='|S1')

Sequence can be buffered in memory using a read-ahead buffer for fast sequential access:

>>> from timeit import timeit
>>> fetch = "genes['NM_001282543.1'][200:230]"
>>> read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta', read_ahead=10000)"
>>> no_read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta')"
>>> string_slicing = "genes = {}; genes['NM_001282543.1'] = 'N'*10000"

>>> timeit(fetch, no_read_ahead, number=10000)
0.2204863309962093
>>> timeit(fetch, read_ahead, number=10000)
0.1121859749982832
>>> timeit(fetch, string_slicing, number=10000)
0.0033553699977346696

Read-ahead buffering can reduce runtime by 1/2 for sequential accesses to buffered regions.

If you want to modify the contents of your FASTA file in-place, you can use the mutable argument. Any portion of the FastaRecord can be replaced with an equivalent-length string. Warning: This will change the contents of your file immediately and permanently:

>>> genes = Fasta('tests/data/genes.fasta', mutable=True)
>>> type(genes['NM_001282543.1'])
<class 'pyfaidx.MutableFastaRecord'>

>>> genes['NM_001282543.1'][:10]
>NM_001282543.1:1-10
CCCCGCCCCT
>>> genes['NM_001282543.1'][:10] = 'NNNNNNNNNN'
>>> genes['NM_001282543.1'][:15]
>NM_001282543.1:1-15
NNNNNNNNNNCTGGC

The FastaVariant class provides a way to integrate single nucleotide variant calls to generate a consensus sequence.

# new in v0.4.0
>>> consensus = FastaVariant('tests/data/chr22.fasta', 'tests/data/chr22.vcf.gz', het=True, hom=True)
RuntimeWarning: Using sample NA06984 genotypes.

>>> consensus['22'].variant_sites
(16042793, 21833121, 29153196, 29187373, 29187448, 29194610, 29821295, 29821332, 29993842, 32330460, 32352284)

>>> consensus['22'][16042790:16042800]
>22:16042791-16042800
TCGTAGGACA

>>> Fasta('tests/data/chr22.fasta')['22'][16042790:16042800]
>22:16042791-16042800
TCATAGGACA

>>> consensus = FastaVariant('tests/data/chr22.fasta', 'tests/data/chr22.vcf.gz', sample='NA06984', het=True, hom=True, call_filter='GT == "0/1"')
>>> consensus['22'].variant_sites
(16042793, 29187373, 29187448, 29194610, 29821332)

You can also specify paths using pathlib.Path objects.

#new in v0.7.1
>>> from pyfaidx import Fasta
>>> from pathlib import Path
>>> genes = Fasta(Path('tests/data/genes.fasta'))
>>> genes
Fasta("tests/data/genes.fasta")

Accessing fasta files from filesystem_spec filesystems:

# new in v0.7.0
# pip install fsspec s3fs
>>> import fsspec
>>> from pyfaidx import Fasta
>>> of = fsspec.open("s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta", anon=True)
>>> genes = Fasta(of)

It also provides a command-line script:

cli script: faidx

Fetch sequences from FASTA. If no regions are specified, all entries in the
input file are returned. Input FASTA file must be consistently line-wrapped,
and line wrapping of output is based on input line lengths.

positional arguments:
  fasta                 FASTA file
  regions               space separated regions of sequence to fetch e.g.
                        chr1:1-1000

optional arguments:
  -h, --help            show this help message and exit
  -b BED, --bed BED     bed file of regions (zero-based start coordinate)
  -o OUT, --out OUT     output file name (default: stdout)
  -i {bed,chromsizes,nucleotide,transposed}, --transform {bed,chromsizes,nucleotide,transposed} transform the requested regions into another format. default: None
  -c, --complement      complement the sequence. default: False
  -r, --reverse         reverse the sequence. default: False
  -a SIZE_RANGE, --size-range SIZE_RANGE
                        selected sequences are in the size range [low, high]. example: 1,1000 default: None
  -n, --no-names        omit sequence names from output. default: False
  -f, --full-names      output full names including description. default: False
  -x, --split-files     write each region to a separate file (names are derived from regions)
  -l, --lazy            fill in --default-seq for missing ranges. default: False
  -s DEFAULT_SEQ, --default-seq DEFAULT_SEQ
                        default base for missing positions and masking. default: None
  -d DELIMITER, --delimiter DELIMITER
                        delimiter for splitting names to multiple values (duplicate names will be discarded). default: None
  -e HEADER_FUNCTION, --header-function HEADER_FUNCTION
                        python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: lambda x: x.split()[0]
  -u {stop,first,last,longest,shortest}, --duplicates-action {stop,first,last,longest,shortest}
                        entry to take when duplicate sequence names are encountered. default: stop
  -g REGEX, --regex REGEX
                        selected sequences are those matching regular expression. default: .*
  -v, --invert-match    selected sequences are those not matching 'regions' argument. default: False
  -m, --mask-with-default-seq
                        mask the FASTA file using --default-seq default: False
  -M, --mask-by-case    mask the FASTA file by changing to lowercase. default: False
  -e HEADER_FUNCTION, --header-function HEADER_FUNCTION
                        python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: None
  --no-rebuild          do not rebuild the .fai index even if it is out of date. default: False
  --version             print pyfaidx version number

Examples:

$ faidx -v tests/data/genes.fasta
### Creates an .fai index, but supresses sequence output using --invert-match ###

$ faidx tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320
>NM_001282543.1:201-210
CTCGTTCCGC
>NM_001282543.1:300-320
GTAATTGTGTAAGTGACTGCA

$ faidx --full-names tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1| Homo sapiens BRCA1 associated RING domain 1 (BARD1), transcript variant 2, mRNA
CTCGTTCCGC

$ faidx --no-names tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320
CTCGTTCCGC
GTAATTGTGTAAGTGACTGCA

$ faidx --complement tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1:201-210 (complement)
GAGCAAGGCG

$ faidx --reverse tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1:210-201
CGCCTTGCTC

$ faidx --reverse --complement tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1:210-201 (complement)
GCGGAACGAG

$ faidx tests/data/genes.fasta NM_001282543.1
>NM_001282543.1:1-5466
CCCCGCCCCT........
..................
..................
..................

$ faidx --regex "^NM_00128254[35]" genes.fasta
>NM_001282543.1
..................
..................
..................
>NM_001282545.1
..................
..................
..................

$ faidx --lazy tests/data/genes.fasta NM_001282543.1:5460-5480
>NM_001282543.1:5460-5480
AAAAAAANNNNNNNNNNNNNN

$ faidx --lazy --default-seq='Q' tests/data/genes.fasta NM_001282543.1:5460-5480
>NM_001282543.1:5460-5480
AAAAAAAQQQQQQQQQQQQQQ

$ faidx tests/data/genes.fasta --bed regions.bed
...

$ faidx --transform chromsizes tests/data/genes.fasta
AB821309.1  3510
KF435150.1  481
KF435149.1  642
NR_104216.1 4573
NR_104215.1 5317
NR_104212.1 5374
...

$ faidx --transform bed tests/data/genes.fasta
AB821309.1  1    3510
KF435150.1  1    481
KF435149.1  1    642
NR_104216.1 1   4573
NR_104215.1 1   5317
NR_104212.1 1   5374
...

$ faidx --transform nucleotide tests/data/genes.fasta
name        start   end     A       T       C       G       N
AB821309.1  1       3510    955     774     837     944     0
KF435150.1  1       481     149     120     103     109     0
KF435149.1  1       642     201     163     129     149     0
NR_104216.1 1       4573    1294    1552    828     899     0
NR_104215.1 1       5317    1567    1738    968     1044    0
NR_104212.1 1       5374    1581    1756    977     1060    0
...

faidx --transform transposed tests/data/genes.fasta
AB821309.1  1       3510    ATGGTCAGCTGGGGTCGTTTCATC...
KF435150.1  1       481     ATGACATCATTTTCCACCTCTGCT...
KF435149.1  1       642     ATGACATCATTTTCCACCTCTGCT...
NR_104216.1 1       4573    CCCCGCCCCTCTGGCGGCCCGCCG...
NR_104215.1 1       5317    CCCCGCCCCTCTGGCGGCCCGCCG...
NR_104212.1 1       5374    CCCCGCCCCTCTGGCGGCCCGCCG...
...

$ faidx --split-files tests/data/genes.fasta
$ ls
AB821309.1.fasta    NM_001282549.1.fasta    XM_005249645.1.fasta
KF435149.1.fasta    NR_104212.1.fasta       XM_005265507.1.fasta
KF435150.1.fasta    NR_104215.1.fasta       XM_005265508.1.fasta
NM_000465.3.fasta   NR_104216.1.fasta       XR_241079.1.fasta
NM_001282543.1.fasta        XM_005249642.1.fasta    XR_241080.1.fasta
NM_001282545.1.fasta        XM_005249643.1.fasta    XR_241081.1.fasta
NM_001282548.1.fasta        XM_005249644.1.fasta

$ faidx --delimiter='_' tests/data/genes.fasta 000465.3
>000465.3
CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
.......

$ faidx --size-range 5500,6000 -i chromsizes tests/data/genes.fasta
NM_000465.3 5523

$ faidx -m --bed regions.bed tests/data/genes.fasta
### Modifies tests/data/genes.fasta by masking regions using --default-seq character ###

$ faidx -M --bed regions.bed tests/data/genes.fasta
### Modifies tests/data/genes.fasta by masking regions using lowercase characters ###

$ faidx -e "lambda x: x.split('.')[0]" tests/data/genes.fasta -i bed
AB821309    1       3510
KF435150    1       481
KF435149    1       642
NR_104216   1       4573
NR_104215   1       5317
.......

Similar syntax as samtools faidx

A lower-level Faidx class is also available:

>>> from pyfaidx import Faidx
>>> fa = Faidx('genes.fa')  # can return str with as_raw=True
>>> fa.index
OrderedDict([('AB821309.1', IndexRecord(rlen=3510, offset=12, lenc=70, lenb=71)), ('KF435150.1', IndexRecord(rlen=481, offset=3585, lenc=70, lenb=71)),... ])

>>> fa.index['AB821309.1'].rlen
3510

fa.fetch('AB821309.1', 1, 10)  # these are 1-based genomic coordinates
>AB821309.1:1-10
ATGGTCAGCT

If the FASTA file is not indexed, when Faidx is initialized the build_index method will automatically run, and the index will be written to "filename.fa.fai" with write_fai(). where "filename.fa" is the original FASTA file.
Start and end coordinates are 1-based.

Support for compressed FASTA

pyfaidx can create and read .fai indices for FASTA files that have been compressed using the bgzip tool from samtools. bgzip writes compressed data in a BGZF format. BGZF is gzip compatible, consisting of multiple concatenated gzip blocks, each with an additional gzip header making it possible to build an index for rapid random access. I.e., files compressed with bgzip are valid gzip and so can be read by gunzip. See this description for more details on bgzip.

Changelog

Please see the releases for a comprehensive list of version changes.

Known issues

I try to fix as many bugs as possible, but most of this work is supported by a single developer. Please check the known issues for bugs relevant to your work. Pull requests are welcome.

Contributing

Create a new Pull Request with one feature. If you add a new feature, please create also the relevant test.

To get test running on your machine:

Create a new virtualenv and install the dev-requirements.txt.

pip install -r dev-requirements.txt
Download the test data running:

python tests/data/download_gene_fasta.py
Run the tests with

pytests

Acknowledgements

This project is freely licensed by the author, Matthew Shirley, and was completed under the mentorship and financial support of Drs. Sarah Wheelan and Vasan Yegnasubramanian at the Sidney Kimmel Comprehensive Cancer Center in the Department of Oncology.

pyfaidx's People

Contributors

Stargazers

Watchers

pyfaidx's Issues

Storing fasta indexes between runs?

Is this possible? Seems like it isn't based on the cl args shown in the README.md

Handle semicolon comments during indexing and retrieval

The spec for htslib faidx mentions optional comments (;) being handled properly, so even though this is an uncommon case we should handle it as well.

Is pyfaidx suited to do full genome lookups?

I am using fasta files representing the whole genome, with the only header lines being >chr1, >chr2 etc. and my indexes are huge, i.e. 37093012.

Will pyfaidx still perform well? If yes, this info should probably be added to the README.md.

Add nose test for Genome.neg

Subclass Fasta to ingest VCF and return consensus sequence

This should be straightforward using PyVCF and pysam. The subclass could be called FastaVariant and return a consensus sequence with either homozygous variants or heterozygous variants or both included as substitutions. Skip over indels and complex variants. MNPs can be handled but the length of the return sequence should be checked.

Reindex file if time stamp on fasta is newer than on index?

It would be great if the index were automatically recreated if the fasta file has a new time stamp than the .fai file. I just did this test as we had to rebuild a fasta we are using and reindexing did not occur.

thanks for considering!

Skip biopython tests on pypy3 and cpython3.2

When we can't install BioPython we should just skip those tests.

Pass tests on Windows

read-ahead caching is not effective at mitigating reads from disk

It looks like the read_ahead property is implemented with a bug that prevents proper caching.

Format `faidx --stats` using presets e.g. "bed" "chromsizes" "nucleotide"...

"stats" right now just consists of the sequence names and lengths. It might be more useful to allow simple bed files, UCSC-style "chrom.sizes" files, and more advanced nucleotide (gc%, a/c/g/t, top kmer, etc...) formats to be specified like: faidx --stats bed and so on.

'Sequence' object breaks string methods upper(), split(), etc

See example below:

>>> from pyfaidx import Fasta
>>> ref = Fasta('my.fasta')
>>> ref.keys()
['myseq']
>>> ref['myseq'][:5].upper()
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'Sequence' object has no attribute 'upper'
>>> ref['myseq'][:5].split('T')
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'Sequence' object has no attribute 'split'

To fix this:

The easy way is to change this line in Faidx.fetch():

return Sequence(name=rname, start=int(start + 1),
                     end=int(end), seq=seq.replace('\n', ''))

return seq.replace('\n', '')

i.e. return a pure string, rather than a Sequence object.

The hard way is to reimplement string methods in Sequence object, like Bio.Seq.Seq.

Add option `--stats` to print basic stats about the fasta file

FastaVariant needs tests

Check that Faidx.fetch does not return header characters when passing a negative start coordinate

Sequence contains non-DNA characters

This error is coming up in 0.3.4 and does not come up in 0.3.1

Traceback (most recent call last):
File "Step1.py", line 151, in
main()
File "Step1.py", line 123, in main
desired_sequence = -chromosome_infile[chromosome][start:stop] #if negative strand, give reverse complement in fasta file
File "/data/zappadata_p2/pschaugh/RepTags/venv/lib/python2.7/site-packages/pyfaidx/init.py", line 105, in neg
return self[::-1].complement
File "/data/zappadata_p2/pschaugh/RepTags/venv/lib/python2.7/site-packages/pyfaidx/init.py", line 140, in complement
comp = self.class(self.name, complement(self.seq),
File "/data/zappadata_p2/pschaugh/RepTags/venv/lib/python2.7/site-packages/pyfaidx/init.py", line 591, in complement
raise ValueError("Sequence contains non-DNA characters: \n {0}".format(''.join(wrap_sequence(70, seq))))
ValueError: Sequence contains non-DNA characters:
ttgtatggttttgaatggtctacgtagttttagacacgatcctcccatcaaatatcacatttacgaatac
tatttgttcttctttctagaatttaattgttagattgtaatatggaatttcttcatctttttccctcttg
tgtgatttgggtttaggtcatcttcctttctttattatttctgatctcatgtctgttgactttatttctt
atccttgtaccatctcattcaatcgttttggttcaagtcaagaaacttttttagtggttttaattacttg
ttaatcaagctgactcattctattttctcccttcagagttattgattttagtctttacattcacgtctgt
aatggtgactaaattgtctttattttactaatataattattctgttatcatacttataaacatacggttt
tttaacctgttggatctgttttacctatttaaggatctgtatatggtggatggttctgattnggtgtttc
tttatcttttatgcttgtctgtatattgatcgtttctctaacttagttactaggttttagacggttgttt
cttttagggtcctcgactaccgaagtaagcacttaagatggtttgtaaatttcttctcaattgtgattag
gaagagttttcaaaggttttttcaaattccattcccctgtgaaagtttgaataagaagttccggttctaa
tgagaaagtggtttcgttctgtttccatgatgttattttaattcgatgtctagttatagagactatttat
gactacatttttaggagttggtttataatcacttgtcttaattcactgtgtagtttttttaatatgtgta
ctggagcaccctaaataaagactttaagttcctatcaagttgtatacttttacttaattacattattgat
gtaattgtcttacttctttgtcaagggtacactagtagagttaactacctctttttataaactgttttag
gttgtggaaaggtactattttttgtangtggtttgatccttgtcttcctttgatggagttgtattaattc
cggtatatacctttcgggtaacgattgtagtatgaattactactttctaacttttgaaaagaagattcga
ttctgtgttcccttcttacggatgaaaatttaagtcgtatcattacctttagaactggtctatttaatcc
tttatttttcgtaattttaaccctttatttttcattttgctagggaaagtatctactacattaaagtata
catcttttggggtttctaaggtattctttgtactatcttgattattgttttaatcgtttgaatgtcttat
gagtcgtatgtctttaatcgacacaaagatatgttattgttacttgttggactttttctttaattctttc
gttaaagtaaatttcaccctggtttgttttattttttatctttgtttgacttggttcctccactttctac
acatgtgacttttgatactttataacgagtttctttaatttctgtgatgtttttatctgtaggaaacaag
tatctactcttctggattttaactttctacaccagagatctaatttacgttaaatagaattttaaggttg
tagttaaaaaaaaatgttttgatctttttagataggattgtaagtatatcttagagttccctgggattta
tcgtttttgttagaactttttcttgtgtcaacctccagagtgtgaagaatcaaagttttaaagattttta
gatgtcataatttttgtctcaccataactgcatatctgtctgtgtatctagtcactctatcttatctctc
gggtctttatttgagaacatatataccggaacactaaaaaatttttcaacgatcctgataggttatccct
tttcctgtcagaaaagttgtttacaatgacccttttgacctatcggtgtacgttttcttacttcaacatg
ggaatggaatgtggtatacgtctttaatttttatcagtttctgaatctatactctcaatcttatgagtgt
tcttttacatccccgtgtcgaagtaccgtaacgtaaaccgttcctaaagaaccgatgctgtggctttcgt
atccgttgtttttttagtctattaaccttgtacatttataatttttaaaaacacttagtttactgtgaca
gttgtctcattttccgttgtgtgacttaccgtcctttataaacatttaatatatagactattgcctaatg
ataggtatatatttcttgagaacgttaagttgttggttttttgttgtgttaattttttatctgtttcctg
aacttatatataaagaggttactatatgtttactggttattcatgtacttttctacgagttgttgtactt
attaatccttttacgtttagttttggtgtcactctatgatgtgttgatatctttgttataccaccaaaga
gtctaaagtgtactaggtcgttaatgtggagacccatatatggagtttcttaattttcgtctcggagttt
ctctacaaagatgtgggtacaggtgttgtcgtaataagtgttatcagttttctaccttcgttgggttcac
agacaactgtctacttacctatttgttttacaccgtataggtgttttaccttataataaattggaatttt
tcctttttatta

Add read ahead cache option for sequential accession

long_name missing leading chr number

I think I may have found a bug. I can work around it, but thought you would want to fix it. It is really weird.

fasta file: mouse NCBI 38: GRCm38_68.fa

Symptom: on one record (chr 4), the long_name is missing the leading chr number. All the others are OK.

Hope the listings below help. (and thanks VERY much for pyfaidx…)
-Al

Confirm that the record is correct in the file:

$ awk '/^>4/{print $0; exit;}' GRCm38_68.fa

4 dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF

Open the file and see the top level:

from pyfaidx import Fasta
fa = Fasta('GRCm38_68.fa')
fa.keys()
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X', 'Y', 'JH584295.1', 'JH584292.1', 'GL456368.1', 'GL456396.1', 'GL456359.1', 'GL456382.1', 'GL456392.1', 'GL456394.1', 'GL456390.1', 'GL456387.1', 'GL456381.1', 'GL456370.1', 'GL456372.1', 'GL456389.1', 'GL456378.1', 'GL456360.1', 'GL456385.1', 'GL456383.1', 'GL456213.1', 'GL456239.1', 'GL456367.1', 'GL456366.1', 'GL456393.1', 'GL456216.1', 'GL456379.1', 'JH584304.1', 'GL456212.1', 'JH584302.1', 'JH584303.1', 'GL456210.1', 'GL456219.1', 'JH584300.1', 'JH584298.1', 'JH584294.1', 'GL456354.1', 'JH584296.1', 'JH584297.1', 'GL456221.1', 'JH584293.1', 'GL456350.1', 'GL456211.1', 'JH584301.1', 'GL456233.1', 'JH584299.1']

See that the long name is OK for chr 3:

print fa['3'].long_name
3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF

See that chr 4 is not OK:

print fa['4'].long_name
dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF

The next two listings are longish, so I describe them here: first a listing of all the names, then all the long names. The names are OK for all chrs. The long names are all OK except for 4.

for record in fa:
... print record.name
...
1
10
11
12
13
14
15
16
17
18
19
2
3
4
5
6
7
8
9
MT
X
Y
JH584295.1
JH584292.1
GL456368.1
GL456396.1
GL456359.1
GL456382.1
GL456392.1
GL456394.1
GL456390.1
GL456387.1
GL456381.1
GL456370.1
GL456372.1
GL456389.1
GL456378.1
GL456360.1
GL456385.1
GL456383.1
GL456213.1
GL456239.1
GL456367.1
GL456366.1
GL456393.1
GL456216.1
GL456379.1
JH584304.1
GL456212.1
JH584302.1
JH584303.1
GL456210.1
GL456219.1
JH584300.1
JH584298.1
JH584294.1
GL456354.1
JH584296.1
JH584297.1
GL456221.1
JH584293.1
GL456350.1
GL456211.1
JH584301.1
GL456233.1
JH584299.1

Now the long names

for record in fa:
... print record.long_name
...
1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF
11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF
12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF
13 dna:chromosome chromosome:GRCm38:13:1:120421639:1 REF
14 dna:chromosome chromosome:GRCm38:14:1:124902244:1 REF
15 dna:chromosome chromosome:GRCm38:15:1:104043685:1 REF
16 dna:chromosome chromosome:GRCm38:16:1:98207768:1 REF
17 dna:chromosome chromosome:GRCm38:17:1:94987271:1 REF
18 dna:chromosome chromosome:GRCm38:18:1:90702639:1 REF
19 dna:chromosome chromosome:GRCm38:19:1:61431566:1 REF
2 dna:chromosome chromosome:GRCm38:2:1:182113224:1 REF
3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF
dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF
5 dna:chromosome chromosome:GRCm38:5:1:151834684:1 REF
6 dna:chromosome chromosome:GRCm38:6:1:149736546:1 REF
7 dna:chromosome chromosome:GRCm38:7:1:145441459:1 REF
8 dna:chromosome chromosome:GRCm38:8:1:129401213:1 REF
9 dna:chromosome chromosome:GRCm38:9:1:124595110:1 REF
MT dna:chromosome chromosome:GRCm38:MT:1:16299:1 REF
X dna:chromosome chromosome:GRCm38:X:1:171031299:1 REF
Y dna:chromosome chromosome:GRCm38:Y:1:91744698:1 REF
JH584295.1 dna:scaffold scaffold:GRCm38:JH584295.1:1:1976:1 REF
JH584292.1 dna:scaffold scaffold:GRCm38:JH584292.1:1:14945:1 REF
GL456368.1 dna:scaffold scaffold:GRCm38:GL456368.1:1:20208:1 REF
GL456396.1 dna:scaffold scaffold:GRCm38:GL456396.1:1:21240:1 REF
L456359.1 dna:scaffold scaffold:GRCm38:GL456359.1:1:22974:1 REF
GL456382.1 dna:scaffold scaffold:GRCm38:GL456382.1:1:23158:1 REF
GL456392.1 dna:scaffold scaffold:GRCm38:GL456392.1:1:23629:1 REF
GL456394.1 dna:scaffold scaffold:GRCm38:GL456394.1:1:24323:1 REF
GL456390.1 dna:scaffold scaffold:GRCm38:GL456390.1:1:24668:1 REF
GL456387.1 dna:scaffold scaffold:GRCm38:GL456387.1:1:24685:1 REF
GL456381.1 dna:scaffold scaffold:GRCm38:GL456381.1:1:25871:1 REF
GL456370.1 dna:scaffold scaffold:GRCm38:GL456370.1:1:26764:1 REF
GL456372.1 dna:scaffold scaffold:GRCm38:GL456372.1:1:28664:1 REF
GL456389.1 dna:scaffold scaffold:GRCm38:GL456389.1:1:28772:1 REF
GL456378.1 dna:scaffold scaffold:GRCm38:GL456378.1:1:31602:1 REF
GL456360.1 dna:scaffold scaffold:GRCm38:GL456360.1:1:31704:1 REF
GL456385.1 dna:scaffold scaffold:GRCm38:GL456385.1:1:35240:1 REF
GL456383.1 dna:scaffold scaffold:GRCm38:GL456383.1:1:38659:1 REF
GL456213.1 dna:scaffold scaffold:GRCm38:GL456213.1:1:39340:1 REF
GL456239.1 dna:scaffold scaffold:GRCm38:GL456239.1:1:40056:1 REF
GL456367.1 dna:scaffold scaffold:GRCm38:GL456367.1:1:42057:1 REF
GL456366.1 dna:scaffold scaffold:GRCm38:GL456366.1:1:47073:1 REF
GL456393.1 dna:scaffold scaffold:GRCm38:GL456393.1:1:55711:1 REF
GL456216.1 dna:scaffold scaffold:GRCm38:GL456216.1:1:66673:1 REF
GL456379.1 dna:scaffold scaffold:GRCm38:GL456379.1:1:72385:1 REF
JH584304.1 dna:scaffold scaffold:GRCm38:JH584304.1:1:114452:1 REF
GL456212.1 dna:scaffold scaffold:GRCm38:GL456212.1:1:153618:1 REF
JH584302.1 dna:scaffold scaffold:GRCm38:JH584302.1:1:155838:1 REF
JH584303.1 dna:scaffold scaffold:GRCm38:JH584303.1:1:158099:1 REF
GL456210.1 dna:scaffold scaffold:GRCm38:GL456210.1:1:169725:1 REF
GL456219.1 dna:scaffold scaffold:GRCm38:GL456219.1:1:175968:1 REF
JH584300.1 dna:scaffold scaffold:GRCm38:JH584300.1:1:182347:1 REF
JH584298.1 dna:scaffold scaffold:GRCm38:JH584298.1:1:184189:1 REF
JH584294.1 dna:scaffold scaffold:GRCm38:JH584294.1:1:191905:1 REF
GL456354.1 dna:scaffold scaffold:GRCm38:GL456354.1:1:195993:1 REF
JH584296.1 dna:scaffold scaffold:GRCm38:JH584296.1:1:199368:1 REF
JH584297.1 dna:scaffold scaffold:GRCm38:JH584297.1:1:205776:1 REF
GL456221.1 dna:scaffold scaffold:GRCm38:GL456221.1:1:206961:1 REF
JH584293.1 dna:scaffold scaffold:GRCm38:JH584293.1:1:207968:1 REF
GL456350.1 dna:scaffold scaffold:GRCm38:GL456350.1:1:227966:1 REF
GL456211.1 dna:scaffold scaffold:GRCm38:GL456211.1:1:241735:1 REF
JH584301.1 dna:scaffold scaffold:GRCm38:JH584301.1:1:259875:1 REF
GL456233.1 dna:scaffold scaffold:GRCm38:GL456233.1:1:336933:1 REF
JH584299.1 dna:scaffold scaffold:GRCm38:JH584299.1:1:953012:1 REF

Why does Bio.SeqIO and Pyfaidx give different results?

I'm sure it is me mucking up somehow, but when I use Pyfaidx and Bio.SeqIO I get different results...

reference_genome_fasta = "/home/endrebak/genomes/hg38/hg38.fa"
position = slice(68344571,68348572)
hg38 = Fasta(reference_genome_fasta)
gene_pyfaidx = hg38["chr15"][position].seq
gene_pyfaidx[:20] #"aaaaaaaaaaaattccctgt"
#but with Bio.SeqIO...
with open(reference_genome_fasta, "rU") as genome_file_handle:
    chromosome_dictionary = SeqIO.to_dict(SeqIO.parse(genome_file_handle, "fasta"))
gene_bio_seqio = chromosome_dictionary["chr15"][position]
str(gene_bio_seqio[:20].seq) #'AGGGGGACTAGACTCGAAGA'

And when I try to look up the same sequence in UCSC, I get yet another sequence: https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr15:68346496-68346499&hgsid=198858039_mEdzdwARqPcSygOiGTzyEzcF87Ed

(only four first nucleotides shown ("ATGA") to get nuclotide resolution and added one to index since UCSC one-indexed).

Faidx and Fasta option to re-build the index file

Single short lines in the middle of a file are not detected

>AB821309.1
ATGGTCAGCTGGGGTCGTTTCATCTGCCTGGTCGTGGTCACCATGGCAACCTTGTCCCTGGCCCGGCCCT
CCTTCAGTTTAGTTGAGGATACCACATTAGAGCCAGAAGATGCCATCTCATCCGGAGATGATGAGA
CACCGATGGTGCGGAAGATTTTGTCAGTGAGAACAGTAACAACAAGAGAGCACCATACTGGACCAACACA

does not raise an error

Add --version flag

Split header lines on a value

pyfaidx correctly installed but not working. error: unrecognized arguments

I installed pyfaidx as indicated.
Typing

faidx 
usage: faidx [-h] [-b BED] [--stats] [-c] [-r] [-n] [--split-files] [--lazy]
             [--default-seq DEFAULT_SEQ] [-d DELIMITER]
             [--mask-with-default-seq | --mask-by-case] [--version]
             fasta [regions [regions ...]]
.....

hower, when I try to run something it always tells me that there is an unrecognized arguments in my comand. The same is true if I use the example files.

faidx --stats tests/data/genes.fasta
usage: faidx [-h] [-b BED] [--stats] [-c] [-r] [-n] [--split-files] [--lazy]
             [--default-seq DEFAULT_SEQ] [-d DELIMITER]
             [--mask-with-default-seq | --mask-by-case] [--version]
             fasta [regions [regions ...]]
faidx: error: unrecognized arguments: tests/data/genes.fasta

I do not understand what I am missing here.

`FastaRecord.long_name` breaks when `Faidx.split_char` yields multiple values

Floating point values are allowed as file offsets

Using python 3.4.0 and whatever version of pyfaidx is currently grabbed by pip (you should add __version__ to the module)...

Simple access (and slicing) both end up trying to use floats where ints are required.

Traceback (most recent call last):
  File "./Ngaps.py", line 207, in <module>
    sys.exit(Main(argv=None))
  File "./Ngaps.py", line 100, in Main
    x = ref[0][100]
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 460, in __getitem__
    return self._fa.get_seq(self.name, n + 1, n + 1)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 561, in get_seq
    return self.faidx.fetch(name, start, end)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 336, in fetch
    self.fill_buffer(name, start, end + self.read_ahead)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 326, in fill_buffer
    seq = self.from_file(name, start, end)
  File "/usr/local/lib/python3.4/dist-packages/pyfaidx/__init__.py", line 380, in from_file
    seq = self.file.read(seq_blen).decode()
TypeError: 'float' object cannot be interpreted as an integer

A quick fix (perhaps a hack) is to cast start and end to ints in from_file:

        start0 = int(start) - 1  # make coordinates [0,1)

and

        seq_len = int(end) - start0

Sorry, no time at the moment to thoroughly test, much less fork your most current code and proved you with a proper diff.

PS: The title of this issue isn't quite correct... It is a call to read where the error gets tripped, not an index.

PPS: Useful module! Thanks. Not super complicated, but thanks for doing it right and saving the rest of us from having to reinvent the wheel (or use SeqIO where it really isn't appropriate IMO).

Read actual defline from FASTA file index "gap"

One current design limitation of pyfaidx is that it mirrors the samtools indexing behavior of truncating headers after whitespace. There is a good reason for this - any whitespace in the identifier would break the index file. A side effect of this is that frequently the "description" in a header will be lost when reading into the file using the index. It seems like an option to recover the full header line would be useful, and pretty cheap to implement.

To determine the byte offset and length of the header from the index file, we can determine the byte end of the preceding sequence by adding unprintable characters, and this should be the byte start of the real header line. We can then read from header byte start to sequence offset and save this as something like Sequence.long_name.

Better error message for non IUPAC nucleotide characters

I should just print the invalid sequence and line number to assist users in correcting invalid input sequence raised in #47.

Benchmark against other comparable tools

htslib
pyfasta
fastahack

Take a look at the htslib faidx.c for closer implementation

https://github.com/samtools/htslib/blob/develop/faidx.c

Check sequence is composed of allowed characters

http://en.wikipedia.org/wiki/FASTA_format

Rewrite tests to function more independently, download test data instead of storing in version control

Option for mutable Fasta file

Sometimes (masking regions) it is convenient to modify a fasta file in place if possible. We can read the number of line breaks to be inserted, and for entries of the same size as the original fetched region we can modify the fasta file in place.

Generalized `Sequence` object to work with protein and RNA seqs

The original module was developed for the use case of accessing DNA strings from a reference genome file. It should be easy to generalize this to protein and RNA strings as well.

add checks for .complement() method

fasta indexing is broken

Faidx.build_index gives different results than samtools

faidx regex matching argument

In certain cases (http://www.biostars.org/p/133245/) matching headers by regular expressions would be very helpful. This could be added as an argument to the Faidx class, or as a feature only in the faidx script.

CLI script: read a list of regions to fetch from a BED file

Expose FastaRecord `iter` method for hybrid line/object-based iteration

Sometimes it would be useful to do something like:

for line in FastaRecord:
  do something with the line

Currently FastaRecord's __getitem__ returns the entire contents of a FASTA entry as a string. Maybe a __iter__ method could return smaller chunks to limit memory consumption. Might be able to use the read-ahead buffer from #26.

[bug] GATK-bundle - duplicate key?

I would like to replace pygr with pyfaidx.
Previously I have been using the human_g1k_v37_decoy.fasta file, which is part of the GATK bundle (available through FTP) for getting sequences. I use the same file for alignments.

>>> from pyfaidx import Fasta
>>> genome = Fasta('human_g1k_v37_decoy.fasta')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 552, in __init__
    filt_function=filt_function)
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 209, in __init__
    self.read_fai(split_char)
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 244, in read_fai
    raise ValueError('Duplicate key "%s"' % rname)
ValueError: Duplicate key "['2', 'dna:chromosome', 'chromosome:GRCh37:2:1:243199373:1']"

The human_g1k_v37_decoy.fasta.fai file does not seem to contain that key twice.

Catch KeyError in faidx and warn user on stderr

It might be too harsh to exit faidx when sequence names don't exist. Instead, warning the user that the sequence name doesn't exist and then moving on would be more useful. See: https://www.biostars.org/p/141241/#141306

When strict_bounds=False and seq_blen calculated < 0, as_raw parameter is not respected.

Take a look at the manuscript?

The manuscript is something... maybe @brentp, @azalea, and @swheelan can give me some comments? Feel free to edit as you see fit - the document is version controlled.

Indexing [0:0] is different than standard Python string

>>> gfasta = pyfaidx.Fasta('....fa', as_raw=True)
>>> gfasta['chr1'][0:0]

Returns the whole chromosome

>>> s='ABCD'
>>> s[0:0]

Returns an empty string

IMHO, pyfaidx should return an empty string.

Properly handle lowercase characters in FASTA

Specifically need to worry any time I do string replacement.

Decrease memory footprint with large indices

@jsilter Has a use case for opening the entire NCBI non redundant nucleotide database, which is currently problematic. The index for this database is 24,312,914 entries and 1.2G on disk. The in-memory index is far larger than this, and so it's possible that we might not be able to store an index this large in memory with the current data structures. However, there are some simple ways to reduce memory footprint already:

remove Faidx raw_index attribute. I'm not sure anyone was using this, so it should not be an issue.
nest named tuples instead of another dict in Faidx.index's OrderedDict
- 31% space savings just by using named tuples (40% total savings w/ dropping raw_index list)
- additional 4% savings by using OrderedBytesDict (not sure it's worth it.)

ipdb> fasta = Fasta(fasta_filename, as_raw = True)
ipdb> chrom = fasta.keys()[0]
ipdb> fasta[chrom][1:2]
>chrII:236683-237683(+):2-2
T
ipdb> fasta = Fasta(fasta_filename, as_raw = True)
ipdb> fasta[chrom][1:2]
u'T'

Note that the index for fasta_filename doesn't exist before the first call.

Is this expected behaviour? Does as_raw only work on an existing index? i.e. the index is created at the first call, and then as_raw is used?