samtools / htsjdk Goto Github PK

View Code? Open in Web Editor NEW

275.0 49.0 244.0 159.17 MB

A Java API for high-throughput sequencing data (HTS) formats.

Home Page: http://samtools.github.io/htsjdk/

Java 99.78% Shell 0.01% Python 0.03% JavaScript 0.01% Scala 0.16%

genomics sequencing dna ngs bam cram sam fasta vcf java-api

htsjdk's People

Stargazers

Watchers

Forkers

lindenb bpow heuermh nnordborg evolvedmicrobe jsilter andreschumacher azymer nebiolabs fnothaft mremmert mp15 hadoopgenomics monkollek iliat weincody eitanbanks lvaruzza ryan-williams dmyersturnbull mrferreira ashwinkalbhor allenday mbreese kentalot christear lequynhnhu shenkers zengfengbo lenbok pryvkingh thordurkristjansson melonistic johnwallace123 ronlevine james-d jmafofo mingyi-wang rtvt123 dfsurdyk jakeakopp limbus-medtec lybird300 ianushevskii a-nikitiuk nagyistoce melop tomwhite alpenditrix callsobing jmarshall jn7163 cmnbroad vdanjean labkey-bpatel laserson mike-miroliubov kolesovalexandr superbobry pityka sonicyyouth brendanofallon jeshiuhuang espre05 beatwolf readbio nirmalyab zaalgeo nikolaikopernik mnter bioinformagik nh13 novelbiocloud cwhelan jpritt gotolino anomegap d-cameron meganshand jdcla guttmanlab fjolidon intel-hss nagyist pfxuan eduardocandeal jonasgiek rochesequencing jmchung lbergelson vadimzalunin shuang-broad kasont baradist bayolau jamesemery pabarcgar hwangy dnikolaeva evegen55

htsjdk's Issues

CRAM code base javadoc and code documentation.

There are a number of classes that are lacking documentation. In general any public API methods should be documented (unless they are self explanatory). Also any complicated control logic should be properly commented.

CRAM to support interval querying

Before we release, we would like to support interval range querying of CRAM files. @vadimzalunin could this be supported?

CloseableIterator<SAMRecord> query(final QueryInterval[] intervals,
                                              final boolean contained)

Add a check for correct platform values, case insensitively

We never use INVALID_PLATFORM_VALUE but the spec defines platform values. The platform values are uppercase but in production cases there are BAMs with mixed or lowercase (:/) platform values. Lets add a check in SamFileValidator for the correct set of values, case-insensitively, along with unit tests. This is motivated by a new @RG-PL value in the SAM Spec: samtools/hts-specs#39

What's with switching to "contig" everywhere? It's not even correct/accurate?

I just noticed that there are changes to many core classes (SAMRecord, VariantContext, Interval) to implement a new Locatable interface. The main result seems to be deprecation of getChr(), getChrom() getChromosome(), getReference() etc. methods in preference to a new getContig().

I get the desire to standardize, but it's confusing to long time users of the library to suddenly change and deprecate all the existing method names. But more to the point "contig" is not really even an accurate term for most uses. For human, and most larger references genomes, the chromosomes are scaffolds of multiple contigs (contiguous sequences) assembled with Ns in the middle. How about using something more generic like getReferenceSequence()?

Extended API for symbolic alleles

Extending the htsjdk to parse symbolic alt alleles and return meaningful Alleles based on the parsed result would be useful to structural variation users of the API. In particular the following features would be useful:

separation of symbolic alleles into named structural variations, named contig insertions (both with sequence such as "A" and without ""), breakends and breakpoints
methods to expose the parsed breakpoint contig, position, directions
incorporation of all 31 SV headers into VCFStandardHeaderLines (or split out into VCFStructuralVariationHeaderLines)
methods to return the base call portions of symbolic alleles (ie: extract the ACGT from ACGT[chr1:1[ )

One possible design is to subclass Allele:
abstract Allele
Allele <- BaseCallAllele
Allele <- SymbolicAllele
Allele <- NoCallAllele
Allele <- MissingAllele
SymbolicAllele <- StructuralVariationAllele (with enum, subclass hierarchy or getters for SV type and subtypes (subsubtypes?) such as and DUP:TANDEM)
SymbolicAllele <- BreakendAllele
SymbolicAllele <- BreakpointAllele
SymbolicAllele <- NamedContigAllele

The low-level design and implementation depends on resolution of ambiguities in the VCF specifications as the are multiple points in which the VCF specifications are under-defined and ambiguous. For example, it is unclear as to whether "AA", "", "AA" are valid alleles.

CRAM support for non-ref encoding

I'm not sure if this has been fixed yet or not, as I was only able to test cramtools-2.1.jar from EBI web site. Is there a way to run picard tools with CRAM (I tried ViewSam and it didn't work).

Anyway, that aside, cramtools is unable to read CRAM files that has no reference. My file entirely uses cram features to store the base calls and never has need to reference an external sequence. It does now have @sq M5 auxiliary tags (but does not require them), nor is it using an internally embedded reference.

I think this is still a legal file (apart from the oddity of officially needing @sq M5, but I think we already decided to relax that for the case of embedded reference sequences), albeit a fairly inefficient one.

Separate JAR files

Hello, I want to use the samtools part of the htsjdk, but the current Ant file doesn't have a task for building separate JAR files for each of the subprojects. Is there any particular reason for this?

Clean up Iterators / Iterables

See the comment on SecondaryOrSupplementarySkippingIterator - it combines a peekable iterator with an implicit filter, but it would be cleaner if these were implemented separately and then composed.
Should we get rid of, or rename, NotPrimarySkippingIterator (currently unused)? The problem is that it only filters out secondary (but not supplementary) alignments, but someone new to the code might well assume that "not primary" means neither secondary not supplementary.
Where possible, we should make Iterators also implement Iterable so that for-each syntax can be used.

CRAM rewriting fails on assert

I'm not going to say I think it's CRAM and the NM tag¹, but I think it's CRAM and the NM tag^2,3.

This command line works, using the test crams in the current cramtools master branch⁴:

$ java \
  -jar ~/src/picard/dist/picard.jar SamFormatConverter \
  R= ~/src/cramtools/src/test/resources/data/set1/small.fa \
  I= ~/src/cramtools/src/test/resources/data/set1/small.cram \
  O= test.cram
[Sun Dec 21 10:05:07 EST 2014] picard.sam.SamFormatConverter INPUT=test.bam OUTPUT=test.cram REFERENCE_SEQUENCE=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.fa    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sun Dec 21 10:05:07 EST 2014] Executing as [email protected] on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17; Picard version: 1.126(5b984f92664175debe3826040fdae1496d1d8fa7_1418347776) JdkDeflater
[Sun Dec 21 10:05:08 EST 2014] picard.sam.SamFormatConverter done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=163053568
$

The same command with the addition of -ea crashes on an assert⁵:

$ java -ea \
  -jar ~/src/picard/dist/picard.jar SamFormatConverter \
  R= ~/src/cramtools/src/test/resources/data/set1/small.fa \
  I= ~/src/cramtools/src/test/resources/data/set1/small.cram \
  O= test.cram
[Sun Dec 21 10:05:19 EST 2014] picard.sam.SamFormatConverter INPUT=test.bam OUTPUT=test.cram REFERENCE_SEQUENCE=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.fa    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sun Dec 21 10:05:19 EST 2014] Executing as [email protected] on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17; Picard version: 1.126(5b984f92664175debe3826040fdae1496d1d8fa7_1418347776) JdkDeflater
[Sun Dec 21 10:05:19 EST 2014] picard.sam.SamFormatConverter done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=128974848
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.AssertionError
    at htsjdk.samtools.CRAMFileWriter.flushContainer(CRAMFileWriter.java:275)
    at htsjdk.samtools.CRAMFileWriter.finish(CRAMFileWriter.java:337)
    at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:205)
    at picard.sam.SamFormatConverter.doWork(SamFormatConverter.java:88)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:89)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:99)
$

Licence

Hi what about the MIT licence in the htsjdk.
If I use some classes to develop a new tool, I should implement the licence-text into all my classes?
Or I'm not allowed to use this packages in my new tool?

Bam slicing issue: Unable to query by chromosome

I tried to use the SamReader:

reader.query(intervals, false);

where intervals contain a single QueryInterval which is:

QueryInterval interval = new QueryInterval(0,0,0);

It will throw an exception as follows:

htsjdk.samtools.util.RuntimeIOException: Expected 8 bytes, got 4
    at htsjdk.samtools.AbstractBAMFileIndex$IndexStreamBuffer.readLong(AbstractBAMFileIndex.java:684) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.AbstractBAMFileIndex.readLong(AbstractBAMFileIndex.java:438) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.AbstractBAMFileIndex.query(AbstractBAMFileIndex.java:332) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.DiskBasedBAMFileIndex.getSpanOverlapping(DiskBasedBAMFileIndex.java:60) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:729) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:412) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:468) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]

The BAM file size is about 15 GB downloaded from 1000 Genomes: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00595/alignment/HG00595.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522.bam

Any help is appreciated.

BAM and BAI files Validation

Hi there,

Is there a tool to validate if the BAM/BAI file is valid?

Thank you!

Jerry

NPE in TabixReader

There is a bug in the TabixReader constructor.
You can specify the path to the index file ("idxFn", second argument of the last constructor), but when this value is not null, it is never assigned to the class variable "mIdxFn". Because of that you will get a NullPointerException in the readIndex() method.

add getBackingResourceName() to SAMReader

There is currently no way to get information about the datasource backing the SAMReader. This makes it difficult to return useful error messages when there is any sort of exception.

It should return a String which sufficiently describes the resource enough that a human could identify which file, url, database, etc was the problem.

JEXLMap update to help GATK VariantFiltration

JEXLMap already has JEXL context variables for isHomRef, isHet and isHomVar (circa line 97 in variant/variantcontext/JEXLMap.java), but it would be great if variables could be added for the other genotype types, specifically noCall.

Of course if GSA had commit rights we could take care of some of these issues on our own.... ;)

OldDbSNPFeature uses 0-based locations

Other features return 1-based locations. This should be fixed. Or we should delete OldDbSNPFeature.

Does anyone use this?

moving the code from picard to htsjdk

Hi, I'm currently moving my code from picard1.100 to htsjdk. So far I cannot see how I should now map those 'old' classes:

PicardException
net.sf.samtools.cmdline.*

any hint ?
Thanks.

Support VN:1.5

I've used biobambam to merge some files, and it ouputs
@HD VN:1.5 SO:coordinate
using https://github.com/gt1/libmaus/blob/master/src/libmaus/bambam/BamCatHeader.hpp#L116

According to the spec ( http://samtools.github.io/hts-specs/SAMv1.pdf ), 1.5 is the current version, but picard ValidateSamFile gives an error, which the diff below eliminates.

diff --git a/src/java/htsjdk/samtools/SAMFileHeader.java b/src/java/htsjdk/samtools/SAMFileHeader.java
index 2e237c4..ac361b1 100644
--- a/src/java/htsjdk/samtools/SAMFileHeader.java
+++ b/src/java/htsjdk/samtools/SAMFileHeader.java
@@ -46,9 +46,9 @@ public class SAMFileHeader extends AbstractSAMHeaderRecord
     public static final String VERSION_TAG = "VN";
     public static final String SORT_ORDER_TAG = "SO";
     public static final String GROUP_ORDER_TAG = "GO";
-    public static final String CURRENT_VERSION = "1.4";
+    public static final String CURRENT_VERSION = "1.5";
     public static final Set<String> ACCEPTABLE_VERSIONS =
-            new HashSet<String>(Arrays.asList("1.0", "1.3", "1.4"));
+            new HashSet<String>(Arrays.asList("1.0", "1.3", "1.4", "1.5"));

     /**
      * These tags are of known type, so don't need a type field in the text representation.

move HTSJDK to java 8

I suspect this suggestion will be controversial.

It would be helpful for a lot of reasons to allow java 8 code in htsjdk. There are many useful features which included in 8 which were not in 6, or 7. See (here)[http://www.oracle.com/technetwork/java/javase/8-whats-new-2157071.html] for a complete list.

In my view, the biggest and most desirable changes are the introduction of lambdas, default methods in interfaces, and the stream APIs.

write tests for VariantContext validation

htsjdk 1.129
none of the validation methods on VariantContext is tested:
extraStrictValidation
validateReferenceBases
validateRSIDs(rsIDs);
validateAlternateAlleles();
validateChromosomeCounts();

replace clone() with copy constructors or factory methods

overriding and using clone is discouraged - SAMRecord and SAMFileHeader should provide copy constructors or static factory methods instead.

Maven central?

Hi,

Are you planning on distributing htsjdk on maven central? I can only see 1.118 there, from user "seqdoop" which I expect isn't you guys.

(The scenario is that I'm using the GATK-framework to develop custom walker but need to update htsjdk from the supplied version 1.120).

thanks

Daniel

Implement the PM tag in the ReadGroup header record

Lets add support the @RG-PM tag in the SAM Header. This relates to the recent SAM specification change: samtools/hts-specs#16

A first implementation hint is to take a look at SAMReadGroupRecord and its STANDARD_TAGS. We will want to add the PM tag, and expose it here. Add unit tests too.

Make a snapshot release of HTSJDK after a travis CI build of master

See: #115

CRAM code base test cases.

Test cases need to be converted to testng and additional coverage added. A fair amount of
the tests are currently run by main()s. These tests should be converted to
a full testng test suite such that our automated builds can run them. This
is also a good time to look at code coverage. Once the tests are converted
to testng we can run some code coverage plugins to get a good sense of how
well this is currently covered.

Potential Bug: Missing reads when querying using Reader API

Hi developers,

I need your help again! I got an issue when I try to query bam files using SamReader/SAMFileReader.
Basically, there are two reads in the BAM:

ERR019497.24419495      73      1       11966   0       108M    =       11966   0       TTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGGTT    =@BDCFFGEEGGFFBGGDFHDEEDFBIHHBF3GIDDHF;IAHHIHFAGGID9CHDEEFIHJHGJIHJ?1CCG5?EFFB.AAEHEEFGFBFGFIIBHGGFD35==C%:9    X0:i:3  X1:i:4  MD:Z:105T2      RG:Z:ERR019497  AM:i:0  NM:i:1  SM:i:0  BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@D@BB       XT:A:R
ERR019497.24419495      133     1       11966   0       104M4S  =       11966   0       ACCTAGGCCAATCCACTGGGTTCTGTGGCAGCGAGGGCCTGCCTGATCTTCCACCCCCTCTCCCAGGGCCAAACCTAAACCCCCCTACCCCCCCCCCCCCGCCGGCTG    9@@EAC8C6DF?EEFDEFFFCEEFFDFGFE?:&8),4.51.+59*88'/9//62-$%,.-;%+420#&.2C4B'%2-,'&/$%%$$'%)%.4(*'0/5+*/25#%-%&    XC:i:104        RG:Z:ERR019497

Note that they both have the same QNAME (ERR019497.24419495), RNAME (1) and POS (11966).

When I use the reader query API as follows against this BAM file:

SAMRecordIterator itr = reader.queryOverlapping(new QueryInterval[] { query });

where

QueryInterval query=new QueryInterval(reader.getFileHeader().getSequenceIndex(1), 11966,11966));

I got 2 reads back which is perfect!

However when I use:

QueryInterval query=new QueryInterval(reader.getFileHeader().getSequenceIndex(1), 11967,11967));

Only one read is back:

ERR019497.24419495      73      1       11966   0       108M    =       11966   0       TTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGGTT    =@BDCFFGEEGGFFBGGDFHDEEDFBIHHBF3GIDDHF;IAHHIHFAGGID9CHDEEFIHJHGJIHJ?1CCG5?EFFB.AAEHEEFGFBFGFIIBHGGFD35==C%:9    X0:i:3  X1:i:4  MD:Z:105T2      RG:Z:ERR019497  AM:i:0  NM:i:1  SM:i:0  BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@D@BB       XT:A:R

Note that both reads have about 100 bases long. So I would expect I will have 2 reads back in the second query as well but it didn't happen.

Is it a bug in the library? If it is not, can you provide me a rationale?

You can download the BAM files to reproduce the issue at http://s000.tinyupload.com/index.php?file_id=00308273047953999316

Thank you!

Jerry

Incorporate test cases for testing the ranges of values for the SAM 'i' data type

Please note that @jkbonfield has provided us with some test cases.

Explicit declaration of license is missing

The README doesn't state a license, and there is also no LICENSE file. It seems from at least some of the source code that the code is under the MIT license. I'd recommend making that explicit by putting a LICENSE file in the root folder of the repo, and/or declaring the license under which the code is released explicitly in the README.

Header lines with same key are lost when parsing VCF.

Consider a VCF file with the following two lines in the header:

##GATKCommandLine=<ID=ExtractConsensusSites, ....
##GATKCommandLine=<ID=FilterLiftedVariants,....

When this is parsed by the VCFHeader class, both lines are fed into the meta data object as follows: (VCFHeader.java: 279)

mOtherMetaData.put(line.getKey(), line);

And the second entry overwrites the first, so the first line is lost from the meta-data.

Is there supposed to a restriction on the number of elements with the same key allowed in the VCF format?

htsjdk.version.property.xml not saved as xml

I'm running ant 1.9.4 :

$ /commun/data/packages/apache-ant/apache-ant-1.9.4/bin/ant -version
Apache Ant(TM) version 1.9.4 compiled on April 29 2014
$ echo $ANT_HOME
/commun/data/packages/apache-ant/apache-ant-1.9.4

in build.xml the file htsjdk.version.property.xml should be saved as XML but it is not:

$ /commun/data/packages/apache-ant/apache-ant-1.9.4/bin/ant write-version-property
(...)
$ cat /commun/data/users/lindenb/src/jvarkit-git/htsjdk/htsjdk.version.property.xml
#htsjdk version
#Mon, 08 Sep 2014 10:53:51 +0200

htsjdk-version=1.119

It breaks ant when reading the xml to get the htsjdk.version( lindenb/jvarkit#11 )

may be it should be named 'htsjdk.version.properties' ?

Get coveralls working with Travis CI

I would like to see code coverage here: https://coveralls.io/r/samtools/htsjdk

CRAM doesn't set Slice.refMD5

There appears to be a Slice.setRefMD5(), but it appears that it's not wired in. By the time SliceIO writes the Slice.refMD5 it is still null.

Creating the cram "works", but crashes in CRAMTools 2.1 unless passed the hidden --resilient argument.

$ java \
  -jar ~/src/picard/dist/picard.jar SamFormatConverter \
  R= ~/src/cramtools/src/test/resources/data/set1/small.fa \
  I= ~/src/cramtools/src/test/resources/data/set1/small.cram \
  O= test.cram
[Wed Jan 14 15:58:07 ART 2015] picard.sam.SamFormatConverter INPUT=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.cram OUTPUT=test.cram REFERENCE_SEQUENCE=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.fa    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed Jan 14 15:58:07 ART 2015] Executing as [email protected] on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_67-b01; Picard version: 1.128(c8e12338d226532b30e9ecdbf33180a073c3ffc7_1421081159) JdkDeflater
[Wed Jan 14 15:58:09 ART 2015] picard.sam.SamFormatConverter done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=163577856
$ java \
  -jar cramtools-2.1.jar \
  bam \
  -R ~/src/cramtools/src/test/resources/data/set1/small.fa \
  -I test.cram
ERROR   2015-01-14 15:55:38 Slice   Reference MD5 mismatch for slice 0:2-22180, CCTAAGCCTA...TCGGTACTCC
ERROR   2015-01-14 15:55:38 Cram2Bam    Reference sequence MD5 mismatch for slice: seq id 0, start 2, span 22179, expected MD5 00000000000000000000000000000000
$

Request that Feature have getters for start/end that use a consistent coordinate system

Feature is a very useful interface when dealing with different input file types. However, the Feature interface explicitly states: "The coordinate conventions for start and end are implementation dependent, no specific contact is specified here." As a feature request, I would propose that the interface provide some method to obtain the start/end coordinates with a known coordinate scheme, such that every type of feature will give you the same value. This way you know whether you're getting 0- or 1- based numbers. This could be either:

create get1BasedStart()/get1BasedEnd() methods
Create some sort of CoordinateType enum, with 1- and 0-based values. add a getCoordinateType() method.

or there are plenty of other options. Doesnt really matter what, so long as it always returns the same value from any implementation of Feature.

Apologies if there's already a way to get at this problem...

CIO.isCRAM uses DIS.readFully instead of SFR.readBytes and throws EOFE

CramIO.isCRAM() should not use DataInputStream.readFully() on unknown input streams. SamFileReader.isBAMFile() instead uses SamFileReader.readBytes(), and hence does not throw an exception when the stream is not as expected.

Test case:

Assert.assertFalse(CramIO.isCRAM(new ByteArrayInputStream(new byte[0])));

Also, like SamFileReader.isBAMFile() (via isValidFile()), isCRAM() should probably be checking that markSupported() before using mark().

CRAM code base coding style

Update CRAM code base to our code standards. This mostly involves code
style, variable names and standard conventions.

IndexFactory.createTabixIndex() is not able to handle block-compressed files

Currently, if you pass a block-compressed file into IndexFactory.createTabixIndex(), the file never gets decompressed internally, causing the underlying codec to blow up with errors like the following:

    Exception in thread "main" htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:115)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:88)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:41)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.readHeader(IndexFactory.java:413)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.<init>(IndexFactory.java:401)
at htsjdk.tribble.index.IndexFactory.createTabixIndex(IndexFactory.java:327)

I see two possible solutions, and wanted to solicit feedback on which would be preferable:

Create an overload of createTabixIndex() that accepts a stream rather than a File (this places the burden of setting up the stream correctly on the client)
Have the existing FeatureIterator.initStream() check for block-compressed input via AbstractFeatureReader.hasBlockCompressedExtension(), and wrap the stream returned within a BlockCompressedInputStream as needed.

Thoughts?

Use commons-compress instead of apache-ant-1.8.2-bzip2.jar (library provenance and support more compression types)?

htsjdk.samtools.util.IOUtil depends on CBZip2{Input,Output}Stream in order to implement openBzipFileForReading and openBzipFileForWriting and the calls in openFileForReading and openFileForWriting where file types are autodetected by extension.

CBZip2{Input,Output}Stream are provided in the source in lib/apache-ant-1.8.2-bzip2.jar, but this is effectively a binary blob. There is no expressed provenance to how this jar was compiled (although presumably the relevant files from apache ant were carved out). Also, since this portion of ant is not distributed by apache as an individual component, this causes problems for the possible move towards maven for compilation and deployment (pull request for #55 currently declares a dependency on all of apache-ant-1.8.2, which is a bit excessive just for bzip2 support).

Possible solutions would be to:

Include the source to CBZip2InputStream and CBZip2OutputStream in htsjdk, probably in the htsjdk.samtools.util.bzip2 package. This should be fine from a license standpoint (ant is under Apache License, version 2.0)
Depend upon commons-compress instead, which would require some small rewriting of the relevant portions of IOUtil. On the up side, this would allow for support for several more compression types (in addition to bzip2, also pack200, .Z, xz and lzma). A slight downside is that this dependency is about 357k (compared to the 30k or so that the apache-ant-1.8.2-bzip2.jar is).

I would be happy to submit a pull request for either (probably would take less time than it did to type up this message), but wanted to know if there was a consensus as to which is the better approach?

VCFFileReader: InputStream?

Hello I'm in need of showing completion percentages from an import operation that uses the VCFFIleReader.
At the moment the constructor asks for a java.io.File object which, being only a fancy way to represent a path, doesn't allow me to tell() how much data has been consumed (since this is the only way to get an idea about how much time is left to read the entire VCF file).

Is there a way to access the underlying InputStream?
Otherwise, can someone add this feature?

Being able to know (even approximately) how much time is left seems kind of a must-have feature, IMO.

Thanks!

htsjdk.samtools.fastq.FastqReader doesn't accept empty records

I used the htsjdk.samtools.fastq.FastqReader to read fastq files processed with cutadapt: some reads were empty : the FastqReader throws an exception from FastqReader.checkLine

is there a way to fix this ? the field 'skipBlankLines' makes it difficult to fix.

Another idea; in FastqReader would it be advantageous to use htsjdk.samtools.util.LineReader instead of a java.io BufferedReader ?

Refactor Filters to be functional interface

Currently filters implement 2 different functions, a filter on a single SAMRecord and a filter on 2 mated SAMRecords.

This should be reduced to just a single method that takes 1 SAMRead. Classes that need a mate pair filter can build their own, or filters that are need for mated pairs could implement their own MateFilter interface. In almost all cases the result on mates is just && or || of the primary filter, it's easy to compose these. It would also eliminate many filters throwing UnimplementedOperation exceptions.

This would make #177 easier since filters would then fit into the standard notion of what a filter or predicate is.

How to work with BED or vcf files

Hello,
I'm unsure if this is the right way to ask a developing question. But I searched for examples how to use the htsjdk and for the SamReader/SamReaderFactory there are nice examples.
Is there also some available examples how to read BED or vcf files?

Best Air

Specify Index-File in VCFFileReader

It would be nice to be able to specify the index-file in the VCFFileReader.
The underlying FeatureReader has the possibility to specify the index resource, but there is no possibility in the VCFFileReader to set the index resource (e.g. an additional constructor). At the moment this VCFFileReader only works if the index file is located in the same folder as the VCF file.

Publish "unstable" build on each push to master?

(related to #115)

There's been some interest in having an "unstable" jar published to Maven on each push to master, in addition to the "stable" release every ~2 weeks.

This would be useful for larger projects that depend on HTSJDK where a few members of the project want to contribute features to HTSJDK, without everyone having to worry about syncing their local versions of HTSJDK before building the other project.

I'm not sure on the best way to do this. We'd also need to make it very clear that it's for experimental use only. In the short term, a fork might be an easier solution, but let's keep this on the radar.

@lbergelson @akiezun @droazen are interested in this.

Move Picard's CompareSAMs functionality into HTSJDK

The ability to compare two *AM files seems generally useful, especially for unit tests.

There is currently a Picard CLP named CompareSAMs that tests the equality of two such files, and collects more detailed statistics as well. Many unit tests in Picard instantiate this class in a non-CLP context, making dangerous assumptions about mutated fields and such.

It could easily be refactored into a standalone class that takes two SamReaders and returns an object with various collected metrics, with the Picard tool being a thin wrapper.

Alternatively, if such functionality already exists in HTSJDK, the Picard tool should be using it...

(see also the SamFileValidator / ValidateSamFile duality)

TextCigarCodec is stateless but has non-static methods

if encode, decode, isDigit would be made static one could use them without first creating an instance of class.

Integer overflow problem in tribble.util.popgen.HardyWeinbergCalculation

Users like Monkol are finding the HWP annotation to be very valuable but unfortunately we stopped emitting it in GenotypeGVCF because there is an integer overflow problem in the code with 91K samples. Hopefully this stack trace from Tim will be enough to reproduce the issue.

Hi All,

First I thought you’d all like to know that the 91k got kicked off on Saturday. However, all of the GenotypeGVCF jobs are failing with the following message:

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -4408
at htsjdk.tribble.util.popgen.HardyWeinbergCalculation.hwCalculate(HardyWeinbergCalculation.java:94)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.computeHardyWeinbergPvalue(GATKVariantContextUtils.java:1878)
at org.broadinstitute.gatk.tools.walkers.annotator.GenotypeSummaries.annotate(GenotypeSummaries.java:94)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:192)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:177)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:227)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:186)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:110)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.1-163-g4284d7a):

Ryan: would you be able to take a look at this ASAP today? The new joint calling is blocked on this. A few things you might need:

logs: /seq/dax/macarthur_joint_calling/v2/logs/genotypeGvcfs.scattered.macarthur_joint_calling.*.log
GATK: /seq/tng/tfenne/gatkbin/GenomeAnalysisTK-v3.1-163-g4284d7a.jar (built from gsa-unstable on Friday afternoon)
Example command:

java -Djava.io.tmpdir=/seq/picardtemp3/macarthur_joint_calling
-Djava.io.tmpdir=/local/scratch/macarthur_joint_calling
-Djava.io.tmpdir=/seq/picardtemp3/macarthur_joint_calling
-Djava.io.tmpdir=/seq/picardtemp4/macarthur_joint_calling
-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx16000m
-jar /seq/tng/tfenne/gatkbin/GenomeAnalysisTK-v3.1-163-g4284d7a.jar
-T GenotypeGVCFs
--disable_auto_index_creation_and_locking_when_reading_rods
-R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta
-o /seq/dax/macarthur_joint_calling/v2/scattered/temp_0102_of_1000/genotypes.unfiltered.vcf.gz
-D /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.dbsnp.vcf
-L /seq/dax/macarthur_joint_calling/v2/scattered/temp_0102_of_1000/scattered.intervals
-V /seq/dax/macarthur_joint_calling/v2/combined_gvcfs/all_combined_gvcfs.list

Merging out of order sequence headers results in error

Recently, I got an error

Cannot merge sequence dictionaries because sequence chrM and chrY are in different orders in two input sequence dictionaries.

Can someone explain this error further?

Does this need to be an error or is this case not implemented yet? Is there a downstream problem if sequence header that are in different orders are still added?

CRAM ReadTag throws BufferUnderflowException under multithreaded usage

Not sure the performance implications, but perhaps ReadTag.buf could still be "risky", maybe just with a static ThreadLocal<ByteBuffer> instead of just a static ByteBuffer?

Bug when parsing spliced BAM record with very large intron

I've had a bug report against our software which appears to be an underlying bug in the htsjdk parser for BAM files. The read which fails to parse is a spliced read with a huge intron in it (226Mbp). When parsing the cigar string htsjdk corrupts this value (it looks like it overflows so I get a negative value), but if I use samtools to convert the bam file to sam, then htsjdk reads it just fine.

I've put the example files I used (only 1 read in them) up at:

http://ftp2.babraham.ac.uk/ftpusr68/

Some code which illustrates the problem is below. For the sam file I get:
Cigar is 109M226643263N25M17S Start=10002 End=226653398

..but for the equivalent BAM file I get:
Ignoring SAM validation error: ERROR: Record 1, Read name GR_10257_idx_5/2, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
Cigar is 109M-41792193N25M17S Start=10002 End=-41782058

This is using htsjdk-1.129.jar

import java.io.File;

import htsjdk.samtools.SAMRecord;
import htsjdk.samtools.SamReader;
import htsjdk.samtools.SamReaderFactory;
import htsjdk.samtools.ValidationStringency;

public class PicardCigarBug {

public static void main (String [] args) {

    File smallBugBam = new File("C:/Users/andrewss/Desktop/Bug/small.bam");

// File smallBugBam = new File("C:/Users/andrewss/Desktop/Bug/small.sam");

    SamReaderFactory samFactory = SamReaderFactory.makeDefault().validationStringency(ValidationStringency.LENIENT);
    SamReader inputSam = samFactory.open(smallBugBam); 


    for (SAMRecord samRecord : inputSam) {
        System.err.println("Cigar is "+samRecord.getCigarString()+" Start="+samRecord.getAlignmentStart()+" End="+samRecord.getAlignmentEnd());
    }
}

}

aux.bam and aux.sam is not valid file names on windows

Hello,

I have previously been doing some work on Picard and Htsjdk, but now when I try to update to the latest master branch in htsjdk there is a problem with the aux.bam and aux.sam files in the testdata/htsjdk/samtools directory. Apparently it seems like files named 'aux' are forbidden in Windows due to some restrictions since the MSDOS times!!! (https://social.technet.microsoft.com/Forums/windows/en-US/e22c021d-d188-4ff2-a4dd-b5d58d979c58/the-specified-device-name-is-invalid)

Would it be possible for you to change the file names of those two files?

samtools / htsjdk Goto Github PK

htsjdk's People

Stargazers

Watchers

Forkers

htsjdk's Issues

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.1-163-g4284d7a):

Recommend Projects

Recommend Topics

Recommend Org