samtools / htsjdk Goto Github PK
View Code? Open in Web Editor NEWA Java API for high-throughput sequencing data (HTS) formats.
Home Page: http://samtools.github.io/htsjdk/
A Java API for high-throughput sequencing data (HTS) formats.
Home Page: http://samtools.github.io/htsjdk/
There are a number of classes that are lacking documentation. In general any public API methods should be documented (unless they are self explanatory). Also any complicated control logic should be properly commented.
Before we release, we would like to support interval range querying of CRAM files. @vadimzalunin could this be supported?
CloseableIterator<SAMRecord> query(final QueryInterval[] intervals,
final boolean contained)
We never use INVALID_PLATFORM_VALUE
but the spec defines platform values. The platform values are uppercase but in production cases there are BAMs with mixed or lowercase (:/) platform values. Lets add a check in SamFileValidator
for the correct set of values, case-insensitively, along with unit tests. This is motivated by a new @RG-PL
value in the SAM Spec: samtools/hts-specs#39
I just noticed that there are changes to many core classes (SAMRecord, VariantContext, Interval) to implement a new Locatable interface. The main result seems to be deprecation of getChr(), getChrom() getChromosome(), getReference() etc. methods in preference to a new getContig().
I get the desire to standardize, but it's confusing to long time users of the library to suddenly change and deprecate all the existing method names. But more to the point "contig" is not really even an accurate term for most uses. For human, and most larger references genomes, the chromosomes are scaffolds of multiple contigs (contiguous sequences) assembled with Ns in the middle. How about using something more generic like getReferenceSequence()?
Extending the htsjdk to parse symbolic alt alleles and return meaningful Alleles based on the parsed result would be useful to structural variation users of the API. In particular the following features would be useful:
One possible design is to subclass Allele:
abstract Allele
Allele <- BaseCallAllele
Allele <- SymbolicAllele
Allele <- NoCallAllele
Allele <- MissingAllele
SymbolicAllele <- StructuralVariationAllele (with enum, subclass hierarchy or getters for SV type and subtypes (subsubtypes?) such as and DUP:TANDEM)
SymbolicAllele <- BreakendAllele
SymbolicAllele <- BreakpointAllele
SymbolicAllele <- NamedContigAllele
The low-level design and implementation depends on resolution of ambiguities in the VCF specifications as the are multiple points in which the VCF specifications are under-defined and ambiguous. For example, it is unclear as to whether "AA", "", "AA" are valid alleles.
I'm not sure if this has been fixed yet or not, as I was only able to test cramtools-2.1.jar from EBI web site. Is there a way to run picard tools with CRAM (I tried ViewSam and it didn't work).
Anyway, that aside, cramtools is unable to read CRAM files that has no reference. My file entirely uses cram features to store the base calls and never has need to reference an external sequence. It does now have @sq M5 auxiliary tags (but does not require them), nor is it using an internally embedded reference.
I think this is still a legal file (apart from the oddity of officially needing @sq M5, but I think we already decided to relax that for the case of embedded reference sequences), albeit a fairly inefficient one.
Hello, I want to use the samtools part of the htsjdk
, but the current Ant file doesn't have a task for building separate JAR files for each of the subprojects. Is there any particular reason for this?
I'm not going to say I think it's CRAM and the NM tag1, but I think it's CRAM and the NM tag2,3.
This command line works, using the test crams in the current cramtools master branch4:
$ java \
-jar ~/src/picard/dist/picard.jar SamFormatConverter \
R= ~/src/cramtools/src/test/resources/data/set1/small.fa \
I= ~/src/cramtools/src/test/resources/data/set1/small.cram \
O= test.cram
[Sun Dec 21 10:05:07 EST 2014] picard.sam.SamFormatConverter INPUT=test.bam OUTPUT=test.cram REFERENCE_SEQUENCE=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.fa VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sun Dec 21 10:05:07 EST 2014] Executing as [email protected] on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17; Picard version: 1.126(5b984f92664175debe3826040fdae1496d1d8fa7_1418347776) JdkDeflater
[Sun Dec 21 10:05:08 EST 2014] picard.sam.SamFormatConverter done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=163053568
$
The same command with the addition of -ea
crashes on an assert5:
$ java -ea \
-jar ~/src/picard/dist/picard.jar SamFormatConverter \
R= ~/src/cramtools/src/test/resources/data/set1/small.fa \
I= ~/src/cramtools/src/test/resources/data/set1/small.cram \
O= test.cram
[Sun Dec 21 10:05:19 EST 2014] picard.sam.SamFormatConverter INPUT=test.bam OUTPUT=test.cram REFERENCE_SEQUENCE=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.fa VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sun Dec 21 10:05:19 EST 2014] Executing as [email protected] on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17; Picard version: 1.126(5b984f92664175debe3826040fdae1496d1d8fa7_1418347776) JdkDeflater
[Sun Dec 21 10:05:19 EST 2014] picard.sam.SamFormatConverter done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=128974848
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.AssertionError
at htsjdk.samtools.CRAMFileWriter.flushContainer(CRAMFileWriter.java:275)
at htsjdk.samtools.CRAMFileWriter.finish(CRAMFileWriter.java:337)
at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:205)
at picard.sam.SamFormatConverter.doWork(SamFormatConverter.java:88)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:89)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:99)
$
Hi what about the MIT licence in the htsjdk.
If I use some classes to develop a new tool, I should implement the licence-text into all my classes?
Or I'm not allowed to use this packages in my new tool?
I tried to use the SamReader:
reader.query(intervals, false);
where intervals contain a single QueryInterval which is:
QueryInterval interval = new QueryInterval(0,0,0);
It will throw an exception as follows:
htsjdk.samtools.util.RuntimeIOException: Expected 8 bytes, got 4
at htsjdk.samtools.AbstractBAMFileIndex$IndexStreamBuffer.readLong(AbstractBAMFileIndex.java:684) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
at htsjdk.samtools.AbstractBAMFileIndex.readLong(AbstractBAMFileIndex.java:438) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
at htsjdk.samtools.AbstractBAMFileIndex.query(AbstractBAMFileIndex.java:332) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
at htsjdk.samtools.DiskBasedBAMFileIndex.getSpanOverlapping(DiskBasedBAMFileIndex.java:60) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:729) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:412) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:468) ~[dcc-htsjdk-1.117.jar:${hts-version}(d65f6b3af647b0e4610f17760b17c3323df64aab_1405347869)]
The BAM file size is about 15 GB downloaded from 1000 Genomes: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00595/alignment/HG00595.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522.bam
Any help is appreciated.
Hi there,
Is there a tool to validate if the BAM/BAI file is valid?
Thank you!
Jerry
There is a bug in the TabixReader constructor.
You can specify the path to the index file ("idxFn", second argument of the last constructor), but when this value is not null, it is never assigned to the class variable "mIdxFn". Because of that you will get a NullPointerException in the readIndex() method.
There is currently no way to get information about the datasource backing the SAMReader. This makes it difficult to return useful error messages when there is any sort of exception.
It should return a String
which sufficiently describes the resource enough that a human could identify which file, url, database, etc was the problem.
JEXLMap already has JEXL context variables for isHomRef, isHet and isHomVar (circa line 97 in variant/variantcontext/JEXLMap.java), but it would be great if variables could be added for the other genotype types, specifically noCall.
Of course if GSA had commit rights we could take care of some of these issues on our own.... ;)
Other features return 1-based locations. This should be fixed. Or we should delete OldDbSNPFeature.
Does anyone use this?
Hi, I'm currently moving my code from picard1.100 to htsjdk. So far I cannot see how I should now map those 'old' classes:
any hint ?
Thanks.
I've used biobambam to merge some files, and it ouputs
@HD VN:1.5 SO:coordinate
using https://github.com/gt1/libmaus/blob/master/src/libmaus/bambam/BamCatHeader.hpp#L116
According to the spec ( http://samtools.github.io/hts-specs/SAMv1.pdf ), 1.5 is the current version, but picard ValidateSamFile gives an error, which the diff below eliminates.
diff --git a/src/java/htsjdk/samtools/SAMFileHeader.java b/src/java/htsjdk/samtools/SAMFileHeader.java
index 2e237c4..ac361b1 100644
--- a/src/java/htsjdk/samtools/SAMFileHeader.java
+++ b/src/java/htsjdk/samtools/SAMFileHeader.java
@@ -46,9 +46,9 @@ public class SAMFileHeader extends AbstractSAMHeaderRecord
public static final String VERSION_TAG = "VN";
public static final String SORT_ORDER_TAG = "SO";
public static final String GROUP_ORDER_TAG = "GO";
- public static final String CURRENT_VERSION = "1.4";
+ public static final String CURRENT_VERSION = "1.5";
public static final Set<String> ACCEPTABLE_VERSIONS =
- new HashSet<String>(Arrays.asList("1.0", "1.3", "1.4"));
+ new HashSet<String>(Arrays.asList("1.0", "1.3", "1.4", "1.5"));
/**
* These tags are of known type, so don't need a type field in the text representation.
I suspect this suggestion will be controversial.
It would be helpful for a lot of reasons to allow java 8 code in htsjdk. There are many useful features which included in 8 which were not in 6, or 7. See (here)[http://www.oracle.com/technetwork/java/javase/8-whats-new-2157071.html] for a complete list.
In my view, the biggest and most desirable changes are the introduction of lambdas, default methods in interfaces, and the stream APIs.
htsjdk 1.129
none of the validation methods on VariantContext is tested:
extraStrictValidation
validateReferenceBases
validateRSIDs(rsIDs);
validateAlternateAlleles();
validateChromosomeCounts();
overriding and using clone is discouraged - SAMRecord and SAMFileHeader should provide copy constructors or static factory methods instead.
Hi,
Are you planning on distributing htsjdk on maven central? I can only see 1.118 there, from user "seqdoop" which I expect isn't you guys.
(The scenario is that I'm using the GATK-framework to develop custom walker but need to update htsjdk from the supplied version 1.120).
thanks
Daniel
Lets add support the @RG-PM
tag in the SAM Header. This relates to the recent SAM specification change: samtools/hts-specs#16
A first implementation hint is to take a look at SAMReadGroupRecord
and its STANDARD_TAGS
. We will want to add the PM
tag, and expose it here. Add unit tests too.
See: #115
Test cases need to be converted to testng and additional coverage added. A fair amount of
the tests are currently run by main()s. These tests should be converted to
a full testng test suite such that our automated builds can run them. This
is also a good time to look at code coverage. Once the tests are converted
to testng we can run some code coverage plugins to get a good sense of how
well this is currently covered.
Hi developers,
I need your help again! I got an issue when I try to query bam files using SamReader/SAMFileReader.
Basically, there are two reads in the BAM:
ERR019497.24419495 73 1 11966 0 108M = 11966 0 TTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGGTT =@BDCFFGEEGGFFBGGDFHDEEDFBIHHBF3GIDDHF;IAHHIHFAGGID9CHDEEFIHJHGJIHJ?1CCG5?EFFB.AAEHEEFGFBFGFIIBHGGFD35==C%:9 X0:i:3 X1:i:4 MD:Z:105T2 RG:Z:ERR019497 AM:i:0 NM:i:1 SM:i:0 BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@D@BB XT:A:R
ERR019497.24419495 133 1 11966 0 104M4S = 11966 0 ACCTAGGCCAATCCACTGGGTTCTGTGGCAGCGAGGGCCTGCCTGATCTTCCACCCCCTCTCCCAGGGCCAAACCTAAACCCCCCTACCCCCCCCCCCCCGCCGGCTG 9@@EAC8C6DF?EEFDEFFFCEEFFDFGFE?:&8),4.51.+59*88'/9//62-$%,.-;%+420#&.2C4B'%2-,'&/$%%$$'%)%.4(*'0/5+*/25#%-%& XC:i:104 RG:Z:ERR019497
Note that they both have the same QNAME (ERR019497.24419495)
, RNAME (1)
and POS (11966)
.
When I use the reader query API as follows against this BAM file:
SAMRecordIterator itr = reader.queryOverlapping(new QueryInterval[] { query });
where
QueryInterval query=new QueryInterval(reader.getFileHeader().getSequenceIndex(1), 11966,11966));
I got 2 reads back which is perfect!
However when I use:
QueryInterval query=new QueryInterval(reader.getFileHeader().getSequenceIndex(1), 11967,11967));
Only one read is back:
ERR019497.24419495 73 1 11966 0 108M = 11966 0 TTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGGTT =@BDCFFGEEGGFFBGGDFHDEEDFBIHHBF3GIDDHF;IAHHIHFAGGID9CHDEEFIHJHGJIHJ?1CCG5?EFFB.AAEHEEFGFBFGFIIBHGGFD35==C%:9 X0:i:3 X1:i:4 MD:Z:105T2 RG:Z:ERR019497 AM:i:0 NM:i:1 SM:i:0 BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@D@BB XT:A:R
Note that both reads have about 100 bases long. So I would expect I will have 2 reads back in the second query as well but it didn't happen.
Is it a bug in the library? If it is not, can you provide me a rationale?
You can download the BAM files to reproduce the issue at http://s000.tinyupload.com/index.php?file_id=00308273047953999316
Thank you!
Jerry
Please note that @jkbonfield has provided us with some test cases.
The README doesn't state a license, and there is also no LICENSE file. It seems from at least some of the source code that the code is under the MIT license. I'd recommend making that explicit by putting a LICENSE file in the root folder of the repo, and/or declaring the license under which the code is released explicitly in the README.
Consider a VCF file with the following two lines in the header:
##GATKCommandLine=<ID=ExtractConsensusSites, ....
##GATKCommandLine=<ID=FilterLiftedVariants,....
When this is parsed by the VCFHeader class, both lines are fed into the meta data object as follows: (VCFHeader.java: 279)
mOtherMetaData.put(line.getKey(), line);
And the second entry overwrites the first, so the first line is lost from the meta-data.
Is there supposed to a restriction on the number of elements with the same key allowed in the VCF format?
I'm running ant 1.9.4 :
$ /commun/data/packages/apache-ant/apache-ant-1.9.4/bin/ant -version
Apache Ant(TM) version 1.9.4 compiled on April 29 2014
$ echo $ANT_HOME
/commun/data/packages/apache-ant/apache-ant-1.9.4
in build.xml the file htsjdk.version.property.xml should be saved as XML but it is not:
$ /commun/data/packages/apache-ant/apache-ant-1.9.4/bin/ant write-version-property
(...)
$ cat /commun/data/users/lindenb/src/jvarkit-git/htsjdk/htsjdk.version.property.xml
#htsjdk version
#Mon, 08 Sep 2014 10:53:51 +0200
htsjdk-version=1.119
It breaks ant when reading the xml to get the htsjdk.version( lindenb/jvarkit#11 )
may be it should be named 'htsjdk.version.properties' ?
I would like to see code coverage here: https://coveralls.io/r/samtools/htsjdk
There appears to be a Slice.setRefMD5()
, but it appears that it's not wired in. By the time SliceIO
writes the Slice.refMD5
it is still null
.
Creating the cram "works", but crashes in CRAMTools 2.1 unless passed the hidden --resilient
argument.
$ java \
-jar ~/src/picard/dist/picard.jar SamFormatConverter \
R= ~/src/cramtools/src/test/resources/data/set1/small.fa \
I= ~/src/cramtools/src/test/resources/data/set1/small.cram \
O= test.cram
[Wed Jan 14 15:58:07 ART 2015] picard.sam.SamFormatConverter INPUT=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.cram OUTPUT=test.cram REFERENCE_SEQUENCE=/Users/kshakir/src/cramtools/src/test/resources/data/set1/small.fa VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed Jan 14 15:58:07 ART 2015] Executing as [email protected] on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_67-b01; Picard version: 1.128(c8e12338d226532b30e9ecdbf33180a073c3ffc7_1421081159) JdkDeflater
[Wed Jan 14 15:58:09 ART 2015] picard.sam.SamFormatConverter done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=163577856
$ java \
-jar cramtools-2.1.jar \
bam \
-R ~/src/cramtools/src/test/resources/data/set1/small.fa \
-I test.cram
ERROR 2015-01-14 15:55:38 Slice Reference MD5 mismatch for slice 0:2-22180, CCTAAGCCTA...TCGGTACTCC
ERROR 2015-01-14 15:55:38 Cram2Bam Reference sequence MD5 mismatch for slice: seq id 0, start 2, span 22179, expected MD5 00000000000000000000000000000000
$
Feature is a very useful interface when dealing with different input file types. However, the Feature interface explicitly states: "The coordinate conventions for start and end are implementation dependent, no specific contact is specified here." As a feature request, I would propose that the interface provide some method to obtain the start/end coordinates with a known coordinate scheme, such that every type of feature will give you the same value. This way you know whether you're getting 0- or 1- based numbers. This could be either:
create get1BasedStart()/get1BasedEnd() methods
Create some sort of CoordinateType enum, with 1- and 0-based values. add a getCoordinateType() method.
or there are plenty of other options. Doesnt really matter what, so long as it always returns the same value from any implementation of Feature.
Apologies if there's already a way to get at this problem...
CramIO.isCRAM()
should not use DataInputStream.readFully()
on unknown input streams. SamFileReader.isBAMFile()
instead uses SamFileReader.readBytes()
, and hence does not throw an exception when the stream is not as expected.
Test case:
Assert.assertFalse(CramIO.isCRAM(new ByteArrayInputStream(new byte[0])));
Also, like SamFileReader.isBAMFile()
(via isValidFile()
), isCRAM()
should probably be checking that markSupported()
before using mark()
.
Update CRAM code base to our code standards. This mostly involves code
style, variable names and standard conventions.
Currently, if you pass a block-compressed file into IndexFactory.createTabixIndex()
, the file never gets decompressed internally, causing the underlying codec to blow up with errors like the following:
Exception in thread "main" htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:115)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:88)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:41)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.readHeader(IndexFactory.java:413)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.<init>(IndexFactory.java:401)
at htsjdk.tribble.index.IndexFactory.createTabixIndex(IndexFactory.java:327)
I see two possible solutions, and wanted to solicit feedback on which would be preferable:
createTabixIndex()
that accepts a stream rather than a File (this places the burden of setting up the stream correctly on the client)FeatureIterator.initStream()
check for block-compressed input via AbstractFeatureReader.hasBlockCompressedExtension()
, and wrap the stream returned within a BlockCompressedInputStream
as needed.Thoughts?
htsjdk.samtools.util.IOUtil
depends on CBZip2{Input,Output}Stream
in order to implement openBzipFileForReading
and openBzipFileForWriting
and the calls in openFileForReading
and openFileForWriting
where file types are autodetected by extension.
CBZip2{Input,Output}Stream
are provided in the source in lib/apache-ant-1.8.2-bzip2.jar
, but this is effectively a binary blob. There is no expressed provenance to how this jar was compiled (although presumably the relevant files from apache ant were carved out). Also, since this portion of ant is not distributed by apache as an individual component, this causes problems for the possible move towards maven for compilation and deployment (pull request for #55 currently declares a dependency on all of apache-ant-1.8.2, which is a bit excessive just for bzip2 support).
Possible solutions would be to:
htsjdk.samtools.util.bzip2 package
. This should be fine from a license standpoint (ant is under Apache License, version 2.0)IOUtil
. On the up side, this would allow for support for several more compression types (in addition to bzip2, also pack200, .Z, xz and lzma). A slight downside is that this dependency is about 357k (compared to the 30k or so that the apache-ant-1.8.2-bzip2.jar
is).I would be happy to submit a pull request for either (probably would take less time than it did to type up this message), but wanted to know if there was a consensus as to which is the better approach?
Hello I'm in need of showing completion percentages from an import operation that uses the VCFFIleReader.
At the moment the constructor asks for a java.io.File object which, being only a fancy way to represent a path, doesn't allow me to tell()
how much data has been consumed (since this is the only way to get an idea about how much time is left to read the entire VCF file).
Is there a way to access the underlying InputStream?
Otherwise, can someone add this feature?
Being able to know (even approximately) how much time is left seems kind of a must-have feature, IMO.
Thanks!
I used the htsjdk.samtools.fastq.FastqReader to read fastq files processed with cutadapt: some reads were empty : the FastqReader throws an exception from FastqReader.checkLine
is there a way to fix this ? the field 'skipBlankLines' makes it difficult to fix.
Another idea; in FastqReader would it be advantageous to use htsjdk.samtools.util.LineReader instead of a java.io BufferedReader ?
Currently filters implement 2 different functions, a filter on a single SAMRecord
and a filter on 2 mated SAMRecords
.
This should be reduced to just a single method that takes 1 SAMRead
. Classes that need a mate pair filter can build their own, or filters that are need for mated pairs could implement their own MateFilter
interface. In almost all cases the result on mates is just &&
or ||
of the primary filter, it's easy to compose these. It would also eliminate many filters throwing UnimplementedOperation
exceptions.
This would make #177 easier since filters would then fit into the standard notion of what a filter or predicate is.
Hello,
I'm unsure if this is the right way to ask a developing question. But I searched for examples how to use the htsjdk and for the SamReader/SamReaderFactory there are nice examples.
Is there also some available examples how to read BED or vcf files?
Best Air
It would be nice to be able to specify the index-file in the VCFFileReader.
The underlying FeatureReader has the possibility to specify the index resource, but there is no possibility in the VCFFileReader to set the index resource (e.g. an additional constructor). At the moment this VCFFileReader only works if the index file is located in the same folder as the VCF file.
(related to #115)
There's been some interest in having an "unstable" jar published to Maven on each push to master, in addition to the "stable" release every ~2 weeks.
This would be useful for larger projects that depend on HTSJDK where a few members of the project want to contribute features to HTSJDK, without everyone having to worry about syncing their local versions of HTSJDK before building the other project.
I'm not sure on the best way to do this. We'd also need to make it very clear that it's for experimental use only. In the short term, a fork might be an easier solution, but let's keep this on the radar.
@lbergelson @akiezun @droazen are interested in this.
The ability to compare two *AM files seems generally useful, especially for unit tests.
There is currently a Picard CLP named CompareSAMs that tests the equality of two such files, and collects more detailed statistics as well. Many unit tests in Picard instantiate this class in a non-CLP context, making dangerous assumptions about mutated fields and such.
It could easily be refactored into a standalone class that takes two SamReaders and returns an object with various collected metrics, with the Picard tool being a thin wrapper.
Alternatively, if such functionality already exists in HTSJDK, the Picard tool should be using it...
(see also the SamFileValidator / ValidateSamFile duality)
if encode, decode, isDigit would be made static one could use them without first creating an instance of class.
Users like Monkol are finding the HWP annotation to be very valuable but unfortunately we stopped emitting it in GenotypeGVCF because there is an integer overflow problem in the code with 91K samples. Hopefully this stack trace from Tim will be enough to reproduce the issue.
Hi All,
First I thought you’d all like to know that the 91k got kicked off on Saturday. However, all of the GenotypeGVCF jobs are failing with the following message:
java.lang.ArrayIndexOutOfBoundsException: -4408
at htsjdk.tribble.util.popgen.HardyWeinbergCalculation.hwCalculate(HardyWeinbergCalculation.java:94)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.computeHardyWeinbergPvalue(GATKVariantContextUtils.java:1878)
at org.broadinstitute.gatk.tools.walkers.annotator.GenotypeSummaries.annotate(GenotypeSummaries.java:94)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:192)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:177)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:227)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:186)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:110)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)
Ryan: would you be able to take a look at this ASAP today? The new joint calling is blocked on this. A few things you might need:
logs: /seq/dax/macarthur_joint_calling/v2/logs/genotypeGvcfs.scattered.macarthur_joint_calling.*.log
GATK: /seq/tng/tfenne/gatkbin/GenomeAnalysisTK-v3.1-163-g4284d7a.jar (built from gsa-unstable on Friday afternoon)
Example command:
java -Djava.io.tmpdir=/seq/picardtemp3/macarthur_joint_calling
-Djava.io.tmpdir=/local/scratch/macarthur_joint_calling
-Djava.io.tmpdir=/seq/picardtemp3/macarthur_joint_calling
-Djava.io.tmpdir=/seq/picardtemp4/macarthur_joint_calling
-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx16000m
-jar /seq/tng/tfenne/gatkbin/GenomeAnalysisTK-v3.1-163-g4284d7a.jar
-T GenotypeGVCFs
--disable_auto_index_creation_and_locking_when_reading_rods
-R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta
-o /seq/dax/macarthur_joint_calling/v2/scattered/temp_0102_of_1000/genotypes.unfiltered.vcf.gz
-D /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.dbsnp.vcf
-L /seq/dax/macarthur_joint_calling/v2/scattered/temp_0102_of_1000/scattered.intervals
-V /seq/dax/macarthur_joint_calling/v2/combined_gvcfs/all_combined_gvcfs.list
Recently, I got an error
Cannot merge sequence dictionaries because sequence chrM and chrY are in different orders in two input sequence dictionaries.
Can someone explain this error further?
Does this need to be an error or is this case not implemented yet? Is there a downstream problem if sequence header that are in different orders are still added?
Not sure the performance implications, but perhaps ReadTag.buf
could still be "risky", maybe just with a static ThreadLocal<ByteBuffer>
instead of just a static ByteBuffer
?
I've had a bug report against our software which appears to be an underlying bug in the htsjdk parser for BAM files. The read which fails to parse is a spliced read with a huge intron in it (226Mbp). When parsing the cigar string htsjdk corrupts this value (it looks like it overflows so I get a negative value), but if I use samtools to convert the bam file to sam, then htsjdk reads it just fine.
I've put the example files I used (only 1 read in them) up at:
http://ftp2.babraham.ac.uk/ftpusr68/
Some code which illustrates the problem is below. For the sam file I get:
Cigar is 109M226643263N25M17S Start=10002 End=226653398
..but for the equivalent BAM file I get:
Ignoring SAM validation error: ERROR: Record 1, Read name GR_10257_idx_5/2, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
Cigar is 109M-41792193N25M17S Start=10002 End=-41782058
This is using htsjdk-1.129.jar
import java.io.File;
import htsjdk.samtools.SAMRecord;
import htsjdk.samtools.SamReader;
import htsjdk.samtools.SamReaderFactory;
import htsjdk.samtools.ValidationStringency;
public class PicardCigarBug {
public static void main (String [] args) {
File smallBugBam = new File("C:/Users/andrewss/Desktop/Bug/small.bam");
// File smallBugBam = new File("C:/Users/andrewss/Desktop/Bug/small.sam");
SamReaderFactory samFactory = SamReaderFactory.makeDefault().validationStringency(ValidationStringency.LENIENT);
SamReader inputSam = samFactory.open(smallBugBam);
for (SAMRecord samRecord : inputSam) {
System.err.println("Cigar is "+samRecord.getCigarString()+" Start="+samRecord.getAlignmentStart()+" End="+samRecord.getAlignmentEnd());
}
}
}
Hello,
I have previously been doing some work on Picard and Htsjdk, but now when I try to update to the latest master branch in htsjdk there is a problem with the aux.bam and aux.sam files in the testdata/htsjdk/samtools directory. Apparently it seems like files named 'aux' are forbidden in Windows due to some restrictions since the MSDOS times!!! (https://social.technet.microsoft.com/Forums/windows/en-US/e22c021d-d188-4ff2-a4dd-b5d58d979c58/the-specified-device-name-is-invalid)
Would it be possible for you to change the file names of those two files?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.