bigdatagenomics / bdg-formats Goto Github PK
View Code? Open in Web Editor NEWOpen source formats for scalable genomic processing systems using Avro. Apache 2 licensed.
License: Apache License 2.0
Open source formats for scalable genomic processing systems using Avro. Apache 2 licensed.
License: Apache License 2.0
The documentation for VariantAttribute
fields states that Number=R VCF INFO attribute values are split for multi-allelic sites
/**
Total read depth, VCF INFO reserved key AD, Number=R, split for multi-allelic
sites.
*/
union { null, int } readDepth = null;
/**
Forward strand read depth, VCF INFO reserved key ADF, Number=R, split for
multi-allelic sites.
*/
union { null, int } forwardReadDepth = null;
/**
Reverse strand read depth, VCF INFO reserved key ADR, Number=R, split for
multi-allelic sites.
*/
union { null, int } reverseReadDepth = null;
...
/**
Additional variant attributes that do not fit into the standard fields above.
The values are stored as strings, even for flag, integer, and float types. VCF
INFO key values with Number=., Number=0, Number=1, and Number=[n] are shared across
all alternate alleles in the same VCF record. VCF INFO key values with Number=A and
Number=R are split for multi-allelic sites.
*/
map<string> attributes = {};
When converting VCF records to VariantAnnotation
records, we assume that the first index of the array is the reference allele value, and use the alternate allele index to extract the value for the alternate allele.
##INFO=<ID=MY,Number=R,Type=Integer,Description="My Number=R attribute">
1 1024 . G A,T . PASS MY=10,20,30
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"attributes": "MY=20"}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "T", "annotation":{"attributes": "MY=30"}}
When converting from VariantAnnotation
back to VCF records, we no longer have access to the reference allele value. I can think of at least four ways to handle this:
1 1024 . G A . PASS MY=-1,20
1 1024 . G T . PASS MY=-1,30
1 1024 . G A . PASS MY=.,20
1 1024 . G T . PASS MY=.,30
1 1024 . G A . PASS MY=20
1 1024 . G T . PASS MY=30
Variant
and VariantAnnotation
records for the reference allele when splitting multi-allelic sites{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "<*>", "annotation":{"attributes": "MY=10"}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"attributes": "MY=20"}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "T", "annotation":{"attributes": "MY=30"}}
1 1024 . G <*> . PASS MY=10
1 1024 . G A . PASS MY=20
1 1024 . G T . PASS MY=30
For the Number=R VCF INFO attribute values that map to fields (currently AD, ADF, ADR) we have a couple more options:
1 1024 . G A,T . PASS AD=10,20,30
array<int>
for the field type with cardinality 2 array<int> readDepth = [];
array<int> forwardReadDepth = [];
array<int> reverseReadDepth = [];
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"readDepth": 10,20}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "T", "annotation":{"readDepth": 10,30}}
1 1024 . G A . PASS AD=10,20
1 1024 . G T . PASS AD=10,30
union { null, int } readDepth = null;
union { null, int } forwardReadDepth = null;
union { null, int } reverseReadDepth = null;
union { null, int } referenceReadDepth = null;
union { null, int } referenceForwardReadDepth = null;
union { null, int } referenceReverseReadDepth = null;
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"readDepth": 20, "referenceReadDepth": 10}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"readDepth": 30, "referenceReadDepth": 10}}
1 1024 . G A . PASS AD=10,20
1 1024 . G T . PASS AD=10,30
New option, with 5 above, as proposed below
##INFO=<ID=MY,Number=R,Type=Integer,Description="My Number=R attribute">
1 1024 . G A,T . PASS AD=5,15,25;MY=10,20,30
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"readDepth": [5,15], "attributes": "MY=10,20"}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "T", "annotation":{"readDepth": [5,25], "attributes": "MY=10,30"}}
1 1024 . G A . PASS AD=5,15;MY=10,20
1 1024 . G T . PASS AD=5,25;MY=10,30
Similar to the VariantCallingAnnotations
record (see #51), I think the DatabaseVariantAnnotation
record needs some TLC.
Am I missing this somewhere or is the sample name not stored in AlignmentRecord. I'm looking for something equivalent to the
SAMRecord.getReadGroup.getSample
from https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMReadGroupRecord.java#L70
Having a diagrams of all the data structures of BDG-Formats would help a newcomer get started in the project.
Many tools allow us to generate such a diagram automatically from Java sources, like UMLGraph which is open source.
I'll submit a pull request shortly that achieves this feature.
As raised on bigdatagenomics/adam#815, the readNum
field of AlignmentRecord
is a bit vaguely named. E.g., readNum
could be read as a UUID, instead of the number of the read in the fragment.
I just noticed this field getting lost when converting into and out of BAMs with ADAM. Should we/I add it? Should we just infer it in ADAM? Or continue not supporting it?
See TLEN
in the SAM spec.
htsjdk
puts it in SAMRecord
s as "inferred insert size"
per discussion at bigdatagenomics/adam#1290
Currently we just have fragmentStartPosition
and fragmentLength
. To perform predicate filtering by a ReferenceRegion, we need a fragmentEndPosition
field.
Publish the C/C++ artifacts, generated from the latest JSON (via bigdatagenomics.github.io?).
In bigdatagenomics/adam#815 we decided that the normalization provided by the Sequence
s in the Fragment
record wasn't that useful and was somewhat hard to reason about.
I remember there was some discussion of converting the name of 'ADAMRecord' to 'Read' at the same time as we made the move to remove the 'ADAM' prefix from a lot of these schemas -- although I can't find the discussion of it, at the time.
However, can I vote that we change the name of 'Read' back to 'Record'? If 'Record' is too generic, maybe something more specific like 'AlignmentRecord'.
The point is, the 'Read' schema actually bears a many-to-one relationship with the reads themselves (which will become clear, if we start parsing the FASTQ files directly in any context where we're parsing the raw data), and using a '[something]Record' name will (continue to) evoke the association with SAMRecord which is so clearly implied by the actual presence of fields with similar names and semantics.
Thoughts?
Related to bigdatagenomics/adam#194, and #108. Specifically, this is a subset of #108 that I'd like to get into 0.10.0.
Variant has:
/**
True if filters were applied for this variant. VCF column 7 "FILTER" any value other
than the missing value.
*/
union { null, boolean } filtersApplied = null;
/**
True if all filters for this variant passed. VCF column 7 "FILTER" value PASS.
*/
union { null, boolean } filtersPassed = null;
/**
Zero or more filters that failed for this variant. VCF column 7 "FILTER" shared across
all alleles in the same VCF record.
*/
array<string> filtersFailed = [];
While VariantCallingAnnotations has:
// FILTER: True or false implies that filters were applied and this variant PASSed or not.
// While 'null' implies not filters were applied.
union { null, boolean } variantIsPassing = null;
array <string> variantFilters = [];
I'm going to make VariantCallingAnnotations match Variant.
Releasing bdg-formats fails under Java 8 because certain Javadoc warnings have changed to errors in Java 8. We should look closer to see if the issues are from the comments we've written inline in our avdl, or if it is caused by the Javadoc generated by Avro. To work around, we can just cut the releases using Java 7.
The last revision to the Pileup schema removed too many fields, we need the sampleId
field (or its equivalent) added back in.
This is a ticket to capture our collective thinking around the re-organization of the Feature schema. The Feature schema needs to be edited to satisfy the following (additional) requirements:
See discussion on bigdatagenomics/adam#1103
I can't find Pileup class file in org.bdgenomics.formats.avro package. than there is error:
val pileups = reads.adamRecords2Pileup().cache()
:32: error: value adamRecords2Pileup is not a member of org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.AlignmentRecord]
val pileups = reads.adamRecords2Pileup().cache()
Please how i can git the file and solute the error?
Thanks!
jack xu
I have trouble accessing those "end" fields (e.g. AlignmentRecord.end, variant.end) with sparkSql because end is a reserved keyword there and it conflicts with the field names.
I was wondering: Is it possible to assign different names to those fields?
To support things like random barcodes, drop-seq, etc.
Followup to #44.
Specifically, better schema level support for:
END
fieldADAM is on Avro 1.7.7, while bdg-formats is on 1.7.4. I think this is causing some weird behavior with the Spark 1.5 stream of releases which pull in Avro 1.7.7.
#103 removed StructuralVariant
and StructuralVariantType
records; this issue suggests we might want to reconsider that decision.
The reference between Variant
and VariantAnnotation
should be forward so that it can be projected away.
We should leave some top-level documentation in bdg-formats, as part of the README.
This should explain, among other things, that
@heuermh will break out the variant/genotype changes into smaller chunks, so that we can roll them into ADAM downstream incrementally.
I wanted to revisit VariantCallingAnnotations
. It seems a lot of these fields are GATK specific and it seems hard to extend to add new annotations or variant calling output.
Might it make more sense to move of these to attributes
and improve VCF output things from attributes (which it doesn't seem like it does currently?)
We should add fields for:
We should also add a tag array.
DatabaseVariantAnnotation
to VariantAnnotation
VariantCallingAnnotations
to GenotypeAnnotation
TranscriptEffect
under VariantAnnotation
.Seems like both of these are still hanging around, was there any resolution on this?
I can't remember the part of the discussion around #90 where alternateAllele
in TranscriptEffect
was dropped. As far as I can tell, this is still necessary when reading in variants to associate an effect with a specific alternate allele.
Minor code style and doc fixes:
Base
enumI was writing code that munged NucleotideContigFragment
s and realized that Contig
should be changed to Reference
and NucleotideContigFragment
should be renamed to Contig
or ContigFragment
.
Along with #9
Small nit: I feel like bigdatagenomics/bdg-formats is redundant. Would anyone be opposed to renaming to bigdatagenomics/formats? I will leave this open for a week and make the changes if no one is opposed.
The original position and original cigar flags are useful for describing the alignment of a read prior to realignment.
May contain incompatible API changes.
Source diff
apache/avro@release-1.7.7...release-1.8.0
This stems from the -onlyvariants
flag we added to vcf2adam
, which only writes out the variant information. IMO, annotations associated with a variant should be packaged with a variant. Not a genotype. If you're denormalizing the variant information into the Genotype
, you shouldn't denormalize these two pieces separately. This is annoying from the -onlyvariants
perspective, because at the moment, this ends up storing minimal info on the variants, when what I really want to do is analyze the metadata on the variants. Thoughts?
Similar to 95f4b5b, we want to move the Contig
record out of Variant
and replace it with a string contigName
field.
@arahuja implied there had been some discussion around this in the past.
AFAICT there is no good way right now to capture there being two non-reference alleles at one locus, having one Variant
per Genotype
.
In the immediate term I will work around this by emitting two lines / Variant
s in my VCFs, but I'm curious whether we should support the other way here. Thanks!
In the 0.1.1 implementation, the ADAMFeature contains a single field 'trackName', which the comments imply (for GFF/GTF files) contains both the 'feature' and 'source' values from the original record.
However, for parsing out GFF and GTF files into hierarchical or structured gene models, we're going to need to represent those two fields as separate values.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.