Giter Club home page Giter Club logo

imgthla's Introduction


IPD-IMGT/HLA Database

This directory contains data for the IPD-IMGT/HLA database. The IPD-IMGT/HLA database is a specialist sequence database for sequences of the human histocompatibility complex. This directory contains the IPD-IMGT/HLA flat files and documentation.

Cloning the Repository

From April 2024, Release 3.56.0

As of Release 3.56.0, due April 2024, all large files (>100MB) will be provided as compressed files rather than utilise Git LFS, which was previously required. This includes the hla.dat, xml/hla.xml and xml/hla_ambigs.xml in the next release. This has been done to simplify the cloning process and also due to escalating and unpredictable costs in providing the files using Git LFS from a public repository. All compressed files will use the ZIP format. This formatting change will be applied to all branches.

Up to April 2024

Previously the repository has required the use of the Git LFS tools (https://git-lfs.github.com) to handle files over 100MB in size. Whilst all hla.dat files are now provided as a zipped file, any pulls from previous commits for Release 3.55.0 and earlier will still require Git LFS. Please use this when cloning the repository to ensure the larger files are downloaded correctly. If Git LFS is not used then large files will contain pointers to the Git LFS location rather than the data required.


File Formats

The directory also contains the HLA sequences in a number of formats. Within the following folders, the various format types are explained briefly here:

Alignments folder

Files designated “X_prot.txt”, where X is a locus or gene, contain protein sequences. Please note that alleles that contain non-coding variations may be identical at the protein level.

Files designated “X_nuc.txt”, where X is a locus or gene, contain the nucleotide coding sequences (CDS). Please note that alleles that contain non-coding variations may be identical at the CDS level.

Files designated “X_gen.txt”, where X is a locus or gene, contain genomic DNA sequences. Please note that for alleles that do not possess genomic sequences there will be no entry in the file, or where there is only a single genomic sequence at the locus, a file will not be produced.

For further information on the construction of these text files, please refer to the description available here: https://www.ebi.ac.uk/ipd/imgt/hla/alignment/help/. To provide consistency in both formatting and to record versioning information, as of version 3.32.0, the header is designated by hash tags at the start of the line.

A zip compressed archive of all the text-format alignment files is available from the top-level directory.

FASTA folder

All files in this folder are provided in the FASTA sequence format. Please note the FASTA format contains no alignment information.

Files designated “X_prot.fasta”, where X is a locus or gene, contain protein sequences. Please note that alleles that contain non-coding variations may be identical at the protein level.

Files designated “X_nuc.fasta”, where X is a locus or gene, contain the nucleotide coding sequences (CDS). Please note that alleles that contain non-coding variations may be identical at the CDS level.

Files designated “X_gen.fasta”, where X is a locus or gene, contain genomic DNA sequences. Please note for alleles that do not possess genomic sequences, there will be no entry in the file.

MSF Folder

All files in this folder are provided in the MSF sequence format.

Files designated “X_prot.msf”, where X is a locus or gene, contain protein sequences. Please note that alleles that contain non-coding variations may be identical at the protein level.

Files designated “X_nuc.msf”, where X is a locus or gene, contain the nucleotide coding sequences (CDS). Please note that alleles that contain non-coding variations may be identical at the CDS level.

Files designated “X_gen.msf”, where X is a locus or gene, contain genomic DNA sequences. Please note for alleles that do not possess genomic sequences, there will be no entry in the file.

OID Folder

Further information on the OID files can be found in the dedicated README file in the oid directory. As of version 3.32.0, all list files have been converted to csv format, and contain a header. The header is donated by hash tags at the start of the line.
https://github.com/ANHIG/IMGTHLA/blob/Latest/oid/README.md

PIR Folder

All files in this folder are provided in the PIR sequence format.

Files designated “X_prot.pir”, where X is a locus or gene, contain protein sequences. Please note that alleles that contain non-coding variations may be identical at the protein level.

Files designated “X_nuc.pir”, where X is a locus or gene, contain the nucleotide coding sequences (CDS). Please note that alleles that contain non-coding variations may be identical at the CDS level.

Files designated “X_gen.pir”, where X is a locus or gene, contain genomic DNA sequences. Please note for alleles that do not possess genomic sequences, there will be no entry in the file.

TCE Folder

The files in this folder provide a listing of the T-Cell Epitope Group Assignments for DPB1 proteins. The assignments are taken from the algorithms used for the online tools at https://www.ebi.ac.uk/ipd/imgt/hla/matching/. The file formart is as follows;

  • DPB1 allele, DPB1 protein, Version 1 Assignment, Version 2 Assignment, Comments

Alleles which have yet to be assigned a TCE group using either version are left blank.

WMDA Folder

Further information on the WMDA files can be found in the dedicated README file in the wmda directory. https://github.com/ANHIG/IMGTHLA/blob/Latest/wmda/README.md

XML Folder

Please refer to the relevant XSD file for information regarding the XML files, which can be found here: https://github.com/ANHIG/IMGTHLA/blob/Latest/xml/hla_ambigs.xsd

Please note in release 3.43.0, there are three XML files for the release, hla.xml, hla_ciwd.xml and hla_ambigs.xml. The hla_ciwd.xml file is an updated version of the hla.xml file and includes the addition of new information from the Common, intermediate and well‐documented HLA alleles in world populations: CIWD version 3.0.0 (https://doi.org/10.1111/tan.13811). This is as new elements have been required to incorporate this data, and the CWD version 2.0.0 data has been recoded to the same structure. In release 3.44.0 and onwards, hla_ciwd.xml will replace hla.xml, and the older format archived.

Please note in release 3.53.0, there was a change made to the hla.xml. The releaseversions tag attribute releasestatus has been changed to a binary flag containing either "Public" or "Deleted" to allow for easier filtering of deleted alleles. In addition a releasecomments attribute has been added containing information about changes to this allele with this verison of the database, this contains the information previously stored in the releasestatus attribute.

Please note in release 3.55.0, there are three XML files for the release, hla.xml, hla_new.xml and hla_ambigs.xml. The hla_new.xml is an updated version of the hla.xml and includes a new release tag containing version and date information for the release. In release 3.56.0 and onwards, hla_new.xml will replace hla.xml, and the older format archived.

Allele List Folder

Lists of alleles for different versions of the database are now included in this single folder due to the large number of files.

These filenames take the format Allelelist.XXXX.txt with the XXXX in the file denotes a particular release. These files are a csv format detailing for each allele the official name used in each release of the database.

Other Files

The top-level directory contains the following files;

  • Alignments_Rel_XXXX.zip - a compressed archive of the alignments folder, where the XXXX in the file denotes a particular release.
  • LICENSE.md - a file detailing the licensing of data included in the IPD-IMGT/HLA Database.
  • Nomenclature_2009.txt - a file detailing pre-2010 allele nomenclature
  • README.md - This README file
  • hla.dat.zip - An EMBL-ENA style format file containing data from the IPD-IMGT/HLA Database, see (https://github.com/ANHIG/IMGTHLA/blob/Latest/Manual.md) for further details.
  • hla_gen.fasta - a copy of the file in the fasta directory, includes the DNA sequence for all alleles, which have genomic sequences available.
  • hla_nuc.fasta - a copy of the file in the fasta directory, includes the DNA sequence for the CDS sequence of all alleles.
  • hla_prot.fasta - a copy of the file in the fasta directory, includes the amino acid sequence for all alleles.
  • md5checksum.txt - a file detailing md5 checksums for all files in the top-level directory

The top-level directory contains the following lists, in order to provide consistency in both formatting and to record versioning information, as of version 3.32.0, all list files have been converted to csv format, and contain a header. The header is designated by hash tags at the start of the line.

  • Allele_status.txt - a csv file detailing for each allele how many times it has been submitted, from how many cells, the unconfirmed/confirmed status of the allele, if the CDS is fully sequenced and if the allele is cDNa or gDNA sequence.
  • Allelelist.txt - a csv file listing all alleles named at the time of the latest release.
  • Allelelist_history.txt - a csv file detailing for each allele the official name used in each release of the database.
  • Deleted_alleles.txt - a csv file detailing all deleted allele names, with reasons for the deletion. This list also includes details of any suffix changes.
  • release_version.txt - a plain text file which denotes the current release version.
  • sversion_history.txt - a csv file detailing for each allele the Sequence Version used in each release of the database.

Versioning

The database version number, IPD-IMGT/HLA 3.44.0 2021-04-20 b9d9ef7, can be interpreted as;

  • Database Name
  • Major release number (nomenclature version, quarterly release, sequence version)
  • Date
  • Latest commit for ANHIG/IMGTHLA/Latest branch

The major release number contains three key fields, the first is the nomenclature version, which is currently 3. The second is the quarterly release number, which is incremented by 1 every January, April, July and October with each subsequent release. The final third number represents the sequence version. A '0' is used for the primary quarterly release, and only incremented if any subsequent interim path or update contains a change to a valid base (not a * or a .) in either the nucleotide (both cDNA and gDNA) or protein sequence. Changes to the positioning of indels, or unsequenced bases are not included if the raw sequence remains unchanged.


CONTACTS

For information on the IPD-IMGT/HLA Database please see the website at: http://www.ebi.ac.uk/ipd/imgt/hla

Additional information on sequence file formats is available from: http://www.ebi.ac.uk/ipd/imgt/hla/download/

For any other information please contact [email protected].


COPYRIGHT NOTICE

We have chosen to apply the Creative Commons Attribution-NoDerivs License to all copyrightable parts of our databases, which includes the sequence alignments. This means that you are free to copy, distribute, display and make commercial use of the databases in all legislations, provided you give us credit by citing the following;

Barker DJ, Maccari G, Georgiou X, Cooper MA, Flicek P, Robinson J, Marsh SGE: The IPD-IMGT/HLA Database. Nucleic Acids Research (2023), 51:D1053-60

Robinson J, Malik A, Parham P, Bodmer JG, Marsh SGE: IMGT/HLA - a sequence database for the human major histocompatibility complex Tissue Antigens (2000), 55:280-287

We are strongly opposed to the mirroring of the data contained on our sites, both hla.alleles.org and the IPD-IMGT/HLA Database, and would ask that rather than mirror the information, appropriate links are provided where applicable.

If you intend to distribute a modified version of our data, you must ask us for permission first, please contact hla [at] alleles [dot] org for further details of how modified data can be reproduced.


FUNDING

The development of the IPD-IMGT/HLA Database was funded by an EU BIOTECH grant. The work of maintaining and updating the database has been supported in the past by the Imperial Cancer Research Fund, the National Institute of Health, the National Marrow Donor Program (NMDP) and more recently by the Anthony Nolan Trust. The continual maintenace and any further development of the database relies on alternate sources of financial support, which are actively been sought for the continued maintenance of the database. The Sequence.org initiative at the NMDP has solicited funds from institutions and companies who produce HLA typing reagents, typing systems, and instrumentation or that otherwise utilise these databases in critical components of their business. To learn more about how your business can support the IPD-IMGT/HLA Database, please contact: Anna Bedard, (Email: abedard [at] nmdp [dot] org), Be The Match Foundation.

If you intend to use any of the data found on our sites for commercial use, we would ask you to consider funding the database and the work we do. Without continued funding the database cannot be maintained.


DISCLAIMER

Where discrepancies have arisen between reported sequences and those stored in the database, the original authors have been contacted where possible, and necessary amendments to published sequences have been incorporated. Future sequencing may identify errors and the WHO Nomenclature Committee would welcome any evidence that helps to maintain the accuracy of the database. We therefore make no warranties regarding the correctness of the data, and disclaim liability for damages resulting from its use. We cannot provide unrestricted permission regarding the use of the data, as some data may be covered by patents or other rights. Any medical or genetic information is provided for research, educational and informational purposes only. It is not in any way intended to be used as a substitute for professional medical advice, diagnosis, treatment or care.

We reserve the right to use information about visitors (IP addresses), date/time visited, page visited, referring website, etc. for site usage statistics and to improve our services.

imgthla's People

Contributors

dominicbarkeran avatar ipd-deploy avatar jrob119 avatar michaelcooperan avatar xeniageorgiouan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

imgthla's Issues

DQA1*05:01:04 is not in P or G group in hla.xml.

Good morning again,

We noticed an inconsistency between the files. Will you correct which ever needs to be corrected, please?

allele id="HLA18836" name="HLA-DQA1*05:01:04" dateassigned="2018-04-30"
hla_g_group status="None"/
hla_p_group status="None"/

hla_nom_g.txt
DQA1*;05:01:01:01/05:01:01:02/05:01:01:03/05:01:04/05:03:01:01/05:03:01:02/05:05:01:01/05:05:01:02/05:05:01:03/05:05:01:04/05:05:01:05/05:05:01:06/05:05:01:07/05:05:01:08/05:05:01:09/05:05:01:10/05:06:01:01/05:06:01:02/05:07/05:08/05:09/05:11;05:01:01G

DQA1*;05:01:01:01/05:01:01:02/05:01:01:03/05:01:02/05:01:04/05:03:01:01/05:03:01:02/05:05:01:01/05:05:01:02/05:05:01:03/05:05:01:04/05:05:01:05/05:05:01:06/05:05:01:07/05:05:01:08/05:05:01:09/05:05:01:10/05:06:01:01/05:06:01:02/05:07/05:08/05:09/05:11;05:01P

Thank you!
May the force be with you,
Marney

Assembly version

Hello all,

I am working on a neoantigene pipeline and using optitype for HLA detection. Optitype has an older FASTA version (2013) and the same alleles differ.
What is the assembly version of the most recent FASTA files here (2018)? I am looking at hla_nuc.fasta and hla_prot.fasta. GRCH39/HG39?
I was unable to find the info in readme/version report/change log, nor is it at
https://www.ebi.ac.uk/ipd/imgt/hla/ .
I think it would be useful to have it somewhere clearly visible.

Thank you

Invalid character � in dat file for 3.21.0, 3.22.0, 3.23.0 and 3.24.0

The following line is found in the hla.dat file for 3.21.0, 3.22.0, 3.23.0 and 3.24.0.

RA   Balas A, S�nchez-Gordo F, Garcia-S�nchez F, Gomez-Zumaquero JM, Vicario JL;

This prevents these files from being properly parsed.

Here are the specific alleles that have this issue:

Release = 3210, line # = 121045, Allele = HLA-A*11:210N
Release = 3210, line # = 177260, Allele = HLA-A*26:107N
Release = 3220, line # = 125142, Allele = HLA-A*11:210N
Release = 3220, line # = 183644, Allele = HLA-A*26:107N
Release = 3230, line # = 127802, Allele = HLA-A*11:210N
Release = 3230, line # = 187727, Allele = HLA-A*26:107N
Release = 3240, line # = 129967, Allele = HLA-A*11:210N
Release = 3240, line # = 191426, Allele = HLA-A*26:107N

What does '|' mean in the multiple sequence alignment?

In the 'alignments' folder A_gen.txt file, there are several lines contain " | " symbol, for example:
A_01:01:01:01 G | ATGGCCGTC ATGGCGCCCC GAACCCTCCT CCTGCTACTC TCGGGGGCCC TGGCCCTGAC CCAGACCTGG GCGG | GTGAGT GCGGGGTCGG GAGGGAAACC
A_01:01:01:02N - | --------- ---------- ---------- ---------- ---------- ---------- ---------- ---- | ------ ---------- ----------
A*01:01:01:03 * | --------- ---------- ---------- ---------- ---------- ---------- ---------- ---- | ------ ---------- ----------

May I ask what do these " | " symbols mean?

Many thanks,

Mengyao

Errors in assigning intron numbers to DRB4*03:01N intron sequences?

In the hla.xml for release 3.33.0, the names of the DRB4*03:01N intron features do not match the feature order numbers for other DRB intron features.

Here the the intron elements for DRB4*03:01N:

     <feature id="914.5" order="5" featuretype="Intron" name="Intron 1">
        <SequenceCoordinates start="1" end="2684" />
     </feature>
      <feature id="914.7" order="7" featuretype="Intron" name="Intron 2">
        <SequenceCoordinates start="2967" end="3670" />
     </feature>
      <feature id="914.9" order="9" featuretype="Intron" name="Intron 3">
        <SequenceCoordinates start="3782" end="4255" />
     </feature>
      <feature id="914.11" order="11" featuretype="Intron" name="Intron 4">
        <SequenceCoordinates start="4280" end="4581" />
    </feature>

Here are the corresponding intron elements for other DRB alleles (e.g., DRB4*01:03:01:03):

      <feature id="6603.3" order="3" featuretype="Intron" name="Intron 1">
        <SequenceCoordinates start="414" end="9976" />
     </feature>
      <feature id="6603.5" order="5" featuretype="Intron" name="Intron 2">
        <SequenceCoordinates start="10247" end="12983" />
     </feature>
      <feature id="6603.7" order="7" featuretype="Intron" name="Intron 3">
        <SequenceCoordinates start="13266" end="13969" />
     </feature>
      <feature id="6603.9" order="9" featuretype="Intron" name="Intron 4">
        <SequenceCoordinates start="14081" end="14554" />
     </feature>
      <feature id="6603.11" order="11" featuretype="Intron" name="Intron 5">
        <SequenceCoordinates start="14579" end="14880" />
     </feature>

Shouldn't all DRB Intron 1 sequences be intron order 3, and all intron sequences of intron order 5 be intron 2?

C*17:01:01:02

Hi James,

During the processing of a bunch of new alleles, we ran into an issue with C*17:01:01:02
The hla.dat file we pulled from the git repository has Exon 5 marked as "pseudo" while on the IPD-IMGT/HLA website it is not marked as such. A cursory look at the history of the sequence does not indicate any recent changes. We were wondering if this was intentional and something we should take into account in our work flow ?

Cheers,
Vineeth

HLA.Dat user manual not matching hla.dat file

The user manual and the HLA.Dat file appear to be out of sync. The user manual states that the DT Entry will have 3 per entry. When I look at the 3.30.0 HLA.Dat file, there are only 2 per entry.

DRB5*01:01:01 not listed under alignments directory.

I find alignment flat file format useful as it already has intron exon boundaries embedded.

DRB5*01:01:01 allele is not listed under "alignments" directory whereas it is listed under "msf" directory.

Is this because there is only one full-length allele of DRB5? But in the README file, gen.txt description says:

"Please note for alleles that do not possess genomic sequences, there will be no entry in the file"

So for DRB5 even with one allele, there should be DRB5_gen.txt file containing the DRB5*01:01:01 allele.

Under msf directory, it is listed under DRB5_gen.msf but there is no corresponding alignment file DRB5_gen.txt under alignments directory.

Problems with the 3.34.0 nuc.txt and prot.txt alignments for HLA-B and -C

In the 3.34.0 HLA-B protein alignment, the HLA-B*13:120Q peptide sequence is 11 amino acids longer than the reference, but these positions are not accounted for in the reference with . symbols. As a result, even though the last sequence block for all other alleles only include 69 amino-acid positions, the last 11 amino acids of the HLA-B*13:120Q sequence appear in a separate block, as below.
screen shot 2018-10-17 at 3 08 17 pm

This also occurs for the B_nuc.txt alignment, as below.
screen shot 2018-10-17 at 3 08 43 pm

The same thing is also true for the C*04:09N allele in the C_prot.txt and C_nuc.txt alignments.

It seems like these extra peptide positions should be included in the reference sequences as sequence indels.

Comma in Description field of Deleted_alleles.txt file

Line 106 of the Deleted_alleles.txt file (HLA00615,DQA1*05013,To take account of coding polymorphism in the leader peptide, sequence renamed DQA1*05:05 (April 1998)) includes a comma in the Description field.

This results in an extra column being added for this line when parsing the file as a .csv document.

Could this comma be removed? It doesn't change the meaning of the entry.

typos in README.md

It seems the COPYRIGHT NOTICE section of the README.md file here contains 1-2 typos.

The section indicates 2015 as the publication date for the Nucleic Acids Research article, but Google Scholar indicates 2014. I think 2015 is a typo.

Another typo: the word "stongly".

C*02:10:01GG

Hi all,

This extra G is causing us some issues.

allele id="HLA18583" name="HLA-C02:02:37" dateassigned="2018-03-29"
hla_g_group status="C
02:10:01GG"
hla_p_group status="C*02:02P"

Thank you!
Marney

3.29.0 - Expected sequence length 687, found 549 (HLA00845.2)

The hla.dat file for 3.29.0 has the incorrect sequence length for HLA00845.2. The sequence tag should have 549 instead of 687.

SQ Sequence 687 BP; 152 A; 173 C; 223 G; 139 T; 0 other;

ID   HLA00845; SV 2; standard; DNA; HUM; 549 BP.
XX
AC   HLA00845;
XX
SV   HLA00845.2
XX
DT   06-AUG-1993 (Rel. 1.0.0, Created, Version 1)
DT   16-AUG-2017 (Rel. 3.29.0.1, Last Updated, Version 2)
XX
DE   HLA-DRB1*14:13, Human MHC Class II sequence (partial)
XX
KW   Human MHC; HLA; Class II; HLA-DRB1; Allele; HLA-DRB1*14:13;
XX
OS   Homo Sapiens (human)
OC   Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates;
OC   Catarrhini; Hominidae; Homo.
XX
CC   --------------------------------------------------------------------------
CC   IPD-IMGT/HLA Release Version 3.29.0.1
CC   --------------------------------------------------------------------------
CC   Copyrighted by the IPD-IMGT/HLA Database, Distributed under the Creative
CC   Commons Attribution-NoDerivs License, see;
CC   http://www.ebi.ac.uk/ipd/imgt/hla/licence.html for further details.
CC   --------------------------------------------------------------------------
XX
RN   [1]
RP   1-549
RX   PUBMED; 8168862.
RA   Pando M, Theiler G, Melano R, Petzl-Erler ML, Satz ML;
RT   "A new HLA-DR6 allele (DRB1*1413) found in a tribe of Brazilian Indians";
RL   Immunogenetics 39:377-377(1994).
XX
CC   --------------------------------------------------------------------------
CC   The sequence below is the official allele sequence as approved by the
CC   WHO Nomenclature Committee for Factors of the HLA System.
CC   Any cross references may differ from the sequence shown below.
CC   --------------------------------------------------------------------------
XX
DR   EMBL; AM110001; AM110001.0.
DR   EMBL; L21755; L21755.1.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..549
FT                   /organism="Homo sapiens"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:9606"
FT                   /ethnic="American Indian"
FT                   /cell_line="GRC-138"
FT   CDS             <1..549>
FT                   /codon_start=1
FT                   /partial
FT                   /gene="HLA-DRB1"
FT                   /allele="HLA-DRB1*14:13"
FT                   /product="MHC Class II HLA-DRB1*14:13 sequence"
FT                   /translation="RFLEYSTSECHFFNGTERVRFLERYFHNQEENVRFDSDVGEYRAV
FT                   TELGRPSAEYWNSQKDLLEQRRAAVDTYCRHNYGVGESFTVQRRVHPKVTVYPSKTQPL
FT                   QHHNLLVCSVSGFYPGSIEVRWFRNGQEEKTGVVSTGLIHNGDWTFQTLVMLETVPRSG
FT                   EVYTCQVEHPSVTSPLTVE"
FT   exon            1..270
FT                   /number="2"
FT   exon            271..549
FT                   /number="3"
SQ   Sequence 687 BP; 152 A; 173 C; 223 G; 139 T; 0 other;
     cacgtttctt ggagtactct acgtctgagt gtcatttctt caatgggacg gagcgggtgc        60
     ggttcctgga gagatacttc cataaccagg aggagaacgt gcgcttcgac agcgacgtgg       120
     gggagtaccg ggcggtgacg gagctggggc ggcctagcgc cgagtactgg aacagccaga       180
     aggacctcct ggagcagagg cgggccgcgg tggacaccta ctgcagacac aactacgggg       240
     ttggtgagag cttcacagtg cagcggcgag tccatcctaa ggtgactgtg tatccttcaa       300
     agacccagcc cctgcagcac cacaacctcc tggtctgttc tgtgagtggt ttctatccag       360
     gcagcattga agtcaggtgg ttccggaatg gccaggaaga gaagactggg gtggtgtcca       420
     caggcctgat ccacaatgga gactggacct tccagaccct ggtgatgctg gaaacagttc       480
     ctcggagtgg agaggtttac acctgccaag tggagcaccc aagcgtgaca agccctctca       540
     cagtggaat                                                               549
//

no newline following XML declaration in hla_ambigs.xml

On line 1 of hla_ambigs.xml, the XML declaration is not followed by a newline character, so the tns:ambiguityData start-tag appears on the same line.

A newline character is not required by the XML spec, but could be a helpful aesthetic enhancement.

Identical sequences with different feature annotations - 174 alleles

Feature annotations should not differ between database releases if the sequence is the same. If an annotation is changed in a later database release, then it should also be updated in all previous database releases that contain that sequence. The feature annotations for 174 alleles change between database releases even though the sequences do not. These differences mainly impact intron-4, exon-5, and intron-5 for HLA-DQB1. Below is a table of all the observed instances of this issue.

DB Allele # Features Removed # Features Added # Features Differ Features Removed Features Added Features that Differ
3160 HLA-B*15:302N 0 0 3 exon_5 exon_2 exon_3
3160 HLA-C*08:89N 0 0 1 exon_2
3170 HLA-B*15:302N 0 0 1 exon_5
3180 HLA-B*39:97N 0 0 1 exon_3
3180 HLA-C*08:89N 0 0 1 exon_2
3190 HLA-C*08:89N 0 0 1 exon_2
3220 HLA-B*07:251N 0 0 1 exon_3
3280 HLA-B*15:149N 0 0 1 exon_4
3280 HLA-B*15:246N 0 0 1 exon_4
3280 HLA-C*08:89N 0 0 1 exon_2
3290 HLA-B*15:149N 0 0 1 exon_4
3290 HLA-B*15:246N 0 0 1 exon_4
3300 HLA-A*24:155N 1 0 0 exon_5
3300 HLA-A*26:01:01:03N 0 0 2 intron_4 exon_4
3300 HLA-B*07:44N 0 0 2 intron_4 exon_4
3300 HLA-B*15:01:01:02N 0 1 1 exon_1 intron_1
3300 HLA-B*15:149N 0 0 1 exon_4
3300 HLA-B*15:246N 0 0 2 exon_5 exon_4
3300 HLA-B*44:02:01:02S 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:01:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:02:01:02 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:02:04 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:53Q 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:62 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:79 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:80 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:81 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:82 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:83 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:84 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*02:96N 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:02 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:03 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:04 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:05 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:06 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:07 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:08 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:09 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:10 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:11 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:12 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:14 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:15 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:16 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:17 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:01:18 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:17 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:22 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:35 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:36 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:01:37 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:02:01:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:02:01:02 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:02:01:03 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:02:09 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:02:12 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:02:21 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:02:22 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:02:23 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:02:24 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:03:02:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:03:02:02 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:03:02:03 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:03:04 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:04:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:04:03 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:05:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:150 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:191 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:195 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:196 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:197Q 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:19:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:211 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:239 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:243 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:245 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:246 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:247 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:248 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:249 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:250 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:251 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:252 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:253 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:254 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*03:263 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*04:01:01:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*04:02:01:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*04:02:11 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*04:02:12 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*04:11 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*04:32 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:01:01:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:01:01:02 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:01:01:03 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:01:01:04 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:01:01:05 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:01:23 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:01:24 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:02:01:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:02:01:02 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:02:01:03 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:02:07 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:02:11 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:102 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:103 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:104 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:106 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:108 0 1 1 exon_5 exon_6
3300 HLA-DQB1*05:133 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:134 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:135 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:136 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:137 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:148 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:149 1 1 0 exon_6 exon_5
3300 HLA-DQB1*05:31 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:43:02 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:52 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:57 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*05:96 0 1 1 exon_5 exon_6
3300 HLA-DQB1*05:97 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:01:08 0 1 1 exon_5 exon_6
3300 HLA-DQB1*06:01:10 0 1 1 exon_5 exon_6
3300 HLA-DQB1*06:01:11 0 1 1 exon_5 exon_6
3300 HLA-DQB1*06:02:01:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:02:01:02 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:02:01:03 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:02:17 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:02:22 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:02:23 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:02:25 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:02:26 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:02:27 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:02:28 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:03:01:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:03:01:02 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:03:12 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:03:14 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:03:20 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:03:21 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:03:23 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:03:24 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:03:25 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:03:26 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:04:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:09:01:01 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:09:01:02 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:103 0 1 1 exon_5 exon_6
3300 HLA-DQB1*06:111 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:117 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:125 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:187 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:188 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:217 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:218 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:219 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:221 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:222 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:223 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:224 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:225 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:226 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:227 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:228 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:37 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:44 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:84 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:90 0 2 1 exon_5 intron_5 intron_4
3300 HLA-DQB1*06:99:02 0 1 1 exon_5 exon_6
3320 HLA-C*07:02:01:17N 0 0 2 intron_3 exon_3

DPA1_gen.fasta renamed to DPA_gen.fasta

But the alignment file not renamed? The pir and msf files were also renamed.
Are sequences for the DPA2 pseudo gene forthcoming?
This isn't a technical issues just a consistency issue.

File format

Is there documentation for the txt alignment format (for example: A_gen)?

Thank you for hosting this on github!

gGroup and gGroupAllele names in hla_ambigs.xml don't use full gene names

The gGroup and gGroupAllele names in hla_ambigs.xml don't use the full gene names. For example, in place of "HLA-A", they use "A". This makes them inconsistent with the allele names in hla.xml.

Below are file excerpts to further illustrate the issue.

From hla.xml:
<allele id="HLA00001" name="HLA-A*01:01:01:01" dateassigned="1989-08-01">

From hla_ambigs.xml:
<tns:gGroup name="A*01:01:01G" gid="HGG00001">
<tns:gGroupAllele name="A*01:01:01:01" alleleid="HLA00001" />

Please consider revising the gGroup and gGroupAllele names in hla_ambigs.xml to use the full gene names.

Difference between fasta and alignments for A*01:11N

One base pair before point mutation 968G>T, the sequences seem to diverge. The mutation (T) is higlighted:

From alignment file (that I think is correct):
GGAGAACGGTAA...
vs the fasta section:
GGAGAACGACCC...

Incorrectly using join for DRB5 sequences in 3.20.0 and 3.21.0

In the hla.dat files for 3.20.0 and 3.21.0 a join is being used for the CDS sequence when it shouldn't be which causes parsers to fail. Here's an example:

DR   EMBL; AJ427352; AJ427352.1.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..270
FT                   /organism="Homo sapiens"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:9606"
FT                   /ethnic="Caucasoid"
FT                   /cell_line="Barpay"
FT   CDS             join(1..270)
FT                   /codon_start=1
FT                   /partial
FT                   /gene="HLA-DRB5"
FT                   /allele="HLA-DRB5*01:12"
FT                   /product="MHC Class II HLA-DRB5*01:12 sequence"
FT                   /translation="RFLQQDKYECHFFNGTERVRFLHRDIYNQEEDLRFDSDVGEYRAV
FT                   TELGRPDAESWNSQKDFLERRRAEVDTVCRHNYGVGESFTVQRR"

Should be FT CDS 1..270 or FT CDS <1..270> instead.

Here's a list of all the alleles that have this:

HLA01638.1 HLA-DRB5*01:11
HLA01634.1 HLA-DRB5*01:12
HLA01871.1 HLA-DRB5*01:13
HLA00927.1 HLA-DRB5*02:03
HLA00928.1 HLA-DRB5*02:04
HLA01280.1 HLA-DRB5*02:05
HLA00916.1 HLA-DRB5*01:01:02
HLA00918.2 HLA-DRB5*01:03
HLA00920.1 HLA-DRB5*01:05
HLA00921.1 HLA-DRB5*01:06
HLA00922.1 HLA-DRB5*01:07
HLA00924.1 HLA-DRB5*01:09
HLA01012.3 HLA-DRB5*01:10N

Error in Sequence tag

Hi James,

in the new release 3.33.0 of hla.dat some DRB1 sequences are invalid. See for example DRB1*13:09, the substring "y/alignment_libraries/libs/drb1345genomiclib:drb1_13:09" should not be there, i think.

FH Key Location/Qualifiers
FH
FT source 1..325
FT /organism="Homo sapiens"
FT /mol_type="genomic DNA"
FT /db_xref="taxon:9606"
FT /ethnic="Hispanic"
FT /cell_line="MJD"
FT /cell_line="NT01111"
FT CDS <1..270
FT /codon_start=1
FT /partial
FT /gene="HLA-DRB1"
FT /allele="HLA-DRB113:09"
FT /product="MHC Class II HLA-DRB1
13:09 sequence"
FT /translation="RFLEYSTSECHFFNGTERVRFLDRYFHNQEENVRFDSDVGEFRAV
FT TELGRPDAEYWNSQKDILEQARAAVDTYCRHNYGVVESFTVQRR"
FT exon 1..270
FT /number="2"
FT UTR 271..328
SQ Sequence 325 BP; 58 A; 67 C; 100 G; 51 T; 49 other;
cacgtttctt ggagtactct acgtctgagt gtcatttctt caatgggacg gagcgggtgc 60
ggttcctgga cagatacttc cataaccagg aggagaacgt gcgcttcgac agcgacgtgg 120
gggagttccg ggcggtgacg gagctggggc ggcctgatgc cgagtactgg aacagccaga 180
aggacatcct ggagcaggcg cgggccgcgg tggacaccta ctgcagacac aactacgggg 240
ttgtggagag cttcacagtg cagcggcgag y/alignmen t_librarie s/libs/drb 300
1345genomi clib:drb1_ 13:09 325
//

Cheers,
Markus

nucleotide CDS alignment (MSA) file of release 3.9.0

I want to download the multiple sequence alignment files of release 3.9.0 release because we want to finish the remaining portion of an old project. However, I am unable to find the those files in this repository. Specifically I need the file DQA_nuc.txt or DQA1_nuc.txt for release 3.9.0. as I already have the files of other genes I am interested in.

Let me know if there is anyway I can find that file.

Thank you

Some alleles are missing from hla.xml

Hi

During my recent investigation, i found that some alleles are missing from hla.xml which are in hla.dat. For example, HLA-H*02:06. There are ~300 alleles in this situation.

Is this intended?

Thank you,
Marcell

incorrect/missing alignmentreference elements in hla.xml

For the DPB1 alleles, the alignmentreference element attributes have an empty alleleid attribute, and the allelename attribute contains "DPB101:01:01", but the allele element in the file has the extended name "DPB101:01:01:01" so the reference is not made.

DRBx alleles also have an empty alleleid alignmentreference attribute, but in these cases the DRB1*01:01:01 allele is named consistently

john

HLA-DMB*01:02 - Invalid join

HLA00490 - 3.30.0

The join(<1..284) is invalid because a join should have at least two parts.

DR   EMBL; Z24750; Z24750.1.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..284
FT                   /organism="Homo sapiens"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:9606"
FT                   /ethnic="Caucasoid"
FT                   /cell_line="YAR"
FT   CDS             join(<1..284)
FT                   /codon_start=1
FT                   /partial
FT                   /gene="HLA-DMB"
FT                   /allele="HLA-DMB*01:02"
FT                   /product="MHC Class II HLA-DMB*01:02 sequence"
FT                   /translation="PPSVQVAKTTPFNTREPVMLACYVWGFYPAEVTITWRKNGKLVMP
FT                   HSSEHKTAQPNGDWTYQTLSHLALTPSYGDTYTCVVEHIGAPEPILRDW"
FT   exon            1..284
FT                   /number="3"
FT                   /partial
SQ   Sequence 284 BP; 67 A; 83 C; 74 G; 60 T; 0 other;
     ggccaccatc tgtgcaagta gccaaaacca ctccttttaa cacgagggag cctgtgatgc        60
     tggcctgcta tgtgtggggc ttctatccag cagaagtgac tatcacgtgg aggaagaacg       120
     ggaagcttgt catgcctcac agcagtgagc acaagactgc ccagcccaat ggagactgga       180
     cataccagac cctctcccat ttagccttaa ccccctctta cggggacact tacacctgtg       240
     tggtagagca cattggggct cctgagccca tccttcggga ctgg                        284
//

Having this error in the hla.dat file causes bio parsers to fail.

Missing archive zip file

Hello,

In the README, you note that a "zip compressed archive of all the text-format alignment files is available from the top-level directory". However, I am unable to find such a zip file. The only zip file appears to be the Alignment_Rel_3350.zip that contains the alignments from the current release.

In particular, I would like to find archive versions of the alignment files and the archive versions of the fasta files.

Can you point me in the right direction?

Thanks,
Rachel

Sequence length error found for DRB1*14:13 (HLA00845)

For DRB1*14:13 (HLA00845) We noticed that the exon regions do not match the overall sequence length. As you can see from this snippet, the sequence length is 687 but the actual sequence listed is only 549 in length.
FT exon 1..270
FT /number="2"
FT exon 271..549
FT /number="3"
FT exon 553..663
FT /number="4"
FT exon 664..687
FT /number="5"
SQ Sequence 687 BP; 152 A; 173 C; 223 G; 139 T; 0 other;
cacgtttctt ggagtactct acgtctgagt gtcatttctt caatgggacg gagcgggtgc 60
ggttcctgga gagatacttc cataaccagg aggagaacgt gcgcttcgac agcgacgtgg 120
gggagtaccg ggcggtgacg gagctggggc ggcctagcgc cgagtactgg aacagccaga 180
aggacctcct ggagcagagg cgggccgcgg tggacaccta ctgcagacac aactacgggg 240
ttggtgagag cttcacagtg cagcggcgag tccatcctaa ggtgactgtg tatccttcaa 300
agacccagcc cctgcagcac cacaacctcc tggtctgttc tgtgagtggt ttctatccag 360
gcagcattga agtcaggtgg ttccggaatg gccaggaaga gaagactggg gtggtgtcca 420
caggcctgat ccacaatgga gactggacct tccagaccct ggtgatgctg gaaacagttc 480
ctcggagtgg agaggtttac acctgccaag tggagcaccc aagcgtgaca agccctctca 540
cagtggaat 549

Current Release and Date Stamp in hla_ambigs.xml

The current release and date stamps in hla_ambigs.xml for the current release (3.30.0) are empty.

<?xml version="1.0" encoding="UTF-8"?>
	<tns:ambiguityData xmlns:tns="http://www.example.org/ambig-aw"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.example.org/ambig-aw ambig-aw.xsd ">
	<tns:releaseVersion currentRelease="" date="" />
	<tns:geneList>

Genomic alignment of DPA1*04:01 and DPA1*04:02 in the DPA1_gen.txt file

The alignment in the DPA1_gen.txt file for DPA1 *04:01 and *04:02 makes it appear that these alleles differ significantly in their sequence for positions 1061 to 1093, as below.

dpa1_gen_0401-0402_intron1

However, the sequences of these alleles are identical through these positions, and it seems like the sequence for *04:02 should only include a 3 nucleotide deletion, relative to the reference, for positions 1061 - 1063, as below.

dpa1_gen_0401-0402_intron1_fixed

Release 3.36.0 - file inconsistencies

  1. There are two new alleles where "dateassigned" is blank in the hla.xml file, DQA1 05:05:01:20 (HLA22679) and DRB4 01:03:01:10 (HLA22663). The dates are listed appropriately in the hla_nom file.

  2. There is an inconsistency between hla_nom and hla.xml for HLA00886, where the xml file has the allele name as v2 DRB3 010101 while the nom file has v3 DRB3 01:01:01. Could you explain this for us?

  3. The hla.xml file has a G group listed as C*07:726N:01G while nom_g lists it as 07:726:01G. Could you please look into this one too?

  4. There is an inconsistency between nom_p and hla.xml regarding DQA1 05:05:01:20. This allele is listed as part of DQA1*05:01P in nom_p but has no p group status in the xml file.

Any help on the above is greatly appreciated. Thanks!

ClassI_nuc.txt alignment issue (extra insertion placeholders in B,C alleles cause misalignment)

Extra insertion place holders found in B and C alleles (not A) starting line 122460 causing the exon barrier to not align around codon 182.

This looks like this A, B, and C got out of alignment due to an insertion placeholder present in the B alleles, but not A,C starting on line 98736 in B07:02:01:01 (due to '-' symbol in B40:345N, line 101665).

I can't attach the file, too big.

Strange deletion at A*01:18N peptide position 341

In the A_prot.txt alignment, the sequence for the final peptide position for A*01:18N is a deletion (.), but the sequence for the preceding 158 peptide positions is unknown (*).

This does not correspond to the A_nuc.txt alignment, where exon 8 nucleotide sequence is *****.

This terminal deletion does not show up in the .fasta, .msf or .pir alignments (but honestly, it isn't clear how it could).

incomplete fasta file

hi,
the hla_gen.fasta from the latest version contains sequences for only 5773 alleles.
where are the other alleles? can't find DPA1*03:02 for instance.

thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.