cnr-ibba / smarter-database Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 1.8 MB

Smarter database repository

Home Page: https://smarter-database.readthedocs.io/en/latest/

License: MIT License

Makefile 3.46% JavaScript 0.05% Jupyter Notebook 70.45% Python 26.04%

affymetrix breeding efficiency genotype plink resilience ruminant smarter database

smarter-database's Introduction

SMARTER Database

SMARTER-database aims to collect data produced by the WP4 group in the context of the SMARTER project and to merge them with already available data.

Project Organization

├── data
│   ├── external        <- Data from third party sources.
│   ├── interim         <- Intermediate data that has been transformed.
│   ├── processed       <- The final, canonical data sets for modeling.
│   └── raw             <- The original, immutable data dump.
|
├── database            <- MongoDB smarter database docker-composed image
│
├── docs                <- A default Sphinx project; see sphinx-doc.org for details
│
├── models              <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks           <- Jupyter notebooks. Naming convention is a number (for ordering),
│                          the creator's initials, and a short `-` delimited description, e.g.
│                          `1.0-jqp-initial-data-exploration`.
│
├── references          <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports             <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures         <- Generated graphics and figures to be used in reporting
│
├── src                 <- Source code for use in this project.
│   ├── __init__.py     <- Makes src a Python module
│   │
│   ├── data            <- Scripts to download or generate data
│   │
│   ├── features        <- Scripts to turn raw data into features for modeling
│   │
│   ├── models          <- Scripts to train models and then use trained models to make
│   │   │                  predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization   <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
|
├── tests               <- Folder to test python modules / scripts
│
├── HISTORY.rst         <- Project change log
├── LICENSE             <- Project LICENSE
├── Makefile            <- Makefile with commands like `make data` or `make train`
├── README.md           <- The top-level README for developers using this project.
├── conda-linux-64.lock <- Conda environment lock file for linux 64 bit
├── environment.yml     <- Conda environment file
├── poetry.lock         <- Poetry lock file
├── pyproject.toml      <- Poetry project file
├── pytest.ini          <- Configuration file for `pytest` testing environment
├── setup.py            <- makes project pip installable (pip install -e .) so src can be imported
└── test_environment.py <- Helper script to test if environment is properly set

Project based on the cookiecutter data science project template. #cookiecutterdatascience

smarter-database's People

Contributors

Watchers

smarter-database's Issues

:card_file_box: track breeds and countries in countries and breeds collection respectively

Is your feature request related to a problem? Please describe.
When filtering samples from SMARTER-frontend, suggested breed and country values take no considerations about which breeds are available in each country and viceversa, so it's possible to take a breed and country which returns 0 results. If breed and country collections track countries and breeds respectively, it will be possible to query a breed endpoint and filter by country

Describe the solution you'd like
Track breeds and countries int countries and breeds collection respectively. The script src/data/update_db_status.py could track that information in collections

Describe alternatives you've considered
An alternative could be executing an aggregation pipeline within sample collections using SMARTER-backend

Additional context

Model countries and breed as ListField
Update collections with update_db_status.py

:recycle: refactor tests which require database objects

Is your feature request related to a problem? Please describe.
Tests should be refactored in order to initialize data from fixtures. This will make easier the testing processes, even in most difficult requirement setup

Describe the solution you'd like
Database relates tests should start by fixtures upload, and requirements must be managed by uploading json objects

Describe alternatives you've considered
Test could remain like this, however it seems difficult to proper setup certain tests

Additional context
See tests in SMARTER-backend and take inspirations by django tests

:white_check_mark: check goat coordinates

Is your feature request related to a problem? Please describe.
Check goat coordinates as described by email

Describe the solution you'd like
Check and fix chromosome names accordingly http://www.goatgenome.org/data/capri4dbsnp-base-CHI-ARS-OAR-UMD.csv and http://www.goatgenome.org/data/Goat_IGGC_65K_v2_A2_chr_mapinfo_Aux_file.txt

Describe alternatives you've considered
The data which comes from manifest seems to be almost correct, however there are SNPs which are annotated on X chromosomes while they are in a scaffold

Additional context

The probes have been mapped on ARS1 assembly with the following naming of chromosomes/scaffolds:

    1 to 29
    MT
    X.1 X.2 for scaffold assigned to X in the auxiliary and csv files, X in the bpm file
    Y.1 to Y.11 for scaffolds assigned to Y in the auxiliary and csv files, Y in the bpm file
    0.scaffold_name…for unplaced scaffold (example: 0.NW_017191871.1) in the auxiliary and csv files, 0.NW_scaffold in the bpm file.

For probes either unlocalized or with multiple localizations or not confirmed as belonging to X or Y scaffolds (after studying the genotypes), localization was set to null.

check goat chromosomes with aux file
check chromosomes in Variants locations: mind to scaffold, null, and non-autosomal

:sparkles: import data from Spain datasets

Is your feature request related to a problem? Please describe.
Foreground data coming from Spain need to be imported in SMARTER database

Describe the solution you'd like
Data comes from an affymetrix chip which is not modeled by the database. This data need to be uploaded into database before merging data to SMARTER dataset

Describe alternatives you've considered
I could insert the data I can model, however I could miss about 20% of SNP which I can't find in database. Moreover I need to check genotypes and try to figure out what this should be in TOP coding. This will be difficult to handle for reference genomes which are different from ones used to generate the chip data

Additional context
Add any other context or screenshots about the feature request here.

solve issues about 10K chip: are this genotypes a subset of the data I have? need I to import them into database
import the new manifest data
~~move the src.features.plinkio.AffyPlinkIO class to a generic AffyCellIO class (to handle cell files).~~
support for comments or tabs in src.features.plinkio.TextPlinkIO class
import spain data (and metadata) into SMARTER database

:sparkles: track `ALT` and `REF` alleles in locations

Is your feature request related to a problem? Please describe.
Tracking ALT and REF alleles for each location is required to convert data into VCF format

Describe the solution you'd like
ALT and REF alleles should be determine from genome alignments and then tracked for each Location object

Describe alternatives you've considered
The sheep genome consortium tracks genotypes with REF/ALT convention, data can be imported from here. However, all the SNPs not managed by them (ie Affymetrix) cannot have these attributes

Additional context
Please see comments in #18 for import_consortium.py

Add ALT and REF class attributes to src.features.smarterdb.Location
Track latest data from Sheep genome consortium

:construction_worker: Set Up Continuous Integration

Is your feature request related to a problem? Please describe.
This repo lacks of CI

Describe the solution you'd like
Set up Travis CI and Coverage (maybe scrutinizer)

Describe alternatives you've considered
Test could run before a commit, however it's better to configure such features

Additional context
Mind to Partner Queue Solution

:sparkles: add multiple phenotypes to samples

Is your feature request related to a problem? Please describe.
There are sample with multiple phenotypes (more rows to be added to the same samples)

Describe the solution you'd like
Try to upload all data in Phenotypes using lists

Describe alternatives you've considered
We can create another collection in which put every record as an object. However, this means deal with another endpoint in backend, r-smarter-api and web interface and require users to merge data themselves

Additional context

import multiple phenotypes for the same sample
test stuff within SMARTER-database project
try to load data through SMARTER-backend
try to collect data using r-smarter-api
try to visualize data wth SMARTER-frontend

:boom: model database migrations

Is your feature request related to a problem? Please describe.
Dropping and re-creating data every time a change is needed is not performant and prone to errors: if intermediate and processed data file are not removed and the collections not restored in the initial state we can have inconsistencies

Describe the solution you'd like
Database changes need to be modeled with migration when possible. A change could be done without running the entire import process. Test database migrations to understand if they could be applied in this context

Describe alternatives you've considered
src.features.smarterdb.get_or_create_sample could be modified to check if sample was changed. The same concept could be applied on dataset imports.

Additional context
See:

Documents migration in mongoengine documentation
Mongoengine-migrate project

:card_file_box: check variants data before update

Is your feature request related to a problem? Please describe.
There are affy probes like Affx-291664366 and Affx-281280004 which map to the same cust_id: "oar3_OAR8_62459373". however they are A/G and A/C respectively: I can't track both of them on the same SNP, so I need to check alleles (illumina_top coding) in order to understand if the SNP is the same, otherwise I could skip the import

Describe the solution you'd like
Track illumina_top attribute at variant level: update only if these records match. Need to check at base level since I can't trust to a wrong location during updates. This attribute should generated when creating a variant and should never changed

Describe alternatives you've considered
I could skip import or overwrite with new data, but I can't merge dataset with different alleles

Additional context

track illumina_top at variant level
skip variant update if illumina_top alleles between variant and new location don't match
use dates to understand if I need to update location or not

:zap: enhance genotype conversion

Is your feature request related to a problem? Please describe.
SNP genotype conversion is very slow. For example, if the original file is already in plink binary format it requires too much time to convert it by creating temporary plink text files. Converting the whole dataset into FORWARD (see #111) generates huge temporary files. It requires time converting from binary to text since the information for the same sample is column based in binary and row based in text formats.

Describe the solution you'd like
Ideally data conversion need to be done with binary formats (not text). If it's possible to change only the alleles of the .bim files for plink binary files it needs to be done.

Describe alternatives you've considered
We working with text until now and it's works. However, is very inefficient.

Additional context

understand how plinkio deal with writing files
value plink transposed binary format
value updating only the allele in *.bim file for binary format

:sparkles: add Croatian sheep data

Is your feature request related to a problem? Please describe.
Add data coming from Drzaic et al 2022. Its related dataset can be found in dryad

Describe the solution you'd like
Check data and import HD SNPs. Supplementary materials have information about GPS coordinates

Describe alternatives you've considered
An alternative could be not importing dataset, freezing database content in this state

Additional context

check assembly
ensure this samples are not currently in database
fix and import GPS coordinates
load data into database

:sparkles: generate final genotype files for OAR4 and CHI1 assemblies

Is your feature request related to a problem? Please describe.
Genotypes files for the remaining coordinate systems need to be generate (OAR4 and CHI1)

Describe the solution you'd like
Genotype data need to be generated with other assemblies relying on configuration files or parameters: in this way there's the possibility to support a new assembly in the future with a few changes.

Describe alternatives you've considered
If we want to support other genome versions, we need to find a solution to deal with: there could be a version specific Makefile, a Makefile dependency or a workflow done with softwares like Nextflow

Additional context

Provide additional assembly version as parameters
Convert genotypes in CHI1 and OAR4 genomic coordinates
Ensure that operations on SampleSpecies collections occur exactly once
Publish data on FTP site
Update FTP README
Check all the documentations when assembly versions are mentioned

:sparkles: import data from isheep database

Is your feature request related to a problem? Please describe.
SMARTER database can be integrated with the isheep database, which contains data from CAS partner

Describe the solution you'd like
All genotypes from isheep database can be downloaded in VCF format (used both for chips and WGS). Convert data into illumina top then add genotypes into database

Describe alternatives you've considered
There are no alternatives if we need to add CAS data into SMARTER dataset

Additional context
Add any other context or screenshots about the feature request here.

Track species in SampleSpecies species field
Support custom species during sample creation
Fix species for ADAPTMAP dataset
Fix metadata table
Create isheep samples
update genomic coordinates for OAR4 assembly
update rs_id if possible
convert VCF into plink format
import data from chips and WGS

:bug: indel SNPs should be discarded from final dataset

Describe the bug
Variant like oar3_OAR1_80093512_dup are indels, and can't be converted in illumina_top nor included in the final dataset file

To Reproduce
Steps to reproduce the behavior:

Try to create samples from foreground french dataset (binary plink HD)

Expected behavior
Such snps should be tracked, for example with a is_indel flag, in order to be ignored while fetching coordinates. Or such SNPs could be ignored when created from manifest (like Affymetrix indel SNPs are)

Screenshots

{
  "version" : "Oar_v3.1",
  "chrom" : "X",
  "position" : 50971660,
  "illumina" : "I/D",
  "illumina_strand" : "MINUS",
  "strand" : "PLUS",
  "imported_from" : "manifest",
  "date" : ISODate("2013-04-23T00:00:00Z"),
  "consequences" : [ ]
}

Additional context
Choose one of this two distinct strategies:

Detect indels while importing dataset from manifest
Add is_indel attribute to such SNPs in VariantSpecies
Ignore such indels variant in plinkio.SmarterMixin.fetch_coordinates

skip SNPs during manifest upload (like import_affymetrix.py does)
check no-errors while uploading SNPchimp coordinates

Search and add this flag manually with mongodb:

db.variantSheep.updateMany({$or: [{locations: {$elemMatch: {imported_from: "manifest", illumina: "D/I"}}}, {locations: {$elemMatch: {imported_from: "manifest", illumina: "I/D"}}}]}, {$set: {is_indel: true}})```

:sparkles: add WGS background data for Goat

Is your feature request related to a problem? Please describe.
Add WGS background data for goat animals coming from Berihulay et al. 2019

Describe the solution you'd like
Import WGS background data like we did for iSheep data

Describe alternatives you've considered
Alternative could be different datasets or ignoring WGS data for Goats

Additional context

Check for ARS1 assembly (search SNPs relying on information in database)
Extract SNPs relying on probe position
Create metadata accordingly to publication and Supplementary Information
Convert into illumina top and import into database

:bug: check country for french dataset with GPS coordinates

Describe the bug
There are some sheep animals assigned to a wrong country; this can be verified by inspecting the GPS coordinates and doing a reverse geocoding

To Reproduce
An example of this are animals coming from this dataset:

# collect geo data from Romanov sheeps
romanov_sheep <- get_smarter_geojson(
  "Sheep",
  query = list(breed = "Romanov")) %>%
    sf::st_cast("POINT", do_split = FALSE)

# display data with leaflet
leaflet::leaflet(
  data = romanov_sheep
  ) %>%
    leaflet::addTiles() %>%
    leaflet::addMarkers(
      clusterOptions = leaflet::markerClusterOptions(), label = ~smarter_id
    )

Expected behavior
Those animals need to be assigned to the proper country (ie RUSSIA* for Romanov sheeps)

Additional context

check samples with reverse geocoding
fix countries and their respective dataset
import and re-generate IDS (however, this raise again a problem spotted in #73)
new database release

:bug: imported illumina report multi breeds file have the same FID in output

Describe the bug
By importing illumina reportfile for multi breed dataset, we have the same FID in ped output: This seems related to the fact that the fid variable is readed once and then never re-assigned when sample changes:

# determine fid from sample, if not received as argument
if not fid:
    sample = self.SampleSpecies.objects.get(
        original_id=row.sample_id,
        dataset=dataset
    )

    fid = sample.breed_code
    logger.debug(f"Found breed {fid} from {row.sample_id}")

once fid is set, no more updates are possible

To Reproduce
Steps to reproduce the behavior:
Import an illumina report multibreeds dataset and see the same FID in the output file

Expected behavior
FID need to change accordingly sample breed code

Additional context

define a test with two samples with different FID
check that FID changes in output

:sparkles: add data for Guisandesa goats

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
add data for Guisandese goats

Describe the solution you'd like
A clear and concise description of what you want to happen.
Add affymetrix goat genotypes and then import the new dataset

Describe alternatives you've considered
we cannot compare this data without importing it into database since the database technologies are different

Additional context

Add information for affymetrix goat chip
Import goat dataset

:boom: define a SMARTER snp coordinate version

Is your feature request related to a problem? Please describe.
Variant information comes from different sources, each one describing only a subset of all smarter variants. The SMARTER coordinate version should merge the most updated and trusted information and provide a final set of coordinates used to generate the most updated genotype file

Describe the solution you'd like
A solution could be trying to align probes directly to the genome of interest. This could solve ambiguities, determine new relations between chips and get information on the latest assembly

Describe alternatives you've considered
All the evidences for a variation should be compared and order by updated times and by autority. The most updated and trusted coordinates should be used to define a smarter coordinate source which can be displayed in API and web interface.

Additional context

track dates when importing snps from snpchimp
illumina_top attribute should be checked during insertion and written in to database
merge and rank evidences
define a comprehensive coordinate version relying on different evidences

:sparkles: import data from Hungary

Is your feature request related to a problem? Please describe.
Data from Hungary need to be imported

Describe the solution you'd like
Add Hungarian dataset to smarter database. Model breed names like White Dorper, Rusty Tsigai and Ile de france (check for breeds already in database).

Describe alternatives you've considered
There are no alternatives to Hungarian data load

Additional context

Model Hungarian breed names
Import from plink (seems in forward OAR3)
Add GPS coordinates (if any)

:sparkles: add new background datasets for goat

Is your feature request related to a problem? Please describe.
Add two more background datasets for goat

Describe the solution you'd like
Test and update Goat background samples with samples from Burren et al. 2016 and Cortellari et al. 2021

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add Burren et al. 2016 samples
Add Cortellari et al. 2021 samples
Check no Bezoar duplicated (from Cortellari)

:sparkles: track constants and status in database

Is your feature request related to a problem? Please describe.
Default attributes should be tracked in database, with information on releases and status

Describe the solution you'd like
Database should be coupled with library version. Tracking defaults in database could be useful for SMARTER-backend

Describe alternatives you've considered
Defaults and magic number could be tracked in scripts, however those information need to be duplicated in SMARTER-backend

Additional context

track database version into a collection
track defaults into a collections (ie assembly versions)
~~illumina_top attribute should be checked during insertion and written in to database~~ (moved in #37)
track dates when importing snps from manifest
~~track dates when importing snps from snpchimp~~ (moved in #37)
track type (foreground, background) within samples

:sparkles: import latest Uruguayan data

Is your feature request related to a problem? Please describe.
A new batch of data has been uploaded from Uruguay

Describe the solution you'd like
Import latest Uruguay data as foreground data

Describe alternatives you've considered
There are no alternatives, data should be imported

Additional context

Check data type
Check user IDs
Check SNPs are in database
Import data into SMARTER-database

:sparkles: track affymetrix data

Is your feature request related to a problem? Please describe.
Refactor variants collection to track informations from affymetrix

Describe the solution you'd like
The affymetrix ID should be a key in variant collections. Moreover we need to track dates into locations (when the chip was updated)

Describe alternatives you've considered
Affymetrix data could be stored in other collections, however would be difficult to translate or search for a SNPs between chips

Additional context

model affymetrix data in variant collections
load data from affymetrix manifest
calculate illumina_top from affymetrix sequence
check genotypes between illumina and affymetrix

:sparkles: add a collection for country objects

Is your feature request related to a problem? Please describe.
Having a collection of supported countries could help in suggesting items through the portal interface

Describe the solution you'd like
Define a collection of countries and rely on that while managing samples

Describe alternatives you've considered
Countries could be derived with aggregations on Samples collections. However this could take more time than modelling distinct countries in a new collection

Additional context

~~check country collections before inserting samples or dataset~~
use pycountry module to collect info in countries

:boom: Manage python packages with poetry

Is your feature request related to a problem? Please describe.
It's hard to maintain dependencies with this mix of conda and requirements.txt file

Describe the solution you'd like
Try to follow this discussion to integrate poetry with conda environment

Describe alternatives you've considered
an alternative could be using pixi, but it seems quite new

Additional context

install poetry within conda environment
manage dependencies using poetry
update documentation

:sparkles: import data from dbSNP

Is your feature request related to a problem? Please describe.
Import data from dbSNP

Describe the solution you'd like
Need to insert last dbSNP information in database. Should check if I could find info for the chips that are actually supported

Describe alternatives you've considered
Those data are not so recent, since dbSNP wasn't updated from last 2017. Reference positions need to be more recent

Additional context

Fix linter issues
Refactor src.features.dbsnp module
Load data from dbSNP152

:wrench: configure database connection

Is your feature request related to a problem? Please describe.
Connect to a database in a remote host, in order to execute the import process in a computing resource

Describe the solution you'd like
Track configuration parameters in env file and set default values for development

Describe alternatives you've considered
Data could be imported in the same host of the database (as currently is)

Additional context

Track configuration parameters in env file
Test for default values for local development
Test import process

:sparkles: import latest Uruguayan data

Is your feature request related to a problem? Please describe.
A new batch of data has been uploaded from Uruguay

Describe the solution you'd like
Import latest Uruguay data as foreground data

Describe alternatives you've considered
There are no alternatives, data should be imported

Additional context

Check data type
Check user IDs
Check SNPs are in database
Import data into SMARTER-database

:sparkles: import data from affymetrix file

Is your feature request related to a problem? Please describe.
Data coming from affymetrix need to be imported into SMARTER database.

Describe the solution you'd like
Files are similar to plink format, however first plink colums are missing (only sample name and genotype are present). Sample name need to be converted from metadata files (this is different from the usual alias used for breed conversion). Since there are more data than the data which need to be shared within smarter, we don't want to create a record while loading data

Describe alternatives you've considered
Fix data before import may not work since format is slightly different from plink format. Subset samples before uploading could be useful with others datasets

Additional context

Add a function in src.feature.plinkio.SmarterMixin to ignore a ped line if not present in database
Import samples from file by providing country and breeds as parameters
Define alias while importing samples
Search in database relying on original_id or alias
Parse affy genotype data and discard INDELs
Rely on original coordinate system to determine illumina top alleles
Apply the desidered coordinate system
Add a custom script for affymetrix data import

:sparkles: Import coordinates from Sheep and Goat projects

Is your feature request related to a problem? Please describe.
Sheep and Goat projects have SNPs coordinates for the four major assemblies I need (CHI1, ARS, OAR3, OAR4)

Describe the solution you'd like
Check genomic coordinates, make a report and then fill those data into database

Describe alternatives you've considered
Data could be fetched directly from EBI or EnsEMBL, however it's interesting to merge multiple data sources to understand where problems could be. Data from consortium will not be used to provide the final data files, evaluate if these information need to be provided with final version

Additional context
Those data could be useful to cover missing informations (like rs in goat). This data lacks of alignment information, so getting illumina_top from this type of data make no sense.

Manage the two different data files provided by consortium
Force data update when importing from consortium
Track date when importing from consortium
Determine illumina_top data directly from variant for Sheep
Import data for goat

:sparkles: determine illumina_forward genotype from manifest data

Is your feature request related to a problem? Please describe.
It's not possible convert from forward to top using data imported from manifest since the Location.illumina_forward attribute cannot be set or derived from manifest data. Consider, for example, SNP 250506CS3900386000001_696.1:

{
        "version" : "Oar_v3.1",
        "chrom" : "16",
        "position" : 62646307,
        "illumina" : "A/G",
        "illumina_strand" : "TOP",
        "strand" : "TOP",
        "imported_from" : "manifest",
        "date" : ISODate("2009-01-07T00:00:00Z"),
        "consequences" : [ ]
}

Even if strand is TOP and illumina_strand is TOP, the wanted illumina_forward allele is T/C, since it depends on genome mapping. The illumina_strand tells how to cut the SNPs from the probe (and since is top, I can take the illumina record as it is to have an illumina top. The strand is the direction of the alignment, however I don't know if the probe was reversed or not in this aligment (probably yes, since there is an affymetrix probe in this record which seems to be reversed - has T/C snp in sequence)

Describe the solution you'd like
illumina_forward should be determined using genomic sequence and manifest informations. In this way, we will be able to convert from forward to top using manifest data (ie goat forward data input)

Describe alternatives you've considered
An alternative could be getting data from other sources, for example illumina row files, since that conversion doesn't rely on illumina_forward attribute, however the script import_from_illumina.py should change in order to determine fid from database and not from command line (which is necessary for multi breeds illumina row files.

Additional context

read the proper genome file while reading from manifest
determine the reference allele from sequence relying on SNP location
check if reference allele belongs to manifest illumina feature. If yes, illumina_forward could be equal to illumina feature, if not illumina should be reversed complemented
cross check illumina_forward feature with other source (ie. SNPchimp)

:boom: should smarter_id change its format?

Is your feature request related to a problem? Please describe.
The smarter_id attribute is not a good primary key since it incorporates additional information in its name like country and breed_code. This means that modifing a country or a breed_code for a sample causes inconsistencies between this primary id and fields, having a stable id which contains wrong informations. On the other way, fixing smarter_id relying on new data means that these ids will change with database version.

Describe the solution you'd like
smarter_ids shold be anonymizes, maybe tracking the project name with the number (like SMARTER-OA-000002310, species need to be kept to avoid name collisions between goat and sheep

Describe alternatives you've considered
An alternative could be fix stuff after data import, using a mongoengine query to correct wrong values keeping smarter_id stable. However this means not to trust information in ids anymore.

Additional context
The Galway sheeps breed was tracked as Germany instead of Ireland since country was derived from a wrong set of coordinates. Code and country need to be fixed

:sparkles: add SNPconvert.py to main src/data scripts

Is your feature request related to a problem? Please describe.
SNPconvert.py is a script based on SMARTER-database libraries which can convert genomic coordinates of datasets without importing them into SMARTER-database. It should be managed and maintained with SMARTER-database main scripts

Describe the solution you'd like
Support and maintain SNPconvert.py in main branch

Describe alternatives you've considered
Even if this script can be used like this, will be easier to maintain all software within SMARTER-database development

Additional context

Add integration test for SNPconvert.py
Support SNPconvert.py in main documentation

:sparkles: convert genotypes from top to forward

Is your feature request related to a problem? Please describe.
Will be nice to convert genotypes in forward coding format. This will be useful to transform data in other format like VCF

Describe the solution you'd like
The ideal solution will be working with datasets separately, by converting genotypes in temporary files (and folders)

Describe alternatives you've considered
An alternative, will be working on the final dataset and doing the conversion using the SNPconvert.py tool. This will be easier since the data are in the same format and less changes are required to maintain this project

Additional context
Ideally, this task should do:

Convert from illumina to forward in src.features.smarterdb.Location
Add a _to_forward function in src.features.plinkio.SmarterMixin
Support src and dst coding in src.features.plinkio.SmarterMixin._process_genotypes , _process_pedline and update_pedfile
Convert into forward using SNPconvert.py
#115

:bug: check that breed exists while importing phenotypes

Describe the bug
The Malagueña purpose metadata was not correctly inserted into database. Also landrace was recorded as landrance in breed table

To Reproduce
Search for Malagueña phenotype

Expected behavior
The Malagueña should have the purpose phenotype

Additional context

Check that breed exists while inserting phenotype data
Fix Malagueña and Landrace (goat) breeds
Check traditional Arran goat -> should be Aran

:memo: update documentation

Is your feature request related to a problem? Please describe.
Describe how this project works

Describe the solution you'd like
Documentation should be updated: the importing process and data generation process need to be described

Describe alternatives you've considered
We can document in an another moment, however documentation will be useful to others to understand better the whole process

Additional context

Improve documentation
Set up documentation with readthedocs

:sparkles: integrate background data with published datasets

Is your feature request related to a problem? Please describe.
Add additional published datasets to fill gaps and provide more data to SMARTER database

Describe the solution you'd like
Some published datasets could be added to SMARTER-database

Describe alternatives you've considered
Bakcground data could remain as its is, however adding new data could add some useful informations

Additional context

Front Genet 2021 Jun 18;12:612492. doi: 10.3389/fgene.2021.612492. (Welsh sheep breeds)
Barbato M, Hailer F, Orozco-Terwengel P, Kijas J, Mereu P, Cabras P, et al. Genomic signatures of adaptive introgression from European moufon into domestic sheep. Sci Rep. 2017;7:7623
Ciani et al. Genet Sel Evol (2020) 52:25 https://doi.org/10.1186/s12711-020-00545-7
https://www.nature.com/articles/s41598-019-44137-y/ (Northwest Africa)
Heredity (Edinb)2017 Mar;118(3):293-301. doi: 10.1038/hdy.2016.86. Epub 2016 Sep 14 (Algerian sheep)

:sparkles: add a script to transform PLINK into VCF and fix allele references using `bcftools`

:rotating_light: fix issues with linter

Describe the bug
Linter tests should pass without errors

To Reproduce
Steps to reproduce the behavior:

flake8 src
# or 
tox
# or see linter github workflow

Expected behavior
Linter process should exit with success

Additional context

refacord dbsnp feature module

:recycle: model location with MultiPointField

Is your feature request related to a problem? Please describe.
Modelling locations like:

# GPS location
# NOTE: X, Y where X is longitude, Y latitude
locations = mongoengine.ListField(mongoengine.PointField(), default=None)

Seems to not work properly when querying.

Describe the solution you'd like
Try to model those attributes with MultiPointField

Describe alternatives you've considered
An alternative could be a custom method like is_top to evaluate locations. However, this could be not performant as the MultiPointField could be. An alternate solutions is to write locations in two different attributes (location1, location2, ...)

Additional context

change model type
test import_metadata.py works properly
try to make a query with geo_within_polygon (see http://docs.mongoengine.org/guide/querying.html?highlight=PointField#geo-queries)

:sparkles: check that illumina A/B alleles are tracked properly

Is your feature request related to a problem? Please describe.
In the illumina technical notes is clearly states which alleles need to be A/B. Check variant import after illumina updates.

Describe the solution you'd like
Illumina import should take account of A/B when determining illumina top

Describe alternatives you've considered
This step seems to be not important: illumina TOP is currently determined and enforced during data import. Detect correctly the A/B allele should affect only a few SNPs, without consequences on TOP detection

Additional context

Update src.features.illumina module
Update tests accordingly
Test variants after variants import

cnr-ibba / smarter-database Goto Github PK

smarter-database's Introduction

SMARTER Database

Project Organization

smarter-database's People

Contributors

Watchers

smarter-database's Issues

Recommend Projects

Recommend Topics

Recommend Org