Giter Club home page Giter Club logo

etl's People

Contributors

bcli4d avatar fedorov avatar wlongabaugh avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

fedorov

etl's Issues

[clinical] collection_id should not be an array

The field collection_id in the column-level metadata includes array for CPTAC collections. I think I can guess why this is the case (there is a single CPTAC table coming from ISB-CGC), but I am not sure this is the right way to encode this information. This will be quite confusing for the user. Schema of this column should be consistent with its schema in other places this column is encountered.

To deal with this, we might include a column Program along with collection_id.

Alternatively, we could unnest the multivalued rows of collection_id column and replicate column-level metadata for each collection in the program.

[clinical] Clinical data "inventory" table

Current organization of tables has 2 components:

  • single table defining the dictionaries for per-collection clinical tables
  • per-collection clinical tables

However, in the general case, 1) we will have more than 1 clinical table per collection (with different schemas); 2) we will at least sometime have the need to communicate description of the specific table (table level metadata).

Examples are the NLST collection and ACRIN clinical tables.

I suggest we introduce another level for organization that has the following columns (we could call it clinical_data_inventory or something like that?):

  • collection_id
  • clinical_table_id
  • description

This follows the approach implemented for ACRIN in https://github.com/fedorov/idc-clinical-cleanup, with the result in https://console.cloud.google.com/bigquery?p=idc-tcia&d=af_clinical_sandbox&page=dataset. There, those table-level metadata attributes are organized in tables per-collection (<collection_id>_dict), but we might as well put it all into a single table.

Investigate ingestion of HTAN related data

It is currently impossible (from what I know) to figure out what data is accompanying HTAN images stored in IDC. We have image-related data for CPTAC, TCGA and other collections. I do not see reasons not to follow the same approach for HTAN. We should investigate this.

Consider adding instance size to the BQ tables

It will probably be quite valuable to the users to know how much space the cohort will take when downloaded. In particular, this will be important for the user in order to allocate sufficient disk space when downloading the data to the VM.

I think we should add a column with the size for each instance we are indexing. We discussed this with @wlongabaugh and he also thought it makes sense. What do you think @bcli4d?

[clinical] Add column to point to the dataset/table from the dictionary

Currently, we rely on naming conventions to dereference collection-specific dictionary elements to the specific tables, which works as long as we have everything in the same dataset.

As discussed yesterday, it would be more robust and easier to understand if we directly had a reference to the dataset/table, and it will also help us address use cases where we point to the external tables (if we need to), such as if we decide to point to CPTAC/TCGA tables in ISB-CGC.

Rows column populated incorrectly

Note that the table populated using the output of those scripts does not have the Rows column initialized correctly. Rows needed to be quoted in tick-marks, since it is a protected word for BQ. This was not considered, and Rows in idc-dev-etl:idc.dicom_all is all nulls.

I made a Colab notebook automating all of the steps in this repository, so I don't think we need to fix it or regenerate the output. I just wanted to make note of this.

You can see the notebook here: https://colab.research.google.com/drive/1PDwmjKNcqoqR0boHqiAW8vZcH9kUqQSl

The resulting table is idc-dev-etl:tcga.dicom_metadata.

I wonder if we should deprecate the content of this repo and replace it with the notebook above, to avoid confusion?

[clinical] Capture provenance of the collection-specific columns

Per discussion yesterday, it will be very useful to capture the provenance of the items in the per-collection metadata dictionary:

  • what file they are coming from (+hash of that file)
  • if the above is a zip file - which file from the zip file they come from
  • what sheet, if applicable
  • if there is a hierarchy of column names - include that hierarchy

Integrate 'Legacy' Clinical data into new clinical dataset.

The idc_v* datasets have had clinical data for TCGA collections and the NLST collection. for some time. They have not been integrated into the new idc_v*_clinical. It seems that we would want to represent all clinical data the same way. One complication is that the portal has clinical data for only TCGA data. AFAIK this is pulled from the tcga_clinical_rel9 in the idc_v* dataset.

variable_label should not be blank

Currently, it appears to be blank for IDC-assigned columns. I suggest variable_label should indicate that those columns are not part of the source data. Maybe we can use a prefix like "[IDC provenance attribute]" or something like that.

image

[clinical] Fix out of order columns in the per-collection tables

It is indeed unfortunate that the columns of the tables are all shuffled.

I noticed you are creating the schema manually, but then are importing from JSON.

Did you consider just importing from JSON, and let BQ auto-detect the schema? (e.g., I do this here for ACRIN tables: https://github.com/fedorov/idc-clinical-cleanup/blob/main/bq_exporter.py#L13-L36).

If that auto-detection of the column type does not produce results needed, we might do some casting later. But as is, it is very challenging to compare original data and ingested data side by side.

[clinical] Add PatientID to the per-collection tables (whenever it is not already available)

We need to check this, but I am pretty sure at least for some collections there is no PatientID column that can be used to join with the dicom_all column.

I suggest we add that PatientID column whenever it is not available and whenever it is possible to derive that column. This will probably need to be handled on a per-collection basis. In the field(s) describing provenance this (related to #24) should be captured by labeling the attribute as "derived" or "introduced by IDC" to clearly differentiate original fields from those we introduce.

Add regression testing to confirm consistency of clinical table schemas with column_metadata

Follow up on #35

A (relatively!) easy regression test should be, for each table_name in column_metadata, take the list of variable_name, and confirm that the list of columns from the schema in the corresponding table is exactly that. We should have a regression check and run this test on every update of the clinical metadata tables. I will submit a follow up separate ticket on that.

Clinical data per-table metadata tracking

To facilitate tracking of changes for the clinical data tables, we should probably introduce a table that will have one row per clinical table and would include:

  • initial version of IDC data where that table was introduced
  • version of IDC data where this table was modified
  • collection_id it belongs to
  • path to the file that was used to extract it and the hash of that file

We should discuss what else should be there, but I wanted to document this issue right away without coming up with a comprehensive list.

Make column names capitalization consistent

As much as possible, it would be helpful if the names of the columns in the auxiliary table were capitalized consistently.

Currently, all DICOM attributes in dicom_metadata follow CamelCase convention, while most of the columns in the aux table are all lower caps with underscore separator.

Since for most users the primary table will be dicom_all, which is a mix of DICOM and non-DICOM items, it would be very convenient to be able to easily differentiate between those two groups.

The following are exceptions to the capitalization convention in the aux table are the following attributes:

  • LicenseURL
  • LicenseLongName
  • LicenseShortName
  • capitalization of source_DOI is changed in the view query: aux.source_doi as Source_DOI,

Collection-level tables also use CameCase, but since neither of those attributes is coming from DICOM, and they are not included in dicom_all, this is less consequential, although maybe we should consider making it lower case to be consistent with the aux table.

[clinical] Improve harmonization of ACRIN and non-ACRIN tables schema

I suggest it makes sense to do the following:

  • remove columns that are probably not of interest to the users (those would still be available in the clinical_notes.json): sources, column_number, num_values, rng
  • use values.option_code instead of introducing values.option_value
  • table_name should match the suffix of the table in BQ - for most of the tables, it would be "clinical"

Inconsistencies identified for hnscc_3dct_rt_clinical table

column_metadata lists a long list of variables for the hnscc_3dct_rt_clinical collection:

image

However, the referenced table is very short:

image

Developing this thought, an easy regression test should be, for each table_name in column_metadata, take the list of variable_name, and confirm that the list of columns from the schema in the corresponding table is exactly that. We should have a regression check and run this test on every update of the clinical metadata tables. I will submit a follow up separate ticket on that.

dicom_patient_id appears to be missing in several clinical tables

SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.spie_aapm_lung_ct_challenge_clinical

head_neck_pet_ct_clinical

ispy2_clinical

Full UNIION query that led to this discovery (no, I did not type it manually):

SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_6698_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_A0
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_A1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_A3
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_AA
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_AB
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_AC
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_AF
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_AG
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_BA
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_BQ
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_BR
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_CA
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_CJ
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_CK
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_E3
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_FS
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_FT
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_HE
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_I1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_IC
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_IE
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_IM
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_IS
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_JS
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_JT
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_LP
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_PF
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_PH
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_PM
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_PO
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_PR
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_RT
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_SB
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_TS
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_VT
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_VY
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_A0
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_A1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_AE
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_AI
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_AT
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_BS
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_DE
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_DS
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_EX
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_F1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_MC
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_MH
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_MI
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_MR
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_MS
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_PQ
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_PR
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_SA
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_SC
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_T4
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_TA
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_V1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_V2
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_V3
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_V4
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_VA
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_BL
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_BX
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_C2
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_DS
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_EX
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_F1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_I1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_IM
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_M4
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_P3
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_PL
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_PM
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_PR
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_PV
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_S1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_S2
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_T0
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_TA
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_TD
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_UP
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_A0
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_A1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_AE
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_DS
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_F1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_F2
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_I1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_IM
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_O1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_PR
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_Q2
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_QA
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_QZ
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_SS
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_T1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_TA
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_TF
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.b_mode_and_ceus_liver_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.breast_diagnosis_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.breast_mri_nact_pilot_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.c4kc_kits_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.covid_19_ar_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.covid_19_ar_imaging
#UNION ALL
#SELECT
#  dicom_patient_id
#FROM
#  idc-dev.idc_v11_clinical.head_neck_pet_ct_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.head_neck_radiomics_hn1_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.hnscc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.hnscc_radiomics
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.hnscc_3dct_rt_outcome
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.hnscc_3dct_rt_weight
#UNION ALL
#SELECT
#  dicom_patient_id
#FROM
#  idc-dev.idc_v11_clinical.ispy2_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.lidc_idri_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.lung_pet_ct_dx_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.midrc_ricord_1a_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.midrc_ricord_1b_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.nsclc_radiomics_genomics_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.nsclc_radiomics_interobserver1_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.opc_radiomics_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.qin_headneck_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.rembrandt_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.soft_tissue_sarcoma_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.cptac_brca_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.cptac_ccrcc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.cptac_coad_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.cptac_gbm_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.cptac_hnscc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.cptac_lscc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.cptac_luad_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.cptac_pda_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.cptac_ucec_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_acc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_blca_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_brca_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_cesc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_chol_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_coad_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_dlbc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_esca_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_hnsc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_kich_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_kirc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_kirp_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_lgg_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_lihc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_luad_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_lusc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_meso_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_ov_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_paad_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_pcpg_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_prad_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_read_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_sarc_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_skcm_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_stad_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_tgct_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_thca_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_thym_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_ucec_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_ucs_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_uvm_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.cmmd_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.covid_19_ny_sbu_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.hcc_tace_seg_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.hnscc_3dct_rt_demographics
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.hnscc_3dct_rt_dose
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.ispy1_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.ispy1_outcome
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.midrc_ricord_1c_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.nsclc_radiomics_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.nsclc_radiogenomics_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.prostate_diagnosis_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.qin_breast_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.soft_tissue_sarcoma_outcome
#UNION ALL
#SELECT
#  dicom_patient_id
#FROM
#  idc-dev.idc_v11_clinical.spie_aapm_lung_ct_challenge_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.cptac_ov_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.tcga_gbm_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.lung_fused_ct_pathology_clinical
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_TX
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_LE
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_V1
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_V2
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_flt_breast_V3
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_CO
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_fmiso_brain_FM
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_SF
UNION ALL
SELECT
  dicom_patient_id
FROM
  idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_ST

Missing mergeschemas.py

mergeschemas.py then merges all these schemas into one schema that defines the dicom_all table.

@G-White-ISB I just noticed this file is missing from the repo. Can you add it for the sake of completeness?

acrin_6698 sbrgrade NAs are replaced with nulls

Occurrences of NA in the input data should be parsed into NA string. Currently, at least for the acrin_6698 collection, they become nulls for the sbrgrade, which is wrong and confusing, given that the dictionary parsed lists "NA" as a named value for that column.

image

image

[clinical] CPTAC clinical metadata

I understand there is CPTAC clinical metadata in ISB-CGC that matches our images. @G-White-ISB can you please investigate how it is organized, how it is versioned, how it can be linked with images, so we can discuss how to make it available to the users?

Define BQ layout of clinical data tables

Per discussion, we can define a new dataset, which would be versioned, and would contain:

  • per-collection clinical data tables (e.g., bigquery-public-data.idc_clinical_v9.*); we agreed that it makes sense to keep prefix _clinical for the cases where there is a single clinical data table per collection, since there will be cases where there will be multiple such tables, and we would need to differentiate based on the suffix)
  • "dictionary" defining data elements for all clinical data tables (1 table) (something that will be based on what currently is in idc-dev-etl.clinical4.clinical_meta)
  • an equivalent of "auxiliary" table that will define per-table metadata attributes (such as source of the table, hash of the source file, IDC data version where it was introduced and modified) (we do not yet have an example or defined structure of this, see #29 )

Use fully resolved versioned table names in all places

  • in column_metadata, table column should point to the future location of the table once it is in Google PDP: bigquery-public-data.idc_v11_clinical.<table name>.
  • in CPTAC clinical tables the source table should be the specific version of the table, not current in ISB-CGC
  • source for TCGA should be bigquery-public-data.idc_v11.tcga_clinical_rel9

Values for `dicom_patient_id` are invalid for the `acrin_6698` collection

It appears that the clinical table is referring to the patients by suffixes of DICOM PatientID. To get the actual patient ID we need to prepend that suffix with ACRIN-6698-.

Going forward, when populating dicom_patient_id column, there should be a check that all IDs included in that column for the specific clinical data table are all present among the values of PatientID selected from dicom_all for that specific table.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.