imagingdatacommons / etl Goto Github PK
View Code? Open in Web Editor NEW(CORE REPO)
License: Apache License 2.0
(CORE REPO)
License: Apache License 2.0
It is currently impossible (from what I know) to figure out what data is accompanying HTAN images stored in IDC. We have image-related data for CPTAC, TCGA and other collections. I do not see reasons not to follow the same approach for HTAN. We should investigate this.
Clinical data from this collection is missing, but per @G-White-ISB was supposed to be ingested: https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70226903
By "derived" I refer to the situation where column_metadata
values
column was populated by examining the actual values encountered for the specific column names.
I suggest it makes sense to do the following:
As discussed in #26 the patient identifier in the clinical collection may not be identical to the patientID in the DICOM file. There is sometimes a difference in format (ie handling of letter case etc) between the two. Each row in the clinical data must be mapped to the corresponding DICOM patientID attribute
It is indeed unfortunate that the columns of the tables are all shuffled.
I noticed you are creating the schema manually, but then are importing from JSON.
Did you consider just importing from JSON, and let BQ auto-detect the schema? (e.g., I do this here for ACRIN tables: https://github.com/fedorov/idc-clinical-cleanup/blob/main/bq_exporter.py#L13-L36).
If that auto-detection of the column type does not produce results needed, we might do some casting later. But as is, it is very challenging to compare original data and ingested data side by side.
Note that the table populated using the output of those scripts does not have the Rows column initialized correctly. Rows needed to be quoted in tick-marks, since it is a protected word for BQ. This was not considered, and Rows in idc-dev-etl:idc.dicom_all
is all nulls.
I made a Colab notebook automating all of the steps in this repository, so I don't think we need to fix it or regenerate the output. I just wanted to make note of this.
You can see the notebook here: https://colab.research.google.com/drive/1PDwmjKNcqoqR0boHqiAW8vZcH9kUqQSl
The resulting table is idc-dev-etl:tcga.dicom_metadata
.
I wonder if we should deprecate the content of this repo and replace it with the notebook above, to avoid confusion?
It will probably be quite valuable to the users to know how much space the cohort will take when downloaded. In particular, this will be important for the user in order to allocate sufficient disk space when downloading the data to the VM.
I think we should add a column with the size for each instance we are indexing. We discussed this with @wlongabaugh and he also thought it makes sense. What do you think @bcli4d?
Follow up on #35
A (relatively!) easy regression test should be, for each table_name
in column_metadata
, take the list of variable_name
, and confirm that the list of columns from the schema in the corresponding table is exactly that. We should have a regression check and run this test on every update of the clinical metadata tables. I will submit a follow up separate ticket on that.
It appears that the clinical table is referring to the patients by suffixes of DICOM PatientID
. To get the actual patient ID we need to prepend that suffix with ACRIN-6698-
.
Going forward, when populating dicom_patient_id
column, there should be a check that all IDs included in that column for the specific clinical data table are all present among the values of PatientID
selected from dicom_all
for that specific table.
I understand there is CPTAC clinical metadata in ISB-CGC that matches our images. @G-White-ISB can you please investigate how it is organized, how it is versioned, how it can be linked with images, so we can discuss how to make it available to the users?
Currently, it is captured in the nlst_clinical_*
tables, with the dictionaries in RTF format linked here: https://learn.canceridc.dev/data/organization-of-data/files-and-metadata#nlst. I don't know if they are already handled, but I did not find the dictionaries/tables in idc-dev-etl:clinical
.
As much as possible, it would be helpful if the names of the columns in the auxiliary table were capitalized consistently.
Currently, all DICOM attributes in dicom_metadata
follow CamelCase convention, while most of the columns in the aux table are all lower caps with underscore separator.
Since for most users the primary table will be dicom_all
, which is a mix of DICOM and non-DICOM items, it would be very convenient to be able to easily differentiate between those two groups.
The following are exceptions to the capitalization convention in the aux table are the following attributes:
source_DOI
is changed in the view query: aux.source_doi as Source_DOI,
Collection-level tables also use CameCase, but since neither of those attributes is coming from DICOM, and they are not included in dicom_all
, this is less consequential, although maybe we should consider making it lower case to be consistent with the aux table.
column_metadata
lists a long list of variables for the hnscc_3dct_rt_clinical
collection:
However, the referenced table is very short:
Developing this thought, an easy regression test should be, for each table_name
in column_metadata
, take the list of variable_name
, and confirm that the list of columns from the schema in the corresponding table is exactly that. We should have a regression check and run this test on every update of the clinical metadata tables. I will submit a follow up separate ticket on that.
column_metadata
, table
column should point to the future location of the table once it is in Google PDP: bigquery-public-data.idc_v11_clinical.<table name>
.current
in ISB-CGCbigquery-public-data.idc_v11.tcga_clinical_rel9
Per discussion yesterday, it will be very useful to capture the provenance of the items in the per-collection metadata dictionary:
We need to check this, but I am pretty sure at least for some collections there is no PatientID
column that can be used to join with the dicom_all
column.
I suggest we add that PatientID
column whenever it is not available and whenever it is possible to derive that column. This will probably need to be handled on a per-collection basis. In the field(s) describing provenance this (related to #24) should be captured by labeling the attribute as "derived" or "introduced by IDC" to clearly differentiate original fields from those we introduce.
To facilitate tracking of changes for the clinical data tables, we should probably introduce a table that will have one row per clinical table and would include:
We should discuss what else should be there, but I wanted to document this issue right away without coming up with a comprehensive list.
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.spie_aapm_lung_ct_challenge_clinical
head_neck_pet_ct_clinical
ispy2_clinical
Full UNIION
query that led to this discovery (no, I did not type it manually):
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_6698_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_A0
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_A1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_A3
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_AA
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_AB
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_AC
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_AF
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_AG
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_BA
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_BQ
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_BR
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_CA
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_CJ
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_CK
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_E3
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_FS
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_FT
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_HE
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_I1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_IC
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_IE
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_IM
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_IS
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_JS
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_JT
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_LP
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_PF
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_PH
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_PM
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_PO
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_PR
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_RT
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_SB
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_TS
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_VT
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_VY
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_A0
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_A1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_AE
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_AI
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_AT
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_BS
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_DE
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_DS
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_EX
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_F1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_MC
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_MH
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_MI
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_MR
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_MS
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_PQ
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_PR
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_SA
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_SC
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_T4
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_TA
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_V1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_V2
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_V3
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_V4
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_VA
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_BL
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_BX
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_C2
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_DS
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_EX
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_F1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_I1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_IM
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_M4
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_P3
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_PL
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_PM
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_PR
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_PV
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_S1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_S2
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_T0
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_TA
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_TD
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_hnscc_fdg_pet_ct_UP
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_A0
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_A1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_AE
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_DS
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_F1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_F2
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_I1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_IM
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_O1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_PR
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_Q2
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_QA
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_QZ
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_SS
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_T1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_TA
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_TF
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.b_mode_and_ceus_liver_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.breast_diagnosis_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.breast_mri_nact_pilot_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.c4kc_kits_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.covid_19_ar_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.covid_19_ar_imaging
#UNION ALL
#SELECT
# dicom_patient_id
#FROM
# idc-dev.idc_v11_clinical.head_neck_pet_ct_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.head_neck_radiomics_hn1_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.hnscc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.hnscc_radiomics
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.hnscc_3dct_rt_outcome
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.hnscc_3dct_rt_weight
#UNION ALL
#SELECT
# dicom_patient_id
#FROM
# idc-dev.idc_v11_clinical.ispy2_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.lidc_idri_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.lung_pet_ct_dx_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.midrc_ricord_1a_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.midrc_ricord_1b_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.nsclc_radiomics_genomics_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.nsclc_radiomics_interobserver1_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.opc_radiomics_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.qin_headneck_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.rembrandt_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.soft_tissue_sarcoma_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.cptac_brca_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.cptac_ccrcc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.cptac_coad_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.cptac_gbm_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.cptac_hnscc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.cptac_lscc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.cptac_luad_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.cptac_pda_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.cptac_ucec_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_acc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_blca_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_brca_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_cesc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_chol_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_coad_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_dlbc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_esca_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_hnsc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_kich_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_kirc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_kirp_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_lgg_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_lihc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_luad_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_lusc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_meso_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_ov_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_paad_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_pcpg_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_prad_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_read_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_sarc_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_skcm_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_stad_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_tgct_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_thca_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_thym_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_ucec_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_ucs_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_uvm_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.cmmd_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.covid_19_ny_sbu_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.hcc_tace_seg_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.hnscc_3dct_rt_demographics
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.hnscc_3dct_rt_dose
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.ispy1_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.ispy1_outcome
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.midrc_ricord_1c_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.nsclc_radiomics_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.nsclc_radiogenomics_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.prostate_diagnosis_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.qin_breast_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.soft_tissue_sarcoma_outcome
#UNION ALL
#SELECT
# dicom_patient_id
#FROM
# idc-dev.idc_v11_clinical.spie_aapm_lung_ct_challenge_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.cptac_ov_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.tcga_gbm_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.lung_fused_ct_pathology_clinical
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_TX
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_LE
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_V1
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_V2
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_flt_breast_V3
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_CO
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_fmiso_brain_FM
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_SF
UNION ALL
SELECT
dicom_patient_id
FROM
idc-dev.idc_v11_clinical.acrin_nsclc_fdg_pet_ST
I am coming up with those ideas as I work on a notebook demonstrating the use of this content. No action right now, but as I think about it, referring to those as "variables" may be confusing.
The idc_v* datasets have had clinical data for TCGA collections and the NLST collection. for some time. They have not been integrated into the new idc_v*_clinical. It seems that we would want to represent all clinical data the same way. One complication is that the portal has clinical data for only TCGA data. AFAIK this is pulled from the tcga_clinical_rel9 in the idc_v* dataset.
The field collection_id
in the column-level metadata includes array for CPTAC collections. I think I can guess why this is the case (there is a single CPTAC table coming from ISB-CGC), but I am not sure this is the right way to encode this information. This will be quite confusing for the user. Schema of this column should be consistent with its schema in other places this column is encountered.
To deal with this, we might include a column Program
along with collection_id
.
Alternatively, we could unnest the multivalued rows of collection_id
column and replicate column-level metadata for each collection in the program.
Current organization of tables has 2 components:
However, in the general case, 1) we will have more than 1 clinical table per collection (with different schemas); 2) we will at least sometime have the need to communicate description of the specific table (table level metadata).
Examples are the NLST collection and ACRIN clinical tables.
I suggest we introduce another level for organization that has the following columns (we could call it clinical_data_inventory
or something like that?):
This follows the approach implemented for ACRIN in https://github.com/fedorov/idc-clinical-cleanup, with the result in https://console.cloud.google.com/bigquery?p=idc-tcia&d=af_clinical_sandbox&page=dataset. There, those table-level metadata attributes are organized in tables per-collection (<collection_id>_dict
), but we might as well put it all into a single table.
Currently, we rely on naming conventions to dereference collection-specific dictionary elements to the specific tables, which works as long as we have everything in the same dataset.
As discussed yesterday, it would be more robust and easier to understand if we directly had a reference to the dataset/table, and it will also help us address use cases where we point to the external tables (if we need to), such as if we decide to point to CPTAC/TCGA tables in ISB-CGC.
Per discussion, we can define a new dataset, which would be versioned, and would contain:
bigquery-public-data.idc_clinical_v9.*
); we agreed that it makes sense to keep prefix _clinical
for the cases where there is a single clinical data table per collection, since there will be cases where there will be multiple such tables, and we would need to differentiate based on the suffix)idc-dev-etl.clinical4.clinical_meta
)mergeschemas.py then merges all these schemas into one schema that defines the dicom_all table.
@G-White-ISB I just noticed this file is missing from the repo. Can you add it for the sake of completeness?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.