Comments (18)
Can someone confirm the correctness of this file? There are a few sheets that don't have any data.
https://github.com/ImagingDataCommons/IDC-ProjectManagement/files/12196403/CCDI_Submission_Template_v1.0.1_DM_v2.xlsx
from etl.
@G-White-ISB Mentioning it in the documentation works for me.
from etl.
@G-White-ISB There were different versions of the spreadsheet. Attached is the one I believe to be the latest version after having reviewed the #1394 issue.
CCDI_Submission_Template_v1.0.1_DM_v2 2023-02-17.xlsx
For your information, David C. and I met today with the Childhood Cancer Data Initiative (CCDI) folks. They may send us another version of the spreadsheet but the timing is undetermined. Therefore, I propose that you go forward with the attached spreadsheet and we update the clinical data if and when we get a richer set of data from the CCDI folks.
from etl.
FYI. We found out and it was confirmed by the Khan lab who supplied the imaging data that the sample_id on the sample worksheet is the COG USI (Childrens Oncology Group) (Universal Identifier). Is there a way that we can label it as such or put a remark somewhere. This will help the CCDI folks to link to the images.
from etl.
@ulrikew I just discussed this with @dclunie, and he suggests somewhere in the documentation describing the collection (ie, in our Zenodo record!) we mention that sample_id
attribute in the clinical spreadsheet data is the COG USI. Does this sound ok with you?
We do not want to change the spreadsheet, since it was used by David for conversion. And I do not want those transformations/relabeling to be done in George's scripts.
from etl.
Ok, looking at the latest Excel file provided by @ulrikew, there are 13 different sheets. The clinical_measure_file ,methlation_array_file, publication, sequencing_file, study, study_admin, study_arm, and study_funding sheets all appear to contain headers but no data.
I assume the imaging file sheet is not useful. We are not providing the tiffs anyways.
@fedorov am I to write in that sample_id is the COG USI in the column description in the column_metadata table? We don't have much documentation outside the BQ tables themselves.
from etl.
A column on the participant sheet is clearly mislabled. The header is ethnicity, values are 'Metastatic' or 'Non-metastatic'.
from etl.
The spreadsheet was supplied by David Milewski [email protected], so you might want to ask him about this (and cc myself and @ulrikew).
I wrote to him on 2023/02/27:
"The ethnicity column in the Participant tag now contains the Metastasis_at_diagnosis value 😄
Also, I came across some cells in which there are new lines embedded in quoted text, which screws up simple line-oriented scripts that operate on extracted CSV files.
If you get a chance, could you remove the new lines in the Samples tab for RMS2452, RMS2472 and others (with a second line that start with a parenthesis) in the sample_anatomic_site column cells please?"
but did not get a reply or updated form, so just worked around it.
He can also confirm for you that as he wrote by email 2023/08/11:
"The COG USI is the same as the "sample.sample_id" in the table (example: "PAIZZZ"). This should enable us to link the sample to our Oncogenomics db."
If he is going to update the spreadsheet, he could add that information to the Dictionary tab entry for "sample_id" - not sure if you were planning on including the information in the dictionary in the BigQuery tables anyway
You might also want to ask him what the value of "study.phs_accession" should be, since I was told on a recent call that the "phs_accession" is what defines the scope of uniqueness of the "sample_id" (which is not globally unique).
If you don't hear back promptly, I would just go ahead and leave out the ethnicity column since it is clearly wrong (as well as being useless and peculiarly US-ethnocentric).
from etl.
RMS clinical data was added for the v16 release
from etl.
@G-White-ISB this query returns nothing - am I missing something?
SELECT * FROM `idc-dev-etl.idc_v16_clinical.column_metadata` where collection_id like "rms*"
from etl.
The syntax is wrong. Change the '*' to a '%':
SELECT * FROM idc-dev-etl.idc_v16_clinical.column_metadata
where collection_id like "rms%"
from etl.
Ah, of course!.. I was doing this while talking to Ulli, and didn't pay enough attention! Thank you!
from etl.
Issues identified
from etl.
@G-White-ISB it turns out the most valuable clinical data accompanying RMS-Mutation-Prediction is in the supplement 1 of this paper: https://aacrjournals.org/clincancerres/article/29/2/364/713962/Predicting-Molecular-Subtype-and-Survival-of. Let's plan to ingest that one in a future IDC release.
from etl.
@ulrikew I am preparing an answer for Brian Furner, who asked how to get the COG USI identifiers, and I see that none of the columns for the clinical data accompanying RMS collection has descriptions of what the columns mean.
I think going forward it would be highly desirable for the submitters to be required to describe the semantics of the columns, which would then go into column_label
or something like that column. I understand we don't have those for legacy collections, but I think we should strive to improve for the collections we curate.
from etl.
https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/1394#issuecomment-1674909405
Does this help?
And yes, we need to require data dictionaries and an explanation of the used values.
from etl.
Ulli, yes, I digged that out because I knew we talked about it. But from all I know, there is absolutely no chance anyone else finding that clinical data table will be able to figure this out.
from etl.
Absolutely true!
from etl.
Related Issues (20)
- [clinical] Add PatientID to the per-collection tables (whenever it is not already available) HOT 7
- Map the patient identifier column in the clinical collections to the DICOM patientID HOT 2
- [clinical] Support NLST clinical data HOT 9
- Clinical data per-table metadata tracking HOT 3
- Define BQ layout of clinical data tables HOT 1
- [clinical] collection_id should not be an array HOT 3
- variable_label should not be blank HOT 3
- Consider renaming "variable" to "column" in "column_metadata" HOT 2
- Inconsistencies identified for hnscc_3dct_rt_clinical table HOT 2
- Add regression testing to confirm consistency of clinical table schemas with column_metadata HOT 2
- Integrate 'Legacy' Clinical data into new clinical dataset.
- Inconsistencies identified for the ISPY1 clinical table HOT 4
- dicom_patient_id appears to be missing in several clinical tables HOT 2
- Use fully resolved versioned table names in all places HOT 1
- table_metadata should indicate whether table dictionary was parsed from sources or derived HOT 2
- acrin_6698 sbrgrade NAs are replaced with nulls HOT 2
- Values for `dicom_patient_id` are invalid for the `acrin_6698` collection HOT 2
- Duke-Breast-Cancer-MRI clinical data is missing HOT 1
- Investigate ingestion of HTAN related data HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from etl.