Giter Club home page Giter Club logo

cda-python's Introduction

CDA Python

This library sits on top of the machine generated CDA Python Client and offers some syntactic sugar to make it more pleasant to query the CDA.

Launch in Binder

To try out the example notebook in MyBinder.org without having to install anything, just click on the logo below. This will launch a Jupyter Notebook instance with our example notebook ready to run.

Binder

For Testers use this Binder

Click on the logo below. This will launch a Jupyter Notebook instance with our example notebook ready to run.

MyBinder.org

Install the CDA Python library locally:

  1. Download and install docker click this link or copy url https://www.docker.com/products/docker-desktop to your Browser

  2. Open Terminal or PowerShell a and navigate to cda-python folder then we will run a docker command

    • docker compose up --build
  3. Open a Browser to this url http://localhost:8888 and you are up and running.

    • To change the Port edit the .env file NOTEBOOK_PORT
  4. To Stop the container from running return to the terminal where the cdapython project is on your keyboard you will click Control C to stop the container .

To delete the container use this command in the cdapython project directory.

  • docker compose down

Pip install

Alternatively, CDA Python can be installed using pip. However, this requires python >= 3.6 on your system. To check your version at the command-line run python -V. To update your version you can download from https://www.python.org/downloads/ additional python installation help can be found here. Once you have the proper python version, you can run CDA using:

pip install git+https://github.com/CancerDataAggregator/cda-python.git

NOTE: We recommend the docker method because pip installation can be a bit more cumbersome, and will not be as closely monitored as the docker installation.

Basics

We will now show you the basic structure of CDA python through the use of the most commands:

  • columns(): show all available columns in the table,
  • unique_terms(): for a given column show all unique terms,
  • Q: Executes this query on the public CDA server, and
  • Q.sql: allows you to enter SQL style queries.
  • query : allows you to write long form Q statments with out chaining

(Also see example IPython notebook)

from cdapython import Q, columns, unique_terms


columns() # List column names eg:
# ['days_to_birth',
#  'race',
#  'sex',
#  'ethnicity',
#  'id',
#  'ResearchSubject',
#  'ResearchSubject.Diagnosis',
#  'ResearchSubject.Diagnosis.morphology',
#  'ResearchSubject.Diagnosis.tumor_stage',
#  'ResearchSubject.Diagnosis.tumor_grade',
#  'ResearchSubject.Diagnosis.Treatment',
#  'ResearchSubject.Diagnosis.Treatment.type',
#  'ResearchSubject.Diagnosis.Treatment.outcome',


unique_terms("ResearchSubject.primary_disease_type") # List unique terms for this column eg:
# [None,
#  'Acinar Cell Neoplasms',
#  'Adenomas and Adenocarcinomas',
#  'Adnexal and Skin Appendage Neoplasms',
#  'Basal Cell Neoplasms',
#  'Blood Vessel Tumors',
#  'Breast Invasive Carcinoma',
#  'Chromophobe Renal Cell Carcinoma',
#  'Chronic Myeloproliferative Disorders',
#  'Clear Cell Renal Cell Carcinoma',
#  'Colon Adenocarcinoma',
# ...

q1 = Q('ResearchSubject.primary_disease_type = "Adenomas and Adenocarcinomas"')
r = q1.run()                                 # Executes this query on the public CDA server

# r = q1.run(host="http://localhost:8080")   # Executes on local instance of CDA server
# r = q1.run(limit=2)                        # Limit to two results per page


r.sql   # Return SQL string used to generate the query e.g.
# "SELECT * FROM gdc-bq-sample.cda_mvp.v1, UNNEST(ResearchSubject) AS _ResearchSubject WHERE (_ResearchSubject.primary_disease_type = 'Adenomas and Adenocarcinomas')"

print(r) # Prints some brief information about the result page eg:
#
# Query: SELECT * FROM gdc-bq-sample.cda_mvp.v1, UNNEST(ResearchSubject) AS _ResearchSubject WHERE (_ResearchSubject.# primary_disease_type = 'Adenomas and Adenocarcinomas')
# Offset: 0
# Limit: 2
# Count: 2
# More pages: Yes


r[0] # Returns nth result of this page as a Python dict e.g.
#
# {'days_to_birth': None,
#  'race': None,
#  'sex': None,
#  'ethnicity': None,
#  'id': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3',
#  'ResearchSubject': [{'Diagnosis': [],
#    'Specimen': [],
#    'associated_project': 'CGCI-HTMCP-CC',
#    'id': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3',
#    'primary_disease_type': 'Adenomas and Adenocarcinomas',
#    'identifier': [{'system': 'GDC',
#      'value': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3'}],
#    'primary_disease_site': 'Cervix uteri'}],
#  'Diagnosis': [],
#  'Specimen': [],
#  'associated_project': 'CGCI-HTMCP-CC',
#  'primary_disease_type': 'Adenomas and Adenocarcinomas',
#  'identifier': [{'system': 'GDC',
#    'value': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3'}],
#  'primary_disease_site': 'Cervix uteri'}


r.pretty_print(0) # Prints the nth result nicely
#
# { 'Diagnosis': [],
#   'ResearchSubject': [ { 'Diagnosis': [],
#                          'Specimen': [],
#                          'associated_project': 'CGCI-HTMCP-CC',
#                          'id': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3',
#                          'identifier': [ { 'system': 'GDC',
#                                            'value': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3'}],
#                          'primary_disease_site': 'Cervix uteri',
#                          'primary_disease_type': 'Adenomas and '
#                                                  'Adenocarcinomas'}],
#   'Specimen': [],
#   'associated_project': 'CGCI-HTMCP-CC',
#   'days_to_birth': None,
#   'ethnicity': None,
#   'id': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3',
#   'identifier': [ { 'system': 'GDC',
#                     'value': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3'}],
#   'primary_disease_site': 'Cervix uteri',
#   'primary_disease_type': 'Adenomas and Adenocarcinomas',
#   'race': None,
#   'sex': None}


r2 = r.next_page()  # Fetches the next page of results

print(r2)

# Query: SELECT * FROM gdc-bq-sample.cda_mvp.v1, UNNEST(ResearchSubject) AS _ResearchSubject WHERE (_ResearchSubject.# primary_disease_type = 'Adenomas and Adenocarcinomas')
# Offset: 2
# Limit: 2
# Count: 2
# More pages: Yes

r1 = Q.sql("""
SELECT
*
FROM gdc-bq-sample.cda_mvp.v1, UNNEST(ResearchSubject) AS _ResearchSubject
WHERE (_ResearchSubject.primary_disease_type = 'Adenomas and Adenocarcinomas')
""")

r1.pretty_print(0)
#
#{ 'Diagnosis': [],
#  'ResearchSubject': [ { 'Diagnosis': [],
#                         'Specimen': [],
#                         'associated_project': 'CGCI-HTMCP-CC',
#                         'id': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3',
#                         'identifier': [ { 'system': 'GDC',
#                                           'value': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3'}],
#                         'primary_disease_site': 'Cervix uteri',
#                         'primary_disease_type': 'Adenomas and '
#                                                 'Adenocarcinomas'}],
#  'Specimen': [],
#  'associated_project': 'CGCI-HTMCP-CC',
#  'days_to_birth': None,
#  'ethnicity': None,
#  'id': 'HTMCP-03-06-02177',
#  'id_1': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3',
#  'identifier': [ { 'system': 'GDC',
#                    'value': '4d54f72c-e8ac-44a7-8ab9-9f20001750b3'}],
#  'primary_disease_site': 'Cervix uteri',
#  'primary_disease_type': 'Adenomas and Adenocarcinomas',
#  'race': None,
#  'sex': None}


query('ResearchSubject.identifier.system = "GDC" FROM ResearchSubject.primary_disease_type = "Ovarian Serous Cystadenocarcinoma" AND ResearchSubject.identifier.system = "PDC"')
result = q1.run()

Comparison operators

The following comparsion operators can be used with the Q command:

operator Description Q.sql required?
= condition equals no
!= condition is not equal no
< condition is less than no
> condition is greater than no
<= condition is less than or equal to no
>= condition is less than or equal to no
like similar to = but always wildcards ('%', '_', etc) yes
in compares to a set yes

additionally, more complex SQL can be used with the Q.sql command.

A simple query

Select data from TCGA-OV project, with donors over age 50

Quick form

from cdapython import Q

q1 = Q('ResearchSubject.Diagnosis.age_at_diagnosis > 50*365')
q2 = Q('ResearchSubject.associated_project = "TCGA-OV"')

q = q1.And(q2)
r = q.run()

print(r)

# Query: SELECT * FROM gdc-bq-sample.cda_mvp.v1, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS # _Diagnosis WHERE ((_Diagnosis.age_at_diagnosis > 50*365) AND (_ResearchSubject.associated_project = 'TCGA-OV'))
# Offset: 0
# Limit: 1000
# Count: 461
# More pages: No

r.pretty_print(2)
# { 'Diagnosis': [ { 'Treatment': [ { 'outcome': None,
#                                     'type': 'Radiation Therapy, NOS'},
#                                   { 'outcome': None,
#                                     'type': 'Pharmaceutical Therapy, NOS'}],
#                    'age_at_diagnosis': 28779,
#                    'id': 'dc8af98b-03cb-5817-84fa-d86a7f2df8c6',
#                    'morphology': '8441/3',
#                    'primary_diagnosis': 'Serous cystadenocarcinoma, NOS',
#                    'tumor_grade': 'not reported',
#                    'tumor_stage': 'not reported'}],
#   'ResearchSubject': [ { 'Diagnosis': [ { 'Treatment': [ { 'outcome': None,
#                                                            'type': 'Radiation '
#                                                                    'Therapy, '
#                                                                    'NOS'},
#                                                          { 'outcome': None,
#                                                            'type': 'Pharmaceutical '
#                                                                    'Therapy, '
#                                                                    'NOS'}],
#                                           'age_at_diagnosis': 28779,
#                                           'id': 'dc8af98b-03cb-5817-84fa-d86a7f2df8c6',
#                                           'morphology': '8441/3',
#                                           'primary_diagnosis': 'Serous '
#                                                                'cystadenocarcinoma, '
#                                                                'NOS',
#                                           'tumor_grade': 'not reported',
#                                           'tumor_stage': 'not reported'}],
# ...

Any given part of a query is expressed as a string of three parts separated by spaces:

Q('ResearchSubject.associated_project = "TCGA-OV"')

The first part is interpreted as a column name, the second as a comparator and the third part as a value. If the value is a string, it needs to be put in quotes.

Detailed form

For cases where there may be ambiguity in the quoting, or the right side of the comparison is another column, the detailed form should be used. Here the three parts of a query are explicity split apart.

from cdapython import Q, Col, Quoted

q1 = Q(Col('ResearchSubject.Diagnosis.age_at_diagnosis'), '>=', 50 * 365)
q2 = Q(Col('ResearchSubject.associated_project'), '=', Quoted('TCGA-OV'))

Pointing to a custom CDA instance

.run() will execute the query on the public CDA API (https://cda.cda-dev.broadinstitute.org/api/cda/v1/).

.run("http://localhost:8080") will execute the query on a CDA server running at http://localhost:8080.

Quick Explanation on UNNEST usage in BigQuery

Using Q in the CDA client will echo the generated SQL statement that may contain multiple UNNEST inclusions when including a dot(.) structure which may need a quick explanation. UNNEST is similar to unwind in which embedded data structures must be flattend to appear in a table or Excel file. Note; The following call using the SQL endpoint is not the preferred method to execute a nested attribute query in BigQuery. The Q language DSL abstracts the required unnesting that exists in a Record. In BigQuery, structures must be represented in an UNNEST syntax such that: A.B.C.D must be unwound to SELECT (_C.D) in the following fashion:

SELECT (_C.D)
from TABLE, UNNEST(A) AS _A, UNNEST(_A.B) as _B, UNNEST(_B.C) as _C

ResearchSubject.Specimen.source_material_type represents a complex record that needs to unwound in SQL syntax to be queried on properly when using SQL.

SELECT DISTINCT(_Specimen.source_material_type)
FROM gdc-bq-sample.cda_mvp.v3,
UNNEST(ResearchSubject) AS _ResearchSubject,
UNNEST(_ResearchSubject.Specimen) AS _Specimen

Note

This is the spiritual successor of the Query Translator Prototype

cda-python's People

Contributors

briandoconnor avatar dionboles-asym avatar elijahlowe avatar fkaufman-asym avatar jackdigi avatar kghose avatar pshapiro4broad avatar yeastcell avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.