Need help: Some TDC data give random number drug id instead of name, Pubchem retrival solution take time

Data given in TDC hERG central give drug id and smiles instead of drug name :
https://tdcommons.ai/single_pred_tasks/tox/#herg-central

solution can be used pubchempy or pugrest, however it take time for a datapoint it take 1 second.

Can we rely on SMILES only ?

New role: Outreach officer(s)

For members of the community that want to contribute to outreach:

make regular blog posts with progress reports/newsletters
fill website with content
explain work in appealing visuals
add documentation

Tabular data issues | ToxCast consist of 615 columns toward 615 dataset

I investigate toxCast data. It involve 615 columns or target, I can convert it into dataset toward separated 615 target. I try to remove Nan, but nothing return. It need to be curated separately. I hope if someone can reduce the time for me or help in investigation each target information like names and URI's. I will start it after TCR ADME data and CRISPR Repair data.Thanks.

FreeSolv data (template for discussion)

New Task | Finish Single-instance remaining data & Generation Datasets from TDC

I will finish Single-instance remaining data from TDC:
https://tdcommons.ai/single_pred_tasks/overview/
and generation data:
https://tdcommons.ai/generation_tasks/overview/

Update contribution guide

          ok, will update the contribution guide saying that also notebooks are ok

Originally posted by @kjappelbaum in #4 (comment)

New task: Add lipophilicity dataset

Add the lipophilicity dataset from https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/Lipophilicity.csv (from https://github.com/kjappelbaum/awesome-chemistry-datasets#ml-structure-property-benchmark-datasets)
PR draft is here: #23
@kjappelbaum Is there more information on the dataset, e.g., license, etc.?

New Task: Buchwald-Hartwig yield prediction data

Data from https://www.science.org/doi/10.1126/science.aar5169 for reaction yield prediction task.

I will take care of that.

Need help: Tox21, hERG center and ToxCast has multiple point from TDC

Tox21 etc. has multiple point list, what we can do for this case? Merging all and make multiple target name or make it separately inside subfolder

Investigate storing all raw data on HF or Foundry

is there a limit on the size?

New Task: Add EuroPMC Dataset

I would like to work on this.
I took this dataset from the awesome list.
What is the priority for this?? Any more information that I need to have in mind before starting to add this dataset?

New task: Add flashpoint dataset

Add the flashpoint dataset from https://github.com/kjappelbaum/awesome-chemistry-datasets, which contains flashpoints of 10575 molecules, collected by Sun et al., Assessing Graph-based Deep Learning Models for Predicting Flash Point.

New Task : Add ADME Property data from TDC

I will add all ADME property in TDC::
https://tdcommons.ai/single_pred_tasks/adme/

Issue labels

We might want to have some additional issue labels:

dataset
tokenizer
docs

New Task: Make schemas semantic

Again, not urgent and not important for our main goal (but useful for making the dataset more impactful):

Add some semantics to the dataset description (e.g. using LinkML link the keys to a controlled vocabulary)

New Task: Add Drug-Target Interaction data

Add Drug-Target Interaction data from https://tdcommons.ai/multi_pred_tasks/dti/. The dataset contains the target amino acid sequence/compound SMILES string and the goal is to predict their binding affinity.

New Task: Add Toxicity dataset from TDC

I will work on all Toxicity dataset here :
https://tdcommons.ai/single_pred_tasks/tox/

In contribution guide specify the following

link the issue
when listing names, keep in mind it will use for prompting the model. Is there enough context?
do we need to sample multiple columns at the same time (e.g. protein and drug SMILES)?

New task: Scrape supporting information files

The supporting information of American Chemical Society and Royal Society of Chemistry journals are available without a subscription for research use. I'm thinking of writing a scraper that will download all the PDFs.

Two questions:

Is there a place where we can host possibly hundreds of GBs of PDFs?
How to extract data from the PDFs? I think this will probably be addressed in #18

Add contribution guide

New Task: Add CheMBL dataset

New task: Add ESOL dataset

The “small” training data set is available in the supporting information:
https://pubs.acs.org/doi/10.1021/ci034243x

I created an issue in the ESOL repo to ask for the full data:
hossainlab/ESOL#1

Worst case we can only use the small subset.

Dataset TODO list

Dataset Todo

Add "synthetic" data #13 [nice-to-have]
Run ChemDataExtractor on Free Text #18 [needs-discussion]
Prepare PubChem dataset #19 [priority-high]
Add CheMBL dataset #24 [priority-high]
Add ESOL dataset #33 [priority-high]

Dataset In Progress

Done ✓

Add flashpoint dataset #43 [othertea]
add initial model pipeline [maw501] [bethanyconnolly][kjappelbaum][MicPie] #71
Add iupac goldbook #187 #188 [MicPie]
Add RXN-SMILES as identifier type #113 [kjappelbaum]
Add benchmark field #116
Add entos protonation energy #244 #233 [kjappelbaum]
Add chebi-20 dataset #63 #108 [jackapbutler]
Add FDA Adverse reactions datasets #139 #143 [jackapbutler]
Add Natural text dataset elsevier_oa_cc-by_corpus #216

Validate the links in `meta.yaml`

          Shall we leave them like this?

Originally I thought of adding some more context with "description" etc.
I'm ok with dropping this (but there should be one link we highlight as the one where the data comes from).

I also realized that this part is currently not validated.

Originally posted by @kjappelbaum in #23 (comment)

Tabular data issues | Complex data structure identifier

Here, I will put issue I face in Tabular. I will mention it also, in TODO.

I found that Quantum Mechanics have built in structure data identifier is complex structure:
https://tdcommons.ai/single_pred_tasks/qm/

['C', 'H', 'H', 'H', 'H'],
 array([[ 0.99813803, -0.00263872, -0.00464602],
        [ 2.0944175 , -0.00242373,  0.00417336],
        [ 0.63238996,  1.03082951,  0.00417296],
        [ 0.62561232, -0.52974905,  0.88151021],
        [ 0.64010219, -0.50924801, -0.90858051]]))
"C" -> [ 0.99813803, -0.00263872, -0.00464602]

I try to put it in a dict, but dictionary cannot have a duplicate key value, other solution will be put into other structure, But this also will need multiple identifier that might input together, so we can't input C without x,y,z and can't input C without other atoms.

Again in Antibody Developability it need two sequence as input:
https://tdcommons.ai/single_pred_tasks/develop/

I can split it, but we will need the user to enter two input together.

===============
I found same things here in Reaction Yields, but here we can make a columns for each:

https://tdcommons.ai/single_pred_tasks/yields/

{'product': 'Cc1ccc(Nc2ccc(C(F)(F)F)cc2)cc1',
'catalyst': '',
'reactant': 'FC(F)(F)c1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2-c2ccccc2N~1)C(F)(F)F.CC(C)c1cc(C(C)C)c(-c2ccccc2P(C2CCCCC2)C2CCCCC2)c(C(C)C)c1.CCN=P(N=P(N(C)C)(N(C)C)N(C)C)(N(C)C)N(C)C.c1ccc(-c2ccno2)cc1'}

Again user must enter three identifier together as same product can be duplicated.

In epitope data input is one, but target given a complex output like:
https://tdcommons.ai/single_pred_tasks/epitope/
X:
'MASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPKRGSGKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGRGLSLSRFSWGAEGQRPGFGYGGRASDYKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARR'
Y:
[109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122]
which is the indices found in the input, so we need to return again to input and provide some visualization or return it as it's ?

Need help: Can't find data mentioned in the referenced source TDC.

Can't found data mentioned in reference tdc LD50 :
https://tdcommons.ai/single_pred_tasks/tox/#acute-toxicity-ld50
https://doi.org/10.1021/tx900189p

and tdc Skin Reaction:
https://tdcommons.ai/single_pred_tasks/tox/#skin-reaction
https://doi.org/10.1016/j.taap.2014.12.014

first one author kindly give the data, but second one not.

Consider adding URIs for dataset targets and properties

(related to #72)

The idea is that these URIs can at least i) resolve whether two definitions are the same across datasets and could potentially ii) be used to augment the dataset with canonical descriptions and semantic links, either during prep or by the model on-the-fly.

This currently assumes an "exact match" style mapping between target and property -- we could build in additional semantic context in the schema here to enable things like related identities/subclasses/parthood and all that jazz. I struggled with the Butkiewicz sets as it is really outside my field and definitions are available for e.g., cav3, t-type, calcium channel and activity but not activity_cav3_t_type_calcium_channels.

As discussed, this is quite a niche task that may not be suitable to ask others to perform. Even in my own case, it is not clear exactly how good these particular definitions are -- I just went via BioPortal for fields that have good matches: https://bioportal.bioontology.org/

originally posted in #72 (comment)

Clarify contribution guide

what is transform.py supposed to do
what is meta.yaml supposed to do

New Task: Add Drug-Drug Interaction Data

I will work on TWOsides drug-drug interaction database from here

Create GitHub action that validates `yaml` metadata files

          Let me quickly create a GitHub action that validates the `yaml` files.

Originally posted by @kjappelbaum in #4 (comment)

New task: Common reading list

Might be nice to have a repo of interesting papers. Can have different forms:

shared Zotero collection
a GitHub page
a Discord channel

clarify that `meta.yaml` must not necessarily be produced in `transform.py`

New Task: Prepare PubChem dataset

There is so much info in PubChem, we should

get an overview how this has been used
discuss how we can use it most effectively

Validate the links in `meta.yaml`

          Shall we leave them like this?

Originally I thought of adding some more context with "description" etc.
I'm ok with dropping this (but there should be one link we highlight as the one where the data comes from).

I also realized that this part is currently not validated.

Originally posted by @kjappelbaum in #23 (comment)

New Task: Add chebi-20 dataset

Overview

I will add the chebi-20 dataset from this paper which provides rows which map from "CID", "SMILES" and a natural language description of the particular molecule.

Basic template could be The molecule <CID> with smiles <SMILES> can be described as follows ____. This dataset is also already mentioned in the awesome-chemistry-datasets repository.

New Task: Add Therapeutic Data Commons dataset

Add Therapeutic Data Commons datasets from https://github.com/kjappelbaum/awesome-chemistry-datasets list. The datasets included are across therapeutic modalities and stages of discovery

do not complain about too long lines

might be too annoying for contributors with the BibTeX entries

Add (IA)^3 to model training pipeline

Based on the paper Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning, we'd like to add Infused Adapter by Inhibiting and Amplifying Inner Activations, $(IA)^3$, to the model training pipeline.

Groundwork for GPT-NeoX codebase integration

We want to understand how we can interact with GPT-NeoX effectively and how we can use it to perform initial prompt tuning experiments on our datasets.

Understand how GPT-NeoX handles (at least the following high level points)
- training data
- tokenisation
- model architecture configuration
- model training configuration
- evaluation / checkpointing
#109
#120
#111
#124

New Task: Add High-throughput Screening dataset from TDC

I will add data from :
https://tdcommons.ai/single_pred_tasks/hts/

add pre-commit hook

H2_storage_materials_database

Hi, I can prep this dataset from https://github.com/kjappelbaum/awesome-chemistry-datasets

Run pre-commit in CI

          Perhaps it would be cleaner if we also simply run pre-commit here. I can also take care of that

Originally posted by @kjappelbaum in #71 (comment)

openbioml / chemnlp Goto Github PK

chemnlp's People

Contributors

Stargazers

Watchers

Forkers

chemnlp's Issues

Dataset Todo

Dataset In Progress

Done ✓

I try to put it in a dict, but dictionary cannot have a duplicate key value, other solution will be put into other structure, But this also will need multiple identifier that might input together, so we can't input C without x,y,z and can't input C without other atoms.

Again user must enter three identifier together as same product can be duplicated.

Overview

Recommend Projects

Recommend Topics

Recommend Org