openbioml / chemnlp Goto Github PK
View Code? Open in Web Editor NEWChemNLP project
License: MIT License
ChemNLP project
License: MIT License
Data given in TDC hERG central give drug id and smiles instead of drug name :
https://tdcommons.ai/single_pred_tasks/tox/#herg-central
solution can be used pubchempy or pugrest, however it take time for a datapoint it take 1 second.
Can we rely on SMILES only ?
For members of the community that want to contribute to outreach:
I investigate toxCast data. It involve 615 columns or target, I can convert it into dataset toward separated 615 target. I try to remove Nan, but nothing return. It need to be curated separately. I hope if someone can reduce the time for me or help in investigation each target information like names and URI's. I will start it after TCR ADME data and CRISPR Repair data.Thanks.
I will finish Single-instance remaining data from TDC:
https://tdcommons.ai/single_pred_tasks/overview/
and generation data:
https://tdcommons.ai/generation_tasks/overview/
ok, will update the contribution guide saying that also notebooks are ok
Originally posted by @kjappelbaum in #4 (comment)
Add the lipophilicity dataset from https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/Lipophilicity.csv (from https://github.com/kjappelbaum/awesome-chemistry-datasets#ml-structure-property-benchmark-datasets)
PR draft is here: #23
@kjappelbaum Is there more information on the dataset, e.g., license, etc.?
Data from https://www.science.org/doi/10.1126/science.aar5169 for reaction yield prediction task.
I will take care of that.
Tox21 etc. has multiple point list, what we can do for this case? Merging all and make multiple target name or make it separately inside subfolder
is there a limit on the size?
I would like to work on this.
I took this dataset from the awesome list.
What is the priority for this?? Any more information that I need to have in mind before starting to add this dataset?
Add the flashpoint dataset from https://github.com/kjappelbaum/awesome-chemistry-datasets, which contains flashpoints of 10575 molecules, collected by Sun et al., Assessing Graph-based Deep Learning Models for Predicting Flash Point.
I will add all ADME property in TDC::
https://tdcommons.ai/single_pred_tasks/adme/
We might want to have some additional issue labels:
dataset
tokenizer
docs
Again, not urgent and not important for our main goal (but useful for making the dataset more impactful):
Add some semantics to the dataset description (e.g. using LinkML link the keys to a controlled vocabulary)
Add Drug-Target Interaction data from https://tdcommons.ai/multi_pred_tasks/dti/. The dataset contains the target amino acid sequence/compound SMILES string and the goal is to predict their binding affinity.
I will work on all Toxicity dataset here :
https://tdcommons.ai/single_pred_tasks/tox/
The supporting information of American Chemical Society and Royal Society of Chemistry journals are available without a subscription for research use. I'm thinking of writing a scraper that will download all the PDFs.
Two questions:
The “small” training data set is available in the supporting information:
https://pubs.acs.org/doi/10.1021/ci034243x
I created an issue in the ESOL repo to ask for the full data:
hossainlab/ESOL#1
Worst case we can only use the small subset.
Add papyrus protein targets #336
Adding data from the Human Metabolome Database (HMDB) #136 [adamoyoung]
Adding Data from MassBank of North America (MoNA) #137 [adamoyoung]
Add Open Targets datasets for drug information #138 #139 #140 #141 #142 [jackapbutler]
Adding the europepmc dataset #162 [hssn-20]
Adding Uniprot, X-linking to reaction DBs for enzymes #191 [hypnopump]
Add DrugChat data #293 [alxfgh]
Adding Suzuki Miyaura yield prediction dataset #212 [pschwllr]
Add QMOF dataset #235 [kjappelbaum]
Add SuperCon dataset #236 [kjappelbaum]
Add QMUG dataset #237 [kjappelbaum]
Add Enamine dataset #238 [kjappelbaum]
Add ORD dataset #239 [kjappelbaum]
Refactor rhea_db into csv files #242 [kjappelbaum]
Add Drug-Target Interaction data #68 [strubeyj]
Add EuroPMC Dataset #32 [abhinav-kashyap-asus]
Add Buchwald Hartwig dataset[pschwllr] #81
Add Drug-Drug Interaction Data from nSIDES [apoorvasrinivasan26] #89
Add uspto data from drfp #95
Add NLMChem #114 [apoorvasrinivasan26]
Add ThermoML Archive dataset #118
Adding the Chemistry textbooks from LibreTexts library #134
Add Therapeutic Data Commons dataset #27 [priority-high]
-[ ] Single-instance [phalem] #90
-[ ] Multi-instance
-[ ] Generation data [phalem] #90
Shall we leave them like this?
Originally I thought of adding some more context with "description" etc.
I'm ok with dropping this (but there should be one link we highlight as the one where the data comes from).
I also realized that this part is currently not validated.
Originally posted by @kjappelbaum in #23 (comment)
Here, I will put issue I face in Tabular. I will mention it also, in TODO.
I found that Quantum Mechanics have built in structure data identifier is complex structure:
https://tdcommons.ai/single_pred_tasks/qm/
['C', 'H', 'H', 'H', 'H'],
array([[ 0.99813803, -0.00263872, -0.00464602],
[ 2.0944175 , -0.00242373, 0.00417336],
[ 0.63238996, 1.03082951, 0.00417296],
[ 0.62561232, -0.52974905, 0.88151021],
[ 0.64010219, -0.50924801, -0.90858051]]))
"C" -> [ 0.99813803, -0.00263872, -0.00464602]
Again in Antibody Developability it need two sequence as input:
https://tdcommons.ai/single_pred_tasks/develop/
I can split it, but we will need the user to enter two input together.
===============
I found same things here in Reaction Yields, but here we can make a columns for each:
https://tdcommons.ai/single_pred_tasks/yields/
{'product': 'Cc1ccc(Nc2ccc(C(F)(F)F)cc2)cc1',
'catalyst': '',
'reactant': 'FC(F)(F)c1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2-c2ccccc2N~1)C(F)(F)F.CC(C)c1cc(C(C)C)c(-c2ccccc2P(C2CCCCC2)C2CCCCC2)c(C(C)C)c1.CCN=P(N=P(N(C)C)(N(C)C)N(C)C)(N(C)C)N(C)C.c1ccc(-c2ccno2)cc1'}
In epitope data input is one, but target given a complex output like:
https://tdcommons.ai/single_pred_tasks/epitope/
X:
'MASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPKRGSGKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGRGLSLSRFSWGAEGQRPGFGYGGRASDYKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARR'
Y:
[109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122]
which is the indices found in the input, so we need to return again to input and provide some visualization or return it as it's ?
Can't found data mentioned in reference tdc LD50 :
https://tdcommons.ai/single_pred_tasks/tox/#acute-toxicity-ld50
https://doi.org/10.1021/tx900189p
and tdc Skin Reaction:
https://tdcommons.ai/single_pred_tasks/tox/#skin-reaction
https://doi.org/10.1016/j.taap.2014.12.014
first one author kindly give the data, but second one not.
(related to #72)
The idea is that these URIs can at least i) resolve whether two definitions are the same across datasets and could potentially ii) be used to augment the dataset with canonical descriptions and semantic links, either during prep or by the model on-the-fly.
This currently assumes an "exact match" style mapping between target and property -- we could build in additional semantic context in the schema here to enable things like related identities/subclasses/parthood and all that jazz. I struggled with the Butkiewicz sets as it is really outside my field and definitions are available for e.g., cav3, t-type, calcium channel and activity but not activity_cav3_t_type_calcium_channels.
As discussed, this is quite a niche task that may not be suitable to ask others to perform. Even in my own case, it is not clear exactly how good these particular definitions are -- I just went via BioPortal for fields that have good matches: https://bioportal.bioontology.org/
originally posted in #72 (comment)
transform.py
supposed to dometa.yaml
supposed to doI will work on TWOsides drug-drug interaction database from here
Let me quickly create a GitHub action that validates the `yaml` files.
Originally posted by @kjappelbaum in #4 (comment)
Might be nice to have a repo of interesting papers. Can have different forms:
There is so much info in PubChem, we should
Shall we leave them like this?
Originally I thought of adding some more context with "description" etc.
I'm ok with dropping this (but there should be one link we highlight as the one where the data comes from).
I also realized that this part is currently not validated.
Originally posted by @kjappelbaum in #23 (comment)
I will add the chebi-20
dataset from this paper which provides rows which map from "CID", "SMILES" and a natural language description of the particular molecule.
Basic template could be The molecule <CID> with smiles <SMILES> can be described as follows ____
. This dataset is also already mentioned in the awesome-chemistry-datasets repository.
Add Therapeutic Data Commons datasets from https://github.com/kjappelbaum/awesome-chemistry-datasets list. The datasets included are across therapeutic modalities and stages of discovery
might be too annoying for contributors with the BibTeX entries
Based on the paper Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning, we'd like to add Infused Adapter by Inhibiting and Amplifying Inner Activations,
We want to understand how we can interact with GPT-NeoX effectively and how we can use it to perform initial prompt tuning experiments on our datasets.
I will add data from :
https://tdcommons.ai/single_pred_tasks/hts/
Hi, I can prep this dataset from https://github.com/kjappelbaum/awesome-chemistry-datasets
Perhaps it would be cleaner if we also simply run pre-commit here. I can also take care of that
Originally posted by @kjappelbaum in #71 (comment)
See PR #36
Develop a GitHub page/dashboard that takes the meta.yaml
files to make a simple dashboard/static page to get an overview of our implemented datasets.
Not a high priority, but a nice to have.
To increase the dataset size, we can compute many different properties for all SMILES ChemBERTa also did this.
It might be already interesting to have simple things as SMILES -> composition, SMILES -> SELFIES, SMILES -> number rings, SMILES -> molecular weight ...
We need a mechanism that describes to which columns of a dataset a tokenizer applies (e.g., I think that we could use the identifier
in meta.yaml
for this).
Then, collect implementations for SMILES, SELFIES, InChI (?), IUPAC Name (?) tokenizers and describe in some way (registry pattern, decorator, ...) to which data types it applies to
Can we extract some info (semantic classes and named entities) from the text datasets? Are we maybe even able to extract info from the images in the papers?
This might be useful for better train/test splits or to create relevant subsets of data (e.g., certain compound classes) and LIFT prompting.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.