Giter Club home page Giter Club logo

chemnlp's People

Contributors

adamoyoung avatar adrianm0 avatar apoorvasrinivasan26 avatar arkadiusz-czerwinski avatar bethanyconnolly avatar hypnopump avatar jackapbutler avatar kjappelbaum avatar maw501 avatar mehradans92 avatar micpie avatar ml-evs avatar n0w0f avatar othertea avatar phalem avatar pixelatory avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

chemnlp's Issues

New role: Outreach officer(s)

For members of the community that want to contribute to outreach:

  • make regular blog posts with progress reports/newsletters
  • fill website with content
  • explain work in appealing visuals
  • add documentation

Tabular data issues | ToxCast consist of 615 columns toward 615 dataset

I investigate toxCast data. It involve 615 columns or target, I can convert it into dataset toward separated 615 target. I try to remove Nan, but nothing return. It need to be curated separately. I hope if someone can reduce the time for me or help in investigation each target information like names and URI's. I will start it after TCR ADME data and CRISPR Repair data.Thanks.

New Task: Add EuroPMC Dataset

I would like to work on this.
I took this dataset from the awesome list.
What is the priority for this?? Any more information that I need to have in mind before starting to add this dataset?

Issue labels

We might want to have some additional issue labels:

  • dataset
  • tokenizer
  • docs

New Task: Make schemas semantic

Again, not urgent and not important for our main goal (but useful for making the dataset more impactful):

Add some semantics to the dataset description (e.g. using LinkML link the keys to a controlled vocabulary)

In contribution guide specify the following

  • link the issue
  • when listing names, keep in mind it will use for prompting the model. Is there enough context?
  • do we need to sample multiple columns at the same time (e.g. protein and drug SMILES)?

New task: Scrape supporting information files

The supporting information of American Chemical Society and Royal Society of Chemistry journals are available without a subscription for research use. I'm thinking of writing a scraper that will download all the PDFs.

Two questions:

  • Is there a place where we can host possibly hundreds of GBs of PDFs?
  • How to extract data from the PDFs? I think this will probably be addressed in #18

Dataset TODO list

Dataset Todo

  • Add "synthetic" data #13 [nice-to-have]
  • Run ChemDataExtractor on Free Text #18 [needs-discussion]
  • Prepare PubChem dataset #19 [priority-high]
  • Add CheMBL dataset #24 [priority-high]
  • Add ESOL dataset #33 [priority-high]

Dataset In Progress

  • Add Papyrus dataset #335 #340

  • Add papyrus protein targets #336

  • Adding data from the Human Metabolome Database (HMDB) #136 [adamoyoung]

  • Adding Data from MassBank of North America (MoNA) #137 [adamoyoung]

  • Add Open Targets datasets for drug information #138 #139 #140 #141 #142 [jackapbutler]

  • Adding the europepmc dataset #162 [hssn-20]

  • Adding Uniprot, X-linking to reaction DBs for enzymes #191 [hypnopump]

  • Add DrugChat data #293 [alxfgh]

  • Adding Suzuki Miyaura yield prediction dataset #212 [pschwllr]

  • Add QMOF dataset #235 [kjappelbaum]

  • Add SuperCon dataset #236 [kjappelbaum]

  • Add QMUG dataset #237 [kjappelbaum]

  • Add Enamine dataset #238 [kjappelbaum]

  • Add ORD dataset #239 [kjappelbaum]

  • Refactor rhea_db into csv files #242 [kjappelbaum]

  • Add Drug-Target Interaction data #68 [strubeyj]

  • H2_storage_materials_database #64 [bethanyconnolly] #76

  • Add EuroPMC Dataset #32 [abhinav-kashyap-asus]

  • Add Buchwald Hartwig dataset[pschwllr] #81

  • Add Drug-Drug Interaction Data from nSIDES [apoorvasrinivasan26] #89

  • Add uspto data from drfp #95

  • Add NLMChem #114 [apoorvasrinivasan26]

  • Add ThermoML Archive dataset #118

  • Adding the Chemistry textbooks from LibreTexts library #134

  • Add Therapeutic Data Commons dataset #27 [priority-high]

    -[ ] Single-instance [phalem] #90

    • Add ADME Property [phalem] #84
      • Absorption #85
        • Caco-2 (Cell Effective Permeability), Wang et al.[MicPie] #37
        • PAMPA Permeability, NCATS [MicPie] #41
        • HIA (Human Intestinal Absorption), Hou et al. #85
        • Pgp (P-glycoprotein) Inhibition, Broccatelli et al. #85
        • Bioavailability, Ma et al. #85
        • Lipophilicity, AstraZeneca [MicPie] #22
        • Solubility, AqSolDB #85
        • Hydration Free Energy, FreeSolv #85
      • Distribution #86
        • BBB (Blood-Brain Barrier), Martins et al. #86
        • PPBR (Plasma Protein Binding Rate), AstraZeneca #86
        • VDss (Volumn of Distribution at steady state), Lombardo et al. #86
      • Metabolism #88
        • CYP P450 2C19 Inhibition, Veith et al. #88
        • CYP P450 2D6 Inhibition, Veith et al. #88
        • CYP P450 3A4 Inhibition, Veith et al. #88
        • CYP P450 1A2 Inhibition, Veith et al. #88
        • CYP P450 2C9 Inhibition, Veith et al. #88
        • CYP2C9 Substrate, Carbon-Mangels et al. #88
        • CYP2D6 Substrate, Carbon-Mangels et al. #88
        • CYP3A4 Substrate, Carbon-Mangels et al. #88
      • Excretion #87
        • Half Life, Obach et al. #87
        • Clearance, AstraZeneca #87
    • Add Toxicity [phalem]
    • Add High-throughput Screening [phalem]
      • SARS-CoV-2 In Vitro, Touret et al. #59
      • SARS-CoV-2 3CL Protease, Diamond. #94
      • HIV #60
      • Butkiewicz et al. #62
    • Add Quantum Mechanics Modeling #78
      • QM7b
      • QM8
      • QM9
    • Add Reaction Yields #78
      • Buchwald-Hartwig #81
      • USPTO
    • Add Epitope(Immunotherapy under Target discovery) #97
      • IEDB, Jespersen et al. #96
      • PDB, Jespersen et al. #96
    • Add Antibody Developability #78
      • TAP #99
      • SAbDab, Chen et al. #99
    • Add CRISPR Repair Outcome[apoorvasrinivasan26]
      • Leenay et al.

    -[ ] Multi-instance

    • Add Drug-Target Interaction data #68[strubeyj]
      • BindingDB
      • DAVIS
      • KIBA
    • Add Drug-Drug Interaction
      • DrugBank Multi-Typed DDI
      • TWOSIDES Polypharmacy Side Effects
    • Add Gene-Disease Association
      • DisGeNET
    • Add Drug Response
      • GDSC1
      • GDSC2
    • Add Peptide-MHC Binding
      • MHC Class I, IEDB-IMGT, Nielsen et al.
      • MHC Class II, IEDB, Jensen et al.
    • Add Antibody-antigen Affinity
      • SAbDab
    • Add MicroRNA-Target Interaction
      • miRTarBase
    • Add Catalyst
      • USPTO
    • Add TCR-Epitope Binding Affinity [strubeyj] #67
      • Weber et al.

    -[ ] Generation data [phalem] #90

    • Add Molecule Generation #178 [arkadiusz-czerwinski]
      • MOSES #178 [arkadiusz-czerwinski]
      • ZINC #178 [arkadiusz-czerwinski]
      • ChEMBL #178 [arkadiusz-czerwinski]
    • Add Retrosynthesis
      • USPTO-50K
      • USPTO
    • Add Reaction Outcome
      • USPTO
    • Add Structure-based Drug Design
      • PDBBind
      • DUD-E
      • scPDB

Done ✓

  • Add flashpoint dataset #43 [othertea]
  • add initial model pipeline [maw501] [bethanyconnolly][kjappelbaum][MicPie] #71
  • Add iupac goldbook #187 #188 [MicPie]
  • Add RXN-SMILES as identifier type #113 [kjappelbaum]
  • Add benchmark field #116
  • Add entos protonation energy #244 #233 [kjappelbaum]
  • Add chebi-20 dataset #63 #108 [jackapbutler]
  • Add FDA Adverse reactions datasets #139 #143 [jackapbutler]
  • Add Natural text dataset elsevier_oa_cc-by_corpus #216

Validate the links in `meta.yaml`

          Shall we leave them like this? 

Originally I thought of adding some more context with "description" etc.
I'm ok with dropping this (but there should be one link we highlight as the one where the data comes from).

I also realized that this part is currently not validated.

Originally posted by @kjappelbaum in #23 (comment)

Tabular data issues | Complex data structure identifier

Here, I will put issue I face in Tabular. I will mention it also, in TODO.

I found that Quantum Mechanics have built in structure data identifier is complex structure:
https://tdcommons.ai/single_pred_tasks/qm/

['C', 'H', 'H', 'H', 'H'],
 array([[ 0.99813803, -0.00263872, -0.00464602],
        [ 2.0944175 , -0.00242373,  0.00417336],
        [ 0.63238996,  1.03082951,  0.00417296],
        [ 0.62561232, -0.52974905,  0.88151021],
        [ 0.64010219, -0.50924801, -0.90858051]]))
"C" -> [ 0.99813803, -0.00263872, -0.00464602]

I try to put it in a dict, but dictionary cannot have a duplicate key value, other solution will be put into other structure, But this also will need multiple identifier that might input together, so we can't input C without x,y,z and can't input C without other atoms.

Again in Antibody Developability it need two sequence as input:
https://tdcommons.ai/single_pred_tasks/develop/

I can split it, but we will need the user to enter two input together.

===============
I found same things here in Reaction Yields, but here we can make a columns for each:

https://tdcommons.ai/single_pred_tasks/yields/

{'product': 'Cc1ccc(Nc2ccc(C(F)(F)F)cc2)cc1',
'catalyst': '',
'reactant': 'FC(F)(F)c1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2-c2ccccc2N~1)C(F)(F)F.CC(C)c1cc(C(C)C)c(-c2ccccc2P(C2CCCCC2)C2CCCCC2)c(C(C)C)c1.CCN=P(N=P(N(C)C)(N(C)C)N(C)C)(N(C)C)N(C)C.c1ccc(-c2ccno2)cc1'}

Again user must enter three identifier together as same product can be duplicated.

In epitope data input is one, but target given a complex output like:
https://tdcommons.ai/single_pred_tasks/epitope/
X:
'MASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPKRGSGKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGRGLSLSRFSWGAEGQRPGFGYGGRASDYKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARR'
Y:
[109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122]
which is the indices found in the input, so we need to return again to input and provide some visualization or return it as it's ?

Consider adding URIs for dataset targets and properties

(related to #72)

The idea is that these URIs can at least i) resolve whether two definitions are the same across datasets and could potentially ii) be used to augment the dataset with canonical descriptions and semantic links, either during prep or by the model on-the-fly.

This currently assumes an "exact match" style mapping between target and property -- we could build in additional semantic context in the schema here to enable things like related identities/subclasses/parthood and all that jazz. I struggled with the Butkiewicz sets as it is really outside my field and definitions are available for e.g., cav3, t-type, calcium channel and activity but not activity_cav3_t_type_calcium_channels.

As discussed, this is quite a niche task that may not be suitable to ask others to perform. Even in my own case, it is not clear exactly how good these particular definitions are -- I just went via BioPortal for fields that have good matches: https://bioportal.bioontology.org/

originally posted in #72 (comment)

New task: Common reading list

Might be nice to have a repo of interesting papers. Can have different forms:

  • shared Zotero collection
  • a GitHub page
  • a Discord channel

Validate the links in `meta.yaml`

          Shall we leave them like this? 

Originally I thought of adding some more context with "description" etc.
I'm ok with dropping this (but there should be one link we highlight as the one where the data comes from).

I also realized that this part is currently not validated.

Originally posted by @kjappelbaum in #23 (comment)

New Task: Add chebi-20 dataset

Overview

I will add the chebi-20 dataset from this paper which provides rows which map from "CID", "SMILES" and a natural language description of the particular molecule.

Basic template could be The molecule <CID> with smiles <SMILES> can be described as follows ____. This dataset is also already mentioned in the awesome-chemistry-datasets repository.

Groundwork for GPT-NeoX codebase integration

We want to understand how we can interact with GPT-NeoX effectively and how we can use it to perform initial prompt tuning experiments on our datasets.

  • Understand how GPT-NeoX handles (at least the following high level points)
    • training data
    • tokenisation
    • model architecture configuration
    • model training configuration
    • evaluation / checkpointing
  • #109
  • #120
  • #111
  • #124

New Task: Build data dashboard

Develop a GitHub page/dashboard that takes the meta.yaml files to make a simple dashboard/static page to get an overview of our implemented datasets.
Not a high priority, but a nice to have.

New Task: Add "synthetic" data

To increase the dataset size, we can compute many different properties for all SMILES ChemBERTa also did this.

It might be already interesting to have simple things as SMILES -> composition, SMILES -> SELFIES, SMILES -> number rings, SMILES -> molecular weight ...

New task: Compile tokenizers

We need a mechanism that describes to which columns of a dataset a tokenizer applies (e.g., I think that we could use the identifier in meta.yaml for this).

Then, collect implementations for SMILES, SELFIES, InChI (?), IUPAC Name (?) tokenizers and describe in some way (registry pattern, decorator, ...) to which data types it applies to

New Task: Run ChemDataExtractor on Free Text

Can we extract some info (semantic classes and named entities) from the text datasets? Are we maybe even able to extract info from the images in the papers?

This might be useful for better train/test splits or to create relevant subsets of data (e.g., certain compound classes) and LIFT prompting.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.