Giter Club home page Giter Club logo

series4_predictivemodel's People

Contributors

edwintse avatar gcincilla avatar holeung avatar iamdavyg avatar luiraym avatar mattodd avatar sladem-tox avatar spadavec avatar wvanhoorn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

series4_predictivemodel's Issues

MAIP: a web service for predicting blood‐stage malaria inhibitors

An interesting website @mattodd

We describe the development of an open-source software platform for creating such models, a comprehensive evaluation of methods to create a single consensus model and a web platform called MAIP available at https://www.ebi.ac.uk/chembl/maip/. MAIP is freely available for the wider community to make large-scale predictions of potential malaria inhibiting compounds. This project also highlights some of the practical challenges in reproducing published computational methods and the opportunities that open-source software can offer to the community.

Bosc, N., Felix, E., Arcila, R. et al. MAIP: a web service for predicting blood‐stage malaria inhibitors. J Cheminform 13, 13 (2021). https://doi.org/10.1186/s13321-021-00487-2

Series 4 Predictive Model - AI with Small Data from DeepMirror

Hello @mattodd @edwintse !
A few months ago we were contacted by DeepMirror who were developing an AI platform to work with small data and were interested in testing it with some experimental datasets.
We suggested to give it a go with the Series 4 molecules, as there is a good benchmark already and it's a difficult problem for AI due to the low-data scenario! I think the model is ready and they will be for sure happy to run predictions to help make a decision before synthesis if you need it.
I am tagging them here to follow up on this: Ryan, @rg314 and Andrea @a19s

Evariste Technologies compounds

Hi all,

We’ve already caught up with @mattodd about this but will just give a quick intro for everybody else. Evariste Technologies is a start-up focusing on a probabilistic approach to drug discovery. Briefly, the platform we’ve built, Frobenius, takes an existing dataset and identifies the most promising starting point/s, then designs a bunch of new compounds and scores them according to the likelihood of achieving a set of pre-specified endpoints.

We were really interested in the recent publication detailing the open competition run by the OSM team and thought we’d have a crack at the problem ourselves. The compounds attached are the output generated by Frobenius when it’s presented with the series 4 data. More specifically, we’ve taken the two most promising starting points, applied the various compound designers and selected a subset of the highest scoring compounds (filtered by a medicinal chemist for synthetic feasibility etc). The number associated with each compound is the probability of it achieving a pIC50 of 8.

As we mentioned to Mat, we’re keen to get some of these synthesised and are able to contribute towards cost of synthesis. We’re also more than happy for anyone interested in these compounds to use them as inspiration for similar structures, if this is the case, we’d really appreciate being kept in the loop as we can very readily score the idea in Frobenius.

If anyone’s interested in knowing more about the modelling than the (very) brief overview I’ve given here we’re happy to discuss in detail.

Best wishes,
Alfie

https://www.evaristetechnologies.com
https://www.linkedin.com/in/alfie-brennan-746ba6b1

Evariste Suggestions

Series4_EVT_Suggestions.xlsx

Malaria suggestions pIC50 8 .pdf

Suggestions for New Active Compounds

With the competition (#1) concluded and the winners announced (#18), they have been tasked with coming up with suggestions for new active compounds by using their developed models. These suggestions will be synthesised and sent for testing (hopefully) before Christmas. We can then add this all to the paper (#3) before submission.

In order to give us time to get started:

  • The deadline for first-pass suggestions will be close of business Friday 15th Nov.

  • The final deadline will be close of business Friday 22nd Nov to allow for an additional suggestion.


The structures below have been suggested by the winners to be synthesised. Currently no routes have been found to access the additional suggested compounds (with the exception of the leftmost one)

Untitled Wiley Document-3

Post Mortem Meeting Fri Jan 31st?

At the end of the project, after all predicted molecules have been synthesised and evaluated, it would be very good to get together to:

  1. Dissect the results and work out what we learned (the Post Mortem)
  2. Hear from all the submitters about the nature of their day jobs (what AI/ML people do)
  3. Hear from others in the field who didn't enter (what other AI/ML people do)
  4. Finish off the paper for submission (closure).

I'm about to book the Library room in the Royal Society of Chemistry in London for Friday January 31st. Would that work for everyone? Something like 10-4 so that people don't have to stay over.

We're going to try to keep the meeting free. We're also going to try to help students out with travel costs, but apologies to long-distance entrants, I'm not sure we can help with those kinds of costs...

@mmgalushka @jonjoncardoso @holeung @spadavec @luiraym @sladem-tox @gcincilla @wvanhoorn @BenedictIrwin @IamDavyG @murrayfold @drc007 @cdsouthan @bendndi @danaklug @fidiris

New Series 4 candidates based on generative model - EOSI

Hello @mattodd @edwintse,

At @ersilia-os we have tried to generate new Series 4 candidates. In short, we provide two tables:

  • A list of >100k molecules obtained with a generative model: download 100k
  • A relatively diverse selection of 1k molecules: download 1k

For a first assessment of the results, you can check this dynamic visualization of the selected 1k candidates. If a cluster is of particular interest, please refer to the full results to discover other similar molecules. You can also check a tree map of all molecules.

Our generative model approach is based on Reinvent 2.0. We have implemented several reinforcement-learning agents, aimed at optimizing activity and other desirable properties. This GitHub Repository contains more detailed information and source code.

This is the first time we run a generative model, so please bear with us. We will be more than happy to optimize further runs based on your feedback.

Thanks!
@GemmaTuron @miquelduranfrigola

COVID-19 Update/Finalisation of the Paper

Hi all,

In light of the COVID-19 situation, most places are going into closedown and people are starting to work from home (including UCL from this Friday). Even though I had finished making Exscientia 1b (#22), Dundee has also closed down today so we won't be able to get that tested any time soon.

The paper is in good shape (#3) and our current plan is to submit as is (i.e. with the 4 original suggested compounds) as soon as possible and go from there, with the possibility of sneaking in the remaining two compounds later on. This means we can begin to finalise everything for the paper. Considering the topical nature of this work, we are thinking of shooting for Nature Comms as a first go.

Here's what remains:

  • @jonjoncardoso: Could you 1) add in the experimental details of your model into the Experimental section at the end of the paper, and 2) add in a one sentence summary of the model into Table 2 in the R&D section.
  • @spadavec: Could you 1) add in the experimental details of your model into the Experimental section at the end of the paper, 2) add in a one sentence summary of the model into Table 2 in the R&D section, and 3) add your name and affiliation details to the first page.
  • I have gotten all the characterisation data for the compounds and will sort that out for the SI.
  • @wvanhoorn, @btatsis: Laksh left a comment in your experimental section about the references that need to be added. Would you know which references need to be added and can you add them in if so?

Once all these things are done, I can finalise all the formatting things for submission.

I hope that everyone stays safe over the next few months and everything will get back to normal in the not too distant future.

Final Preparations for Paper Submission

Hi all. I hope everyone's well. An update on the paper (this issue closes #26).

The current draft is here, along with all the other files. We've incorporated some email comments from people who are not regularly on Github, which has been useful.

We're just waiting on a couple of things that lockdown interrupted:

  1. Validation of the potency of the final four active compounds. They've been measured in duplicate in Dundee, but Dundee's (perfectly reasonable) policy is that published potencies should be measured in triplicate. Completing this should be quick after lockdown.

  2. Confirmation that the more recent compounds in the paper share the same phenotype, i.e. that they are all still active in the ion regulation (PfATP4) assay. I'm 95% sure we've not scaffold hopped, but it's a loose end. This means shipping compounds to Adele and Kiaran in Canberra, and the temporary snafu is that the compounds are currently locked away in our shutdown UCL lab. We'll get them out as soon as we can.

For both these final measurements we have stocks of the compounds, which is a positive, and both relevant assays are up and running.

In the meantime we can get the paper ready to submit. This current draft is looking very nice (don't worry about the reference list - @edwintse is going to format that last thing). I need people to have a read and suggest any changes they'd like. Ultimately everyone will need to give the green light to submission, and that can be given now if you want. We're going to target Nature Comms since there's a lot of unusual work here.

The SI is also here, and it would help if people could also quickly verify their contribution to this document.

I also need people to check (at end of paper):

  1. Competing interests (we currently have none)
  2. Manuscript contributions (we've made a decent guess)
  3. Any additional funding that needs to be declared (specific to the project, rather than e.g. core company funding)
  4. Any co-authors we may have missed (this is super important, and it's easy to make a minor mistake when there are lots of authors)

Suggestions for referees for the paper would be very useful too.

@jonjoncardoso @spadavec @wvanhoorn @btatsis @BenedictIrwin @gcincilla @holeung @alintheopen @IamDavyG @sladem-tox @luiraym @murrayfold

PfATP4 structure predicted with AlphaFold2

Hi @edwintse @mattodd

Last week the code of AlphaFold2 was published. Today EBI and DeepMind are releasing full structural proteomes predicted with this AI tool: https://alphafold.ebi.ac.uk/. Also InterPro and PFam now provide AlphaFold2 predictions. All very exciting.

I don't think PfATP4 (Q9U445) structure prediction is available yet, so we have run AlphaFold2 ourselves. Perhaps this is useful to revisit the docking approaches :)

pfatp4_alphafold_model

Note that we truncated the first 115 residues because they failed to align and eventually fold. HHblits was used for the MSA alignment.

Post Mortem Meeting on Jan 31st 2020

As mentioned in #21, we will be having a short one-day meeting on this competition, partly in order to see what went wrong/right, and partly to discuss this area of science in more depth.

This Issue is a place where we can consult on the day's schedule. Please suggest either things or people to be added in.

We'll look into a way to broadcast, e.g. via Zoom for remote attendees like @holeung and @IamDavyG

We'll be financially supporting the meeting (lunch, coffees, no attendance fee) via AI3SD money. If there are any potential sponsors of the meeting, I'd love to know. We can offer some travel reimbursement for student attendees (will start separate issue on this).

Predicting the Activity of Drug Candidates when There is no Target
Friday January 31st 2020
Royal Society of Chemistry, Burlington House, Piccadilly, London, W1J 0BD

10:00 Introduction about OSM and this competition (Mat Todd and Ed Tse)
10:30 Talks from competition entrants @BenedictIrwin (Optibrium)
11:00 Talks from competition entrants @wvanhoorn (Exscientia)
11:30 Talks from competition entrants @gcincilla (Molomics)
12:00 Talks from competition entrants AN Other
12:30 Lunch
13:30 Talks from invited guests Who?
14:00 Talks from invited guests Who?
14:30 Talks from invited guests Who?
15:00 Talks from invited guests Who?
15:30 Closing remarks
16:00 End

Genetic algorithm optimisation of Series 4 potency

The goal is to use my graph-based genetic algorithm (GA) to maximise pIC50 values predicted using machine learning (ML). Here are some preliminary results.

ML model 1 (ML1)

  • Random Forest (RF) model using 1024-bit ECFP4 fingerprints. Training was done with PyCaret.
  • Trained on all Series 4 pIC50 values in the training set (except 3 B-containing compounds for which the SMILES strings didn’t resolve).
  • The model has a precision of 91% on the holdout minus the B-compound but plus the 4 new experimentally validated molecules from the competition.
  • The model is available here
  • The ML models cannot predict pIC50 values that are significantly larger than those predicted for the training set, i.e. around 7. For new molecules, predicted pIC50 values close to 7 are thus potentially lower bound
  • I also made a ML model (ML1_65) where I removed molecules with pIC50 > 6.5 from the training set.

GA searches for molecules with large pIC50 values

  • The starting population is drawn from the training set for ML1_65. The population size is 50 and the search is run for 150 generations.
  • A GA search using ML1_65 to predict pIC50 values finds the top molecules shown below together with their predicted pIC50 values.
  • Note that two of the molecules are already known, i.e. rediscovered by the search, and have significantly larger pIC50 values than those predicted by the ML1_65 model.
  • Thus, the GA/ML approach can be used to identify potentially high scoring molecules.

Screenshot 2021-03-19 at 15 29 51

  • A GA search using ML1 to predict pIC50 values finds the top 10 molecules shown below together with their predicted pIC50 values.

Screenshot 2021-03-19 at 15 31 51

How to select the best molecules, if any, for further study?

  • Molecules with pIC50 values greater than ca 6.7 should not be eliminated based on the pIC50s
  • The GA search is stochastic so each new search tends to contribute more new molecules, so there are potentially hundreds of such molecules.
  • Should I try to filter them further? For example, by predicted solubility, synthetic accessibility, and/or RML. All these models have errors associated with them, so it may weed out potentially useful candidates.
  • Or is it better to post them all and have experts decide? Also, each molecule could potentially be tweaked a bit based on expert knowledge without impacting the predicted pCI50s significantly.

COMPETITION ROUND 2: A Predictive Model for Series 4

UPDATE: Round 2 has now concluded. Thanks to all who participated! The results announcement can be found here.

OSM will be launching the second round of the predictive modelling competition on August 1st. This will build upon the first round which was run in 2016 (results here). All relevant background can be found in the previous two links and on the Wiki (tab above). Submissions will be allowed up to the end of the day on September 11th.

This aim of the competition is to develop a computational model that predicts new, potent molecules in OSM Series 4.

The target of these molecules is strongly suspected to be PfATP4, since there has so far been essentially a perfect correlation between activity of molecules in this series vs the parasite and in an assay that measured ion regulation, used as a proxy for activity vs PfATP4. PfATP4 is an important target for the development of new drugs for malaria.

We are providing a dataset of actives and inactives. The challenge is to use the data to develop a model that allows us to (better) design compounds in Series 4 that will be active against that target. This competition is part of Open Source Malaria, meaning that everything need to adhere to the Six Laws.

This round of the competition is funded by the AI3SD+ network. Details of the submitted proposal can be found here (#2). The funding allows us to actually make the molecules that are proposed to be active.

Competition Timeline

  • Competition launch: The competition will run from 01/08/19 to 11/09/19.
  • Paper write-up: This will happen as the competition is being run and will be submitted to the forthcoming special issue of the Beilstein Journal of Organic Chemistry.
  • Judging and results: A panel (to be announced) will evaluate the models against an undisclosed test set to determine the model(s) best able to predict activity of knowns.
  • Synthesis of top compounds: With the best performing model(s) as judged above, the relevant submitters will be asked to suggest new potent Series 4 compounds. These will be synthesised and biologically evaluated to determine the predictive capabilities of the models.

The Competition
OSM will provide:

  • A dataset containing actives and inactive compounds against PfATP4 along with their in vitro potencies (here). This list has been updated to include the more recent Pathogen Box results from the Kirk lab that was used as the test set in the last competition.
  • The Master Chemical List which contains activity data for all OSM compounds from Series 1-4.
  • Jeremy Horst's Homology Model built from crystal structures of the closest mammalian homolog (SERCA)
    PfATP4-PNAS2014.pdb.txt
  • Details of the relevant mutations known to be associated with resistance.

Submission Rules:

  • Entries may either be submitted to directly to GitHub (uploaded in the Submitted Models folder in the Code tab above) or be uploaded onto an ELN and a link posted in this repository.
  • Entrants can work individually or in teams (no limit to team size).
  • Entrants must work openly during the competition. This doesn't necessarily mean that inputs have to be logged in real time (although that is strongly encouraged), but entries that have not openly deposited working data on a regular basis prior to the deadline will not be accepted.
    Open Electronic Notebooks (ELN) such as Labtrove or LabArchives can be useful places to post data and work collaboratively. For example, Ho Leung Ng's ELN can be viewed and commented on here. Please note that LabTrove authors are not alerted when a comment is added to an entry so GitHub is a useful place to tag others.
  • Entrants must agree to their work's incorporation into a future OSM journal publication(s).
  • Competition winner(s) will be authors on any relevant future paper(s).
  • Any valid* entries will at least be acknowledged on any relevant future paper(s) and if the contribution is significant may lead to authorship.

How will entries be assessed?
There is a relatively high confidence level that PfATP4 is the molecular target for Series 4 (i.e. compounds that are potent in vitro show disruption of ion regulation in the PfATP4 assay). Therefore, for this round of the competition, we will be focussing on the prediction of active Series 4 compounds (rather than the prediction of any active compounds vs PfATP4) since the two should correlate.

  • For the final submission, entrants will predict the potencies of an undisclosed set of Series 4 compounds (to be provided at a later date)
  • A judging panel (to be announced) will evaluate these predictions in comparison with experimental data to determine the winner(s)

What's the prize
Two prizes will be awarded, one for a private sector entry and one for a public sector entry.
...also the opportunity to contribute to our understanding of a new class of antimalarials
...and authorship on a resulting peer-reviewed publication arising from the OSM consortium

*A 'valid' entry is one that stands up to the rigour expected from published in silico models. Judges are entitled to use discretion in the case of unconventional entrants, for example those from people with no formal training such as high school students.

Comments and questions can go below. The above rules/guidance will be periodically updated.

COMPETITION ROUND 2: RESULTS!

Round 2 (#1) is complete! Thank you again to all who participated.

We are pleased to announce the winner of the competition is Giovani Cincilla (@gcincilla; Molomics). A very close second place goes jointly to Willem van Hoorn (@wvanhoorn; Ex Scientia) and Ben Irwin/Mario Öeren/Tom Whitehead (@BenedictIrwin; Obtibrium and Intelligens). We’re also adding Davy Guan (@IamDavyG; USyd) to the list of winners as the best non-company entrant.

Congratulations! We will be awarding prizes to the best model(s) from a company as well as one from the wider community. £100 each will go to Giovanni and Davy for the company and non-company winners. £50 will also go to Willem and Ben/Mario/Tom (combined) as the runners-up. The prizes will be presented at the upcoming meeting (see below).

Huge thanks go to the other entrants (@mmgalushka, @jonjoncardoso, @holeung, @spadavec, @luiraym, @sladem-tox) – though you didn’t win, you’re still involved – see below!


What just happened?

The results from Round 1 may be found here. The submitted models for Round 2 were all here. These were evaluated vs. the actual potencies observed for these compounds, which are shown here and here. The analysis of the entries may be found here. This analysis has been conducted by Ed Tse (USyd & OSM), Murray Robertson (Strathclyde), Robert Glen (Cambridge) and Mat Todd (UCL & OSM).

Screen Shot 2019-11-04 at 10 05 32 am

Analysis of the models was done as follows:

  • All entries were collated into a table along with the experimental values
  • For easier comparison between entries, the values were normalised between 0 and 1
  • The predictions for each compound were plotted in separate column graphs to visually compare with the experimental results
  • The predictions were then classified as ‘correct’ or ‘incorrect’ (and further separated into false positives and false negatives)
  • Precision and recall were also calculated for each model as a further comparison
  • Winners were determined based on the highest precision calculated for each model from the “Raw (>25)” analysis

n.b. The initial analysis was done with relatively tight classifications limiting the results to >2.5 uM. The following analysis was relaxed by increasing the limit to >25 uM. Both can be found in the respective tabs at the bottom of the spreadsheet.

It’s quite interesting how similar some of the training structures were to the test structures. Murray has done an analysis of this here. This is to some extent expected, given that this is a lead optimisation project, but worth noticing in passing.


What’s next?

We are now moving on to the next phase of this project, which will involve:

  1. The prediction of new active compounds to be synthesised here in the lab, i.e. the Validation Phase. We will be reaching out to the winners to use their models to predict 2 new compounds each. This will give us a total of 8 compounds to make (by yours truly) which will be tested for parasite killing before Christmas (fingers crossed). The idea is that these are predictions of molecules that are ideally structurally distinct from the known molecules (as much as possible), to try to identify a highly potent new active series. However, the ideal case would be for each of the 4 teams to predict one molecule that can be as similar as they like to a known, and one “moonshot” molecule that is as structurally distinct from the model as possible, while being predicted to be active according to the model.

  2. Finishing off the paper describing this competition (here). All of the entrants from this round are asked to provide a brief summary of their methods (if possible) in a similar manner to what is currently in the paper for Round 1. Everyone please add your author details to the paper. This is a joint effort.

  3. Running a one-day meeting on AI/ML in drug discovery, using this competition as the focus of the meeting. This will be in London in mid-January (e.g. week of Jan 20th), and details will come as soon as we’ve booked the space. Hopefully everyone can attend, and we have some money available to cover some travel expenses. More on this ASAP. If anyone would like to come but cannot make mid-January, please say.

So well done everyone, and we’re excited to synthesise the new predictions of novel actives!

Writeup of the paper

With Round 2 of the competition officially underway (#1), I have begun to draft up the paper/fill out the wiki. This will include all the work done in both Round 1 and Round 2.

The paper is currently being written up in this google doc using the Beilstein JOC template:

https://docs.google.com/document/d/1aD29GjC8RjqrSDcWcEUptS04Z2v10deReRp0eB3kcp4/edit?usp=sharing

I have begun by writing some of the introduction and Round 1 of the competition. If the participants from Round 1 (@holeung, @kellerberrin, @spadavec, @gcincilla, @IamDavyG, @jonjoncardoso) were able to summarise your models and methods that would be super helpful. Any improvements made to the original models during this round can be written up under Round 2.

I am using temporary reference labels for now (surname of the first author and year of publication) so it will be easier to number them at the end.

Please add your author details to the top of the paper as well.

At the same time, all of the information that will be written in this paper will need to go into the relevant wiki pages too.

Changes in SMILES code in the Master Chemical List?

Hi everyone,

Our group at the Department of Informatics at King's College London - under Dr. Sophia Tsoka @sophiatsoka - have been revisiting this modelling challenge and we have some questions about changes in SMILES codes in the Master Chemical List.

Ruby (@yutongLi1997) has downloaded the newest version of the master list and compared it with the previous version I had from when I participated in Round #2 of the Competition.

She notice that the structures listed below were a bit different this time. My guess is that these compounds had the wrong SMILES and had been revised more recently but I couldn't locate the changes in the spreadsheet. Can anyone confirm this?

# OSM Codes:
['OSM-S-82', 'OSM-S-88', 'OSM-S-89', 'OSM-S-351', 'OSM-S-546', 'OSM-S-631']

#old_smiles:
 ['CNC(=O)COC(=O)c1cc(C)n(c1C)c2ccc(F)cc2',
 'CNC(=O)CN1CCC(CC1)NCc2cc(C)n(c2C)c3ccc(Cl)cc3',
 'CCNC(=O)[C@@H]1C[C@@H](N)CN1Cc2cc(C)n(c2C)c3ccccc3Cl',
 'Clc1cccc(c1Cl)c2nnc3cncc(OCCc4ccccc4)n23',
 'Fc1ccc(CCOc2cncc3nnc(c4ccc5c[nH]nc5c4)n23)cc1F',
 'COc1ccc(cc1)c2n[nH]c(n2)c3nccn3CCc4ccccc4']

#new_smiles:
['CC1=CC(=C(C)[N]1C2=CC=C(C=C2)F)C(=O)OCC(=NC)O',
 'CC1=CC(=C(C)[N]1C2=CC=C(C=C2)Cl)CNC3CCN(CC3)CC(=NC)O',
 'CCN=C([C@@H]1C[C@H](CN1CC2=C(C)[N](C(=C2)C)C3=CC=CC=C3Cl)N)O',
 'ClC1=CC(Cl)=C(C2=NN=C3C=NC=C(N32)OCCC4=CC=CC=C4)C=C1',
 'FC1=CC(CCOC2=CN=CC3=NN=C(C4=CC=C5C(NN=C5)=C4)N32)=CC=C1F',
 'COC(C=C1)=CC=C1C2=NN=C(N2)C3=NC=CN3CCC4=CC=CC=C4']

PS: What we have been up to

  • We are working on improving the accuracy of our algorithm (modSAR), assessing its weaknesses and limitations, while modelling OSM data.

  • The model I trained on Round 2 did not predict activity of the external test set that well even though the algorithm had performed well on previous datasets we've worked on. Changing from CDK molecular descriptors to more widely used RDKit circular fingerprints have already improved the fit and accuracy of the model in general, but we are still working on validating these results.

  • We are also planning to apply shapley values to help explain activity and to debug models results

  • I uploaded a Jupyter notebook to our repository with exploration on the earlier version of the dataset. Here is the link if anyone is interested.

Details of the AI3SD Proposal

The below text was used in an application submitted by @mattodd in Feb 2019 to the AI3SD network for funding. The title was "Predicting the Activity of Drug Candidates when there is No Target". The aim was to use an open approach to provide a real-world example of how new methods in AI/machine learning can actually impact drug discovery, and to do this by tackling a common and difficult problem: predicting actives in a phenotypic drug discovery project.

The Problem
We aim to use diverse AI approaches to develop new ways to solve one of the biggest challenges in drug discovery: the prediction of activity of drug candidates in the absence of a biological target. We aim to do this using a public competition and open data.

In modern drug discovery it is frequently the case that optimisation of drug candidates is undertaken in the absence of a known biological target – so called phenotypic drug discovery.[Nat. Rev. Drug Disc. 2017, 16, 531] In many therapeutic areas such an approach is seen as superior since it focuses efforts on those compounds known to be effective vs. whole cells or organisms. A common situation is that we have structure-activity relationship data (i.e. a collection of molecules and their associated biological activities) but we do not have information about the binding interactions of those molecules with a biological target. Yet we must be predictive of which molecules to make next, in order to allocate resources wisely. To date, the vast majority of such prediction has been based on the intuition of the medicinal chemists involved. This highly valuable resource has limitations of bias, or of imagination, or in some cases of resources: many small-scale drug discovery projects (particularly in academia, or in start-up companies) may have few people examining the data, meaning good hypotheses may be missed, or key insights overlooked. Manual organic synthesis of individual compounds designed in response to hypotheses – during the so-called Lead Optimisation phase – is among the most expensive areas of drug discovery. It is not unusual for the synthesis of one molecule to require two weeks of a postdoctoral researcher’s time, equating to ca. £2K per compound. If we are to identify the medicines society most needs, we must become significantly more efficient at the prediction of phenotypic potency.

Drug discovery is a complicated, multi-faceted process involving a range of expertise that varies according to the stage of a project. The design of a compound intended to achieve a biological end involves disciplines across organic chemistry, medicinal chemistry, pharmacokinetics, computational chemistry and, usually, biology of the relevant organism (be that a pathogen, or human biology). There is a requirement of strategic planning through project management and a delicate balance of resources vs. potential gain - i.e. when to “kill” a project based on perceived likely return on investment. All of these roles are required in the specific drug discovery project at the heart of this research proposal. Open Source Malaria (OSM) involves scientists at UCL and The University of Sydney, but also scientists from elsewhere in the world such as those from the 20 other institutions that contributed to OSM’s first research paper. The preliminary attempts at solving the present research problem (described below under preliminary data) have come from the US, UK and Australia, from both the public and private sectors but also citizen scientists. This project involves people from the broadest range of professional backgrounds.

The project concerns predicting biological activity for Open Source Malaria Series 4, the most current pressing research problem for OSM. Over 200 molecules are known in this series, with potencies against the malaria parasite ranging from inactive to sub-10 nM. Yet it is still the case that weeks of laboratory-based effort may beexpended in making reasonable-looking molecules that are found to have zero potency. Series 4 is highly promising: several members have cured malaria in the mouse model of the disease. This isthe closest an open source series has ever been to the clinical phase of investigation.

The biological target is thought to be the ion pump PfATP4, an essential part of the parasite’s machinery in maintaining ion balance when inside a red blood cell. Despite extensive effort, the structure of this large, complex membrane-bound protein remains unsolved. The target is implicated by genetic changes found in resistant mutants. Understanding PfATP4 is crucial because it is the supposed target of the newest antimalarial to reach Phase III clinical trials, KAE609 (Cipargamin), developed by Novartis. Mysteriously, PfATP4 is also inhibited by a bewildering array of unrelated chemotypes.[Int. J. Parasitol. 2015, 5, 149] It is unclear how this is possible.

A competition was announced by OSM in 2016, run and concluded. All available data were curated for the community and submissions of models, using any methodology, were encouraged from OSM’s contributor network. Six diverse, fully-fledged entries were accompanied by full details. These models were evaluated against a test dataset (the MMV Pathogen Box) that had not been disclosed. Evaluation of the entries by a scientific advisory panel led to the award of the prize to two equally well-performing models. These models are not yet highly predictive, despite the quality of the input data and the relevant expertise of the entrants. This proposal now aims to build on this significant preliminary work. All original submitters are willing to improve the models and wish to publish the work, providing an excellent community starting point for this proposal.

Our aim is to become more predictive of potencyby building new models by whatever means possible. This research project will apply new AI methods, as part of an open competition, to generate a high quality predictive model for this important antimalarial series. There have been clear and exciting advances in recent years in applying computational approaches (e.g. matched pairs analysis) to the prediction of biological activity. However, the time is right for a broader exploration of approaches to compound prediction, using newer methods of machine learning that are being trialled by some of the leading companies in the field of AI who have joined this application. It is time that there were available to the scientific community clear examples of the potential impact of machine learning methods in real drug discovery projects, in order that we might more clearly understand the impact of such new technologies, and to clearly distinguish current state of the art from hype. To achieve this, and to discover the best ideas from any quarter, we propose to mix the application of new methods from the private sector with a public competition to which anyone may submit solutions. What has been missing to date from the OSM team, and what is missing in the vast majority of drug discovery projects around the world, is AI. Essentially no drug discovery project other than those taking place in the largest pharma companies, or those taking place at new AI-centric companies, involve any significant element of AI at all. This is an astonishing situation, one that we hope to reverse through the outcomes of this public-facing project.


Project Aims
To develop a general AI-enabled approach to solving the prediction of biological activity in phenotypic drug discovery, through the use of a public competition. Objectives are as follows:

  • Data curation and sharing of everything that is known of the activity of molecules in OSM Series 4.
  • Invitation to our co-applicant partners to submit models, as well as an invitation for the public to contribute.
  • Evaluation of the predictive models, including via synthesis of several of the predicted compounds. Winners are determined and modest prizes awarded.
  • Post-mortem collaborative discussion of the outcomes at a physical workshop.
  • Publication of a summary article. Description of how the project can be continued

Project Method

  • Data on many OSM Series 4 compounds, an other compounds also targeting PfATP4, are already available. The dataset will be updated with new compounds from OSM, the general literature and the MMV Pathogen Box. A small subset will be held from public view. A Github post will be written to announce the project, along with the rules for participation. We will disseminate widely, including through the AI3SD network.
  • Details of working and methods can be uploaded to lab notebooks such as Labarchives, Zenodo. Commercial partners may need to keep back proprietary details (what they term “magic sauce”) but they have committed to keep this to a minimum.
  • An expert scientific advisory group will assess the intrinsic performance of the models and vs. the small unpublished dataset.
  • Results are announced, and predictions of new compounds are solicited from the best-performing models.
  • Synthesis is carried out of five new molecules predicted to be potent. This will use existing in-kind resources available in Todd’s lab (postdoctoral researcher as part of OSM 2019–20, plus chemicals and consumables). The molecules will be assessed for potency at the University of Dundee, and tested in an ion regulation model of PfATP4 at the Australian National University, Canberra (Prof. Kiaran Kirk). All activities funded in-kind.
  • A workshop is held in London to which competition participants will be invited (depending on the number of entries, costs will be covered), and the wider AI3SD community will be invited. The workshop will carry out a post-mortem of the competition. Two small prizes are awarded, one for a private sector entry and one for a public sector entry. Interested parties will collaborate on ways forward and funding.
  • A full write-up of this project will be submitted to a special issue of the Beilstein Journal of Organic Chemistry, in the thematic issue on Medicinal Chemistry, for which Prof. Todd is the invited Editor (to be completed 2019). The paper will be a new style of “Challenge Paper” the journal seeks to pioneer, in which the paper is framed as a “living project” to which others are invited to contribute the next stages

Series 4 optimization through Collective Intelligence

Series 4 optimization through Collective Intelligence

Human Collective Intelligence (HCI) is the knowledge, skills & intuition of all the members of a team empowered by their real-time collaboration. Here we propose to use HCI to optimize OSM Series 4 compounds through the predictive model that won the competition and a Molomics web-based technology working with the most popular web browsers (e.g. Firefox, Chrome).

The technology allows to collaboratively explore & exploit the chemical space by designing & evaluating molecules. For any designed molecule, the system will currently provide prediction of Pfal activity, solubility (logS) and Caco-2 cells permeability. Other molecular properties (e.g. synthetic accessibility) can be evaluated by team members.
The technology is rather intuitive so, if you prefer, you can stop reading and directly access it at:

https://osm.molomics.com/w/signup/signup.html

Otherwise, please get a better understanding about the technology following the instructions below:

Step 1: learn the technology basics through the Sandbox

The first time you access the technology you'll be able to access only 1 toy project called “Sandbox”. There you will learn the technology basics in about 10 minutes. This includes how to sketch an organic molecule, how to browse all the team molecules and how to access molecule information.

Step 2: browse the team molecules

Upon completion of the Sandbox, you'll be able to access the real project called “Plasmodium falciparum OSM-S4”. Once inside, you can browse all the molecules generated by the team for this project and you can filter them by tags, molecular properties and team members. Through the molecule information you can see its structural alerts, human evaluations and also how the molecule was generated.

Step 3: design new molecules

After you get inspired by for example exploring already tested molecules, you can design new molecules by modifying any existing compound or by starting drawing from scratch. The objective is to design new Series 4 molecules with good predicted properties. Please consider that any molecule prediction has a prediction confidence (based on the model applicability domain) associated to it (i.e. traffic light colors on the right side of prediction represent high, medium and low confidence). An ideal molecule has favorable predictions and high prediction confidences. Molecules with low confidence predictions should be avoided.

Step 4: evaluate molecules

Molecules can be evaluated by the team members by accessing the evaluation tab in the molecule information.

Questions

Any question about the technology can be asked here. Questions about molecules are addressed directly in the technology.

Progress on the Synthesis of Suggested Compounds

Thank you again to the winners for all the interesting suggestions for new active compounds (#19)! With so many different steps, I've been multitasking as much as possible to make these compounds as quickly as I can so in case you have been wondering where we are currently at with this, I've put together the scheme below which outlines that progress that I have made thus far. The coloured arrows represent the status of the reactions (green = done, blue = in progress, red = not yet started). I'll try to update this scheme as I knock off more of the intermediates. Final compounds and their suggesters are in the shadowed boxes. All the experimental information can be found in my ELN.

Progress

LabTrove author ID issues

I have lost access to my old LabTrove ELN in the years between Round 1 and 2 of this competition so I have made a replacement LabTrove ELN for Round 2. The issue is that I am appearing as a blank author because I messed up my registration while trying to reclaim my old LabTrove ELN. I have not found any way to change my name so is it possible for an administrator to just delete my account for me to reregister? Here is the LabTrove ELN I am referring to: https://malaria.ourexperiment.org/osm_s4_r2_2019

Predicting blood stage activity with the Malaria Inhibitor Prediction (MAIP) Platform

The European Bioinformatics Institute (EBI), with support from MMV have launched a website allowing people to upload sets of compounds for blood stage activity prediction.

https://www.ebi.ac.uk/chembl/maip/

The model was trained on over 5 million data points on screening collections from Novartis, GSK, St Jude Children's Research Hospital, AstraZeneca and MMV.


I've put all the Series 4 compounds through the platform and the results are below:

Screen Shot 2020-05-21 at 4 10 16 pm

vuechart-example

n.b. 5 SMILES couldn't be parsed correctly.

Top 50% contains 267 compounds with the highest scores, which are all over 25.8
Top 10% contains 53 compounds with the highest scores, which are all over 36.27
Top 1% contains 5 compounds with the highest scores, which are all over 49.52

I've added in the mean EC50 values into the output predictions.csv file for compounds that we've tested. You can then filter the columns by descending model_score.

predictions.csv.zip

Abstract of Project

We could use a simple abstract of the project, suitable for the paper, but also suitable for the base Readme on this repo.
@edwintse could you please take the text that is currently on the wiki landing page (https://github.com/OpenSourceMalaria/Series4_PredictiveModel/wiki) and put that on the landing page for the repo (the "code" page) and then tweak it suitable for interested scientists who may not be familiar with what's going on? Two paras - what has happened up to this point and then what round 2 is hoping to achieve. We can copy and paste it into the paper when done.

Scheme Describing Round 1

The paper needs a succinct scheme describing the outcome of Round 1. I think this needs to be a landscape figure with columns detailing: 1) Entrant name, 2) Several-word description of method, 3) Structures of the predicted actives, 4) winner or not.

@edwintse or anyone else, can you please think whether this would work and paste possibilities in this Issue?

Any Relevant Literature for the Paper

This is a placeholder issue where we can insert in the comments any papers that we think might be relevant to the competition paper. Important that we gather papers relevant to AI/ML in drug prediction, but also anything else. If clearly relevant, can be put in paper directly. If not, can discuss here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.