isayev / release Goto Github PK

View Code? Open in Web Editor NEW

347.0 347.0 133.0 449.75 MB

Deep Reinforcement Learning for de-novo Drug Design

License: MIT License

Jupyter Notebook 99.10% Python 0.90%

cheminformatics deeplearning drug-discovery molecular-modeling qsar reinforcement-learning

release's People

Contributors

Stargazers

Watchers

Forkers

zwtian666 leelasd lilleswing jamel-mes chemphy aspirincode dingye18 karthincl dkoes jreuben11 bpdot usccolumbia ludwig19 xxffliu meijian anu-bioinfo qiusir1 songminghu2004 bbrighttaer thebigfotos frank-lb tiger-tiger jinhou mariewelt wllllg gkxiao natalia-varenik gmseabra xyuan haroon03 leelasdsi imtiazadam spideralessio phenylazide luoyuan1984 mantemp rexdeng mandar5335 professordong bbyun28 nova255 plin1112 sj-huang sailfish009 lxlsu mingyuanxu abdulelahalshehri orientalcds theneuronprogrammer aoe-khkhan neeleshca flyyufelix alleboudy m-mokaya ysc150799 kiwichen2003 roysh yanqingwu sts-sadr deepstatsanalysis diallobakary4 rishistyping yulanl22 exjustice qin-folks zerodesigner susmita2000 kroy1809 manangoel99 yingli2009 i-cubic-i vinodbukya6 rezaei-ma bdockbockd sametgumus212 cgh2797 ajayarunachalam wudangbio wa2003 yaxche-io loganwu0526 afikhan layeqa wangdi2014 pk-organics othmanej hohadang99 biocreator rae-zhang herveyrobot waldo719 xdamianx-coder cuidachao dzvinkayarish novaxiong lsalases whataboutchem shunsunsun catenate15 brupani

release's Issues

LogP Example: "TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not list"

Hi,

I'm re-running the LogP example using current version of PyTorch, and the execution stops in the reinforcement loop due to a TypeError, as below. Are you aware of any changes in PyTorch that could be responsible for this? Is there a solution for it?

Thanks!

for i in range(n_iterations):
    for j in trange(n_policy, desc='Policy gradient...'):
        cur_reward, cur_loss = RL_logp.policy_gradient(gen_data)
        rewards.append(simple_moving_average(rewards, cur_reward)) 
        rl_losses.append(simple_moving_average(rl_losses, cur_loss))
    
    plt.plot(rewards)
    plt.xlabel('Training iteration')
    plt.ylabel('Average reward')
    plt.show()
    plt.plot(rl_losses)
    plt.xlabel('Training iteration')
    plt.ylabel('Loss')
    plt.show()
        
    smiles_cur, prediction_cur = estimate_and_update(RL_logp.generator, 
                                                     my_predictor, 
                                                     n_to_generate)
    print('Sample trajectories:')
    for sm in smiles_cur[:5]:
        print(sm)

with the error below:

Policy gradient...:   0%|          | 0/15 [00:00<?, ?it/s]./release/data.py:98: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  return torch.tensor(tensor).cuda()
Policy gradient...:   0%|          | 0/15 [00:00<?, ?it/s]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-7a3a9698cf0c> in <module>
      1 for i in range(n_iterations):
      2     for j in trange(n_policy, desc='Policy gradient...'):
----> 3         cur_reward, cur_loss = RL_logp.policy_gradient(gen_data)
      4         rewards.append(simple_moving_average(rewards, cur_reward))
      5         rl_losses.append(simple_moving_average(rl_losses, cur_loss))

~/work/li/leadopt/generator/ReLeaSE/release/reinforcement.py in policy_gradient(self, data, n_batch, gamma, std_smiles, grad_clipping, **kwargs)
    117                     reward = self.get_reward(trajectory[1:-1],
    118                                              self.predictor,
--> 119                                              **kwargs)
    120 
    121             # Converting string of characters into tensor

<ipython-input-33-a8c049e9e937> in get_reward_logp(smiles, predictor, invalid_reward)
      1 def get_reward_logp(smiles, predictor, invalid_reward=0.0):
----> 2     mol, prop, nan_smiles = predictor.predict([smiles])
      3     if len(nan_smiles) == 1:
      4         return invalid_reward
      5     if (prop[0] >= 1.0) and (prop[0] <= 4.0):

~/work/li/leadopt/generator/ReLeaSE/release/rnn_predictor.py in predict(self, smiles, use_tqdm)
     62                 self.model[i]([torch.LongTensor(smiles_tensor).cuda(),
     63                                torch.LongTensor(length).cuda()],
---> 64                               eval=True).detach().cpu().numpy())
     65         prediction = np.array(prediction).reshape(len(self.model), -1)
     66         prediction = np.min(prediction, axis=0)

/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

~/work/source/repos/OpenChem/openchem/models/Smiles2Label.py in forward(self, inp, eval)
     41         else:
     42             self.train()
---> 43         embedded = self.Embedding(inp)
     44         output, _ = self.Encoder(embedded)
     45         output = self.MLP(output)

/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

~/work/source/repos/OpenChem/openchem/modules/embeddings/basic_embedding.py in forward(self, inp)
      7 
      8     def forward(self, inp):
----> 9         embedded = self.embedding(inp)
     10         return embedded

/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    124         return F.embedding(
    125             input, self.weight, self.padding_idx, self.max_norm,
--> 126             self.norm_type, self.scale_grad_by_freq, self.sparse)
    127 
    128     def extra_repr(self) -> str:

/opt/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1812         # remove once script supports set_grad_enabled
   1813         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1814     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1815 
   1816 

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not list

It seem the predictor is just a tranditional Classify model not a network model. Why did you say using LSTM in the predictor model in your paper?

Hi, Dear authors,
It seem the predictor is just a tranditional model not a network model. Why did you say using LSTM in the predictor model in your paper? (Deep reinforcement learning for de novo drug design)
I think only the generator was based on the network model, right? I am so confused when I comparing your code with your paper, please give some help to your readers. If you have update the codes, please let me know.

Weights are not updating

Hi! My weights don't seem to be changing.

I printed out the grad for each of the layers and it comes back as none, the grad of the hidden layer is none as well.

In the code the only thing I changed in the reinforcement file was rl_loss -= (log_probs[0, top_i].cpu().detach().numpy()*discounted_reward) and then tried rl_loss -= (log_probs[0, top_i].item()*discounted_reward) due to an issue with cuda device = 0.

Would you know any reason why grad would be coming up as none for all of the layers? I believe this is why optimizer.step() is not working and the weights are not updating.

RecurringQSAR-Example: "No module named 'data_preprocessing" and missing "

Hi there,

Running the RecurringQSAR-example.ipynb, there is an import statement:
import data_preprocessing as dp
that fails because this module is not available.

In fact., this in really not such a big deal, because it is apparently not used anywhere else in the notebook, so I could just comment it out. However, the notebook fails when looking for the jak2_data.csv file.

I understand this issue has been raised before, and that you used proprietary data. However, would it be possible just to upload some sample data, so we can run tests and know the expected format?

Thanks!

LogP files missing

Hi, I'm trying to work on the example notebooks included in the package. Working on the LogP notebook, I noticed that:

File paths are pointing to wrong directories:
/data/masha/generative_model/chembl_22_clean_1576904_sorted_std_final.smi
/home/mariewelt/Notebooks/gan_oracle/oracle_data/logP_labels.csv
This I could easily change the prefix to ./data/
There is a file still missing: /home/mariewelt/Notebooks/PyTorch/Model_checkpoints/generator/checkpoint_lstm
I looked into the ReLeaSE/checkpoints/generator/ folder, but this file is not there. Should it be possible to generate this file prior to running this notebook? How? Or is it just missing from the tree?

Thanks!

modulenotfound error

missing jak2_data.csv

can't find jak2_data.csv in the data folder.

it seems a bug in your reinforcement.py line 114

Hi all,
A little bit confuse about the deaded while loop, are you want to skip the invalid SMILES?
So add "break" next to "reward = 0" can work. However, I think you want to penalize the invalid smiles, so just replace reward to some negative value?

RAM requirements

How much RAM do I need to run logP prediction?
I can not run generator
gen_data = GeneratorData(training_data_path=gen_data_path, delimiter='\t',
cols_to_read=[0], keep_header=True, tokens=tokens)
because it eats all available memory

Missing a LICENSE file

Shall I put together a pull request with the MIT license? ;-)

missing jak2 train

I tried to run your JAK2-demo using my jak2 data, but when I execute to "train_model" here, I get an unexpected error

Execute the code in the following section to report an error:

rewards = []
n_to_generate = 1000
n_policy_replay = 10
n_policy = 5
n_transfer = 500
n_iterations = 5
prediction_log = []

for _ in range(n_iterations):
     
    ## Transfer learning 
    RL.transfer_learning(transfer_data, n_epochs=n_transfer)
    _, prediction = estimate_and_update(n_to_generate)
    prediction_log.append(prediction)
    if len(np.where(prediction >= threshold)[0])/len(prediction) > 0.15:
        threshold = min(threshold + 0.05, 0.8)

    ### Policy gtadient with experience replay 
    for _ in range(n_policy_replay):
        rewards.append(RL.policy_gradient_replay(gen_data, replay, threshold=threshold, n_batch=10))
        print(rewards[-1])
    
    _, prediction = estimate_and_update(n_to_generate)
    prediction_log.append(prediction)
    if len(np.where(prediction >= threshold)[0])/len(prediction) > 0.15:
        threshold = min(threshold + 0.05, 0.8)
    
    ### Policy graient without experinece replay 
    for _ in range(n_policy):
        rewards.append(RL.policy_gradient(gen_data, threshold=threshold, n_batch=10))
        print(rewards[-1]) 

    _, prediction = estimate_and_update(n_to_generate)
    prediction_log.append(prediction)
    if len(np.where(prediction >= threshold)[0])/len(prediction) > 0.15:
        threshold = min(threshold + 0.05, 0.8)

Get the following error：

RuntimeError                              Traceback (most recent call last)
<ipython-input-51-d78b3f91f2cc> in <module>()
     11 
     12     ## Transfer learning
---> 13     RL.transfer_learning(transfer_data, n_epochs=n_transfer)
     14 #     _, prediction = estimate_and_update(n_to_generate)
     15 #     prediction_log.append(prediction)

/project/ReLeaSE/reinforcement.py in transfer_learning(self, data, n_epochs, augment)
    131 
    132     def transfer_learning(self, data, n_epochs, augment=False):
--> 133         _ = self.generator.fit(data, n_epochs, augment=augment)

/project/ReLeaSE/stackRNN.py in fit(self, data, n_epochs, all_losses, print_every, plot_every, augment)
    332         for epoch in range(1, n_epochs + 1):
    333             inp, target = data.random_training_set(smiles_augmentation)
--> 334             loss = self.train_step(inp, target)
    335             loss_avg += loss
    336 

/project/ReLeaSE/stackRNN.py in train_step(self, inp, target)
    287         for c in range(len(inp)):
    288             output, hidden, stack = self(inp[c], hidden, stack)
--> 289             loss += self.criterion(output, target[c])
    290 
    291         loss.backward()

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    475             result = self._slow_forward(*input, **kwargs)
    476         else:
--> 477             result = self.forward(*input, **kwargs)
    478         for hook in self._forward_hooks.values():
    479             hook_result = hook(self, input, result)

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/loss.py in forward(self, input, target)
    860     def forward(self, input, target):
    861         return F.cross_entropy(input, target, weight=self.weight,
--> 862                                ignore_index=self.ignore_index, reduction=self.reduction)
    863 
    864 

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   1548     if size_average is not None or reduce is not None:
   1549         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 1550     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   1551 
   1552 

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   1401         raise ValueError('Expected 2 or more dimensions (got {})'.format(dim))
   1402 
-> 1403     if input.size(0) != target.size(0):
   1404         raise ValueError('Expected input batch_size ({}) to match target batch_size ({}).'
   1405                          .format(input.size(0), target.size(0)))

RuntimeError: dimension specified as 0 but tensor has no dimensions

I didn't encounter any mistakes before this step, but this step got an unexpected error. Can you give me some guidance?

small correction in installation instructions

In the installation documented in the README, I am pretty sure the conda command for creating a new environment should be:

conda create -n release python=3.6

https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands

Also the specific version of ?? in pip_requirements.txt doesn't seem to be available anymore?

(release) [jclin@longleaf-login4 ReLeaSE]$ pip install tensorflow==1.11.0rc1
ERROR: Could not find a version that satisfies the requirement tensorflow==1.11.0rc1 (from versions: 0.12.1, 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.2.1, 1.3.0, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.12.2, 1.12.3, 1.13.1, 1.13.2, 1.14.0, 1.15.0, 1.15.2, 1.15.3, 1.15.4, 1.15.5, 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.1.4, 2.2.0rc0, 2.2.0rc1, 2.2.0rc2, 2.2.0rc3, 2.2.0rc4, 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0rc0, 2.3.0rc1, 2.3.0rc2, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0rc0, 2.4.0rc1, 2.4.0rc2, 2.4.0rc3, 2.4.0rc4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.5.0rc0, 2.5.0rc1, 2.5.0rc2, 2.5.0rc3, 2.5.0, 2.5.1, 2.5.2, 2.6.0rc0, 2.6.0rc1, 2.6.0rc2, 2.6.0, 2.6.1, 2.6.2)
ERROR: No matching distribution found for tensorflow==1.11.0rc1

In my installation, I used pip install tensorflow==1.11.0 instead for now.

LogP_optimization_demo-RNNPredictor

Dear author, I would like to ask，When I run this line of code： my_predictor = RNNPredictor(path_to_params, path_to_checkpoint, predictor_tokens)
I received the following error：

How to solve this error, it seems to be a problem caused by OpenChem

Error, whille runnig the LogP prediction model.

Hi,
I did try to run the (https://github.com/isayev/ReLeaSE/blob/master/RecurrentQSAR-example-logp.ipynb) the following example, and it ends with the following error.

Can you please let us know if RecurrentQSAR-example-logp.ipnb is working fine or there is any bugs.

Thanks,

IndexError: too many indices for array in ipynb file

Hi guyz, I am getting "too many indices for array" error while trying to run the ipynb file with a new csv. Find the attachment for error.

jak2_data.csv

can't find jak2_data.csv in the clone,How to get jak2_data.csv，

Model Loading Error

Sorry for the inconvenience!

my_generator.load_model('checkpoints/generator/checkpoint_biggest') leads to the following error:

RuntimeError: Error(s) in loading state_dict for StackAugmentedRNN:
Missing key(s) in state_dict: "rnn.weight_ih_l0", "rnn.weight_hh_l0", "rnn.bias_ih_l0", "rnn.bias_hh_l0", "rnn.weight_ih_l0_reverse", "rnn.weight_hh_l0_reverse", "rnn.bias_ih_l0_reverse", "rnn.bias_hh_l0_reverse".
Unexpected key(s) in state_dict: "gru.weight_ih_l0", "gru.weight_hh_l0", "gru.bias_ih_l0", "gru.bias_hh_l0".
size mismatch for stack_controls_layer.weight: copying a param of torch.Size([3, 1000]) from checkpoint, where the shape is torch.Size([3, 1500]) in current model.
size mismatch for stack_input_layer.weight: copying a param of torch.Size([100, 1000]) from checkpoint, where the shape is torch.Size([1500, 1500]) in current model.
size mismatch for stack_input_layer.bias: copying a param of torch.Size([100]) from checkpoint, where the shape is torch.Size([1500]) in current model.
size mismatch for encoder.weight: copying a param of torch.Size([45, 500]) from checkpoint, where the shape is torch.Size([45, 1500]) in current model.
size mismatch for decoder.weight: copying a param of torch.Size([45, 1000]) from checkpoint, where the shape is torch.Size([45, 1500]) in current model.

Do you potentially know why this is the case? Thank you!

Multiple parameters and new checkpoint_biggest

Hello,
I was wondering if you have the python code for how you created the checkpoint_biggest from the chembl file? It also mentions loading multiple parameters in the abstract of the paper, do you have any implementation with more than one parameter?

Thank you,
Kevin

uploading the JAK2 model

I read that the JAK2 data itself is proprietary but is it possible to upload the trained model without the data for use, cause like said in the JAK2 notebook 2000 samples is not so much to train a NN predictor, so it would be great if the model itself could be uploaded

RuntimeError: dimension specified as 0 but tensor has no dimensions

when I am running train the model

for _ in range(n_iterations):

### Transfer learning 
RL.transfer_learning(transfer_data, n_epochs=n_transfer)
_, prediction = estimate_and_update(n_to_generate)
prediction_log.append(prediction)
if len(np.where(prediction >= threshold)[0])/len(prediction) > 0.15:
    threshold = min(threshold + 0.05, 0.8)

### Policy gtadient with experience replay 
for _ in range(n_policy_replay):
    rewards.append(RL.policy_gradient_replay(gen_data, replay, threshold=threshold, n_batch=10))
    print(rewards[-1])

_, prediction = estimate_and_update(n_to_generate)
prediction_log.append(prediction)
if len(np.where(prediction >= threshold)[0])/len(prediction) > 0.15:
    threshold = min(threshold + 0.05, 0.8)

### Policy graient without experinece replay 
for _ in range(n_policy):
    rewards.append(RL.policy_gradient(gen_data, threshold=threshold, n_batch=10))
    print(rewards[-1]) 

_, prediction = estimate_and_update(n_to_generate)
prediction_log.append(prediction)
if len(np.where(prediction >= threshold)[0])/len(prediction) > 0.15:
    threshold = min(threshold + 0.05, 0.8)

I experience this error

RuntimeError Traceback (most recent call last)
in ()
10
11 ### Transfer learning
---> 12 RL.transfer_learning(transfer_data, n_epochs=n_transfer)
13 _, prediction = estimate_and_update(n_to_generate)
14 prediction_log.append(prediction)

~/ReLeaSE/reinforcement.py in transfer_learning(self, data, n_epochs, augment)
131
132 def transfer_learning(self, data, n_epochs, augment=False):
--> 133 _ = self.generator.fit(data, n_epochs, augment=augment)

~/ReLeaSE/stackRNN.py in fit(self, data, n_epochs, all_losses, print_every, plot_every, augment)
330 for epoch in range(1, n_epochs + 1):
331 inp, target = data.random_training_set(smiles_augmentation)
--> 332 loss = self.train_step(inp, target)
333 loss_avg += loss
334

~/ReLeaSE/stackRNN.py in train_step(self, inp, target)
285 for c in range(len(inp)):
286 output, hidden, stack = self(inp[c], hidden, stack)
--> 287 loss += self.criterion(output, target[c])
288
289 loss.backward()

~/anaconda3/envs/ReLeaSE/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
489 result = self._slow_forward(*input, **kwargs)
490 else:
--> 491 result = self.forward(*input, **kwargs)
492 for hook in self._forward_hooks.values():
493 hook_result = hook(self, input, result)

~/anaconda3/envs/ReLeaSE/lib/python3.6/site-packages/torch/nn/modules/loss.py in forward(self, input, target)
757 _assert_no_grad(target)
758 return F.cross_entropy(input, target, self.weight, self.size_average,
--> 759 self.ignore_index, self.reduce)
760
761

~/anaconda3/envs/ReLeaSE/lib/python3.6/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce)
1440 >>> loss.backward()
1441 """
-> 1442 return nll_loss(log_softmax(input, 1), target, weight, size_average, ignore_index, reduce)
1443
1444

~/anaconda3/envs/ReLeaSE/lib/python3.6/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce)
1326 raise ValueError('Expected 2 or more dimensions (got {})'.format(dim))
1327
-> 1328 if input.size(0) != target.size(0):
1329 raise ValueError('Expected input batch_size ({}) to match target batch_size ({}).'
1330 .format(input.size(0), target.size(0)))

RuntimeError: dimension specified as 0 but tensor has no dimensions

module load_dictionary in the class PredictorData should not be removed?

Just run the example JAK2-demo.pynb with the recent modifications you'll see what I mean.

gpu running out of memory

Good afternoon,

I've been using the code from the develop branch with pytorch 0.4. I am having this memory issue below when executing this piece of code from the notebook example:

    ### Transfer learning 
    RL.transfer_learning(transfer_data, n_epochs=n_transfer)
    _, prediction = estimate_and_update(n_to_generate)
    prediction_log.append(prediction)
    if len(np.where(prediction >= threshold)[0])/len(prediction) > 0.15:
        threshold = min(threshold + 0.05, 0.8)

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCTensorMath.cu:35

Any idea of what might be causing this problem?

A little question about the reward function

Hello, I want to try to use your method to design some molecules with my own data set. Is the two models of pIC50 and metal melting temperature just different reward functions? Can you explain the significance of the reward function for the melting temperature of metals? Why use this form? Or can you provide the demo code for this experiment, I look forward to your reply! !!

JAK2-demo fine tuing

Hello!

I am a student who is studying your code. I have some questions.
If I look at the end of your 'JAK2-demo.ipync' code,

'while (jak2_compounds) = 1000):
sm = my_generator.evaluate(gen_data, tempperature=0.5) [1:-1]
clean_sm, pred, nan_sm = jak2_predictor.preferred valu
if Len(clean_sm) > 0 and red[0] = 0.8: #probably greater than 0.8
jak2_compounds += clean_sm
save_smi_to_file('/generated_compounds/test_ask1/' + str(i+1) + '.txt', jak2_compounds)'

The gen_data that is applied to the 'my_generator.evalute' function is a file about the Chemlb 22 data base. However, the last file you save is called jak2_compounds.

If the last part creates a new jak2_compounds through fine tuning, I think the data used for gen_data should contain jak2 data rather than chemlb data.

I'd like to politely ask if I'm right.
Also, I want to know which part of the code is doing fine_tuning. I'm studying a lot thanks to you. I would appreciate your reply.

Transfer Data

Hi!

I'm unable to find the transfer learning data-- transfer_data.smi. Do you potentially know what the cause might be?

Thanks!

How to randomly generate molecules?

Hello, I am very happy to see your research. I want to use the trained generation model to randomly generate molecules without setting any conditions, but I don't know how to operate, so please give me more guidance, how should I do it? Can you generate molecules randomly instead of conditionally?
Best !

How to determine the token of a new data set

I collected a large smiles data set. I wanted to try to generate the model from scratch. Then I counted the unique characters of all smiles, as follows:
#%()*+-./0123456789:=@ABCDEFGHIKLMNOPRSTUVWXYZ[\\]abcdefghiklmnoprstuy

But I see in your `JAK2_min_max_demo.ipynb',

tokens = ['<', '>', '#', '%', ')', '(', '+', '-', '/', '.', '1', '0', '3', '2', '5', '4', '7', '6', '9', '8', '=', 'A', '@', 'C', 'B', 'F', 'I', 'H', 'O', 'N', 'P', 'S', '[', ']','\\', 'c', 'e', 'i', 'l', 'o', 'n', 'p', 's', 'r', '\n']

Then I read the smiles data file you provided chembl_22_clean_1576904_sorted_std_final.smi,Get the unique character of smiles,But I found that token is not equal to token in `JAK2_min_max_demo.ipynb':

chem_smiles = read_smi_file("ReLeaSE/data/chembl_22_clean_1576904_sorted_std_final.smi")
ch_smiles = [i.split("\t")[0] for i in chem_smiles[0]]

tokens2 = list(set(''.join(ch_smiles)))
tokens2 = list(np.sort(tokens))
tokens2 = ''.join(tokens)

The token2 result is:#%()+-./0123456789=BCFHINOPS[\\]clnoprs

Except that < and >'denote beginning and ending, token1 and token2 are not equal, why is that? What did you do with the chembl_22_clean_1576904_sorted_std_final.smi?

Can you give me more guidance? Thank you very much.

missing jak2 data

can't find jak2_data.csv in the clone, should one download from chembl?

The generation model of transfer learning training can not generate molecules similar to those in training set?

Hello, I use the SMILES data I collected and your transfer learning model to train a generation model on my data. I want the model to generate new molecules similar to my training set, but unfortunately, it seems that the model has not learned any characteristics of the training data. Here are some SMILES generated by my model:

'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC        ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC             ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC                    ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC                         ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCOCCCCCCCCCCCCCCCCC              ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC                               ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC                                 ',
 'CC=CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC=CCC=CSC     ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC                                     ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC                                      ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC                                        ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC                                         ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC                                          ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC(C)C                                 ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC(CC)CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC=C(C)C                       ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC(C)C                                   ',
 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC                                            ',

The generated SMILES are transformed into molecular structure diagrams. Some examples are as follows:

It doesn't look like a compound molecule at all.What may be the problem?

The number of tokens is inconsistent with the tokens provided in `LogP_optimization_demo.ipynb`, and cannot be learned using migration?

Hello, when I train a generate model with my own SMILES data, use LogP_optimization_demo.ipynb:
tokens = ['<', '>', '#', '%', ')', '(', '+', '-', '/', '.', '1', '0 ', '3', '2', '5', '4', '7', '6', '9', '8', '=', 'A', '@', 'C', 'B', 'F', 'I', 'H', 'O', 'N', 'P', 'S', '[', ']', '\\', 'c', ' e', 'i', 'l', 'o', 'n', 'p', 's', 'r', '\n'], but will get characters outside the tokens list, causing me to fail Continue to use the Transfer learning method to train, so I changed the code as follows during training:

gen_data_path = "data/nueji_data2.csv"
gen_data = GeneratorData(training_data_path=gen_data_path, delimiter='\t', 
                         cols_to_read=[0], keep_header=True, tokens=None)
hidden_size = 1500
stack_width = 1500
stack_depth = 200
layer_type = 'GRU'
lr = 0.001
optimizer_instance = torch.optim.Adadelta

my_generator = StackAugmentedRNN(input_size=gen_data.n_characters, hidden_size=hidden_size,
                                 output_size=gen_data.n_characters, layer_type=layer_type,
                                 n_layers=1, is_bidirectional=False, has_stack=True,
                                 stack_width=stack_width, stack_depth=stack_depth, 
                                 use_cuda=use_cuda, 
                                 optimizer_instance=optimizer_instance, lr=lr)
model_path = './checkpoints/generator/checkpoint_biggest_rnn'
my_generator.load_model(model_path)

But I get the following error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-11-3c9498b26c8c> in <module>()
----> 1 my_generator.load_model(model_path)

/scratch2/hzhou/Drug/generate_smiles/ReLeaSE/release/stackRNN.py in load_model(self, path)
    140         """
    141         weights = torch.load(path)
--> 142         self.load_state_dict(weights)
    143 
    144     def save_model(self, path):

~/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
    717         if len(error_msgs) > 0:
    718             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 719                                self.__class__.__name__, "\n\t".join(error_msgs)))
    720 
    721     def parameters(self):

RuntimeError: Error(s) in loading state_dict for StackAugmentedRNN:
	size mismatch for encoder.weight: copying a param of torch.Size([40, 1500]) from checkpoint, where the shape is torch.Size([45, 1500]) in current model.
	size mismatch for decoder.weight: copying a param of torch.Size([40, 1500]) from checkpoint, where the shape is torch.Size([45, 1500]) in current model.
	size mismatch for decoder.bias: copying a param of torch.Size([40]) from checkpoint, where the shape is torch.Size([45]) in current model.

But my data set is very small. Without migration learning, my generation model may not be able to learn the chemical rules of SMILES, so my idea is this: I use the `data/chembl_22_clean_1576904_sorted_std_final.smi'data set to retrain a model, but I customize tokens to define the characters in my data set into token, and finally make it work again. Re-training my data with a pre-training model, is my idea right? I'm not sure.

logp accuracy error.

Hi,
I repeat to execute your LogP module with two test. But, in first test, I only get 0.2413 for valid SMILES when I used retrained generator and predictor to do reinforcement learning. And my retrained generator also get 0.8876497315159025 for drug-like region and 0.7263 for valid SMILES before reinforcement learning. I don't know why my generator get the low valid SMILES after Reinforcement learning.
And in second test, I get 0.7698 for valid SMILES and 0.9664848012470771 for drug-like region when I used your checkpoint of generator. Do you have some tricks for generator training?

Error no module named sklearn

The logp notebook throws an error trying to import what I think is scikit-learn

How to generate a specified number of molecules using a trained generation model?

I use my own SMILES data to load the pre-training model you provided and train a generation model so that I can generate new molecules with similar structure to my training data. How can I use the trained generation model to generate a specified number of new molecules?

How to reproduce the results of the paper presentation?

Hello, I want to use the data you provided to reproduce the results mentioned in the paper, such as the following results:

In particular, I don't know what is mentioned in your paper:

how the similarities in the generated molecules and training data are calculated.
How is the complexity of the molecule generated during training, such as polyphenyl rings, multiple substituents, in the code?

Can you provide a complete example? Thank You
Best wishes!

Garbled characters appear when displaying molecular images

When I optimized the molecular properties according to the LogP_optimization_demo.ipynb tutorial, I converted the generated smiles into images, but I got the garbled problem:

I don't know what the reason is. I just follow the tutorial. Please tell me why, how should I solve this problem?

isayev / release Goto Github PK

release's People

Contributors

Stargazers

Watchers

Forkers

release's Issues

Recommend Projects

Recommend Topics

Recommend Org