maxhodak / keras-molecules Goto Github PK

View Code? Open in Web Editor NEW

518.0 518.0 145.0 1.04 MB

Autoencoder network for learning a continuous representation of molecular structures.

License: MIT License

Python 100.00%

keras-molecules's People

Contributors

Stargazers

Watchers

Forkers

farizrahman4u rbharath alexbw yut148 wavelets iamsile yydxlv makemefriendanshu dennissheberla vyraun benjamesbabala levipierce fuish dakoner hedgefair pechersky horacepan dribnet hsiaoyi0504 alucarrd miaecle bityangke yccai michaelosthege mehdidc mouatez levinas 0x7ca lilleswing spadavec kehang codeaudit bionicles sungjinlees iasawseen savagej grapesugar chaoshangcs tkda-h3 charlestondance jgraving hstone1 delton137 huizhuzhao chao1224 eholmgren vvvvvero yanjunli-cs wangjs quanshengwu jonathanchiang solertis zhangyang5511 gregkoytiger hfooladi valaentine shubhampachori12110095 thegodone ai3dvision lalalland xiaoxj2 bunseki2 haibaraes yobcmst michaelkane1919 sunshinejnjn xxffliu chemphy taigi liamwilbraham davidwhealey amoliu jmrinaldi wangz10 basil-m maeve-k andrewlaird amirunpri2018 wardlt chipper1 foxtrotmike gkxiao fanwangm zcrwind shenwanxiang alpha358 rayeesrahman oaklight fuzhuliu mlb2251 austinapple syyunn zjujdj muu4649 ajayar7 a626677909 robmacc chendeheng611 abdulelahalshehri bbyun28

keras-molecules's Issues

problem with loading sample

Hi, when I try to use the model with the command:
python2 sample.py data/smiles_50k.h5 data/model_50k.h5 --target autoencoder
I have the following error:

Using Theano backend.
Traceback (most recent call last):
File "sample.py", line 97, in
main()
File "sample.py", line 90, in main
autoencoder(args, model)
File "sample.py", line 41, in autoencoder
data, charset = load_dataset(args.data, split = False)
File "/home/jarbona/keras-molecules/autoencoder/utils.py", line 20, in load_dataset
h5f = h5py.File(filename, 'r')
File "/usr/lib/python2.7/site-packages/h5py/_hl/files.py", line 272, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/usr/lib/python2.7/site-packages/h5py/_hl/files.py", line 92, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "h5py/h5f.pyx", line 76, in h5py.h5f.open (/tmp/pip-4rPeHA-build/h5py/h5f.c:1930)
IOError: Unable to open file (File signature not found)

I am not sure if it comes from my install.

Generative Adversarial Networks

In case you guys haven't seen it, this paper came out recently and looks kind of interesting: https://arxiv.org/abs/1701.01329

My first couple read throughs leave me with some questions. The paper triggers a couple of my first-order heuristics (explaining basic stuff like RNNs, seemingly magical performance on generating long valid SMILEs that suggests overfitting) and has kind of a weird application of fine-tuning as transfer learning, among other things. I'm planning on working up some parts of this paper like the stacked LSTMs as a SMILES generator for transfer to a property prediction network over this weekend. Anyone else have any comments on this paper or things to try?

@pechersky @dribnet @dakoner

Problem with the pretrained model on autoencoding

Dear author,

I download your codes and pre-trained model (model_500k.h5) and tried out the following commands:
python preprocess.py data/smiles_500k.h5 data/processed_500k.h5
python sample.py data/processed_500k.h5 data/model_500k.h5 --target autoencoder

Then it outputs:

NC(=O)c1nc(cnc1N)c2ccc(Cl)c(c2)S(=O)(=O)Nc3cccc(Cl)c3
(-> encoder -> decoder ->)
7-7ASC-F@@7N7AAAAAAAAAAAAAlllllNAACAAC7lll7AlllAAACC%CLA-VVVVVVVVFF--lAAAAAAAAAAAAAAVVAAAAACCAACCAAACAAACCA77A-VVV--

I am not sure what happened to the pre-trained model, seems it does not do a good job at all... Do you see a similar problem or I did something wrong...?

Encoding representations

Hello,

I'm pretty new to autoencoders and I know we can use utilize them for unsupervised learning. Is it possible to use this model to create representations (with encoding) for a set of SMILES?

If so, I guess first I had to preprocess my data set, then use sample.py ?

Thanks!

preprocess.py reduce

Hi,

I am trying to run preprocess.py and there is an unresolved reference to reduce:
charset = list(reduce(lambda x, y: set(y) | x, structures, set()))

however, if I:

from functools import reduce
I then get the error.

Traceback (most recent call last):
File "preprocess.py", line 88, in
main()
File "preprocess.py", line 60, in main
h5f.create_dataset('charset', data = charset)
File "C:\Users\dmcclymo\AppData\Local\Continuum\Anaconda3\lib\site-packages\h5py_hl\group.py", line 105, in crea
taset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "C:\Users\dmcclymo\AppData\Local\Continuum\Anaconda3\lib\site-packages\h5py_hl\dataset.py", line 93, in mak
_dset
tid = h5t.py_create(dtype, logical=1)
File "h5py\h5t.pyx", line 1450, in h5py.h5t.py_create (C:\Minonda\conda-bld\h5py_1474482825505\work\h5py\h5t.c:16
File "h5py\h5t.pyx", line 1470, in h5py.h5t.py_create (C:\Minonda\conda-bld\h5py_1474482825505\work\h5py\h5t.c:15
File "h5py\h5t.pyx", line 1531, in h5py.h5t.py_create (C:\Minonda\conda-bld\h5py_1474482825505\work\h5py\h5t.c:15
TypeError: No conversion path for dtype: dtype('<U1')

Replace SMILES input to Coulomb matrix

I am working on changing the input form SMILES to Coulomb matrix, 200 Coulomb matrices (29*29 matrix) with the HOMO-LUMO gap have been produced and saved into a .h5 file by the following code:

# Saving in .h5 format
h5f = h5py.File('processed.h5','w')
h5f.create_dataset('homo_lumo_gaps',data=homo_lumo_gaps)
h5f.create_dataset('padded_coulomb_matrices', data=padded_coulomb_matrices)

And I try to run the train.py directly with the generated process.h5 and give me this error message

KeyError: "Unable to open object (Object 'data_train' doesn't exist)"

I think that the problem comes from the way that I save the file is different from the original preprocess.py... But I cannot get the original idea and thus don't know how should I modify my code.
The preprocess.py I am using is here

https://docs.google.com/document/d/17f9n7tzeadpCo0_pit548QiU1-Loib2opMcm0I4MxzQ/edit?usp=sharing

And I want to known other than the ''naming'' problem as I have mentioned, will the NN work as I expected if I directly import the Coulomb matrix to replace the SMILES strings? Is there any part of the code I will need to modify? Thank You.
I know that this is not a good way to ask questions, but I really need some help. Any help is appreciated. Thank You.

Problems with the model_500k.h5

In order to test the model_500k.h5 in the data folder.

I ran the scripts as below:
python2.7 preprocess.py data/smiles_500k.h5 data/processed_500.h5;

python2.7 sample.py data/processed_500.h5 data/model_500k.h5 --target autoencoder

however, the second step provides me with error message as following:

main()
File "sample.py", line 90, in main
autoencoder(args, model)
File "sample.py", line 44, in autoencoder
model.load(charset, args.model, latent_rep_size = latent_dim)
File "/Users/flynn/Documents/desktop/GT_second_semester/song_lab/nanoparticle_research/molecule_BO/keras_molecule/keras-molecules/molecules/model.py", line 95, in load
self.create(charset, weights_file = weights_file, latent_rep_size = latent_rep_size)
File "/Users/flynn/Documents/desktop/GT_second_semester/song_lab/nanoparticle_research/molecule_BO/keras_molecule/keras-molecules/molecules/model.py", line 50, in create
self.autoencoder.load_weights(weights_file)
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2500, in load_weights
self.load_weights_from_hdf5_group(f)
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2585, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "/usr/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 990, in batch_set_value
assign_op = x.assign(assign_placeholder)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 575, in assign
return state_ops.assign(self._variable, value, use_locking=use_locking)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
use_locking=use_locking, name=name)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2242, in create_op
set_shapes_for_outputs(ret)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1617, in set_shapes_for_outputs
shapes = shape_func(op)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1568, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 610, in call_cpp_shape_fn
debug_python_shape_fn, require_shape_fn)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 675, in _call_cpp_shape_fn_impl
raise ValueError(err.message)

ValueError: Dimension 2 in both shapes must be equal, but are 56 and 55 for 'Assign' (op: 'Assign') with input shapes: [9,1,56,9], [9,1,55,9].

Is this model_500k.h5 not proper for smiles_500k.h5? If so, how could I make use of model_500k.h5?

Is there a list of generated candidates available?

i.e. list of outputs from the VAE. If not, are you aware of any other resources that might have molecular candidates via a generative algorithm? (VAE, GAN, etc.)

Will there be problems if use Tensorflow for CPU?

I tried to run your repository on my laptop, which doesn't have an independent GPU like Nvidia. So I use tensorflow (version 1.0.0) for CPU, and I have satisfied all other requirements in the file requirements.txt. I wonder whether this could work.
When I was trying to replicate the tutorial on the homepage of your repository, after successfully creating the processed.h5 under the /data folder from smiles_50k.h5 file, I tried to run the following command and got this mistake:

python train.py data/processed.h5 model.h5 --epochs 20
Using TensorFlow backend.
Traceback (most recent call last):
  File "train.py", line 59, in <module>
    main()
  File "train.py", line 37, in main
    model.create(charset, latent_rep_size = args.latent_dim)
  File "/home/simonfqy/Documents/keras-molecules/molecules/model.py", line 23, in create
    _, z = self._buildEncoder(x, latent_rep_size, max_length)
  File "/home/simonfqy/Documents/keras-molecules/molecules/model.py", line 62, in _buildEncoder
    h = Flatten(name='flatten_1')(h)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 514, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 572, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 149, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/core.py", line 409, in call
    return K.batch_flatten(x)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 825, in batch_flatten
    x = tf.reshape(x, tf.pack([-1, prod(shape(x)[1:])]))
AttributeError: 'module' object has no attribute 'pack'

Can anyone please help me with this? Is it related to my using tensorflow for CPU only?

Version numbers for working install

I've created a fresh conda environment and run pip install -r requirements.txt, but still getting errors that are likely due to version mismatch, e.g.:

  File "preprocess.py", line 85, in <module>
    main()
  File "preprocess.py", line 72, in main
    apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
  File "preprocess.py", line 63, in create_chunk_dataset
    chunks=tuple([chunk_size]+list(dataset_shape[1:])))
  File "/home/spadavec/.local/lib/python2.7/site-packages/h5py/_hl/group.py", line 105, in create_dataset
    dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
  File "/home/spadavec/.local/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 76, in make_new_dset
    if isinstance(chunks, tuple) and (-numpy.array([ i>=j for i,j in zip(tmp_shape,chunks) if i is not None])).any():
TypeError: The numpy boolean negative, the `-` operator, is not supported, use the `~` operator or the logical_not function instead.

Does anyone have version numbers that aren't in conflict?

binary cross-entropy in vae_loss

Hi, why do you use binary crossentropy in vae_loss https://github.com/maxhodak/keras-molecules/blob/master/molecules/model.py#L77 ?
It looks natural to use categorical-cross entropy and also authors of original article used categorical cros-entropy. Have you compared these two losses?

ResourceExhaustedError: OOM when allocating tensor with shape[600,120,1503]

I am new to ubuntu and this program, so this question may be very simple...
This error message shows when I run the train.py

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[600,120,1503]

I think that this may be my GPU running out of memory. My graphic card is a GTX970 with 4GB memory. I am running the program according to the README, which is inputting this line

python train.py data/processed.h5 model.h5 --epochs 20

I have read a very similar problem as below

http://stackoverflow.com/questions/39076388/tensorflow-deep-mnist-resource-exhausted-oom-when-allocating-tensor-with-shape

but I still cannot find a way to fix my problem. Can anyone please give me a hand? Thank you very much.

Use more up-to-date ZINC database

ZINC15 is now available

sample.py assumes what input file to sample from

Currently, sample.py assumes the presence of a particular h5 file in data. Instead, sample.py should take the input data as a parameter. See related pull request.

Preprocess uses too much RAM

preprocess.py loads the entire pre-preprocessed data into RAM, does transforms that require more RAM. I'm trying to preprocess GDB-17 w/ 50M SMILES strings and it's just about filling up my 64GB RAM machine. We should be able to go directly from SMILES input to preprocessed input files with far fewer resources although it would take work to make all the steps incremental.

Cannot fetch LFS-stored files.

The files in data/ cannot be retrieved via git lfs, because the bandwidth quota seems to be exceeded (see error message below). Is there an alternate method of obtaining the files?

keras-molecules$ git lfs fetch
Fetching master
Git LFS: (0 of 3 files) 0 B / 156.20 MB                                                                                                                                                                            
batch response: http: This repository is over its data quota. Purchase more data packs to restore access.
Docs: https://help.github.com/articles/purchasing-additional-storage-and-bandwidth-for-a-personal-account/
Warning: errors occurred

KL divergence term in loss function

The loss function for the autoencoder is calculated here using
kl_loss = - 0.5 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis = -1)
taking the mean over the dimensions of the latent representation. However, several other sources, including the VAE example in the keras repo, use the sum instead:
kl_loss = - 0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
Is there a reason for the difference? Given the relatively large number of latent dimensions, it seems like this would significantly impact the strength of the KL regularization.

Trained model

Hello, I really appreciate you reimplement the model structure. But I think most people doesn't have such GPU resources, so can you add already trained models into this repo?

switch to keras pip?

Currently the requirements.txt file references keras from git:
-e git+https://github.com/fchollet/keras.git#egg=keras

however, keras is in PyPI.

if it's not impossible, I suggest using the one from pip with a defined version number, should make it less likely to get hit by future upgrades to the Keras API.

Try Gomez-Bombarelli's TerminalGRU

I was looking to see whether the paper's authors had any draft code for the model they describe, to check whether they did something differently. I found that R GB had implemented some new RNN layers they might have used for the decoder. I don't know if they ended up using the conventional GRu or this new TerminalGRU instead. Might be worth trying out:

keras-team/keras@master...rgbombarelli:master#diff-3118e4e28157032506f771f279a551c3R639

h5 files with non-sequential indices cannot be processed

I had an h5 file with a saved "structure" column with an index that did not start at 0. Because of that, the np.arange call to generate train_idx, test_idx would generate indices that were not valid for the h5 file. Instead, we can generate train_idx, test_idx by running train_test_split on the index of the column itself.

See my proposed pull request.

Trying to preprocess an h5 file with a large number of structures (~500K) results in out-of-memory error

The preprocessing script as written tries to generate the entire vector all at once in memory, then write it in one go to the h5 processed file. Instead, we can generate chunks of tensor at a time, and write it at a time. See related pull request.

Strange results for the autoencoder with preloaded weights

Hi,
I am not sure if the weights that correspond to the pretrain network
have som problem or if I made an error in the small pipeline, but the
results from the autoencoder seems very bad when I run sample.py:

First I preprocessed the data for 500k molecules
python2 preprocess.py data/smiles_500k.h5 data/processed_500.h5
which created a file of about 13G (processed_500.h5)
then I run:
python2 sample.py data/processed_500.h5 data/model_500k.h5 --target autoencoder

And I get:
CCOC(=O)CSC1=NC(=O)N2C=CC(=CC2=N1)C
S.+[SSSSS+[.(b..FFFFFFFFF(F%(F%FFF%%FFF

training accuracy @ 99.99%, validation never goes above 97%

This is probably more of a machine learning issue in general, but in training keras-molecules on a small input set (~50K strings), the training eventually reaches 99.99% accuracy (loss ~.01), but the validation never exceeds 97%.

My feeling is this is a sign of overfitting the training data. The trainer has gotten effectively perfect at reconstructing the input SMILES strings but can't generalize to the odd cases in the test set that have no analogs in the training set.

I'm not sure how to debug this or if it really even represents a bug.

Training and canonicalization

My training has been on the ZINC database, but when I decode generated vectors into SMILES strings, I pass them into rdkit's canonicalizer. It looks to me like ZINC used a different canonicalization mechanism.

It's not clear to me at all whether this would make any difference- obviously, if you train on a set generated by one canonicalizer, the system is going to learn to produce outputs that are consistent with that canonicalizer. In a sense, this is a bias (the canonicalization rules are based mostly in graph theory rather than chemistry).

TypeError: The numpy boolean negative, the `-` operator, is not supported, use the `~` operator or the logical_not function instead.

python preprocess.py data/smiles_50k.h5 data/processed_50k.h5

Traceback (most recent call last):
File "preprocess.py", line 85, in
main()
File "preprocess.py", line 72, in main
apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
File "preprocess.py", line 63, in create_chunk_dataset
chunks=tuple([chunk_size]+list(dataset_shape[1:])))
File "/home/vinay/chemistrytensor/local/lib/python2.7/site-packages/h5py/_hl/group.py", line 105, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/home/vinay/chemistrytensor/local/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 76, in make_new_dset
if isinstance(chunks, tuple) and (-numpy.array([ i>=j for i,j in zip(tmp_shape,chunks) if i is not None])).any():
TypeError: The numpy boolean negative, the - operator, is not supported, use the ~ operator or the logical_not function instead.

VAE part of model

Hi, it looks like that this code actually train not VAE model but simple auto-encoder model. Here are reasons:

Epsilon std is 0.01 https://github.com/maxhodak/keras-molecules/blob/master/molecules/model.py#L58 when it should be 1. I assume that it is safe to say that there is almost no sampling.
KL loss should be very small because there is mean operation https://github.com/maxhodak/keras-molecules/blob/master/molecules/model.py#L78. In that case there is mean along feature and sequence shape. But both of this should be summed to obtain right KL loss relative to crossentropy loss.
The picture in readme also indicates that, because not all regions in latent space are covered by points. And authors wrote in paper that they observed this when they trained simple auto-encoder model.

May be it makes sense to simple train autoencoder model and compare results.

Getting tables.exceptions.HDF5ExtError: HDF5 error back trace

When I tried to run a program by executing python preprocess.py data/smiles_50k.h5 data/processed.h5. it is generating an error.
The detailed error is attached in the image. How to correct this?

KeyError: "Unable to open object (Object 'latent_vectors' doesn't exist)"

I'm trying to mess around with some custom datasets, and keep getting the following error when trying ot read_latent_data:

  File "ava.py", line 165, in <module>
    main()
  File "ava.py", line 127, in main
    data, charset = read_encoded_vecs(data_path)
  File "ava.py", line 63, in read_encoded_vecs
    data, charset = read_latent_data(data_path)
  File "/home/lerche/word_embedding_molecules/ava/keras_molecules/sample.py", line 34, in read_latent_data
    data = h5f['latent_vectors'][:]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
  File "/home/lerche/miniconda2/lib/python2.7/site-packages/h5py/_hl/group.py", line 166, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open (/tmp/pip-4rPeHA-build/h5py/h5o.c:3570)
KeyError: "Unable to open object (Object 'latent_vectors' doesn't exist)"

My process for creating the input is:

create a .txt file that has 2 columns, 1 for SMILES and 1 for MoleculeID (~800k total)
Convert the .txt to h5 using the following:

import import_smiles
import sys
i_file = import_smiles.read_smiles(sys.argv[1], column=1)
import_smiles.create_h5(d, 'output.h5')

Preprocess the above h5 file using python preprocess.py output.h5 preprocessed.h5
run python train processed.h5 model.h5 --epochs 20

Then, if I try to run the following code, the error above is displayed:

from keras_molecules.sample import read_latent_data

data, charset = read_latent_data(data_path)

Interestingly, I can still run python sample.py on the data and model files; any idea where I'm going wrong?

EDIT: I should note that it looks like the keys that exist in processed.h5 file are:

[u'charset', u'data_test', u'data_train']

EDIT2: I'm getting this same behavior with the 50k data/model files that are 'included' as examples.

Open source license?

The unnatural encoding of current implementation

After testing, I found that the procedure of building a list of the unique characters used in the dataset (The "charset") is wired. Current encoding will make the resulting output much fragile, because we didn't avoid the situation of Cl interpreted as "C", "l". For example, we should treat 'Cl' as independent character rather than 'C' and 'l' directly. It chemically unreasonable to see 'l' along.

Error: No object named table in the file

Hi
On running
python preprocess.py data/dataset.h5 data/processed.h5

I get an error: no object named table in the file.

Validation dataset

Im curious - can anybody help me understand why there's a separate validation dataset for the auto-encoder?

If the purpose of the autoencoder is to take us from (categorical, discreet) to (multinomial, continuous) and then back to (categorical discreet) then isn't it better if the network is trained on all of the samples, even to the point of overfitting? I'm saying this because we're not trying to train the network to generalize molecular structure - it's simply a transformer. Once we construct the latent space then it's up to us to analyse it, but shouldn't the autoencoder simply be trained to encode and decode, and not to generalize?

We still get predictive ability by manipulating transforms through the latent space (as a totally separate task), then the decoder can spit out the molecule that was the result of our latent transform.

No?

Issue replicating graph

I would like to be able to generate a picture akin to that displayed in the read me, however even though I converge my model beyond the point in the read me, I do not get the distinct striations shown. Rather I get a more spread out graph still with some striations

Has anyone been able to replicate the Image as displayed in the paper and readme. Was it generated using an actual a 2d latent dim, or a higher dimension then PCAed down to 2d (I have tried both and neither has worked), any help would be greatly appreciated.

Architecture issues

Reading your code:

    def create(self,
               charset,
               original_dim = 120,
               epsilon_std = 0.01,
               latent_rep_size = 292,
               weights_file = None):
        charset_length = len(charset)

        x = Input(shape=(original_dim, charset_length))
        h = Convolution1D(9, 9, input_dim=60)(x)
        h = Convolution1D(9, 9)(h)
        h = Convolution1D(10, 11)(h)
        h = Flatten()(h)
        h = Dense(435)(h)
        z_mean = Dense(latent_rep_size, name='z_mean')(h)
        z_log_var = Dense(latent_rep_size, name='z_log_var')(h)

        def sampling(args):
            z_mean, z_log_var = args
            batch_size = K.shape(z_mean)[0]
            epsilon = K.random_normal(shape=(batch_size, latent_rep_size), mean=0., std=epsilon_std)
            return z_mean + K.exp(z_log_var / 2) * epsilon

        z = Lambda(sampling)([z_mean, z_log_var])

        h = Dense(latent_rep_size, name='latent_input')(z)
        h = RepeatVector(original_dim)(h)
        h = GRU(501, return_sequences = True)(h)
        h = GRU(501, return_sequences = True)(h)
        h = GRU(501, return_sequences = True)(h)
        decoded_mean = TimeDistributedDense(charset_length, activation='softmax', name='decoded_mean')(h)

        def vae_loss(x, x_decoded_mean):
            x = K.flatten(x)
            x_decoded_mean = K.flatten(x_decoded_mean)
            xent_loss = original_dim * objectives.binary_crossentropy(x, x_decoded_mean)
            kl_loss = - 0.5 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis = -1)
            return xent_loss + kl_loss

        encoded_input = Input(shape=(original_dim, latent_rep_size))

        self.autoencoder = Model(x, decoded_mean)
        self.encoder = Model(x, z_mean)
        #self.decoder = Model(self.autoencoder.get_layer('latent_input')(encoded_input),
        #                     self.autoencoder.get_layer('decoded_mean')(encoded_input))

What you call original_dim should actually be called something like max_length
You pass an input_dim argument to the first conv layer, this is not necessary (you already have an Input layer and you're using the functional API anyway)
Your conv layers do not have activations! You should add an argument activation='relu' to all of them.
Same for most of your Denselayers. Except these two, for which a linear layer makes sense:

z_mean = Dense(latent_rep_size, name='z_mean')(h)
z_log_var = Dense(latent_rep_size, name='z_log_var')(h)

Enable light compression on h5 files

The h5 files are being written without compression. This makes them really large, wasting disk space. It might have some runtime costs (usually it's faster to read/write compressed data into the CPU than uncompressed data). I suggest using a fairly light compression, it won't take much time to encode or decode and will still compact things significantly.

Top-level dir is not a module

I've been trying to package up keras-molecules as a module so I can run it on Google Cloud Machine Learning. Unfortunately, the top-level of keras-molecules (with train.py etc) isn't really module ready. The repo name has a "-" in it, and there's no top-level init.py.

ImportError: libcudart.so.8.0

I receive the following error when trying to train the model.
ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory

I am using cuda 10.0, python 2.7.15, on GPU.

Better molecule representation

Using SMILES makes this problem unnecessarily hard, likely detracting from the smoothness of the latent space and causing the majority of sampled SMILES to be invalid even after extended training. The graph convolution work isn't reversible so you can't use it to generate new structures not in the initial set. This is pretty open-ended but what are other representations to try? Ideally somewhere where every point corresponds to a valid molecule.

One idea I've had is to borrow on the concept of cellular encodings, where rather than specifying the molecular graph directly we specify a set of operations which when evaluated yield a valid molecule. Then we're doing seq-to-seq learning for a list of "molecule construction ops." Careful thought would have to be put into the instruction set, but I think it might be possible to get something expressive where every combination is also valid.

@dakoner ?

Convolution1D Issue

If my understanding is right, when we use the convolution1D, the timestep should stay constant. But interestingly here, I found the structure doesn't follow this principle:

Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_15 (InputLayer)            (None, 277, 76)       0                                            
____________________________________________________________________________________________________
conv_1 (Convolution1D)           (None, 269, 9)        6165        input_15[0][0]                   
____________________________________________________________________________________________________
conv_2 (Convolution1D)           (None, 261, 9)        738         conv_1[0][0]                     
____________________________________________________________________________________________________
conv_3 (Convolution1D)           (None, 251, 10)       1000        conv_2[0][0]                     
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 2510)          0           conv_3[0][0]                     
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 435)           1092285     flatten_1[0][0]                  
____________________________________________________________________________________________________
z_mean (Dense)                   (None, 56)            24416       dense_1[0][0]                    
____________________________________________________________________________________________________
z_log_var (Dense)                (None, 56)            24416       dense_1[0][0]                    
____________________________________________________________________________________________________
lambda (Lambda)                  (None, 56)            0           z_mean[0][0]                     
                                                                   z_log_var[0][0]                  
____________________________________________________________________________________________________
latent_input (Dense)             (None, 56)            3192        lambda[0][0]                     
____________________________________________________________________________________________________
repeat_vector (RepeatVector)     (None, 277, 56)       0           latent_input[0][0]               
____________________________________________________________________________________________________
gru_1 (GRU)                      (None, 277, 501)      838674      repeat_vector[0][0]              
____________________________________________________________________________________________________
gru_2 (GRU)                      (None, 277, 501)      1507509     gru_1[0][0]                      
____________________________________________________________________________________________________
gru_3 (GRU)                      (None, 277, 501)      1507509     gru_2[0][0]                      
____________________________________________________________________________________________________
decoded_mean (TimeDistributed)   (None, 277, 76)       38152       gru_3[0][0]                      
====================================================================================================
Total params: 5,044,056
Trainable params: 5,044,056
Non-trainable params: 0

From conv2 to conv3, the dim is increasing, and timestep is reducing. Is this a bug in keras or if I miss sth?

locale-gen: not found when running docker build

$ docker build .

Step 3/16 : RUN locale-gen en_US.UTF-8
 ---> Running in XXXXXXXXX
/bin/sh: 1: locale-gen: not found
The command '/bin/sh -c locale-gen en_US.UTF-8' returned a non-zero code: 127

I think I have to run 'apt-get install locale' like this question

Incorporate Grammar VAE

A new paper came out:
https://arxiv.org/abs/1703.01925

They were able to implement a VAE that autoencodes the SMILES grammar as a context-free grammar, which is a pretty good approximation. They improved over the GB SMILES VAE. This would solve several of the issues that have been filed such as #31 #54.

The Grammar VAE code is at https://github.com/mkusner/grammarVAE. It looks a lot like the VAE code we already have, in terms of the actual model code. Incorporating the zinc_grammar and the masking shouldn't be too difficult -- just some work to implement the masking in Theano.

In fact, their code also uses mean and not sum in the KL loss term, considering #59.

Sharing trained weight

Hello, I am very interested in testing your model.
Do you think you could share the trained model file (model.h5 file)
Thank you

Property prediction

This codebase works well for me and I'm able to replicate the current results. Having worked a good bit with other latent spaces, I'm curious to find out what other operations the latent space of this model might support. Specifically, I suspect that the latent space could also support analogies and attribute vectors, but unfortunately I'm not familiar with chemistry datasets and smile strings.

Would anyone be interested in helping me build a labelled dataset of molecules that includes binary attributes and then investigating the results of applying attribute vectors? An example structure of the dataset would be:

smile string	Polar	Toxic	Flammable	Positive Oxidation State
CN1CCC[C@H]1c2cccnc2	True	False	False	True
O=C1Oc2ccccc2c3ccccc13	False	True	True	False
...

Generally, these datasets can be useful even if they are much smaller than the training dataset - say dozens to hundreds of rows. Ideally, the chosen attributes would be those that could serve as unambiguous labels and operators. For example, pretend the following is true:

Carbon dioxide is a polar molecule.
The equivalent to carbon dioxide without polarity is carbon monoxide.

Then this would be a great attribute because it follows the formula:

Molecule X has (doesn't have) attribute Y
The equivalent of Molecule X with (without) attribute Y is Z

I don't know enough chemistry to know if there are even such attributes for subsets of molecules. But if there are, then a small dataset of molecules with and without attribute Y would be sufficient to see if Z could be inferred from this model given X.

Reproduce GDB-17 construction pipeline

I think this is probably out of scope for this project but I think this may be the right community to target.

I read the GDB-17 construction paper (http://pubs.acs.org/doi/abs/10.1021/ci300415d). GDB-17 explicitly enumerates a subset of the total chemical space of molecules with up to 17 nuclei.

GDB-17 itself is not available in total, although the paper says that a sample of 1-2M entries is available and is sufficient for any training purposes.

The paper describes the construction process in enough detail to reproduce (although it would take effort), but does not provide code to do so. A number of the construction steps are fairly subtle (particularly with respect to rings, conjugation and aromatics) and there are a few "arbitrary" pruning choices that I'm not happy with. I believe a simple distributed pipeline could produce GDB-17 or an even better library automatically. Once the pipeline is built, larger GDBs could be constructed easily.

The challenge here is parsing what they describe in the methods section and converting that to code. They didn't do a good job of describing their methods in detail.

Issue when sampling with pretrained model - ValueError : axes don't match array

Hello, I've been trying to test out the autoencoder using the pre-trained model and examples in the readme. I pre-processed both the 50k and 500k datasets and get the same error with both. Has anyone also experienced this error?

python sample.py data/processed_500k.h5 data/model_500k.h5 --target autoencoder

Traceback (most recent call last):
File "sample.py", line 97, in
main()
File "sample.py", line 90, in main
autoencoder(args, model)
File "sample.py", line 44, in autoencoder
model.load(charset, args.model, latent_rep_size = latent_dim)
File "/Users/brentkuenzi/Documents/GitHub/keras-molecules/molecules/model.py", line 95, in load
self.create(charset, weights_file = weights_file, latent_rep_size = latent_rep_size)
File "/Users/brentkuenzi/Documents/GitHub/keras-molecules/molecules/model.py", line 50, in create
self.autoencoder.load_weights(weights_file)
File "/Users/brentkuenzi/anaconda2/lib/python2.7/site-packages/keras/engine/network.py", line 1180, in load_weights
f, self.layers, reshape=reshape)
File "/Users/brentkuenzi/anaconda2/lib/python2.7/site-packages/keras/engine/saving.py", line 916, in load_weights_from_hdf5_group
reshape=reshape)
File "/Users/brentkuenzi/anaconda2/lib/python2.7/site-packages/keras/engine/saving.py", line 675, in preprocess_weights_for_loading
weights[0] = np.transpose(weights[0], (3, 2, 0, 1))
File "/Users/brentkuenzi/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 575, in transpose
return _wrapfunc(a, 'transpose', axes)
File "/Users/brentkuenzi/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 52, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: axes don't match array

Error downloading the chembl22 and Zinc

[xianghui@login-c01 keras-molecules]$ python download_dataset.py --dataset chembl22
Bus error (core dumped)#################################################################################################################################### | ETA: 0:00:00 1.44 MB/s

Please kindly suggest.

Problem running plot.py

After running the sampling from trained model, I tried running the following: python plot.py --data 'encoded.h5' --model 'model.h5'. However, when I do this, I am getting the following error message:

Traceback (most recent call last):
File "plot.py", line 58, in
main()
File "plot.py", line 52, in main
plot_2d(args)
File "plot.py", line 41, in plot_2d
data = np.loadtxt(fname=args.data, dtype=str, delimiter='\t')
File "/Users/mythriambatipudi/miniconda2/envs/venv/lib/python2.7/site-packages/numpy/lib/npyio.py", line 927, in loadtxt
raise ValueError("Wrong number of columns at line %d" % line_num)
ValueError: Wrong number of columns at line 5

I am not sure what is causing this error. Could you please help me with this?

Updated pretrained model?

First of all I want to thank you all for your immense contributions to this library, it is truly helpful.

As someone without access to a powerful GPU, training these models is very time consuming. I encountered an error when testing out the model_500k.h5 on smiles_50k.h5 with sample_gen.py

I inputed python sample_gen.py smiles_50k.h5 data/model_500k.h5 --target autoencoder
and received the error ValueError: Shapes (9, 1, 56, 9) and (9, 1, 55, 9) are not compatible from within the load_weights_from_hdf5_group function.

Perhaps an updated pretrained model is needed? The models I train compile, but are very inaccurate (because of my machine's limitations), so something tells me it has to do with the provided model. I might be able to get an AWS server running to help out if needed.

Regards,
Liam

Add random seed to make the training consistant

As title