maxhodak / keras-molecules Goto Github PK
View Code? Open in Web Editor NEWAutoencoder network for learning a continuous representation of molecular structures.
License: MIT License
Autoencoder network for learning a continuous representation of molecular structures.
License: MIT License
Hi, when I try to use the model with the command:
python2 sample.py data/smiles_50k.h5 data/model_50k.h5 --target autoencoder
I have the following error:
Using Theano backend.
Traceback (most recent call last):
File "sample.py", line 97, in
main()
File "sample.py", line 90, in main
autoencoder(args, model)
File "sample.py", line 41, in autoencoder
data, charset = load_dataset(args.data, split = False)
File "/home/jarbona/keras-molecules/autoencoder/utils.py", line 20, in load_dataset
h5f = h5py.File(filename, 'r')
File "/usr/lib/python2.7/site-packages/h5py/_hl/files.py", line 272, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/usr/lib/python2.7/site-packages/h5py/_hl/files.py", line 92, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "h5py/h5f.pyx", line 76, in h5py.h5f.open (/tmp/pip-4rPeHA-build/h5py/h5f.c:1930)
IOError: Unable to open file (File signature not found)
I am not sure if it comes from my install.
In case you guys haven't seen it, this paper came out recently and looks kind of interesting: https://arxiv.org/abs/1701.01329
My first couple read throughs leave me with some questions. The paper triggers a couple of my first-order heuristics (explaining basic stuff like RNNs, seemingly magical performance on generating long valid SMILEs that suggests overfitting) and has kind of a weird application of fine-tuning as transfer learning, among other things. I'm planning on working up some parts of this paper like the stacked LSTMs as a SMILES generator for transfer to a property prediction network over this weekend. Anyone else have any comments on this paper or things to try?
Dear author,
I download your codes and pre-trained model (model_500k.h5) and tried out the following commands:
python preprocess.py data/smiles_500k.h5 data/processed_500k.h5
python sample.py data/processed_500k.h5 data/model_500k.h5 --target autoencoder
Then it outputs:
NC(=O)c1nc(cnc1N)c2ccc(Cl)c(c2)S(=O)(=O)Nc3cccc(Cl)c3
(-> encoder -> decoder ->)
7-7ASC-F@@7N7AAAAAAAAAAAAAlllllNAACAAC7lll7AlllAAACC%CLA-VVVVVVVVFF--lAAAAAAAAAAAAAAVVAAAAACCAACCAAACAAACCA77A-VVV--
I am not sure what happened to the pre-trained model, seems it does not do a good job at all... Do you see a similar problem or I did something wrong...?
Hello,
I'm pretty new to autoencoders and I know we can use utilize them for unsupervised learning. Is it possible to use this model to create representations (with encoding) for a set of SMILES?
If so, I guess first I had to preprocess my data set, then use sample.py
?
Thanks!
Hi,
I am trying to run preprocess.py and there is an unresolved reference to reduce:
charset = list(reduce(lambda x, y: set(y) | x, structures, set()))
however, if I:
from functools import reduce
I then get the error.
Traceback (most recent call last):
File "preprocess.py", line 88, in
main()
File "preprocess.py", line 60, in main
h5f.create_dataset('charset', data = charset)
File "C:\Users\dmcclymo\AppData\Local\Continuum\Anaconda3\lib\site-packages\h5py_hl\group.py", line 105, in crea
taset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "C:\Users\dmcclymo\AppData\Local\Continuum\Anaconda3\lib\site-packages\h5py_hl\dataset.py", line 93, in mak
_dset
tid = h5t.py_create(dtype, logical=1)
File "h5py\h5t.pyx", line 1450, in h5py.h5t.py_create (C:\Minonda\conda-bld\h5py_1474482825505\work\h5py\h5t.c:16
File "h5py\h5t.pyx", line 1470, in h5py.h5t.py_create (C:\Minonda\conda-bld\h5py_1474482825505\work\h5py\h5t.c:15
File "h5py\h5t.pyx", line 1531, in h5py.h5t.py_create (C:\Minonda\conda-bld\h5py_1474482825505\work\h5py\h5t.c:15
TypeError: No conversion path for dtype: dtype('<U1')
I am working on changing the input form SMILES to Coulomb matrix, 200 Coulomb matrices (29*29 matrix) with the HOMO-LUMO gap have been produced and saved into a .h5 file by the following code:
# Saving in .h5 format
h5f = h5py.File('processed.h5','w')
h5f.create_dataset('homo_lumo_gaps',data=homo_lumo_gaps)
h5f.create_dataset('padded_coulomb_matrices', data=padded_coulomb_matrices)
And I try to run the train.py directly with the generated process.h5 and give me this error message
KeyError: "Unable to open object (Object 'data_train' doesn't exist)"
I think that the problem comes from the way that I save the file is different from the original preprocess.py... But I cannot get the original idea and thus don't know how should I modify my code.
The preprocess.py I am using is here
https://docs.google.com/document/d/17f9n7tzeadpCo0_pit548QiU1-Loib2opMcm0I4MxzQ/edit?usp=sharing
And I want to known other than the ''naming'' problem as I have mentioned, will the NN work as I expected if I directly import the Coulomb matrix to replace the SMILES strings? Is there any part of the code I will need to modify? Thank You.
I know that this is not a good way to ask questions, but I really need some help. Any help is appreciated. Thank You.
In order to test the model_500k.h5 in the data folder.
I ran the scripts as below:
python2.7 preprocess.py data/smiles_500k.h5 data/processed_500.h5;
python2.7 sample.py data/processed_500.h5 data/model_500k.h5 --target autoencoder
however, the second step provides me with error message as following:
main()
File "sample.py", line 90, in main
autoencoder(args, model)
File "sample.py", line 44, in autoencoder
model.load(charset, args.model, latent_rep_size = latent_dim)
File "/Users/flynn/Documents/desktop/GT_second_semester/song_lab/nanoparticle_research/molecule_BO/keras_molecule/keras-molecules/molecules/model.py", line 95, in load
self.create(charset, weights_file = weights_file, latent_rep_size = latent_rep_size)
File "/Users/flynn/Documents/desktop/GT_second_semester/song_lab/nanoparticle_research/molecule_BO/keras_molecule/keras-molecules/molecules/model.py", line 50, in create
self.autoencoder.load_weights(weights_file)
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2500, in load_weights
self.load_weights_from_hdf5_group(f)
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2585, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "/usr/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 990, in batch_set_value
assign_op = x.assign(assign_placeholder)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 575, in assign
return state_ops.assign(self._variable, value, use_locking=use_locking)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
use_locking=use_locking, name=name)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2242, in create_op
set_shapes_for_outputs(ret)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1617, in set_shapes_for_outputs
shapes = shape_func(op)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1568, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 610, in call_cpp_shape_fn
debug_python_shape_fn, require_shape_fn)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 675, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Dimension 2 in both shapes must be equal, but are 56 and 55 for 'Assign' (op: 'Assign') with input shapes: [9,1,56,9], [9,1,55,9].
Is this model_500k.h5 not proper for smiles_500k.h5? If so, how could I make use of model_500k.h5?
i.e. list of outputs from the VAE. If not, are you aware of any other resources that might have molecular candidates via a generative algorithm? (VAE, GAN, etc.)
I tried to run your repository on my laptop, which doesn't have an independent GPU like Nvidia. So I use tensorflow (version 1.0.0) for CPU, and I have satisfied all other requirements in the file requirements.txt. I wonder whether this could work.
When I was trying to replicate the tutorial on the homepage of your repository, after successfully creating the processed.h5
under the /data folder from smiles_50k.h5
file, I tried to run the following command and got this mistake:
python train.py data/processed.h5 model.h5 --epochs 20
Using TensorFlow backend.
Traceback (most recent call last):
File "train.py", line 59, in <module>
main()
File "train.py", line 37, in main
model.create(charset, latent_rep_size = args.latent_dim)
File "/home/simonfqy/Documents/keras-molecules/molecules/model.py", line 23, in create
_, z = self._buildEncoder(x, latent_rep_size, max_length)
File "/home/simonfqy/Documents/keras-molecules/molecules/model.py", line 62, in _buildEncoder
h = Flatten(name='flatten_1')(h)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 514, in __call__
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 572, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 149, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/usr/local/lib/python2.7/dist-packages/keras/layers/core.py", line 409, in call
return K.batch_flatten(x)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 825, in batch_flatten
x = tf.reshape(x, tf.pack([-1, prod(shape(x)[1:])]))
AttributeError: 'module' object has no attribute 'pack'
Can anyone please help me with this? Is it related to my using tensorflow for CPU only?
I've created a fresh conda environment and run pip install -r requirements.txt
, but still getting errors that are likely due to version mismatch, e.g.:
File "preprocess.py", line 85, in <module>
main()
File "preprocess.py", line 72, in main
apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
File "preprocess.py", line 63, in create_chunk_dataset
chunks=tuple([chunk_size]+list(dataset_shape[1:])))
File "/home/spadavec/.local/lib/python2.7/site-packages/h5py/_hl/group.py", line 105, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/home/spadavec/.local/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 76, in make_new_dset
if isinstance(chunks, tuple) and (-numpy.array([ i>=j for i,j in zip(tmp_shape,chunks) if i is not None])).any():
TypeError: The numpy boolean negative, the `-` operator, is not supported, use the `~` operator or the logical_not function instead.
Does anyone have version numbers that aren't in conflict?
Hi, why do you use binary crossentropy in vae_loss https://github.com/maxhodak/keras-molecules/blob/master/molecules/model.py#L77 ?
It looks natural to use categorical-cross entropy and also authors of original article used categorical cros-entropy. Have you compared these two losses?
I am new to ubuntu and this program, so this question may be very simple...
This error message shows when I run the train.py
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[600,120,1503]
I think that this may be my GPU running out of memory. My graphic card is a GTX970 with 4GB memory. I am running the program according to the README, which is inputting this line
python train.py data/processed.h5 model.h5 --epochs 20
I have read a very similar problem as below
http://stackoverflow.com/questions/39076388/tensorflow-deep-mnist-resource-exhausted-oom-when-allocating-tensor-with-shape
but I still cannot find a way to fix my problem. Can anyone please give me a hand? Thank you very much.
ZINC15 is now available
Currently, sample.py assumes the presence of a particular h5 file in data
. Instead, sample.py should take the input data as a parameter. See related pull request.
preprocess.py loads the entire pre-preprocessed data into RAM, does transforms that require more RAM. I'm trying to preprocess GDB-17 w/ 50M SMILES strings and it's just about filling up my 64GB RAM machine. We should be able to go directly from SMILES input to preprocessed input files with far fewer resources although it would take work to make all the steps incremental.
The files in data/ cannot be retrieved via git lfs, because the bandwidth quota seems to be exceeded (see error message below). Is there an alternate method of obtaining the files?
keras-molecules$ git lfs fetch
Fetching master
Git LFS: (0 of 3 files) 0 B / 156.20 MB
batch response: http: This repository is over its data quota. Purchase more data packs to restore access.
Docs: https://help.github.com/articles/purchasing-additional-storage-and-bandwidth-for-a-personal-account/
Warning: errors occurred
The loss function for the autoencoder is calculated here using
kl_loss = - 0.5 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis = -1)
taking the mean over the dimensions of the latent representation. However, several other sources, including the VAE example in the keras repo, use the sum instead:
kl_loss = - 0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
Is there a reason for the difference? Given the relatively large number of latent dimensions, it seems like this would significantly impact the strength of the KL regularization.
Hello, I really appreciate you reimplement the model structure. But I think most people doesn't have such GPU resources, so can you add already trained models into this repo?
Currently the requirements.txt file references keras from git:
-e git+https://github.com/fchollet/keras.git#egg=keras
however, keras is in PyPI.
if it's not impossible, I suggest using the one from pip with a defined version number, should make it less likely to get hit by future upgrades to the Keras API.
I was looking to see whether the paper's authors had any draft code for the model they describe, to check whether they did something differently. I found that R GB had implemented some new RNN layers they might have used for the decoder. I don't know if they ended up using the conventional GRu or this new TerminalGRU instead. Might be worth trying out:
keras-team/keras@master...rgbombarelli:master#diff-3118e4e28157032506f771f279a551c3R639
I had an h5 file with a saved "structure" column with an index that did not start at 0. Because of that, the np.arange
call to generate train_idx, test_idx
would generate indices that were not valid for the h5 file. Instead, we can generate train_idx, test_idx
by running train_test_split
on the index of the column itself.
See my proposed pull request.
The preprocessing script as written tries to generate the entire vector all at once in memory, then write it in one go to the h5 processed file. Instead, we can generate chunks of tensor at a time, and write it at a time. See related pull request.
Hi,
I am not sure if the weights that correspond to the pretrain network
have som problem or if I made an error in the small pipeline, but the
results from the autoencoder seems very bad when I run sample.py:
First I preprocessed the data for 500k molecules
python2 preprocess.py data/smiles_500k.h5 data/processed_500.h5
which created a file of about 13G (processed_500.h5)
then I run:
python2 sample.py data/processed_500.h5 data/model_500k.h5 --target autoencoder
And I get:
CCOC(=O)CSC1=NC(=O)N2C=CC(=CC2=N1)C
S.+[SSSSS+[.(b..FFFFFFFFF(F%(F%FFF%%FFF
This is probably more of a machine learning issue in general, but in training keras-molecules on a small input set (~50K strings), the training eventually reaches 99.99% accuracy (loss ~.01), but the validation never exceeds 97%.
My feeling is this is a sign of overfitting the training data. The trainer has gotten effectively perfect at reconstructing the input SMILES strings but can't generalize to the odd cases in the test set that have no analogs in the training set.
I'm not sure how to debug this or if it really even represents a bug.
My training has been on the ZINC database, but when I decode generated vectors into SMILES strings, I pass them into rdkit's canonicalizer. It looks to me like ZINC used a different canonicalization mechanism.
It's not clear to me at all whether this would make any difference- obviously, if you train on a set generated by one canonicalizer, the system is going to learn to produce outputs that are consistent with that canonicalizer. In a sense, this is a bias (the canonicalization rules are based mostly in graph theory rather than chemistry).
python preprocess.py data/smiles_50k.h5 data/processed_50k.h5
Traceback (most recent call last):
File "preprocess.py", line 85, in
main()
File "preprocess.py", line 72, in main
apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
File "preprocess.py", line 63, in create_chunk_dataset
chunks=tuple([chunk_size]+list(dataset_shape[1:])))
File "/home/vinay/chemistrytensor/local/lib/python2.7/site-packages/h5py/_hl/group.py", line 105, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/home/vinay/chemistrytensor/local/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 76, in make_new_dset
if isinstance(chunks, tuple) and (-numpy.array([ i>=j for i,j in zip(tmp_shape,chunks) if i is not None])).any():
TypeError: The numpy boolean negative, the -
operator, is not supported, use the ~
operator or the logical_not function instead.
Hi, it looks like that this code actually train not VAE model but simple auto-encoder model. Here are reasons:
May be it makes sense to simple train autoencoder model and compare results.
I'm trying to mess around with some custom datasets, and keep getting the following error when trying ot read_latent_data
:
File "ava.py", line 165, in <module>
main()
File "ava.py", line 127, in main
data, charset = read_encoded_vecs(data_path)
File "ava.py", line 63, in read_encoded_vecs
data, charset = read_latent_data(data_path)
File "/home/lerche/word_embedding_molecules/ava/keras_molecules/sample.py", line 34, in read_latent_data
data = h5f['latent_vectors'][:]
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "/home/lerche/miniconda2/lib/python2.7/site-packages/h5py/_hl/group.py", line 166, in __getitem__
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "h5py/h5o.pyx", line 190, in h5py.h5o.open (/tmp/pip-4rPeHA-build/h5py/h5o.c:3570)
KeyError: "Unable to open object (Object 'latent_vectors' doesn't exist)"
My process for creating the input is:
import import_smiles
import sys
i_file = import_smiles.read_smiles(sys.argv[1], column=1)
import_smiles.create_h5(d, 'output.h5')
python preprocess.py output.h5 preprocessed.h5
python train processed.h5 model.h5 --epochs 20
Then, if I try to run the following code, the error above is displayed:
from keras_molecules.sample import read_latent_data
data, charset = read_latent_data(data_path)
Interestingly, I can still run python sample.py
on the data and model files; any idea where I'm going wrong?
EDIT: I should note that it looks like the keys that exist in processed.h5
file are:
[u'charset', u'data_test', u'data_train']
EDIT2: I'm getting this same behavior with the 50k data/model files that are 'included' as examples.
After testing, I found that the procedure of building a list of the unique characters used in the dataset (The "charset") is wired. Current encoding will make the resulting output much fragile, because we didn't avoid the situation of Cl interpreted as "C", "l". For example, we should treat 'Cl' as independent character rather than 'C' and 'l' directly. It chemically unreasonable to see 'l' along.
Im curious - can anybody help me understand why there's a separate validation dataset for the auto-encoder?
If the purpose of the autoencoder is to take us from (categorical, discreet) to (multinomial, continuous) and then back to (categorical discreet) then isn't it better if the network is trained on all of the samples, even to the point of overfitting? I'm saying this because we're not trying to train the network to generalize molecular structure - it's simply a transformer. Once we construct the latent space then it's up to us to analyse it, but shouldn't the autoencoder simply be trained to encode and decode, and not to generalize?
We still get predictive ability by manipulating transforms through the latent space (as a totally separate task), then the decoder can spit out the molecule that was the result of our latent transform.
No?
I would like to be able to generate a picture akin to that displayed in the read me, however even though I converge my model beyond the point in the read me, I do not get the distinct striations shown. Rather I get a more spread out graph still with some striations
Has anyone been able to replicate the Image as displayed in the paper and readme. Was it generated using an actual a 2d latent dim, or a higher dimension then PCAed down to 2d (I have tried both and neither has worked), any help would be greatly appreciated.
Reading your code:
def create(self,
charset,
original_dim = 120,
epsilon_std = 0.01,
latent_rep_size = 292,
weights_file = None):
charset_length = len(charset)
x = Input(shape=(original_dim, charset_length))
h = Convolution1D(9, 9, input_dim=60)(x)
h = Convolution1D(9, 9)(h)
h = Convolution1D(10, 11)(h)
h = Flatten()(h)
h = Dense(435)(h)
z_mean = Dense(latent_rep_size, name='z_mean')(h)
z_log_var = Dense(latent_rep_size, name='z_log_var')(h)
def sampling(args):
z_mean, z_log_var = args
batch_size = K.shape(z_mean)[0]
epsilon = K.random_normal(shape=(batch_size, latent_rep_size), mean=0., std=epsilon_std)
return z_mean + K.exp(z_log_var / 2) * epsilon
z = Lambda(sampling)([z_mean, z_log_var])
h = Dense(latent_rep_size, name='latent_input')(z)
h = RepeatVector(original_dim)(h)
h = GRU(501, return_sequences = True)(h)
h = GRU(501, return_sequences = True)(h)
h = GRU(501, return_sequences = True)(h)
decoded_mean = TimeDistributedDense(charset_length, activation='softmax', name='decoded_mean')(h)
def vae_loss(x, x_decoded_mean):
x = K.flatten(x)
x_decoded_mean = K.flatten(x_decoded_mean)
xent_loss = original_dim * objectives.binary_crossentropy(x, x_decoded_mean)
kl_loss = - 0.5 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis = -1)
return xent_loss + kl_loss
encoded_input = Input(shape=(original_dim, latent_rep_size))
self.autoencoder = Model(x, decoded_mean)
self.encoder = Model(x, z_mean)
#self.decoder = Model(self.autoencoder.get_layer('latent_input')(encoded_input),
# self.autoencoder.get_layer('decoded_mean')(encoded_input))
original_dim
should actually be called something like max_length
input_dim
argument to the first conv layer, this is not necessary (you already have an Input layer and you're using the functional API anyway)activation='relu'
to all of them.Dense
layers. Except these two, for which a linear layer makes sense:z_mean = Dense(latent_rep_size, name='z_mean')(h)
z_log_var = Dense(latent_rep_size, name='z_log_var')(h)
The h5 files are being written without compression. This makes them really large, wasting disk space. It might have some runtime costs (usually it's faster to read/write compressed data into the CPU than uncompressed data). I suggest using a fairly light compression, it won't take much time to encode or decode and will still compact things significantly.
I've been trying to package up keras-molecules as a module so I can run it on Google Cloud Machine Learning. Unfortunately, the top-level of keras-molecules (with train.py etc) isn't really module ready. The repo name has a "-" in it, and there's no top-level init.py.
I receive the following error when trying to train the model.
ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory
I am using cuda 10.0, python 2.7.15, on GPU.
Using SMILES makes this problem unnecessarily hard, likely detracting from the smoothness of the latent space and causing the majority of sampled SMILES to be invalid even after extended training. The graph convolution work isn't reversible so you can't use it to generate new structures not in the initial set. This is pretty open-ended but what are other representations to try? Ideally somewhere where every point corresponds to a valid molecule.
One idea I've had is to borrow on the concept of cellular encodings, where rather than specifying the molecular graph directly we specify a set of operations which when evaluated yield a valid molecule. Then we're doing seq-to-seq learning for a list of "molecule construction ops." Careful thought would have to be put into the instruction set, but I think it might be possible to get something expressive where every combination is also valid.
@dakoner ?
If my understanding is right, when we use the convolution1D, the timestep should stay constant. But interestingly here, I found the structure doesn't follow this principle:
Layer (type) Output Shape Param # Connected to
====================================================================================================
input_15 (InputLayer) (None, 277, 76) 0
____________________________________________________________________________________________________
conv_1 (Convolution1D) (None, 269, 9) 6165 input_15[0][0]
____________________________________________________________________________________________________
conv_2 (Convolution1D) (None, 261, 9) 738 conv_1[0][0]
____________________________________________________________________________________________________
conv_3 (Convolution1D) (None, 251, 10) 1000 conv_2[0][0]
____________________________________________________________________________________________________
flatten_1 (Flatten) (None, 2510) 0 conv_3[0][0]
____________________________________________________________________________________________________
dense_1 (Dense) (None, 435) 1092285 flatten_1[0][0]
____________________________________________________________________________________________________
z_mean (Dense) (None, 56) 24416 dense_1[0][0]
____________________________________________________________________________________________________
z_log_var (Dense) (None, 56) 24416 dense_1[0][0]
____________________________________________________________________________________________________
lambda (Lambda) (None, 56) 0 z_mean[0][0]
z_log_var[0][0]
____________________________________________________________________________________________________
latent_input (Dense) (None, 56) 3192 lambda[0][0]
____________________________________________________________________________________________________
repeat_vector (RepeatVector) (None, 277, 56) 0 latent_input[0][0]
____________________________________________________________________________________________________
gru_1 (GRU) (None, 277, 501) 838674 repeat_vector[0][0]
____________________________________________________________________________________________________
gru_2 (GRU) (None, 277, 501) 1507509 gru_1[0][0]
____________________________________________________________________________________________________
gru_3 (GRU) (None, 277, 501) 1507509 gru_2[0][0]
____________________________________________________________________________________________________
decoded_mean (TimeDistributed) (None, 277, 76) 38152 gru_3[0][0]
====================================================================================================
Total params: 5,044,056
Trainable params: 5,044,056
Non-trainable params: 0
From conv2
to conv3
, the dim is increasing, and timestep is reducing. Is this a bug in keras or if I miss sth?
$ docker build .
Step 3/16 : RUN locale-gen en_US.UTF-8
---> Running in XXXXXXXXX
/bin/sh: 1: locale-gen: not found
The command '/bin/sh -c locale-gen en_US.UTF-8' returned a non-zero code: 127
I think I have to run 'apt-get install locale' like this question
A new paper came out:
https://arxiv.org/abs/1703.01925
They were able to implement a VAE that autoencodes the SMILES grammar as a context-free grammar, which is a pretty good approximation. They improved over the GB SMILES VAE. This would solve several of the issues that have been filed such as #31 #54.
The Grammar VAE code is at https://github.com/mkusner/grammarVAE. It looks a lot like the VAE code we already have, in terms of the actual model code. Incorporating the zinc_grammar
and the masking shouldn't be too difficult -- just some work to implement the masking in Theano.
In fact, their code also uses mean
and not sum
in the KL loss term, considering #59.
Hello, I am very interested in testing your model.
Do you think you could share the trained model file (model.h5 file)
Thank you
This codebase works well for me and I'm able to replicate the current results. Having worked a good bit with other latent spaces, I'm curious to find out what other operations the latent space of this model might support. Specifically, I suspect that the latent space could also support analogies and attribute vectors, but unfortunately I'm not familiar with chemistry datasets and smile strings.
Would anyone be interested in helping me build a labelled dataset of molecules that includes binary attributes and then investigating the results of applying attribute vectors? An example structure of the dataset would be:
smile string | Polar | Toxic | Flammable | Positive Oxidation State |
---|---|---|---|---|
CN1CCC[C@H]1c2cccnc2 | True | False | False | True |
O=C1Oc2ccccc2c3ccccc13 | False | True | True | False |
... |
Generally, these datasets can be useful even if they are much smaller than the training dataset - say dozens to hundreds of rows. Ideally, the chosen attributes would be those that could serve as unambiguous labels and operators. For example, pretend the following is true:
Carbon dioxide is a polar molecule.
The equivalent to carbon dioxide without polarity is carbon monoxide.
Then this would be a great attribute because it follows the formula:
Molecule X has (doesn't have) attribute Y
The equivalent of Molecule X with (without) attribute Y is Z
I don't know enough chemistry to know if there are even such attributes for subsets of molecules. But if there are, then a small dataset of molecules with and without attribute Y would be sufficient to see if Z could be inferred from this model given X.
I think this is probably out of scope for this project but I think this may be the right community to target.
I read the GDB-17 construction paper (http://pubs.acs.org/doi/abs/10.1021/ci300415d). GDB-17 explicitly enumerates a subset of the total chemical space of molecules with up to 17 nuclei.
GDB-17 itself is not available in total, although the paper says that a sample of 1-2M entries is available and is sufficient for any training purposes.
The paper describes the construction process in enough detail to reproduce (although it would take effort), but does not provide code to do so. A number of the construction steps are fairly subtle (particularly with respect to rings, conjugation and aromatics) and there are a few "arbitrary" pruning choices that I'm not happy with. I believe a simple distributed pipeline could produce GDB-17 or an even better library automatically. Once the pipeline is built, larger GDBs could be constructed easily.
The challenge here is parsing what they describe in the methods section and converting that to code. They didn't do a good job of describing their methods in detail.
Hello, I've been trying to test out the autoencoder using the pre-trained model and examples in the readme. I pre-processed both the 50k and 500k datasets and get the same error with both. Has anyone also experienced this error?
python sample.py data/processed_500k.h5 data/model_500k.h5 --target autoencoder
Traceback (most recent call last):
File "sample.py", line 97, in
main()
File "sample.py", line 90, in main
autoencoder(args, model)
File "sample.py", line 44, in autoencoder
model.load(charset, args.model, latent_rep_size = latent_dim)
File "/Users/brentkuenzi/Documents/GitHub/keras-molecules/molecules/model.py", line 95, in load
self.create(charset, weights_file = weights_file, latent_rep_size = latent_rep_size)
File "/Users/brentkuenzi/Documents/GitHub/keras-molecules/molecules/model.py", line 50, in create
self.autoencoder.load_weights(weights_file)
File "/Users/brentkuenzi/anaconda2/lib/python2.7/site-packages/keras/engine/network.py", line 1180, in load_weights
f, self.layers, reshape=reshape)
File "/Users/brentkuenzi/anaconda2/lib/python2.7/site-packages/keras/engine/saving.py", line 916, in load_weights_from_hdf5_group
reshape=reshape)
File "/Users/brentkuenzi/anaconda2/lib/python2.7/site-packages/keras/engine/saving.py", line 675, in preprocess_weights_for_loading
weights[0] = np.transpose(weights[0], (3, 2, 0, 1))
File "/Users/brentkuenzi/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 575, in transpose
return _wrapfunc(a, 'transpose', axes)
File "/Users/brentkuenzi/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 52, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: axes don't match array
[xianghui@login-c01 keras-molecules]$ python download_dataset.py --dataset chembl22
Bus error (core dumped)#################################################################################################################################### | ETA: 0:00:00 1.44 MB/s
Please kindly suggest.
After running the sampling from trained model, I tried running the following: python plot.py --data 'encoded.h5' --model 'model.h5'. However, when I do this, I am getting the following error message:
Traceback (most recent call last):
File "plot.py", line 58, in
main()
File "plot.py", line 52, in main
plot_2d(args)
File "plot.py", line 41, in plot_2d
data = np.loadtxt(fname=args.data, dtype=str, delimiter='\t')
File "/Users/mythriambatipudi/miniconda2/envs/venv/lib/python2.7/site-packages/numpy/lib/npyio.py", line 927, in loadtxt
raise ValueError("Wrong number of columns at line %d" % line_num)
ValueError: Wrong number of columns at line 5
I am not sure what is causing this error. Could you please help me with this?
First of all I want to thank you all for your immense contributions to this library, it is truly helpful.
As someone without access to a powerful GPU, training these models is very time consuming. I encountered an error when testing out the model_500k.h5 on smiles_50k.h5 with sample_gen.py
I inputed python sample_gen.py smiles_50k.h5 data/model_500k.h5 --target autoencoder
and received the error ValueError: Shapes (9, 1, 56, 9) and (9, 1, 55, 9) are not compatible
from within the load_weights_from_hdf5_group function.
Perhaps an updated pretrained model is needed? The models I train compile, but are very inaccurate (because of my machine's limitations), so something tells me it has to do with the provided model. I might be able to get an AWS server running to help out if needed.
Regards,
Liam
As title
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.