hypnopump / minifold Goto Github PK

View Code? Open in Web Editor NEW

200.0 11.0 34.0 35.63 MB

MiniFold: Deep Learning for Protein Structure Prediction inspired by DeepMind AlphaFold algorithm

License: MIT License

Jupyter Notebook 97.15% Python 2.62% Julia 0.23%

minifold's Introduction

MiniFold

Abstract

Introduction: The Protein Folding Problem (predicting a protein structure from its sequence) is an interesting one since DNA sequence data available is becoming cheaper and cheaper at an unprecedented rate, even faster than Moore's law 1. Recent research has applied Deep Learning techniques in order to accurately predict the structure of polypeptides [2, 3].
Methods: In this work, we present an attempt to imitate the AlphaFold system for protein prediction architecture [3]. We use 1-D Residual Networks (ResNets) to predict dihedral torsion angles and 2-D ResNets to predict distance maps between the protein amino-acids[4]. We use the CASP7 ProteinNet dataset section for training and evaluation of the model [5]. An open-source implementation of the system described can be found here.
Results: We are able to obtain distance maps and torsion angle predictions for a protein given it's sequence and PSSM. Our angle prediction model scores a 0.39 of MAE (Mean Absolute Error), and 0.39 and 0.43 R^2 coefficients for Phi and Psi respectively, whereas SoTA is around 0.69 (Phi) and 0.73 (Psi). Our methods do not include post-processing of Deep Learning outputs, which can be very noisy.
Conclusion: We have shown the potential of Deep Learning methods and its possible application to solve the Protein Folding Problem. Despite technical limitations, Neural Networks are able to capture relations between the data. Although our visually pleasant results, our system lacks components such as the protein structure prediction from both dihedral torsion angles and the distance map of a given protein and the post-processing of our predictions in order to reduce noise.

Citation

@misc{ericalcaide2019
  title = {MiniFold: a DeepLearning-based Mini Protein Folding Engine},
  publisher = {GitHub},
  journal = {GitHub repository},
  author = {Alcaide, Eric},
  year = {2019},
  howpublished = {\url{https://github.com/EricAlcaide/MiniFold/}},
  doi = {10.5281/zenodo.3774491},
  url = {https://doi.org/10.5281/zenodo.3774491}
}

Introduction

DeepMind, a company affiliated with Google and specialized in AI, presented a novel algorithm for Protein Structure Prediction at CASP13 (a competition which goal is to find the best algorithms that predict protein structures in different categories).

The Protein Folding Problem is an interesting one since there's tons of DNA sequence data available and it's becoming cheaper and cheaper at an unprecedented rate (faster than Moore's law). The cells build the proteins they need through transcription (from DNA to RNA) and translation (from RNA to Aminocids (AAs)). However, the function of a protein does not depend solely on the sequence of AAs that form it, but also their spatial 3D folding. Thus, it's hard to predict the function of a protein from its DNA sequence. AI can help solve this problem by learning the relations that exist between a determined sequence and its spatial 3D folding.

The DeepMind work presented @ CASP was not a technological breakthrough (they did not invent any new type of AI) but an engineering one: they applied well-known AI algorithms to a problem along with lots of data and computing power and found a great solution through model design, feature engineering, model ensembling and so on. DeepMind has no plan to open source the code of their model nor set up a prediction server.

Based on the premise exposed before, the aim of this project is to build a model suitable for protein 3D structure prediction inspired by AlphaFold and many other AI solutions that may appear and achieve SOTA results.

Methods

Proposed Architecture

The methods implemented are inspired by DeepMind's original post. Two different residual neural networks (ResNets) are used to predict angles between adjacent aminoacids (AAs) and distance between every pair of AAs of a protein. For distance prediction a 2D Resnet was used while for angles prediction a 1D Resnet was used.

Image from DeepMind's original blogpost.

Distance prediction

The ResNet for distance prediction is built as a 2D-ResNet and takes as input tensors of shape LxLxN (a normal image would be LxLx3). The window length is set to 200 (we only train and predict proteins of less than 200 AAs) and smaller proteins are padded to match the window size. No larger proteins nor crops of larger proteins are used.

The 41 channels of the input are distributed as follows: 20 for AAs in one-hot encoding (LxLx20), 1 for the Van der Waals radius of the AA encoded previously and 20 channels for the Position Specific Scoring Matrix).

The network is comprised of packs of residual blocks with the architecture below illustrated with blocks cycling through 1,2,4 and 8 strides plus a first normal convolutional layer and the last convolutional layer where a Softmax activation function is applied to get an output of LxLx7 (6 classes for different distance + 1 trash class for the padding that is less penalized).

Architecture of the residual block used. A mini version of the block in this description

The network has been trained with 134 proteins and evaluated with 16 more. Clearly unsufficient data, but memory constraints didn't allow for more. Comparably, AlphaFold was trained with 29k proteins.

The output of the network is, then, a classification among 6 classes wich are ranges of distances between a pair of AAs. Here there's an example of AlphaFold predicted distances and the distances predicted by our model:

Ground truth (left) and predicted distances (right) by AlphaFold.

Ground truth (left) and predicted distances (right) by MiniFold.

The architecture of the Residual Network for distance prediction is very simmilar, the main difference being that the model here described was trained with windows of 200x200 AAs while AlphaFold was trained with crops of 64x64 AAs. When it comes to prediction, AlphaFold used the smaller window size to average across different outputs and achieve a smoother result. Our prediction, however, is a unique window, so there's no average (noisier predictions).

Angles prediction

The ResNet for angles prediction is built as a 1D-ResNet and takes as input tensors of shape LxN. The window length is set to 34 and we only train and predict aangles of proteins with less than 200 (L) AAs. No larger proteins nor crops of larger proteins are used.

The 42 (N) channels of the input are distributed as follows: 20 for AAs in one-hot encoding (Lx20), 2 for the Van der Waals radius and the surface accessibility of the AA encoded previously and 20 channels for the Position Specific Scoring Matrix).

We followed the ResNet20 architecture but replaced the 2D Convolutions by 1D convolutions. The network output consists of a vector of 4 numbers that represent the sin and cos of the 2 dihedral angles between two AAs (Phi and Psi).

Dihedral angles were extracted from raw coordinates of the protein backbone atoms (N-terminus, C-alpha and C-terminus of each AA). The plot of Phi and Psi recieves the name of Ramachandran plot:

The cluster observed in the upper-left region corresponds to the angles comprised between AAs when they form a Beta-sheet while the cluster observed in the central-left region corresponds to the angles comprised between AAs when they form an Alpha-helix.

The results of the model when making predictions can be observed below:

The network has been trained with crops 38,7k crops from 600 different proteins and evaluated with some 4,3k more.

The architecture of the Residual Network is different from the one implemented in AlphaFold. The model here implemented was inspired by this paper and this one.

Results

While the architectures implemented in this first preliminary version of the project are inspired by papers with great results, the results here obtained are not as good as they could be. It's likely that the lack of Multiple Alignmnent (MSA), MSA-based features, Physicochemichal properties of AAs (beyond Van der Waals radius) or the lack of both model and feature engineering have affected the models negatively, as well as the little data that they have been trained on.

For that reason, we can conclude that it has been a somehow naive approach and we expect to further implement some ideas/improvements to these models. As the DeepMind team says: "With few or no alignments accuracy is much worse". It would be interesting to use the predictions made by the models as constraints to a folding algorithm (ie. Rosetta) in order to visualize our results.

Reproducing the results

Here are the following steps in order to run the code locally or in the cloud:

Clone the repo: git clone https://github.com/EricAlcaide/MiniFold
Install dependencies: pip install -r requirements.txt
Get & format the data
1. Download data here (select CASP7 text-based format)
2. Extract/Decompress the data in any directory
3. Create the /data folder inside the MiniFold directory and copy the training_30, training_70 and training90 files to it. Change extensions to .txt.
Execute data preprocessing notebooks (preprocessing folder) in the following order (we plan to release simple scripts instead of notebooks very soon):
1. get_proteins_under_200aa.jl *source_path* *destin_path*: - selects proteins under 200 residues from the source_path file (alternatively can be declared in the script itself) - (you will need the Julia programming language v1.0 in order to run it)
  1. Alternatively: julia_get_proteins_under_200aa.ipynb (you will need Julia as well as iJulia)
2. get_angles_from_coords_py.ipynb - calculates dihedral angles from raw coordinates
3. angle_data_preparation_py.ipynb
Run the models!
1. For angles prediction: models/predicting_angles.ipynb
2. For distance prediction:
  1. models/distance_pipeline/pretrain_model_pssm_l_x_l.ipynb
  2. models/distance_pipeline/pipeline_caller.py
3D structure modelling from predicted results
1. For RR format conversion and 3D structure modelling follow the steps given in models/distance_pipeline/Tutorials/README.pdf

If you encounter any errors during installation, don't hesitate and open an issue.

Post processing of predictions (added end 2020 - not by the original author)

Presently the post processing of the predictions is done using a python script which converts the predicted results into RR format known as Residue-Residue contact prediction format. This format represents the probability of contact between pairwise residues. Data in this format are inserted between MODEL and END records of the submission file. The prediction starts with the sequence of the predicted target splitted.The sequence is followed by the list of contacts in the five-column format as represented below :

	PFRMAT RR
	TARGET T0999
	AUTHOR 1234-5678-9000
	REMARK Predictor remarks
	METHOD Description of methods used
	METHOD Description of methods used
	MODEL  1
	HLEGSIGILLKKHEIVFDGC # <- entire target sequence (up to 50 
	HDFGRTYIWQMSDASHMD   #   residues per line)
	1 8 0 8 0.720        
	1 10 0 8 0.715       # <- i=1 j=10: indices of residues (integers), 
	31 38 0 8 0.710       
	10 20 0 8 0.690      # <- d1=0  d2=8: the range of Cb-Cb distance   
	30 37 0 8 0.678      #    predicted for the residue pair (i,j)  
	11 29 0 8 0.673       
	1 9 0 8 0.63         # <- p=0.63: probability of the residues i=1 and j=9 
	21 37 0 8 0.502      #    being in contact (in descending order) 
	8 15 0 8 0.401
	3 14 0 8 0.400
	5 15 0 8 0.307
	7 14 0 8 0.30
	END

The predictions in this format can then be utilised as input to build 3D models using structure modelling softwares.

Discussion

Future

There is plenty of ideas that could not be tried in this project due to computational and time constraints. In a brief way, some promising ideas or future directions are listed below:

Train with crops of 64x64 AAs, not windows of 200x200 AAs and average at prediction time.
Use data from Multiple Sequence Alignments (MSA) such as paired changes bewteen AAs.
Use distance map as potential input for angle prediction or vice versa.
Train with more data
Use predictions as constraints to a Protein Structure Prediction pipeline (CNS, Rosetta Solve or others).
Set up a prediction script/pipeline from raw text/FASTA file

Limitations

This project has been developed mainly during 3 weeks by 1 person and, therefore, many limitations have appeared. They will be listed below in order to give a sense about what this project is and what it's not.

No usage of Multiple Sequence Alignments (MSA): The methods developed in this project don't use MSA nor MSA-based features as input.
Computing power/memory: Development of the project has taken part in a computer with the following specifications: Intel i7-6700k, 8gb RAM, NVIDIA GTX-1060Ti 6gb and 256gb of storage. The capacity for data exploration, processing, training and evaluating the models is limited.
GPU/TPUs for training: The models were trained and evaluated on a single GPU. No cloud servers were used.
Time: Three weeks of development during spare time.
Domain expertise: No experts in the field of genomics, proteomics or bioinformatics. The author knows the basics of Biochemistry and Deep Learning.
Data: The average paper about Protein Structure Prediction uses a personalized dataset acquired from the Protein Data Bank (PDB). No such dataset was used. Instead, we used a subset of the ProteinNet dataset from CASP7. Our models are trained with just 150 proteins (distance prediction) and 600 proteins (angles prediction) due to memory constraints.

Due to these limitations and/or constraints, the precission/accuracy the methods here developed can achieve is limited when compared against State Of The Art algorithms.

References

Contribute

Hey there! New ideas are welcome: open/close issues, fork the repo and share your code with a Pull Request. Clone this project to your computer:

git clone https://github.com/EricAlcaide/MiniFold

By participating in this project, you agree to abide by the thoughtbot code of conduct

minifold's People

Contributors

Stargazers

Watchers

minifold's Issues

How to visualize the predicted data by VMD or other visual tools?

Error when loading the model

Currently I am using this code to load the model:
model = keras.models.load_model("tester_28_lxl.h5")

However, when I am loading this I'm getting this error.


Traceback (most recent call last):
  File "pretrain_model_pssm_l_x_l.py", line 255, in <module>
    model2 = keras.models.load_model("tester_28_lxl.h5")
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/saving/save.py", line 207, in load_model
    compile)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/saving/hdf5_format.py", line 184, in load_model_from_hdf5
    custom_objects=custom_objects)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/saving/model_config.py", line 64, in model_from_config
    return deserialize(config, custom_objects=custom_objects)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/layers/serialization.py", line 177, in deserialize
    printable_module_name='layer')
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/utils/generic_utils.py", line 358, in deserialize_keras_object
    list(custom_objects.items())))
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/engine/functional.py", line 669, in from_config
    config, custom_objects)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/engine/functional.py", line 1275, in reconstruct_from_config
    process_layer(layer_data)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/engine/functional.py", line 1257, in process_layer
    layer = deserialize_layer(layer_data, custom_objects=custom_objects)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/layers/serialization.py", line 177, in deserialize
    printable_module_name='layer')
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/utils/generic_utils.py", line 360, in deserialize_keras_object
    return cls.from_config(cls_config)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 720, in from_config
    return cls(**config)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/layers/core.py", line 437, in __init__
    self.activation = activations.get(activation)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/activations.py", line 573, in get
    return deserialize(identifier)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/activations.py", line 536, in deserialize
    printable_module_name='activation function')
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/utils/generic_utils.py", line 378, in deserialize_keras_object
    'Unknown ' + printable_module_name + ': ' + object_name)
ValueError: Unknown activation function: softMaxAxis2

Ultimately the error says that there is an unknown activation function: softMaxAxis2.
I am not sure what is causing this issue.

Training data set

Hi @hypnopump
I am learning a bit about deep learning since I want to try to apply it to a particular study that I am working on. I'm just getting started so I've had some trouble identifying some items. I would like to know if you have some fragment of the training data set to be able to execute all the code from the training to the prediction.
It would be very helpful to me.
PS: excellent work.
Mario S.

Error when running angle_data_preparation_py.ipynb

I am running the README.md steps on the Intel DevCloud. I generated full_under_200.txt both with the julia script get_proteins_under_200aa.jl and julia_get_proteins_under_200aa.ipynb for good measure. A diff says files are different but they look the same (tab separated values).
In the DevCloud environment, when I run angle_data_preparation_py.ipynb, I get an error when extracting data from text:

# Scan first n proteins
names = []
seqs = []
psis = []
phis = []
pssms = []
(...)

ValueError: could not convert string to float: '0.0\ (...)

Which can be suppresed by changing function parse_lines(raw) to:

# Helper functions to extract numeric data from text
def parse_lines(raw):
    # added tab \t to suppress previous error
    return np.array([[float(x) for x in line.split("\t") if x != ""] for line in raw])
(...)

That gets passed the first error, but then throws another one further down:

(...)
---> 10             outputs.append([phis[i][j], psis[i][j]])
     11             # break
     12         # print(i, "Added: ", len(seqs[i])-34,"total for now:  ", long)

IndexError: list index out of range

Which I suspect has someting to do with one of the previous outputs, and the features' "n. prots" not being the same:

# Ensure all features have same n. prots

print("Names: ", len(names))
print("Seqs: ", len(seqs))
print("PSSMs: ", len(pssms))
print("Phis: ", len(phis))
print("Psis: ", len(psis))

Names:  601
Seqs:  600
PSSMs:  600
Phis:  0
Psis:  0

Any suggestions on what could be wrong in parsing the full_under_200.txt file?

ValueError: None values not supported.

When I running "predicting_angles.ipynb"
I got an error

`

ValueError Traceback (most recent call last)
in
1 # Resnet (pre-act structure) with 34*42 columns as inputs - leaving a subset for validation
2 his = model.fit(x_train, y_train, epochs=5, batch_size=16, verbose=1, shuffle=True,
----> 3 validation_data=(x_test, y_test))

~/.local/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
1211 else:
1212 fit_inputs = x + y + sample_weights
-> 1213 self._make_train_function()
1214 fit_function = self.train_function
1215

~/.local/lib/python3.6/site-packages/keras/engine/training.py in _make_train_function(self)
314 training_updates = self.optimizer.get_updates(
315 params=self._collected_trainable_weights,
--> 316 loss=self.total_loss)
317 updates = self.updates + training_updates
318

~/.local/lib/python3.6/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
89 warnings.warn('Update your ' + object_name + ' call to the ' +
90 'Keras 2 API: ' + signature, stacklevel=2)
---> 91 return func(*args, **kwargs)
92 wrapper._original_function = func
93 return wrapper

~/.local/lib/python3.6/site-packages/keras/optimizers.py in get_updates(self, loss, params)
538 if self.amsgrad:
539 vhat_t = K.maximum(vhat, v_t)
--> 540 p_t = p - lr_t * m_t / (K.sqrt(vhat_t) + self.epsilon)
541 self.updates.append(K.update(vhat, vhat_t))
542 else:

~/.local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py in binary_op_wrapper(x, y)
886 try:
887 y = ops.convert_to_tensor_v2(
--> 888 y, dtype_hint=x.dtype.base_dtype, name="y")
889 except TypeError:
890 # If the RHS is not a tensor, it might be a tensor aware object

~/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor_v2(value, dtype, dtype_hint, name)
1143 name=name,
1144 preferred_dtype=dtype_hint,
-> 1145 as_ref=False)
1146
1147

~/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx, accept_symbolic_tensors, accept_composite_tensors)
1222
1223 if ret is None:
-> 1224 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
1225
1226 if ret is NotImplemented:

~/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
303 as_ref=False):
304 _ = as_ref
--> 305 return constant(v, dtype=dtype, name=name)
306
307

~/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
244 """
245 return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 246 allow_broadcast=True)
247
248

~/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
282 tensor_util.make_tensor_proto(
283 value, dtype=dtype, shape=shape, verify_shape=verify_shape,
--> 284 allow_broadcast=allow_broadcast))
285 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
286 const_tensor = g.create_op(

~/.local/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape, allow_broadcast)
452 else:
453 if values is None:
--> 454 raise ValueError("None values not supported.")
455 # if dtype is provided, forces numpy array to be the type
456 # provided if possible.

ValueError: None values not supported.
`
What should I do?

Problem: assign weights to classes - solve via Genetic Algorithm/PSO?

If you solve this you'll be mentioned as a special contributor (it shouldn't be very hard but right now I don't have the time to do this if I want to keep developing the model).

A big problem is defining the ideal weights for the different classes. Since it can be framed as an optimization problem (find the ideal weights so the predictions are the best possible given a model structure, the different classes and some data), I think we could use a Genetic Algorithm/PSO/some derivative-free optimization algorithm to get the ideal weights.
Basic formulation:
Given:

a model structure (depth, inputs, ...)
The number of classes and the limits between them
some data

Do:

Get a pretrained model and train it with the new class weights for a short time (1-2 epochs?)
Make predictions and evaluate them according to some metric (MSE wrt ground truth once padding is removed?)
If the metric is better than the previous model: replace old model by new model.
Else: Perform some mutation/perturbation on class weights
GO TO number 1 and repeat
STOP after k iterations without improvement / run the algorithm for k iterations (your choice)

Tips:

The distance prediction notebook eats a lot of RAM (~4gb). My advice is to save the variables such as best model, best performance, model configuration, classes... on text files and load them on every iteration since each time a model is compiled with different weights for classes it consumes new RAM.
My preferred implementation would be:

Create a genetic algorithm as a script on a separate file and execute the script containing the model and the evaluation from it as a terminal command.
Store the results/metrics of a given model and its configuration on a text file. And acces them from the genetic algorithm script.

The genetic algorithm/PSO/other solution should be modular/adaptable to future changes in the model's part (that's why my preferred implementation would be as different scripts).

Installation problem

Hello everyone,

I am trying to install minifold in my centos 7 system. when I run "pip install -r requirements.txt", I get the following error:

ERROR: No matching distribution found for tensorflow-gpu==1.12.0 (from -r requirements.txt (line 86))

After I remove the version number of tensorflow-gpu in the requirements.txt file, tensorflow-gpu 2.0.0 is installed. However, I get the following errors:

Building wheel for pywinpty (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /home/sunyp/software/build/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-22q2dxeq/pywinpty/setup.py'"'"'; file='"'"'/tmp/pip-install-22q2dxeq/pywinpty/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-h5jnr0xb --python-tag cp37
cwd: /tmp/pip-install-22q2dxeq/pywinpty/
Complete output (27 lines):
winpty/cywinpty.pyx: cannot find cimported module 'winpty._winpty'
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/winpty
copying winpty/ptyprocess.py -> build/lib.linux-x86_64-3.7/winpty
copying winpty/winpty_wrapper.py -> build/lib.linux-x86_64-3.7/winpty
copying winpty/init.py -> build/lib.linux-x86_64-3.7/winpty
creating build/lib.linux-x86_64-3.7/winpty/tests
copying winpty/tests/test_cywinpty.py -> build/lib.linux-x86_64-3.7/winpty/tests
copying winpty/tests/test_ptyprocess.py -> build/lib.linux-x86_64-3.7/winpty/tests
copying winpty/tests/test_winpty_wrapper.py -> build/lib.linux-x86_64-3.7/winpty/tests
copying winpty/tests/init.py -> build/lib.linux-x86_64-3.7/winpty/tests
creating build/lib.linux-x86_64-3.7/winpty/_winpty
copying winpty/_winpty/init.py -> build/lib.linux-x86_64-3.7/winpty/_winpty
running build_ext
building 'winpty.cywinpty' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/winpty
gcc -pthread -B /home/sunyp/software/build/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunyp/software/build/anaconda3/include/python3.7m -c winpty/cywinpty.c -o build/temp.linux-x86_64-3.7/winpty/cywinpty.o
winpty/cywinpty.c:597:21: fatal error: Windows.h: No such file or directory
#include "Windows.h"
^
compilation terminated.
error: command 'gcc' failed with exit status 1

ERROR: Failed building wheel for pywinpty

And:

 ERROR: Cannot uninstall 'certifi'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

Could you tell me how to solve the problem?

Thank you!

How to make pssm file?

I want to make pssm file. Because this file predict 3d structure?
Please give me solutions.
Do you have a "make pssm file function" pipeline?

Can't run distance_pipeline pretrain_model_pssm_l_x_l

In order to run the Jupyter notebook, I converted the notebook into a python file by doing:
jupyter nbconvert --to python pretrain_model_pssm_l_x_l.ipynb

I do not believe this made a difference as I was able to run every other file before it like this and it worked just fine. However, once running this file, I'm getting this error just before the program starts the first epoch.

The program points to an error at this point:
his = model.fit(inputs, outputs, epochs=35, batch_size=2, verbose=1, shuffle=True, validation_split=0.1)

This is the full error stack:

Traceback (most recent call last):
  File "pretrain_model_pssm_l_x_l.py", line 245, in <module>
    his = model.fit(inputs, outputs, epochs=35, batch_size=2, verbose=1, shuffle=True, validation_split=0.1)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit
    tmp_logs = self.train_function(iterator)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 871, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
    *args, **kwds))
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
    raise e.ag_error_metadata.to_exception(e)
TypeError: in user code:

    /Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
        return step_function(self, iterator)
    /Users/sid/MiniFold/models/distance_pipeline/elu_resnet_2d_distances.py:41 loss  *
        loss = y_true * K.log(y_pred) * weights
    /Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:1180 binary_op_wrapper
        raise e
    /Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:1164 binary_op_wrapper
        return func(x, y, name=name)
    /Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:1496 _mul_dispatch
        return multiply(x, y, name=name)
    /Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py:201 wrapper
        return target(*args, **kwargs)
    /Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:518 multiply
        return gen_math_ops.mul(x, y, name)
    /Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py:6078 mul
        "Mul", x=x, y=y, name=name)
    /Users/sid/opt/anaconda3/envs/mini_fold/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:558 _apply_op_helper
        inferred_from[input_arg.type_attr]))

This is the last part of the error, and I'm not entirely sure what this means:
TypeError: Input 'y' of 'Mul' Op has type float32 that does not match type int64 of argument 'x'.

Thanks in advance!
Edit 1:
I looked further into this and realized the issue is resulting from this gen_math_ops.mul(x, y, name)
However, I'm not sure where that is located.

Regarding input and distance matrix padding.

Hi!
Wonderful work here, and wonderful code aswell.
I have a few questions regarding your model and some of your input preparation steps.
1- Why do you implemented padding as a new class and not as a mask, by multiplaying every add layer by this binary mask in order to avoid backprop of these regions?
2- Why did you created a different embbeding for the distances, and not only the threshold function?

where is the file "90_full_under_200.txt"?

I can not find this file, please help me! thank you so much!

Engineer PSSM-based features

Help wanted to engineer PSSM-based features that are LxLxN instead of LxN. It's the most straightforward alternative I see right now to MSA-based features (MSAs are huge in memory so we can't use them right now).
Examples:

pssm[i,j] = pssm[i,x]*pssm[j,y] where x,y are the different aminoacids found in positions i,j
distance to diagonal in a LxL map
...
more ideas are welcome (features should be relevant since we don't have infinite memory to store them)

Add more training features

It would be great to add more features that distribute themselves across the whole LxL surface and not just the diagonal.
Examples:
pos[i,j] = pssm[i,x]*pssm[j,y]
distance_to_diagonal (in range 0-1)

Ideate (see if it's possible even w/ low acc/precision) MSA features derived from PSSM (ex: paired substitutions of AAs...)

Julia preprocessing error

when using Julia to preprocessing protein data, I get an error below:

ERROR: LoadError: syntax: line break in ":" expression
Stacktrace:
[1] include at ./boot.jl:317 [inlined]
[2] include_relative(::Module, ::String) at ./loading.jl:1044
[3] include(::Module, ::String) at ./sysimg.jl:29
[4] exec_options(::Base.JLOptions) at ./client.jl:266
[5] _start() at ./client.jl:425
in expression starting at /disk2/minifold/MiniFold/preprocessing/get_proteins_under_200aa.jl:20

I'm a totally newbee in using Julia, can anyone help me about this

I use Julia version 1.0.4, and I make sure source_path and dst_path is right

The pretrained model doesn't load

Greetings.
I have the following error when I try to load the "resnet_1d_angels.h5" that is given in the repository:

model = load_model("resnet_1d_angles.h5", custom_objects={'custom_mse_mae': custom_mse_mae})

AttributeError Traceback (most recent call last)
in
----> 1 model = load_model(
2 "./MiniFold/models/angles/resnet_1d_angles.h5",
3 custom_objects={'custom_mse_mae': angles.custom_mse_mae}
4 )

~/anaconda3/envs/lapki/lib/python3.8/site-packages/keras/engine/saving.py in load_model(filepath, custom_objects, compile)
417 f = h5dict(filepath, 'r')
418 try:
--> 419 model = _deserialize_model(f, custom_objects, compile)
420 finally:
421 if opened_new_file:

~/anaconda3/envs/lapki/lib/python3.8/site-packages/keras/engine/saving.py in _deserialize_model(f, custom_objects, compile)
222 if model_config is None:
223 raise ValueError('No model found in config.')
--> 224 model_config = json.loads(model_config.decode('utf-8'))
225 model = model_from_config(model_config, custom_objects=custom_objects)
226 model_weights_group = f['model_weights']

AttributeError: 'str' object has no attribute 'decode'

Could you please look into that?

Train w/ 64x64 crops and average at prediction time

Training with 64x64 crops would erase the need for a 200 embedding of all proteins and would erase also noisy predictions since there would be an averaging at prediction time

How to get GDT score?

How to get GDT score this project?
Can you give me solutions?
This project have "get GDT score function" pipeline?

what's the meaning of paramter "17"?

the picture from file "angle_data_preparation_py.ipynb"

Distance prediction treshold

Hello,

Nice work !

In the notebook predicting_distances,
Why did you want to predict classes of distances, instead of distances values directly ?

Error while executing pretrain_model_pssn_l_x_l.ipynb

I'm a newbie in this environment of deep learning and I'm just trying to replicate the results you got from this code using Google Colab. Some issues I tried to solve them myself but I got stuck in this point.

First of all, I'd like to know if the libraries extracted from requirements.txt are updated because some of the problems I got where because the libraries weren't updated. (Installing notebook==6.1.5 will require to update jupyter-client==5.3.4, jupyter-core==4.6.1 and terminado==0.8.3). Also, I had to delete pywinpty because the Google Colab environment is based in Linux (I believe this library is meant to work on a Windows environment through Jupyter or something like that, correct me if I'm wrong).

Now, the error that I get when I run the codes is the one in the line mentioned below:
In [13]: dists = np.array([embedding_matrix(matrix) for matrix in dists])

If I decide to run the entire code without installing the proper libraries mentioned in requirements.txt (That will mean working with tensorflow-gpu==2.4.1 instead of 1.12.0 and all the rest of the libraries will be updated compared to the ones written in the txt file), I can get a little further on the codes and I will get the error until the next couple of lines:

In [22]: his = model.fit(inputs, outputs, epochs=35, batch_size=2, verbose=1, shuffle=True, validation_split=0.1)
print(his.history)

Still I can continue to run the rest of the lines of this code but the model won't be trained and I'll get silly results:

Is there a way to fix the issue and run all the notebooks smoothly in Google Colab? or will be better to just do it on Jupyter Notebooks? Hope someone can help me with this issue.

Use Physicochemical features of AAs as input.

It would be useful to use Physicochemical properties beyond Van der Waals radius of AAs as input sucha as:

surface exposure
predicted solvent accessibility
polarity
isoelectric point
pairwise potential
...?