patrickbryant1 / umol Goto Github PK
View Code? Open in Web Editor NEWProtein-ligand structure prediction
Protein-ligand structure prediction
Hi Patrick, thank you for sharing the Colab version of Umol. It works well and is friendly to Python freshmen. I’m just wondering if Umol can output more predicted poses (e.g., the top 10 poses), which would be more thorough for the next steps of MD filtering and SAR study.
Hi,
I'm trying to run Umol but currently I'm stuck at the "Predict" step and don't know how to get the "target_pos $POCKET_INDICES" data. Can you help me?
Thanks for your program and I hope to receive your response.
Best wishes,
Livia.
Hi there,
Thanks for your guys develop such powerful tool.
Actually, I have some models predicted by colabfold, to make sure the data consistency, I decided use my own model on Umol.
But when I tried to upload my own Model predicted by colabfold, I recieve the error in this step, btw, I could foud the "_pred_raw.pdb" file were generated, but "generate_best_conformer" seems like could not handle it.
`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in <cell line: 9>()
107 #Get a conformer
108 pred_ligand = read_pdb(RAW_PDB)
--> 109 best_conf, best_conf_pos, best_conf_err, atoms, nonH_inds, mol, best_conf_id = generate_best_conformer(pred_ligand['chain_coords'], LIGAND)
110
111 #Align it to the prediction
/content/Umol/src/relax/align_ligand_conformer_colab.py in generate_best_conformer(pred_coords, ligand_smiles)
102 nonH_pos = pos[nonH_inds]
103 conf_dmat = np.sqrt(1e-10 + np.sum((nonH_pos[:,None]-nonH_pos[None,:])**2,axis=-1))
--> 104 err = np.mean(np.sqrt(1e-10 + (conf_dmat-pred_dmat)**2))
105 conf_errs.append(err)
106
ValueError: operands could not be broadcast together with shapes (19,19) (2861,2861)`
Any advice would be helpful,
Best regard !
hi as a test case i was using a protein of 539 residue lenth and msa of 25 protein sequences.
but i am getting memory exhaustion error. Is colab limited to protein length or msa size?
I uploaded the .a3m file from HHBlits as outlined in the first cell, but run into this error?
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-8-55efa3eb56f8>](https://localhost:8080/#) in <cell line: 10>()
8
9 #Predict
---> 10 predict(config.CONFIG,
11 MSA_FEATS,
12 LIGAND_FEATS,
7 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py](https://localhost:8080/#) in _create_c_op(graph, node_def, inputs, control_inputs, op_def, extract_traceback)
1965 except errors.InvalidArgumentError as e:
1966 # Convert to ValueError for backwards compatibility.
-> 1967 raise ValueError(e.message)
1968
1969 # Record the current Python stack trace as the creating stacktrace of this
ValueError: Cannot reshape a tensor with 345790 elements to shape [2290,344,1] (787760 elements) for '{{node reshape_msa}} = Reshape[T=DT_INT32, Tshape=DT_INT32](Const_6, reshape_msa/shape)' with input shapes: [2290,151], [3] and with input tensors computed as partial shapes: input[1] = [2290,344,1].
The example in the notebook works fine, so it maybe my formatting, just thought I'd raise so you're aware. I'll try a local install and see if I can get past this.
Hello Patrick,
Thank you for sharing your amazing work!
Forgive me if I missed something but I could not find the details for your training regiment in your article. I am particularly interested in how training was performed on large proteins and ligands (>500 tokens). On page 10 you mention that "15 complexes are out of memory" and that you "crop these to 500 residues", did you do the same for training? Did you randomly crop proteins like in AF2 and if so what sequence size did you choose?
Thank you for your help in advance.
Hi Patrick,
I was wondering what in your experience would be a good selection on the interaction sites. I have noticed a selection has an impact on the outcome, for instance between 10 and 7 A. Another question would be if I test different sites from the same protein, the pLDTT could give a kind of score about the probably binding site?
Thanks a lot,
Cesar
Hello,
I am trying to run on the Colab
With the Input: ID:1ct9_happy
LIGAND: OC(=O)CC(C(=O)O)N
SEQUENCE:DDLQGMFAFALYDSEKDAYLIGRDHLGIIPLYMGYDEHGQLYVASEMKALVPVCRTIKEFPAGSYLWSQDGEIRSYYHRDWFDYDAVKDNVTDKNELRQALEDSVKSHLMSDVPYGVLLSGGLDSSIISAITKKYAARRVEDQERSEAWWPQLHSFAVGLPGSPDLKAAQEVANHLGTVHHEIHFTVQEGLDAIRDVIYHIETYDVTTIRASTPMYLMSRKIKAMGIKMVLSGEGSDEVFGGYLYFHKAPNAKELHEETVRKLLALHMYDCARANKAMSAWGVEARVPFLDKKFLDVAMRINPQDKMCGNGKMEKHILRECFEAYLPASVAWRQKEQFSDGVGYSWIDTLKEVAAQQVSDQQLETARFRFPYNTPTSKEAYLYREIFEELFPLPSAAECVPGGPSVACSSAKAIEWDEAFKKMDDPSGRAVGVHQSAYK
TARGET_POSITIONS:117,120,144,178
NUM_RECYCLES:3
When I try to run my example I get the error in the Predict the protein-ligand structure section :
XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 13608450984 bytes.
BufferAssignment OOM Debugging. (I am running on GPU)
Can you help me resolve this problem please?
is there a docker to use it? the script installation fails for me
Can ligand be inputted via sdf files instead of LIGAND_SMILES format?
In other words, can the matrix file (sdf) be converted into the corresponding LIGAND_SMILES?
I think you should change the code to prevent the warning.
/home/tools/umol_package/Umol/src/net/model/mapping.py:49: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead.
values_tree_def = jax.tree_flatten(values)[1]
/home/tools/umol_package/Umol/src/net/model/mapping.py:53: FutureWarning: jax.tree_unflatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_unflatten instead.
return jax.tree_unflatten(values_tree_def, flat_axes)
/home/tools/umol_package/Umol/src/net/model/mapping.py:124: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead.
flat_sizes = jax.tree_flatten(in_sizes)[0]
Looking at the Colab notebook, there seems to be a sequence length limit of 400. However, this is not apparent anywhere else in the code or manuscript.
Where does this limit come from? Is it only due to computational constraints (i.e. VRAM)? Can this limit be surpassed with larger GPUs?
Hu all,
I got an error when submmited a new protein.
Could help with this please? I have updated the msa, sequence and positions:
LIGAND = "N#CCC(=O)N(CC1)CC@@HN(C)c2ncnc(c23)[nH]cc3" # @param {type:"string"}
SEQUENCE = "RKSPLTLEDFKFLAVLGRGHFGKVLLSEFRPSGELFAIKALKKGDIVARDEVESLMCEKRILAAVTSAGHPFLVNLFGCFQTPEHVCFVMEYSAGGDLMLHIHSDVFSEPRAIFYSACVVLGLQFLHEHKIVYRDLKLDNLLLDTEGYVKIADFGLCKEGMGYGDRTSTFCGTPEFLAPEVLTDTSYTRAVDWWGLGVLLYEMLVGESPFPGDDEEEVFDSIVNDEVRYPRFLSAEAIGIMRRLLRRNPERRLGSSERDAEDVKKQPFFRTLGWEALLARRLPPPFVPTLSGRTDVSNFDEEFTGEAPTLSPPRDARPLTAAEQAAFLDFDFVAGGC" #@param {type:"string"}
TARGET_POSITIONS = "17,28,19,20,23,24,25,91,92,93,94" #@param {type:"string"}
error:
File /cluster/ddu/cmmartinez001/Projects/Umol/content/Umol/src/make_msa_seq_feats_colab.py:98, in process(input_fasta_path, input_msas)
96 parsed_msa, parsed_deletion_matrix, _ = parsers.parse_stockholm(msa)
97 elif custom_msa[-3:] == 'a3m':
---> 98 parsed_msa, parsed_deletion_matrix = parsers.parse_a3m(msa)
99 else: raise TypeError('Unknown format for input MSA, please make sure '
100 'the MSA files you provide terminates with (and '
101 'are formatted as) .sto or .a3m')
102 parsed_msas.append(parsed_msa)
File /cluster/ddu/cmmartinez001/Projects/Umol/content/Umol/src/net/data/parsers.py:142, in parse_a3m(a3m_string)
127 def parse_a3m(a3m_string: str) -> Tuple[Sequence[str], DeletionMatrix]:
128 """Parses sequences and deletion matrix from a3m format alignment.
129
130 Args:
(...)
140 the aligned sequence i at residue position j.
141 """
--> 142 sequences, _ = parse_fasta(a3m_string)
143 deletion_matrix = []
144 for msa_sequence in sequences:
File /cluster/ddu/cmmartinez001/Projects/Umol/content/Umol/src/net/data/parsers.py:62, in parse_fasta(fasta_string)
60 elif not line:
61 continue # Skip blank lines.
---> 62 sequences[index] += line
64 return sequences, descriptions
IndexError: list index out of range
Best wishes,
Cesar
Hi,
I am trying to create the environment with the environment.yml file but am getting the following error:
Could not solve for environment specs
The following packages are incompatible
├─ ambertools ==23.3 py312h1577c9a_6 is requested and can be installed;
└─ openmmforcefields ==0.11.2 pyhd8ed1ab_1 is not installable because it requires
└─ ambertools >=20.0,<23 , which conflicts with any installable versions previously reported.
Is there a work around for this?
Thank you for the help.
Hi Mr Bryant:
When I tried to predicte some huge proteins (about 1179 aminos)
It return error like below:
XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 45026377728 bytes.
I've ran the code at Tesla A100 GPU and tried set os.environ['XLA_PYTHON_CLIENT_PREALLOCATE'] = 'false'
but it still return same error.
May I ask if you have any good advice on this? For example, predicting in parts, but I'm not sure if this will affect the prediction accuracy?
Any response would be helpful
Gratefully!
My machine has a powerful GPU, but the CPU is not very good. When running localcolabfold, I enabled GPU acceleration and the performance was great. So I want to ask if this script can be accelerated by the GPU
Thanks for sharing the Umol code, it is a highly helpful resource. I want to try it on a protein with multiple (4) chains. Each chain contains around 900 residues and ligand interacts with residues from all 4 chains. I want to use only a subset of residues near the binding site as an input sequence. Is it possible to run Umol for such a case? If yes, how can i do it? e.g. what should be the format of sequence for MSA and Umol?
Hi,
I was wondering if there is any way it can be generated a number of good ranked models, As of now I only get one model, but I can see the model sometimes gives a bad pose. Just wondering if it is possible to generate more than one pose with different rankings. or the other way is to have control of the parameters to get a different result. Well, I dont know well the code, better to ask. Great tool by the way, in other cases this approach works really well.
Best wishes,
Cesar
...
Collecting tb-nightly==2.16.0a20231211 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 57))
Downloading tb_nightly-2.16.0a20231211-py3-none-any.whl.metadata (1.8 kB)
Collecting tensorboard-data-server==0.7.2 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 58))
Downloading tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl.metadata (1.1 kB)
Collecting tensorstore==0.1.51 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 59))
Downloading tensorstore-0.1.51-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting termcolor==2.4.0 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 60))
Downloading termcolor-2.4.0-py3-none-any.whl.metadata (6.1 kB)
Collecting tf-estimator-nightly==2.14.0.dev2023080308 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 61))
Downloading tf_estimator_nightly-2.14.0.dev2023080308-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting tf-keras-nightly==2.16.0.dev2023121110 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 62))
Downloading tf_keras_nightly-2.16.0.dev2023121110-py3-none-any.whl.metadata (1.5 kB)
Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement tf-nightly==2.16.0.dev20231211 (from versions: 2.16.0.dev20231225, 2.16.0.dev20231226, 2.16.0.dev20231227, 2.16.0.dev20231228, 2.16.0.dev20231229, 2.16.0.dev20231230, 2.16.0.dev20231231, 2.16.0.dev20240101, 2.16.0.dev20240102, 2.16.0.dev20240103, 2.16.0.dev20240104, 2.16.0.dev20240105, 2.16.0.dev20240106, 2.16.0.dev20240107, 2.16.0.dev20240108, 2.16.0.dev20240110, 2.16.0.dev20240119, 2.16.0.dev20240124, 2.16.0.dev20240125, 2.16.0.dev20240126, 2.16.0.dev20240127, 2.16.0.dev20240128, 2.16.0.dev20240129, 2.16.0.dev20240130, 2.16.0.dev20240201, 2.16.0.dev20240202, 2.16.0.dev20240203, 2.16.0.dev20240204, 2.16.0.dev20240205, 2.16.0.dev20240206, 2.16.0.dev20240207, 2.16.0.dev20240209, 2.16.0, 2.17.0.dev20240213, 2.17.0.dev20240214, 2.17.0.dev20240215, 2.17.0.dev20240216, 2.17.0.dev20240217, 2.17.0.dev20240218, 2.17.0.dev20240219, 2.17.0.dev20240220, 2.17.0.dev20240221, 2.17.0.dev20240222, 2.17.0.dev20240223, 2.17.0.dev20240225, 2.17.0.dev20240226, 2.17.0.dev20240227, 2.17.0.dev20240228, 2.17.0.dev20240229, 2.17.0.dev20240301, 2.17.0.dev20240302, 2.17.0.dev20240303, 2.17.0.dev20240304, 2.17.0.dev20240305, 2.17.0.dev20240306, 2.17.0.dev20240308, 2.17.0.dev20240309, 2.17.0.dev20240310, 2.17.0.dev20240312, 2.17.0.dev20240313, 2.17.0.dev20240314, 2.17.0.dev20240315, 2.17.0.dev20240316, 2.17.0.dev20240317, 2.17.0.dev20240318, 2.17.0.dev20240319, 2.17.0.dev20240320, 2.17.0.dev20240322, 2.17.0.dev20240323, 2.17.0.dev20240324)
ERROR: No matching distribution found for tf-nightly==2.16.0.dev20231211
failed
CondaEnvException: Pip failed
I'll try with tf-nightly==2.16.0
and less strict version requirements for tf-related pip packages later.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.