patrickbryant1 / umol Goto Github PK

View Code? Open in Web Editor NEW

168.0 168.0 16.0 1.97 MB

Protein-ligand structure prediction

Python 46.19% Jupyter Notebook 53.32% Shell 0.49%

umol's People

Contributors

Stargazers

Watchers

Forkers

xinluzhu yaoyinying kiheonbaek mainguyenanhvu gabenavarro mosayebi tims-ml efstathiosnikolaosvlachos zhentg bondrewd gsgithub17 mhanson2019 atfrank takshan

umol's Issues

How to output more predicted poses

Hi Patrick, thank you for sharing the Colab version of Umol. It works well and is friendly to Python freshmen. I’m just wondering if Umol can output more predicted poses (e.g., the top 10 poses), which would be more thorough for the next steps of MD filtering and SAR study.

Target pos

Hi,

I'm trying to run Umol but currently I'm stuck at the "Predict" step and don't know how to get the "target_pos $POCKET_INDICES" data. Can you help me?
Thanks for your program and I hope to receive your response.

Best wishes,
Livia.

ValueError: operands could not be broadcast together with shapes (19,19) (2861,2861)

Hi there,
Thanks for your guys develop such powerful tool.

Actually, I have some models predicted by colabfold, to make sure the data consistency, I decided use my own model on Umol.
But when I tried to upload my own Model predicted by colabfold, I recieve the error in this step, btw, I could foud the "_pred_raw.pdb" file were generated, but "generate_best_conformer" seems like could not handle it.
`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in <cell line: 9>()
107 #Get a conformer
108 pred_ligand = read_pdb(RAW_PDB)
--> 109 best_conf, best_conf_pos, best_conf_err, atoms, nonH_inds, mol, best_conf_id = generate_best_conformer(pred_ligand['chain_coords'], LIGAND)
110
111 #Align it to the prediction

/content/Umol/src/relax/align_ligand_conformer_colab.py in generate_best_conformer(pred_coords, ligand_smiles)
102 nonH_pos = pos[nonH_inds]
103 conf_dmat = np.sqrt(1e-10 + np.sum((nonH_pos[:,None]-nonH_pos[None,:])**2,axis=-1))
--> 104 err = np.mean(np.sqrt(1e-10 + (conf_dmat-pred_dmat)**2))
105 conf_errs.append(err)
106

ValueError: operands could not be broadcast together with shapes (19,19) (2861,2861)`

Any advice would be helpful,
Best regard !

limit in protein length or msa sequence number?

hi as a test case i was using a protein of 539 residue lenth and msa of 25 protein sequences.
but i am getting memory exhaustion error. Is colab limited to protein length or msa size?

colab Umol "predict the protein-ligand complex structure" cell

I uploaded the .a3m file from HHBlits as outlined in the first cell, but run into this error?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-8-55efa3eb56f8>](https://localhost:8080/#) in <cell line: 10>()
      8 
      9 #Predict
---> 10 predict(config.CONFIG,
     11             MSA_FEATS,
     12             LIGAND_FEATS,

7 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py](https://localhost:8080/#) in _create_c_op(graph, node_def, inputs, control_inputs, op_def, extract_traceback)
   1965   except errors.InvalidArgumentError as e:
   1966     # Convert to ValueError for backwards compatibility.
-> 1967     raise ValueError(e.message)
   1968 
   1969   # Record the current Python stack trace as the creating stacktrace of this

ValueError: Cannot reshape a tensor with 345790 elements to shape [2290,344,1] (787760 elements) for '{{node reshape_msa}} = Reshape[T=DT_INT32, Tshape=DT_INT32](Const_6, reshape_msa/shape)' with input shapes: [2290,151], [3] and with input tensors computed as partial shapes: input[1] = [2290,344,1].

The example in the notebook works fine, so it maybe my formatting, just thought I'd raise so you're aware. I'll try a local install and see if I can get past this.

Training Protocol

Hello Patrick,

Thank you for sharing your amazing work!

Forgive me if I missed something but I could not find the details for your training regiment in your article. I am particularly interested in how training was performed on large proteins and ligands (>500 tokens). On page 10 you mention that "15 complexes are out of memory" and that you "crop these to 500 residues", did you do the same for training? Did you randomly crop proteins like in AF2 and if so what sequence size did you choose?

Thank you for your help in advance.

colab notebook error in prediction step

Hi, I am trying to predict the binding state of a protein-ligand complex, but in the prediction step.

Selection of the interaction sites and pLDTT score

Hi Patrick,

I was wondering what in your experience would be a good selection on the interaction sites. I have noticed a selection has an impact on the outcome, for instance between 10 and 7 A. Another question would be if I test different sites from the same protein, the pLDTT could give a kind of score about the probably binding site?

Thanks a lot,

Cesar

Predict the protein-ligand structure Error

Hello,
I am trying to run on the Colab
With the Input: ID:1ct9_happy
LIGAND: OC(=O)CC(C(=O)O)N
SEQUENCE:DDLQGMFAFALYDSEKDAYLIGRDHLGIIPLYMGYDEHGQLYVASEMKALVPVCRTIKEFPAGSYLWSQDGEIRSYYHRDWFDYDAVKDNVTDKNELRQALEDSVKSHLMSDVPYGVLLSGGLDSSIISAITKKYAARRVEDQERSEAWWPQLHSFAVGLPGSPDLKAAQEVANHLGTVHHEIHFTVQEGLDAIRDVIYHIETYDVTTIRASTPMYLMSRKIKAMGIKMVLSGEGSDEVFGGYLYFHKAPNAKELHEETVRKLLALHMYDCARANKAMSAWGVEARVPFLDKKFLDVAMRINPQDKMCGNGKMEKHILRECFEAYLPASVAWRQKEQFSDGVGYSWIDTLKEVAAQQVSDQQLETARFRFPYNTPTSKEAYLYREIFEELFPLPSAAECVPGGPSVACSSAKAIEWDEAFKKMDDPSGRAVGVHQSAYK
TARGET_POSITIONS:117,120,144,178
NUM_RECYCLES:3
When I try to run my example I get the error in the Predict the protein-ligand structure section :
XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 13608450984 bytes.
BufferAssignment OOM Debugging. (I am running on GPU)

Can you help me resolve this problem please?

Docker to run it

is there a docker to use it? the script installation fails for me

Can ligand be inputted via sdf files instead of LIGAND_SMILES format?

Can ligand be inputted via sdf files instead of LIGAND_SMILES format?
In other words, can the matrix file (sdf) be converted into the corresponding LIGAND_SMILES?

FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead.

I think you should change the code to prevent the warning.

/home/tools/umol_package/Umol/src/net/model/mapping.py:49: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead.
values_tree_def = jax.tree_flatten(values)[1]
/home/tools/umol_package/Umol/src/net/model/mapping.py:53: FutureWarning: jax.tree_unflatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_unflatten instead.
return jax.tree_unflatten(values_tree_def, flat_axes)
/home/tools/umol_package/Umol/src/net/model/mapping.py:124: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead.
flat_sizes = jax.tree_flatten(in_sizes)[0]

Sequence Length Limit

Looking at the Colab notebook, there seems to be a sequence length limit of 400. However, this is not apparent anywhere else in the code or manuscript.

Where does this limit come from? Is it only due to computational constraints (i.e. VRAM)? Can this limit be surpassed with larger GPUs?

IndexError: list index out of range, while created a new protein

Hu all,

I got an error when submmited a new protein.

Could help with this please? I have updated the msa, sequence and positions:
LIGAND = "N#CCC(=O)N(CC1)CC@@HN(C)c2ncnc(c23)[nH]cc3" # @param {type:"string"}
SEQUENCE = "RKSPLTLEDFKFLAVLGRGHFGKVLLSEFRPSGELFAIKALKKGDIVARDEVESLMCEKRILAAVTSAGHPFLVNLFGCFQTPEHVCFVMEYSAGGDLMLHIHSDVFSEPRAIFYSACVVLGLQFLHEHKIVYRDLKLDNLLLDTEGYVKIADFGLCKEGMGYGDRTSTFCGTPEFLAPEVLTDTSYTRAVDWWGLGVLLYEMLVGESPFPGDDEEEVFDSIVNDEVRYPRFLSAEAIGIMRRLLRRNPERRLGSSERDAEDVKKQPFFRTLGWEALLARRLPPPFVPTLSGRTDVSNFDEEFTGEAPTLSPPRDARPLTAAEQAAFLDFDFVAGGC" #@param {type:"string"}
TARGET_POSITIONS = "17,28,19,20,23,24,25,91,92,93,94" #@param {type:"string"}

it creates the proteinin first step but then when it creates the paramerts and the complex, it fails.

error:
File /cluster/ddu/cmmartinez001/Projects/Umol/content/Umol/src/make_msa_seq_feats_colab.py:98, in process(input_fasta_path, input_msas)
96 parsed_msa, parsed_deletion_matrix, _ = parsers.parse_stockholm(msa)
97 elif custom_msa[-3:] == 'a3m':
---> 98 parsed_msa, parsed_deletion_matrix = parsers.parse_a3m(msa)
99 else: raise TypeError('Unknown format for input MSA, please make sure '
100 'the MSA files you provide terminates with (and '
101 'are formatted as) .sto or .a3m')
102 parsed_msas.append(parsed_msa)

File /cluster/ddu/cmmartinez001/Projects/Umol/content/Umol/src/net/data/parsers.py:142, in parse_a3m(a3m_string)
127 def parse_a3m(a3m_string: str) -> Tuple[Sequence[str], DeletionMatrix]:
128 """Parses sequences and deletion matrix from a3m format alignment.
129
130 Args:
(...)
140 the aligned sequence i at residue position j.
141 """
--> 142 sequences, _ = parse_fasta(a3m_string)
143 deletion_matrix = []
144 for msa_sequence in sequences:

File /cluster/ddu/cmmartinez001/Projects/Umol/content/Umol/src/net/data/parsers.py:62, in parse_fasta(fasta_string)
60 elif not line:
61 continue # Skip blank lines.
---> 62 sequences[index] += line
64 return sequences, descriptions

IndexError: list index out of range

Best wishes,

Cesar

Environment failed to resolve

Hi,
I am trying to create the environment with the environment.yml file but am getting the following error:

Could not solve for environment specs
The following packages are incompatible
├─ ambertools ==23.3 py312h1577c9a_6 is requested and can be installed;
└─ openmmforcefields ==0.11.2 pyhd8ed1ab_1 is not installable because it requires
└─ ambertools >=20.0,<23 , which conflicts with any installable versions previously reported.

Is there a work around for this?

Thank you for the help.

Some question about giant Protein

Hi Mr Bryant:

When I tried to predicte some huge proteins (about 1179 aminos)
It return error like below:
XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 45026377728 bytes.

I've ran the code at Tesla A100 GPU and tried set os.environ['XLA_PYTHON_CLIENT_PREALLOCATE'] = 'false'
but it still return same error.

May I ask if you have any good advice on this? For example, predicting in parts, but I'm not sure if this will affect the prediction accuracy?

Any response would be helpful
Gratefully!

When run the 'predict.sh', how to use GPU to accelerate it

My machine has a powerful GPU, but the CPU is not very good. When running localcolabfold, I enabled GPU acceleration and the performance was great. So I want to ask if this script can be accelerated by the GPU

Multiple chains in protein sequence

Thanks for sharing the Umol code, it is a highly helpful resource. I want to try it on a protein with multiple (4) chains. Each chain contains around 900 residues and ligand interacts with residues from all 4 chains. I want to use only a subset of residues near the binding site as an input sequence. Is it possible to run Umol for such a case? If yes, how can i do it? e.g. what should be the format of sequence for MSA and Umol?

Different models

Hi,

I was wondering if there is any way it can be generated a number of good ranked models, As of now I only get one model, but I can see the model sometimes gives a bad pose. Just wondering if it is possible to generate more than one pose with different rankings. or the other way is to have control of the parameters to get a different result. Well, I dont know well the code, better to ask. Great tool by the way, in other cases this approach works really well.

Best wishes,

Cesar

Error during conda env creation

...
Collecting tb-nightly==2.16.0a20231211 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 57))
  Downloading tb_nightly-2.16.0a20231211-py3-none-any.whl.metadata (1.8 kB)
Collecting tensorboard-data-server==0.7.2 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 58))
  Downloading tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl.metadata (1.1 kB)
Collecting tensorstore==0.1.51 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 59))
  Downloading tensorstore-0.1.51-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting termcolor==2.4.0 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 60))
  Downloading termcolor-2.4.0-py3-none-any.whl.metadata (6.1 kB)
Collecting tf-estimator-nightly==2.14.0.dev2023080308 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 61))
  Downloading tf_estimator_nightly-2.14.0.dev2023080308-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting tf-keras-nightly==2.16.0.dev2023121110 (from -r /path/to/bin/Umol/condaenv.h37_frer.requirements.txt (line 62))
  Downloading tf_keras_nightly-2.16.0.dev2023121110-py3-none-any.whl.metadata (1.5 kB)

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement tf-nightly==2.16.0.dev20231211 (from versions: 2.16.0.dev20231225, 2.16.0.dev20231226, 2.16.0.dev20231227, 2.16.0.dev20231228, 2.16.0.dev20231229, 2.16.0.dev20231230, 2.16.0.dev20231231, 2.16.0.dev20240101, 2.16.0.dev20240102, 2.16.0.dev20240103, 2.16.0.dev20240104, 2.16.0.dev20240105, 2.16.0.dev20240106, 2.16.0.dev20240107, 2.16.0.dev20240108, 2.16.0.dev20240110, 2.16.0.dev20240119, 2.16.0.dev20240124, 2.16.0.dev20240125, 2.16.0.dev20240126, 2.16.0.dev20240127, 2.16.0.dev20240128, 2.16.0.dev20240129, 2.16.0.dev20240130, 2.16.0.dev20240201, 2.16.0.dev20240202, 2.16.0.dev20240203, 2.16.0.dev20240204, 2.16.0.dev20240205, 2.16.0.dev20240206, 2.16.0.dev20240207, 2.16.0.dev20240209, 2.16.0, 2.17.0.dev20240213, 2.17.0.dev20240214, 2.17.0.dev20240215, 2.17.0.dev20240216, 2.17.0.dev20240217, 2.17.0.dev20240218, 2.17.0.dev20240219, 2.17.0.dev20240220, 2.17.0.dev20240221, 2.17.0.dev20240222, 2.17.0.dev20240223, 2.17.0.dev20240225, 2.17.0.dev20240226, 2.17.0.dev20240227, 2.17.0.dev20240228, 2.17.0.dev20240229, 2.17.0.dev20240301, 2.17.0.dev20240302, 2.17.0.dev20240303, 2.17.0.dev20240304, 2.17.0.dev20240305, 2.17.0.dev20240306, 2.17.0.dev20240308, 2.17.0.dev20240309, 2.17.0.dev20240310, 2.17.0.dev20240312, 2.17.0.dev20240313, 2.17.0.dev20240314, 2.17.0.dev20240315, 2.17.0.dev20240316, 2.17.0.dev20240317, 2.17.0.dev20240318, 2.17.0.dev20240319, 2.17.0.dev20240320, 2.17.0.dev20240322, 2.17.0.dev20240323, 2.17.0.dev20240324)
ERROR: No matching distribution found for tf-nightly==2.16.0.dev20231211

failed

CondaEnvException: Pip failed

I'll try with tf-nightly==2.16.0 and less strict version requirements for tf-related pip packages later.

patrickbryant1 / umol Goto Github PK

umol's People

Contributors

Stargazers

Watchers

Forkers

umol's Issues

it creates the proteinin first step but then when it creates the paramerts and the complex, it fails.

Recommend Projects

Recommend Topics

Recommend Org