salesforce / progen Goto Github PK

View Code? Open in Web Editor NEW

605.0 18.0 111.0 64 KB

Official release of the ProGen models

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

language-model protein generative-model

progen's Introduction

ProGen: Language Modeling for Protein Engineering

Suite of open-sourced projects and models for protein engineering and design.

License

Our code and models are BSD-3 licensed. See LICENSE.txt for details.

Ethics

Predicting the fitness of a protein sequence and capturing the distribution of natural proteins for generative purposes could be a powerful tool for protein design. If our technique or a future iteration thereof is adopted broadly, care should be taken in terms of the end use-cases of these designed samples and downstream effects to ensure safe, non-nefarious, and ethical applications. For projects in any domain, active oversight during project initiation, experimental optimization, and deployment phases should be put in place to ensure safe usage and limitation of unintended harmful effects.

progen's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes dongcf codeaudit wanghuineu khatvangi a-mad yunda-si tpan1039-ui enijkamp ingmarschuster heiidii dot23 pengfeitian darrengao628 superxiang stjordanis zebrajack kehan777 norsage istvandesign tanvirarafin tspp520 genostack imgovind williamrogersdev owaiskhan9654 adityabandi edithso fllamassfdc kchennen rrw1007 kimdn liunanln lrnvrybthksa vas2201 okaris frankdji nataliemaus hklz velcon-zheng jxshi dryqin hongjinwu yefan1996 ipark2021 forfreedomgit spyfighting dw5601 kalyan-immadisetty chaunceydust wittenfelt guruace n98848tj samkenxstream simonkitsangchu mohistyi wufuc yanbosmu crazyhsu ddkang1 tanuj2212 haowang-bioinfo yeyuting0307 cextel spacetolc hongruhu lusirnoman ghas-results luoyiiii zhang-fei-martin alex-hh horikitasaku ynashed cuidachao v-raja yashasdevasurmutt blazexpire qqlaoxia toooooodo mengtinghuang kaffaljidhmah2 tarsli tienmphan jyryu3161 ojdoha matik01 anands-repo keaunamani pincher-chen cnp-ciimar mohsensharifi1991 moslem-tg dearborn-open-ai ionlace thu-caoyf 5456es alginarslan10 gracewx wook2014 coffeeandsoju

progen's Issues

Can one protect some fragments during generation

Is it possible to preserve some locations on a query sequence and only use ProGen to generate the remaining positions (conditioned generation)? Thanks!

Predicting model for CM and MDH dataset

Hi, thank you for the beautiful work.

Porgen has been applied to generate proteins for CM and MDH families. In the Method section, the details are described as:

We computed the AUC in receiver operating characteristic (ROC) curves for predicting binary function labels from model scores. We computed model scores for each sequence in both CM and MDH by using the per-token model log-likelihood in Eq. 2.

Does this mean: (1) for each sequence the log-likelihood is calculated for each token and (2) then a classifier model is employed to predict whether the whole sequence is reactive or not (the label is from experimental data). The features are the calculated log-likelihood score for each token. Could you please also release data/codes/models for this part?

Best regards

Training details of discriminator

Are code of discriminator used for selecting avaliable? Seems that description of this part in article is not clarified enough for reinplementation, most importantly, loss function of this part is not claimed.
Does discriminator use adversarial loss or simple ranking loss (e.g. in ChatGPT)?
Could you provide code of this part?

I want to generate unnatural protein sequences from a gene family using Progen2

I want to generate unnatural protein sequences from a gene family using Progen2, but I find that there are no protein family keywords or taxIDs in the Progen2/tokenizer.json file.
Could you provide a complete tokenizer.json file (containing family keywords and taxIDs, etc.) just like the one in the mapping_files/ folder from "https://doi.org/10.5281/zenodo.7296780"?
Also, can the keywords dictionary from mapping_files/ be used in Progen2 models? (I find that "<|bos|>":1 in "Progen2/tokenizer.json" but 1: '2Fe-2S' in "mapping_files/kw_to_name.p2") I look forward to your reply.

Generate sequence based on natural one

Hi,
First of all, thank you for sharing ProGen2 code.
I read the article as part of my thesis, and started playing with the model.

My questions:

How may I generate a new sequence by ProGen2 which based on natural sequence?
For example:
Natural seq: "DQSVRKLVRKLPDEGLDREKVKTYLDKLGVDREELQKFSDAIGLESSGGS"
A new generated seq: "GSSDIEITVEGKEQADKVIEEMKRRNLEVHVEEHNGQYIDKASLESSGGS"
In generation process, is it possible to define which station/s in natural sequence I want ProGen2 will change? if yes - how?
Example: I want to change only the 3rd station in sequence, so:
Natural seq: "DQSV..."
Possible generated sequence: "DQGV..."
So the 3rd station "S" was changed to "G".

Best Regards,
Aviv.

Assertion error when running example code

Thanks for all your hard work in getting this out! I get the error below when running the example likelihood calculation.

Also, I wonder if you could comment on the example itself? I didn't find homology to a known protein when I blasted the sequence below, and it appears to begin with an "end-of-sequence" token, and end with a "beginning-of-sequence" token.

python3 likelihood.py --model progen2-small --context "2PAQGRARLAAHYGTGRIGREVTVDERCRNLDRLEPSWELLRLLDDMGFIEGQNGLRRYVAEVFALDEPYDMTWRLRSLDEPHEVNAIEFAAPHERVYATLSERFFPDSVERDLRELVTRSLVEVDLGDPFTPPFVNSVYELRGASRRWVGVVRDVLAPDVLPCDATIRVLADAGTRAATRGLREILDTESGRVCVLGLHAALDAIADDRNEVSTSVAVADLEQCVALREAIRQITPRGAISVLVKGPLRTSGMRAQIAAVVHLRAKSSHLLPGGTDVVTFGAREFAIRSAANERKVVASMRLLALPGFAERSLCGLARPGVGRGRWEPAINVSVAADRDQIDLRVMGADVGDASVIFLKRDFRKLTEEFWRTHTDVPIEREDVSAQRTEPDNRWRWLVPCDDLVAPRLTVVPPRSVGHGM1"
loading parameters
loading parameters took 5.17s
loading tokenizer
loading tokenizer took 0.00s
sanity log-likelihood
ll_0=-3.6930243968963623
ll_1=-3.7106473445892334
ll_2=-3.7106471061706543
sanity log-likelihood took 0.11s
Traceback (most recent call last):
  File "likelihood.py", line 226, in <module>
    main()
  File "likelihood.py", line 183, in main
    assert abs(ll_0 - ll_1) < 1e-2
AssertionError

RuntimeError

Greetings, really impressive work!. However, I came accross a problem when I run the command below:
python3 sample.py --model ${model} --t 0.8 --p 0.9 --max-length 1500 --num-samples 10 --context "1MDKKY
SIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLRKKLVDSTDKA
DLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDNLLAQIGDQYADLFLAA
KNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEKYKEIFFDQSKNGYAGYIDGGASQEEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHLGELHAILRRQEDFYPFLKDNR
EKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGASAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFKTNRKVTVKQLKEDYFKKIECFDSVEIS
GVEDRFNASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLTLFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRAFAALIADDSLTFKEDIQKAQVSGQGDSLHEHIA
NLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVIEMARENQTTQKGQKNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQELDINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDN
VPSEEVVKKMKNYWRQLLNAKLITQRKFDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSDFRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDY
KVYDVRKMIAKSEQEIGKATAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQTGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKL
KSVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLASHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKP
IREQAENIIHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGD2"

The protein is about 1370-aa-long, and I got errors like:
RuntimeError: The size of tensor a (1024) must match the size of tensor b (1370) at non-singleton dimension 3
Besides, any input longer than 1024 caused the error. I would like to know if the input sequence max length is only 1024?

Weird behavior with memory usage within samply.py

So for some reason progen2-base starts to use an ungodly amount of VRAM the more I increase the value for --num-samples. If I set the value of --num-samples to 50 I get the following error. Yet if I set --num-samples to 30, 40, even 45, no issue occurs. I assume this is unintentional.

sampling
sampling took 36.29s
Traceback (most recent call last):
  File "sample.py", line 207, in <module>
    main()
  File "sample.py", line 193, in main
    completions = sample(device=device, model=model, tokenizer=tokenizer, context=args.context, pad_token_id=tokenizer.encode('<|pad|>').ids[0], num_return_sequences=args.num_samples, temp=args.t, top_p=args.p, max_length=args.max_length)
  File "sample.py", line 73, in sample
    tokens_batch = model.generate(input_ids, do_sample=True, temperature=temp, max_length=max_length, top_p=top_p, num_return_sequences=num_return_sequences, pad_token_id=pad_token_id)
  File "/home/.../.../progen/progen2/.venv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/.../.../progen/progen2/.venv/lib/python3.7/site-packages/transformers/generation_utils.py", line 1210, in generate
    **model_kwargs,
  File "/home/.../.../progen/progen2/.venv/lib/python3.7/site-packages/transformers/generation_utils.py", line 1714, in sample
    output_hidden_states=output_hidden_states,
  File "/home/.../.../progen/progen2/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/.../.../progen/progen2/models/progen/modeling_progen.py", line 640, in forward
    return_dict=return_dict,
  File "/home/.../.../progen/progen2/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/.../.../progen/progen2/models/progen/modeling_progen.py", line 507, in forward
    output_attentions=output_attentions,
  File "/home/.../.../progen/progen2/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/.../.../progen/progen2/models/progen/modeling_progen.py", line 269, in forward
    output_attentions=output_attentions,
  File "/home/.../.../progen/progen2/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/.../.../progen/progen2/models/progen/modeling_progen.py", line 203, in forward
    value = torch.cat((past_value, value), dim=-2)
RuntimeError: CUDA out of memory. Tried to allocate 76.00 MiB (GPU 0; 14.76 GiB total capacity; 13.21 GiB already allocated; 37.75 MiB free; 13.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

For reference this is the command I'm running

python sample.py --model progen2-base --t 0.8 --p 90 --max-length 512 --num-samples 40 --context <232 AA sequence>

Client Error: Unauthorized for url

Hello,

I was installing ProGen2 within a Conda env. I followed all the steps in the documentation, except that I changed:

python3.8 -m venv .venv
source .venv/bin/activate
with
conda create -n progen -c anaconda python=3.8
source ~/Anaconda/bin/activate progen

When I run the sample.py, I received the following as an output:

falling back to cpu
falling back to fp32
loading parameters
loading parameters took 0.88s

and the following as an error:

python3 /lustre/scratch/x_kazlakam/progen/progen2/sample.py --model progen2-large --t 0.8 --p 0.9 --max-length 1024 --num-samples 2 --context 1
401 Client Error: Unauthorized for url: https://huggingface.co/checkpoints/progen2-large/resolve/main/config.json
Traceback (most recent call last):
File "/home/x_kazlakam/.conda/envs/progen/lib/python3.8/site-packages/transformers/configuration_utils.py", line 585, in _get_config_dict
resolved_config_file = cached_path(
File "/home/x_kazlakam/.conda/envs/progen/lib/python3.8/site-packages/transformers/file_utils.py", line 1846, in cached_path
output_path = get_from_cache(
File "/home/x_kazlakam/.conda/envs/progen/lib/python3.8/site-packages/transformers/file_utils.py", line 2050, in get_from_cache
_raise_for_status(r)
File "/home/x_kazlakam/.conda/envs/progen/lib/python3.8/site-packages/transformers/file_utils.py", line 1977, in _raise_for_status
request.raise_for_status()
File "/home/x_kazlakam/.conda/envs/progen/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/checkpoints/progen2-large/resolve/main/config.json
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/lustre/scratch/x_kazlakam/progen/progen2/sample.py", line 207, in
main()
File "/lustre/scratch/x_kazlakam/progen/progen2/sample.py", line 145, in main
model = create_model(ckpt=ckpt, fp16=args.fp16).to(device)
File "/lustre/scratch/x_kazlakam/progen/progen2/sample.py", line 57, in create_model
return ProGenForCausalLM.from_pretrained(ckpt)
File "/home/x_kazlakam/.conda/envs/progen/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1268, in from_pretrained
config, model_kwargs = cls.config_class.from_pretrained(
File "/home/x_kazlakam/.conda/envs/progen/lib/python3.8/site-packages/transformers/configuration_utils.py", line 510, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/x_kazlakam/.conda/envs/progen/lib/python3.8/site-packages/transformers/configuration_utils.py", line 537, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/x_kazlakam/.conda/envs/progen/lib/python3.8/site-packages/transformers/configuration_utils.py", line 618, in _get_config_dict
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co/' to load this model and it looks like ./checkpoints/progen2-large is not the path to a directory conaining a config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

How progen handles sequences greater than 2048？

I ran progen until I hit the first sequence greater than 2048 and it threw an exception：

2024-01-10 10:26:32.104 | INFO     | __main__:main:100 - falling back to cpu
2024-01-10 10:26:32.105 | WARNING  | __main__:main:105 - falling back to fp32
2024-01-10 10:26:32.105 | INFO     | __main__:main:107 - loading parameters
2024-01-10 10:26:38.012 | INFO     | __main__:main:111 - loading tokenizer
  0%|          | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/public/home/wenyuhao/embedding/progen/progen2/embedding.py", line 137, in <module>
    main(args)
  File "/public/home/wenyuhao/embedding/progen/progen2/embedding.py", line 120, in main
    hidden_states,lm_logits = model.embedding(target) #.logits
  File "/public/home/wenyuhao/embedding/progen/progen2/models/progen/modeling_progen.py", line 700, in embedding
    transformer_outputs = self.transformer(
  File "/public/home/wenyuhao/embedding/progen/progen2/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/home/wenyuhao/embedding/progen/progen2/models/progen/modeling_progen.py", line 503, in forward
    outputs = block(
  File "/public/home/wenyuhao/embedding/progen/progen2/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/home/wenyuhao/embedding/progen/progen2/models/progen/modeling_progen.py", line 265, in forward
    attn_outputs = self.attn(
  File "/public/home/wenyuhao/embedding/progen/progen2/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/home/wenyuhao/embedding/progen/progen2/models/progen/modeling_progen.py", line 213, in forward
    attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
  File "/public/home/wenyuhao/embedding/progen/progen2/models/progen/modeling_progen.py", line 131, in _attn
    attn_weights = torch.where(causal_mask, attn_weights, self.masked_bias.to(attn_weights.dtype))
RuntimeError: The size of tensor a (2048) must match the size of tensor b (2073) at non-singleton dimension 3
  0%|          | 0/10 [00:08<?, ?it/s]%

I found that when the sequence length is greater than 2048, the dimensions of query, key and mask do not match.

Can progen handle sequences larger than length 2048? If not, should I intercept the sequences beyond 2048 first and then enter them into the model?

Releasing cleaned version of ProGen1?

Congratulations on a great paper! We are currently trying to reproduce the steps for finetuning ProGen (ProGen1, nature biotech version) with lysozyme, and then use the same pipeline for another protein of our interest. However, while trying to reproduce the finetuning pipeline, we discovered that the GitHub repo link posted in the NBT paper is for ProGen2. Although the Zenodo link contains the desired ProGen1 codes, the datasets for training (both the pre-training dataset and the finetuning ones) are missing. Furthermore, the Zenodo codebase is apparently used for development and not cleaned for public use and would be difficult for us (experimental biologists) to run. If it is possible, would you guys mind releasing a cleaned open-source version (like ProGen2) with instructions to do sampling and pretraining? This would be most welcomed by the AI+Protein community and by experimental biologists who are not experts with transformer models. Many thanks in advance!

none

URL from huggingface

When setting up progen2 I came to the problem that the URL from huggingface that is needed for the sample.py is not available

Could you please open source the dataset used for pre-training?

Can you please open source the data set used for pre-training?
Now only the model and code are available.

how to run log likelihood score script on a 2 chain protein?

Is there a way to specify a 2 chain protein (e.g. an antibody Fv) as the input of the likelihood.py script? E.g. separating the two chains by a semi-colon or a colon or something like that? Thx in advance.

Are the tags of the training data avaliable? And where can i find it? Thanks

Duplicate Sequences with Different runs

Hello,

I had some GPU RAM issues when trying to generate sequences. I found I can reliable produce only 15 sequences at a time.

But when I do separate runs to get up to 100 sequences, I find some of the output sequences to be 100% ID to sequences produced by other runs.

Is this process deterministic?

How to obtain Test-max90 and Test-max50?

Hello,
Thanks for releasing the ProGen2 code for protein generation!
However, I wonder how can I obtain Test-max90 and Test-max50 held-out test sequences and how to calculate perplexity.
Hope for your suggestions. Thanks in advance!

How to specify a control tag in the sample script?

In the paper it was mentioned that one can sample sequences based on a control tag. Could you provide an example for using a control tag with the sample.py script?

Sampling conditional token distribution

It would be super valuable to have an example script to sample conditional token probabilities for a target index given sequence context.

There seem to be some technical details that are important, but not easy to figure out:

eg not all tokens being actually used
or wether or not this LM always has to work in a causal left-to-right manner, or it can also be used to do "inpainting" of residues in the middle of a sequence...

Finally, the way I'm currently evaluating mutations is by sequentially computing sequence likelihoods for each possible mutated sequence, so this takes 20 forward passes per single point mutation. But I think this is vastly inefficient, since the model produces logits for every position, can the logits for the target index simply be used as a proxy for token probability?

How to generate sequences of a particular family using PROGEN2?

Hello,
I would like to use PROGEN2 package to generate CAS (CRISPR -CAS) family of proteins. How can I use sample.py for this?
The paper says that PROGEN2 generates sequences in controlled fashion using tags. What is the tag for CAS family?
In the following command:
python3 sample.py --model ${model} --t 0.8 --p 0.9 --max-length 1024 --num-samples 2 --context "1'
what is context? is this the tags?

Lysozyme dataset needed

Hello, congrats on the work and thanks for sharing.
I am interested in replicating the paper on fine tuning with lysozymes and wanted to know if you could share the txt file.
Also, I have doubts about the conditioning tags and how you managed to format the tags to put them in each AA sequence.
Thank u in advanced!

How to generate non-natural sequences of an interest protein using ProGen

When running the script with the command 'python3 sample.py --model ${model} --t 0.8 --p 0.9 --max-length 1024 --num-samples 2 --context "1"', the program runs successfully. If I want to generate non-natural amino acid sequences for a novel protein, what steps should I take? Specifically, should I prepare an input file containing the natural amino acid sequence in FASTA format or a PDB file containing the protein structure? Please provide guidance on the appropriate input format for generating non-natural sequences using ProGen.

How should I write the command to run the program for generating non-natural sequences using ProGen?

Data for antibody set

Wonderful paper, a great experimental exposition of where/why the natural sequence models will cap out to support Weinsteins theoretical work.

Could you provide the dataset used from Koenig?

Lastly, for the antibody-specific landscape, we compiled a dataset consisting of binding, expression, and thermal stability measurements for variants derived from eight distinct antibodies. We collected expression and antigen-binding enrichment measurements for variants of the anti-VEGF g6 antibody from a DMS study (Koenig et al., 2017).

correlation with Facebook ESM's log_likelihood

Hi all,

I've tested progen2 on an antibody Fv for which we've also calculated Facebook's ESM log_likelihood stability values, and we see very little correlation between the two. I've tested it both on model small (msma), model medium (mmed) and model OAS (moas), and the results for the Pearson's correlation are below. This is doing an AA scan (changing each aminoacid for the other 19 AA possible) for all positions in the heavy and light chain. The Fv is represented as heavy+(GGGGSx4)+light chain.

The prg2_ll_sum1 is the first likelihood sum value that appears in the output, the prg2_ll_sum3 is the second one (right to left?).

Is this expected? I naively was expecting to see a good correlation between these likelihood values and Facebook's ESM likelihood values. Thx in advance,

msma===
prg2_ll_sum3:count    4064
-------------------   -------
prg2_ll_sum3:min      -840.91
prg2_ll_sum3:median   -836.44
prg2_ll_sum3:max      -830.70
file                                                                          num_cols   num_rows
c5f047165d4b778ddd16807a6840a52a.msma.prg2.scan.csv          5      4,064
prg2_ll_sum3,log_likelihood,0.0185
log_likelihood,prg2_ll_sum1,0.0270
prg2_ll_sum3,llh_delta,0.0048
prg2_ll_sum1,llh_delta,-0.0034
mmed===
prg2_ll_sum3:count    4064
-------------------   --------
prg2_ll_sum3:min      -1119.47
prg2_ll_sum3:median   -1111.60
prg2_ll_sum3:max      -1105.50
file                                                                          num_cols   num_rows
c5f047165d4b778ddd16807a6840a52a.mmed.prg2.scan.csv          5      4,064
prg2_ll_sum3,log_likelihood,-0.0787
prg2_ll_sum1,log_likelihood,-0.0682
prg2_ll_sum3,llh_delta,0.0685
llh_delta,prg2_ll_sum1,0.0616
moas===
prg2_ll_sum3:count    4064
-------------------   --------
prg2_ll_sum3:min      -1541.08
prg2_ll_sum3:median   -1530.23
prg2_ll_sum3:max      -1515.66
file                                                                          num_cols   num_rows
c5f047165d4b778ddd16807a6840a52a.moas.prg2.scan.csv          5      4,064
prg2_ll_sum3,log_likelihood,0.0116
prg2_ll_sum1,log_likelihood,-0.0096
prg2_ll_sum3,llh_delta,0.0134
prg2_ll_sum1,llh_delta,0.0272

A question about "context"

What does the 'context' parameter in this command represent?
If I change --context "1" to --context "2"，the generated files differ in the number at the beginning of the sequence.

python3 sample.py --model ${model} --t 0.8 --p 0.9 --max-length 1024 --num-samples 2 --context "1"

Runnable model

Hello, thank you for sharing the code with the paper!

I took the liberty to implement sample.py on replicate.com to easily use it.

https://replicate.com/okaris/progen2

I can send a PR for a Readme update if you find it useful.

CUDA Out of Memory Error When Running sample.py with a large --num-samples

I'm trying to generate a number of sequences with sample.py, so I set the --num-samples parameter to 100.

However, upon execution, I received the following error message:RuntimeError: CUDA out of memory. Tried to allocate 166.00 MiB (GPU 0; 9.77 GiB total capacity; 8.11 GiB already allocated; 149.81 MiB free; 8.15 GiB reserved in total by PyTorch). My GPU seems to run out of memory during the process.

Has anyone else experienced anything similar, or does anyone have a suggestion on how to get around this memory limitation? Thank you in advance for your time and help!

Could someone tell me about how to train models?

How can we extract embedding for a protein sequence?

Is there a way to extract the embedding from the pretrained model using the input protein sequence?

Segmentation fault (core dumped)

Hello，
when I run sample.py on ubuntu wsl of windows, it appears like this:
:~/progen-main/progen2$ python3 sample.py --model ${model} --t 0.8 --p 0.9 --max-length 1024 --num-samples 2 --context "1"
falling back to cpu
falling back to fp32
loading parameters
Segmentation fault (core dumped)
What's wrong with it? I also can't dowload torch==1.9.0. I fixed the requirements.txt as follows and it can be continue.
GNU nano 6.2 requirements.txt --find-links https://download.pytorch.org/whl/torch_stable.html
torch
transformers
tokenizers

Is there a plan on packaging ProGen and its utilities on PyPI or something?

Fine Tuning the Model

I want to fine-tune ProGen2-small on my own dataset.
See this google colab notebook for an annotated version of the code and the error:
https://colab.research.google.com/drive/1_R0xgf6Kw0K88PYF7-ZOCIh9WRSmXN8C?usp=sharing

First I load the model like this:

import torch
from tokenizers import Tokenizer
from progen.progen2.models.progen.modeling_progen import ProGenForCausalLM

model = ProGenForCausalLM.from_pretrained('/content/drive/MyDrive/progen2-small', torch_dtype=torch.float16, low_cpu_mem_usage=True).to(device)

I am using the huggingface Trainer to fine-tune the model with the DataCollatorForLanguageModeling. I load the tokenizer like this:

def create_tokenizer_custom(file):
    with open(file, 'r') as f:
        return Tokenizer.from_str(f.read())

tokenizer = create_tokenizer_custom(file='/content/progen/progen2/tokenizer.json')

And then convert it to a PreTrainedTokenizerFast as suggested by: huggingface/tokenizers#325

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

tokenizer.save("my-tokenizer.json")
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="my-tokenizer.json")

During fine-tuning, the training loss becomes 0.0000. After training, I attempt to produce new samples:

with torch.no_grad():
  input_ids = torch.tensor(fast_tokenizer.encode("1GRGL")).view([1, -1]).to(device)
  tokens_batch = model.generate(input_ids, do_sample=True, temperature=0.7, max_length=50, top_p=10, num_return_sequences=1, pad_token_id=0)
  as_lists = lambda batch: [batch[i, ...].detach().cpu().numpy().tolist() for i in range(batch.shape[0])]
  print(tokenizer.decode_batch(as_lists(tokens_batch))[0])

However, I get this error: RuntimeError: probability tensor contains either inf, nan or element < 0
Please see the above google colab notebook for the entire code.

what is GB1 (top100avg) ?

I can not understand what is GB1 (top100avg) ? how to calculate?

Many repeats in Progen2 model predictions

I'm following the setup instructions and sampling from a few of the different models but the outputs are very repetitive.

For example:

python3 sample.py --model progen2-small --t 0.8 --p 0.9 --max-length 1024 --num-samples 1 --context "1"
outputs
1SPPPPPPGP2

python3 sample.py --model progen2-oas --t 0.8 --p 0.9 --max-length 1024 --num-samples 1 --context "1"
outputs
1MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

python3 sample.py --model progen2-oas --t 0.8 --p 0.9 --max-length 1024 --num-samples 1 --context "1EVQ"
outputs
1EVQMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

python3 sample.py --model progen2-large --t 0.8 --p 0.9 --max-length 1024 --num-samples 1 --context "1"
outputs
1SEEFSFEEEWFFWMFALMEFFWWAFFEFWFEFEMWEEFFFEWFWFFFWFWFWEMLMFFFWWFEEEWEEEFFFEFWFWFWEEFMSFMWFEWWFWWFSEWFAFFEFEWWWWSSAFFFFSFFFFFFFWWFWFWFWFFFAFEFFWFEFWFWFEMWFMWFFFFFFFEWFWFFFFWFFEFWFFFWWEWEFFFFFWFFEEAFFWFFEFWFFESFSWEFEFFFWMEMFEWFFEFFEFFWFFWWFAFWWFFMEWFFFFFWFFFFFMWFMWFEWFFFFFFEFFWFFFFFFWMWFFWWMFSFFWFFAFFFFEWEFEFFAWFFEWFAAEAFFEFFFFFEFEFFFMWFWWWEAWMWEFSFFWFFFWWAFWFFWWESAFFSFFFFFWFFFWFFSFSWEEFFWAFFFWFAWFAFAWWMWEFFEFSFFWFFEEEFWFWFFFFMFFFEFWWFFFAEFMWFFFWEFWWEWFSFFWFWWFFWEFFFFWWAEFWFFAWEFALWWFFWWFFFEWWFFFAWFWWFFWEFFWFFFEFWSWFFWWAFFFWEFWSLFAMWFSEMSFAWFEFFWMWWFEFFFEEFFFFFFWWWEFMFFFFWFSLEEFFWEEFFFFFEFMFFWWFWWFWSFWWEEEWWFFEEWFWFFFWFWWFWFWSWEESWWAFFFWFESWWWWSAWSWEEFFWWFAWFFFMFFFWFFFFFWFFEWEEWSWFSWAWFWWWFWFWFWFWFWFFAWWWAEMWWFEWFWMAWWAWWFFEFEEFWFWWFEWWFFFFFWWWWEFALFFSEFAEWAFWEMWFFEFWSFMEEFFFAFFAEEMAFWWFEWSWWFFFFSFWMSFFWFWFFEFFWFWWSFWFMEFEFWEEFWWFMWFWFWFFWAFWWWFFWMWFWFFSWFWALFWSEFSEFFWFFFFFFFEFFFMFFFEFFFFWFWWFEWAFFFSFAFFWWWFEFFWFFWFFWWWFFEWWAMEFFFEWWAWFEFSFWSFAFAFEFEFFEWFWFWFEFWSFFFFFWFFFFFFWFWMEFMFFFEFFWEFSWFFFEFFFFAWFWFFEWMFFE

How is the sequence ID calculated in an efficient manner

Hello,
In your excellent paper, a key asspect used is the sequence identity between the artificial and any known natural sequences.
May I ask how this sequence identity is calculated in an effective manner? As it requires to screen all the databases for each sequences.
Many thanks in advance

Add a sequence classification head to ProtGen2

I'm trying to put a sequence prediction head(= sequence classification with 1 output) on top of the ProGen2 model and am experiencing some problems.

For this one usually needs to pool the last_hidden_state outputs to create a vector and add a simple MLP on top. When doing this with ProGen2 the predictions end up being exactly the same no matter the input sequence. This is the case because the pooler just takes the hidden state corresponding to the first token, which is in this case always a 1. The first hidden state is not influenced by any consecutive tokens, which leads to predictions being always exactly the same. For some reason the attentions for the first token depends solely on the value of the first token, which is not the case for other transformer models (the pooler works fine in those cases).

Is this a conscious decision to design the hidden_states solely on tokens that come before the current token? In this case pooling over the last hidden state should work for a value prediction head I assume?

Error when running likelihood.py example for progen2-large on cpu

I get the following error when running:
python3 likelihood.py --model progen2-large --context "2PAQGRARLAAHYGTGRIGREVTVDERCRNLDRLEPSWELLRLLDDMGFIEGQNGLRRYVAEVFALDEPYDMTWRLRSLDEPHEVNAIEFAAPHERVYATLSERFFPDSVERDLRELVTRSLVEVDLGDPFTPPFVNSVYELRGASRRWVGVVRDVLAPDVLPCDATIRVLADAGTRAATRGLREILDTESGRVCVLGLHAALDAIADDRNEVSTSVAVADLEQCVALREAIRQITPRGAISVLVKGPLRTSGMRAQIAAVVHLRAKSSHLLPGGTDVVTFGAREFAIRSAANERKVVASMRLLALPGFAERSLCGLARPGVGRGRWEPAINVSVAADRDQIDLRVMGADVGDASVIFLKRDFRKLTEEFWRTHTDVPIEREDVSAQRTEPDNRWRWLVPCDDLVAPRLTVVPPRSVGHGM1" --device cpu

I do not get this error if I use the progen2-medium model. Instead I get the assertion error I describe in a different issue. From the output, the error with the large model is happening earlier in the code than the error with the medium model (it's not that the assertion is passing with the larger model and it's hitting a later issue).

loading parameters
loading parameters took 56.16s
loading tokenizer
loading tokenizer took 0.00s
sanity log-likelihood
sanity log-likelihood took 0.01s
File "likelihood.py", line 226, in <module>
  main()
File "likelihood.py", line 175, in main
  ll_0 = ll(observation, f=log_likelihood, reduction='mean')
File "likelihood.py", line 165, in ll
  logits = model(input_ids, labels=input_ids).logits
File "/home/cmaher/miniconda3/envs/progen2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
  return forward_call(*input, **kwargs)
File "/home/cmaher/repos/progen/progen2/models/progen/modeling_progen.py", line 628, in forward
  transformer_outputs = self.transformer(
File "/home/cmaher/miniconda3/envs/progen2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
  return forward_call(*input, **kwargs)
File "/home/cmaher/repos/progen/progen2/models/progen/modeling_progen.py", line 500, in forward
  outputs = block(
File "/home/cmaher/miniconda3/envs/progen2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
  return forward_call(*input, **kwargs)
File "/home/cmaher/repos/progen/progen2/models/progen/modeling_progen.py", line 261, in forward
  hidden_states = self.ln_1(hidden_states)
File "/home/cmaher/miniconda3/envs/progen2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
  return forward_call(*input, **kwargs)
File "/home/cmaher/miniconda3/envs/progen2/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 173, in forward
  return F.layer_norm(
File "/home/cmaher/miniconda3/envs/progen2/lib/python3.8/site-packages/torch/nn/functional.py", line 2346, in layer_norm
  return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

It looks like the config file at './checkpoints/progen2-large' is not a valid JSON file.

Cause:
Incorrect directory context of the pretrained models, i.e. progen2-large.

Solution:
mkdir progen2-large in ./checkpoints dir
mv config.json pytorch_model.bin ./progen2-large

rerun the python sample.py ****

Here's a solution for anyone who might run into this problem.

Models difference

Hello, I don't understand the corresponding difference between models.

Can't find the patch file named 'estimator.patch'

I followed your Setup instruction to the third step: patch keras.py, but I did not find that file. After inputing the command, it showed: '**** Can't open patch file estimator.patch: No such file or directory' '**** can't open patch file estimator. Patch: no such file or directory

can somebody please give me the training code of progen2? i want to know how to train the model

Control tags not in tokenizer

Curious how we can prepend control tags for conditional generation. I noticed the control tags are not in the tokenizer, so how could we input them to the model?