Giter Club home page Giter Club logo

embeddings_reproduction's People

Contributors

cthoyt avatar yangkky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

embeddings_reproduction's Issues

Did you compare your results with BioVec by EhsaneddinAsgari

Thank you so much for your great work.

I read a paper called "DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences" by Asgari, E., Poerner, N., McHardy, A., & Mofrad, M.. (https://github.com/ehsanasgari/DeepPrime2Sec)
In the paper, he mentioned he used five kinds of features to do the prediction of protein secondary structure from the protein primary sequence. These five features are:

  1. One-hot vector representation (length: 21) --- onehot: vector representation indicating which amino acid exists at each specific position, where each index in the vector indicates the presence or absence of that amino acid.
  2. ProtVec embedding (length: 50) --- protvec: representation trained using Skip-gram neural network on protein amino acid sequences (ProtVec). The only difference would be character-level training instead of n-gram based training.
    3. Contextualized embedding (length: 300) --- elmo: we use the contextualized embedding of the amino acids trained in the course of language modeling, known as ELMo, as a new feature for the secondary structure task. Contextualized embedding is the concatenation of the hidden states of a deep bidirectional language model. The main difference between ProtVec embedding and ELMO embedding is that the ProtVec embedding for a given amino acid or amino acid k-mer is fixed and the representation would be the same in different sequences. However, the contextualized embedding, as it is clear from its name, is an embedding of word changing based on its context. We train ELMo embedding of amino acids using UniRef50 dataset in the dimension size of 300.
    4. Position Specific Scoring Matrix (PSSM) features (length: 21) --- pssm: PSSM is amino acid substitution scores calculated on protein multiple sequence alignment of homolog sequences for each given position in the protein sequence.
    5. Biophysical features (length: 16) --- biophysical For each amino acid we create a normalized vector of their biophysical properties, e.g., flexibility, instability, surface accessibility, kd-hydrophobicity, hydrophilicity, and etc.

However, he didn't show how to do these feature extraction. I am not sure if you compared your embedding to his work.

By the way,
In my ML project, I want to embed a protein to a vector and then use DL models to do drug-protein interaction prediction. Do you have an example to show how to use it similar to RDkit, eg.
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=512)?

Many thanks!

Examples in README

As a user, it would be nice to directly have some examples in the README to show how I could use this library. There are two simple scenarios that would personally benefit me:

  1. Load the embeddings and generate a hierarchical clustering (show a plot, perhaps?)
  2. Load the embeddings and train a simple model. One example would be a target prioritization approach - @ozlemmuslu would be able to help using the example from her master's thesis if we can see exactly how to make a dataframe out of the embeddings!

An error in visualize page

Hi
when i run all script i get this error message:
in plot_ChRs()
4 df = pd.read_csv('../inputs/localization.txt')
5 with open('../inputs/localization_seq.pkl', 'rb') as f:
----> 6 X_1, terms = pickle.load(f)
7 X_p = pd.read_csv('../inputs/localization_profet.tsv', delimiter='\t')
8 X_p.index = X_p['name']

UnpicklingError: invalid load key, 'v'

UnpicklingError: invalid load key, 'v'.

Hello,
many thanks for the github.

when I run test_predictions, i got following errors ...

UnpicklingError Traceback (most recent call last)
in
1 with open('../inputs/X_aaindex_64_cosine.pkl', 'rb') as f:
----> 2 X_aa = pickle.load(f)

UnpicklingError: invalid load key, 'v'.
also,
npicklingError Traceback (most recent call last)
in
13 # Sequence and structure
14 with open('../inputs/T50_seq_struct.pkl', 'rb') as f:
---> 15 X, _ = pickle.load(f)
16 evals, mu = evaluate(df_train, df_test, X, y_col, 'seq_struc', guesses=(1, 100))
17 res = pd.concat((res, evals), ignore_index=True)

UnpicklingError: invalid load key, 'v'.

my version
print(np.version)
1.18.5
print(pd.version)
1.1.0.

or pkl files are corrupted?

thanks,

User assistance

Hi, I was really interested in your paper, but this repository isn't so user friendly. It would be wonderful to add a setup.py so it can be installed with pip and some documentation for users on how to access the embeddings.

I would be happy to send a PR for the setup.py then we could discuss further on the PR

Separate code from data

I'm 7GB+ into trying to clone this repo and my computer is incredibly upset. I would suggest making a separate repository to house all of the data so the code can be downloaded and used independently.

'Doc2Vec' object has no attribute 'running_training_loss

from embeddings_reproduction import embedding_tools
embeds = embedding_tools.get_embeddings_new(['ABCFFFFFFFFFFFF','EFGHQWERRTTUIIO'], seqs, k=5, overlap=False)
getting the following error
'Doc2Vec' object has no attribute 'running_training_loss

Which model to use for computing new embeddings?

As some users have noted before, in other issues, it is unclear how to use the final model to generate embeddings for a new set of protein sequences. I have identified the files located at http://cheme.caltech.edu/~kkyang/models/ and I have found the script embedding_tools.py from which I suppose the function get_embeddings_new() is the relevant one. But which doc2vec_file should I use to compute embeddings for my set of sequences? Which one is the "final" one?

As previously noted, if a minimal example of this was included in the main README file I am sure it would enable many more users to benefit from your work.

Error in train_docvec_models.ipynb

I tried to recreate the original doc2vec models in train_docvec_models.ipynb but ran into the following error at "model.build_vocab(documents)" when using "merge=True" in the kmer_hypers

TypeError: unhashable type: 'list'

Do you have any suggestions? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.