Model Name Predict bioactivity against Main Protease of SARS-CoV-2

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

New Model Repository Created! 🎉 <a class="user-mention notranslat

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Comments (25)

GemmaTuron commented on July 22, 2024 1

Hi @HarmonySosa

Good start, a couple comments before we approve the request:
I think the tags will not work as they are not from the approved list (in GitBook) - they are python-based so strings need to match
Slugs also have a word limit, something like mpro-covid19 would be better

Can you modify those fields before we approve the model?

from ersilia.

HarmonySosa commented on July 22, 2024 1

Just to document an inconsistency I encountered when I ran the web app (https://share.streamlit.io/nadimfrds/mpropred/Mpropred_app.py) -
When I use a list of molecules as input, I get a different result than when I run each molecule in the list individually. I believe this is because the result values are put in a random order while the molecule names are kept in the original order, so the results are sometimes mapped to different values.
Here is an example:

DB14761 gives a different result when I run it individually compared to when I run it with other molecules. The expected result of 5.7078 is mapped to a different molecule instead.

I got around this by reindexing the descriptors output to match the original input order, but I want to leave a note here in case the issue needs to be referenced again.

from ersilia.

GemmaTuron commented on July 22, 2024 1

Hi Harmony,

The output is dealt with from the Ersilia side. From the code I shared you will see you only need to pass the list of outputs in this case, the pIC50, in the same order as the molecules were inputed. It will be written in a csv file that then Ersilia parses.

from ersilia.

GemmaTuron commented on July 22, 2024

/approve

from ersilia.

github-actions commented on July 22, 2024

New Model Repository Created! 🎉

@HarmonySosa ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos3nn9

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

🍴 Get started by creating a fork of your new model repository - docs
👯 Clone your forked repository - docs
✏️ Make edits to your new forked model repository - docs - Edits might include:
- Updating the README.md file to accurately describe your model
- Add source code for your model
- Adding documentation for your model
🚀 Open a Pull Request from your forked repository to the original repository. This will allow you to bring your local changes into the new ersilia model repository that was just created! - docs

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

from ersilia.

GemmaTuron commented on July 22, 2024

Hi @HarmonySosa !

This model is using PADEL Descriptors to calculate MACCS Fingerprints. In our experience, the PADEL package is not very well integrated with Python and can bring problems.
Can you try to see if the MACCS fingerprints we obtain with RDKIT (MACCS Keys) are the same as the ones we obtain with the MPro Predictor.

It should be something like.. (making the function up, look for the right one)

from rdkit.Chem import MACCSKeys
maccskeys = [MACCSKeys(smi) for smi in smiles_list]

This will allow us to modify the calculate descriptors function:

# Molecular descriptor calculator option
    def desc_calc():
        # Performs the descriptor calculation
        bashCommand = "java -Xms2G -Xmx2G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/MACCSFingerprinter.xml -dir ./ -file descriptors_output.csv"
        process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
        output, error = process.communicate()
        os.remove('molecule.smi')

from ersilia.

HarmonySosa commented on July 22, 2024

Hi @GemmaTuron!

Here are the desc_calc and build_model functions using PADEL:

def desc_calc():
    # Performs the descriptor calculation
    bashCommand = "java -Xms2G -Xmx2G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/MACCSFingerprinter.xml -dir ./ -file descriptors_output.csv"
    process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
    output, error = process.communicate()
    os.remove('molecule.smi')

def build_model(input_data):
    # Reads in saved regression model
    load_model = pickle.load(open('Mpro_model.pkl', 'rb')) #  06.25.24 broken up to handle dtype error
    # Apply model to make predictions
    prediction = load_model.predict(input_data)
    st.header('**Prediction results**')
    prediction_output = pd.Series(prediction, name='pIC50')
    molecule_name = pd.Series(load_data[1], name='molecule_name')
    df = pd.concat([molecule_name, prediction_output], axis=1)
    st.write(df)
    st.markdown(filedownload(df), unsafe_allow_html=True)

These are the results I get when I run the model with PADEL:

This is how I have been trying to use RDKit, but I get different results:

Convert RDKit bit vector to a list of ints

def bitvector_to_list(bitvector):
    return [int(bit) for bit in bitvector]

def calculate_maccs_keys(smiles_list):
    maccs_keys = []
    for smi in smiles_list:
        mol = Chem.MolFromSmiles(smi)
        if mol is not None:
            maccs_key = MACCSkeys.GenMACCSKeys(mol)
            maccs_keys.append(bitvector_to_list(maccs_key))
        else:
            maccs_keys.append([0]*167)  # MACCS keys are 167 bits long
    return maccs_keys

def desc_calc(smiles_list, output_file='descriptors_output.csv'):
    # Calculate MACCS fingerprints using RDKit
    maccs_keys = calculate_maccs_keys(smiles_list)
    
     # Create a DataFrame and name columns appropriately, save to CSV
    df = pd.DataFrame(maccs_keys, columns=[f'MACCSFP{i}' for i in range(167)]) 
    df.to_csv(output_file, index=False)
    return df

# Model building section
def build_model(input_data):
    # Reads in saved regression model
    load_model = pickle.load(open('Mpro_model.pkl', 'rb')) #  06.25.24 dtype error
    # Apply model to make predictions
    prediction = load_model.predict(input_data)
    st.header('**Prediction results**')
    prediction_output = pd.Series(prediction, name='pIC50')
    molecule_name = pd.Series(load_data[1], name='molecule_name')
    df = pd.concat([molecule_name, prediction_output], axis=1)
    st.write(df)
    st.markdown(filedownload(df), unsafe_allow_html=True)

These are the results I get when I use RDKit:

from ersilia.

GemmaTuron commented on July 22, 2024

Hi @HarmonySosa
I am trying to reporduce the results but just with the molecule_name I cannot get the smiles. Where did you get the molecules from?

from ersilia.

GemmaTuron commented on July 22, 2024

mm in any case it does seem the rdkit implementation and the PADEL descriptors differ slightly, quite surprising as the MACCS Keys are just substructure searchers in a way.
In any case, it must be due to the preprocessing that PADEL does vs the preprocessing that rdkit does. We can go ahead and use PADEL in the model, guess by just keeping the folder there it should work

from ersilia.

HarmonySosa commented on July 22, 2024

Overall focus: Trying to get my code to run in the Ersilia framework

I transferred the code that I used to run the model into the Ersilia framework and I updated the environment, but I need to modify the code more to fit Ersilia. I am not sure how to use the input and output format in the main.py file, so I have been manually testing the code on an input file that I have, and I am now getting an index error "list index out of range" in the line input_order = [line.split()[1] for line in f] in desc_calc.

This is my desc_calc function:

def desc_calc():

bashCommand = "java -Xms2G -Xmx2G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/MACCSFingerprinter.xml -dir ./ -file descriptors_output.csv" # original bash command line
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

# Read original input file
with open("molecule.smi", "r") as f:
    input_order = [line.split()[1] for line in f]

descriptors_df = pd.read_csv("descriptors_output.csv")
sorted_descriptors_df = descriptors_df.set_index('Name').reindex(input_order).reset_index() # sort output based on input to get correct ordering

# Save the sorted output to a new file
sorted_descriptors_df.to_csv("sorted_descriptors_output.csv", index=False)

This is the input I ran it on as a text file:
[C@]12(C@@(C@H C@@Hc(cc(C)cc3O)c34)[C@@]4(C5)[C@@h]5CC2)C CMNPD20798
CCC(CC)COC(=O)C@HNP@(OC[C@H]1OC@(C@H[C@@h]1O)C1=CC=C2N1N=CN=C2N)OC1=CC=CC=C1 DB14761
C1=CC(=C(C=C1C2=CC(=O)C3=C(C=C(C=C3O2)O)O)O)O 5280445
C1=C(C=C(C(=C1O)O)O)C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O 4444991
C[C@@]12C(=C(O[C@@h]1O)C(=O)C(=CC@(C)CC3)[C@]34O)[C@]4(C)C@HCC2 CMNPD20802

I have not changed molecule.smi, which is based on the input, from my code that successfully runs the model outside of the Ersilia framework.

from ersilia.

GemmaTuron commented on July 22, 2024

HI @HarmonySosa !

Thanks for the explanation. A few considerations:
If I run the following code with the example you suggest I get a nice output, I cannot reproduce your error:

with open("molecule.smi", "r") as f:
    input_order = [line.split()[1] for line in f]
print(input_order)

Output:

(base) gturon@pujarnol:~/github/ersilia-os$ python test.py
['CMNPD20798', 'DB14761', '5280445', '4444991', 'CMNPD20802']

If I understand it correctly you are trying to get the molecule names so that you can map them later on the descriptors right? That is a good idea but difficult to implement within Ersilia as Ersilia preprocesses the inputs.
The function that gets you a smiles list (in the right order) is:

# read SMILES from .csv file, assuming one column with header
with open(input_file, "r") as f:
    reader = csv.reader(f)
    next(reader)  # skip header
    smiles_list = [r[0] for r in reader]

So we need to work from that list (which only has smiles). If you think the only way is to assign an id to each smiles and then re-map it, you can take this smiles list and create a dataframe with an id column that you generate, something like
ids = [f"smi_{x} for x in range(len(smiles_list)]

from ersilia.

HarmonySosa commented on July 22, 2024

Thank you, Gemma! I was able to resolve my errors so I can run my input and see the pIC50 values in the result, but I am not sure that I understand how to use the Smiles list with the preprocessed inputs that you mentioned, so I likely still need to make adjustments to fit Ersilia. Can you explain a little more what you mean by creating the ids dataframe? This is the output that I currently get when I use Padel and print my dataframe:
molecule_name pIC50
0 DB14761 5.707753
1 5280445 5.194536
2 4444991 5.194536
3 CMNPD20802 5.095158

The output matches what I got when I ran the model locally, but I am not sure if this fits Ersilia properly.

from ersilia.

GemmaTuron commented on July 22, 2024

Wehn you use Ersilia, you will not be able to retain the ID colum, as you can see from the code snippet that I passed to you the input file will only provide you with a list of smiles. So, if you need the ids to make sure the rearrangement is correct, you'll need to create those.
Try the model within Ersilia and you'll understand better. Remember to use the repo_path flag.

from ersilia.

GemmaTuron commented on July 22, 2024

@HarmonySosa

Can you update your code so I can see what you are writing to help you out? I see you have not yet pushed to github: https://github.com/HarmonySosa/eos3nn9

from ersilia.

GemmaTuron commented on July 22, 2024

Hi @HarmonySosa

Just to confirm next steps:

Contact the authors to let them know running the same molecules several times provides different outputs due to a rearrangement of the descriptors per molecule
Understand if the rearrangement is due to the PADEL descriptors or the implementation in streamlit only (that would be ideal, but I think the problem is with the PADEL descriptors themselves as you point out) Just can you confirm that you still observe this effect when running the code not via Streamlit?
Define a way to make sure the results are outputed in the same original order (you already did that actually). IF this is based on using an ID per SMILES ¡, you will need to do something like below in the main.py file for ersilia:

# read SMILES from .csv file, assuming one column with header
with open(input_file, "r") as f:
    reader = csv.reader(f)
    next(reader)  # skip header
    smiles_list = [r[0] for r in reader]
ids = [f"smi_{x} for x in range(len(smiles_list)]
df = pd.DataFrame({"id":ids, "smiles":smiles})

# here function to convert the smiles to descriptors

# here function to re-arrange descriptors ensuring order, you can use the id column we have just created in the dataframe

# here function to obtain the predictions

# and if we are 100% sure the predictions are in the right order corresponding to the molecules in smiles_list, and that the predictions are in a list format:
# write output in a .csv file
with open(output_file, "w") as f:
    writer = csv.writer(f)
    writer.writerow(["value"])  # header, change as needed
    for o in outputs:
        writer.writerow([o])

Something like this should work!

from ersilia.

HarmonySosa commented on July 22, 2024

Hi Gemma, thank you for the clarification! In response to the next steps,

I contacted the authors after our meeting a few days ago, but I am still waiting for a response.
I verified that the rearrangement is due to the PADEL descriptors themselves
I am printing the SMILES in the proper order, but I am not following the Ersilia structure; this is what I am currently working on

I pushed the code so you can hopefully see it. The code currently doesn’t use the proper format you mentioned, but I was just trying to get code that worked and gave me an output that matched. Now, I am trying to adjust the code based on your suggestions to fit the Ersilia format.
Just to clarify, what is the desired output that I should have? Currently, my output is a printed table of the pIC50 values with the molecule name. Is this the proper format?
Thanks for your help!

from ersilia.

HarmonySosa commented on July 22, 2024

Hi! I changed main.py to use absolute paths and adjusted it so that it gives the desired pIC50 outputs when I use an input in a single column. I am trying to see if I can transfer all of the files that I have been using to the GitHub model. I have the Mpro pkl file in the checkpoints folder on my machine, and I also use the PaDEL-Descriptor folder, but I do not see these in the GitHub. Is there a way to add these files?

from ersilia.

GemmaTuron commented on July 22, 2024

Hi @HarmonySosa

If I understand correctly, you now have the necessary files somewhere not in your github folder? Or they are in your local GitHub folder but when you push it to your fork they do not appear online?
If they are in your github folder locally, you simply need to do a git push. Check if any error message is appearing (i,e files too large)

from ersilia.

HarmonySosa commented on July 22, 2024

Hi @GemmaTuron , I had the files on my Ubuntu machine but not on github because I kept getting errors when I tried to push them, including an error about the file size, but I think I have managed to get the PaDEL Descriptor folder in the github. I am just resolving conflicts between my Ubuntu and remote repos so I can add the remaining files, so I think it is figured out now!

from ersilia.

HarmonySosa commented on July 22, 2024

Hi, I updated main.py to get rid of the filedownload function, as per Miquel’s suggestion, so main.py seems alright for now. Is it possible to verify the next steps in incorporating the model?

from ersilia.

GemmaTuron commented on July 22, 2024

Hi @HarmonySosa

The code looks good, well done. A few suggestions:

Move the PAdel folder inside framework and amend the path to there in main.py
The temporal folder it is not actually a temporal folder, it is safer to create one for real: tempfile.mkdtemp(prefix="ersilia-") we like to use the prefix ersilia to ensure they will eventually be all removed. And you can close it with something like shutil.rmtree

Once this is done, you simply have to make sure all the other files (particularly the metadata.json) is properly complete, and try to fetch the model locally inside Ersilia using the repo_path flag. If that works you should be ready to open a PR and merge the code!

from ersilia.

HarmonySosa commented on July 22, 2024

Hi @GemmaTuron, I made the changes you suggested and verified that I have the most recent version of Ersilia. I have been working through the fetch errors when I use the repo_path flag and this is what I have now:
eos3nn9_fetch_error_180724.txt

I am going to go through the Ersilia troubleshooting guide again in case I missed anything, but I would appreciate it if you have any ideas. Thanks!

from ersilia.

GemmaTuron commented on July 22, 2024

Hi @HarmonySosa

From this line: /bin/sh: 1: bentoml: not found I suspect something might not be working properly with your Ersilia installation.

Have you first made sure the model works using the bash run.sh first?
Is it an Ubuntu, Mac or Windows machine? Remember that windows does not work with Ersilia
Did you figure out the problems you were indicating in the eos1pu1 issue?
Can you confirm ersilia works fine by testing another model¿

Remember in CodeSpaces you can use it as your own system, so you can open a codespace from Ersilia, git clone locally the repository you have forked (with all files) and test it there (still using the repo path, or the run.sh file)

from ersilia.

HarmonySosa commented on July 22, 2024

Hi @GemmaTuron, thank you for the help! I have tested that the model works with the run.sh command. I am using Ubuntu on a Windows machine, but I haven't been able to successfully fetch another model, including eos1pu1. I am trying to use CodeSpaces now to see if it works, and I will try reinstalling Ersilia again if it doesn't work.

from ersilia.

HarmonySosa commented on July 22, 2024

Hi @GemmaTuron ,
I tried using the CodeSpace but ended up reinstalling Ersilia. I am running Ubuntu and I now get this error:
err.txt

I can access the url to get:

As such, I know it is loading this far to be able to see the url, but it is failing on the join. The error suggests a path problem, but I have been testing that and I do not see where it is. Do you have any suggestions?
Thanks!

from ersilia.

🦠 Model Request: Predict bioactivity against Main Protease of SARS-CoV-2 about ersilia HOT 25 OPEN

Comments (25)

New Model Repository Created! 🎉

Next Steps ⭐

Additional Resources 📚

Convert RDKit bit vector to a list of ints

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent