Comments (25)
Hi @HarmonySosa
Good start, a couple comments before we approve the request:
I think the tags will not work as they are not from the approved list (in GitBook) - they are python-based so strings need to match
Slugs also have a word limit, something like mpro-covid19 would be better
Can you modify those fields before we approve the model?
from ersilia.
Just to document an inconsistency I encountered when I ran the web app (https://share.streamlit.io/nadimfrds/mpropred/Mpropred_app.py) -
When I use a list of molecules as input, I get a different result than when I run each molecule in the list individually. I believe this is because the result values are put in a random order while the molecule names are kept in the original order, so the results are sometimes mapped to different values.
Here is an example:
DB14761 gives a different result when I run it individually compared to when I run it with other molecules. The expected result of 5.7078 is mapped to a different molecule instead.
I got around this by reindexing the descriptors output to match the original input order, but I want to leave a note here in case the issue needs to be referenced again.
from ersilia.
Hi Harmony,
The output is dealt with from the Ersilia side. From the code I shared you will see you only need to pass the list of outputs in this case, the pIC50, in the same order as the molecules were inputed. It will be written in a csv file that then Ersilia parses.
from ersilia.
/approve
from ersilia.
New Model Repository Created! π
@HarmonySosa ersilia model respository has been successfully created and is available at:
π ersilia-os/eos3nn9
Next Steps β
Now that your new model respository has been created, you are ready to start contributing to it!
Here are some brief starter steps for contributing to your new model repository:
Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository
- π΄ Get started by creating a fork of your new model repository - docs
- π― Clone your forked repository - docs
- βοΈ Make edits to your new forked model repository - docs - Edits might include:
- Updating the
README.md
file to accurately describe your model - Add source code for your model
- Adding documentation for your model
- Updating the
- π Open a Pull Request from your forked repository to the original repository. This will allow you to bring your local changes into the new ersilia model repository that was just created! - docs
Additional Resources π
If you have any questions, please feel free to open an issue and get support from the community!
from ersilia.
Hi @HarmonySosa !
This model is using PADEL Descriptors to calculate MACCS Fingerprints. In our experience, the PADEL package is not very well integrated with Python and can bring problems.
Can you try to see if the MACCS fingerprints we obtain with RDKIT (MACCS Keys) are the same as the ones we obtain with the MPro Predictor.
It should be something like.. (making the function up, look for the right one)
from rdkit.Chem import MACCSKeys
maccskeys = [MACCSKeys(smi) for smi in smiles_list]
This will allow us to modify the calculate descriptors function:
# Molecular descriptor calculator option
def desc_calc():
# Performs the descriptor calculation
bashCommand = "java -Xms2G -Xmx2G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/MACCSFingerprinter.xml -dir ./ -file descriptors_output.csv"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
os.remove('molecule.smi')
from ersilia.
Hi @GemmaTuron!
Here are the desc_calc and build_model functions using PADEL:
def desc_calc():
# Performs the descriptor calculation
bashCommand = "java -Xms2G -Xmx2G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/MACCSFingerprinter.xml -dir ./ -file descriptors_output.csv"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
os.remove('molecule.smi')
def build_model(input_data):
# Reads in saved regression model
load_model = pickle.load(open('Mpro_model.pkl', 'rb')) # 06.25.24 broken up to handle dtype error
# Apply model to make predictions
prediction = load_model.predict(input_data)
st.header('**Prediction results**')
prediction_output = pd.Series(prediction, name='pIC50')
molecule_name = pd.Series(load_data[1], name='molecule_name')
df = pd.concat([molecule_name, prediction_output], axis=1)
st.write(df)
st.markdown(filedownload(df), unsafe_allow_html=True)
These are the results I get when I run the model with PADEL:
This is how I have been trying to use RDKit, but I get different results:
Convert RDKit bit vector to a list of ints
def bitvector_to_list(bitvector):
return [int(bit) for bit in bitvector]
def calculate_maccs_keys(smiles_list):
maccs_keys = []
for smi in smiles_list:
mol = Chem.MolFromSmiles(smi)
if mol is not None:
maccs_key = MACCSkeys.GenMACCSKeys(mol)
maccs_keys.append(bitvector_to_list(maccs_key))
else:
maccs_keys.append([0]*167) # MACCS keys are 167 bits long
return maccs_keys
def desc_calc(smiles_list, output_file='descriptors_output.csv'):
# Calculate MACCS fingerprints using RDKit
maccs_keys = calculate_maccs_keys(smiles_list)
# Create a DataFrame and name columns appropriately, save to CSV
df = pd.DataFrame(maccs_keys, columns=[f'MACCSFP{i}' for i in range(167)])
df.to_csv(output_file, index=False)
return df
# Model building section
def build_model(input_data):
# Reads in saved regression model
load_model = pickle.load(open('Mpro_model.pkl', 'rb')) # 06.25.24 dtype error
# Apply model to make predictions
prediction = load_model.predict(input_data)
st.header('**Prediction results**')
prediction_output = pd.Series(prediction, name='pIC50')
molecule_name = pd.Series(load_data[1], name='molecule_name')
df = pd.concat([molecule_name, prediction_output], axis=1)
st.write(df)
st.markdown(filedownload(df), unsafe_allow_html=True)
These are the results I get when I use RDKit:
from ersilia.
Hi @HarmonySosa
I am trying to reporduce the results but just with the molecule_name I cannot get the smiles. Where did you get the molecules from?
from ersilia.
mm in any case it does seem the rdkit implementation and the PADEL descriptors differ slightly, quite surprising as the MACCS Keys are just substructure searchers in a way.
In any case, it must be due to the preprocessing that PADEL does vs the preprocessing that rdkit does. We can go ahead and use PADEL in the model, guess by just keeping the folder there it should work
from ersilia.
Overall focus: Trying to get my code to run in the Ersilia framework
I transferred the code that I used to run the model into the Ersilia framework and I updated the environment, but I need to modify the code more to fit Ersilia. I am not sure how to use the input and output format in the main.py file, so I have been manually testing the code on an input file that I have, and I am now getting an index error "list index out of range" in the line input_order = [line.split()[1] for line in f] in desc_calc.
This is my desc_calc function:
def desc_calc():
bashCommand = "java -Xms2G -Xmx2G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/MACCSFingerprinter.xml -dir ./ -file descriptors_output.csv" # original bash command line
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
# Read original input file
with open("molecule.smi", "r") as f:
input_order = [line.split()[1] for line in f]
descriptors_df = pd.read_csv("descriptors_output.csv")
sorted_descriptors_df = descriptors_df.set_index('Name').reindex(input_order).reset_index() # sort output based on input to get correct ordering
# Save the sorted output to a new file
sorted_descriptors_df.to_csv("sorted_descriptors_output.csv", index=False)
This is the input I ran it on as a text file:
[C@]12(C@@(C@HC@@Hc(cc(C)cc3O)c34)[C@@]4(C5)[C@@h]5CC2)C CMNPD20798
CCC(CC)COC(=O)C@HNP@(OC[C@H]1OC@(C@H[C@@h]1O)C1=CC=C2N1N=CN=C2N)OC1=CC=CC=C1 DB14761
C1=CC(=C(C=C1C2=CC(=O)C3=C(C=C(C=C3O2)O)O)O)O 5280445
C1=C(C=C(C(=C1O)O)O)C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O 4444991
C[C@@]12C(=C(O[C@@h]1O)C(=O)C(=CC@(C)CC3)[C@]34O)[C@]4(C)C@HCC2 CMNPD20802
I have not changed molecule.smi, which is based on the input, from my code that successfully runs the model outside of the Ersilia framework.
from ersilia.
HI @HarmonySosa !
Thanks for the explanation. A few considerations:
If I run the following code with the example you suggest I get a nice output, I cannot reproduce your error:
with open("molecule.smi", "r") as f:
input_order = [line.split()[1] for line in f]
print(input_order)
Output:
(base) gturon@pujarnol:~/github/ersilia-os$ python test.py
['CMNPD20798', 'DB14761', '5280445', '4444991', 'CMNPD20802']
If I understand it correctly you are trying to get the molecule names so that you can map them later on the descriptors right? That is a good idea but difficult to implement within Ersilia as Ersilia preprocesses the inputs.
The function that gets you a smiles list (in the right order) is:
# read SMILES from .csv file, assuming one column with header
with open(input_file, "r") as f:
reader = csv.reader(f)
next(reader) # skip header
smiles_list = [r[0] for r in reader]
So we need to work from that list (which only has smiles). If you think the only way is to assign an id to each smiles and then re-map it, you can take this smiles list and create a dataframe with an id column that you generate, something like
ids = [f"smi_{x} for x in range(len(smiles_list)]
from ersilia.
Thank you, Gemma! I was able to resolve my errors so I can run my input and see the pIC50 values in the result, but I am not sure that I understand how to use the Smiles list with the preprocessed inputs that you mentioned, so I likely still need to make adjustments to fit Ersilia. Can you explain a little more what you mean by creating the ids dataframe? This is the output that I currently get when I use Padel and print my dataframe:
molecule_name pIC50
0 DB14761 5.707753
1 5280445 5.194536
2 4444991 5.194536
3 CMNPD20802 5.095158
The output matches what I got when I ran the model locally, but I am not sure if this fits Ersilia properly.
from ersilia.
Wehn you use Ersilia, you will not be able to retain the ID colum, as you can see from the code snippet that I passed to you the input file will only provide you with a list of smiles. So, if you need the ids to make sure the rearrangement is correct, you'll need to create those.
Try the model within Ersilia and you'll understand better. Remember to use the repo_path flag.
from ersilia.
Can you update your code so I can see what you are writing to help you out? I see you have not yet pushed to github: https://github.com/HarmonySosa/eos3nn9
from ersilia.
Hi @HarmonySosa
Just to confirm next steps:
- Contact the authors to let them know running the same molecules several times provides different outputs due to a rearrangement of the descriptors per molecule
- Understand if the rearrangement is due to the PADEL descriptors or the implementation in streamlit only (that would be ideal, but I think the problem is with the PADEL descriptors themselves as you point out) Just can you confirm that you still observe this effect when running the code not via Streamlit?
- Define a way to make sure the results are outputed in the same original order (you already did that actually). IF this is based on using an ID per SMILES Β‘, you will need to do something like below in the
main.py
file for ersilia:
# read SMILES from .csv file, assuming one column with header
with open(input_file, "r") as f:
reader = csv.reader(f)
next(reader) # skip header
smiles_list = [r[0] for r in reader]
ids = [f"smi_{x} for x in range(len(smiles_list)]
df = pd.DataFrame({"id":ids, "smiles":smiles})
# here function to convert the smiles to descriptors
# here function to re-arrange descriptors ensuring order, you can use the id column we have just created in the dataframe
# here function to obtain the predictions
# and if we are 100% sure the predictions are in the right order corresponding to the molecules in smiles_list, and that the predictions are in a list format:
# write output in a .csv file
with open(output_file, "w") as f:
writer = csv.writer(f)
writer.writerow(["value"]) # header, change as needed
for o in outputs:
writer.writerow([o])
Something like this should work!
from ersilia.
Hi Gemma, thank you for the clarification! In response to the next steps,
- I contacted the authors after our meeting a few days ago, but I am still waiting for a response.
- I verified that the rearrangement is due to the PADEL descriptors themselves
- I am printing the SMILES in the proper order, but I am not following the Ersilia structure; this is what I am currently working on
I pushed the code so you can hopefully see it. The code currently doesnβt use the proper format you mentioned, but I was just trying to get code that worked and gave me an output that matched. Now, I am trying to adjust the code based on your suggestions to fit the Ersilia format.
Just to clarify, what is the desired output that I should have? Currently, my output is a printed table of the pIC50 values with the molecule name. Is this the proper format?
Thanks for your help!
from ersilia.
Hi! I changed main.py to use absolute paths and adjusted it so that it gives the desired pIC50 outputs when I use an input in a single column. I am trying to see if I can transfer all of the files that I have been using to the GitHub model. I have the Mpro pkl file in the checkpoints folder on my machine, and I also use the PaDEL-Descriptor folder, but I do not see these in the GitHub. Is there a way to add these files?
from ersilia.
Hi @HarmonySosa
If I understand correctly, you now have the necessary files somewhere not in your github folder? Or they are in your local GitHub folder but when you push it to your fork they do not appear online?
If they are in your github folder locally, you simply need to do a git push
. Check if any error message is appearing (i,e files too large)
from ersilia.
Hi @GemmaTuron , I had the files on my Ubuntu machine but not on github because I kept getting errors when I tried to push them, including an error about the file size, but I think I have managed to get the PaDEL Descriptor folder in the github. I am just resolving conflicts between my Ubuntu and remote repos so I can add the remaining files, so I think it is figured out now!
from ersilia.
Hi, I updated main.py to get rid of the filedownload function, as per Miquelβs suggestion, so main.py seems alright for now. Is it possible to verify the next steps in incorporating the model?
from ersilia.
Hi @HarmonySosa
The code looks good, well done. A few suggestions:
- Move the PAdel folder inside framework and amend the path to there in main.py
- The temporal folder it is not actually a temporal folder, it is safer to create one for real:
tempfile.mkdtemp(prefix="ersilia-")
we like to use the prefix ersilia to ensure they will eventually be all removed. And you can close it with something likeshutil.rmtree
Once this is done, you simply have to make sure all the other files (particularly the metadata.json
) is properly complete, and try to fetch the model locally inside Ersilia using the repo_path
flag. If that works you should be ready to open a PR and merge the code!
from ersilia.
Hi @GemmaTuron, I made the changes you suggested and verified that I have the most recent version of Ersilia. I have been working through the fetch errors when I use the repo_path flag and this is what I have now:
eos3nn9_fetch_error_180724.txt
I am going to go through the Ersilia troubleshooting guide again in case I missed anything, but I would appreciate it if you have any ideas. Thanks!
from ersilia.
Hi @HarmonySosa
From this line: /bin/sh: 1: bentoml: not found
I suspect something might not be working properly with your Ersilia installation.
- Have you first made sure the model works using the
bash run.sh
first? - Is it an Ubuntu, Mac or Windows machine? Remember that windows does not work with Ersilia
- Did you figure out the problems you were indicating in the eos1pu1 issue?
- Can you confirm ersilia works fine by testing another modelΒΏ
Remember in CodeSpaces you can use it as your own system, so you can open a codespace from Ersilia, git clone locally the repository you have forked (with all files) and test it there (still using the repo path, or the run.sh file)
from ersilia.
Hi @GemmaTuron, thank you for the help! I have tested that the model works with the run.sh command. I am using Ubuntu on a Windows machine, but I haven't been able to successfully fetch another model, including eos1pu1. I am trying to use CodeSpaces now to see if it works, and I will try reinstalling Ersilia again if it doesn't work.
from ersilia.
Hi @GemmaTuron ,
I tried using the CodeSpace but ended up reinstalling Ersilia. I am running Ubuntu and I now get this error:
err.txt
As such, I know it is loading this far to be able to see the url, but it is failing on the join. The error suggests a path problem, but I have been testing that and I do not see where it is. Do you have any suggestions?
Thanks!
from ersilia.
Related Issues (20)
- π Bug: log file not found warning after using the track flags
- π Bug: Fetching models on MacBook (M1) results in 404 error due to looking for linux/arm64 HOT 1
- π Batch: Define Model Installs through a YAML file instead of a Dockerfile HOT 3
- π Batch: Resource monitoring with different input scenarios and systems HOT 2
- π Bug: Numpy versions conflicts HOT 2
- π Bug: Tracking functionality does not work when a result CSV file is not specified
- π Bug: Performance Metrics Fail for String Output Model HOT 4
- π Bug: Ersilia close when model serving is interrupted HOT 2
- π¦ Model Request: Cardiotoxicity Classifier HOT 13
- π¦ Model Request: Demo Malaria Model HOT 8
- π¦ Model Request: Unit Test Model Compound HOT 3
- π¦ Model Request: QupKake: predict micro-pKa of organic molecules HOT 31
- π Task: Remove dead code from ersilia
- π Task: Inconsistency between current licenses recognized by Ersilia vs those maintained in Airtable HOT 5
- π Bug: Ersilia fetch breaking especially when inside docker containers
- π Task: Reflect correct values in information.json HOT 5
- π Task: Ersilia tries to close docker containers when any model is fetched, even if from source
- π¦ Model Request: Unit Test Model Compound HOT 3
- π Bug: Ersilia Test Command: False Positive Test Failure HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ersilia.