microsoftresearch / azimuth Goto Github PK
View Code? Open in Web Editor NEWMachine Learning-Based Predictive Modelling of CRISPR/Cas9 guide efficiency
License: BSD 3-Clause "New" or "Revised" License
Machine Learning-Based Predictive Modelling of CRISPR/Cas9 guide efficiency
License: BSD 3-Clause "New" or "Revised" License
Azimuth required scikit-learn >=0.17.1, < 0.18.1
pip install azimuth may install scikit-learn 0.18.1
so tests will fail when you do nosetests:
c:\python27\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
c:\python27\lib\site-packages\sklearn\grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
EE
======================================================================
ERROR: test_predictions_nopos (test_saved_models.SavedModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
...
File "sklearn\tree\_tree.pyx", line 632, in sklearn.tree._tree.Tree.__setstate__ (sklearn\tree\_tree.c:8125)
KeyError: 'max_depth'
-------------------- >> begin captured stdout << ---------------------
No model file specified, using V3_model_nopos
--------------------- >> end captured stdout << ----------------------
======================================================================
ERROR: test_predictions_pos (test_saved_models.SavedModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
...
File "sklearn\tree\_tree.pyx", line 632, in sklearn.tree._tree.Tree.__setstate__ (sklearn\tree\_tree.c:8125)
KeyError: 'max_depth'
-------------------- >> begin captured stdout << ---------------------
No model file specified, using V3_model_full
--------------------- >> end captured stdout << ----------------------
----------------------------------------------------------------------
Ran 2 tests in 0.813s
FAILED (errors=2)
I know I can suppress warnings on the client side, but it would be better to get rid of them at the source.
import azimuth.model_comparison as mc,numpy as np
/opt/apps/tools/anaconda2-2.5.0/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/opt/apps/tools/anaconda2-2.5.0/lib/python2.7/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
Recent commit (2c4665e) added assert that 'seq' is a string, but downstream it seems to expect a numpy array. Additionally, the README states that this should be a numpy array
The project website sends me to http://www.jennifer.listgarten.com/azimuthFC_plus_RES_withPredictions.csv which says Not Found [CFN #0005]
.
The data seems to be available in the repo though: https://github.com/MicrosoftResearch/Azimuth/blob/master/azimuth/FC_plus_RES_withPredictions.csv
I created Azimuth-docker container https://github.com/antonkulaga/azimuth-docker to use azimuth in bionformatic pipelines. In fact, there are two containers there - one for console usage and one is azimuth wrapped in flask to use as microservice.
I also put microsoft license inside it https://github.com/antonkulaga/azimuth-docker/blob/master/cli/LICENSE
Hello! We've noticed that occasionally, guides that we score will get a negative predicted score from azimuth. Here's an example:
> from azimuth.model_comparison import predict
> import numpy
> predict(numpy.array(['AACTGATTTCTGGCGTTTTCTTTCTGGCTC']), numpy.array([8905]), numpy.array([96]))
No model file specified, using V3_model_full
array([-0.04603427])
Empirically, it looks like negative scores are more likely to happen with peptide percentages closer to 100. However, from the Azimuth documentation (and the general understanding of CRISPR on-target scores, it looks like scores are expected to be between 0.0 and 1.0. Is this possibly a bug in the scoring system?
If it's helpful, this does not happen if the peptide percentage is set to 95 or less.
Thanks!
Azimuth doesn't seem to be Python 3 ready. Could this be done?
Hi,
I'm trying to compute Rule Set 2 scores using model_comparison.predict on the nopos model.
If I download the latest version from github, the pickle files don't seem to load:
ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'long long'
So I run model_comparison.py as instructed on the main github page to reproduce the two models.
It prints a LOT of warnings:
WARNING: trimming max_index_to use down to length of string=30
but does eventually produce two model files.
However if I use those files, I get a different answer to the value you report in the README that comes with the Rule Set 2 Calculator 0.5909 vs 0.5656.
Looking at that latest test_saved_models.py, it looks like you expect the new value. So I guess I'm just wondering what the reason is for the change? Which one corresponds to the method used in your paper? I'm guessing the older one? In which case, how do I reproduce that?
(sorry I didn't mean to submit this issue when I submitted it, and have since resolved some issues with versioning, so now it's a question instead!)
Thanks,
Felicity
Fixed: I seemed to have missed reading this section, which solved my issue.
Generating new model .pickle files
Sometimes the pre-computed .pickle files in the saved_models directory are incompatible with different versions of scikitlearn. You can re-train the files saved_models/V3_model_full.pickle and saved_models/V3_model_nopos.pickle by running the command python model_comparison.py (which will overwrite the saved models). You can check that the resulting models match the models we precomputed by running python test_saved_models.py within the directory tests.
Using:
sklearn version 0.18.1
python version 2.7.6
I believe that the error arises because the pickled models were built using sklearn 0.17, and it seems like the API for trees changed slightly with sklearn 0.18.
Please let me know if there is any other helpful information I can provide.
$ nosetests
/afs/csail.mit.edu/u/m/maxwshen/.local/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/afs/csail.mit.edu/u/m/maxwshen/.local/lib/python2.7/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
EE
======================================================================
ERROR: test_predictions_nopos (test_saved_models.SavedModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/cluster/mshen/tools/Azimuth/azimuth/tests/test_saved_models.py", line 17, in test_predictions_nopos
predictions = azimuth.model_comparison.predict(np.array(df['guide'].values), None, None)
File "/cluster/mshen/tools/Azimuth/azimuth/model_comparison.py", line 550, in predict
model, learn_options = pickle.load(f)
File "/usr/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1217, in load_build
setstate(state)
File "sklearn/tree/_tree.pyx", line 632, in sklearn.tree._tree.Tree.__setstate__ (sklearn/tree/_tree.c:8128)
KeyError: 'max_depth'
-------------------- >> begin captured stdout << ---------------------
No model file specified, using V3_model_nopos
--------------------- >> end captured stdout << ----------------------
======================================================================
ERROR: test_predictions_pos (test_saved_models.SavedModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/cluster/mshen/tools/Azimuth/azimuth/tests/test_saved_models.py", line 22, in test_predictions_pos
predictions = azimuth.model_comparison.predict(np.array(df['guide'].values), np.array(df['AA cut'].values), np.array(df['Percent peptide'].values))
File "/cluster/mshen/tools/Azimuth/azimuth/model_comparison.py", line 550, in predict
model, learn_options = pickle.load(f)
File "/usr/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1217, in load_build
setstate(state)
File "sklearn/tree/_tree.pyx", line 632, in sklearn.tree._tree.Tree.__setstate__ (sklearn/tree/_tree.c:8128)
KeyError: 'max_depth'
-------------------- >> begin captured stdout << ---------------------
No model file specified, using V3_model_full
--------------------- >> end captured stdout << ----------------------
----------------------------------------------------------------------
Ran 2 tests in 4.093s
FAILED (errors=2)
It was very hard to get the Azimuth 2.0 installed in Python 2.7 so I thought I share my experience with others.
1 conda create --name azimuth python=2.7
2 conda activate azimuth
3 conda install biopython
4 conda install scikit-learn=0.17.1
5 pip install 'numpy==1.12.1'
6 pip install azimuth
When I run nosetests son azimuth it fails with the following output:
# nosetests
/Library/Python/2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
/Library/Python/2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20. DeprecationWarning)
F
======================================================================
FAIL: test_predictions (azimuth.tests.test_saved_models.SavedModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/azimuth/tests/test_saved_models.py", line 19, in test_predictions
self.assertTrue(np.allclose(predictions, df['Stable prediction'].values, atol=1e-3))
AssertionError: False is not true
-------------------- >> begin captured stdout << ---------------------
No model file specified, using V3_model_nopos
`--------------------- >> end captured stdout << ----------------------`
----------------------------------------------------------------------
Ran 1 test in 3.481s
``
FAILED (failures=1)
this script reproduces the problem.
Also, the scores obtained by that script from the azimuth Python package seem to agree with those produced by https://crispr.ml when submitting ENSG00000100823 (APEX1), and with those produced by GPP sgRNA Designer when submitting gene ID 328 (also APEX1), indicating that both servers ignore the parameters "Target Cut Length" and "Target Cut %" when calculating the "On-Target Efficacy Score". This limitation is not obvious from the documentation.
On the project page https://www.microsoft.com/en-us/research/project/azimuth/, User's guide section, it states to use predict
with no aa or peptide percent to use -1
as sentinel when the program actually requires None
.
http://www.nature.com/nbt/journal/v32/n12/fig_tab/nbt.3026_F3.html shows the PAM at positions 24-27 within the 30-mer (using zero-based Python indexing) and the length-20 targeting sequence at positions 4-24. Azimuth's pam_audit code uses this definition, since it requires a 'GG' at seq[25:27]. nucleotide_features_dictionary also uses this definition.
However, countGC assumes the targeting sequence is at positions 5-25: return len(s[5:25].replace('A', '').replace('T', '')). Tm_feature also assumes this: featarray[i,1] = Tm.Tm_staluc(seq[20:25], rna=rna) #5nts immediately proximal of the NGG PAM. These seem incorrect. Could you clarify in the documentation whether the targeting sequence is at positions 4-24 or 5-25?
Thanks for making this tool available to the community!
Hey,
I've been trying to install Azimuth on Windows 10 and have been running into problems even after installing http://aka.ms/vcpython27
It'd be nice to have some clear documentation of the setup process and the list of required libraries needed to get this running. Here's the log
https://gist.github.com/sudheesh001/2b70e0ed4371032cab58d1efa367cee3
Hi Author,
I am new to Python, so I know the problem is relate to the renaming and deprecation of cross_validation sub-module to model_selection.
But I don't know how can I fixed.
Look forward to your reply.
Yao
Hi,
I always get the follows error when running Azimuth.
File "/home/Azimuth-2.0/azimuth/model_comparison.py", line 559, in predict
feature_sets = feat.featurize_data(Xdf, learn_options, pandas.DataFrame(), gene_position, pam_audit=pam_audit, length_audit=length_audit)
File "/home/Azimuth-2.0/azimuth/features/featurization.py", line 31, in featurize_data
get_all_order_nuc_features(data['30mer'], feature_sets, learn_options, learn_options["order"], max_index_to_use=30, quiet=quiet)
File "/home/Azimuth-2.0/azimuth/features/featurization.py", line 153, in get_all_order_nuc_features
include_pos_independent=True, max_index_to_use=max_index_to_use, prefix=prefix)
File "/home/Azimuth-2.0/azimuth/features/featurization.py", line 423, in apply_nucleotide_features
feat_pd = seq_data_frame.apply(nucleotide_features, args=(order, max_index_to_use, prefix, 'pos_dependent'))
File "/home/lib/python2.7/site-packages/pandas/core/series.py", line 3194, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src/inference.pyx", line 1472, in pandas._libs.lib.map_infer
File "/home/lib/python2.7/site-packages/pandas/core/series.py", line 3181, in <lambda>
f = lambda x: func(x, *args, **kwds)
File "/home/Azimuth-2.0/azimuth/features/featurization.py", line 468, in nucleotide_features
features_pos_dependent[alphabet.index(nucl) + (position*len(alphabet))] = 1.0
ValueError: 'R' is not in list
Sometimes, also have the error ValueError: 'K' is not in list
.
I have searched via Google, but no solution found.
So, how to solve the problemz? Thanks.
Attempts to run model_comparison.py to re-train for scikit-learn>=0.17.1 fail with the traceback below. There is no upper bound in the scikit-learn version so pip install azimuth installed scikit_learn==0.19.1. What is the solution (why is xlrd not a required package)?
[~]# python /Library/Python/2.7/site-packages/azimuth/model_comparison.py
/Library/Python/2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
/Library/Python/2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20. DeprecationWarning)
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/azimuth/model_comparison.py", line 609, in <module> save_final_model_V3(filename='saved_models/V3_model_nopos.pickle', include_position=False)
File "/Library/Python/2.7/site-packages/azimuth/model_comparison.py", line 468, in save_final_model_V3
'train_genes': azimuth.load_data.get_V3_genes(),
File "/Library/Python/2.7/site-packages/azimuth/load_data.py", line 466, in get_V3_genes
target_genes = np.concatenate((get_V1_genes(data_fileV1), get_V2_genes(data_fileV2)))
File "/Library/Python/2.7/site-packages/azimuth/load_data.py", line 456, in get_V1_genes
annotations, gene_position, target_genes, Xdf, Y = read_V1_data(data_file, learn_options=None)
File "/Library/Python/2.7/site-packages/azimuth/load_data.py", line 132, in read_V1_data
human_data = pandas.read_excel(data_file, sheetname=0, index_col=[0, 1])
File "/Library/Python/2.7/site-packages/pandas/io/excel.py", line 203, in read_excel
io = ExcelFile(io, engine=engine)
File "/Library/Python/2.7/site-packages/pandas/io/excel.py", line 232, in __init__
import xlrd # throw an ImportError if we need to
ImportError: No module named xlrd
Looks like you hardcoded NGGs, what about CRISPR with other PAMs, like Cpf1?
As typical gRNA length is 20nt I wonder why 30nt are needed and how to define those nt (as I can have gRNA and 10 nt before or gRNA and 20nt after)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.