microsoftresearch / azimuth Goto Github PK

View Code? Open in Web Editor NEW

220.0 220.0 89.0 5.94 MB

Machine Learning-Based Predictive Modelling of CRISPR/Cas9 guide efficiency

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

azimuth's People

Contributors

Stargazers

Watchers

Forkers

pombredanne bmcorser xypan1232 benchling robertsami zzmjohn sask1217 congdetianxia rodoyle kislyuk zhmz90 oneokorganization alenzhao markdunne shiranab fengzhanglab greglever yesimon kgao nonzok4 leolorenzoluis cranesandcaff thetechnocrat-dev asad moritzschaefer jeffreycastellano thegreenjedi jchenpku jjc2718 xtalax ishandutta2007 rakarnik nikhilslounge terminatorj capitulation so-cool teeks99 rodenluo reineckef leiwbit nvichare tzpranto jun-lizst aneeshpanoli richardkmichael oazarate byo-ai 04linsi jeffkatzy healthvivo haoyang-insitro chitrita pratikp204 dtsmith2001 sawravchy nelisdrost sidsd27 neelarka gcorsi srravula1 philloidin biomatters bbakgosu kobbycyber valmsmith39a junyu-boston vb6hobbyst7 bowhan michael-weinstein mario-kart-felix jinyuansun muflhi01 anamika-yadav99 mkefly cruiz24 mudra-hegde yanhui09 nkran healinc mahmoudm69 sergeevaleeza seanpm2001 sunpi

azimuth's Issues

Azimuth tests failed with wrong scikit-learn version.

Azimuth required scikit-learn >=0.17.1, < 0.18.1
pip install azimuth may install scikit-learn 0.18.1
so tests will fail when you do nosetests:

c:\python27\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
c:\python27\lib\site-packages\sklearn\grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
EE
======================================================================
ERROR: test_predictions_nopos (test_saved_models.SavedModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
...
  File "sklearn\tree\_tree.pyx", line 632, in sklearn.tree._tree.Tree.__setstate__ (sklearn\tree\_tree.c:8125)
KeyError: 'max_depth'
-------------------- >> begin captured stdout << ---------------------
No model file specified, using V3_model_nopos

--------------------- >> end captured stdout << ----------------------

======================================================================
ERROR: test_predictions_pos (test_saved_models.SavedModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
 ...
  File "sklearn\tree\_tree.pyx", line 632, in sklearn.tree._tree.Tree.__setstate__ (sklearn\tree\_tree.c:8125)
KeyError: 'max_depth'
-------------------- >> begin captured stdout << ---------------------
No model file specified, using V3_model_full

--------------------- >> end captured stdout << ----------------------

----------------------------------------------------------------------
Ran 2 tests in 0.813s

FAILED (errors=2)

DeprecationWarning from sklearn

I know I can suppress warnings on the client side, but it would be better to get rid of them at the source.

import azimuth.model_comparison as mc,numpy as np
/opt/apps/tools/anaconda2-2.5.0/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/opt/apps/tools/anaconda2-2.5.0/lib/python2.7/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

model_comparison.predict seems to be broken

Recent commit (2c4665e) added assert that 'seq' is a string, but downstream it seems to expect a numpy array. Additionally, the README states that this should be a numpy array

Project website should link to data in the repo

The project website sends me to http://www.jennifer.listgarten.com/azimuthFC_plus_RES_withPredictions.csv which says Not Found [CFN #0005].

The data seems to be available in the repo though: https://github.com/MicrosoftResearch/Azimuth/blob/master/azimuth/FC_plus_RES_withPredictions.csv

Azimuth docker

I created Azimuth-docker container https://github.com/antonkulaga/azimuth-docker to use azimuth in bionformatic pipelines. In fact, there are two containers there - one for console usage and one is azimuth wrapped in flask to use as microservice.
I also put microsoft license inside it https://github.com/antonkulaga/azimuth-docker/blob/master/cli/LICENSE

Update for python 3 please?

Negative guide scores

Hello! We've noticed that occasionally, guides that we score will get a negative predicted score from azimuth. Here's an example:

> from azimuth.model_comparison import predict
> import numpy
> predict(numpy.array(['AACTGATTTCTGGCGTTTTCTTTCTGGCTC']), numpy.array([8905]), numpy.array([96]))
No model file specified, using V3_model_full
array([-0.04603427])

Empirically, it looks like negative scores are more likely to happen with peptide percentages closer to 100. However, from the Azimuth documentation (and the general understanding of CRISPR on-target scores, it looks like scores are expected to be between 0.0 and 1.0. Is this possibly a bug in the scoring system?

If it's helpful, this does not happen if the peptide percentage is set to 95 or less.

Thanks!

Support Python 3

Azimuth doesn't seem to be Python 3 ready. Could this be done?

Issues with using/re-training saved models

Hi,

I'm trying to compute Rule Set 2 scores using model_comparison.predict on the nopos model.

If I download the latest version from github, the pickle files don't seem to load:

ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'long long'

So I run model_comparison.py as instructed on the main github page to reproduce the two models.
It prints a LOT of warnings:

WARNING: trimming max_index_to use down to length of string=30

but does eventually produce two model files.

However if I use those files, I get a different answer to the value you report in the README that comes with the Rule Set 2 Calculator 0.5909 vs 0.5656.

Looking at that latest test_saved_models.py, it looks like you expect the new value. So I guess I'm just wondering what the reason is for the change? Which one corresponds to the method used in your paper? I'm guessing the older one? In which case, how do I reproduce that?

(sorry I didn't mean to submit this issue when I submitted it, and have since resolved some issues with versioning, so now it's a question instead!)

Thanks,
Felicity

nosetests KeyError: max_depth

Fixed: I seemed to have missed reading this section, which solved my issue.

Generating new model .pickle files

Sometimes the pre-computed .pickle files in the saved_models directory are incompatible with different versions of scikitlearn. You can re-train the files saved_models/V3_model_full.pickle and saved_models/V3_model_nopos.pickle by running the command python model_comparison.py (which will overwrite the saved models). You can check that the resulting models match the models we precomputed by running python test_saved_models.py within the directory tests.

Using:
sklearn version 0.18.1
python version 2.7.6

I believe that the error arises because the pickled models were built using sklearn 0.17, and it seems like the API for trees changed slightly with sklearn 0.18.

Please let me know if there is any other helpful information I can provide.

$ nosetests

/afs/csail.mit.edu/u/m/maxwshen/.local/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/afs/csail.mit.edu/u/m/maxwshen/.local/lib/python2.7/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
EE
======================================================================
ERROR: test_predictions_nopos (test_saved_models.SavedModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/cluster/mshen/tools/Azimuth/azimuth/tests/test_saved_models.py", line 17, in test_predictions_nopos
    predictions = azimuth.model_comparison.predict(np.array(df['guide'].values), None, None)
  File "/cluster/mshen/tools/Azimuth/azimuth/model_comparison.py", line 550, in predict
    model, learn_options = pickle.load(f)
  File "/usr/lib/python2.7/pickle.py", line 1378, in load
    return Unpickler(file).load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1217, in load_build
    setstate(state)
  File "sklearn/tree/_tree.pyx", line 632, in sklearn.tree._tree.Tree.__setstate__ (sklearn/tree/_tree.c:8128)
KeyError: 'max_depth'
-------------------- >> begin captured stdout << ---------------------
No model file specified, using V3_model_nopos

--------------------- >> end captured stdout << ----------------------

======================================================================
ERROR: test_predictions_pos (test_saved_models.SavedModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/cluster/mshen/tools/Azimuth/azimuth/tests/test_saved_models.py", line 22, in test_predictions_pos
    predictions = azimuth.model_comparison.predict(np.array(df['guide'].values), np.array(df['AA cut'].values), np.array(df['Percent peptide'].values))
  File "/cluster/mshen/tools/Azimuth/azimuth/model_comparison.py", line 550, in predict
    model, learn_options = pickle.load(f)
  File "/usr/lib/python2.7/pickle.py", line 1378, in load
    return Unpickler(file).load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1217, in load_build
    setstate(state)
  File "sklearn/tree/_tree.pyx", line 632, in sklearn.tree._tree.Tree.__setstate__ (sklearn/tree/_tree.c:8128)
KeyError: 'max_depth'
-------------------- >> begin captured stdout << ---------------------
No model file specified, using V3_model_full

--------------------- >> end captured stdout << ----------------------

----------------------------------------------------------------------
Ran 2 tests in 4.093s

FAILED (errors=2)

How to install

It was very hard to get the Azimuth 2.0 installed in Python 2.7 so I thought I share my experience with others.

Do not use the docker image that was created back in 2017. The results generated by the docker image doesn't match the ones generated by the GPP Web portal (https://portals.broadinstitute.org/gpp/public/)

Use the following instructions to install it in Python 2.7 and the results match the GPP Web portal

1 conda create --name azimuth python=2.7
2 conda activate azimuth
3 conda install biopython
4 conda install scikit-learn=0.17.1
5 pip install 'numpy==1.12.1'
6 pip install azimuth

running nosetests failed

When I run nosetests son azimuth it fails with the following output:

# nosetests
/Library/Python/2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
/Library/Python/2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20. DeprecationWarning)
F
======================================================================
FAIL: test_predictions (azimuth.tests.test_saved_models.SavedModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/azimuth/tests/test_saved_models.py", line 19, in test_predictions
self.assertTrue(np.allclose(predictions, df['Stable prediction'].values, atol=1e-3))
AssertionError: False is not true
-------------------- >> begin captured stdout << ---------------------
No model file specified, using V3_model_nopos
`--------------------- >> end captured stdout << ----------------------`
----------------------------------------------------------------------
Ran 1 test in 3.481s
``
FAILED (failures=1)

azimuth Python package and azimuth web service yield different scores

this script reproduces the problem.

Also, the scores obtained by that script from the azimuth Python package seem to agree with those produced by https://crispr.ml when submitting ENSG00000100823 (APEX1), and with those produced by GPP sgRNA Designer when submitting gene ID 328 (also APEX1), indicating that both servers ignore the parameters "Target Cut Length" and "Target Cut %" when calculating the "On-Target Efficacy Score". This limitation is not obvious from the documentation.

Running nosetests fails - maxDepth KeyError

Usage of predict with no cut/aa

On the project page https://www.microsoft.com/en-us/research/project/azimuth/, User's guide section, it states to use predict with no aa or peptide percent to use -1 as sentinel when the program actually requires None.

Discrepancy in location of PAM [featurization.py]

http://www.nature.com/nbt/journal/v32/n12/fig_tab/nbt.3026_F3.html shows the PAM at positions 24-27 within the 30-mer (using zero-based Python indexing) and the length-20 targeting sequence at positions 4-24. Azimuth's pam_audit code uses this definition, since it requires a 'GG' at seq[25:27]. nucleotide_features_dictionary also uses this definition.

However, countGC assumes the targeting sequence is at positions 5-25: return len(s[5:25].replace('A', '').replace('T', '')). Tm_feature also assumes this: featarray[i,1] = Tm.Tm_staluc(seq[20:25], rna=rna) #5nts immediately proximal of the NGG PAM. These seem incorrect. Could you clarify in the documentation whether the targeting sequence is at positions 4-24 or 5-25?

Thanks for making this tool available to the community!

Running into issues while installing Azimuth on Windows, Need more clear installation docs

Hey,

I've been trying to install Azimuth on Windows 10 and have been running into problems even after installing http://aka.ms/vcpython27

It'd be nice to have some clear documentation of the setup process and the list of required libraries needed to get this running. Here's the log
https://gist.github.com/sudheesh001/2b70e0ed4371032cab58d1efa367cee3

ImportError: No module named cross_validation

Hi Author,
I am new to Python, so I know the problem is relate to the renaming and deprecation of cross_validation sub-module to model_selection.
But I don't know how can I fixed.
Look forward to your reply.
Yao

ValueError: 'R' is not in list

Hi,
I always get the follows error when running Azimuth.

  File "/home/Azimuth-2.0/azimuth/model_comparison.py", line 559, in predict
    feature_sets = feat.featurize_data(Xdf, learn_options, pandas.DataFrame(), gene_position, pam_audit=pam_audit, length_audit=length_audit)
  File "/home/Azimuth-2.0/azimuth/features/featurization.py", line 31, in featurize_data
    get_all_order_nuc_features(data['30mer'], feature_sets, learn_options, learn_options["order"], max_index_to_use=30, quiet=quiet)
  File "/home/Azimuth-2.0/azimuth/features/featurization.py", line 153, in get_all_order_nuc_features
    include_pos_independent=True, max_index_to_use=max_index_to_use, prefix=prefix)
  File "/home/Azimuth-2.0/azimuth/features/featurization.py", line 423, in apply_nucleotide_features
    feat_pd = seq_data_frame.apply(nucleotide_features, args=(order, max_index_to_use, prefix, 'pos_dependent'))
  File "/home/lib/python2.7/site-packages/pandas/core/series.py", line 3194, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/src/inference.pyx", line 1472, in pandas._libs.lib.map_infer
  File "/home/lib/python2.7/site-packages/pandas/core/series.py", line 3181, in <lambda>
    f = lambda x: func(x, *args, **kwds)
  File "/home/Azimuth-2.0/azimuth/features/featurization.py", line 468, in nucleotide_features
    features_pos_dependent[alphabet.index(nucl) + (position*len(alphabet))] = 1.0
ValueError: 'R' is not in list

Sometimes, also have the error ValueError: 'K' is not in list.
I have searched via Google, but no solution found.
So, how to solve the problemz？ Thanks.

Unable to re-train

Attempts to run model_comparison.py to re-train for scikit-learn>=0.17.1 fail with the traceback below. There is no upper bound in the scikit-learn version so pip install azimuth installed scikit_learn==0.19.1. What is the solution (why is xlrd not a required package)?

[~]# python /Library/Python/2.7/site-packages/azimuth/model_comparison.py
/Library/Python/2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
/Library/Python/2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20. DeprecationWarning)
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/azimuth/model_comparison.py", line 609, in <module> save_final_model_V3(filename='saved_models/V3_model_nopos.pickle', include_position=False)
File "/Library/Python/2.7/site-packages/azimuth/model_comparison.py", line 468, in save_final_model_V3
'train_genes': azimuth.load_data.get_V3_genes(),
File "/Library/Python/2.7/site-packages/azimuth/load_data.py", line 466, in get_V3_genes
target_genes = np.concatenate((get_V1_genes(data_fileV1), get_V2_genes(data_fileV2)))
File "/Library/Python/2.7/site-packages/azimuth/load_data.py", line 456, in get_V1_genes
annotations, gene_position, target_genes, Xdf, Y = read_V1_data(data_file, learn_options=None)
File "/Library/Python/2.7/site-packages/azimuth/load_data.py", line 132, in read_V1_data
human_data = pandas.read_excel(data_file, sheetname=0, index_col=[0, 1])
File "/Library/Python/2.7/site-packages/pandas/io/excel.py", line 203, in read_excel
io = ExcelFile(io, engine=engine)
File "/Library/Python/2.7/site-packages/pandas/io/excel.py", line 232, in __init__
import xlrd # throw an ImportError if we need to
ImportError: No module named xlrd

add other PAMs

Looks like you hardcoded NGGs, what about CRISPR with other PAMs, like Cpf1?

explain why 30nt are needed

As typical gRNA length is 20nt I wonder why 30nt are needed and how to define those nt (as I can have gRNA and 10 nt before or gRNA and 20nt after)