recommenders-team / recommenders Goto Github PK
View Code? Open in Web Editor NEWBest Practices on Recommendation Systems
Home Page: https://recommenders-team.github.io/recommenders/intro.html
License: MIT License
Best Practices on Recommendation Systems
Home Page: https://recommenders-team.github.io/recommenders/intro.html
License: MIT License
Pyspark unit test cicd pipeline is executing also the spark notebooks, this shouldn't happen
https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/results?buildId=1669963&view=logs
Spark uses self.k and python uses self.top_k in the init method. The should both be called self.top_k (I guess? lol )
Apart from having jaccard and lift, we can add mutual information, cosine similarity or inclusion index.
Drawing schema about SAR
Match metrics of Spark and Python evaluators with large instead of dummy datasets (e.g., Netflix, Movielens, etc.).
when using reco_bare environment and do pytest -m "not notebooks and not spark" tests/unit/
, I get in a Mac:
(reco_bare) MININT-JFKQCE5:Recommenders miguel$ pytest -m "not notebooks and not spark" tests/unit/
================== test session starts ===================
platform darwin -- Python 3.6.0, pytest-3.6.4, py-1.6.0, pluggy-0.7.1
rootdir: /Users/miguel/MS/code/Recommenders, inifile:
plugins: pylint-0.11.0, datafiles-2.0, cov-2.6.0
collected 39 items / 3 errors / 3 deselected
========================= ERRORS =========================
____ ERROR collecting tests/unit/test_sar_pyspark.py _____
ImportError while importing test module '/Users/miguel/MS/code/Recommenders/tests/unit/test_sar_pyspark.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/unit/test_sar_pyspark.py:7: in <module>
from reco_utils.recommender.sar.sar_pyspark import SARpySparkReference
reco_utils/recommender/sar/sar_pyspark.py:13: in <module>
import pyspark.sql.functions as F
E ModuleNotFoundError: No module named 'pyspark'
__ ERROR collecting tests/unit/test_spark_evaluation.py __
ImportError while importing test module '/Users/miguel/MS/code/Recommenders/tests/unit/test_spark_evaluation.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/unit/test_spark_evaluation.py:7: in <module>
from reco_utils.evaluation.spark_evaluation import (
reco_utils/evaluation/spark_evaluation.py:5: in <module>
from pyspark.mllib.evaluation import RegressionMetrics, RankingMetrics
E ModuleNotFoundError: No module named 'pyspark'
___ ERROR collecting tests/unit/test_spark_splitter.py ___
ImportError while importing test module '/Users/miguel/MS/code/Recommenders/tests/unit/test_spark_splitter.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/unit/test_spark_splitter.py:10: in <module>
from reco_utils.dataset.spark_splitters import spark_chrono_split, spark_random_split
reco_utils/dataset/spark_splitters.py:6: in <module>
from pyspark.sql import Window
E ModuleNotFoundError: No module named 'pyspark'
!!!!!!!! Interrupted: 3 errors during collection !!!!!!!!!
========= 3 deselected, 3 error in 2.02 seconds ==========
idea was taken from https://engineeringblog.yelp.com/2018/05/pyspark-coding-practices-lessons-learned.html
@pytest.fixture(scope='session')
def spark():
....
Tests on certain modules such as evaluation and split should be performed on a sufficiently large dataset (e.g., Netflix, Movielens-1M, etc.).
SAR unit test configs for both single node and pySpark unit tests are the same - right now they are duplicated in each of the test files. They should be imported from conftest.py
Add logger in the same style as in the pySpark SAR unit tests file - the two should be consistent.
As DS VM upgraded to Spark 2.3 the supporting libraries and environment don't work with Spark 2.2 which is required for Airship (note Recommenders). The quick fix is to upgrade virtual env to Spark 2.3. Going forward, we should figure out if we want to keep upgrading Spark versions as DS VM upgrades them OR anchor to specific Spark version with standalone DS VM libs.
The predict method should fill in the SAR score for a given User, Item pair, into a column called prediction. This is needed in order to utilize existing ML Lib libraries.
Sync with @dciborow about the work he's done already in this space
@eisber - can you add your updates to this notebook: https://github.com/Microsoft/Recommenders/blob/staging/notebooks/02_modeling/sar_educational_walkthrough.ipynb
As of now, the smoke and integration tests run on a pre created environment.
We have to test the full pipeline, this means that, for each environment, python, spark and GPU, we create the conda file, install the environment, execute the tests and then remove the environment.
TODO:
To illustrate a model is deployed as a service.
When creating a new cicd system, we had to register a jupyter kernel.
ipykernel install --user --name py36 --display-name "Python (py36)"
python -m ipykernel install --user --name recommender --display-name "Python (recommender)"
We need to review this issue
We had a problem when installing pyspark. The conda file had 2.3.0 but the DSVM had 2.3.1.
We need to find a way to make the installation robust independently to the spark version of the machine.
One suggestion for SAR implementation - SAR currently seems does both user-item affinity matrix calculation and item-item similarity matrix calculation in the same function fit(). Would be good to have them separately in the case we want to re-calculate (update) one of the matrix. Or even better if we have an update function for individual user or item records that only re-calculate the cells related to the user or item from the matrices.
Consider using nbdiff
SAR single node implementation header has pySpark implementation info instead of python implementation docstring.
Evlauation metrics between from two implementations are not matched.
As per Tao - it would be less confusing if we use a name like rec_utils
instead of utilities
.
from utilities.recommender.sar.sar_singlenode import SARSingleNodeReference
from utilities.dataset.url_utils import maybe_download
from utilities.dataset.python_splitters import python_random_split
from utilities.evaluation.python_evaluation import PythonRatingEvaluation, PythonRankingEvaluation
https://github.com/Microsoft/Recommenders/blob/staging/notebooks/02_modeling/sar_deep_dive.ipynb
we need to do import pyspark
Look at the cold user filtering section in pySpark SAR notebook and implement the same with Pandas for SAR on single node notebook.
This allows to run the test cases offline (e.g. on a plane)
time import and class method are missing - we somehow left those in airship. Ooops.
Add those back in
Something in the library turned on info level logging and is causing a flood of "INFO:py4j.java_gateway:Received command c on object id p0" to come to standard console. Please disable...
As the title says, with pySpark we should be loading using spark.read.load directly on unit test files from remote WASB/HDFS, and not using urllib -> pandas -> spark.DataFrame in the SAR pySpark unit tests.
Related to this PR #22
test this notebook
Change the interface to be compatible with sklearn.
For a large number of users (>500,000), SAR pySpark recommend_k_items method thinks that there are cold users in the test set but there are actually only warm users there. This is not reproducible on non-customer data though, related to #88
add header to all python files in the repo before releasing.
Some files don't have the correct docstrings. Ex:
class SparkRatingEvaluation:
"""Spark Rating Evaluator"""
def __init__(
self,
rating_true,
rating_pred,
col_user=DEFAULT_USER_COL,
col_item=DEFAULT_ITEM_COL,
col_rating=DEFAULT_RATING_COL,
col_prediction=PREDICTION_COL,
):
"""Initializer.
Args:
rating_true (spark.DataFrame): True labels.
rating_pred (spark.DataFrame): Predicted labels.
"""
related to #72
On some machines, using a tolerance of 1e-8, the tests pass, but in others they don't.
We got this error on Prometheus, when testing test_sar_single_node.py
:
(py36) miguel@prometheus:~/repos/Recommenders$ pytest tests/unit/test_sar_singlenode.py
=================================================================================== test session starts ====================================================================================
platform linux -- Python 3.6.5, pytest-3.6.4, py-1.7.0, pluggy-0.7.1
rootdir: /home/miguel/repos/Recommenders, inifile:
collected 15 items
tests/unit/test_sar_singlenode.py ...........FFFF [100%]
========================================================================================= FAILURES =========================================================================================
____________________________________________________________________________________ test_user_affinity ____________________________________________________________________________________
demo_usage_data = UserId MovieId Timestamp Rating exponential rating_exponential
0 0003000098E85347 DQF...076
11837 00030000822E3BAE DAF-00448 1.416292e+09 1 0.009076 0.009076
[11838 rows x 6 columns]
sar_settings = {'ATOL': 1e-08, 'FILE_DIR': 'http://recodatasets.blob.core.windows.net/sarunittest/', 'TEST_USER_ID': '0003000098E85347'}
header = {'col_item': 'MovieId', 'col_rating': 'Rating', 'col_timestamp': 'Timestamp', 'col_user': 'UserId'}
def test_user_affinity(demo_usage_data, sar_settings, header):
time_now = demo_usage_data[header["col_timestamp"]].max()
model = SARSingleNodeReference(
remove_seen=True,
similarity_type="cooccurrence",
timedecay_formula=True,
time_decay_coefficient=30,
time_now=time_now,
**header
)
_apply_sar_hash_index(model, demo_usage_data, None, header)
model.fit(demo_usage_data)
true_user_affinity, items = load_affinity(sar_settings["FILE_DIR"] + "user_aff.csv")
user_index = model.user_map_dict[sar_settings["TEST_USER_ID"]]
test_user_affinity = np.reshape(
np.array(
_rearrange_to_test(
model.user_affinity, None, items, None, model.item_map_dict
)[user_index,].todense()
),
-1,
)
> assert np.allclose(
true_user_affinity.astype(test_user_affinity.dtype),
test_user_affinity,
atol=sar_settings["ATOL"],
)
E AssertionError: assert False
E + where False = <function allclose at 0x7f6110e1d730>(array([0. , 0. , 0. , 0. , 0. ,\n 0. , 0. , 0. , 0. ... , 0. , 0. ,\n 0. , 0. , 0.15181286, 1. , 0. ,\n 0. ]), array([0. , 0. , 0. , 0. , 0. ,\n 0. , 0. , 0. , 0. ... , 0. , 0. ,\n 0. , 0. , 0.15195908, 1. , 0. ,\n 0. ]), atol=1e-08)
E + where <function allclose at 0x7f6110e1d730> = np.allclose
E + and array([0. , 0. , 0. , 0. , 0. ,\n 0. , 0. , 0. , 0. ... , 0. , 0. ,\n 0. , 0. , 0.15181286, 1. , 0. ,\n 0. ]) = <built-in method astype of numpy.ndarray object at 0x7f60fc6adee0>(dtype('float64'))
E + where <built-in method astype of numpy.ndarray object at 0x7f60fc6adee0> = array(['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',\n '0', '0.0221122254449968', '0', '0', '0..., '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',\n '0', '0.151812861826336', '1', '0', '0'], dtype='<U18').astype
E + and dtype('float64') = array([0. , 0. , 0. , 0. , 0. ,\n 0. , 0. , 0. , 0. ... , 0. , 0. ,\n 0. , 0. , 0.15195908, 1. , 0. ,\n 0. ]).dtype
tests/unit/test_sar_singlenode.py:201: AssertionError
___________________________________________________________________________ test_userpred[3-cooccurrence-count] ____________________________________________________________________________
threshold = 3, similarity_type = 'cooccurrence', file = 'count', header = {'col_item': 'MovieId', 'col_rating': 'Rating', 'col_timestamp': 'Timestamp', 'col_user': 'UserId'}
sar_settings = {'ATOL': 1e-08, 'FILE_DIR': 'http://recodatasets.blob.core.windows.net/sarunittest/', 'TEST_USER_ID': '0003000098E85347'}
demo_usage_data = UserId MovieId Timestamp Rating exponential rating_exponential
0 0003000098E85347 DQF...076
11837 00030000822E3BAE DAF-00448 1.416292e+09 1 0.009076 0.009076
[11838 rows x 6 columns]
@pytest.mark.parametrize(
"threshold,similarity_type,file",
[(3, "cooccurrence", "count"), (3, "jaccard", "jac"), (3, "lift", "lift")],
)
def test_userpred(
threshold, similarity_type, file, header, sar_settings, demo_usage_data
):
time_now = demo_usage_data[header["col_timestamp"]].max()
model = SARSingleNodeReference(
remove_seen=True,
similarity_type=similarity_type,
timedecay_formula=True,
time_decay_coefficient=30,
time_now=time_now,
threshold=threshold,
**header
)
_apply_sar_hash_index(model, demo_usage_data, None, header)
model.fit(demo_usage_data)
true_items, true_scores = load_userpred(
sar_settings["FILE_DIR"]
+ "userpred_"
+ file
+ str(threshold)
+ "_userid_only.csv"
)
test_results = model.recommend_k_items(
demo_usage_data[
demo_usage_data[header["col_user"]] == sar_settings["TEST_USER_ID"]
],
top_k=10,
)
test_items = list(test_results[header["col_item"]])
test_scores = np.array(test_results["prediction"])
assert true_items == test_items
> assert np.allclose(true_scores, test_scores, atol=sar_settings["ATOL"])
E assert False
E + where False = <function allclose at 0x7f6110e1d730>(array([40.96870941, 40.37760085, 19.55002941, 18.10756063, 13.24775154,\n 12.67358812, 12.49898911, 12.0359004 , 10.91842008, 10.91185623]), array([41.00239015, 40.41649126, 19.5650067 , 18.12114858, 13.26051135,\n 12.6742369 , 12.50043289, 12.047493 , 10.92893636, 10.92236618]), atol=1e-08)
E + where <function allclose at 0x7f6110e1d730> = np.allclose
tests/unit/test_sar_singlenode.py:245: AssertionError
_______________________________________________________________________________ test_userpred[3-jaccard-jac] _______________________________________________________________________________
threshold = 3, similarity_type = 'jaccard', file = 'jac', header = {'col_item': 'MovieId', 'col_rating': 'Rating', 'col_timestamp': 'Timestamp', 'col_user': 'UserId'}
sar_settings = {'ATOL': 1e-08, 'FILE_DIR': 'http://recodatasets.blob.core.windows.net/sarunittest/', 'TEST_USER_ID': '0003000098E85347'}
demo_usage_data = UserId MovieId Timestamp Rating exponential rating_exponential
0 0003000098E85347 DQF...076
11837 00030000822E3BAE DAF-00448 1.416292e+09 1 0.009076 0.009076
[11838 rows x 6 columns]
@pytest.mark.parametrize(
"threshold,similarity_type,file",
[(3, "cooccurrence", "count"), (3, "jaccard", "jac"), (3, "lift", "lift")],
)
def test_userpred(
threshold, similarity_type, file, header, sar_settings, demo_usage_data
):
time_now = demo_usage_data[header["col_timestamp"]].max()
model = SARSingleNodeReference(
remove_seen=True,
similarity_type=similarity_type,
timedecay_formula=True,
time_decay_coefficient=30,
time_now=time_now,
threshold=threshold,
**header
)
_apply_sar_hash_index(model, demo_usage_data, None, header)
model.fit(demo_usage_data)
true_items, true_scores = load_userpred(
sar_settings["FILE_DIR"]
+ "userpred_"
+ file
+ str(threshold)
+ "_userid_only.csv"
)
test_results = model.recommend_k_items(
demo_usage_data[
demo_usage_data[header["col_user"]] == sar_settings["TEST_USER_ID"]
],
top_k=10,
)
test_items = list(test_results[header["col_item"]])
test_scores = np.array(test_results["prediction"])
assert true_items == test_items
> assert np.allclose(true_scores, test_scores, atol=sar_settings["ATOL"])
E assert False
E + where False = <function allclose at 0x7f6110e1d730>(array([0.0616357 , 0.04918001, 0.04247487, 0.04009872, 0.03847229,\n 0.03839772, 0.03251167, 0.02474822, 0.02432458, 0.0224889 ]), array([0.06163639, 0.04921205, 0.04247624, 0.04011545, 0.03848885,\n 0.03843471, 0.0325135 , 0.02477206, 0.02432508, 0.02249099]), atol=1e-08)
E + where <function allclose at 0x7f6110e1d730> = np.allclose
tests/unit/test_sar_singlenode.py:245: AssertionError
________________________________________________________________________________ test_userpred[3-lift-lift] ________________________________________________________________________________
threshold = 3, similarity_type = 'lift', file = 'lift', header = {'col_item': 'MovieId', 'col_rating': 'Rating', 'col_timestamp': 'Timestamp', 'col_user': 'UserId'}
sar_settings = {'ATOL': 1e-08, 'FILE_DIR': 'http://recodatasets.blob.core.windows.net/sarunittest/', 'TEST_USER_ID': '0003000098E85347'}
demo_usage_data = UserId MovieId Timestamp Rating exponential rating_exponential
0 0003000098E85347 DQF...076
11837 00030000822E3BAE DAF-00448 1.416292e+09 1 0.009076 0.009076
[11838 rows x 6 columns]
@pytest.mark.parametrize(
"threshold,similarity_type,file",
[(3, "cooccurrence", "count"), (3, "jaccard", "jac"), (3, "lift", "lift")],
)
def test_userpred(
threshold, similarity_type, file, header, sar_settings, demo_usage_data
):
time_now = demo_usage_data[header["col_timestamp"]].max()
model = SARSingleNodeReference(
remove_seen=True,
similarity_type=similarity_type,
timedecay_formula=True,
time_decay_coefficient=30,
time_now=time_now,
threshold=threshold,
**header
)
_apply_sar_hash_index(model, demo_usage_data, None, header)
model.fit(demo_usage_data)
true_items, true_scores = load_userpred(
sar_settings["FILE_DIR"]
+ "userpred_"
+ file
+ str(threshold)
+ "_userid_only.csv"
)
test_results = model.recommend_k_items(
demo_usage_data[
demo_usage_data[header["col_user"]] == sar_settings["TEST_USER_ID"]
],
top_k=10,
)
test_items = list(test_results[header["col_item"]])
test_scores = np.array(test_results["prediction"])
assert true_items == test_items
> assert np.allclose(true_scores, test_scores, atol=sar_settings["ATOL"])
E assert False
E + where False = <function allclose at 0x7f6110e1d730>(array([0.00134902, 0.00084695, 0.00072497, 0.00072133, 0.00066855,\n 0.0006003 , 0.00045299, 0.00045202, 0.00041803, 0.00034772]), array([0.00134902, 0.00084696, 0.00072513, 0.00072134, 0.00066871,\n 0.00060031, 0.00045312, 0.00045204, 0.00041804, 0.00034806]), atol=1e-08)
E + where <function allclose at 0x7f6110e1d730> = np.allclose
tests/unit/test_sar_singlenode.py:245: AssertionError
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.