recommenders-team / recommenders Goto Github PK

Best Practices on Recommendation Systems

Home Page: https://recommenders-team.github.io/recommenders/intro.html

License: MIT License

Jupyter Notebook 0.40% Python 97.73% Dockerfile 0.56% C++ 0.47% Scala 0.84%

machine-learning recommender ranking deep-learning python jupyter-notebook recommendation-algorithm rating operationalization kubernetes

recommenders's People

Stargazers

Watchers

Forkers

nikhilrj eisber sumedhvdatar wesszumino aaronheee tycooniyke daosanjiao jiamim jizhihang zorrock allensmile hulalazz hfxunlp songxianjin submergerock 10080115 charlottesean yasin0924 karmajun liuyuchen0504 ninghongbo123 eshener danielsc lobby66 hxsylzpf gjwgit jreynolds01 susangzj hunglethanh9 fedorkononov smaticndx robert-joscelyne leochencipher 0xdaksh laanak08 rvaughan wguo123 rtvt123 soraismus ghanima shaunstanislauslau buxxux priteshgudge personalized-news loehndorf arvidkahl rahulsoibam cszhan163 smorin antoniosql sagar2366 hack121 lvzcl arunkumarramanan namaljayathunga awesome-archive chet-sheng hbcbh1999 finesure2017 mozamani kant azurementor giovan shubhampachori12110095 davidmertenjones awalkinclouds tomarraj008 sweetice assassinsurvivor naushadzaman veryfatboy suvratjain1995 bilalmaxood mldeveloper01 deep-brainz rrmina tianyiwangnova efleurine michaelchi08 codeaudit joggyjagz7 cuulee saeedesmaili nikopilibosyan shahidash yhygta dhirendra101 stalinkay leostera chirayukong wangkanger williamtran29 cbentes salah93 yushu-liu peteroxic linyubupa victor8733 chenchaodev saurabh23

recommenders's Issues

Clean up saved sparse matrices / npz files

https://github.com/Microsoft/Recommenders/blob/staging/utilities/recommender/sar/sar_singlenode.py#L325

review pyspark unit tests

Pyspark unit test cicd pipeline is executing also the spark notebooks, this shouldn't happen

https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/results?buildId=1669963&view=logs

python and Spark evaluators use different API for top_k

Spark uses self.k and python uses self.top_k in the init method. The should both be called self.top_k (I guess? lol )

Investigate more similarity metrics for SAR for single node

Apart from having jaccard and lift, we can add mutual information, cosine similarity or inclusion index.

Evaluate the journey between SVD -> SAR / ALS -> SAR - share learnings in notebooks

write learnings into the notebooks themselves
cross links between notebooks when appropriate

Match results from evaluators in different versions

Match metrics of Spark and Python evaluators with large instead of dummy datasets (e.g., Netflix, Movielens, etc.).

BUG: python tests doesn't pass if we use only python environment

when using reco_bare environment and do pytest -m "not notebooks and not spark" tests/unit/, I get in a Mac:

(reco_bare) MININT-JFKQCE5:Recommenders miguel$ pytest -m "not notebooks and not spark" tests/unit/
================== test session starts ===================
platform darwin -- Python 3.6.0, pytest-3.6.4, py-1.6.0, pluggy-0.7.1
rootdir: /Users/miguel/MS/code/Recommenders, inifile:
plugins: pylint-0.11.0, datafiles-2.0, cov-2.6.0
collected 39 items / 3 errors / 3 deselected             

========================= ERRORS =========================
____ ERROR collecting tests/unit/test_sar_pyspark.py _____
ImportError while importing test module '/Users/miguel/MS/code/Recommenders/tests/unit/test_sar_pyspark.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/unit/test_sar_pyspark.py:7: in <module>
    from reco_utils.recommender.sar.sar_pyspark import SARpySparkReference
reco_utils/recommender/sar/sar_pyspark.py:13: in <module>
    import pyspark.sql.functions as F
E   ModuleNotFoundError: No module named 'pyspark'
__ ERROR collecting tests/unit/test_spark_evaluation.py __
ImportError while importing test module '/Users/miguel/MS/code/Recommenders/tests/unit/test_spark_evaluation.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/unit/test_spark_evaluation.py:7: in <module>
    from reco_utils.evaluation.spark_evaluation import (
reco_utils/evaluation/spark_evaluation.py:5: in <module>
    from pyspark.mllib.evaluation import RegressionMetrics, RankingMetrics
E   ModuleNotFoundError: No module named 'pyspark'
___ ERROR collecting tests/unit/test_spark_splitter.py ___
ImportError while importing test module '/Users/miguel/MS/code/Recommenders/tests/unit/test_spark_splitter.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/unit/test_spark_splitter.py:10: in <module>
    from reco_utils.dataset.spark_splitters import spark_chrono_split, spark_random_split
reco_utils/dataset/spark_splitters.py:6: in <module>
    from pyspark.sql import Window
E   ModuleNotFoundError: No module named 'pyspark'
!!!!!!!! Interrupted: 3 errors during collection !!!!!!!!!
========= 3 deselected, 3 error in 2.02 seconds ==========

change the test to load spark as session instead of in module

idea was taken from https://engineeringblog.yelp.com/2018/05/pyspark-coding-practices-lessons-learned.html

@pytest.fixture(scope='session')
def spark():
    ....

Add documentation / text content to SAR Notebooks

Add test data sets

Tests on certain modules such as evaluation and split should be performed on a sufficiently large dataset (e.g., Netflix, Movielens-1M, etc.).

SAR unit test configs

SAR unit test configs for both single node and pySpark unit tests are the same - right now they are duplicated in each of the test files. They should be imported from conftest.py

add logging logger to SAR single node unit tests

Add logger in the same style as in the pySpark SAR unit tests file - the two should be consistent.

DS VM pySpark version and library mismatch

As DS VM upgraded to Spark 2.3 the supporting libraries and environment don't work with Spark 2.2 which is required for Airship (note Recommenders). The quick fix is to upgrade virtual env to Spark 2.3. Going forward, we should figure out if we want to keep upgrading Spark versions as DS VM upgrades them OR anchor to specific Spark version with standalone DS VM libs.

SAR has no predict method

The predict method should fill in the SAR score for a given User, Item pair, into a column called prediction. This is needed in order to utilize existing ML Lib libraries.

Write installation guideline INSTALL.md

notebook ALS

Development of SARplus

Sync with @dciborow about the work he's done already in this space

SAR SQL notebook updates

@eisber - can you add your updates to this notebook: https://github.com/Microsoft/Recommenders/blob/staging/notebooks/02_modeling/sar_educational_walkthrough.ipynb

Add environment installation to smoke and integration tests

As of now, the smoke and integration tests run on a pre created environment.

We have to test the full pipeline, this means that, for each environment, python, spark and GPU, we create the conda file, install the environment, execute the tests and then remove the environment.

TODO:

Change names of builds in VSTS
Add different environments for unit tests
Create and delete the environment after smoke and integration tests
Modify README and link statuses

Notebook of model operationalization

To illustrate a model is deployed as a service.

Add tests for new notebooks

data split
sar deep dive
evaluation
ALS deep dive

Review problem with Jupyter kernel in the unit tests

When creating a new cicd system, we had to register a jupyter kernel.

ipykernel install --user --name py36 --display-name "Python (py36)"

python -m ipykernel install --user --name recommender --display-name "Python (recommender)"

We need to review this issue

Make the system robust to different pyspark versions

We had a problem when installing pyspark. The conda file had 2.3.0 but the DSVM had 2.3.1.

We need to find a way to make the installation robust independently to the spark version of the machine.

Recalculate / update SAR user-item affinity matrix or item-item similarity matrix

One suggestion for SAR implementation - SAR currently seems does both user-item affinity matrix calculation and item-item similarity matrix calculation in the same function fit(). Would be good to have them separately in the case we want to re-calculate (update) one of the matrix. Or even better if we have an update function for individual user or item records that only re-calculate the cells related to the user or item from the matrices.

Clean up notebooks in git

Find a way to reduce git changes created by changes in kernel, cell run history.
Make sure notebooks are following correct templates, review templates

Consider using nbdiff

SAR: move SAR SQL implementation with unit test from Airship into Recommenders

Finish README under each notebook sub-directory

SAR single node implementation header has pySpark implementation info

SAR single node implementation header has pySpark implementation info instead of python implementation docstring.

Mismatch between Spark and Python evaluation

Evlauation metrics between from two implementations are not matched.

Rename "utilities" folder to "reco_utils"

As per Tao - it would be less confusing if we use a name like rec_utils instead of utilities.

from utilities.recommender.sar.sar_singlenode import SARSingleNodeReference
from utilities.dataset.url_utils import maybe_download
from utilities.dataset.python_splitters import python_random_split
from utilities.evaluation.python_evaluation import PythonRatingEvaluation, PythonRankingEvaluation

Define and create master PR strategy

Consider what additional smoke tests / other tests we need in place
Timers on integration tests
Consider tests on larger datasets
Test bench on cosmosdb - consider the scaling of each algorithm

bug in first cell deep dive SAR

https://github.com/Microsoft/Recommenders/blob/staging/notebooks/02_modeling/sar_deep_dive.ipynb

we need to do import pyspark

SAR single node notebook: add Cold User filtering section like in the pySpark SAR notebook

Look at the cold user filtering section in pySpark SAR notebook and implement the same with Pandas for SAR on single node notebook.

Migrate test reference data to git lfs

This allows to run the test cases offline (e.g. on a plane)

DEBUG mode is broken for SAR implementations

time import and class method are missing - we somehow left those in airship. Ooops.

Add those back in

INFO log spamming standard out when using SAR on databricks

Something in the library turned on info level logging and is causing a flood of "INFO:py4j.java_gateway:Received command c on object id p0" to come to standard console. Please disable...

SAR pySpark unit tests use urllib and pandas for loaders

As the title says, with pySpark we should be loading using spark.read.load directly on unit test files from remote WASB/HDFS, and not using urllib -> pandas -> spark.DataFrame in the SAR pySpark unit tests.

Review fixtures in SAR unit tests

Related to this PR #22

SAR PySpark Notebook

SAR PySpark notebook working
Include notebook tests as well

class SparkRatingEvaluation:
    """Spark Rating Evaluator"""

    def __init__(
        self,
        rating_true,
        rating_pred,
        col_user=DEFAULT_USER_COL,
        col_item=DEFAULT_ITEM_COL,
        col_rating=DEFAULT_RATING_COL,
        col_prediction=PREDICTION_COL,
    ):
        """Initializer.
        Args:
            rating_true (spark.DataFrame): True labels.
            rating_pred (spark.DataFrame): Predicted labels.
        """

integration tests SAR with big dataset

related to #72

debug issue with pysar test and tolerance

On some machines, using a tolerance of 1e-8, the tests pass, but in others they don't.

We got this error on Prometheus, when testing test_sar_single_node.py:

(py36) miguel@prometheus:~/repos/Recommenders$ pytest tests/unit/test_sar_singlenode.py 
=================================================================================== test session starts ====================================================================================
platform linux -- Python 3.6.5, pytest-3.6.4, py-1.7.0, pluggy-0.7.1
rootdir: /home/miguel/repos/Recommenders, inifile:
collected 15 items                                                                                                                                                                         

tests/unit/test_sar_singlenode.py ...........FFFF                                                                                                                                    [100%]

========================================================================================= FAILURES =========================================================================================
____________________________________________________________________________________ test_user_affinity ____________________________________________________________________________________

demo_usage_data =                  UserId    MovieId     Timestamp  Rating  exponential  rating_exponential
0      0003000098E85347  DQF...076
11837  00030000822E3BAE  DAF-00448  1.416292e+09       1     0.009076            0.009076

[11838 rows x 6 columns]
sar_settings = {'ATOL': 1e-08, 'FILE_DIR': 'http://recodatasets.blob.core.windows.net/sarunittest/', 'TEST_USER_ID': '0003000098E85347'}
header = {'col_item': 'MovieId', 'col_rating': 'Rating', 'col_timestamp': 'Timestamp', 'col_user': 'UserId'}

    def test_user_affinity(demo_usage_data, sar_settings, header):
        time_now = demo_usage_data[header["col_timestamp"]].max()
        model = SARSingleNodeReference(
            remove_seen=True,
            similarity_type="cooccurrence",
            timedecay_formula=True,
            time_decay_coefficient=30,
            time_now=time_now,
            **header
        )
        _apply_sar_hash_index(model, demo_usage_data, None, header)
        model.fit(demo_usage_data)
    
        true_user_affinity, items = load_affinity(sar_settings["FILE_DIR"] + "user_aff.csv")
        user_index = model.user_map_dict[sar_settings["TEST_USER_ID"]]
        test_user_affinity = np.reshape(
            np.array(
                _rearrange_to_test(
                    model.user_affinity, None, items, None, model.item_map_dict
                )[user_index,].todense()
            ),
            -1,
        )
>       assert np.allclose(
            true_user_affinity.astype(test_user_affinity.dtype),
            test_user_affinity,
            atol=sar_settings["ATOL"],
        )
E       AssertionError: assert False
E        +  where False = <function allclose at 0x7f6110e1d730>(array([0.        , 0.        , 0.        , 0.        , 0.        ,\n       0.        , 0.        , 0.        , 0.      ...       , 0.        , 0.        ,\n       0.        , 0.        , 0.15181286, 1.        , 0.        ,\n       0.        ]), array([0.        , 0.        , 0.        , 0.        , 0.        ,\n       0.        , 0.        , 0.        , 0.      ...       , 0.        , 0.        ,\n       0.        , 0.        , 0.15195908, 1.        , 0.        ,\n       0.        ]), atol=1e-08)
E        +    where <function allclose at 0x7f6110e1d730> = np.allclose
E        +    and   array([0.        , 0.        , 0.        , 0.        , 0.        ,\n       0.        , 0.        , 0.        , 0.      ...       , 0.        , 0.        ,\n       0.        , 0.        , 0.15181286, 1.        , 0.        ,\n       0.        ]) = <built-in method astype of numpy.ndarray object at 0x7f60fc6adee0>(dtype('float64'))
E        +      where <built-in method astype of numpy.ndarray object at 0x7f60fc6adee0> = array(['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',\n       '0', '0.0221122254449968', '0', '0', '0..., '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',\n       '0', '0.151812861826336', '1', '0', '0'], dtype='<U18').astype
E        +      and   dtype('float64') = array([0.        , 0.        , 0.        , 0.        , 0.        ,\n       0.        , 0.        , 0.        , 0.      ...       , 0.        , 0.        ,\n       0.        , 0.        , 0.15195908, 1.        , 0.        ,\n       0.        ]).dtype

tests/unit/test_sar_singlenode.py:201: AssertionError
___________________________________________________________________________ test_userpred[3-cooccurrence-count] ____________________________________________________________________________

threshold = 3, similarity_type = 'cooccurrence', file = 'count', header = {'col_item': 'MovieId', 'col_rating': 'Rating', 'col_timestamp': 'Timestamp', 'col_user': 'UserId'}
sar_settings = {'ATOL': 1e-08, 'FILE_DIR': 'http://recodatasets.blob.core.windows.net/sarunittest/', 'TEST_USER_ID': '0003000098E85347'}
demo_usage_data =                  UserId    MovieId     Timestamp  Rating  exponential  rating_exponential
0      0003000098E85347  DQF...076
11837  00030000822E3BAE  DAF-00448  1.416292e+09       1     0.009076            0.009076

[11838 rows x 6 columns]

    @pytest.mark.parametrize(
        "threshold,similarity_type,file",
        [(3, "cooccurrence", "count"), (3, "jaccard", "jac"), (3, "lift", "lift")],
    )
    def test_userpred(
        threshold, similarity_type, file, header, sar_settings, demo_usage_data
    ):
        time_now = demo_usage_data[header["col_timestamp"]].max()
        model = SARSingleNodeReference(
            remove_seen=True,
            similarity_type=similarity_type,
            timedecay_formula=True,
            time_decay_coefficient=30,
            time_now=time_now,
            threshold=threshold,
            **header
        )
        _apply_sar_hash_index(model, demo_usage_data, None, header)
        model.fit(demo_usage_data)
    
        true_items, true_scores = load_userpred(
            sar_settings["FILE_DIR"]
            + "userpred_"
            + file
            + str(threshold)
            + "_userid_only.csv"
        )
        test_results = model.recommend_k_items(
            demo_usage_data[
                demo_usage_data[header["col_user"]] == sar_settings["TEST_USER_ID"]
            ],
            top_k=10,
        )
        test_items = list(test_results[header["col_item"]])
        test_scores = np.array(test_results["prediction"])
        assert true_items == test_items
>       assert np.allclose(true_scores, test_scores, atol=sar_settings["ATOL"])
E       assert False
E        +  where False = <function allclose at 0x7f6110e1d730>(array([40.96870941, 40.37760085, 19.55002941, 18.10756063, 13.24775154,\n       12.67358812, 12.49898911, 12.0359004 , 10.91842008, 10.91185623]), array([41.00239015, 40.41649126, 19.5650067 , 18.12114858, 13.26051135,\n       12.6742369 , 12.50043289, 12.047493  , 10.92893636, 10.92236618]), atol=1e-08)
E        +    where <function allclose at 0x7f6110e1d730> = np.allclose

tests/unit/test_sar_singlenode.py:245: AssertionError
_______________________________________________________________________________ test_userpred[3-jaccard-jac] _______________________________________________________________________________

threshold = 3, similarity_type = 'jaccard', file = 'jac', header = {'col_item': 'MovieId', 'col_rating': 'Rating', 'col_timestamp': 'Timestamp', 'col_user': 'UserId'}
sar_settings = {'ATOL': 1e-08, 'FILE_DIR': 'http://recodatasets.blob.core.windows.net/sarunittest/', 'TEST_USER_ID': '0003000098E85347'}
demo_usage_data =                  UserId    MovieId     Timestamp  Rating  exponential  rating_exponential
0      0003000098E85347  DQF...076
11837  00030000822E3BAE  DAF-00448  1.416292e+09       1     0.009076            0.009076

[11838 rows x 6 columns]

    @pytest.mark.parametrize(
        "threshold,similarity_type,file",
        [(3, "cooccurrence", "count"), (3, "jaccard", "jac"), (3, "lift", "lift")],
    )
    def test_userpred(
        threshold, similarity_type, file, header, sar_settings, demo_usage_data
    ):
        time_now = demo_usage_data[header["col_timestamp"]].max()
        model = SARSingleNodeReference(
            remove_seen=True,
            similarity_type=similarity_type,
            timedecay_formula=True,
            time_decay_coefficient=30,
            time_now=time_now,
            threshold=threshold,
            **header
        )
        _apply_sar_hash_index(model, demo_usage_data, None, header)
        model.fit(demo_usage_data)
    
        true_items, true_scores = load_userpred(
            sar_settings["FILE_DIR"]
            + "userpred_"
            + file
            + str(threshold)
            + "_userid_only.csv"
        )
        test_results = model.recommend_k_items(
            demo_usage_data[
                demo_usage_data[header["col_user"]] == sar_settings["TEST_USER_ID"]
            ],
            top_k=10,
        )
        test_items = list(test_results[header["col_item"]])
        test_scores = np.array(test_results["prediction"])
        assert true_items == test_items
>       assert np.allclose(true_scores, test_scores, atol=sar_settings["ATOL"])
E       assert False
E        +  where False = <function allclose at 0x7f6110e1d730>(array([0.0616357 , 0.04918001, 0.04247487, 0.04009872, 0.03847229,\n       0.03839772, 0.03251167, 0.02474822, 0.02432458, 0.0224889 ]), array([0.06163639, 0.04921205, 0.04247624, 0.04011545, 0.03848885,\n       0.03843471, 0.0325135 , 0.02477206, 0.02432508, 0.02249099]), atol=1e-08)
E        +    where <function allclose at 0x7f6110e1d730> = np.allclose

tests/unit/test_sar_singlenode.py:245: AssertionError
________________________________________________________________________________ test_userpred[3-lift-lift] ________________________________________________________________________________

threshold = 3, similarity_type = 'lift', file = 'lift', header = {'col_item': 'MovieId', 'col_rating': 'Rating', 'col_timestamp': 'Timestamp', 'col_user': 'UserId'}
sar_settings = {'ATOL': 1e-08, 'FILE_DIR': 'http://recodatasets.blob.core.windows.net/sarunittest/', 'TEST_USER_ID': '0003000098E85347'}
demo_usage_data =                  UserId    MovieId     Timestamp  Rating  exponential  rating_exponential
0      0003000098E85347  DQF...076
11837  00030000822E3BAE  DAF-00448  1.416292e+09       1     0.009076            0.009076

[11838 rows x 6 columns]

    @pytest.mark.parametrize(
        "threshold,similarity_type,file",
        [(3, "cooccurrence", "count"), (3, "jaccard", "jac"), (3, "lift", "lift")],
    )
    def test_userpred(
        threshold, similarity_type, file, header, sar_settings, demo_usage_data
    ):
        time_now = demo_usage_data[header["col_timestamp"]].max()
        model = SARSingleNodeReference(
            remove_seen=True,
            similarity_type=similarity_type,
            timedecay_formula=True,
            time_decay_coefficient=30,
            time_now=time_now,
            threshold=threshold,
            **header
        )
        _apply_sar_hash_index(model, demo_usage_data, None, header)
        model.fit(demo_usage_data)
    
        true_items, true_scores = load_userpred(
            sar_settings["FILE_DIR"]
            + "userpred_"
            + file
            + str(threshold)
            + "_userid_only.csv"
        )
        test_results = model.recommend_k_items(
            demo_usage_data[
                demo_usage_data[header["col_user"]] == sar_settings["TEST_USER_ID"]
            ],
            top_k=10,
        )
        test_items = list(test_results[header["col_item"]])
        test_scores = np.array(test_results["prediction"])
        assert true_items == test_items
>       assert np.allclose(true_scores, test_scores, atol=sar_settings["ATOL"])
E       assert False
E        +  where False = <function allclose at 0x7f6110e1d730>(array([0.00134902, 0.00084695, 0.00072497, 0.00072133, 0.00066855,\n       0.0006003 , 0.00045299, 0.00045202, 0.00041803, 0.00034772]), array([0.00134902, 0.00084696, 0.00072513, 0.00072134, 0.00066871,\n       0.00060031, 0.00045312, 0.00045204, 0.00041804, 0.00034806]), atol=1e-08)
E        +    where <function allclose at 0x7f6110e1d730> = np.allclose

tests/unit/test_sar_singlenode.py:245: AssertionError

recommenders-team / recommenders Goto Github PK

recommenders's People

Stargazers

Watchers

Forkers

recommenders's Issues

Recommend Projects

Recommend Topics

Recommend Org