o19s / hello-ltr Goto Github PK

Set of Jupyter notebooks demonstrating Learning to Rank integrated with Solr and Elasticsearch

License: Apache License 2.0

Dockerfile 0.18% Jupyter Notebook 91.45% Python 8.08% Shell 0.26% Jinja 0.03%

hello-ltr's Introduction

Hello LTR :)

The overall goal of this project is to demonstrate all the steps required to work with LTR in Elasticsearch, Solr, or OpenSearch. There are two modes of running this project. You can run and edit notebooks in a docker container or you can do local development on the notebooks and connect to the search engine(s) running in Docker.

No fuss setup: You just want to play with LTR

Follow these steps if you're just playing around & are OK with possibly losing some work (all notebooks exist just in the docker container)

With docker & docker-compose simply run

docker-compose up

at the root dir and go to town!

This will run jupyter and all search engines in Docker containers. Check that each is up at the default ports:

Solr: localhost:8983
Elasticsearch: localhost:9200
Kibana: localhost:5601
OpenSearch: localhost:9201
OpenSearch Dashboards: localhost:5602
Jupyter: localhost:8888

You want to build your own LTR notebooks

git Follow these steps if you want to do more serious work with the notebooks. For example, if you want to build a demo with your work's data or something you want to preserve later.

Run your search engine with Docker

You probably just want to work with one search engine. So whichever one you're working with, launch that search engine in Docker.

Running Solr w/ LTR

Setup Solr with docker compose to work with just Solr examples:

cd notebooks/solr
docker-compose up

Running Elasticsearch w/ LTR

Setup Elasticsearch with docker compose to work with just Elasticsearch examples:

cd notebooks/elasticsearch
docker-compose up

Running OpenSearch w/ LTR

Setup OpenSearch with docker compose to work with just OpenSearch examples:

cd notebooks/opensearch
docker-compose up

Run Jupyter locally w/ Python 3 and all prereqs

Setup Python requirements

Ensure Python 3.7 or later is installed on your system
Create a virtual environment: python3 -m venv venv
Start the virtual environment: source venv/bin/activate
Check install tooling is up to date python -m pip install -U pip wheel setuptools
Install the requirements pip install -r requirements.txt

Note: The above commands should be run from the root folder of the project.

Start Jupyter notebook and confirm operation

Run jupyter notebook
Browse to notebooks/{search_engine}/{collection}
Open either the "hello-ltr (Solr)" or "hello-ltr (ES)" as appropriate and ensure you get a graph at the last cell

Tests

Automatically run everything...

NB: It may be necessary to increase the number of open files on MacOS to a higher value than the default 256 for the tests to complete successfully. Use:

$ ulimit -n 4096

to increase the value to a sensible amount.

To run a full suite of tests, such as to verify a PR, you can simply run

./tests/test.sh

Optionally with containers rebuilt

./tests/test.sh --rebuild-containers

Failing tests will have their output in tests/last_run.ipynb

You can test one or more engines by specifying a comma delimited list: ./tests/test.sh --engines=solr,opensearch,elasticsearch

While developing...

For more informal development:

Startup the Solr and ES Docker containers
Do your development
Run the command as needed: python tests/run_most_nbs.py
Tests fail if notebooks return any errors
- The failing notebook will be stored at tests/last_run.ipynb

hello-ltr's People

Contributors

Stargazers

Watchers

hello-ltr's Issues

XGBoost examples

Goal: Design at least one notebook to showcase how to use XGBoost models in lieu of Ranklib ones.

Linked to better documentation for XGBoost in ES-LTR
o19s/elasticsearch-learning-to-rank#353

Output start time & end time for ltr training methods

Training can take awhile and it's helpful to know how long it took to assess performance.

We should add start,end,duration print output (or use tqdm) for the train,kcv, and feature_search methods in the ltr.ranklib module.

FileNotFoundError: [Errno 2] No such file or directory: 'data/latest_model.txt'

Hi,
I am trying to run it on Ubuntu 18.04 as described in the readme file. When I run this code in hello-ltr[ES]

from ltr.ranklib import train
train(client, training_set=latest_training_set, 
      index='tmdb', featureSet='release', modelName='latest')
train(client, training_set=classic_training_set, 
      index='tmdb', featureSet='release', modelName='classic')

I am getting this error

/tmp/RankyMcRankFace.jar already exists
Running java -jar /tmp/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t DCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train /tmp/training.txt -save data/latest_model.txt 
DONE

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-8-61d89782267a> in <module>
      1 from ltr.ranklib import train
      2 train(client, training_set=latest_training_set, 
----> 3       index='tmdb', featureSet='release', modelName='latest')
      4 train(client, training_set=classic_training_set, 
      5       index='tmdb', featureSet='release', modelName='classic')

~/Documents/hello-ltr/ltr/ranklib.py in train(client, training_set, modelName, featureSet, index, features, metric2t, leafs, trees, frate, srate, bag, ranker, shrinkage)
     80                                trees=trees,
     81                                shrinkage=shrinkage)
---> 82     save_model(client, modelName, modelFile, index, featureSet)
     83     assert len(ranklibResult.trainingLogs) == 1
     84     return ranklibResult.trainingLogs[0]

~/Documents/hello-ltr/ltr/ranklib.py in save_model(client, modelName, modelFile, index, featureSet)
     56 
     57 def save_model(client, modelName, modelFile, index, featureSet):
---> 58     with open(modelFile) as src:
     59         definition = src.read()
     60         client.submit_ranklib_model(featureSet, index, modelName, definition)

FileNotFoundError: [Errno 2] No such file or directory: 'data/latest_model.txt'

Any suggestion to avoid this error?

Create docker container

Make a simple container to run the notebooks... easy to rebuild when we update the materials

elasticsearch client version in requirements.txt should match Dockerfile elasticsearch server version

The client module version should match server installed version

Currently, in requirements.txt, the following version is elasticsearch==7.0.0

But in the notebooks/elasticsearch/.docker/es-docker/Dockerfile, the version of elastic is 7.6.2

The latest compatible client version is elasticsearch==7.6.0, and if the Dockerfile version changes (to for example 7.9.x, an appropriate client version match needs to be made and explicitly specified in the requirements.txt file.

Raw command notebooks host paths in docker

In docker the hostnames need to be solr/elastic respectively. Rearrange things so we can set this in one place and re-use thruout the notebooks.

Getting negative weight on solr ltr model

Hi,

I've been trying solr ltr with custom data.
After my training, some of the model weights turns out negative values.
does it mean the model diverged?

Getting Started Guide

Hi,
Thanks for making it public. Do you have any getting started guide? I am not able to follow the readme file correctly. Would be great if you can share one.

TermStatQuery in OpenSearch notebook not working

This might rather be a bug in the OpenSearch plugin than a bug in this repository, so I'm mainly posting this here for visibility.

In the notebook notebooks/opensearch/tmdb/term-stat-query.ipynb in step two, the feature tsq_expr_title_tfidf cannot be logged and an exception is thrown:

---------------------------------------------------------------------------
RequestError                              Traceback (most recent call last)
Input In [3], in <cell line: 6>()
      6 with judgments_open('data/title_judgments.txt') as judgment_list:
      7     for qid, query_judgments in groupby(judgment_list, key=lambda j: j.qid):
----> 8         ftr_logger.log_for_qid(judgments=query_judgments, 
      9                                qid=qid,
     10                                keywords=judgment_list.keywords(qid))
     12 df = judgments_to_dataframe(ftr_logger.logged)
     13 df

File ~/IdeaProjects/OpenSearchWork/hello-ltr/ltr/log.py:56, in FeatureLogger.log_for_qid(self, qid, judgments, keywords)
     48 keywords = re.sub('([^\s\w]|_)+', '', keywords)
     50 params = {
     51     "keywords": keywords,
     52     "fuzzy_keywords": ' '.join([x + '~' for x in keywords.split(' ')]),
     53     "keywordsList": [keywords] # Needed by TSQ for the time being
     54 }
---> 56 res = self.client.log_query(self.index, self.feature_set, ids, params)
     58 # Add feature back to each judgment
     59 for doc in res:

File ~/IdeaProjects/OpenSearchWork/hello-ltr/ltr/client/opensearch_client.py:145, in OpenSearchClient.log_query(self, index, featureset, ids, params)
    142 if ids is not None:
    143     params["query"]["bool"]["must"] = terms_query
--> 145 resp = self.es.search(index=index, body=params)
    146 # resp_msg(msg="Searching {} - {}".format(index, str(terms_query)[:20]), resp=SearchResp(resp))
    148 matches = []

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/opensearchpy/client/utils.py:177, in query_params.<locals>._wrapper.<locals>._wrapped(*args, **kwargs)
    175     if p in kwargs:
    176         params[p] = kwargs.pop(p)
--> 177 return func(*args, params=params, headers=headers, **kwargs)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/opensearchpy/client/__init__.py:1544, in OpenSearch.search(self, body, index, params, headers)
   1541 if "from_" in params:
   1542     params["from"] = params.pop("from_")
-> 1544 return self.transport.perform_request(
   1545     "POST",
   1546     _make_path(index, "_search"),
   1547     params=params,
   1548     headers=headers,
   1549     body=body,
   1550 )

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/opensearchpy/transport.py:407, in Transport.perform_request(self, method, url, headers, params, body)
    405             raise e
    406     else:
--> 407         raise e
    409 else:
    410     # connection didn't fail, confirm it's live status
    411     self.connection_pool.mark_live(connection)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/opensearchpy/transport.py:368, in Transport.perform_request(self, method, url, headers, params, body)
    365 connection = self.get_connection()
    367 try:
--> 368     status, headers_response, data = connection.perform_request(
    369         method,
    370         url,
    371         params,
    372         body,
    373         headers=headers,
    374         ignore=ignore,
    375         timeout=timeout,
    376     )
    378     # Lowercase all the header names for consistency in accessing them.
    379     headers_response = {
    380         header.lower(): value for header, value in headers_response.items()
    381     }

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/opensearchpy/connection/http_urllib3.py:275, in Urllib3HttpConnection.perform_request(self, method, url, params, body, timeout, ignore, headers)
    271 if not (200 <= response.status < 300) and response.status not in ignore:
    272     self.log_request_fail(
    273         method, full_url, url, orig_body, duration, response.status, raw_data
    274     )
--> 275     self._raise_error(
    276         response.status,
    277         raw_data,
    278         self.get_response_headers(response).get("content-type"),
    279     )
    281 self.log_request_success(
    282     method, full_url, url, orig_body, response.status, raw_data, duration
    283 )
    285 return response.status, response.getheaders(), raw_data

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/opensearchpy/connection/base.py:300, in Connection._raise_error(self, status_code, raw_data, content_type)
    297 except (ValueError, TypeError) as err:
    298     logger.warning("Undecodable raw error response from server: %s", err)
--> 300 raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
    301     status_code, error_message, additional_info
    302 )

RequestError: RequestError(400, 'search_phase_execution_exception', 'Cannot create query while parsing feature [tsq_expr_title_tfidf]')

Removing the feature and running the notebook prevents this error from happening. The feature works in the ES version of the notebook, so this likely is a bug in the OS plugin.

Judgement ctor requires doc id to be int

hello-ltr/ltr/judgments.py

Line 8 in 54ece3b

self.docId = str(int(docId)) # To force ValueError

In our product, we use string doc ids, which is convenient and more flexible than int.
With the req above I had to enumerate documents manually before indexing in Solr.

I propose to relax the requirement to string, if there are no issues with that in downstream components.

Add code for test/train split

KCV is nice, but we really need to have a test/validation set example to show the benefits of using hold-out data to folks.

Starting elasticsearch container failed

Hi there,

I got the following error:

bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
bootstrap checks failed

I guess it's probably related to this (I'm not an es expert):
elastic/elasticsearch#19987

Fixed it by:
transport.host: localhost

Cheers.

Make Solr Splainer/Quepid Friendly

Can we add CORS to the Solr container here so it can be used with Splainer?

Change print statements to logging

There are a lot of helpful print statements throughout the code that we should change to logging. I would prefer using the built-in Python logging libraries to cut down on dependencies, but are open to various options.

It should be possible to switch on/off logging in a notebook to see what's happening. Logging should by default go to the notebook itself (ie std out, like a print statement).

Not too stressed about what the right granularity is right now, if everything was "INFO" or "ERROR" then that'd be awesome. So long as we can get things into logging statements...

msmarco downloads could get duplicated between solr and elastic

Right now the downloads go to a data/ folder that is a child of the respective search engine. This could cause two copies of the same large data files to be downloaded if users are switching between engines.

Should we move the msmarco data storage location up higher so it could be easily shared? Or check multiple locations for existance before downloading?

Score explains

It would be useful for debugging model performance to include the explain output in the logged features. This would be similar to Splainer in terms of showing exactly where the matching occurred and what values were produced.

Uploading model to elasticsearch 7.9.3 fails

When trying the following request as per the docs here: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/learning-to-rank.html
POST _ltr/_featureset/movie_features/_createmodel

{
  "model": {
    "name": "my_ranklib_model",
    "model": {
      "type": "model/ranklib+json",
      "definition": "<ensemble>
   <tree id="1" weight="0.1">
      <split>
         <feature>1</feature>
         <threshold>10.357876</threshold>
         <split pos="left">
            <feature>1</feature>
            <threshold>0.0</threshold>
            <split pos="left">
               <output>-2.0</output>
            </split>
            <split pos="right">
               <feature>1</feature>
               <threshold>7.0105133</threshold>
               <split pos="left">
                  <output>-2.0</output>
               </split>
               <split pos="right">
                  <output>-2.0</output>
               </split>
            </split>
         </split>
         <split pos="right">
            <output>2.0</output>
         </split>
      </split>
   </tree>
   <tree id="2" weight="0.1">
      <split>
         <feature>1</feature>
         <threshold>10.357876</threshold>
         <split pos="left">
            <feature>1</feature>
            <threshold>0.0</threshold>
            <split pos="left">
               <output>-1.67031991481781</output>
            </split>
            <split pos="right">
               <feature>1</feature>
               <threshold>7.0105133</threshold>
               <split pos="left">
                  <output>-1.67031991481781</output>
               </split>
               <split pos="right">
                  <output>-1.6703200340270996</output>
               </split>
            </split>
         </split>
         <split pos="right">
            <output>1.6703201532363892</output>
         </split>
      </split>
   </tree>
   <tree id="3" weight="0.1">
      <split>
         <feature>2</feature>
         <threshold>10.573917</threshold>
         <split pos="left">
            <output>1.479954481124878</output>
         </split>
         <split pos="right">
            <feature>1</feature>
            <threshold>7.0105133</threshold>
            <split pos="left">
               <feature>1</feature>
               <threshold>0.0</threshold>
               <split pos="left">
                  <output>-1.4799546003341675</output>
               </split>
               <split pos="right">
                  <output>-1.479954481124878</output>
               </split>
            </split>
            <split pos="right">
               <output>-1.479954481124878</output>
            </split>
         </split>
      </split>
   </tree>
   <tree id="4" weight="0.1">
      <split>
         <feature>1</feature>
         <threshold>10.357876</threshold>
         <split pos="left">
            <feature>1</feature>
            <threshold>0.0</threshold>
            <split pos="left">
               <output>-1.3569872379302979</output>
            </split>
            <split pos="right">
               <feature>1</feature>
               <threshold>7.0105133</threshold>
               <split pos="left">
                  <output>-1.3569872379302979</output>
               </split>
               <split pos="right">
                  <output>-1.3569872379302979</output>
               </split>
            </split>
         </split>
         <split pos="right">
            <output>1.3569873571395874</output>
         </split>
      </split>
   </tree>
   <tree id="5" weight="0.1">
      <split>
         <feature>1</feature>
         <threshold>10.357876</threshold>
         <split pos="left">
            <feature>1</feature>
            <threshold>0.0</threshold>
            <split pos="left">
               <output>-1.2721362113952637</output>
            </split>
            <split pos="right">
               <feature>1</feature>
               <threshold>7.0105133</threshold>
               <split pos="left">
                  <output>-1.2721363306045532</output>
               </split>
               <split pos="right">
                  <output>-1.2721363306045532</output>
               </split>
            </split>
         </split>
         <split pos="right">
            <output>1.2721362113952637</output>
         </split>
      </split>
   </tree>
</ensemble>"
    }
  }
}

I get the following error back because this is not valid JSON:

{
    "error": {
        "root_cause": [
            {
                "type": "x_content_parse_exception",
                "reason": "[6:21] [model] failed to parse field [definition]"
            }
        ],
        "type": "x_content_parse_exception",
        "reason": "[6:21] [create_model_from_set] failed to parse field [model]",
        "caused_by": {
            "type": "x_content_parse_exception",
            "reason": "[6:21] [model] failed to parse field [model]",
            "caused_by": {
                "type": "x_content_parse_exception",
                "reason": "[6:21] [model] failed to parse field [definition]",
                "caused_by": {
                    "type": "json_parse_exception",
                    "reason": "Illegal unquoted character ((CTRL-CHAR, code 10)): has to be escaped using backslash to be included in string value\n at [Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 6, column: 33]"
                }
            }
        }
    },
    "status": 400
}

Clearly what I'm trying to submit is not valid JSON but the docs seem to be suggesting this should work. I'm unsure if this is an issue with later versions of elastic being more strict, something wrong in the docs or just me misunderstanding.

data/rre-evaluation.json is missing

This is used in the evaluation.ipynb for demo-ing RRE. Should these scripts be removed? or should that JSON file be re-added?

Consistent versioning/duplication clean up

We now have two dockerfiles for setting up an elasticsearch env and the versions differ between both. Need to remove the duplicates, get everything on one version and make sure all the documentation is pointing to one place.

Typo in netfix movies (Solr) feature

In the netfix movies notebook for Solr, the following feature is used:

    #1
    {
      "name" : "title_has_phrase",
      "store": "title2",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : {
        "q" : "title:\"${keywords})\"^=1"
      }
    },

Note the hanging paren in "title:\"${keywords})\"...
This isn't a problem, as the ) gets stripped out during analysis, but worth a cleanup.

Solr: unknown field 'poster_path' When indexing tmdb data

Deleted index tmdb [Status: 400]
{
  "responseHeader":{
    "status":400,
    "QTime":28},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"Cannot unload non-existent core [tmdb]",
    "code":400}}

Created index tmdb [Status: 200]
Reindexing...
Indexed 0 movies (last Black Mirror: White Christmas)
Indexed 100 movies (last Apocalypse Now)
Indexed 200 movies (last Crooks in Clover)
Indexed 300 movies (last For a Few Dollars More)
Indexed 400 movies (last Downfall)
Flushing 500 docs
Done [Status: 400]
{
  "responseHeader":{
    "status":400,
    "QTime":50},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"ERROR: [doc=374430] unknown field 'poster_path'",
    "code":400}}

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-caf912dfad7c> in <module>
      1 from ltr.index import rebuild_tmdb
----> 2 rebuild_tmdb(client)

~/search/hello-ltr/ltr/index.py in rebuild_tmdb(client, enrich)
     23 def rebuild_tmdb(client, enrich=noop):
     24     movies=indexable_movies(enrich=enrich)
---> 25     rebuild(client, index='tmdb', doc_type='movie', doc_src=movies)

~/search/hello-ltr/ltr/index.py in rebuild(client, index, doc_type, doc_src)
     16     client.index_documents(index,
     17                            doc_type=doc_type,
---> 18                            doc_src=doc_src)
     19 
     20     print('Done')

~/search/hello-ltr/ltr/client/solr_client.py in index_documents(self, index, doc_type, doc_src)
     59 
     60             if len(docs) % BATCH_SIZE == 0:
---> 61                 flush(docs)
     62 
     63         flush(docs)

~/search/hello-ltr/ltr/client/solr_client.py in flush(docs)
     47             resp = requests.post('{}/{}/update?commitWithin=1500'.format(
     48                 self.solr_base_ep, index), json=docs)
---> 49             resp_msg(msg="Done", resp=resp)
     50             docs.clear()
     51 

~/search/hello-ltr/ltr/helpers/handle_resp.py in resp_msg(msg, resp, throw)
      6         print(resp.text)
      7         if throw:
----> 8             raise RuntimeError(resp.text)
      9 

RuntimeError: {
  "responseHeader":{
    "status":400,
    "QTime":50},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"ERROR: [doc=374430] unknown field 'poster_path'",
    "code":400}}

Failure Running /tests/test.sh: TypeError: 'coroutine' object is not subscriptable

I checked out a fresh copy of this repo on MacOS, ran ./tests/test.sh and got the failure below.

Is this a known issue and reproducible? I'm happy to take a deeper look with any guidance/feedback on this.

Thanks,
Mark


================================================
== RUN TESTS: 
== tests/run_most_nbs.py 
..EXECUTING NBS IN DIRECTORY: ./notebooks/
Running... ./notebooks/conversion-augmented-click-models.ipynb
/Users/macohen/IdeaProjects/OpenSearchWork/hello-ltrfresh/tests_venv/lib/python3.9/site-packages/debugpy/_vendored/pydevd/pydevd_plugins/extensions/__init__.py:4: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('pydevd_plugins.extensions')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  __import__('pkg_resources').declare_namespace(__name__)
/Users/macohen/IdeaProjects/OpenSearchWork/hello-ltrfresh/tests_venv/lib/python3.9/site-packages/pkg_resources/__init__.py:2349: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('pydevd_plugins')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(parent)
/Users/macohen/IdeaProjects/OpenSearchWork/hello-ltrfresh/tests_venv/lib/python3.9/site-packages/debugpy/_vendored/pydevd/pydevd_plugins/extensions/types/__init__.py:4: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('pydevd_plugins.extensions.types')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  __import__('pkg_resources').declare_namespace(__name__)
/Users/macohen/IdeaProjects/OpenSearchWork/hello-ltrfresh/tests_venv/lib/python3.9/site-packages/pkg_resources/__init__.py:2349: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('pydevd_plugins.extensions')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(parent)
E/usr/local/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/tracemalloc.py:401: ResourceWarning: unclosed context <zmq.asyncio.Context() at 0x10af9d590>
  class DomainFilter(BaseFilter):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/usr/local/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py:686: ResourceWarning: unclosed event loop <_UnixSelectorEventLoop running=False closed=False debug=False>
  _warn(f"unclosed event loop {self!r}", ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/usr/local/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/unittest/case.py:614: RuntimeWarning: coroutine 'ZMQSocketChannel.get_msg' was never awaited
  outcome.errors.clear()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
.
======================================================================
ERROR: test_for_no_errors (__main__.RunMostNotebooksTestCase)
Run all nbs in directories at test_paths()
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/macohen/IdeaProjects/OpenSearchWork/hello-ltrfresh/tests/notebook_test_case.py", line 42, in test_for_no_errors
    nb, errors = runner.run_notebook(nb, save_nb_path=NotebooksTestCase.SAVE_NB_PATH)
  File "/Users/macohen/IdeaProjects/OpenSearchWork/hello-ltrfresh/tests/runner.py", line 22, in run_notebook
    proc.preprocess(nb, {'metadata': {'path': dirname}})
  File "/Users/macohen/IdeaProjects/OpenSearchWork/hello-ltrfresh/tests_venv/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py", line 405, in preprocess
    nb, resources = super(ExecutePreprocessor, self).preprocess(nb, resources)
  File "/Users/macohen/IdeaProjects/OpenSearchWork/hello-ltrfresh/tests_venv/lib/python3.9/site-packages/nbconvert/preprocessors/base.py", line 69, in preprocess
    nb.cells[index], resources = self.preprocess_cell(cell, resources, index)
  File "/Users/macohen/IdeaProjects/OpenSearchWork/hello-ltrfresh/tests_venv/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py", line 438, in preprocess_cell
    reply, outputs = self.run_cell(cell, cell_index, store_history)
  File "/Users/macohen/IdeaProjects/OpenSearchWork/hello-ltrfresh/tests_venv/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py", line 578, in run_cell
    exec_reply = self._poll_for_reply(parent_msg_id, cell, timeout)
  File "/Users/macohen/IdeaProjects/OpenSearchWork/hello-ltrfresh/tests_venv/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py", line 479, in _poll_for_reply
    if msg['parent_header'].get('msg_id') == msg_id:
TypeError: 'coroutine' object is not subscriptable

----------------------------------------------------------------------
Ran 4 tests in 4.219s

FAILED (errors=1)
================================================
== TEARDOWN 
[+] Running 3/3
 ⠿ Container solr-solr-1  Removed                                                                                                                             1.8s
 ⠿ Volume solr_data       Removed                                                                                                                             0.0s
 ⠿ Network solr_default   Removed                                                                                                                             0.1s
[+] Running 4/4
 ⠿ Container elasticsearch-elasticsearch-1  Removed                                                                                                           6.7s
 ⠿ Container elasticsearch-kibana-1         Removed                                                                                                           2.5s
 ⠿ Volume elasticsearch_tlre-es-data        Removed                                                                                                           0.0s
 ⠿ Network elasticsearch_default            Removed                                                                                                           0.8s
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
================================================
> POOP!    Tests Failed 💩 For:
commit c101343aef9d4c43a1bacdc82f98089347e16553 (HEAD -> main, origin/main, origin/HEAD)
Author: David Fisher <[email protected]>
Date:   Mon Feb 20 15:38:19 2023 -0500

    Fix labs (#87)
    
    * fix typos and bugs that got fixed while running live, but never fixed in the repo.
    
    * column is named clicks, not downloads
    
    * remove visualization exercise that isn't complete
    
    * copy the data directory too...
    
    * remove query cell that isn't part of the exercise.
    
    * Clean up and excise character name items to synchronize with the ES version of this exercise.
    
    * update lesson to not be fill in the blanks
    
    ---------
    
    Co-authored-by: David Fisher <[email protected]>
================================================
 ===============================================
 HELLO-LTR TEST DETAILS
 Containers Rebuilt? false
 Test Command: tests/run_most_nbs.py
================================================

Re-consider Tale of Two Queries notebook intent

This could be as simple, better highlighting of the issue we are trying to show (imbalanced training data, missing use cases). Or it could be a over-hauled into a new notebook altogether. Right now it just falls a little flat in the class cadence.

Clean out 404 for "index doesn't exist" from rebuild()

This gets me everytime I come back to this code after a while:

Reconfig from disk...
Deleted index tmdb [Status: 404]
{
  "error": {
    "root_cause": [
      {
        "type": "index_not_found_exception",
        "reason": "no such index [tmdb]",
        "resource.type": "index_or_alias",
        "resource.id": "tmdb",
        "index_uuid": "_na_",
        "index": "tmdb"
      }
    ],
    ...

In the elastic calls we can be more graceful and silently handle this so I don't see the 404 and worry

Disable dependabot

In our PR party, we discussed disabling dependabot on this repo

We don't have tests (though we might look into adding some)
This is a set of demos, not prod ready...

Investigate tests? Examine Jupyter outputs?

Just a note to see if there's any way we can test the Jupyter notebooks automatically

Prevent re-indexing when index already exists

Thinking to add a rebuild = false or force = false argument to the index functions, that would skip indexing when an identically named index already exits. Would save time from accidental reindexing moving between notebooks. A user could still trigger a fresh rebuild when they wanted to and they would be messaged about the skipping behavior.

ValueError: could not convert string to float: ' 6.439472E '

Hi,
I am getting the following error while training the model.

/tmp/RankyMcRankFace.jar already exists
Running java -jar /tmp/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t DCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train /tmp/training.txt -save /home/laxmi/Documents/hello-ltr/notebooks/elasticsearch/prods/data/title_model.txt 
DONE

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-124-6037c914b727> in <module>
      4                   index='products',
      5                   featureSet='prods',
----> 6                   modelName='title')

~/Documents/hello-ltr/ltr/ranklib.py in train(client, training_set, modelName, featureSet, index, features, metric2t, leafs, trees, frate, srate, bag, ranker, shrinkage)
     89                                frate=frate,
     90                                trees=trees,
---> 91                                shrinkage=shrinkage)
     92 
     93     # print('Saving model: ', modelFile)

~/Documents/hello-ltr/ltr/ranklib.py in trainModel(training_set, out, features, kcv, ranker, leafs, trees, frate, shrinkage, srate, bag, metric2t)
     55     # print('result: ', result)
     56     # print('parsed result', parse_training_log(result))
---> 57     return parse_training_log(result)
     58 
     59 def save_model(client, modelName, modelFile, index, featureSet):

~/Documents/hello-ltr/ltr/helpers/ranklib_result.py in parse_training_log(rawResult)
     82             if m:
     83                 values = line.split('|')
---> 84                 metricTrain = float(values[1])
     85                 rounds.append(metricTrain)
     86             m = re.match(trainMetricRe, line)

ValueError: could not convert string to float: ' 6.439472E '

I have around 3000 judgments. It is working for some files but not working for some.

Any suggestions to fix it?

TypeError: search() got multiple values for argument 'body'

Hi,
When I run this in docker

from ltr.log import FeatureLogger
from ltr.judgments import judgments_open
from itertools import groupby

ftr_logger=FeatureLogger(client, index='tmdb', feature_set='movies')
with judgments_open('data/title_judgments.txt') as judgment_list:
    for qid, query_judgments in groupby(judgment_list, key=lambda j: j.qid):
        ftr_logger.log_for_qid(judgments=query_judgments, 
                               qid=qid,
                               keywords=judgment_list.keywords(qid))

I am getting this error

Recognizing 40 queries...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
      8         ftr_logger.log_for_qid(judgments=query_judgments, 
      9                                qid=qid,
---> 10                                keywords=judgment_list.keywords(qid))

c:\Users\Laxmi\Downloads\hello-ltr-master\hello-ltr-master\ltr\log.py in log_for_qid(self, qid, judgments, keywords)
     52             }
     53 
---> 54             res = self.client.log_query(self.index, self.feature_set, ids, params)
     55 
     56             # Add feature back to each judgment

c:\Users\Laxmi\Downloads\hello-ltr-master\hello-ltr-master\ltr\client\elastic_client.py in log_query(self, index, featureset, ids, params)
    135             params["query"]["bool"]["must"] = terms_query
    136 
--> 137         resp = self.es.search(index, body=params)
    138         resp_msg(msg="Searching {} - {}".format(index, str(terms_query)[:20]), resp=SearchResp(resp))
    139 

C:\ProgramData\Anaconda3\lib\site-packages\elasticsearch\client\utils.py in _wrapped(*args, **kwargs)
    137                 if p in kwargs:
    138                     params[p] = kwargs.pop(p)
--> 139             return func(*args, params=params, headers=headers, **kwargs)
    140 
    141         return _wrapped

TypeError: search() got multiple values for argument 'body'

Seems ES version mismatch. I have ES 7.6.1 in local computer.

LambdaMart training is killing rest of the other processes.

Hi,

I am using the below command to train an LTR model with Lambdamart and whenever the command executes, it kills all the other processes. Is there any way I can control this to use just one core?

java -jar /tmp/RankyMcRankFace.jar -ranker 6 -tts 0.7 -shrinkage 0.05 -metric2t NDCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train -save

Kindly do reply as it would really help out.

ltr.train.kcv method throws an error

It's not possible to specify a kcv param to the ltr.train.train method, and the ltr.train.kcv method does not work (for both Solr and ES). We need to fix this!

Clean up cast in netfix notebook for Solr

There is a confusing block of code in the enhance with collection name code chunk. Remove this for parity with the ES notebook.

docker-compose.yml for solr uses an ES path

hello-ltr/docker/solr/docker-compose.yml

Line 12 in 21ffd7f

- tlre-solr-data:/usr/share/elasticsearch/data

Looks like it might be carry over, plus I'm not sure we even need the docker/solr/docker-compose.yml

Should we remove it? Or am I missing something...

hello-ltr (ES) demo notebook doesn't work.

Hi Folks,

I was following hello-ltr (ES) to play ES with LTR. But when I ran the notebook, and it failed.

---------------------------------------------------------------------------
ConnectionRefusedError                    Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connection.py in _new_conn(self)
    173         try:
--> 174             conn = connection.create_connection(
    175                 (self._dns_host, self.port), self.timeout, **extra_kw

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
     94     if err is not None:
---> 95         raise err
     96 

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
     84                 sock.bind(source_address)
---> 85             sock.connect(sa)
     86             return sock

ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

NewConnectionError                        Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/elasticsearch/connection/http_urllib3.py in perform_request(self, method, url, params, body, timeout, ignore, headers)
    254 
--> 255             response = self.pool.urlopen(
    256                 method, url, body, retries=Retry(False), headers=request_headers, **kw

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    784 
--> 785             retries = retries.increment(
    786                 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    524             # Disabled, indicate to re-raise the error.
--> 525             raise six.reraise(type(error), error, _stacktrace)
    526 

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/packages/six.py in reraise(tp, value, tb)
    769                 raise value.with_traceback(tb)
--> 770             raise value
    771         finally:

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702             # Make the request on the httplib connection object.
--> 703             httplib_response = self._make_request(
    704                 conn,

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    397             else:
--> 398                 conn.request(method, url, **httplib_request_kw)
    399 

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connection.py in request(self, method, url, body, headers)
    238             headers["User-Agent"] = _get_default_user_agent()
--> 239         super(HTTPConnection, self).request(method, url, body=body, headers=headers)
    240 

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py in request(self, method, url, body, headers, encode_chunked)
   1281         """Send a complete request to the server."""
-> 1282         self._send_request(method, url, body, headers, encode_chunked)
   1283 

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1327             body = _encode(body, 'body')
-> 1328         self.endheaders(body, encode_chunked=encode_chunked)
   1329 

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py in endheaders(self, message_body, encode_chunked)
   1276             raise CannotSendHeader()
-> 1277         self._send_output(message_body, encode_chunked=encode_chunked)
   1278 

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py in _send_output(self, message_body, encode_chunked)
   1036         del self._buffer[:]
-> 1037         self.send(msg)
   1038 

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py in send(self, data)
    974             if self.auto_open:
--> 975                 self.connect()
    976             else:

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connection.py in connect(self)
    204     def connect(self):
--> 205         conn = self._new_conn()
    206         self._prepare_conn(conn)

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connection.py in _new_conn(self)
    185         except SocketError as e:
--> 186             raise NewConnectionError(
    187                 self, "Failed to establish a new connection: %s" % e

NewConnectionError: <urllib3.connection.HTTPConnection object at 0x10c019300>: Failed to establish a new connection: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
/var/folders/v3/7xgvf6c935j7zk8c78116mrc0000gp/T/ipykernel_83431/1089693102.py in <module>
      1 movies = helpers.indexable_movies(movies='data/tmdb.json')
      2 
----> 3 index.rebuild(client, index='tmdb', doc_src=movies)

~/RiderProjects/hello-ltr-main/ltr/index.py in rebuild(client, index, doc_src, force)
      7     """
      8 
----> 9     if client.check_index_exists(index):
     10         if (force):
     11             client.delete_index(index)

~/RiderProjects/hello-ltr-main/ltr/client/elastic_client.py in check_index_exists(self, index)
     61 
     62     def check_index_exists(self, index):
---> 63         return self.es.indices.exists(index=index)
     64 
     65     def delete_index(self, index):

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/elasticsearch/client/utils.py in _wrapped(*args, **kwargs)
    345                 if p in kwargs:
    346                     params[p] = kwargs.pop(p)
--> 347             return func(*args, params=params, headers=headers, **kwargs)
    348 
    349         return _wrapped

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/elasticsearch/client/indices.py in exists(self, index, params, headers)
    369             raise ValueError("Empty value passed for a required argument 'index'.")
    370 
--> 371         return self.transport.perform_request(
    372             "HEAD", _make_path(index), params=params, headers=headers
    373         )

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/elasticsearch/transport.py in perform_request(self, method, url, headers, params, body)
    415         # Before we make the actual API call we verify the Elasticsearch instance.
    416         if self._verified_elasticsearch is None:
--> 417             self._do_verify_elasticsearch(headers=headers, timeout=timeout)
    418 
    419         # If '_verified_elasticsearch' isn't 'True' then we raise an error.

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/elasticsearch/transport.py in _do_verify_elasticsearch(self, headers, timeout)
    604             # anywhere then we re-raise the more appropriate error.
    605             if error and not info_response:
--> 606                 raise error
    607 
    608             # Check the information we got back from the index request.

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/elasticsearch/transport.py in _do_verify_elasticsearch(self, headers, timeout)
    567 
    568                 try:
--> 569                     _, info_headers, info_response = conn.perform_request(
    570                         "GET", "/", headers=headers, timeout=timeout
    571                     )

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/elasticsearch/connection/http_urllib3.py in perform_request(self, method, url, params, body, timeout, ignore, headers)
    278             if isinstance(e, ReadTimeoutError):
    279                 raise ConnectionTimeout("TIMEOUT", str(e), e)
--> 280             raise ConnectionError("N/A", str(e), e)
    281 
    282         # raise warnings if any from the 'Warnings' header.

ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x10c019300>: Failed to establish a new connection: [Errno 61] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x10c019300>: Failed to establish a new connection: [Errno 61] Connection refused)

Did I miss something?

Improve LTR reset

Currently each lab resets the LTR store completely and works from scratch, this can get annoying when working with labs that build off the steps from previous ones.

Update reset_ltr so it only resets the models/feature stores associated with a given lab.

Confusion over ltr.train and ltr.ranklib

We have different modules that seemingly do the same thing and it's causing confusion. See the Netfix movies notebooks, lots of references to from ltr import train but there is no module train. There IS a module ltr.ranklib. We need to figure out and clean up the module dependencies.

Organize ES/Solr labs better

Instead of having dupes of all the labs at the top-level, create folders for ES/Solr and move labs into each. Also review the text of each lab to make sure it only talks about the applicable backend.

Lambdamart returns same score for popular documents and non popular documents

Hi Team,

I am using the same code used here in hello-ltr to improve my search. My relevance score is from 4 and goes till 0. 4,3,2,1,0.
I have used customer interaction details such as clicks, views, add to carts, ratings etc along with some of the product attributes in my feature set. I have 2 documents in which one is extremely popular and the other is not. i.e clicks, views, add to carts, ratings values are very high for one and for the other it is very low. However the lambdamart model returns the same rescore score for these 2 documents. As a result the non popular doc is displayed on top instead of the popular one. Can someone please help me understand this behavior and how to solve it??

Add OpenSearch as an Option

I'd like to add OpenSearch as an option in this tutorial. Hopefully, you want the same, so I opened a draft PR. It will take some time to develop the rest.

closed by #91

Make training competition a separate notebook

Split the "competition" out of the netfix notebooks, so it's easier to follow/use/improve.

Add additional RankLib flags to the ltr.ranklib training methods

We currently don't support all the command line flags as documented here: https://sourceforge.net/p/lemur/wiki/RankLib%20How%20to%20use/

We should evaluate which additional flags to support and add them to the ltr.ranklib methods of trainModel, train, kcv, and feature_search

At the very least, we should add the tc flag

Elasticsearch elastic_client.py usage needs to be updated

Positional arguments are no longer allowed, so the call to create(index...) needs to be updated to use index=index

Incorrect docid check for collection_name_en In netfix movies (Solr)

I think the corpus has changed since this notebook was last updated, but the document retrieved in this line

client.get_doc(index='tmdb', doc_id=319074)

Doesn't have a collection_name_en field - so the validation is a negative! However it might help to leave it in to show that some docs are missing the field, and then follow up with the correct doc:
client.get_doc(index='tmdb', doc_id=1368)

Optimize feature logging process to make it faster

Hi Team,

I have a judgement file with over 30k queries and for each query there are max 1000 docs. I have used the code present in
hello-ltr for the feature logging process. It takes more than 7 hours to log the features values and get the training set ready. Is there any way I can reduce the time? I am new to LTR and cannot find any way to optimize this process. Any help would be appreciated. Thankyou

I read that with python 3.9 encoding argument has been removed, so I don't know how to resolve the problem.

o19s / hello-ltr Goto Github PK

hello-ltr's Introduction

Hello LTR :)

No fuss setup: You just want to play with LTR

You want to build your own LTR notebooks

Run your search engine with Docker

Running Solr w/ LTR

Running Elasticsearch w/ LTR

Running OpenSearch w/ LTR

Run Jupyter locally w/ Python 3 and all prereqs

Setup Python requirements

Start Jupyter notebook and confirm operation

Tests

Automatically run everything...

While developing...

hello-ltr's People

Contributors

Stargazers

Watchers

Forkers

hello-ltr's Issues

Recommend Projects

Recommend Topics

Recommend Org