Giter Club home page Giter Club logo

ml4ir's Introduction

ml4ir: Machine Learning for Information Retrieval

CircleCI | changelog

Quickstart โ†’ ml4ir Read the Docs | ml4ir pypi | python ReadMe

ml4ir is an open source library for training and deploying deep learning models for search applications. ml4ir is built on top of python3 and tensorflow 2.x for training and evaluation. It also comes packaged with scala utilities for JVM inference.

ml4ir is designed as modular subcomponents which can each be combined and customized to build a variety of search ML models such as:

  • Learning to Rank
  • Query Auto Completion
  • Document Classification
  • Query Classification
  • Named Entity Recognition
  • Top Results
  • Query2SQL
  • add your application here

ml4ir

Motivation

Search is a complex data space with lots of different types of ML tasks working on a combination of structured and unstructured data sources. There existed no single library that

  • provides an end-to-end training and serving solution for a variety of search applications
  • allows training of models with limited coding expertise
  • allows easy customization to build complex models to tackle the search domain
  • focuses on performance and robustness
  • enables fast prototyping

So, we built ml4ir to do all of the above.

Guiding Principles

Customizable Library

Firstly, we want ml4ir to be an easy-to-use and highly customizable library so that you can build the search application of your need. ml4ir allows each of its subcomponents to be overriden, mixed and match with other custom modules to create and deploy models.

Configurable Toolkit

While ml4ir can be used as a library, it also comes prepackaged with all the popular search based losses, metrics, embeddings, layers, etc. to enable someone with limited tensorflow expertise to quickly load their training data and train models for the task of interest. ml4ir achieves this by following a hybrid approach which allow for each subcomponent to be completely controlled through configurations alone. Most search based ML applications can be built this way.

Performance First

ml4ir is built using the TFRecord data pipeline, which is the recommended data format for tensorflow data loading. We combine ml4ir's high configurability with out of the box tensorflow data optimization utilities to define model features and build a data pipeline that easily allows training on huge amounts of data. ml4ir also comes packaged with utilities to convert data from CSV and libsvm format to TFRecord.

Training-Serving Handshake

As ml4ir is a common library for training and serving deep learning models, this allows us to build tight integration and fault tolerance into the models that are trained. ml4ir also uses the same configuration files for both training and inference keeping the end-to-end handshake clean. This allows user's to easily plug in any feature store(or solr) into ml4ir's serving utilities to deploy models in one's production environments.

Search Model Hub

The goal of ml4ir is to form a common hub for the most popular deep learning layers, losses, metrics, embeddings used in the search domain. We've built ml4ir with a focus on quick prototyping with wide variety of network architectures and optimizations. We encourage contributors to add to ml4ir's arsenal of search deep learning utilities as we continue to do so ourselves.

Continuous Integration

We use CircleCI for running tests. Both jvm and python tests will run on each commit and pull request. You can find both the CI pipelines here

Unit test can be run from the Python/Java IDEs directly or with dedictated mvn or python command

For integration test, you need to run, in the jvm directory:

  • mvn verify -Pintegration_tests after enabling your Python environement as described in the python README.md
  • or, if you prefer running the Python training in Docker, mvn verify -Pintegration_tests -DuseDocker

Alternatively, you can abuse the e2e test to test the jvm inference against a custom directory throught this command: mvn test -Dtest=TensorFlowInferenceIT#testRankingSavedModelBundleWithCSVData -DbundleLocation=/path/to/my/trained/model -DrunName=myRunName

Documentation

We use sphinx for ml4ir documentation. The documentation is hosted using Read the Docs at ml4ir.readthedocs.io/en/latest.

For python doc strings, please use the numpy docstring format specified here.

ml4ir's People

Contributors

armandgiraud avatar arvindsrikantan avatar balikasg avatar calliebradley avatar darshshah avatar dependabot[bot] avatar ducouloa avatar jakemannix avatar lastmansleeping avatar marjanhs avatar mbrette avatar mohazahran avatar svc-scm avatar tanmaylaud avatar ullimague avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ml4ir's Issues

Use model built using ml4ir inside of SOLR

Hi,
I am new to solr space and bumped across this library. I have been trying to use NER of search query during query_time.
So for instance if I type Puma Black Shoes , the input for matching should be something of Puma Black shoes . I have scouted the net but couldn't find any good resource available on this, I would basically want an externally trained model to be integrated inside of SOLR, so that once a model is trained, its json file is utilised inside of solr.
Any ideas? Also how can i do NER using ml4ir? Can't seem to be a guide on that in documentation provided.
Thanks

Support MRR and ACR for Example dataset

Background: Currently, ranking metrics require a rank and mask field to function. Remove this dependency so that it can be used for k-class classification problems as well.

CC: @Ullimague Filed this one.

ml4ir-inference integration test seems to be broken when there are missing feature values

Steps to repro

  1. Train a model and generate the model_predictions.csv
  2. In the model_predictions.csv set the feature for any record to null by removing its value
    Example: query_0,1,0.314,0.0,0.0,1.0,m01h9eeb,0,domain_0,1,0.4095464,1 -> query_0,1,,0.0,0.0,1.0,m01h9eeb,0,domain_0,1,0.4095464,1
  3. Run integration test
mvn scala:run "-DaddArgs=../../python/models/end_to_end_test_ranking/final/tfrecord/|../../python/logs/end_to_end_test_ranking/model_predictions.csv|../../python/ml4ir/applications/ranking/tests/data/configs/feature_config_integration_test.yaml"

Error Trace

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil (file:/Users/ashish.srinivasa/.m2/repository/com/google/protobuf/protobuf-java/3.5.1/protobuf-java-3.5.1.jar) to field java.nio.Buffer.address
WARNING: Please consider reporting this to the maintainers of com.google.protobuf.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
java.lang.reflect.InvocationTargetException
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at scala_maven_executions.MainHelper.runMain(MainHelper.java:161)
	at scala_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
Caused by: java.lang.NumberFormatException: empty String
	at java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1842)
	at java.base/jdk.internal.math.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
	at java.base/java.lang.Float.parseFloat(Float.java:461)
	at scala.collection.immutable.StringLike$class.toFloat(StringLike.scala:281)
	at scala.collection.immutable.StringOps.toFloat(StringOps.scala:29)
	at ml4ir.inference.tensorflow.data.FeatureProcessors$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(FeatureProcessors.scala:14)
	at ml4ir.inference.tensorflow.data.FeatureProcessors$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(FeatureProcessors.scala:14)
	at scala.Option.map(Option.scala:146)
	at ml4ir.inference.tensorflow.data.FeatureProcessors$$anonfun$3$$anonfun$apply$3.apply(FeatureProcessors.scala:14)
	at ml4ir.inference.tensorflow.data.FeatureProcessors$$anonfun$3$$anonfun$apply$3.apply(FeatureProcessors.scala:14)
	at ml4ir.inference.tensorflow.data.FeaturePreprocessor$$anonfun$extractFloatFeatures$1.apply(FeaturePreprocessor.scala:41)
	at ml4ir.inference.tensorflow.data.FeaturePreprocessor$$anonfun$extractFloatFeatures$1.apply(FeaturePreprocessor.scala:38)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.Map$Map4.foreach(Map.scala:188)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at ml4ir.inference.tensorflow.data.FeaturePreprocessor.extractFloatFeatures(FeaturePreprocessor.scala:38)
	at ml4ir.inference.tensorflow.data.FeaturePreprocessor.apply(FeaturePreprocessor.scala:34)
	at ml4ir.inference.tensorflow.data.FeaturePreprocessor.apply(FeaturePreprocessor.scala:17)
	at scala.collection.immutable.List.map(List.scala:284)
	at ml4ir.inference.tensorflow.data.SequenceExampleBuilder.apply(TFRecordBuilders.scala:40)
	at ml4ir.inference.tensorflow.data.SequenceExampleBuilder.build(TFRecordBuilders.scala:48)
	at ml4ir.inference.tensorflow.SequenceExampleInference$$anonfun$runQueriesAgainstDocs$2.apply(SequenceExampleInference.scala:132)
	at ml4ir.inference.tensorflow.SequenceExampleInference$$anonfun$runQueriesAgainstDocs$2.apply(SequenceExampleInference.scala:130)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:296)
	at ml4ir.inference.tensorflow.SequenceExampleInference$.runQueriesAgainstDocs(SequenceExampleInference.scala:130)
	at ml4ir.inference.tensorflow.SequenceExampleInference$.evaluateRankingInferenceAccuracy(SequenceExampleInference.scala:68)
	at ml4ir.inference.tensorflow.SequenceExampleInference$.main(SequenceExampleInference.scala:64)
	at ml4ir.inference.tensorflow.SequenceExampleInference.main(SequenceExampleInference.scala)
	... 6 more
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  12.111 s
[INFO] Finished at: 2021-06-09T12:32:05-07:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.3.1:run (default-cli) on project ml4ir-inference: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 240 (Exit value: 240) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.