Giter Club home page Giter Club logo

myrrix-recommender's Issues

"Tag" API

User/item metadata, in the form of tags, can be integrated into the model as 
"fake" items/users, respectively. It works but requires filtering them from 
results. It would be good to make this a first-class concept in the API somehow.

Original issue reported on code.google.com by [email protected] on 7 Feb 2013 at 9:06

/ingest doesn't do appropriate input validation

Run this command to input data via /ingest - this method has an error in it, 
the entire CSV file 'pref_data' will be concatenated to one line.

cat pref_data | curl --trace-ascii /tmp/curl.out -X POST -d @- 
http://localhost:8080/ingest

/ingest happily accepts the one line, and adds one user, and one item. 

It would be preferable if /ingest threw an exception due to the malformed input.

For the record, the correct curl command would be: 
cat pref_data | curl -X POST --data-binary @- http://localhost:8080/ingest

Original issue reported on code.google.com by [email protected] on 3 Oct 2012 at 5:52

Merge user / item API method

Sometimes a user, or item, exists in the data under one ID to start, but a 
different one later. For example, a user may log in part-way through a session, 
at which point the data from the session so far needs to be quickly merged into 
the 'real' user ID.

Original issue reported on code.google.com by [email protected] on 23 Nov 2012 at 4:27

Can't specify cluster members to Serving Layer instances on AWS

Right now there's a chicken-and-egg problem when running many Serving Layers in 
a cluster on AWS. They need to know about each other to forward requests, but 
are started up in serial and the host names are not known ahead of time. 

Specify dummy host names as a workaround for now and use the Java client (with 
all correct host names) instead to put the request to the right instance.

It's also possible to use DNS aliases to know the DNS names ahead of time and 
assign them after the instances are up.

Original issue reported on code.google.com by [email protected] on 13 Oct 2012 at 6:27

Process stuck uploading to HDFS

Tracking an issue report here that the Serving Layer can get stuck uploading a 
file, it seems, to HDFS and a Computation Layer. I will post more detail as I 
have it.

Original issue reported on code.google.com by [email protected] on 29 Aug 2012 at 4:31

Real-time recommender performance evaluation

It would be of great benefit to display some measure of the recommender's 
effectiveness in real-time. As each new datum is ingested, it's possible to 
evaluate how good a recommendation the engine thought it was before it was 
ingested. Real data ought to have been viewed as good recommendations 
previously. This could be as simple as average estimated strength.

Original issue reported on code.google.com by [email protected] on 28 Aug 2012 at 9:33

Fold-in for new data improvement

Currently, new data is projected into feature space and added to existing 
user/item feature vectors. This is simple, but very approximate. The side 
effects of this approximation happen to be useful: emphasizing recency by 
over-stating the importance of recent data, and tending to promote recently 
active items since the feature vectors become sums over new data rather than 
averages -- large.

At larger data volumes these effects may become too pronounced. There are some 
smarter ways to fold in data. Placeholder here for sorting that out.

Original issue reported on code.google.com by [email protected] on 26 Sep 2012 at 11:57

Anti-spam measures

It would be good if the front-end had any anti-spam measures, like a simple 
anti-DDOS filter. IT should also be possible to ignore users who seem to have 
far too many data points, or at least down-sample them as they are likely spam.

Original issue reported on code.google.com by [email protected] on 7 Feb 2013 at 9:05

Don't fail mostSimilarItems / recommendToAnonymous if an item isn't known

These operations return an answer that is a function of many items. Right now, 
if one is not known, the operation fails. In practice, this may not be 
desirable. It is possible in some architectures to make a request for a new 
item before it's been fully ingested. It's also likely that the caller would 
rather have an answer based on partial input rather than none at all.

Original issue reported on code.google.com by [email protected] on 26 Nov 2012 at 10:35

Improve speed / scale of new self-organizing maps visualization

On large data sets, it can take a long time to generate the self-organizing 
map, the new visualization in the Serving Layer. The result is often mostly 
filled to capacity since there are so many points. It should increase the map 
resolution, and cap / sample the data, as data increases, to result in a 
meaningful map that doesn't take quite so long.

Original issue reported on code.google.com by [email protected] on 17 Nov 2012 at 12:01

Improve Computation Layer to operate on current generation

Right now the Computation Layer can only operate on the generation behind the 
current one, as it attempts to ensure nothing is writing to the generation it 
processes (or else data is lost). There are better ways to manage this, such 
that the CL can wait for Serving Layers to switch to a new generation and then 
run the current generation. This will lower latency by one cycle.

Original issue reported on code.google.com by [email protected] on 8 Oct 2012 at 12:48

Tiny data can lead to near-singular matrices, bad results

When data is very small, it's possible (and even easy when data is small)  for 
the internal computations, which involve inverting a matrix, to end up with 
something (nearly) uninvertible. Symptoms are usually very large estimates or 
NaN.

It will check for this condition and warn that the number of features need to 
be decreased. I am not yet sure the check is good enough. And it is surprising 
behavior to new users testing with a little data.

Original issue reported on code.google.com by [email protected] on 2 Oct 2012 at 4:46

Asynchronous client

In many contexts it would be useful to have an asynchronous client that can, at 
minimum, perform updates asynchronously. For example in most code paths, it's 
likely that the update will happen in the context of other updates, like 
updating a database. The app will not necessarily want to wait, or even fail, 
if the recommender update fails.

Would be nicer too to implement asynchronous versions of methods like 
"recommend".

Original issue reported on code.google.com by [email protected] on 25 Nov 2012 at 2:04

Use Runtime.getRuntime().addShutdownHook instead of SignalManager

It seems that SignalManager tries to do what Runtime#addShutdownHook already 
does, yet by using sun.misc.Signal - which is not part of java public API.

It would be simpler just to use Runtime#addShutdownHook, and remove the 
net.myrrix.common.signal package.

Note: I've tested it (using Runtime#addShutdownHook) and it behaves nicely when 
receiving SIGINT, SIGTERM and SIGQUIT (the hook is executed), while SIGKILL 
terminates the application forcibly, and the hook is not executed.

Original issue reported on code.google.com by [email protected] on 6 Dec 2012 at 3:22

Load new X/Y model incrementally

Right now new models are loaded alongside the existing model. This is simple 
and allows for an atomic swap, but, means heap usage peaks well above normal 
levels. Ideally the model is loaded incrementally, with entries replaced one by 
one. This would mean the number of features can't change from run to run, but 
that's rare.


Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 12:31

Remove idea of "instance"

Right now, the computation layer uses the idea of a 'bucket' and 'instance' 
within that bucket to isolate files used by one logical recommender instance. 
The reason that there is both a bucket and instance are largely historical and 
are not required. For that reason I would like to remove 'instance'; callers 
would use a different bucket for each logical instance.

This corresponds to an Amazon S3 bucket for each logical instance, or a root 
directory on HDFS (or, really any directory on HDFS), and neither are in short 
supply.

It simplifies the code and configuration, and solves at least one potential 
problem: right now, SSL certs are per-bucket, not per instance. So are 
rescorers. This is probably no longer a viable assumption.

I also believe that everyone uses one instance per bucket anyway.

However it also requires existing users to make a change; data must be moved to 
continue processing. That means shutting down the CL, copying the data, 
restarting SLs with new configuration/version (which has to happen anyway for 
an upgrade), and them removing the old copy.

I'd like to solicit feedback first. If this change were made it would be 
accompanied by release notes detailing the above.

Original issue reported on code.google.com by [email protected] on 12 Feb 2013 at 11:51

Rest set/add preference does not support `0` as an input

What steps will reproduce the problem?
1. Call set preference (http://myrrix.com/rest-api/#setaddpreference) with the 
value `0`.
2. Server will fail

What is the expected output? What do you see instead?
Should return 200 and ingest the value the same way you ingest `0.0`

What version of the product are you using? On what operating system?
I am using the 0.9 standalone serving layer.

Please provide any additional information below.
Note that a workaround is to ingest `0.0` instead of `0`.

Original issue reported on code.google.com by [email protected] on 10 Jan 2013 at 7:51

In distributed mode, brand-new items may be unavailable in mostSimilarItems, recommendToAnonymous methods

In distributed mode, front-ends can be partitioned by user. This causes a 
potential issue when a brand new item (not user) arrives, since it will become 
known immediately to only 1 of the N front-ends. This is not an issue for most 
methods, but, a call to mostSimilarItems will tend to fail (unless it randomly 
uses that 1 of N frontends) for this brand-new item until the model is rebuilt.

The behavior should at least be deterministic and predictable. The view from 
the client needs to return an answer.

Original issue reported on code.google.com by [email protected] on 25 Nov 2012 at 2:07

Choose stopping point for iterations programmatically

Right now the user chooses the number of iterations. This is not ideal as it's 
not meaningfully choosable by the user. Really the iterations should stop when 
the results stop moving much. The implementations should sample this movement 
and stop when some threshold is reached instead.

Original issue reported on code.google.com by [email protected] on 24 Jan 2013 at 2:07

Local input dir variable ignored in run-serving-layer.sh script

Using myrrix-serving-0.7.jar

Using run-serving-layer.sh script from myrrix-web, the LOCAL_INPUT_DIR variable 
is never appended in the ALL_ARGS variable, since the ALL_ARGS variable is 
overwritten on the next lines (choosing PORT/SECURE_PORT/KEYSTORE_FILE). 

The fix is simple: append the LOCAL_INPUT_DIT after these lines.

P.S.: I'm sorry for posting this issue on getsatisfaction.com/myrrix first, I 
should have posted it here.

Original issue reported on code.google.com by [email protected] on 24 Oct 2012 at 3:21

Implement a mostPopular method

It would be nice to have a mostPopular in MyrrixRecommender.

The signature would be something like this:
List<RecommendedItem> mostPopular(int howMany) throws TasteException;

It then would be implemented in ServerRecommender as a selectTopN from from a 
special iterator which returns as the score for a given item the summed 
preferences weight of every users (or the count of users who have a preference 
for this item if it is easier).

What do you think? I will propose a patch later.

Original issue reported on code.google.com by [email protected] on 8 Jan 2013 at 10:51

removePreference should be exposed by the REST API

And, it should be idempotent. It should also remove not only items that no 
longer exist in a user's set of known items, but the user too if applicable (no 
more items known). This should also be documented more clearly versus 
setPreference() in the javadoc.

Original issue reported on code.google.com by [email protected] on 25 Jun 2012 at 12:32

Add cluster-related API methods

The Computation Layer already computes clusters, optionally, with a kmeans++ / 
spectral variant. There is not yet an API method to access the clusters, and 
should be.

Original issue reported on code.google.com by [email protected] on 8 Jan 2013 at 3:19

Apply logistic function to reconstructed / estimated values?

The values on which recommendations are ranked, and the result of the 
estimatePreference method, are actually elements in the reconstruction of the 
0/1 input matrix P. The values are typically between 0 and 1, but need not be 
in practice.

It may be more intuitive to limit the output to the range (0,1) by passing the 
result through the logistic function 1/(1+e^-x). In practice we would need to 
apply the logistic function to some function of x, like 5(x-0.5) in order to 
scale it appropriately.

This would not affect relative rank of recommendations. It would affect the 
actual values.

Original issue reported on code.google.com by [email protected] on 31 Aug 2012 at 2:59

Print low-memory warnings in Serving Layer

Big input to the Serving Layer can exhaust the heap if it is not sized from 
default appropriately. It would probably be less surprising and more helpful to 
the user to periodically check heap availability during the load phase and 
print helpful warnings if it gets very low.

Original issue reported on code.google.com by [email protected] on 28 Aug 2012 at 1:38

Need a better recommendedBecause algorithm

The "recommendedBecause" method just finds the user's item that is closest to a 
given target in feature space. This is not personalized. It is not the full 
process described well in the Hu/Koren/Volinsky paper. 

Naively, implementing that process involves pre-computing an f x f matrix Wu 
per user and storing it, and also storing the original input matrix R (or C). 
None of these are done yet and holding these in memory would be infeasible.

The task is to figure out a compromise or way around this.

Original issue reported on code.google.com by [email protected] on 28 Aug 2012 at 6:30

Web app needs better error page

Instead of the default error page, it would be good to present something 
simpler and more customized in case of an error like 404 or 500

Original issue reported on code.google.com by [email protected] on 3 Oct 2012 at 8:08

Consider different handling of negative input

The original ALS formulation does not allow negative values. The current 
implementation does, but assigns them very low weight. This is better than 
negative weight, which is ill-formed, but not as principled as it could be.

While it's a corner case, and not intended to be used with negative input, it 
should be possible to modify the formulation to use *increasing* weight for 
more negative values, but penalize difference from 0 instead of 1. This would 
be more principled, and likely to give more intuitive results in the case that 
someone does want to use negative input.

Original issue reported on code.google.com by [email protected] on 31 Oct 2012 at 12:48

Add Computation Layer driver program

The Computation Layer is already a command-line program, but running it runs 
for one generation. It is meant to be run repeatedly, perhaps at regular 
intervals by a cron job.

We should make that available as a Java program as well, something that can run 
continuously and run the Computation Layer at fixed delay or after a certain 
amount of data is written.

Original issue reported on code.google.com by [email protected] on 28 Aug 2012 at 1:39

Add client integration examples

For example, would be interesting to provide sample code showing how to add 
data in response to a message from a JMS queue.

Original issue reported on code.google.com by [email protected] on 26 Nov 2012 at 10:28

[PATCH] Myrrix-server: Allow to pass multiple rescorerParams values in url

In AbstractMyrrixServlet#getRescorerParams, the method used 
(request.getParameter) only allows to pass a single value. The method used 
should be request.getParameterValues which allows to pass multiple values, 
which is coherent with the RescorerProvider contract (String... args).

Patch attached.

Original issue reported on code.google.com by [email protected] on 14 Nov 2012 at 9:22

Attachments:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.