nageshbhattu / myrrix-recommender Goto Github PK

Automatically exported from code.google.com/p/myrrix-recommender

License: Apache License 2.0

HTML 0.44% Shell 0.87% Java 98.69%

myrrix-recommender's Issues

"Tag" API

User/item metadata, in the form of tags, can be integrated into the model as 
"fake" items/users, respectively. It works but requires filtering them from 
results. It would be good to make this a first-class concept in the API somehow.

Original issue reported on code.google.com by [email protected] on 7 Feb 2013 at 9:06

/ingest doesn't do appropriate input validation

Run this command to input data via /ingest - this method has an error in it, 
the entire CSV file 'pref_data' will be concatenated to one line.

cat pref_data | curl --trace-ascii /tmp/curl.out -X POST -d @- 
http://localhost:8080/ingest

/ingest happily accepts the one line, and adds one user, and one item. 

It would be preferable if /ingest threw an exception due to the malformed input.

For the record, the correct curl command would be: 
cat pref_data | curl -X POST --data-binary @- http://localhost:8080/ingest

Original issue reported on code.google.com by [email protected] on 3 Oct 2012 at 5:52

Merge user / item API method

Sometimes a user, or item, exists in the data under one ID to start, but a 
different one later. For example, a user may log in part-way through a session, 
at which point the data from the session so far needs to be quickly merged into 
the 'real' user ID.

Original issue reported on code.google.com by [email protected] on 23 Nov 2012 at 4:27

Can't specify cluster members to Serving Layer instances on AWS

Right now there's a chicken-and-egg problem when running many Serving Layers in 
a cluster on AWS. They need to know about each other to forward requests, but 
are started up in serial and the host names are not known ahead of time. 

Specify dummy host names as a workaround for now and use the Java client (with 
all correct host names) instead to put the request to the right instance.

It's also possible to use DNS aliases to know the DNS names ahead of time and 
assign them after the instances are up.

Original issue reported on code.google.com by [email protected] on 13 Oct 2012 at 6:27

Patch for /trunk/client/src/net/myrrix/client/CLI.java

Changed HOST_FLAG command line option description from "Serving Layer port 
number" to "Serving Layer host address".

Original issue reported on code.google.com by [email protected] on 16 Aug 2012 at 9:25

Process stuck uploading to HDFS

Tracking an issue report here that the Serving Layer can get stuck uploading a 
file, it seems, to HDFS and a Computation Layer. I will post more detail as I 
have it.

Original issue reported on code.google.com by [email protected] on 29 Aug 2012 at 4:31

Serving Layer Console test methods don't work with HTTPS + DIGEST auth

Presumably the Javascript isn't playing nice with HTTPS and DIGEST 
authentication. It may be something that only happens with a fake SSL cert. But 
all methods get "HTTP Error 0".

Original issue reported on code.google.com by [email protected] on 7 Oct 2012 at 7:06

Real-time recommender performance evaluation

It would be of great benefit to display some measure of the recommender's 
effectiveness in real-time. As each new datum is ingested, it's possible to 
evaluate how good a recommendation the engine thought it was before it was 
ingested. Real data ought to have been viewed as good recommendations 
previously. This could be as simple as average estimated strength.

Original issue reported on code.google.com by [email protected] on 28 Aug 2012 at 9:33

[PATCH] Allow Recommend to Anonymous with IDRescorer

It would be useful to allow to provide an IDRescorer for recommendToAnonymous 
method.

I've implemented just that, the patch is attached to this issue.
Let me know if you think this is a good or a bad idea.

Regards,
Julien

Original issue reported on code.google.com by [email protected] on 9 Nov 2012 at 11:11

Attachments:

recommendToAnonymousWithIDRescorer.patch

ingest failing on large uploads when authentication is turned on

Noting here that I've seen large uploads fail with "Unauthorized" when HTTP 
DIGEST authentication is enabled in the Serving Layer. The security settings 
like nonce validity time are probably too tight.

Original issue reported on code.google.com by [email protected] on 14 Jun 2012 at 11:20

Fold-in for new data improvement

Currently, new data is projected into feature space and added to existing 
user/item feature vectors. This is simple, but very approximate. The side 
effects of this approximation happen to be useful: emphasizing recency by 
over-stating the importance of recent data, and tending to promote recently 
active items since the feature vectors become sums over new data rather than 
averages -- large.

At larger data volumes these effects may become too pronounced. There are some 
smarter ways to fold in data. Placeholder here for sorting that out.

Original issue reported on code.google.com by [email protected] on 26 Sep 2012 at 11:57

Patch for /trunk/web/src/net/myrrix/web/AllRecommendations.java

Fixed typo in the NamedThreadFactory

Original issue reported on code.google.com by [email protected] on 2 Oct 2012 at 9:12

Attachments:

AllRecommendations.java.patch

No way to provide a rescorer .jar or keystore file on AWS

Known issue -- there's no way yet to provide a JAR containing a 
RescorerProvider, on Amazon AWS, or a keystore for for SSL.

Original issue reported on code.google.com by [email protected] on 13 Oct 2012 at 6:26

Anti-spam measures

It would be good if the front-end had any anti-spam measures, like a simple 
anti-DDOS filter. IT should also be possible to ignore users who seem to have 
far too many data points, or at least down-sample them as they are likely spam.

Original issue reported on code.google.com by [email protected] on 7 Feb 2013 at 9:05

Don't fail mostSimilarItems / recommendToAnonymous if an item isn't known

These operations return an answer that is a function of many items. Right now, 
if one is not known, the operation fails. In practice, this may not be 
desirable. It is possible in some architectures to make a request for a new 
item before it's been fully ingested. It's also likely that the caller would 
rather have an answer based on partial input rather than none at all.

Original issue reported on code.google.com by [email protected] on 26 Nov 2012 at 10:35

Improve speed / scale of new self-organizing maps visualization

On large data sets, it can take a long time to generate the self-organizing 
map, the new visualization in the Serving Layer. The result is often mostly 
filled to capacity since there are so many points. It should increase the map 
resolution, and cap / sample the data, as data increases, to result in a 
meaningful map that doesn't take quite so long.

Original issue reported on code.google.com by [email protected] on 17 Nov 2012 at 12:01

Support Hadoop 2 / CDH4

There's a request to support Hadoop 2.x / CDH4. It would require modest changes.

Original issue reported on code.google.com by [email protected] on 10 Aug 2012 at 4:07

Protect console with password separately

It should be possible to set a username/password for the console page only

Original issue reported on code.google.com by [email protected] on 7 Oct 2012 at 4:16

Improve Computation Layer to operate on current generation

Right now the Computation Layer can only operate on the generation behind the 
current one, as it attempts to ensure nothing is writing to the generation it 
processes (or else data is lost). There are better ways to manage this, such 
that the CL can wait for Serving Layers to switch to a new generation and then 
run the current generation. This will lower latency by one cycle.

Original issue reported on code.google.com by [email protected] on 8 Oct 2012 at 12:48

Tiny data can lead to near-singular matrices, bad results

When data is very small, it's possible (and even easy when data is small)  for 
the internal computations, which involve inverting a matrix, to end up with 
something (nearly) uninvertible. Symptoms are usually very large estimates or 
NaN.

It will check for this condition and warn that the number of features need to 
be decreased. I am not yet sure the check is good enough. And it is surprising 
behavior to new users testing with a little data.

Original issue reported on code.google.com by [email protected] on 2 Oct 2012 at 4:46

catch NumberFormatException in xxxServlet and return HTTP error 400 (Bad request)

In myrrix-web's servlets, the Long.parseLong and Integer.parseInt calls should 
be surrounded with a try/catch NumberFormatException and return HTTP Error 400 
as its done with missing arguments. This should make the servlets more robust.

Original issue reported on code.google.com by [email protected] on 5 Nov 2012 at 1:55

Patch for /trunk/web/src/net/myrrix/web/InitListener.java

looking to? or looking for?

Original issue reported on code.google.com by [email protected] on 2 Nov 2012 at 2:27

Attachments:

InitListener.java.patch

Asynchronous client

In many contexts it would be useful to have an asynchronous client that can, at 
minimum, perform updates asynchronously. For example in most code paths, it's 
likely that the update will happen in the context of other updates, like 
updating a database. The app will not necessarily want to wait, or even fail, 
if the recommender update fails.

Would be nicer too to implement asynchronous versions of methods like 
"recommend".

Original issue reported on code.google.com by [email protected] on 25 Nov 2012 at 2:04

Use Runtime.getRuntime().addShutdownHook instead of SignalManager

It seems that SignalManager tries to do what Runtime#addShutdownHook already 
does, yet by using sun.misc.Signal - which is not part of java public API.

It would be simpler just to use Runtime#addShutdownHook, and remove the 
net.myrrix.common.signal package.

Note: I've tested it (using Runtime#addShutdownHook) and it behaves nicely when 
receiving SIGINT, SIGTERM and SIGQUIT (the hook is executed), while SIGKILL 
terminates the application forcibly, and the hook is not executed.

Original issue reported on code.google.com by [email protected] on 6 Dec 2012 at 3:22

Load new X/Y model incrementally

Right now new models are loaded alongside the existing model. This is simple 
and allows for an atomic swap, but, means heap usage peaks well above normal 
levels. Ideally the model is loaded incrementally, with entries replaced one by 
one. This would mean the number of features can't change from run to run, but 
that's rare.

Original issue reported on code.google.com by [email protected] on 5 Sep 2012 at 12:31

Remove idea of "instance"

Right now, the computation layer uses the idea of a 'bucket' and 'instance' 
within that bucket to isolate files used by one logical recommender instance. 
The reason that there is both a bucket and instance are largely historical and 
are not required. For that reason I would like to remove 'instance'; callers 
would use a different bucket for each logical instance.

This corresponds to an Amazon S3 bucket for each logical instance, or a root 
directory on HDFS (or, really any directory on HDFS), and neither are in short 
supply.

It simplifies the code and configuration, and solves at least one potential 
problem: right now, SSL certs are per-bucket, not per instance. So are 
rescorers. This is probably no longer a viable assumption.

I also believe that everyone uses one instance per bucket anyway.

However it also requires existing users to make a change; data must be moved to 
continue processing. That means shutting down the CL, copying the data, 
restarting SLs with new configuration/version (which has to happen anyway for 
an upgrade), and them removing the old copy.

I'd like to solicit feedback first. If this change were made it would be 
accompanied by release notes detailing the above.

Original issue reported on code.google.com by [email protected] on 12 Feb 2013 at 11:51

Rest set/add preference does not support `0` as an input

What steps will reproduce the problem?
1. Call set preference (http://myrrix.com/rest-api/#setaddpreference) with the 
value `0`.
2. Server will fail

What is the expected output? What do you see instead?
Should return 200 and ingest the value the same way you ingest `0.0`

What version of the product are you using? On what operating system?
I am using the 0.9 standalone serving layer.

Please provide any additional information below.
Note that a workaround is to ingest `0.0` instead of `0`.

Original issue reported on code.google.com by [email protected] on 10 Jan 2013 at 7:51

SIGINT / SIGTERM bypass orderly shutdown of server process and sync of unflushed data

SIGINT (ctrl-C) or SIGTERM (kill) will cause the JVM to shutdown in a hurry, 
and the Tomcat / app shutdown process is not invoked. This means data not 
stored to S3 / HDFS is lost. A signal handler should be in place to invoke 
orderly shutdown.

Original issue reported on code.google.com by [email protected] on 31 Oct 2012 at 10:47

In distributed mode, brand-new items may be unavailable in mostSimilarItems, recommendToAnonymous methods

In distributed mode, front-ends can be partitioned by user. This causes a 
potential issue when a brand new item (not user) arrives, since it will become 
known immediately to only 1 of the N front-ends. This is not an issue for most 
methods, but, a call to mostSimilarItems will tend to fail (unless it randomly 
uses that 1 of N frontends) for this brand-new item until the model is rebuilt.

The behavior should at least be deterministic and predictable. The view from 
the client needs to return an answer.

Original issue reported on code.google.com by [email protected] on 25 Nov 2012 at 2:07

Choose stopping point for iterations programmatically

Right now the user chooses the number of iterations. This is not ideal as it's 
not meaningfully choosable by the user. Really the iterations should stop when 
the results stop moving much. The implementations should sample this movement 
and stop when some threshold is reached instead.

Original issue reported on code.google.com by [email protected] on 24 Jan 2013 at 2:07

Local input dir variable ignored in run-serving-layer.sh script

Using myrrix-serving-0.7.jar

Using run-serving-layer.sh script from myrrix-web, the LOCAL_INPUT_DIR variable 
is never appended in the ALL_ARGS variable, since the ALL_ARGS variable is 
overwritten on the next lines (choosing PORT/SECURE_PORT/KEYSTORE_FILE). 

The fix is simple: append the LOCAL_INPUT_DIT after these lines.

P.S.: I'm sorry for posting this issue on getsatisfaction.com/myrrix first, I 
should have posted it here.

Original issue reported on code.google.com by [email protected] on 24 Oct 2012 at 3:21

removePreference() / setPreference API methods should not throw NotReadyException

Most methods can't operate until a model has been computed or loaded, like 
recommend(), so they throw NotReadyException. setPreference() and 
removePreference() can meaningfully operate without a model -- record the 
input, and simply not update the (non-existent) model.

Original issue reported on code.google.com by [email protected] on 23 Aug 2012 at 9:41

Implement a mostPopular method

It would be nice to have a mostPopular in MyrrixRecommender.

The signature would be something like this:
List<RecommendedItem> mostPopular(int howMany) throws TasteException;

It then would be implemented in ServerRecommender as a selectTopN from from a 
special iterator which returns as the score for a given item the summed 
preferences weight of every users (or the count of users who have a preference 
for this item if it is easier).

What do you think? I will propose a patch later.

Original issue reported on code.google.com by [email protected] on 8 Jan 2013 at 10:51

Specify command line args in AWS EC2 instance

Right now when running the Serving Layer on EC2, user-data can be used to 
specify program arguments, but not JVM arguments. There should be a way to 
specify JVM args.

Original issue reported on code.google.com by [email protected] on 28 Sep 2012 at 2:38

removePreference should be exposed by the REST API

And, it should be idempotent. It should also remove not only items that no 
longer exist in a user's set of known items, but the user too if applicable (no 
more items known). This should also be documented more clearly versus 
setPreference() in the javadoc.

Original issue reported on code.google.com by [email protected] on 25 Jun 2012 at 12:32

Add mean average precision to PrecisionRecallEvaluator

This utility code should also compute MAP, mean average precision, as it's easy 
to do and a well understood metric.

Original issue reported on code.google.com by [email protected] on 24 Jan 2013 at 2:08

Add cluster-related API methods

The Computation Layer already computes clusters, optionally, with a kmeans++ / 
spectral variant. There is not yet an API method to access the clusters, and 
should be.

Original issue reported on code.google.com by [email protected] on 8 Jan 2013 at 3:19

Support instance IDs that are non-numeric

There is no longer a good reason that instance IDs must be numbers. It should 
be easily possible to support any string, which may be more intuitive and 
usable.

Original issue reported on code.google.com by [email protected] on 23 Oct 2012 at 4:57

Apply logistic function to reconstructed / estimated values?

The values on which recommendations are ranked, and the result of the 
estimatePreference method, are actually elements in the reconstruction of the 
0/1 input matrix P. The values are typically between 0 and 1, but need not be 
in practice.

It may be more intuitive to limit the output to the range (0,1) by passing the 
result through the logistic function 1/(1+e^-x). In practice we would need to 
apply the logistic function to some function of x, like 5(x-0.5) in order to 
scale it appropriately.

This would not affect relative rank of recommendations. It would affect the 
actual values.

Original issue reported on code.google.com by [email protected] on 31 Aug 2012 at 2:59

Print low-memory warnings in Serving Layer

Big input to the Serving Layer can exhaust the heap if it is not sized from 
default appropriately. It would probably be less surprising and more helpful to 
the user to periodically check heap availability during the load phase and 
print helpful warnings if it gets very low.

Original issue reported on code.google.com by [email protected] on 28 Aug 2012 at 1:38

Need a better recommendedBecause algorithm

The "recommendedBecause" method just finds the user's item that is closest to a 
given target in feature space. This is not personalized. It is not the full 
process described well in the Hu/Koren/Volinsky paper. 

Naively, implementing that process involves pre-computing an f x f matrix Wu 
per user and storing it, and also storing the original input matrix R (or C). 
None of these are done yet and holding these in memory would be infeasible.

The task is to figure out a compromise or way around this.

Original issue reported on code.google.com by [email protected] on 28 Aug 2012 at 6:30

Client should send compressed data to ingest endpoint

The ingest endpoint actually can accept compressed data already, but the client 
does not handle compressed files locally nor does it compress the data it 
sends. It should.

Original issue reported on code.google.com by [email protected] on 15 Jun 2012 at 5:42

Web app needs better error page

Instead of the default error page, it would be good to present something 
simpler and more customized in case of an error like 404 or 500

Original issue reported on code.google.com by [email protected] on 3 Oct 2012 at 8:08

Incorrect output format for recs / similar items from Computation Layer

The output on HDFS looks like

user   [item,value:item,value ...]

when it should be

user   [item:value,item:value ...]

and similarly for similar items.

Original issue reported on code.google.com by [email protected] on 17 Oct 2012 at 4:47

Consider different handling of negative input

The original ALS formulation does not allow negative values. The current 
implementation does, but assigns them very low weight. This is better than 
negative weight, which is ill-formed, but not as principled as it could be.

While it's a corner case, and not intended to be used with negative input, it 
should be possible to modify the formulation to use *increasing* weight for 
more negative values, but penalize difference from 0 instead of 1. This would 
be more principled, and likely to give more intuitive results in the case that 
someone does want to use negative input.

Original issue reported on code.google.com by [email protected] on 31 Oct 2012 at 12:48

Replace _LOCK file mechanism with query for running jobs?

Instead of using a _LOCK file on HDFS / S3 to indicate a CL is running, perhaps 
better to look for running jobs by querying the cluster.

Original issue reported on code.google.com by [email protected] on 29 Nov 2012 at 1:13

Add Computation Layer driver program

The Computation Layer is already a command-line program, but running it runs 
for one generation. It is meant to be run repeatedly, perhaps at regular 
intervals by a cron job.

We should make that available as a Java program as well, something that can run 
continuously and run the Computation Layer at fixed delay or after a certain 
amount of data is written.

Original issue reported on code.google.com by [email protected] on 28 Aug 2012 at 1:39

Add client integration examples

For example, would be interesting to provide sample code showing how to add 
data in response to a message from a JMS queue.

Original issue reported on code.google.com by [email protected] on 26 Nov 2012 at 10:28

Automate tuning of lambda, number of features

The project should have some additional utility class to choose lambda and 
number of features, within a range, that seem to maximize a given metric.

Original issue reported on code.google.com by [email protected] on 24 Jan 2013 at 2:09

[PATCH] Myrrix-server: Allow to pass multiple rescorerParams values in url

In AbstractMyrrixServlet#getRescorerParams, the method used 
(request.getParameter) only allows to pass a single value. The method used 
should be request.getParameterValues which allows to pass multiple values, 
which is coherent with the RescorerProvider contract (String... args).

Patch attached.

Original issue reported on code.google.com by [email protected] on 14 Nov 2012 at 9:22

Attachments:

AbstractMyrrixServlet.java.patch

nageshbhattu / myrrix-recommender Goto Github PK

myrrix-recommender's Issues

Recommend Projects

Recommend Topics

Recommend Org