Giter Club home page Giter Club logo

spark-ffm's Introduction

Spark-FFM

A Spark-based implementation of Field-Awared Factorization Machine. See http://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf

The data should be formatted as

label field1:feat1:val1 field2:feat2:val2

to fit FFM, that is to extends LIBSVM data format by adding field information to each feature.

Currently, we support paralleledSGD and paralledAdagrad optimization methods, as they are more efficient in dealing with large dataset.

Besides, user can also choose to have FFMModel with/without global bias and one-way interactions.

Contact & Feedback

If you encounter bugs, feel free to submit an issue or pull request.

spark-ffm's People

Contributors

vinceshieh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-ffm's Issues

initial coef and predict value

Thanks for your spark implementation of FFM.
Here's my question:
generateInitWeights() ... val coef = 0.5 / Math.sqrt(param.k) ...
predict() ... val v: Double = 2.0 * v1 * v2 * r ...

Why not
generateInitWeights() ... val coef = 1.0 / Math.sqrt(param.k) ...
predict() ... val v: Double = 1.0 * v1 * v2 * r ...

This is more close to the description in original Paper,"The initial values of w are randomly sampled from a uniform distribution between [0,1.0/sqrt(k)]"

Weight Initialisation might be a cause for loss NaN

As weights are initialized as codes line: 70 - line: 75, in file FFMWithAdag.scala, standard deviation of Z= W * X +b roughly equals to sqrt(mn/2). Once it is larger than 706(around), exp(z) becomes NaN. Proper initialization of weight should make the Guassion distribution more narrowed. coef = sqrt(1/mnk)?
screen shot 2017-12-27 at 5 54 12 pm

tr_loss is always NaN

Hi, Nice code! I try to run this code on my dataset, but the tr_loss is always NaN. I found a part of weights is NaN too. Are there any possible reason for it?

Can't load FMM model

hello, when generate FMM model, but, why can't load the model to predict?
Erro as follow:
Exception in thread "main" java.lang.Exception: FFMModel.load did not recognize model with (className, format version):(org.apache.spark.mllib.classification.FFMModel$SaveLoadV1_0$, 1.0).

Prediction Label >= 0.5 should change to > 0.5 ?

In TestFFM.scala
val scores: RDD[(Double, Double)] = testing.map(x => {
val p = ffm.predict(x._2)
val ret = if (p >= 0.5) 1.0 else -1.0
(ret, x._1)
})
should change to
val scores: RDD[(Double, Double)] = testing.map(x => {
val p = ffm.predict(x._2)
val ret = if (p > 0.5) 1.0 else -1.0
(ret, x._1)
})

because In Spark source code(BinaryLabelCounter.scala)is

/** Processes a label. */
def +=(label: Double): BinaryLabelCounter = {
// Though we assume 1.0 for positive and 0.0 for negative, the following check will handle
// -1.0 for negative as well.
if (label > 0.5) numPositives += 1L else numNegatives += 1L
this
}

Feature indices do not conincide

As per the weights array, feature index should coincide with each.
Let's say a feature j1 = data(i)._2 - 1 gets its weight for field f2 as following: j1 * align1 + f2 * align0
For FFMModel in FieldAwaredFactorizationMachine.scala, line 122 and line 131, index for datai is data(i)_2 -1, however, for dataii is data(ii). Should these conform with each other?
val j1 = data(i)._2 - 1 [line 122]
val f1 = data(i)._1
val v1 = data(i)._3
......
while (ii < valueSize) {
val j2 = data(ii)._2 [line 131]
val f2 = data(ii)._1
......

I got this error

19/03/25 06:44:25 INFO ContextCleaner: Cleaned accumulator 63
19/03/25 06:44:25 INFO ContextCleaner: Cleaned accumulator 58
get numFields:96,nunFeatures:998430273,numFactors:3
allocating:-429780416
Exception in thread "main" java.lang.NegativeArraySizeException
at org.apache.spark.mllib.classification.FFMWithAdag.generateInitWeights(FFMWithAdag.scala:68)
at org.apache.spark.mllib.classification.FFMWithAdag.run(FFMWithAdag.scala:110)
at org.apache.spark.mllib.classification.FFMWithAdag$.train(FFMWithAdag.scala:142)
at TestFFM$.main(TestFFM.scala:33)
at TestFFM.main(TestFFM.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/03/25 06:44:29 INFO SparkContext: Invoking stop() from shutdown hook
19/03/25 06:44:29 INFO SparkUI: Stopped Spark web UI at http://ip-172-28-141-79.ap-southeast-1.compute.internal:4047
19/03/25 06:44:29 INFO YarnClientSchedulerBackend: Interrupting monitor thread
19/03/25 06:44:29 INFO YarnClientSchedulerBackend: Shutting down all executors
19/03/25 06:44:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
19/03/25 06:44:29 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,

features to complete

I summarize the current design and come out with a few things that can be done to make spark-ffm better.
1. add global bias intercept and one-way interactions, current design only support pairwise interactions
2. normalization support

It'd be nice if any of you are interested in working on this project, you are always welcome and I will try best to help. :)

average loss bug

val (wSum, lSum, miniBatchSize) = data.treeAggregate((BDV(bcWeights.value.toArray), 0.0, 0L))(
seqOp = (c, v) => {
val r = gradient.asInstanceOf[FFMGradient].computeFFM(v._1, (v._2), Vectors.fromBreeze(c._1),
1.0, eta, regParam, true, i, solver)
(r._1, r._2 + c._2, c._3 + 1)
},
combOp = (c1, c2) => {
(c1._1 + c2._1, c1._2 + c2._2, c1._3 + c2._3)
}) // TODO: add depth level
bcWeights.destroy(blocking = false)

  if (miniBatchSize > 0) {
    stochasticLossHistory += lSum / miniBatchSize
    weights = Vectors.dense(wSum.toArray.map(_ / miniBatchSize))
    println("iter:" + (i + 1) + ",tr_loss:" + lSum / miniBatchSize)
  } else {
    println(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
  }

when average loss weights = Vectors.dense(wSum.toArray.map(_ / miniBatchSize)) is not correct,
because in seqOp weights are already added up, then operation "weights = Vectors.dense(wSum.toArray.map(_ / miniBatchSize))" will lead to all weights very small.

The right way is "weights = Vectors.dense(wSum.toArray.map(_ / data.getNumPartitions))"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.