vinceshieh / spark-ffm Goto Github PK

FFM (Field-Awared Factorization Machine) on Spark

License: Apache License 2.0

Scala 99.18% Shell 0.82%

ffm spark field-awared-factorization-machine sparse

spark-ffm's Introduction

Spark-FFM

A Spark-based implementation of Field-Awared Factorization Machine. See http://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf

The data should be formatted as

label field1:feat1:val1 field2:feat2:val2

to fit FFM, that is to extends LIBSVM data format by adding field information to each feature.

Currently, we support paralleledSGD and paralledAdagrad optimization methods, as they are more efficient in dealing with large dataset.

Besides, user can also choose to have FFMModel with/without global bias and one-way interactions.

Contact & Feedback

If you encounter bugs, feel free to submit an issue or pull request.

spark-ffm's People

Contributors

Stargazers

Watchers

spark-ffm's Issues

initial coef and predict value

Thanks for your spark implementation of FFM.
Here's my question:
generateInitWeights() ... val coef = 0.5 / Math.sqrt(param.k) ...
predict() ... val v: Double = 2.0 * v1 * v2 * r ...

Why not
generateInitWeights() ... val coef = 1.0 / Math.sqrt(param.k) ...
predict() ... val v: Double = 1.0 * v1 * v2 * r ...

This is more close to the description in original Paper,"The initial values of w are randomly sampled from a uniform distribution between [0,1.0/sqrt(k)]"

Weight Initialisation might be a cause for loss NaN

As weights are initialized as codes line: 70 - line: 75, in file FFMWithAdag.scala, standard deviation of Z= W * X +b roughly equals to sqrt(mn/2). Once it is larger than 706(around), exp(z) becomes NaN. Proper initialization of weight should make the Guassion distribution more narrowed. coef = sqrt(1/mnk)?

Update W & b error

tr_loss is always NaN

Hi, Nice code! I try to run this code on my dataset, but the tr_loss is always NaN. I found a part of weights is NaN too. Are there any possible reason for it?

Is it necessary to save ffm model to hdfs?

Very glad to see a ffm model implement on spark. Is there any plan to save the model after training? Thanks very much.

Can't load FMM model

hello, when generate FMM model, but, why can't load the model to predict?
Erro as follow:
Exception in thread "main" java.lang.Exception: FFMModel.load did not recognize model with (className, format version):(org.apache.spark.mllib.classification.FFMModel$SaveLoadV1_0$, 1.0).

Prediction Label >= 0.5 should change to > 0.5 ？

In TestFFM.scala
val scores: RDD[(Double, Double)] = testing.map(x => {
val p = ffm.predict(x._2)
val ret = if (p >= 0.5) 1.0 else -1.0
(ret, x._1)
})
should change to
val scores: RDD[(Double, Double)] = testing.map(x => {
val p = ffm.predict(x._2)
val ret = if (p > 0.5) 1.0 else -1.0
(ret, x._1)
})

because In Spark source code（BinaryLabelCounter.scala）is

/** Processes a label. */
def +=(label: Double): BinaryLabelCounter = {
// Though we assume 1.0 for positive and 0.0 for negative, the following check will handle
// -1.0 for negative as well.
if (label > 0.5) numPositives += 1L else numNegatives += 1L
this
}

Feature indices do not conincide

As per the weights array, feature index should coincide with each.
Let's say a feature j1 = data(i)._2 - 1 gets its weight for field f2 as following: j1 * align1 + f2 * align0
For FFMModel in FieldAwaredFactorizationMachine.scala, line 122 and line 131, index for datai is data(i)_2 -1, however, for dataii is data(ii). Should these conform with each other?
val j1 = data(i)._2 - 1 [line 122]
val f1 = data(i)._1
val v1 = data(i)._3
......
while (ii < valueSize) {
val j2 = data(ii)._2 [line 131]
val f2 = data(ii)._1
......

I got this error

19/03/25 06:44:25 INFO ContextCleaner: Cleaned accumulator 63
19/03/25 06:44:25 INFO ContextCleaner: Cleaned accumulator 58
get numFields:96,nunFeatures:998430273,numFactors:3
allocating:-429780416
Exception in thread "main" java.lang.NegativeArraySizeException
at org.apache.spark.mllib.classification.FFMWithAdag.generateInitWeights(FFMWithAdag.scala:68)
at org.apache.spark.mllib.classification.FFMWithAdag.run(FFMWithAdag.scala:110)
at org.apache.spark.mllib.classification.FFMWithAdag$.train(FFMWithAdag.scala:142)
at TestFFM$.main(TestFFM.scala:33)
at TestFFM.main(TestFFM.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/03/25 06:44:29 INFO SparkContext: Invoking stop() from shutdown hook
19/03/25 06:44:29 INFO SparkUI: Stopped Spark web UI at http://ip-172-28-141-79.ap-southeast-1.compute.internal:4047
19/03/25 06:44:29 INFO YarnClientSchedulerBackend: Interrupting monitor thread
19/03/25 06:44:29 INFO YarnClientSchedulerBackend: Shutting down all executors
19/03/25 06:44:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
19/03/25 06:44:29 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,

features to complete

I summarize the current design and come out with a few things that can be done to make spark-ffm better.
1. add global bias intercept and one-way interactions, current design only support pairwise interactions
2. normalization support

It'd be nice if any of you are interested in working on this project, you are always welcome and I will try best to help. :)

Loss without regularization

average loss bug

val (wSum, lSum, miniBatchSize) = data.treeAggregate((BDV(bcWeights.value.toArray), 0.0, 0L))(
seqOp = (c, v) => {
val r = gradient.asInstanceOf[FFMGradient].computeFFM(v._1, (v._2), Vectors.fromBreeze(c._1),
1.0, eta, regParam, true, i, solver)
(r._1, r._2 + c._2, c._3 + 1)
},
combOp = (c1, c2) => {
(c1._1 + c2._1, c1._2 + c2._2, c1._3 + c2._3)
}) // TODO: add depth level
bcWeights.destroy(blocking = false)

  if (miniBatchSize > 0) {
    stochasticLossHistory += lSum / miniBatchSize
    weights = Vectors.dense(wSum.toArray.map(_ / miniBatchSize))
    println("iter:" + (i + 1) + ",tr_loss:" + lSum / miniBatchSize)
  } else {
    println(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
  }

when average loss weights = Vectors.dense(wSum.toArray.map(_ / miniBatchSize)) is not correct,
because in seqOp weights are already added up, then operation "weights = Vectors.dense(wSum.toArray.map(_ / miniBatchSize))" will lead to all weights very small.

The right way is "weights = Vectors.dense(wSum.toArray.map(_ / data.getNumPartitions))"

vinceshieh / spark-ffm Goto Github PK

spark-ffm's Introduction

Spark-FFM

Contact & Feedback

spark-ffm's People

Contributors

Stargazers

Watchers

Forkers

spark-ffm's Issues

Recommend Projects

Recommend Topics

Recommend Org