ycjuan / libffm Goto Github PK

A Library for Field-aware Factorization Machines

License: BSD 3-Clause "New" or "Revised" License

Makefile 2.09% C++ 97.91%

libffm's Introduction

Table of Contents
=================

- What is LIBFFM
- Overfitting and Early Stopping
- Installation
- Data Format
- Command Line Usage
- Examples
- OpenMP and SSE
- Building Windows Binaries
- FAQ


What is LIBFFM
==============

LIBFFM is a library for field-aware factorization machine (FFM). 

Field-aware factorization machine is a effective model for CTR prediction. It has been used to win the top-3 positions
of following competitions:

    * Criteo: https://www.kaggle.com/c/criteo-display-ad-challenge

    * Avazu: https://www.kaggle.com/c/avazu-ctr-prediction

    * Outbrain: https://www.kaggle.com/c/outbrain-click-prediction

    * RecSys 2015: http://dl.acm.org/citation.cfm?id=2813511&dl=ACM&coll=DL&CFID=941880276&CFTOKEN=60022934

You can find more information about FFM in the following paper / slides:

    * http://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf

    * http://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf

    * https://arxiv.org/abs/1701.04099


Overfitting and Early Stopping
==============================

FFM is prone to overfitting, and the solution we have so far is early stopping. See how FFM behaves on a certain data
set:

    > ffm-train -p va.ffm -l 0.00002 tr.ffm
    iter   tr_logloss   va_logloss
       1      0.49738      0.48776
       2      0.47383      0.47995
       3      0.46366      0.47480
       4      0.45561      0.47231
       5      0.44810      0.47034
       6      0.44037      0.47003
       7      0.43239      0.46952
       8      0.42362      0.46999
       9      0.41394      0.47088
      10      0.40326      0.47228
      11      0.39156      0.47435
      12      0.37886      0.47683
      13      0.36522      0.47975
      14      0.35079      0.48321
      15      0.33578      0.48703


We see the best validation loss is achieved at 7th iteration. If we keep training, then overfitting begins. It is worth
noting that increasing regularization parameter do not help:

    > ffm-train -p va.ffm -l 0.0002 -t 50 -s 12 tr.ffm
    iter   tr_logloss   va_logloss
       1      0.50532      0.49905
       2      0.48782      0.49242
       3      0.48136      0.48748
                 ...
      29      0.42183      0.47014
                 ...
      48      0.37071      0.47333
      49      0.36767      0.47374
      50      0.36472      0.47404


To avoid overfitting, we recommend always provide a validation set with option `-p.' You can use option `--auto-stop' to
stop at the iteration that reaches the best validation loss:

    > ffm-train -p va.ffm -l 0.00002 --auto-stop tr.ffm
    iter   tr_logloss   va_logloss
       1      0.49738      0.48776
       2      0.47383      0.47995
       3      0.46366      0.47480
       4      0.45561      0.47231
       5      0.44810      0.47034
       6      0.44037      0.47003
       7      0.43239      0.46952
       8      0.42362      0.46999
    Auto-stop. Use model at 7th iteration.


Installation
============

Requirement: It requires a C++11 compatible compiler. We also use OpenMP to provide multi-threading. If OpenMP is not
available on your platform, please refer to section `OpenMP and SSE.'

- Unix-like systems:
  Typeype `make' in the command line.

- Windows:
  See `Building Windows Binaries' to compile.



Data Format
===========

The data format of LIBFFM is:

<label> <field1>:<feature1>:<value1> <field2>:<feature2>:<value2> ...
.
.
.

`field' and `feature' should be non-negative integers. See an example `bigdata.tr.txt.'

It is important to understand the difference between `field' and `feature'. For example, if we have a raw data like this:

Click  Advertiser  Publisher
=====  ==========  =========
    0        Nike        CNN
    1        ESPN        BBC

Here, we have 

    * 2 fields: Advertiser and Publisher

    * 4 features: Advertiser-Nike, Advertiser-ESPN, Publisher-CNN, Publisher-BBC

Usually you will need to build two dictionares, one for field and one for features, like this:
    
    DictField[Advertiser] -> 0
    DictField[Publisher]  -> 1
    
    DictFeature[Advertiser-Nike] -> 0
    DictFeature[Publisher-CNN]   -> 1
    DictFeature[Advertiser-ESPN] -> 2
    DictFeature[Publisher-BBC]   -> 3

Then, you can generate FFM format data:

    0 0:0:1 1:1:1
    1 0:2:1 1:3:1

Note that because these features are categorical, the values here are all ones.


Command Line Usage
==================

-   `ffm-train'

    usage: ffm-train [options] training_set_file [model_file]

    options:
    -l <lambda>: set regularization parameter (default 0.00002)
    -k <factor>: set number of latent factors (default 4)
    -t <iteration>: set number of iterations (default 15)
    -r <eta>: set learning rate (default 0.2)
    -s <nr_threads>: set number of threads (default 1)
    -p <path>: set path to the validation set
    --quiet: quiet model (no output)
    --no-norm: disable instance-wise normalization
    --auto-stop: stop at the iteration that achieves the best validation loss (must be used with -p)

    By default we do instance-wise normalization. That is, we normalize the 2-norm of each instance to 1. You can use
    `--no-norm' to disable this function.
    
    A binary file `training_set_file.bin' will be generated to store the data in binary format.

    Because FFM usually need early stopping for better test performance, we provide an option `--auto-stop' to stop at
    the iteration that achieves the best validation loss. Note that you need to provide a validation set with `-p' when
    you use this option.


-   `ffm-predict'

    usage: ffm-predict test_file model_file output_file



Examples
========

Download a toy data from:

    zip: https://drive.google.com/open?id=1HZX7zSQJy26hY4_PxSlOWz4x7O-tbQjt

    tar.gz: https://drive.google.com/open?id=12-EczjiYGyJRQLH5ARy1MXRFbCvkgfPx

This dataset is subsampled 1% from Criteo's challenge.

> tar -xzf libffm_toy.tar.gz

or 

> unzip libffm_toy.zip


> ./ffm-train -p libffm_toy/criteo.va.r100.gbdt0.ffm libffm_toy/criteo.tr.r100.gbdt0.ffm model

train a model using the default parameters


> ./ffm-predict libffm_toy/criteo.va.r100.gbdt0.ffm model output

do prediction


> ./ffm-train -l 0.0001 -k 15 -t 30 -r 0.05 -s 4 --auto-stop -p libffm_toy/criteo.va.r100.gbdt0.ffm libffm_toy/criteo.tr.r100.gbdt0.ffm model

train a model using the following parameters:

    regularization cost = 0.0001
    latent factors = 15
    iterations = 30
    learning rate = 0.3
    threads = 4
    let it auto-stop


OpenMP and SSE
==============

We use OpenMP to do parallelization. If OpenMP is not available on your
platform, then please comment out the following lines in Makefile.

    DFLAG += -DUSEOMP
    CXXFLAGS += -fopenmp

Note: Please run `make clean all' if these flags are changed.

We use SSE instructions to perform fast computation. If you do not want to use it, comment out the following line:

    DFLAG += -DUSESSE

Then, run `make clean all'



Building Windows Binaries
=========================

The Windows part is maintained by different maintainer, so it may not always support the latest version.

The latest version it supports is: v1.21

To build them via command-line tools of Visual C++, use the following steps:

1. Open a DOS command box (or Developer Command Prompt for Visual Studio) and go to LIBFFM directory. If environment
variables of VC++ have not been set, type

"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\vcvars64.bat"

You may have to modify the above command according which version of VC++ or
where it is installed.

2. Type

nmake -f Makefile.win clean all


FAQ
===

Q: Why I have the same model size when k = 1 and k = 4?

A: This is because we use SSE instructions. In order to use SSE, the memory need to be aligned. So even you assign k =
   1, we still fill some dummy zeros from k = 2 to 4.


Q: Why the logloss is slightly different on the same data when I run the program two or more times when I use multi-threading

A: When there are more then one thread, the program becomes non-deterministic. To make it determinisitc you can only use one thread.


Contributors
============

Yuchin Juan, Wei-Sheng Chin, and Yong Zhuang

For questions, comments, feature requests, or bug report, please send your email to:

    Yuchin Juan ([email protected])

For Windows related questions, please send your email to:

    Wei-Sheng Chin ([email protected])

libffm's People

Contributors

Stargazers

Watchers

Forkers

466152112 ty01csbaidu nkhuyu witgo weizier cxysteven snazz2001 jyt109 pawnbot perevalovtimur rikima wubinzzu chenglongchen by321 hhh920406 songyf albert1988 lai-bluejay guomin joshua-wu poseidon1214 riverhxz alexeyrodriguez lambdaji bluebytes60 zhmz90 kazk1018 rockyzyl irwenqiang mazefeng omacor starkmchen jifang1218 marquisthunder anhncs banxiyan xxllp jchluo d-dark nissimnabar amit-gupta- daniel-perry lanweikang keeson tankle johnsonchengwu ashishlal alvis-huang crafet orenov daxiongshu blitzglep1326 namlq redsuncmx jameschl xuchaoigo charmby babylls lonely7345 colinsongf joelone weylew nonva timwee gmodena github-hongweizhang codeaudit shieldmylove sachintyagi22 travisbrady kevinmtian tbfly zhongpeixiang gom7745 anki1909 qian2729 arunreddy yangl1ng beedata vspinu dselivanov tanzhuqing leezqcst dikoufu ianlini jasonwayne chrisbg meratzest yuezzymsqtd baokunguo sunmingze coolashish defaultrobot ravali121 dkletran 4575759ww mawbhkdg heihuhuray zjuzt fatterzhang

libffm's Issues

libffm-linear prediction

Hello,

I'm trying to use libffm-linear library. Here are my outputs:

libffm-linear>windows\ffm-train -s 2 -l 0 -k 10 -t 50 -r 0.01 --au
to-stop -p test_data.txt train_data.txt model
iter tr_logloss va_logloss
1 0.25510 0.25017
2 0.25129 0.24927
3 0.25070 0.24882
4 0.25041 0.24843
5 0.25020 0.24821
6 0.25005 0.24808
7 0.24990 0.24801
8 0.24977 0.24800
9 0.24968 0.24820
Auto-stop. Use model at 8th iteration.

libffm-linear>windows\ffm-predict test_data.txt model output_file
logloss = 0.34800

Why prediction logloss differs from validation logloss on same file?

How to output AUC-ROC

How to output AUC-ROC on console?

Unknown features

Unknown features (like new app_id or device_id that was not in training data) lead to random probabilities (too small or too high). Could you suggest a workaround for using LIBFFM in that case?

Train and val data set both have labels but there is no label in test data set. How to fill up <label> in FFM data format？

Thanks for your amazing libffm.

When using ffm_predict, I have a problem about how to fill up in FFM data format when test data set without labels.

Thanks again.

Can two field have the same feature id?

Can two field have the same feature id, which means the feature id is associated with field?

k_aligned & memory requirements

It would be useful to mention in the README that memory allocation depends on k_aligned, not just k. So changing k from 4 to 5 actually doubles memory requirements.
Is there any particular reason why you align k to the power of 2?

almost no comments in codes

In the implement, there are almost no comments. It is hard to read and learn.
It is known that C codes is harder to read than python lang. That there are no comments make learner much harder.
All in all, the implement is unfriendly. Please add necessary comments. At least, the members of structs would be commented.
Thank you on behalf of everyone

How to use tags as features with ffm?

How to use tags associated with item as a field in FFM? In FFM, only one feature for a given field can be turned on. But, for tags, we have several of features "1" for that given field. So, how to use tags as field for FFM?

make error

g++ -Wall -O3 -std=c++0x -march=native -fopenmp -DUSESSE -DUSEOMP -c -o ffm.o ffm.cpp /tmp/cc2xJsit.s: Assembler messages: /tmp/cc2xJsit.s:3277: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cc2xJsit.s:3286: Error: suffix or operands invalid for vpaddd' /tmp/cc2xJsit.s:3598: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cc2xJsit.s:3609: Error: suffix or operands invalid for vpaddd' /tmp/cc2xJsit.s:3949: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cc2xJsit.s:3955: Error: suffix or operands invalid for vpaddd' /tmp/cc2xJsit.s:4273: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cc2xJsit.s:4284: Error: suffix or operands invalid for vpaddd'

Assigning `1` to multiple binary features in the same field

Consider a case where several of the binary features in a field can be true. For example, one might want to encode the history of recent advertisers that were shown to a user.

In regards to this, the paper says:

Note that according to the number of possible values in a
categorical feature, the same number of binary features are
generated and every time only one of them has the value 1.

I'm using this python wrapper, and it trains on such a feature configuration. For example, the following (field, feature, value) sample will run:
[(1, 2, 1), (2, 3, 1), (3, 5, 1), (3, 6, 1), (3, 7, 1)]. But this seems to go against the statement from the paper.

So is this code just working by coincidence, or is it the FFM actually capable of learning from this sort of "history" encoding?

need tf version for learning

for learning ffm, I would like to find a tf version of ffm.

Does parallel operation of train function in ffm.cpp ensure thread safety?

Regarding train in ffm.cpp lines 228-375, I have a question on thread safety.

below are lines 288-312
#if defined USEOMP

        #pragma omp parallel for schedule(static) reduction(+: tr_loss)

        #endif

        for(ffm_int ii = 0; ii < (ffm_int)order.size(); ii++)

        {

        ffm_int i = order[ii];

        ffm_float y = tr->Y[i];
        
        ffm_node *begin = &tr->X[tr->P[i]];

        ffm_node *end = &tr->X[tr->P[i+1]];

        ffm_float r = R_tr[i];

        ffm_float t = wTx(begin, end, r, *model);

        ffm_float expnyt = exp(-y*t);

        tr_loss += log(1+expnyt);
           
        ffm_float kappa = -y*expnyt/(1+expnyt);

        wTx(begin, end, r, *model, kappa, param.eta, param.lambda, true);
        }

I'm new to openmp parallel operations. I'm curious whether it ensures thread safety regarding wTx operation at the very bottom. wTx(begin, end, r, *model, kappa, param.eta, param.lambda, true);
It seems that since wTx with do_update = true updates weights, it could interfere with other threads updating the weights.
Waiting for reply.

Java wrapper

Hello!

I'm about to finish a generalised wrapper for "predict" and "ffm_load_model" function in Java. It would be great if you will review my code and then add it to your library if you deem it fit.

Thank You

Learning from chunks of data / online learning?

Hi @guestwalk !

Thanks a lot for the awesome library. It's certainly made my life a lot easier.

Since we get a segfault for files that are too large, is there a way to learn from chunks of data? In other words, can an existing model be updated with new data?

Thanks again,

Is `float` supported in values?

Sorry for being ignorant, can I put float type value in the input data?

how to get the FFM data format ?

hi,dear
could you pls help me how to transform CSV data to FFM data ?
thx

Why does the "ffm_predict" function return "1/(1+exp(-t))"?

I'm confused about the "ffm_predict" function in the ffm.cpp :

ffm_float ffm_predict(ffm_node *begin, ffm_node *end, ffm_model &model) {
    ffm_float r = 1;
    if(model.normalization) {
        r = 0;
        for(ffm_node *N = begin; N != end; N++)
            r += N->v*N->v; 
        r = 1/r;
    }

    ffm_float t = wTx(begin, end, r, model);

    return 1/(1+exp(-t));
}

After reading the paper, "Field-aware Factorization Machines for CTR Prediction", I think the return value should be the variable t, not be the value of 1/(1+exp(-t)). Could you answer my doubt ?

can it been used as regression model?

hi,
libffm is really an useful model. I want to use it as a regression model , may i know how to realize? thanks.

Why return “1/(1+exp(-t))” in the ffm_predict.cpp ?

I'm confused about the last line of "ffm_predict.cpp":

ffm_float ffm_predict(ffm_node *begin, ffm_node *end, ffm_model &model) {
    ffm_float r = 1;
    if(model.normalization) {
        r = 0;
        for(ffm_node *N = begin; N != end; N++)
            r += N->v*N->v; 
        r = 1/r;
    }
    ffm_float t = wTx(begin, end, r, model);
    return 1/(1+exp(-t));
}

After reading the paper, "Field-aware Factorization Machines for CTR Prediction" , I think the predict value is the variable "t" , but the return of this function is "1/(1+expp(-t))" . Could you answer my doubt ?

Can libffm solve multi-class problem?

Does ffm support regression?

It seems we cannot set the optimization objective. Can we use ffm for regression problem?

ffm-train not found

Hi, I am trying to use libffm on ubuntu 16.04. I have C++11 and OpenMP installed via apt-get, downloaded libffm and did make. I am in the libffm dir and ran and got the following.

josh:~/libffm-master$ ffm-train bigdata.tr.txt model
ffm-train: command not found

When I check the dir you can see it is there

josh@josh-HP-ZBook-17-G2:~/libffm-master$ dir
bigdata.te.txt  ffm.cpp  ffm-predict      ffm-train.cpp  README
bigdata.tr.txt  ffm.h    ffm-predict.cpp  Makefile
COPYRIGHT   ffm.o    ffm-train    Makefile.win

Any help would be great. Thanks.

READ THE CODE PROBLEM

I read the source code. i can not figure out why the model.w size is model.n * model.m * k_aligned * 2.

CUDA support

Did you think about porting this to CUDA/CUBLAS?

viewing the model

I've used this pacakge a few months, ago, and I remember I was able to do $head model, and to see the model weights.
It seems that the model is now encoded somehow (binarized?) am I correct? is there a way to see the model as before?

out of bounds access

in ffm.cpp: ffm_node* end = &prob.X[prob.P[i + 1]];
can access array out the bounds

-nan value appeared wh

“-nan” value appeared during training

When I was training the model, the first few iterations worked fine but subsequent iterations returned "-nan" for the log losses of training and validating data sets.

Any ideas what went wrong?

Sample of the data used for training:

1 0:400492:1 1:977206:1 2:861366:1 3:223345:1 4:4:0.0 5:5:9567.0 6:6:31835.0 7:7:0.300471105528 8:8:0.0 9:9:0.0 10:35822:1 11:486386:1 12:528723:1 13:662860:1 14:990282:1 15:406964:1 16:698517:1 17:585048:1 18:18:0.38219606197 19:19:0.125217833586 20:20:0.438929013305 21:21:0.216453092359 22:923220:1 23:63477:1 24:216531:1 25:461117:1

0 0:400492:1 1:203267:1 2:861366:1 3:223345:1 4:4:0.0 5:5:1642.0 6:6:9441.0 7:7:0.173830192674 8:8:0.0 9:9:0.0644 10:709579:1 11:486386:1 12:528723:1 13:662860:1 14:778015:1 15:581435:1 16:698517:1 17:181797:1 18:18:0.581693006318 19:19:0.097000178732 20:20:0.367630745198 21:21:0.182764132116 22:923220:1 23:63477:1 24:216531:1 25:461117:1

Features with multiple fields in `bigdata.tr.txt`

Hi,

It seems like bigdata.tr.txt has feature 2739 with multiple fields (5 and 13) in lines 21, 36, and 88. Shouldn't fields be unique per feature?

How to increase click weights?

Hello!
I have very disbalanced dataset that has only 0.5% of clicks.
So I have very poor results.
Can I increase weights of the clicks to make them more important? Or the only way is to oversample them?

what is the optimized method used in this model

Hi,what is the optimized method used in this model?

Bias + linear terms

Are there any plans of incorporating bias and linear terms in this new re-factored version ?
I know they're included in v114 on the website but if I'm not mistaken they're still not on master (I think?).
Thanks !

nan predictions

using the python wrapper (libffm-python)

for some reason, when the input dataset becomes too large (too many "fields" ~ about 29 or more), the predictions (at least the first iterations, havent checked if it changes eventually after N iterations) are all NaN

*edit: few samples of data, even a one row dataframe, presents the same issue, so it appears to be "fields" related

*edit2: tested, doesnt cnverge after N iterations

Segmentation fault

Hello,

Thank you for your excellent method, software and description.

I faced a problem trying to employ the libffm in my ML task. I am getting segmentation fault when using it with cross-validation option. Here are my setup and data:
Ubuntu 13.10
~/libffm$ ./ffm-train -k 5 -t 30 -r 0.03 -v 2 data.txt
fold logloss
0 0.1080
Segmentation fault (core dumped)

The data.txt can be downloaded here https://drive.google.com/open?id=0B9HyQ7ZccW4-VFE0VWtxUHF2R3c

The problem arises only when working with big data files like that. If you cut it to 100K lines (it is around 250K lines) everything get OK.

Regards,
Sergey

Rust library

Hi, just wanted to share that LIBFFM is now available in Rust. Thanks for the neat project!

when I add some numeric continuous features, the loss decreased slowly

I found that loss is very quick to decrease, if all feature are categorical. And some numerical feature are chosen into the model, it is very slow to decrease the loss even if 150 iteration.
Could you tell me why or give me some advices?