- CPython 3.6.x, 3.7.x
$ pip install cdfm
The dataset format is like SVM-rank one.
The difference is eid
must be specified in a line.
Here is a definition of a line.
|
symbol means OR
(so <str>|<int>
means the value must have either str or int type).
<line> .=. <label> qid:<qid> eid:<eid> <features>#<comments>
<label> .=. <float>|<str as a class>
<qid> .=. <str>|<int>
<eid> .=. <str>|<int>
<features> .=. <dim>:<value>
<dim> .=. <0 or Natural Number>
<value> .=. <float>
<comments> .=. <Any text will do>
Let me show you an example.
0.5 qid:1 eid:x 1:0.1 2:-0.2 3:0.3 # comment A
0.0 qid:1 eid:y 1:-0.1 2:0.2 4:0.4
-0.5 qid:1 eid:z 2:-0.2 3:0.3 4:-0.4 # comment C
0.5 qid:2 eid:y 1:0.1 2:-0.2 3:0.3
0.0 qid:2 eid:z 1:-0.1 2:0.2 4:0.4
-0.5 qid:2 eid:w 2:-0.2 3:0.3 4:-0.4 # comment E
Additionally, you can use distance between entities in a group.
<line> .=. qid:<qid> eid:<eid> cid:<cid> <factors> # <comments>
<cid> .=. <str>|<int>
<factors> .=. <dim>:<value>
<div> .=. <0 or Natural Number>
<value> .=. <float>
<comments> .=. <Any text will do>
Let me show you an example.
qid:3 eid:x cid:y 1:0.5 2:-0.3 3:1.2 # comment A
qid:3 eid:x cid:z 1:0.0 2:0.2 3:0.8 # comment B
qid:3 eid:y cid:z 1:0.2 2:0.3 3:-0.7 # comment C
from cdfm.utils import load_cdfmdata
# loading dataset as a DataFrame object
# 1. features
features_path = '/path/to/features'
n_dimensions = 10
features = load_cdfmdata(features_path, n_dimensions)
# features.columns
# >>> Index(['label', 'qid', 'eid', 'features'], dtype='object')
# 2. proximities
proximities_path = '/path/to/proximities'
n_dimensions = 2
proximities = load_cdfmdata(proximities_path, n_dimensions, mode='proximity')
# proximities.columns
# >>> Index(['qid', 'eid', 'cid', 'proximities'], dtype='object')
# some preprocessing here...
# Finally, build a dataset
train = build_cdfmdata(features) # using features only
train = build_cdfmdata(features, proximities) # using proximities
from cdfm.models import CDFMRanker
# define your model
model = CDFMRanker(k=8, n_iter=300, init_eta=1e-2)
# fitting, printing out epoch losses if verbose is True
model.fit(train, verbose=True)
import pickle
with open('/path/to/file.pkl', mode='wb') as fp:
pickle.dump(model, fp)
# loading test dataset
test_df = load_cdfmdata(test_path, n_dimensions)
test = build_cdfmdata(test_df)
pred = model.predict(test)
Tutorial using NAR Horse Racing dataset.
# pwd
# >>> path/to/cdfm
$ mkdir dumps
$ mkdir dumps/models # pickle dumps fitted models.
$ mkdir dumps/predictions # pandas dumps evaluation dataset.
$ python example.py --k 2 --n-iter 100
# 1. install develop dependencies
$ pip install -e .[dev]
# 2. linting
$ pylint cdfm # check pylintrc for more details...
# 3. type checking
$ mypy @mypy_check_files --config-file=mypy.ini
# 4. testing
$ pytest