Giter Club home page Giter Club logo

Comments (21)

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024

Hey @qingyuanxingsi I know this repository is very dependency-heavy. It is on the one side a collection of many different ensemble and decision tree induction techniques, each requiring its own dependencies.

Did you manage to get it up and running? Feel free to copy-paste your errors so I can help out.

from genesim.

qingyuanxingsi avatar qingyuanxingsi commented on May 27, 2024
  1. Can you give some instructions on how to run the R inTrees algorithm with xgboost? This is what I want to do most now! May be some demo code will be quite useful? It seems that example.py may contain some information, but it contains too many unnecessary details. A shorter and cleaner guide will be much helpful.
  2. Is the orange package really useful?? Really hard to install it!Still very hard to install orange in Windows!
  3. See the following pic, is this a bug??
    1
  4. This is also red in my IDE.
    2

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024
  1. The orange package is only needed if you want the C4.5 decision tree induction algorithm, so yes you can skip that installation for your specific usecase

  2. That is indeed a bug, it only occurs when your column is of datatype datetime, which never happened to be the case for my datasets. I think replacing robj with ro should fix it.

  3. Indeed, should be commented out. It still stems from experiments I conducted with https://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1A_079.pdf

Let me check for point number 1. how this can easily be done. Hold on

from genesim.

qingyuanxingsi avatar qingyuanxingsi commented on May 27, 2024

@GillesVandewiele Given the rare usage of C4.5, I strongly suggest you remove C4.5 and related dependencies(may be Orange??) from this package, it will be much cleaner, or it is really hard to try to use the code.
If it becomes easier to start, many people may be willing to explore.

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024

That is true, I could add an extra option that either includes or excludes this package.

That being said, it is not because something is rarely used, that it is not good. I empirically tested all these induction algorithms, and the C4.5 algorithm outperforms the sklearn CART algorithm for almost every single dataset.

from genesim.

qingyuanxingsi avatar qingyuanxingsi commented on May 27, 2024

@GillesVandewiele Looking forward to your guide on inTrees with xgboost!Much thanks.

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024

So here's the outline of what you will need to do, I can write an example script for you, but of course only when I have some time to spare. That will probably be this weekend.

  1. Create your XGBoost ensemble (with Python)
  2. Iterate over the models in your XGBoost ensemble and convert them to GENESIM DecisionTree's. For this, please take a look at https://github.com/IBCNServices/GENESIM/blob/master/constructors/genesim.py#L628 where this is being done.
for idx, tree_string in enumerate(xgb_model.clf._Booster.get_dump()):
   tree = self.parse_xgb_tree_string(tree_string, train, feature_cols, label_col,
                                             np.unique(train[label_col].values)[idx % n_classes])
   tree_list.append(tree)
  1. Now you have a list of GENESIM DecisionTree objects, which can be passed to inTrees, after converting them to compliant R DataFrames, using the _tree_to_R_object function in the inTrees.py file (

    def _tree_to_R_object(self, tree, feature_mapping):
    and
    if tree.count_nodes() > 1: treeList.append(self._tree_to_R_object(tree, feature_mapping))
    )

  2. Execute inTrees as follows (the only variable in the snippet below is treeList, which is the list of GENESIM DecisionTrees

        ro.globalenv["treeList"] = ro.Vector([len(treeList), ro.Vector(treeList)])
        ro.r('names(treeList) <- c("ntree", "list")')

        rules = ro.r('buildLearner(getRuleMetric(extractRules(treeList, X), X, target), X, target)')
        rules=list(rules)
        conditions=rules[int(0.6*len(rules)):int(0.8*len(rules))]
        predictions=rules[int(0.8*len(rules)):]

        # Create a OrderedRuleList
        rulesets = []
        for idx, (condition, prediction) in enumerate(zip(conditions, predictions)):
            # Split each condition in Rules to form a RuleSet
            rulelist = []
            condition_split = [x.lstrip().rstrip() for x in condition.split('&')]
            for rule in condition_split:
                feature = feature_mapping_reverse[int(re.findall(r',[0-9]+]', rule)[0][1:-1])]

                lte = re.findall(r'<=', rule)
                gt = re.findall(r'>', rule)
                eq = re.findall(r'==', rule)
                cond = lte[0] if len(lte) else (gt[0] if len(gt) else eq[0])

                extract_value = re.findall(r'[=>]-?[0-9\.]+', rule)
                if len(extract_value):
                    value = float(re.findall(r'[=>]-?[0-9\.]+', rule)[0][1:])
                else:
                    feature = 'True'
                    value = None

                rulelist.append(Condition(feature, cond, value))
            rulesets.append(Rule(idx, rulelist, prediction))

return OrderedRuleList(rulesets)

Entirely taken from

. Make sure to import the OrderedRuleList objects etc.

from genesim.

qingyuanxingsi avatar qingyuanxingsi commented on May 27, 2024

@GillesVandewiele Following your guide, I've written the following code snippet! Mind checking it for me, in case I made any mistakes.

Note: I made two methods of inTreesClassifier public to shorten the code.

# -*- coding:utf-8 -*-

import xgboost as xgb
import pickle
from constructors.inTrees import inTreesClassifier, Rule, Condition
from constructors.ensemble import XGBClassification
from constructors.genesim import GENESIM
import re

import numpy as np
import pandas as pd
import rpy2
from rpy2.robjects import pandas2ri

pandas2ri.activate()
import rpy2.robjects as ro

local_model = r'xxx.model'
train_df_file = r'xxx.pkl'
# python data frame
train_df = pickle.load(open(train_df_file, 'rb'))

bst = xgb.Booster({'nthread': 4})
bst.load_model(local_model)

genesim = GENESIM()
feat_names = ["aaa", "bbb", "ccc"]

# generate feature mapping
feature_mapping = {}
feature_mapping_reverse = {}
for idx, feat in enumerate(feat_names):
    feature_mapping[feat] = idx + 1
    feature_mapping_reverse[idx + 1] = feat

inTrees_clf = inTreesClassifier()
algo = XGBClassification()
treeList = []
for idx, tree_string in enumerate(bst.get_dump()):
    # binary classfication
    tree = genesim.parse_xgb_tree_string(tree_string,
                                         train_df,
                                         feature_cols=feat_names,
                                         label_col='label',
                                         the_class=0)
    treeList.append(inTrees_clf.tree_to_R_object(tree, feature_mapping))

ro.globalenv["treeList"] = ro.Vector([len(treeList), ro.Vector(treeList)])
ro.r('names(treeList) <- c("ntree", "list")')

rules = ro.r('buildLearner(getRuleMetric(extractRules(treeList, X), X, target), X, target)')
rules = list(rules)
conditions = rules[int(0.6 * len(rules)):int(0.8 * len(rules))]
predictions = rules[int(0.8 * len(rules)):]

# Create a OrderedRuleList
rulesets = []
for idx, (condition, prediction) in enumerate(zip(conditions, predictions)):
    # Split each condition in Rules to form a RuleSet
    rulelist = []
    condition_split = [x.lstrip().rstrip() for x in condition.split('&')]
    for rule in condition_split:
        feature = feature_mapping_reverse[int(re.findall(r',[0-9]+]', rule)[0][1:-1])]

        lte = re.findall(r'<=', rule)
        gt = re.findall(r'>', rule)
        eq = re.findall(r'==', rule)
        cond = lte[0] if len(lte) else (gt[0] if len(gt) else eq[0])

        extract_value = re.findall(r'[=>]-?[0-9\.]+', rule)
        if len(extract_value):
            value = float(re.findall(r'[=>]-?[0-9\.]+', rule)[0][1:])
        else:
            feature = 'True'
            value = None

        rulelist.append(Condition(feature, cond, value))
    rulesets.append(Rule(idx, rulelist, prediction))

# print rules
for rule in rulesets:
    print(rules)

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024

Looks good at first sight @qingyuanxingsi . Strong work! I'll check it out right now, if you want, you can always make a pull request (call the file xgb_intrees.py or smth).

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024

I made some small adaptations to your code. I got it up and running now :)

# -*- coding:utf-8 -*-

import xgboost as xgb
import pickle
from constructors.inTrees import inTreesClassifier, Rule, Condition, OrderedRuleList
from constructors.ensemble import XGBClassification
from constructors.genesim import GENESIM
import re

import numpy as np
import pandas as pd
import rpy2
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr

from sklearn.datasets import make_classification

pandas2ri.activate()
import rpy2.robjects as ro


# Create a dataframe with feature and target columns 
X, y = make_classification(n_samples=500, n_features=3, n_redundant=0)
train_df = pd.DataFrame(X)
feat_names = ["aaa", "bbb", "ccc"]
train_df.columns = feat_names
train_df['label'] = pd.Series(y)

# Fit an XGBClassifier
bst = xgb.XGBClassifier()
bst.fit(train_df[feat_names], train_df['label'])

# generate feature mapping
feature_mapping = {}
feature_mapping_reverse = {}
for idx, feat in enumerate(feat_names):
    feature_mapping[feat] = idx + 1
    feature_mapping_reverse[idx + 1] = feat

# Now the real work. Iterate over the dumps (string format) of the 
# different models/trees in our XGBoost model. Convert them to
# a `decisiontree`
inTrees_clf = inTreesClassifier()
algo = XGBClassification()
treeList = []
genesim = GENESIM()
for idx, tree_string in enumerate(bst._Booster.get_dump()):
    # binary classfication
    tree = genesim.parse_xgb_tree_string(tree_string,
                                         train_df,
                                         feature_cols=feat_names,
                                         label_col='label',
                                         the_class=0)
    treeList.append(inTrees_clf._tree_to_R_object(tree, feature_mapping))

# Do some python magic: call the R module inTrees with our newly composed
# treelist, consisting of GENESIM `decisiontree`s
importr('inTrees')
ro.globalenv["X"] = pandas2ri.py2ri(train_df[feat_names])
ro.globalenv["target"] = ro.FactorVector(train_df['label'])
ro.globalenv["treeList"] = ro.Vector([len(treeList), ro.Vector(treeList)])
ro.r('names(treeList) <- c("ntree", "list")')

rules = ro.r('buildLearner(getRuleMetric(extractRules(treeList, X), X, target), X, target)')
rules = list(rules)

print('Standard output from the inTrees algorithm:')
print(rules)

# Now parse the std output into python object so that they can be used
# for classification etc.
conditions = rules[int(0.6 * len(rules)):int(0.8 * len(rules))]
predictions = rules[int(0.8 * len(rules)):]

print(conditions)

# Create a OrderedRuleList
rulesets = []
for idx, (condition, prediction) in enumerate(zip(conditions, predictions)):
    # Split each condition in Rules to form a RuleSet
    rulelist = []
    condition_split = [x.lstrip().rstrip() for x in condition.split('&')]
    for rule in condition_split:
        feature = feature_mapping_reverse[int(re.findall(r',[0-9]+]', rule)[0][1:-1])]

        lte = re.findall(r'<=', rule)
        gt = re.findall(r'>', rule)
        eq = re.findall(r'==', rule)
        cond = lte[0] if len(lte) else (gt[0] if len(gt) else eq[0])

        extract_value = re.findall(r'[=>]-?[0-9\.]+', rule)
        if len(extract_value):
            value = float(re.findall(r'[=>]-?[0-9\.]+', rule)[0][1:])
        else:
            feature = 'True'
            value = None

        rulelist.append(Condition(feature, cond, value))
    rulesets.append(Rule(idx, rulelist, prediction))
orl = OrderedRuleList(rulesets)

# print rules
print('Parsed rules:')
orl.print_rules()

Btw, I don't know if you knew this already: but I had to hack my way around a bit to get a probability for each class in the leaves of the XGBoost Decision Trees (gradient boosting models work a lot different than the other classical ensemble techniques). Make sure to check out dmlc/xgboost#1746 to get some more information on that :)

Finally, just out of interest: in what kind of application and how are you going to use GENESIM?

from genesim.

qingyuanxingsi avatar qingyuanxingsi commented on May 27, 2024

@GillesVandewiele
Thanks for your help. However rpy2 doesn't support Windows by now(or not well). So what I'm trying to do now is to export the tree data frame to a file and later load it in R and generate the rules. I'm not familiar in R, can you help me modify the code to make it work.

library(inTrees)
library(xgboost)
library(randomForest)

# binary model file
bst <- xgb.load("E:\\data\\jump\\xxx.model")

tree_dir <- "E:\\data\\jump\\gen_rule_v1"
train_data_file <- "E:\\data\\jump\\rp_jump_train_pd.csv"

filenames <- list.files(tree_dir)

treeNum <- length(filenames)

train_ds <- read.csv(train_data_file)

treeList <- NULL
treeList$ntree <- treeNum
treeList$list <- vector("list", treeNum)
for (j in 1:treeNum) {
  cur_filename = paste(tree_dir, "\\", filenames[j], sep = "")
  cur_df <- read.csv(cur_filename)
  row.names(cur_df) <- cur_df$id
  cur_df$id <- NULL
  treeList$list[[j]] <- cur_df
}

X <- train_ds[, 1:(ncol(train_ds) - 1)]
target <- train_ds[, "label"]

exec <- extractRules(treeList, X)
exec[1:2,]

Here is content of one of the exported trees:

id,left daughter,right daughter,split var,split point,status,prediction
1,2,3,combo_avg_,3.13423,1,0
2,4,5,time_min_,1.813,1,0
4,8,9,hit_cnt_,5.5,1,0
8,16,17,time_avg_,2.5265,1,0
16,0,0,,0.0,-1,1
17,0,0,,0.0,-1,0
9,18,19,time_wait_,129.0,1,0
18,0,0,,0.0,-1,1
19,0,0,,0.0,-1,0
5,10,11,score_,322.0,1,0
10,20,21,time_avg_,4.1905,1,0
20,0,0,,0.0,-1,1
21,0,0,,0.0,-1,0
11,22,23,time_avg_,2.788,1,0
22,0,0,,0.0,-1,1
23,0,0,,0.0,-1,0
3,6,7,time_avg_,2.2305,1,0
6,12,13,time_min_,0.616,1,0
12,0,0,,0.0,-1,0
13,24,25,combo_avg_,4.53862,1,0
24,0,0,,0.0,-1,1
25,0,0,,0.0,-1,1
7,14,15,fast_action_,3.5,1,0
14,26,27,score_,3329.5,1,0
26,0,0,,0.0,-1,0
27,0,0,,0.0,-1,1
15,28,29,per_step_val_,12.2173,1,0
28,0,0,,0.0,-1,1
29,0,0,,0.0,-1,0

Much thanks!

Usage: I'm exploring generating rules from a xgboost model and make it a rule-based classifier, if it is understandable by human, it will be much helpful.

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024

I wish I could help, but my knowledge of R is very very limited... You just need to create dataframes that are the same as the output of my _tree_to_R_object function.

Other options are using a good OS for development ;) or just use a docker image (this repo has a Dockerfile already).

I would be interested to hear about results you are achieving with this approach, especially how they compare to rule learners that operate directly on the data (RIPPER, CN2, ...)

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024

Also, maybe you can get the rpy2 library working in Windows anyway, but using another method that just pip install

https://stackoverflow.com/questions/14882477/rpy2-install-on-windows-7

from genesim.

qingyuanxingsi avatar qingyuanxingsi commented on May 27, 2024

Rule learners that operate directly on the data (RIPPER, CN2, ...)
Can you give me some papers(link) on these methods, learning rules directly from data can be an alternative direction, as sometimes you cannot use(trust) ml algorithms for prediction!

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024

Sure: https://link.springer.com/content/pdf/10.1007/s10994-005-5011-x.pdf

The first author, Furnkranz, has a lot of work around rule learning. One paragraph in that paper lists all prominent algorithms, with corresponding references. (first paragraph of section 3)

Btw, this is where the Orange package, comes in play again. It has implementations of e.g. CN2

Another note is that decision trees can easily be converted to rule lists as well, by just listing all paths from the root to leaf nodes, so every decision tree induction technique and techniques such as GENESIM or ISM could be handy as well :). Moreover, I think the representation format of decision trees is much more interpretable than rule lists (Fig. 1 of https://biblio.ugent.be/publication/8537061/file/8537064.pdf)

from genesim.

qingyuanxingsi avatar qingyuanxingsi commented on May 27, 2024

@GillesVandewiele
Finally make it working in Windows, much thanks.

Moreover, can you parse the metrics of the learnt rules to the output, so I can analysis the generated rules?? Just like the R output below!

3

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024

Yes you can. OrderedRuleList has a prediction function, which allows you to calculate stuff such as accuracy (the inverse of error). Moreover, you can also calculate coverage for each rule by counting how many times a certain rule gets triggered on your dataset.

from genesim.

qingyuanxingsi avatar qingyuanxingsi commented on May 27, 2024

@GillesVandewiele
Well, I mean cannot you just parse the freq and err from the inTrees std output??Here!!!

print('Standard output from the inTrees algorithm:')
print(rules)

Or, can you tell me the format of the output of the inTrees package? so I can parse it myself.

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024

@qingyuanxingsi good point! Of course you can :)

the lengths are in rules[:int(len(rules)*0.2)] (first 20% entries)
the frequency is in the next 20% rules[int(len(rules)*0.2):int(len(rules)*0.4)]
and finally error in the next 20% rules[int(len(rules)*0.4):int(len(rules)*0.6)]

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024
lengths = rules[:int(0.2 * len(rules))]
frequencies = rules[int(0.2 * len(rules)):int(0.4 * len(rules))]
errors = rules[int(0.4 * len(rules)):int(0.6 * len(rules))]
conditions = rules[int(0.6 * len(rules)):int(0.8 * len(rules))]
predictions = rules[int(0.8 * len(rules)):]

from genesim.

GillesVandewiele avatar GillesVandewiele commented on May 27, 2024

@qingyuanxingsi did you manage to get everything up and running? did you obtain any nice results with it? Else I'm going to close the issue :)

from genesim.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.