Giter Club home page Giter Club logo

eps's Introduction

Eps

Machine learning for Ruby

  • Build predictive models quickly and easily
  • Serve models built in Ruby, Python, R, and more

Check out this post for more info on machine learning with Rails

Build Status

Installation

Add this line to your application’s Gemfile:

gem "eps"

On Mac, also install OpenMP:

brew install libomp

Getting Started

Create a model

data = [
  {bedrooms: 1, bathrooms: 1, price: 100000},
  {bedrooms: 2, bathrooms: 1, price: 125000},
  {bedrooms: 2, bathrooms: 2, price: 135000},
  {bedrooms: 3, bathrooms: 2, price: 162000}
]
model = Eps::Model.new(data, target: :price)
puts model.summary

Make a prediction

model.predict(bedrooms: 2, bathrooms: 1)

Store the model

File.write("model.pmml", model.to_pmml)

Load the model

pmml = File.read("model.pmml")
model = Eps::Model.load_pmml(pmml)

A few notes:

  • The target can be numeric (regression) or categorical (classification)
  • Pass an array of hashes to predict to make multiple predictions at once
  • Models are stored in PMML, a standard for model storage

Building Models

Goal

Often, the goal of building a model is to make good predictions on future data. To help achieve this, Eps splits the data into training and validation sets if you have 30+ data points. It uses the training set to build the model and the validation set to evaluate the performance.

If your data has a time associated with it, it’s highly recommended to use that field for the split.

Eps::Model.new(data, target: :price, split: :listed_at)

Otherwise, the split is random. There are a number of other options as well.

Performance is reported in the summary.

  • For regression, it reports validation RMSE (root mean squared error) - lower is better
  • For classification, it reports validation accuracy - higher is better

Typically, the best way to improve performance is feature engineering.

Feature Engineering

Features are extremely important for model performance. Features can be:

  1. numeric
  2. categorical
  3. text

Numeric

For numeric features, use any numeric type.

{bedrooms: 4, bathrooms: 2.5}

Categorical

For categorical features, use strings or booleans.

{state: "CA", basement: true}

Convert any ids to strings so they’re treated as categorical features.

{city_id: city_id.to_s}

For dates, create features like day of week and month.

{weekday: sold_on.strftime("%a"), month: sold_on.strftime("%b")}

For times, create features like day of week and hour of day.

{weekday: listed_at.strftime("%a"), hour: listed_at.hour.to_s}

Text

For text features, use strings with multiple words.

{description: "a beautiful house on top of a hill"}

This creates features based on word count.

You can specify text features explicitly with:

Eps::Model.new(data, target: :price, text_features: [:description])

You can set advanced options with:

text_features: {
  description: {
    min_occurences: 5,          # min times a word must appear to be included in the model
    max_features: 1000,         # max number of words to include in the model
    min_length: 1,              # min length of words to be included
    case_sensitive: true,       # how to treat words with different case
    tokenizer: /\s+/,           # how to tokenize the text, defaults to whitespace
    stop_words: ["and", "the"]  # words to exclude from the model
  }
}

Full Example

We recommend putting all the model code in a single file. This makes it easy to rebuild the model as needed.

In Rails, we recommend creating a app/ml_models directory. Be sure to restart Spring after creating the directory so files are autoloaded.

bin/spring stop

Here’s what a complete model in app/ml_models/price_model.rb may look like:

class PriceModel < Eps::Base
  def build
    houses = House.all

    # train
    data = houses.map { |v| features(v) }
    model = Eps::Model.new(data, target: :price, split: :listed_at)
    puts model.summary

    # save to file
    File.write(model_file, model.to_pmml)

    # ensure reloads from file
    @model = nil
  end

  def predict(house)
    model.predict(features(house))
  end

  private

  def features(house)
    {
      bedrooms: house.bedrooms,
      city_id: house.city_id.to_s,
      month: house.listed_at.strftime("%b"),
      listed_at: house.listed_at,
      price: house.price
    }
  end

  def model
    @model ||= Eps::Model.load_pmml(File.read(model_file))
  end

  def model_file
    File.join(__dir__, "price_model.pmml")
  end
end

Build the model with:

PriceModel.build

This saves the model to price_model.pmml. Check this into source control or use a tool like Trove to store it.

Predict with:

PriceModel.predict(house)

Monitoring

We recommend monitoring how well your models perform over time. To do this, save your predictions to the database. Then, compare them with:

actual = houses.map(&:price)
predicted = houses.map(&:predicted_price)
Eps.metrics(actual, predicted)

For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0. For accuracy, alert if it drops below a certain threshold.

Other Languages

Eps makes it easy to serve models from other languages. You can build models in Python, R, and others and serve them in Ruby without having to worry about how to deploy or run another language.

Eps can serve LightGBM, linear regression, and naive Bayes models. Check out ONNX Runtime and Scoruby to serve other models.

Python

To create a model in Python, install the sklearn2pmml package

pip install sklearn2pmml

And check out the examples:

R

To create a model in R, install the pmml package

install.packages("pmml")

And check out the examples:

Verifying

It’s important for features to be implemented consistently when serving models created in other languages. We highly recommend verifying this programmatically. Create a CSV file with ids and predictions from the original model.

house_id prediction
1 145000
2 123000
3 250000

Once the model is implemented in Ruby, confirm the predictions match.

model = Eps::Model.load_pmml("model.pmml")

# preload houses to prevent n+1
houses = House.all.index_by(&:id)

CSV.foreach("predictions.csv", headers: true, converters: :numeric) do |row|
  house = houses[row["house_id"]]
  expected = row["prediction"]

  actual = model.predict(bedrooms: house.bedrooms, bathrooms: house.bathrooms)

  success = actual.is_a?(String) ? actual == expected : (actual - expected).abs < 0.001
  raise "Bad prediction for house #{house.id} (exp: #{expected}, act: #{actual})" unless success

  putc "✓"
end

Data

A number of data formats are supported. You can pass the target variable separately.

x = [{x: 1}, {x: 2}, {x: 3}]
y = [1, 2, 3]
Eps::Model.new(x, y)

Data can be an array of arrays

x = [[1, 2], [2, 0], [3, 1]]
y = [1, 2, 3]
Eps::Model.new(x, y)

Or Numo arrays

x = Numo::NArray.cast([[1, 2], [2, 0], [3, 1]])
y = Numo::NArray.cast([1, 2, 3])
Eps::Model.new(x, y)

Or a Rover data frame

df = Rover.read_csv("houses.csv")
Eps::Model.new(df, target: "price")

Or a Daru data frame

df = Daru::DataFrame.from_csv("houses.csv")
Eps::Model.new(df, target: "price")

When reading CSV files directly, be sure to convert numeric fields. The table method does this automatically.

CSV.table("data.csv").map { |row| row.to_h }

Algorithms

Pass an algorithm with:

Eps::Model.new(data, algorithm: :linear_regression)

Eps supports:

  • LightGBM (default)
  • Linear Regression
  • Naive Bayes

LightGBM

Pass the learning rate with:

Eps::Model.new(data, learning_rate: 0.01)

Linear Regression

By default, an intercept is included. Disable this with:

Eps::Model.new(data, intercept: false)

To speed up training on large datasets with linear regression, install GSL. With Homebrew, you can use:

brew install gsl

Then, add this line to your application’s Gemfile:

gem "gslr", group: :development

It only needs to be available in environments used to build the model.

Probability

To get the probability of each category for predictions with classification, use:

model.predict_probability(data)

Naive Bayes is known to produce poor probability estimates, so stick with LightGBM if you need this.

Validation Options

Pass your own validation set with:

Eps::Model.new(data, validation_set: validation_set)

Split on a specific value

Eps::Model.new(data, split: {column: :listed_at, value: Date.parse("2019-01-01")})

Specify the validation set size (the default is 0.25, which is 25%)

Eps::Model.new(data, split: {validation_size: 0.2})

Disable the validation set completely with:

Eps::Model.new(data, split: false)

Database Storage

The database is another place you can store models. It’s good if you retrain models automatically.

We recommend adding monitoring and guardrails as well if you retrain automatically

Create an Active Record model to store the predictive model.

rails generate model Model key:string:uniq data:text

Store the model with:

store = Model.where(key: "price").first_or_initialize
store.update(data: model.to_pmml)

Load the model with:

data = Model.find_by!(key: "price").data
model = Eps::Model.load_pmml(data)

Jupyter & IRuby

You can use IRuby to run Eps in Jupyter notebooks. Here’s how to get IRuby working with Rails.

Weights

Specify a weight for each data point

Eps::Model.new(data, weight: :weight)

You can also pass an array

Eps::Model.new(data, weight: [1, 2, 3])

Weights are supported for metrics as well

Eps.metrics(actual, predicted, weight: weight)

Reweighing is one method to mitigate bias in training data

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/eps.git
cd eps
bundle install
bundle exec rake test

eps's People

Contributors

ankane avatar schmijos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eps's Issues

erroneous results when using categorical variables with linear regression algorithm

We have a categorical variable for day_of_week as one of 4 independent variables in our model. The LightGBM algorithm works correctly but when I force the model to use the linear regression algorithm, the resultant prediction is incorrect. If I subsequently remove the categorical variable, the linear regression algorithm gives an accurate prediction. Here's an example of what our data set looks like:

{:day_of_service_util=>0.80952380952381, :day_in_advance_util=>0.714285714285714, :block_minutes=>420.0, :week_day=>"Fri"},
{:day_of_service_util=>0.69047619047619, :day_in_advance_util=>0.214285714285714, :block_minutes=>420.0, :week_day=>"Mon"},
{:day_of_service_util=>0.80952380952381, :day_in_advance_util=>0.238095238095238, :block_minutes=>420.0, :week_day=>"Mon"},
{:day_of_service_util=>0.80952380952381, :day_in_advance_util=>0.238095238095238, :block_minutes=>420.0, :week_day=>"Mon"}

day_of_service_util is the Target dependent variable.

Thanks for this great gem!

`_summary': undefined method `feature_importance' for nil:NilClass (NoMethodError)

Given you've installed the prerequisites:

gem install 'eps'
brew install libomp

And a have a file as such:

# test.rb
#!/home/test/.rvm/rubies/ruby-2.6.3/bin/ruby
require 'eps'

data = [
  {bedrooms: 1, bathrooms: 1, price: 100000},
  {bedrooms: 2, bathrooms: 1, price: 125000},
  {bedrooms: 2, bathrooms: 2, price: 135000},
  {bedrooms: 3, bathrooms: 2, price: 162000}
]

model = Eps::Model.new(data, target: :price)

File.write("model.pmml", model.to_pmml)

pmml = File.read("model.pmml")

model = Eps::Model.load_pmml(pmml)

puts model.summary

You will received an error:

Traceback (most recent call last):
        4: from ./test.rb:21:in `<main>'
        3: from /workspace/.rvm/gems/eps-0.3.2/lib/eps/model.rb:60:in `method_missing'
        2: from /workspace/.rvm/gems/eps-0.3.2/lib/eps/model.rb:60:in `public_send'
        1: from /workspace/.rvm/gems/eps-0.3.2/lib/eps/base_estimator.rb:69:in `summary'
/workspace/.rvm/gems/eps-0.3.2/lib/eps/lightgbm.rb:7:in `_summary': undefined method `feature_importance' for nil:NilClass (NoMethodError)

Which is this line:

importance = @booster.feature_importance

I assume the instance variable just isn't set at that point so I may just misunderstand how the file is to be used but I would think that I would be able to load the model and use it to get the summary without needing to re-train.

Is that not the case?

Update production models without reprocessing all data

Hi there

I am using EPS version 0.1.1 on production to predict some values to detect fraudulent data.
The algorithm itself works fine, but I am struggling to update the models with new values.

I have the models saved on database, exported to JSON (I know it's not supported on newer versions, I am planning to migrate soon) and they are loaded (Regressor.load_json) when I need to make a prediction.

But now, I need to update these models daily and my training data is too big and takes a lot of time to normalize data, remove outliers, etc..
Is there a way to load a stored model, add new data, retrain and update the model with just the new data (without loading all previous trained data) ?

I thought the Daru Dataframes could be used to achieve this, but it seems that is not the case (or I can't find how).

If there isn't a way to do that with EPS (no matter what version), someone know if it can be done with any other Python framework?

Thanks!

How to retrain / continue training a model

It seems like you can create a new model with a data set and save it and load it up, but how do I keep retraining the model?

I'm expecting the flow to be:

  1. Model.new(starter_data)
  2. model.save (to the database or file)

.... later

  1. Model.load(from db or file)
  2. model.train(new_data)

I'm not sure how to do that though.  Train seems to be a private method on model, so I'm not sure how to get more data to an existing model without needing to just load it all up from the new method all over again, but with a large dataset that's difficult.

RuntimeError: Number of samples must be at least two more than number of features

I have data-set containing 100 rows When I run

model = Eps::Regressor.new(data, target: :twelve)

I get this error

RuntimeError: Number of samples must be at least two more than number of features

Dataset:

[{"nine"=>"322", "ten"=>"303", "eleven"=>"309", "twelve"=>"317"},
{"nine"=>"476", "ten"=>"417", "eleven"=>"428", "twelve"=>"332"},
{"nine"=>"345", "ten"=>"387", "eleven"=>"461", "twelve"=>"348"},
{"nine"=>"486", "ten"=>"487", "eleven"=>"368", "twelve"=>"445"},
{"nine"=>"360", "ten"=>"311", "eleven"=>"394", "twelve"=>"364"},
{"nine"=>"473", "ten"=>"470", "eleven"=>"307", "twelve"=>"353"},
{"nine"=>"432", "ten"=>"323", "eleven"=>"439", "twelve"=>"360"},
{"nine"=>"403", "ten"=>"318", "eleven"=>"492", "twelve"=>"348"},
{"nine"=>"386", "ten"=>"464", "eleven"=>"404", "twelve"=>"422"},
{"nine"=>"432", "ten"=>"318", "eleven"=>"331", "twelve"=>"382"},
{"nine"=>"374", "ten"=>"499", "eleven"=>"472", "twelve"=>"359"},
{"nine"=>"462", "ten"=>"473", "eleven"=>"364", "twelve"=>"320"},
{"nine"=>"475", "ten"=>"495", "eleven"=>"502", "twelve"=>"395"},
{"nine"=>"381", "ten"=>"332", "eleven"=>"326", "twelve"=>"366"},
{"nine"=>"498", "ten"=>"504", "eleven"=>"446", "twelve"=>"355"},
...
...
...
]
Am I missing something?

Memory requirements?

I'm new to machine learning and just found this gem.

After reading the readme, it seems memory might be a considerable constraint here. We all know ActiveRecord can get a bit bloated, and if you are training the entirety of a large table, you might run out of memory quickly? You could scale the containers training, but would loading ht PMML file into memory on boot require all containers to be heavy on memory?

Any thoughts or suggestions? Sorry if this is obvious.

Question: Does the linear regression class consider weights?

Hey there!

This is an awesome library -- thank you!

Quick question... In python, I can call .fit on a linear regression and pass it a sample_weight (as seen here https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit).

I've been reading your code here https://github.com/ankane/eps/blob/master/lib/eps/linear_regression.rb and I don't see if there's a way to incorporate weights. Is this possible?

Question: number of samples

I'm getting "Number of samples must be at least two more than number of features"
Can you please describe what 'samples' are?

raise "Number of samples must be at least two more than number of features"

[
      { bedrooms: 1, bathrooms: 1, price: 100000, city_id: 1, listed_at: Date.new(2018,01,10) },
      { bedrooms: 2, bathrooms: 1, price: 125000, city_id: 2, listed_at: Date.new(2018,02,10) },
      { bedrooms: 2, bathrooms: 2, price: 135000, city_id: 1, listed_at: Date.new(2018,03,10) },
      { bedrooms: 3, bathrooms: 2, price: 162000, city_id: 1, listed_at: Date.new(2018,04,10) },
      { bedrooms: 3, bathrooms: 2, price: 142000, city_id: 2, listed_at: Date.new(2018,02,10) },
      { bedrooms: 2, bathrooms: 2, price: 128000, city_id: 3, listed_at: Date.new(2018,01,10) }
    ]
def features(house)
  {
    bedrooms: house[:bedrooms],
    city_id: house[:city_id],
    month: house[:listed_at].strftime("%b")
  }
end

Could not find OpenMP

In console when I try to build my model I get this error:

LoadError: Could not find OpenMP
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/lightgbm-0.1.7/lib/lightgbm/ffi.rb:10:in `rescue in <module:FFI>'
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/lightgbm-0.1.7/lib/lightgbm/ffi.rb:5:in `<module:FFI>'
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/lightgbm-0.1.7/lib/lightgbm/ffi.rb:2:in `<module:LightGBM>'
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/lightgbm-0.1.7/lib/lightgbm/ffi.rb:1:in `<top (required)>'

Then if I try to build it again I get a different error:

NoMethodError: undefined method `LGBM_DatasetCreateFromMat' for LightGBM::FFI:Module
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/lightgbm-0.1.7/lib/lightgbm/dataset.rb:44:in `initialize'
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/eps-0.3.2/lib/eps/lightgbm.rb:72:in `new'
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/eps-0.3.2/lib/eps/lightgbm.rb:72:in `_train'
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/eps-0.3.2/lib/eps/base_estimator.rb:167:in `train'
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/eps-0.3.2/lib/eps/base_estimator.rb:7:in `initialize'
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/eps-0.3.2/lib/eps/model.rb:47:in `new'
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/eps-0.3.2/lib/eps/model.rb:47:in `train'
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/eps-0.3.2/lib/eps/model.rb:7:in `initialize'
	from /Users/user/Projects/playground/app/ml_models/visit_duration.rb:24:in `new'
	from /Users/user/Projects/playground/app/ml_models/visit_duration.rb:24:in `build'
	from /Users/user/.rvm/gems/ruby-2.4.6/gems/eps-0.3.2/lib/eps/base.rb:5:in `build'

FWIW the version and path brew installed libomp: /usr/local/homebrew/Cellar/libomp/9.0.0

Question about computing the intercept

Hi there!

I have the following code:

model = Eps::LinearRegression.new(x, y)

It outputs something like

image

I have a question about that text highlighted in pink. Is there a way to prevent the intercept from being calculated?

For example, the scikit learn library has this fit_intercept method that can be set to true to false, allowing me to say when I want an intercept to be used in calculations.

Is this possible with this library?

runtime error: Unknown Label with lightgbm algorithm

I'm receiving a runtime error: Unknown Label when I use the lightgbm algorithm in certain circumstances, but not when I use the linear regression algorithm - on the exact same data set. Here's the full error:
RuntimeError: Unknown label: Tue
from /Users/michaelburke/.rvm/gems/ruby-2.6.5@copient_health_rails6/bundler/gems/eps-509da754d6e9/lib/eps/label_encoder.rb:28:in `block in transform'
The name of the label varies with different models' error messages. And SOME of the lightgbm models actually build without error, but others fail every time, depending on what filter of the dataset I use to build the model.

Using a categorical target raises an exception

According to the README it's possible to use a categorical target. I tried this:

data = [
  {bedrooms: 1, bathrooms: 1, price: 100000, tag: 'Low'},
  {bedrooms: 2, bathrooms: 1, price: 125000, tag: 'Med'},
  {bedrooms: 2, bathrooms: 2, price: 135000, tag: 'Med'},
  {bedrooms: 3, bathrooms: 2, price: 162000, tag: 'High'}
]
model = Eps::Model.new(data, target: :tag)
puts model.summary

That raises comparison of Array with Array failed (ArgumentError) from NaiveBayes#_predict. The calculated probs array is [[NaN, "Low"], [NaN, "Med"], [NaN, "High"]] so presumably choking on those NaN values.

Not sure what the correct fix is—should NaiveBayes#calculate_class_probabilities be changed so that it doesn't return incomparable values? Or am I using it wrong?

Any data on performance?

Do you any benchmarks to share? I’m curious how slow it is (I bet it is because Ruby), but I’m mostly curious about the usage of GSL and how much it improves in performance.

Can't run the detailed example in plain Ruby project

As always, very good stuff Andrew 👏

When running the detailed project in plain Ruby file, it breaks at the save section because of dependency loading/resolution.
uninitialized constant Eps::LinearRegression::Nokogiri

I noticed Nokogiri is declared only for dev. environment and don't know if you plan to keep as-is or maybe declare as top-level ::Nokogiri

ArgumentError: comparison of String with 1.0000000180025095e-35 failed

I'm seeing a weird issue trying out your example code with some of my data.

class MarkModel < Eps::Base
  def build
    data = [{:title=>"Full Stack Developer Needed for property rental management website", :category=>"Full Stack Development", :mark=>true}, {:title=>"UI/UX design - Upwork", :category=>"UX/UI Design", :mark=>false}]

    model = Eps::Model.new(data, target: :mark)
    puts model.summary
  end

  private

  def features(p)
    {
      title: p.title.to_s,
      category: p.category.to_s,
      mark: p.mark
    }
  end

end

With this data I get:

ArgumentError: comparison of String with 1.0000000180025095e-35 failed
from .../eps-0.3.7/lib/eps/evaluators/lightgbm.rb:95:in `>'

but change the 2nd title to :title=>"UI/UX design" and it works fine. Certain combinations of words work and others don't.

help with errors

I've followed the simple boiler plate for creating and training a model, and I keep getting this error message -- have some hints on what this means so I can debug?

[LightGBM] [Fatal] The number of features in data (59) is not the same as it was in training data (103).

I've verified that each item in my data has the same number of features, so not sure what this means. It seems that this message is coming from a linked lib call, so couldn't get into the code:

https://github.com/ankane/lightgbm/blob/master/lib/lightgbm/booster.rb#L139

LightGBM model summary raises ArgumentError

Calling #summary on a LightGBM model raises a comparison of Array with Array failed (ArgumentError) exception, which is coming from the #sort_by call in Eps::LightGBM#_summary. I don't know how to best add test coverage to expose the issue in the tests, but I can demonstrate it using data from LightGBMTest:

test_data = [
  ["drv", 1029],
  ["class", 144],
  ["displ", 4322],
  ["year", 1126],
  ["cyl", 320],
  [["model", "4wd"], 364],
  [["model", "pickup"], 160],
  [["model", "2wd"], 72],
  [["model", "a4"], 0],
  [["model", "awd"], 0]]

test_data << ["foo", 0]

p test_data.sort_by { |k, v| [-v, k] }

ArgumentError will be raised if there are any string keys and array keys with the same value.

This is because strings can't be compared to arrays:

irb(main):001:0> ["model", "awd"] <=> "foo"
=> nil

Converting all the keys to strings using #display_field before the call to #sort_by fixes the issue, so does flattening the temporary arrays inside the #sort_by block.

Ideas

Ideas

  • Add support for getting probabilities for classification
  • Add random hyperparameter search - https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
    • try 30-100 combinations
    • learning rate (10^-4 to 1, log scale)
    • depth (2-6)
    • feature_fraction (.5 to 1)
  • Drop support for gsl gem (no longer maintained and Eps supports gslr gem)
  • (maybe) Move to ONNX format

Possible edge case issue with intercept and RMSE calculated value

Hi Andrew -

I have 1596 models that are all variations of training data derived from a single data set. I'm using Eps Linear Regression with GSL to build the models. The RMSE for all of the models is within my expected range (almost zero up to 0.398) EXCEPT for 6 of the models (out of 21) from a single user (user is one of the independent variables). These 6 models have ridiculous RMSE values like 4072322534930.

I have attached datasets in yml format in the attached zip file if you want to build a model and look at the RMSE values. The "good" example should build a model with an RMSE in my accepted range. The "bad" example will build a model with a ridiculously high RMSE. The target is the first column of data.

Archive.zip

I've reviewed the training data set and the raw data from which it was calculated, and there don't appear to be any outliers or red flags. I can't find any meaningful differences between the good and bad yaml files. The other odd thing is that even with the ridiculous RMSE (and intercept) values, the models still predict well.

Any thoughts on why the RMSE is messed up for this "bad" data set?

Stack level too deep

I'm getting a stack level too deep error in lib/eps/data_frame.rb when trying to train my model. The error is occuring at the values_at call on line 121 (https://github.com/ankane/eps/blob/master/lib/eps/data_frame.rb#L121) with the splat (*rows). The rows size is 130518 so perhaps larger than expected/supported?

  cols.each do |c|
    raise "Undefined column: #{c}" unless columns.include?(c)

    df.columns[c] = columns[c].values_at(*rows)
  end

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.