Giter Club home page Giter Club logo

Comments (10)

cjekel avatar cjekel commented on May 23, 2024

While this is not the focus of this library, it is possible to use pwlf to find an optimum number of line segments.

The problem with automatically detecting the number of line segments is that the objective function is not as well defined. Many questions will need to be answered. Do you introduce a regularizer? Do you introduce a penalty? If you do not have a lot of noise, it's likely that without some form of regularization or penalty function that you may have as many line segments as you have data points. This no longer becomes simply minimizing a L2 norm.

Do you have any suggestions on how to find the optimum number of line segments?


Let me give you an example of how you could use pwlf to find the best number of line segments. This example just uses a simple penalty parameter to penalize the model complexity, or the number of line segments. Essentially we want to find x such that my_obj is minimized. I then go on to minimize this function using bayesian optimization (or EGO). To see how this is done, check out the newly added example.

# define your objective function
def my_obj(x):
    # define some penalty parameter l
    # you'll have to arbitrarily pick this
    # it depends upon the noise in your data,
    # and the value of your sum of square of residuals
    l = y.mean()*0.001
    f = np.zeros(x.shape[0])
    for i, j in enumerate(x):
        my_pwlf.fit(j[0])
        f[i] = my_pwlf.ssr + (l*j[0])
    return f

from piecewise_linear_fit_py.

cjekel avatar cjekel commented on May 23, 2024

Would it make sense in your application that breakpoint locations can only occur at a data point? That could greatly reduce the time of the inner optimization.

You could also try using the fitfast() function instead of the fit() function within my_obj.

from piecewise_linear_fit_py.

cjekel avatar cjekel commented on May 23, 2024

I want to open this up, because depending upon your problem their may be easier ways to deal with finding the best number of line segments.

If you don't have noise in your data, it may be possible to detect the number of line segments from a simple least squares fit. I try to show that in the first example here (this requires the 0.4.0 version in development)

from piecewise_linear_fit_py.

hyperactivetimes avatar hyperactivetimes commented on May 23, 2024

While this is not the focus of this library, it is possible to use pwlf to find an optimum number of line segments.

The problem with automatically detecting the number of line segments is that the objective function is not as well defined. Many questions will need to be answered. Do you introduce a regularizer? Do you introduce a penalty? If you do not have a lot of noise, it's likely that without some form of regularization or penalty function that you may have as many line segments as you have data points. This no longer becomes simply minimizing a L2 norm.

Do you have any suggestions on how to find the optimum number of line segments?

Let me give you an example of how you could use pwlf to find the best number of line segments. This example just uses a simple penalty parameter to penalize the model complexity, or the number of line segments. Essentially we want to find x such that my_obj is minimized. I then go on to minimize this function using bayesian optimization (or EGO). To see how this is done, check out the newly added example.

# define your objective function
def my_obj(x):
    # define some penalty parameter l
    # you'll have to arbitrarily pick this
    # it depends upon the noise in your data,
    # and the value of your sum of square of residuals
    l = y.mean()*0.001
    f = np.zeros(x.shape[0])
    for i, j in enumerate(x):
        my_pwlf.fit(j[0])
        f[i] = my_pwlf.ssr + (l*j[0])
    return f

hello there!
i'm currently working on a problem which requires using this package. however, determining the optimal number of line segments in advance is a must. so as i was searching the net i stumbled upon this piece of code in which the underlying idea i assume to be a tradeoff between "ssr" and the "number of line segments". however "j" is a scalar and can not be indexed so i got an error. then i was wondering if i could replace the "j[0]" with "i" or not?
on top of that for i = 1 i receive another error.

from piecewise_linear_fit_py.

cjekel avatar cjekel commented on May 23, 2024

The short answer is that x in the above case was expected to be a 2d array.

just do

def my_obj(x):
    # x is integer, returns float
    l = y.mean()*0.001
    my_pwlf.fit(x)
    f = my_pwlf.ssr + (l*x)
    return f

For a longer answer. I've been working on a better way to do this.

How many data points do you have in your application? What is the upper limit of line segments you are searching for?

from piecewise_linear_fit_py.

hyperactivetimes avatar hyperactivetimes commented on May 23, 2024

The short answer is that x in the above case was expected to be a 2d array.

just do

def my_obj(x):
    # x is integer, returns float
    l = y.mean()*0.001
    my_pwlf.fit(x)
    f = my_pwlf.ssr + (l*x)
    return f

For a longer answer. I've been working on a better way to do this.

How many data points do you have in your application? What is the upper limit of line segments you are searching for?

thanks for the fast response.
the short version solved the issue.
i have 25 data points sampled from "a/x" curve in which "a" is the constant and "x" is the variable. the less line segments i have the better it will be for my following intentions.

from piecewise_linear_fit_py.

cjekel avatar cjekel commented on May 23, 2024

@hyperactivetimes Another approach I have been playing around with, that would work well in your case is to compare the model's prediction variance, and compare this to an approximate variance from the data. Ideally this ratio will be 1.0.

This is a proof of concept, that manually tries 2 segments, then 3, then 4, until we've broken the ratio.

import numpy as np
import matplotlib.pyplot as plt
import pwlf


def calcGSJ(data):
    """
    Estimate the variance in data u sing the GSJ method
    """
    x = data[:, 0]
    y = data[:, 1]
    n = len(x)
    eTilde = np.zeros(len(x)-2)
    ci2 = np.zeros(len(x)-2)
    #    GSJ method
    for i in range(1, n-1):
        di = x[i+1]-x[i]
        if di == 0:
            ai = 0.5
        else:
            ai = (x[i+1] - x[i]) / di
        bi = 1.0 - ai
        ci2[i-1] = 1.0 / (ai**2 + bi**2 + 1.0)
        eTilde[i-1] = ai*y[i-1] + bi*y[i+1] - y[i]        
    sigmaHat2 = (1.0/(n-2))*np.dot(ci2.T, eTilde**2)
    print('GSJ: ', sigmaHat2)
    return sigmaHat2


np.random.seed(1231)
# generate sin wave data
x = np.linspace(0, 10, num=100)
y = np.sin(x * np.pi / 2)
# add noise to the data
y = np.random.normal(0, 0.05, 100) + y

data = np.vstack((x, y)).T

sigma_hat = calcGSJ(data)

# initialize piecewise linear fit with your x and y data
my_pwlf = pwlf.PiecewiseLinFit(x, y)

keep_going = True
n_segment = 2
results = {}
while keep_going:
    _ = my_pwlf.fitfast(n_segment, pop=10)
    nb = my_pwlf.beta.size + my_pwlf.fit_breaks.size - 2
    sigma = my_pwlf.ssr / (my_pwlf.n_data - nb)
    F = sigma_hat / sigma
    results[n_segment] = F
    if F > 1.0:
        keep_going = False
        print(f'best result found with n_segment={n_segment - 1}')
        n_segment -= 1
    else:
        n_segment += 1

# perform the fit again
_ = my_pwlf.fitfast(n_segment, pop=10)
# predict for the determined points
xHat = np.linspace(min(x), max(x), num=10000)
yHat = my_pwlf.predict(xHat)

# plot the results
plt.figure()
plt.plot(x, y, 'o')
plt.plot(xHat, yHat, '-', label=f'pwlf {n_segment}')
_ = my_pwlf.fitfast(n_segment+1, pop=10)
yHat = my_pwlf.predict(xHat)
plt.plot(xHat, yHat, '-', label=f'pwlf {n_segment+1}')
plt.legend()
plt.show()

I would love to hear if this works better, or worse in your application. There could be better ways to do the search for the magic ratio? (binary search, golden ratio, quadratic fits? )

from piecewise_linear_fit_py.

thunderbug1 avatar thunderbug1 commented on May 23, 2024

A veeery simplistic approach to this problem is to fit more and more lines to the data until adding more lines does not
give any meaningful improvement. Maybe not the fastest approach but works well enough for my application.

x = series.index.to_numpy()
y = series.to_numpy()
my_pwlf = pwlf.PiecewiseLinFit(x, y)
max_line_count = 10
min_percent_improvement_to_add_line = 80

last_ssr = None
best_pwlf = None
for i in range(2, max_line_count):
    #res = my_pwlf.fit(i)
    res = my_pwlf.fitfast(i, pop=3) # runs faster than fit but does not guarantee optimal solution
    ssr = my_pwlf.ssr
    if last_ssr is not None:
        improvement_percent = ( last_ssr - ssr )/ last_ssr * 100
        if improvement_percent < min_percent_improvement_to_add_line:
            break
    last_ssr = ssr
    best_pwlf = my_pwlf
print(f"best line_count: {i-1}")
xhat = np.linspace(x.min(), x.max(), len(x))
yhat = best_pwlf.predict(xhat)

from piecewise_linear_fit_py.

cjekel avatar cjekel commented on May 23, 2024

@thunderbug1 Thanks for sharing this!

One thing to note as the number of line segments approaches the number of data points, SSR will approach zero.

from piecewise_linear_fit_py.

thunderbug1 avatar thunderbug1 commented on May 23, 2024

@cjekel Thats very true, from what I have seen so far, the code stops at a reasonable number of lines. One could even add a dynamically set max_line_count if the signal length/type varies to avoid overfitting in edge cases.

from piecewise_linear_fit_py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.