tbuli / symfit Goto Github PK

Symbolic Fitting; fitting as it should be.

Home Page: http://symfit.readthedocs.org

License: MIT License

Python 100.00%

python curve-fitting sympy scipy least-squares

symfit's Introduction

Please cite this DOI if symfit benefited your publication. Building this has been a lot of work, and as young researchers your citation means a lot to us. Martin Roelfs & Peter C Kroon, symfit. doi:10.5281/zenodo.1133336

Documentation

http://symfit.readthedocs.org

Project Goals

The goal of this project is simple: to make fitting in Python pythonic. What does pythonic fitting look like? Well, there's a simple test. If I can give you pieces of example code and don't have to use any additional words to explain what it does, it's pythonic.

from symfit import parameters, variables, Fit, Model
import numpy as np

xdata = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
ydata = np.array([2.3, 3.3, 4.1, 5.5, 6.7])
yerr = np.array([0.1, 0.1, 0.1, 0.1, 0.1])

a, b = parameters('a, b')
x, y = variables('x, y')
model = Model({y: a * x + b})

fit = Fit(model, x=xdata, y=ydata, sigma_y=yerr)
fit_result = fit.execute()

Cool right? So now that we have done a fit, how do we use the results?

import matplotlib.pyplot as plt

yfit = model(x=xdata, **fit_result.params)[y]
plt.plot(xdata, yfit)
plt.show()

Need I say more? How about I let another code example do the talking?

from symfit import parameters, Fit, Equality, GreaterThan

x, y = parameters('x, y')
model = 2 * x * y + 2 * x - x**2 - 2 * y**2
constraints = [
    Equality(x**3, y),
    GreaterThan(y, 1),
]

fit = Fit(- model, constraints=constraints)
fit_result = fit.execute()

I know what you are thinking. "What if I need to fit to a system of Ordinary Differential Equations?"

from symfit import variables, Parameter, ODEModel, Fit, D
import numpy as np

tdata = np.array([10, 26, 44, 70, 120])
adata = 10e-4 * np.array([44, 34, 27, 20, 14])

a, b, t = variables('a, b, t')
k = Parameter('k', 0.1)

model_dict = {
    D(a, t): - k * a**2,
    D(b, t): k * a**2,
}

ode_model = ODEModel(model_dict, initial={t: 0.0, a: 54 * 10e-4, b: 0.0})

fit = Fit(ode_model, t=tdata, a=adata, b=None)
fit_result = fit.execute()

For more fitting delight, check the docs at http://symfit.readthedocs.org.

symfit's People

Contributors

Stargazers

Watchers

Forkers

eljee pckroon jbarnoud isentropic nicolaselie christwell jason-neal antonykamp stephanlachnit brocksam elgalu jhsmit passion4energy ford--prefect maelh iq-scm maximelucassky

symfit's Issues

Order parameters alphabetically

Because the order is not well defined, sometimes tests fail because the parameter order can change from run to run. This could be potentially dangerous in the field as well (though it shouldn't be because the parameters are assumed unordered in the api) so it's better to fix it alphabetically.

Allthough this ordering would also allow models to be called with ordered arguments rather than keyword arguments, I don't want that as it makes the api less pythonic.

LogLikelihood fitting

Currently only least squares fitting is supported. I propose implementing a maximum-likelihood style of fitting as well. The following is needed:

What is now the Fit object should become LeastSquares.
Fit wraps LeastSquares.
A new LogLikelihood object should be defined which has exactly the same API as Fit.

Likelihood NaN

Likelihood fitting does not always work, it quite easely results in NaN. This can be caused by giving poor initial conditions, but also has more fundamental reasons. In particular I would preffer to use likelihood rather than loglikelihood because likelihood is always a number between 0 and 1, whereas loglikelihood can be between -inf and 0, which might be problematic.

Currently, the following is minimised:
ans = - np.nansum(np.log(func(x, p)))
Thereby maximizing the loglikelihood. I would prefer to minimize the following:
ans = - np.product(func(x, p))
However, this does not work nearly as good. How come?

Open sympy api?

Should the API of SymFit also expose the SymPy api for convenience?

BaseFit.sigma_data returns dependent data instead

Traceback error on first documentation example in python3

Going through the documentation, trying to assign a Paramter to A, a traceback error occured. I'm using Python3 (with sympy installed) on a Linux Mint 17 system.

Error found:

In [1]: from symfit.api import Parameter, Variable, exp, Fit

In [2]: A = Parameter(100, min=0)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-dc44c935119d> in <module>()
----> 1 A = Parameter(100, min=0)

/usr/local/lib/python3.4/dist-packages/symfit/core/argument.py in __init__(self, value, min, max, fixed, name, *args, **kwargs)
     27     """
     28     def __init__(self, value=1.0, min=None, max=None, fixed=False, name=None, *args, **kwargs):
---> 29         super(Parameter, self).__init__(name, *args, **kwargs)
     30         self.value = value
     31         self.fixed = fixed

TypeError: object.__init__() takes no parameters

Clear out the traits dependancy

And make this package purely Pythonic. This has numerous advantages, the most important of this being Python 3.x compatibility.

Because a lot of things in this package need validation and are read-only, the use of certain traits can be replaced by python properties.

Allow easy reuse of parameters by allowing **fit_result.params

Let g = a*x + b
Suppose we have fitted the data and the result is stored in fit_result.
To reavaluate the function at a given x, I suggest to allow the following syntax:

g(x=2, **fit_result.params)

To get the value of the model g at x=2. It seems more pythonic to me than doing
g(fit_result.params, x=2)
and certainly better than the current

g(
    x=2,
    a=fit_result.params.a,
    b=fit_result.params.b
)

As this can get very involved and requires you to have a memory.

To do this, ParameterList must emulate a dict.and might as well be called ParameterDict from now on.

import symfit vs symfit.api

For increased readability of code, I think symfit.api should be replaced by just symfit. The .api syntax was taken from several other projects I had been working in and where it seemed to make sense. (Traits, enaml)

However, in python we don't typically do this so it feels wrong to me now.

Importing from .api should give a deprication warning.

test_gaussian_2d_fitting fails

This test has been unreliable since day 1. For one, I have noticed different behavior between py2.7 and py3.x. This seems to be related to x, y or y, x ordering in 2D numpy arrays.

I therefore ignore it when this fit errors, since I assume it not to be fundamental to symfit. Obviously assumption is the root of all cockups, so it would be nice if someone could investigate.

Easy evaluation of submodels

In one of my applications, I use the following:

pop_1 = A_1 * exp(-(x - x0_1)**2/(2 * sig_1**2))
pop_2 = A_2 * exp(-(x - x0_2)**2/(2 * sig_2**2))
model = pop_1 + pop_2

fit = Fit(model, xdata, ydata)
fit_result = fit.execute()

After fitting, the model is easily evaluated using

y = model(x=xdata, **fit_result.params)

But what if you only want to evaluate pop_1? The following currently gives an error:

y_1 = pop_1(x=xdata, **fit_result.params)

Should this be supported, or do we want another shorthand for doing this? If you have a good idea, please propose your suggestion as a reply to this issue.

fit.data should have dependent vars as keys (?)

fit.data and it's derivatives currently have Variable names as key, np.array as data. To make the internal API cleaner, it should be considered to change this to Variable: np.array pairs. This will make the code in certain other methods cleaner, since you often encounter things like this: (self is a Fit instance)

{var.name: self.data[var.name] for var in self.model.dependent_vars}

Could be sexyfied into

{var: self.data[var] for var in self.model.dependent_vars}

Since I don't think it more convenient to use the name anywhere except when initiating data.

Vector valued functions

In order to allow vector valued functions as models, a nice way to define such functions is needed. Unless I am mistaken, sympy does not have a standard way of doing this (yet?).

What would be the best way of doing this? I propose the following:

list of functions

a = [x**2, y**2, 2 * x * y]

Or a tuple could be used if it makes more sense for it to be immutable.

vector object

a = vector(x**2, y**2, 2 * x * y)

overload an otherwise unused operator, e.g.

a = x**2 & y**2 & 2 * x * y

Other operators that could be used are | and ^ but 'bitwise and' makes more sense to me.

Better error messages

Some Exception subclasses should be defined for better Exception handling, and to give users useful errors.

Parameter errors greatly overestimated when weights are provided

It turns out that when weights are provided for parameters, the fitting process greatly overestimates parameter errors.

Example from the tests:

t_data = np.array([1.4, 2.1, 2.6, 3.0, 3.3])
y_data = np.array([10, 20, 30, 40, 50])

sigma = 0.2
n = np.array([5, 3, 8, 15, 30])
sigma_t = sigma / np.sqrt(n)

# We now define our model
y = Variable()
g = Parameter()
t_model = (2 * y / g)**0.5

fit = Fit(t_model, y_data, t_data)
fit_result = fit.execute()
self.assertAlmostEqual(fit_result.params.g, 9.0879601268490919)
self.assertAlmostEqual(fit_result.params.g_stdev, 0.15247139498208956)

fit = Fit(t_model, y_data, t_data, sigma=sigma_t)
fit_result = fit.execute()
self.assertAlmostEqual(fit_result.params.g, 9.0951297073387991)
self.assertAlmostEqual(fit_result.params.g_stdev, 2.0562987023203601)

I thought the 2.056 stdev wasn't unexpected, given the poor dataset used. However, it was pointed out to me that in Mathematica the same data gave 9.095 +- 0.105 when using weights.

I am now researching the causes and so far I have discovered that the mistake was using the residuals as calculated by leastsq to calculate the errors using the following line:

# in LeastSquares.execute. infodic['fvec'] contains scipy's residuals
s_sq = (infodic['fvec'] ** 2).sum() / (len(self.ydata) - len(popt))
pcov = cov_x * s_sq if cov_x is not None else None

Changing this to

# in LeastSquares.execute
s_sq = ((self.scipy_func(self.xdata, popt) - self.ydata) ** 2).sum() / (len(self.ydata) - len(popt))
pcov = cov_x * s_sq if cov_x is not None else None

gave 9.095 +- 0.153. In other words, recalculating the residual variance. This is still slightly more than Mathematica. However, it is the same as what curve_fit gives with absolute_sigma=True. With absolute_sigma=False, curve_fit agrees with Mathematica.

The question now becomes: which is correct?

After reading up on the subject I have the impression that is depends on whether your weights are measurement errors, or just some relative weights. In the latter case you do not want your parameter errors to depend on the absolute values of the weights; you just want to use them to assign different significance to the data-points. However, in the former case the absolute value of the errors in the data-points is significant and should therefore be propagated into the parameter errors.

Sources:
http://nbviewer.ipython.org/urls/gist.github.com/taldcroft/5014170/raw/31e29e235407e4913dc0ec403af7ed524372b612/curve_fit.ipynb
http://nbviewer.ipython.org/gist/cdeil/5030045
http://reference.wolfram.com/language/howto/FitModelsWithMeasurementErrors.html

I would greatly appreciate help on this topic. In the meantime, I will build all the links above into tests to see what happens.

Make sympy_to_scipy More efficient

def sympy_to_scipy(func, vars, params):
    """
    Convert a symbolic expression to one scipy digs.
    :param func: sympy expression
    :param vars: variables
    :param params: parameters
    """
    lambda_func = sympy_to_py(func, vars, params)
    def f(x, p):
        """
        Scipy style function.
        :param x: list of arrays, NxM
        :param p: tuple of parameter values.
        """
        x = np.atleast_2d(x)
        y = [x[i] for i in range(len(x))]
        return lambda_func(*(y + list(p)))

    return f

The current definition of f will do the list comprehension every time it is called, which can be a lot of a lot of iterations are needed for the fit to converge. Therefore a more efficient function is desirable.

Two different syntaxes for getting Parameter values?

Currently there are two different ways to get the value and stdev of a parameter:

# a = Parameter()

assert fit_result.params.a == fit_result.value(a)

This kinda conflicts with the principle that there should be only one obvious way to do something. Which is better?

If we decide to remove the first one, that means the ParameterDict object can be removed entirely and replaced with an OrderedDict. This is something I would like, as the current ParameterDict is kind of a "don't touch it, it works"-thing.

Furthermore, the second syntax is better when dealing with dynamically generated Parameter objects whose name you don't know or care about.

The only thing going for the first one is that I think it's rather elegant.
@Jhsmit, @Pitje06, @Eljee What do you think?

Reorganise tests

Tests should be restructured over multiple files depending on the thing they test, because the current test object has become to big.

Function objects

For common functions, we could create base objects, e.g.

x0 = Parameter()
sig = Parameter()
model = Gaussian(x0 = x0, sig = sig)

This way you don't have to remember the exact description for certain commonly used functions.

Use of min & max keywords

By using the min and max keywords for Parameter I have overwritten the Python built-ins in that namespace. However, I do not think this is a bad thing as

they are just the most logical words to use.
if we need min() and max() we would use numpy.min()/max() anyway.

This issue is therefore created solely to explain the argument if ever the question is posed.

[feature request] Make symbolic functions callable.

Make the following allowable syntax:

>>> a = Parameter()
>>> x = Variable()
>>> f = a * x**2
>>> print f(3, 2)
18
>>> print f(x=3, a=2)
18

Because it looks sexy.

(And because a user requested it.)

Proposal to prevent ambiguity in variable assignment

In the current API, the way data is assigned to the variables can be unclear. Although symfit check that either the first or the last dimension of the xdata needs to have the same length as the model has vars, this can be quite unreadable.Furthermore, the unpacking is done by the order in which the data is presented in this dimension, so if the user provides them in the wrong order the fit will be wrong.

Example in the current API, conserning some function defined on the grid 5 <= x < 5, 5 <= y < 5:

xdata = np.arange(-5, 5, 1) 
ydata = np.arange(-5, 5, 1) 
xx, yy = np.meshgrid(xdata, ydata, sparse=False) 
xdata_coor = np.dstack((xx, yy)) 

zdata = 2.5 * xx**2 + 3.0 * yy**2

a, b = parameters('a, b')
x, y = variables('x, y')
new = a * x**2 + b * y**2

fit = Fit(new, xdata_coor, zdata)

I would like to change the API so as to prevent these ambiquity and to increase readability.Proposal:

xdata = np.arange(-5, 5, 1) 
ydata = np.arange(-5, 5, 1) 
xx, yy = np.meshgrid(xdata, ydata, sparse=False) 

zdata = 2.5 * xx**2 + 3.0 * yy**2

a, b = parameters('a, b')
x, y = Variable(data=xx), Variable(data=yy) # We have to prepare the objects seperately because we want to supply kwargs
new = a * x**2 + b * y**2

fit = Fit(new, zdata)

This would increase readability, and remove the need for stacking of the data. However, the call to fit is less clear to me.

Alternatively,

xdata = np.arange(-5, 5, 1) 
ydata = np.arange(-5, 5, 1) 
xx, yy = np.meshgrid(xdata, ydata, sparse=False) 

zdata = 2.5 * xx**2 + 3.0 * yy**2

a, b = parameters('a, b')
x, y = variables('x, y')
new = a * x**2 + b * y**2

fit = Fit(new, x=xx, y=yy, data=zdata)

This way the fit object can link the keyword arguments to the corrosponding variables, and throw an error if no Variable by that name exists:

fit = Fit(new, q=xx, y=yy, data=zdata) # raises TypeError: no variable by the name of 'q' exists.

Furthermore, it would still be possible to make the data a postional argument, but the variables would have to be keyword arguments:

fit = Fit(new, zdata, x=xx, y=yy)

What do you think? Do you like one of these suggestions, or would you like something else? More ideas are welcome.

Docs: add a style guide

The docs need a style guide with some best practices for symfit.

Both for users, and developers.

pip installing symfit fails to install sympy

perhaps in setup.py 'install_requires' should be changed to 'requires'?
not tested

Piesewise continous function fitting

Fitting to piecewise functions is quite common. Luckely, most of the code seems to work already, thanks to sympy! The thing that doesn't work yet is the fitting...

Interesting example:
http://stackoverflow.com/questions/35795341/fitting-a-two-layer-model-to-wind-profile-data-in-python#comment60105265_35795341

symfit implementation of the solution:

k, z0, ust, a, b, zsl = parameters('k, z0, ust, a, b, zsl')
z, = variables('z')

model = Piecewise(
    (ust/k * ln(z/z0), z < zsl),
    (a*z + b, z >= zsl),
)

xdata = np.linspace(0, 10)
y = model(xdata, k=0.4, zsl=2., ust=0.3, z0=0.002, a=0.1, b=10.)

fit = Fit(model, xdata, y)
fit_result = fit.execute()

It's pretty awesome that the Piecewise object is now behaving exactly as one would expect! However, there is one problem:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I think this is caused by the lines like z < zsl, which when converted to numpy code will be of the form np.array < float. So one possible solution could be to find a way to convert the parameter values to arrays upon calling s.t. the comparison can be made elementwise.

Add convenience functions for multiple Parameter/Variable creation

sympy support the following to make multiple symbols:

x, y, z, t = symbols('x y z t')

Although I'm not a big fan of determining the names from the string x y z t, symfit should have something similar:

x, y = variables('x y')
a, b = parameters('a b')

Or if you use the inspect for DRY philosophy used in single Argument definitions,

x, y = variables()
a, b = parameters()

The function itself would be a oneliner in the first case:

def variables(names):
    return [Variable(name=name) for name in names.split()]

Bounds can cause fitting to fail

In some cases, supplying symfit with bounds to parameters makes the fitting converge to a non-solution, whereas without it finds the solution just fine. Since all bounds are handled by leastsqbound I think the bug is in that program.

It would therefore be prudent to reintroduce the Minimize facilities and use them to handle all fitting.

Model should be iterable

When initiating a Model, Model.model_dict is set with an OrderedDict containing dependent_variable: expression pairs.

In developing, I see myself writing lines such as this A LOT:

for var, expr in self.model.model_dict.items():

And this isn't even the least clear occasion. I plan to take out the middle man such that iterating over a Model basically means iterating over it's model_dict.

This would also imply the following:

x, y_1, y_2 = variables('x, y_1, y_2')
a, b = parameters('a, b')

model_dict = {y_1: a * x**2, y_2: 2 * x * b}
model = Model.from_dict(model_dict)

print model[y_1]
>>> a * x**2
print len(model) # This already works
>>> 2
print model.items()
>>> ((y_1, a * x**2), (y_2, 2 * x * b))

Although it only saves a call to .model_dict it makes the API infinitely more readable. Plus, how are you to know the thing you need to call is .model_dict? This will be way more intuitive.

name is not always correctly determined for Parameter

The error seems to occur for Parameters only, in particular when adding kwargs such as min, max.

Symfit.version is missing

Currently it is impossible to easily check programmatically what version of symfit is installed, since there is no symfit.version or any equivalent.

Remove self.sigma from error on fitting

At the moment the LeastSquares object takes sigma from self. This is nontransparent as all other input to the calculation of error and eval_jacobian is passed as args.

Fit to a constant

Fitting to a constant doesn't work at the moment. The following should be allowed:

a = Parameter()
model = a

But it is not. You have to provide at least one variable. But the following hack doesn't always work either:

x = Variable()
a = Parameter(0.0)
b = Parameter(0.0, fixed=True)
model = a + b*x

It gives nan for a and b.

Fitting to an ODE as model

I would really like to be able to fit to differential equations. This problem recently came up while helping a friend with his data.

This thread is meant to be a brainstorm about potential syntaxes as well as the implementation.

p, q, r = parameters('p, q, r')
x, y, t = variables('x, y, t')

model_dict = {
    Derivative(x, t): x + p * y,
    Derivative(y, t): q * y + r * x,
}

Fit(model_dict, t=tdata, x=xdata, y=ydata)

This uses the existing Derivative object from sympy.

Internally, I think it would be awesome if an ODEModel could be defined, which takes any (set of) differential equations and integrates them internally. Then a call to this model will return the values of the solutions at the given point and basically behaves as it's solution.

When no exact solution can be found a numerical integration should be performed. Therefore, initial conditions and range should be optional parameters. Some ideas on a syntax:

ODEModel(model_dict, initial=({t: 0, y: 0}, {t: 0, x: 0}), range=(t, 0, 25))
ODEModel(model_dict, initial={y[0] = 0, x[0] = 0}, range=(t, 0, 25))

(Although the range can be taken from tdata)

A good syntax for initial conditions is crucial, including the derivatives.

As I currently understand the way to do this, the steps are to first integrate the ODE using the initial guesses for the parameters. Then you fit this to the data, redoing the integration every time you vary the parameters. This is a very expensive process. So I think a good ODE solver is one that can leave the parameters symbolical while integrating numerically by parts such that the integration doesn't have to be repeated when doing the fitting. I have no idea if this is possible, just thinking out loud here.

The integration technique recommended to me was Runge-Kutta: http://mathworld.wolfram.com/Runge-KuttaMethod.html, but I'm not sure yet wether this meets this requirement. As I continue to learn more about this, I will report back here.

Reintroduce scipy.optimize.minimize

This has been commented out in the source for lack of consistency with Fit, but I suggest reintroducting it by defining the following objects:

class BaseJacobian:
    __metaclass__ = ABCMeta
    @abstractproperty
    def jacobian(self):
         """ Return the jacobian functions. """
         pass

    @abstractmethod
    def eval_jacobian(self, p, func, x, y):
        """ evaluate self.jacobian """
         pass

class ParameterJacobian(BaseJacobian):
    """ Provides support for a Jacobian with respect to Parameters in the given model """
    pass

class VariableJacobian(BaseJacobian):
    """ Provides support for a Jacobian with respect to Variables in the given model """
    pass

Then we could have BaseFit inherit from ParameterJacobian, and BaseMinimize inherit from VariableJacobian.

lambdify with dummify=False

It turns out that upon creating a pythonic function of our models, we must make sure to use dummify=False otherwise not our symbols, but dummy variables will be used for the argument names in the lambda function returned.

BaseFit should have error_func and eval_jacobian

I disabled these on BaseFit because currently not all objects require them. But I regret this decision, as it makes subclassing from BaseFit's children potentialy a mess, where you would have to redefine execute whereas maybe just the error_func is different.

Related to #38.

R squared can be negative and just plain wrong

Probably due to the fact that also here the residuals from scipy are used instead of recalculating them, which improves things greatly when weights are provided.

Model ABC's

A recent discussion (#48) highlighted an important point: what exactly is a Model, and what isn't it?

As the number of Model subclasses is gradually growing, it is time to create some order. Current subclasses are TaylorModel, Constraint and soon to be introduced ODEModel.

Most importantly I define too many numerical functions in an independent way even though they are not. If you know the numerical way to evaluate the function, you can also evaluate its chi-squared, just to give an example.

So lets look at some of the ways model ABC’s could be defined. I left out a lot of default behavior to clarify what is interesting about each object. If methods have not been completed below that means they are implemented in a standard way.

class BaseModel:
    """ 
    Defines all the default behavior of a model: it's iterable, and 
    when called the components are evaluated and returned as 
    a namedtuple. Furthermore, the model has a call signature so it's inspect friendly.
    """

class Callable(BaseModel):
    @abstractmethod
    def eval_components(self, *ordered_data, **named_data):
         """" Returns the component of the model evaluated at the given data point(s) """

   def eval_chi_squared

class Differentiable(Callable):
    @abstractmethod
    def eval_jacobian(self, *ordered_data, **named_data):
         """" Returns the jacobian evaluated at the given data point(s) """

   def eval_chi_squared_jacobian

Fitting subclasses already use these eval_functions so this would allow for plug and play definitions of custom Model object which immediately play nice with symfit tools designed for these classes of objects.

Currently what is called eval_ here is actually called numerical_ and these are properties. However, this has certain limitations I recently came up again so I intend to replace them by these eval_ methods since it's a better name.

Remove abc, raise NotImplementedError instead

Part of the reason py2 is now disabled is because Abstract Base Classes are defined differently. But since symfit doesn't take full advantage of abc anyway, a simple raise NotImplementedError in the base class would suffice.

FitResults should be read only

as people should not be allowed to mess with the FitResults. Do not like the results? Try a different model or change your experiment to match your expectations ^^.

But no messing about with FitResults.

Datapoint weights

Weights are currently not supported by the Fit object and should therefore be passed to Fit.execute as kwargs. This should be changed so objects can be initiated with weights immidiately.

Analytical Constrained Fitting

I am currently working on analytic constrained fitting using generalised Lagrange Multipliers. This method finds the saddle points of the model and therefore allows for analytical expressions for the parameters. These can then be used to calculate the value from the data.

Since multiple solutions are possible, we want to enable the user to select the desired solution as sexy as possible. To give an example of how this would work, I reproduce an example I found here: http://www.asp.ucar.edu/colloquium/1992/notes/part1/node36.html.

The challenge is to calculate the angles of a triangle from a set of measurements on said angles. The constraint is that we know the angles should add up to 180 degrees.

This is how this will look in symfit after vector valued (#15) models have been introduced.

a, b, c = parameters('a, b, c')  # fitted angles
a_i, b_i, c_i = variables('a_i, b_i, c_i')  # variables representing the i'th measured angles.

model = {a_i: a, b_i: b, c_i: c}
constraints = [Eq(a + b + c, 180)]

analytic_fit = LagrangeMultipliers(model, constraints=constraints)
extrema = analytic_fit.extrema
>>> [Extremum(a=2*a_i/3 - b_i/3 - c_i/3 + 60, b=-a_i/3 + 2*b_i/3 - c_i/3 + 60, c=-a_i/3 - b_i/3 + 2*c_i/3 + 60)]

Extremum is a NamedTuple containing an analytical extremum to the system of equations.The beauty of this is that the user can now provide this solution to his favourite fitting object as the one to use to calculate the parameters by averring over all the data:

data = np.array([
    [10., 10., 10., 10., 10., 10., 10.],
    [100., 100., 100., 100., 100., 100., 100.],
    [70., 70., 70., 70., 70., 70., 70.],
])

fit = Fit(model, a_i=data[0], b_i=data[0], c_i=data[0], extremum=extrema[0])
fit.execute()

I also plan on building a ConstrainedFit object to automate the entire process, if it is possible to determine the desired solution from the bounds provided on parameters. In case of only one solution this is easy of course. In case no solution can be found or multiple, the object should (?) raise an Exception.

data = np.array([
    [10., 10., 10., 10., 10., 10., 10.],
    [100., 100., 100., 100., 100., 100., 100.],
    [70., 70., 70., 70., 70., 70., 70.],
])

fit = ConstrainedFit(model, a_i=data[0], b_i=data[0], c_i=data[0], constraints=constraints)
fit.execute()

Make model calls consistent

Currently there are a couple of equivalent functions in Model and expr that do roughly the same. I should check to make sure that there is no overlap in their tasks. How I think it should work:

Expr.__call__ evaluates a sympy Expr with given args and kwargs.
Model.__call__ should evaluate each of the Expr in model_dict.
Model.numerical_components is thereby redundant and should be removed. There should be only one obvious way, and thats to call Model.

ODE Documentation

The ODE branch is almost finished. Now it needs to have some good documentation. Any nice toy models to serve as examples are welcome.

I will also include a list of thing that need to be done on the ODEModel for reference:

Higher order ODEs, currently only systems of 1st order are allowed
Check the standard deviations in parameters against analytical solutions for some model systems. They appear to be unrealistically high.
Initial conditions should be allowed to be Parameter objects as well because the Fit seems to be very sensitive to getting them exactly right.

I wanted to have all these finished before releasing but I now realize that its better to release the functionality we have now so people can start using it already.

built-in vars() overwritten

I like using vars as a short hand notation in Fit and other objects. I must admit that I didn't realise it was a built-in at first. This might be a potential issue, so I should consider renaming this in future functions.

Luckily this does not influence the user API.

FitResults interface

The interface for getting value's, stdev and covariance from a FitResults instance should be changed to the following for readability:

(Let a, b be instances of Parameter)

fit_result = fit.execute() # Any FitResults instance

print fit_result.value(a)
>>> 3.0

print fit_result.stdev(a)
>>> 0.5

print fit_result.covariance(a, b)
>>> 0.0

print fit_result.variance(a) 
>>> 0.25

Flatten for multidimensional data

In order to deal with multidimensional data in a consistent way, we need to flatten the input arrays so the mean square deviation can be calculated by scipy.

To give an example:

xdata = np.random.randint(-10, 11, size=(2, 400))
zdata = 2.5*xdata[0]**2 + 7.0*xdata[1]**2

a = Parameter('a')
b = Parameter('b')
x = Variable('x')
y = Variable('y')
new = a*x**2 + b*y**2

fit = Fit(new, xdata, zdata)

Upon init, the Fit should look at the shape of xdata, and compare that with the number of variables in the model. The first dimension of xdata should equal the number of variables. E.g. the following should be valid syntax:

x, y = xdata

All remaining dimensions should be flattened. If we had for example a dataset where xdata.shape == (10, 10, 2) and zdata.shape == (10, 10), then internally these will be flattened to xdata.shape == (100, 2) and zdata.shape == (100,).

Obviously this should be done in a consistent way, so as not to loose the corrospondence between datapoints.

TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

Traceback (most recent call last):
  File "C:\Users\Martin\Dropbox\Code\temp\ALEX-Suite\alex\symfit_wrapper.py", line 246, in run_fitting
    fit_results.append(fit.execute())
  File "C:\Python27\lib\site-packages\symfit\core\fit.py", line 340, in execute
    pcov = cov_x * s_sq
TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

This seems to happen when a model with a lot of parameters is used, in which case the fitting fails to find cov_x and instead returns None.

This is fatal error that needs to be fixed asap.

ConstrainedNumericalLeastSquares

Should be easy to make: inherit from Minimize and redefine error_func and eval_jacobian to be those for a least squares optimization. This will then immediately allow for constraints.

More pressing is finding a good dataset to write tests with.

And the name for this object is a bit long, maybe something shorter.

Fit can then be adapted to accept constraints.

Docs: add glossary

I would love for the docs to have a glossary of commonly used (statistical) terms and their precise definition in symfit.

When working with a certain fitting framework commonly used in particle physics I noticed that they use the term error both when fitting and when making error bar plots. However, in one context one assumes this to mean standard deviation, in the other standard error.

Such ambiguities should be prevented and a glossary might go a long way in preventing them.