Great package. I've just begun tinkering at a very low level to make sure my intuition

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Understanding doepipeline on simple example about doepipeline HOT 2 OPEN

bgriffen commented on September 15, 2024

Understanding doepipeline on simple example

from doepipeline.

Comments (2)

RickardSjogren commented on September 15, 2024

Hi @bgriffen, thank you for your interest and well-described questions! I will try to answer your questions below.

If I remember correctly the algorithm "forgets" previous runs. Since it uses least-squares modelling there is a risk that the model will be biased towards the region of data space where there are more observations, i.e. in the direction from where previous iterations where run. The purpose of statistical experimental designs is to eliminate bias and provide good support for the desired type of model.
The algorithm returns the best experiment that has been actually run. The update between iterations are done based on the "interpolated" predictions. The reason is that the curvature of the response surface probably does not perfectly match the model surface, meaning that we return the run that we have confirmed is the best.
Your model is a simple linear model that approximates the Gaussian surface very poorly since it's very curved. The "fullfactorial2levels" design does not support quadratic models. So my first choice would be to try a response surface design that allows for quadratic models (e.g. "ccc"). Hopefully that will improve the convergence of the algorithm in this toy example also.
Hopefully is this problem solved by my suggestion in 3. We have used this algorithm in-house for many different real-world applications for a number of years, and find that it converges well.
To use OLS to model the response and find a predicted optimum predict_optimum is used. The limits of the design are updated based on the predicted optimum in update_factors_from_optimum. _new_optimization_design outputs a design based on the current state of the factors.

I hope this answers your questions.

from doepipeline.

bgriffen commented on September 15, 2024

Hi Richard -- thank you for the clarification. My understanding improves each day. Just a few additional comments/questions below if it's OK.

The algorithm returns the best experiment that has been actually run. The update between iterations are done based on the "interpolated" predictions. The reason is that the curvature of the response surface probably does not perfectly match the model surface, meaning that we return the run that we have confirmed is the best.

Indeed, I did think I was overstepping the mark with my multivariate normal distribution. Though, with a sufficiently large variance it should be easier to model with a fullfactorial2levels model. Indeed, the surface optimum predicted is 0.01562 which is rather close to the actual optimum of 0.01592. Though the values of A and B predicted however are quite out - A ~ 71 (actual = 60) and B ~ 90.5 (actual 75). Though again, ff2level might not cut it, even for large variances in on the distribution (updated below).

Your model is a simple linear model that approximates the Gaussian surface very poorly since it's very curved. The "fullfactorial2levels" design does not support quadratic models. So my first choice would be to try a response surface design that allows for quadratic models (e.g. "ccc"). Hopefully that will improve the convergence of the algorithm in this toy example also.

OK, I'll experiment with the other designs. The trouble I face is using a design which can be implemented physically vs. what the response surface function may be (unknown a priori). I need to actually perform these experiments and physically, not in-silico so two level ff or three level ff is easier to arrange but may not be the best given ccc allows quadratic responses to be well described.

On that though, if I have historic data that doesn't fit the current design, e.g. its a sparse sampling, can that data still be used to initialize the search? Or do you have to set up the whole pipeline "clean" and carry out the experimental designs as generated by fullfactorial2levels and attach their responses once measured? Additionally, once say I perform the fullfactorial2level runs, can I then change the design mid-way through optimization to get a better functional fit/prediction of optima.

To use OLS to model the response and find a predicted optimum predict_optimum is used. The limits of the design are updated based on the predicted optimum in update_factors_from_optimum. _new_optimization_design outputs a design based on the current state of the factors.

I guess what I poorly asked, was that I need to run a predict_optimum step first in order for the OLS model to be used to generate new experiments.

Just to double check -- my order of doing things is correct? I added a comment to lines under question below

exp = ExperimentDesigner(factors,
                        'ccc',
                        responses,
                        model_selection='greedy',
                        skip_screening=True,
                        shrinkage=0.9)

df = exp.new_design()

# single iteration e.g.
for niters in range(number_of_iterations):
    r_0 = generate_response(df,rv)
    fractioni = pd.DataFrame.from_dict({"fraction":r_0})
    bi = exp.get_best_experiment(df,fractioni) # get the best experiment of the run
    fi = exp.update_factors_from_optimum(bi) # it is strange that it is named "from_optimum" but the function received the result from exp.get_best_experiment?
    if fi[1]: # if converged, stop generating new experiments
            break
    dfoptimal,model,prediction = exp.get_optimal_settings(fractioni) # slight modification of what is returned originally set in designer.py

    # generate a new design based on responses and best guess optima
    df = exp.new_design()

I'm just struggling to get the order of get_best_experiment, get_optimal_settings, update_factors_from_optimum and new_design correct given the internal variables that are updated? I'm just trying to get the logic of the functions in the correct order given a response measured at in each loop (though synthetically generated here).

Lastly, if a factor's weight is ~0 to the prediction, is there a way to remove it when new_design() is run.

Thank you for your help. Much appreciated.

from doepipeline.

Understanding doepipeline on simple example about doepipeline HOT 2 OPEN

Comments (2)

Related Issues (15)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent