tirthajyoti / doepy Goto Github PK

Design of Experiment Generator. Read the docs at: https://doepy.readthedocs.io/en/latest/

License: MIT License

Python 100.00%

design design-of-experiments statistics engineering science research phsyics python doe random-design factorial-experiment

doepy's Introduction

Welcome to DOEPY

Design of Experiments Generator in Python (`pip install doepy`)

Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, California.

Check my website for details about my other projects and data science/ML articles.

Introduction

Design of Experiments (DOE) is an important activity for any scientist, engineer, or statistician planning to conduct experimental analysis. This exercise has become critical in this age of rapidly expanding field of data science and associated statistical modeling and machine learning. A well-planned DOE can give a researcher a meaningful data set to act upon with the optimal number of experiments, thus preserving critical resources.

After all, the aim of Data Science is essentially to conduct the highest quality scientific investigation and modeling with real world data. And to do good science with data, one needs to collect it through carefully thought-out experiments to cover all corner cases and reduce any possible bias.

How to use it?

What supporting packages are required?

First make sure you have all the necessary packages installed. You can simply run the .bash (Unix/Linux) and .bat (Windows) files provided in the repo, to install those packages from your command line interface. They contain the following commands,

pip install numpy
pip install pandas
pip install pydoe
pip install diversipy

How to install the package?

(On Linux and Windows) You can use pip to install doepy::

pip install doepy

(On Mac OS), first install pip,

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

Then proceed as above.

Github

The package is hosted at this Github repo.

Quick start

Let's say you have a design problem with the following table for the parameters range. Imagine this as a generic example of a chemical process in a manufacturing plant. You have 3 levels of Pressure, 3 levels of Temperature, 2 levels of FlowRate, and 2 levels of Time.

Pressure	Temperature	FlowRate	Time
40	290	0.2	5
50	320	0.3	8
70	350	-	-

First, import build module from the package,

from doepy import build

Then, try a simple example by building a full factorial design. We will use build.full_fact() function for this. You have to pass a dictionary object to the function which encodes your experimental data.

build.full_fact(
{'Pressure':[40,55,70],
'Temperature':[290, 320, 350],
'Flow rate':[0.2,0.4], 
'Time':[5,8]}
)

If you build a full-factorial DOE out of this, you should get a table with 3x3x2x2 = 36 entries.

Pressure	Temperature	FlowRate	Time
40	290	0.2	5
50	290	0.2	5
70	290	0.2	5
40	320	0.2	5
...	...	...	...
...	...	...	...
40	290	0.3	8
50	290	0.3	8
70	290	0.3	8
40	320	0.3	8
...	...	...	...
...	...	...	...

There are, of course, half-factorial designs to try!

Latin Hypercube design

Sometimes, a set of randomized design points within a given range could be attractive for the experimenter to asses the impact of the process variables on the output. Monte Carlo simulations are a close example of this approach.

However, a Latin Hypercube design is a better choice for experimental design rather than building a complete random matrix, as it tries to subdivide the sample space in smaller cells and choose only one element out of each subcell. This way, a more uniform spreading of the random sample points can be obtained.

User can choose the density of sample points. For example, if we choose to generate a Latin Hypercube of 12 experiments from the same input files, that could look like,

build.space_filling_lhs(
{'Pressure':[40,55,70],
'Temperature':[290, 320, 350],
'Flow rate':[0.2,0.4], 
'Time':[5,11]},
num_samples = 12
)

Pressure	Temperature	FlowRate	Time
63.16	313.32	0.37	10.52
61.16	343.88	0.23	5.04
57.83	327.46	0.35	9.47
68.61	309.81	0.35	8.39
66.01	301.29	0.22	6.34
45.76	347.97	0.27	6.94
40.48	320.72	0.29	9.68
51.46	293.35	0.20	7.11
43.63	334.92	0.30	7.66
47.87	339.68	0.26	8.59
55.28	317.68	0.39	5.61
53.99	297.07	0.32	10.43

Of course, there is no guarantee that you will get the same matrix if you run this function because this are randomly sampled, but you get the idea!

Other functions to try

Try any one of the following designs,

Full factorial: build.full_fact()
2-level fractional factorial: build.frac_fact_res()
Plackett-Burman: build.plackett_burman()
Sukharev grid: build.sukharev()
Box-Behnken: build.box_behnken()
Box-Wilson (Central-composite) with center-faced option: build.central_composite() with face='ccf' option
Box-Wilson (Central-composite) with center-inscribed option: build.central_composite() with face='cci' option
Box-Wilson (Central-composite) with center-circumscribed option: build.central_composite() with face='ccc' option
Latin hypercube (simple): build.lhs()
Latin hypercube (space-filling): build.space_filling_lhs()
Random k-means cluster: build.random_k_means()
Maximin reconstruction: build.maximin()
Halton sequence based: build.halton()
Uniform random matrix: build.uniform_random()

Read from and write to CSV files

Internally, you pass on a dictionary object and get back a Pandas DataFrame. But, for reading from and writing to CSV files, you have to use the read_write module of the package.

from doepy import read_write
data_in=read_write.read_variables_csv('../Data/params.csv')

Then you can use this data_in object in the DOE generating functions.

For writing back to a CSV,

df_lhs=build.space_filling_lhs(data_in,num_samples=100)
filename = 'lhs'
read_write.write_csv(df_lhs,filename=filename)

You should see a lhs.csv file in your directory.

A simple pipeline for building a DOE table

Combining the build functions and the read_write module, one can devise a simple pipeline to build a DOE from a CSV file input.

Suppose you have a file in your directory called ranges.csv that contains min/max values of an arbitrary number of parameters. Just two lines of code will generate a space-filling Latin hypercube design, based on this file, with 100 randomized samples spanning over the min/max ranges and save it to a file called DOE_table.csv.

from doepy import build, read_write

read_write.write_csv(
build.space_filling_lhs(read_write.read_variables_csv('ranges.csv'),
num_samples=100),
filename='DOE_table.csv'
)

Features

At its heart, doepy is just a collection of functions, which wrap around the core packages (mentioned below) and generate design-of-experiment (DOE) matrices for a statistician or engineer from an arbitrary range of input variables.

Limitation of the foundation packages used

Both the core packages, which act as foundations to this repo, are not complete in the sense that they do not cover all the necessary functions to generate a DOE table that a design engineer may need while planning an experiment. Also, they offer only low-level APIs in the sense that the standard output from them are normalized numpy arrays. It was felt that users, who may not be comfortable in dealing with Python objects directly, should be able to take advantage of their functionalities through a simplified user interface.

Simplified user interface

There are other DOE generators out there, but they generate n-dimensional arrays. doepy is built on the simple theme of being intuitive and easy to work with - for researchers, engineers, and social scientists of all background - not just the ones who can code.

User just needs to provide a simple CSV file with a single table of variables and their ranges (2-level i.e. min/max or 3-level).

Some of the functions work with 2-level min/max range while some others need 3-level ranges from the user (low-mid-high). Intelligence is built into the code to handle the case if the range input is not appropriate and to generate levels by simple linear interpolation from the given input.

The code will generate the DOE as per user's choice and write the matrix in a CSV file on to the disk.

In this way, the only API user needs to be exposed to, are input and output CSV files. These files then can be used in any engineering simulator, software, process-control module, or fed into process equipments.

Pandas DataFrame support

Under the hood, doepy generates Numpy arrays and convert them to Pandas DataFrame. Therefore, programmatically, it is simple to get those Numpy arrays or DataFrames to do more, if the user wishes so.

Coming in a future release - support for more types of files

Support for more input/output types will come in future releases - MS Excel, JSON, etc.

Designs available

Full factorial,
2-level fractional factorial,
Plackett-Burman,
Sukharev grid,
Box-Behnken,
Box-Wilson (Central-composite) with center-faced option,
Box-Wilson (Central-composite) with center-inscribed option,
Box-Wilson (Central-composite) with center-circumscribed option,
Latin hypercube (simple),
Latin hypercube (space-filling),
Random k-means cluster,
Maximin reconstruction,
Halton sequence based,
Uniform random matrix

About Design of Experiment

What is a scientific experiment?

In its simplest form, a scientific experiment aims to predict the outcome by introducing a change of the preconditions, which is represented by one or more independent variables, also referred to as “input variables” or “predictor variables.” The change in one or more independent variables is generally hypothesized to result in a change in one or more dependent variables, also referred to as “output variables” or “response variables.” The experimental design may also identify control variables that must be held constant to prevent external factors from affecting the results.

What is Experimental Design?

Experimental design involves not only the selection of suitable independent, dependent, and control variables, but planning the delivery of the experiment under statistically optimal conditions given the constraints of available resources. There are multiple approaches for determining the set of design points (unique combinations of the settings of the independent variables) to be used in the experiment.

Main concerns in experimental design include the establishment of validity, reliability, and replicability. For example, these concerns can be partially addressed by carefully choosing the independent variable, reducing the risk of measurement error, and ensuring that the documentation of the method is sufficiently detailed. Related concerns include achieving appropriate levels of statistical power and sensitivity.

The need for careful design of experiment arises in all fields of serious scientific, technological, and even social science investigation — computer science, physics, geology, political science, electrical engineering, psychology, business marketing analysis, financial analytics, etc…

Options for open-source DOE builder package in Python?

Unfortunately, the majority of the state-of-the-art DOE generators are part of commercial statistical software packages like JMP (SAS) or Minitab. However, a researcher will surely benefit if there is open-source code that presents an intuitive user interface for generating an experimental design plan from a simple list of input variables. There are a couple of DOE builder Python packages but individually they don’t cover all the necessary DOE methods and they lack a simplified user API, where one can just input a CSV file of input variables’ range and get back the DOE matrix in another CSV file.

Acknowledgements and Requirements

The code was written in Python 3.7. It uses following external packages that needs to be installed on your system to use it,

pydoe: A package designed to help the scientist, engineer, statistician, etc., to construct appropriate experimental designs. Check the docs here.
diversipy: A collection of algorithms for sampling in hypercubes, selecting diverse subsets, and measuring diversity. Check the docs here.
numpy
pandas

doepy's People

Contributors

Stargazers

Watchers

doepy's Issues

Float precision

In doe_function.construct_df, the pandas dataframe is forced to be of dtype=float32. This leads to a non-negligeable loss of accuracy. Double float are common now in all applications.

Is there any reason for this dtype specification?

If yes, I would propose to make it an option
If no, I would propose to remove it and let pandas handle the dtype from the given array.

Thanks.

Analyzing doe?

Hello,
Once doe is generated and results acquired, how to analyze results?
Do you have plans to integrate corresponding statistical analysis tools (such as ANOVA, regression etc.) into the doepy or can you advise which respective python libraries to use for results analysis?

Fractional Factorial design changes level values

Using the code below I would expect a design with 8 experiments whereby the min and max levels are used for each attribute. That works and I get a design that makes sense except for one item: the levels for G2 are changed from 0.2 and 0.4 into 0 and 1. This behavior does not change when I add a middle level, if I change the order of the attributes in the design space, or if I change the name of attribute G2. It does work however, when I change the values to 2 and 4. It seems that when one of the levels is below a value of 1, that the levels are changed to 0 and 1.

My code:

from doepy import build

Define the design space

design_space = {'P_CG_substance':['P','CG'],
'P_CG_level':[1,2,3],
'AF':[1, 1.5, 2],
'MX':[1.25, 1.5, 2],
'G2':[0.2, 0.4],
}

print(design_space)

Build the design

design = build.frac_fact_res(design_space)

In the design for column P_CG_substance, replace 0 with P and 1 with CG

design['P_CG_substance'] = design['P_CG_substance'].replace({0:'P', 1:'CG'})

Print the design

print(design)

Print the number of experiments

print(f'number of experiments is {len(design)}')

Expected result:
P_CG_substance P_CG_level AF MX G2
0 P 1.0 1.0 2.00 0.4
1 CG 1.0 1.0 1.25 0.2
2 P 3.0 1.0 1.25 0.4
3 CG 3.0 1.0 2.00 0.2
4 P 1.0 2.0 2.00 0.2
5 CG 1.0 2.0 1.25 0.4
6 P 3.0 2.0 1.25 0.2
7 CG 3.0 2.0 2.00 0.4

What I get:
P_CG_substance P_CG_level AF MX G2
0 P 1.0 1.0 2.00 1.0
1 CG 1.0 1.0 1.25 0.0
2 P 3.0 1.0 1.25 1.0
3 CG 3.0 1.0 2.00 0.0
4 P 1.0 2.0 2.00 0.0
5 CG 1.0 2.0 1.25 1.0
6 P 3.0 2.0 1.25 0.0
7 CG 3.0 2.0 2.00 1.0

Constraints

Hi there, is it possible to introduce constraints such as A + B + C = 1?
Thank you!

Full Factorial not accepting/parsing floats value correctly

As shown in flowrate column, the Flow rate':[0.2,0.4] was label encoded into 0.0 and 1.0. Was different to what is shown in the documentation.

build.full_fact({'Pressure':[40,55,70],'Temperature':[290, 320, 350],'Flow rate':[0.2,0.4],'Time':[5,8]})
    Pressure  Temperature  Flow rate  Time
0       40.0        290.0        0.0   5.0
1       55.0        290.0        0.0   5.0
2       70.0        290.0        0.0   5.0
3       40.0        320.0        0.0   5.0
4       55.0        320.0        0.0   5.0
5       70.0        320.0        0.0   5.0
6       40.0        350.0        0.0   5.0
7       55.0        350.0        0.0   5.0
8       70.0        350.0        0.0   5.0
9       40.0        290.0        1.0   5.0
10      55.0        290.0        1.0   5.0
22      55.0        320.0        0.0   8.0
23      70.0        320.0        0.0   8.0
24      40.0        350.0        0.0   8.0
25      55.0        350.0        0.0   8.0
26      70.0        350.0        0.0   8.0
27      40.0        290.0        1.0   8.0
28      55.0        290.0        1.0   8.0
29      70.0        290.0        1.0   8.0
30      40.0        320.0        1.0   8.0
31      55.0        320.0        1.0   8.0
32      70.0        320.0        1.0   8.0
33      40.0        350.0        1.0   8.0
34      55.0        350.0        1.0   8.0
35      70.0        350.0        1.0   8.0

What was shown in Documentation:

prob_distribution

Hello, argument prob_distribution is not used for LHS model, in build_lhs function?

DOE Full Factorial

The Full Factorial DOE algorithm is repeating some experiments.
Have you faced this issue?

replicates?

Hello,
How to handle replication - is the doe designer capable of designing replicated experiments?

Critical issue: full_fact fails after recent update of pandas to 1.5.0

This code from your README file worked until pandas was updated to 1.5.0.

from doepy import build
df = build.full_fact(
{
    'Pressure':[40,55,70],
    'Temperature':[290, 320, 350],
    'Flow rate':[0.2,0.4], 
    'Time':[5,8]}
)
print(df)

It gives now the following. Note how flow rate is returning the values 0 and 1 and not 0.2 and 0.4.

    Pressure  Temperature  Flow rate  Time
0       40.0        290.0        0.0   5.0
1       55.0        290.0        0.0   5.0
...
7       55.0        350.0        0.0   5.0
8       70.0        350.0        0.0   5.0
9       40.0        290.0        1.0   5.0
10      55.0        290.0        1.0   5.0
...
16      55.0        350.0        1.0   5.0
17      70.0        350.0        1.0   5.0

I did a bit of digging around and the issue seems to depend on the factor values. Once you switch a factor to float, the value is no longer returned; just 0, 1, 2, and so on:

from doepy import build
df = build.full_fact(
{
    'Pressure':[40.1,55,70],
    'Flow rate':[5, 8.123], 
})
print(df)

   Pressure  Flow rate
0       0.0        5.0
1      55.0        5.0
2      70.0        5.0
3       0.0        1.0
4      55.0        1.0
5      70.0        1.0

Python version used was 3.9

Negative floats return incorrect table

When generating a Latin hypercube sampling (simple or space filled), input factors with negative value levels return a table with incorrect values.

For example,

build.lhs(
{'a':[-1,-5],
'b':[-3,-6],
'c':[1,2]})

returns a table like (numbers truncated for display)

   a        b      c
 0.5     -1.9     1.0
-0.76    -0.78    1.95
-0.04    -2.14    1.43

where clearly the options for A and B are out of bounds for the provided levels, whereas c is correct.

Python 3.8.13
doepy version 0.0.1 installed from pip

Parsing factor levels for two leveled designs

Thank you for this great work.
I think that the code snippet that is used to trim factor_level_ranges in each two leveled design should search for the min, max value of the array and avoid duplicate values.
something like:
for key in factor_level_ranges:
if len(factor_level_ranges[key]) != 2:
factor_level_ranges[key][0] = min(factor_level_ranges[key])
factor_level_ranges[key][1] = max(factor_level_ranges[key])
factor_level_ranges[key] = factor_level_ranges[key][:2]
print(
f"{key} had more than two levels. Assigning the end point to the high level."
)
if (factor_level_ranges[key][0]==factor_level_ranges[key][1]):
rep_value=factor_level_ranges[key][0]
raise ValueError("duplicate value '{rep_value}' found in key '{key}'")
Furthermore, this code should be included in a function or a decorator since you are reusing it a lot in your code.

Support for parameters with string values

Is there a way to prevent string values from beeing converted to floats? E.g. if I define string_parameter=['a', 'b', 'c']

from doepy import build

build.full_fact({
    'int_param':[1, 2, 3],
    'float_param': [0.1, 0.2, 0.3],
    'string_param': ['a', 'b', 'c'],
)

the values for string_param are implicitly converted to 0.0, 1.0, 2.0 respectively. Instead I'd like to get the plain strings 'a', 'b', 'c'.

Fix a couple of typos in the readme.md

There's a couple of typos I can fix ("supporitng" etc.).

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Pressure	Temperature	FlowRate	Time
40	290	0.2	5
50	290	0.2	5
70	290	0.2	5
40	320	0.2	5
...	...	...	...
...	...	...	...
40	290	0.3	8
50	290	0.3	8
70	290	0.3	8
40	320	0.3	8
...	...	...	...
...	...	...	...

Pressure	Temperature	FlowRate	Time
40	290	0.2	5
50	290	0.2	5
70	290	0.2	5
40	320	0.2	5
...	...	...	...
...	...	...	...
40	290	0.3	8
50	290	0.3	8
70	290	0.3	8
40	320	0.3	8
...	...	...	...
...	...	...	...