This package provides a set of tools for designing surveys and
conducting power analyses for choice-based conjoint survey experiments
in R. Each function in the package begins with cbc_
and supports a
step in the following process for designing and analyzing surveys:
The current version is not yet on CRAN, but you can install it from Github using the {remotes} library:
# install.packages("remotes")
remotes::install_github("jhelvy/cbcTools")
Load the library with:
library(cbcTools)
The first step in designing an experiment is to define the attributes
and levels for your experiment and then generate all of the profiles
of each possible combination of those attributes and levels. For
example, let’s say you’re designing a conjoint experiment about apples
and you want to include price
, type
, and freshness
as attributes.
You can obtain all of the possible profiles for these attributes using
the cbc_profiles()
function:
profiles <- cbc_profiles(
price = seq(1, 4, 0.5), # $ per pound
type = c('Fuji', 'Gala', 'Honeycrisp'),
freshness = c('Poor', 'Average', 'Excellent')
)
nrow(profiles)
#> [1] 63
head(profiles)
#> profileID price type freshness
#> 1 1 1.0 Fuji Poor
#> 2 2 1.5 Fuji Poor
#> 3 3 2.0 Fuji Poor
#> 4 4 2.5 Fuji Poor
#> 5 5 3.0 Fuji Poor
#> 6 6 3.5 Fuji Poor
tail(profiles)
#> profileID price type freshness
#> 58 58 1.5 Honeycrisp Excellent
#> 59 59 2.0 Honeycrisp Excellent
#> 60 60 2.5 Honeycrisp Excellent
#> 61 61 3.0 Honeycrisp Excellent
#> 62 62 3.5 Honeycrisp Excellent
#> 63 63 4.0 Honeycrisp Excellent
Depending on the context of your survey, you may wish to eliminate or modify some profiles before designing your conjoint survey (e.g., some profile combinations may be illogical or unrealistic). WARNING: including hard constraints in your designs can substantially reduce the statistical power of your design, so use them cautiously and avoid them if possible.
If you do wish to set some levels conditional on those of other
attributes, you can do so by setting each level of an attribute to a
list that defines these constraints. In the example below, the type
attribute has constraints such that only certain price levels will be
shown for each level. In addition, for the "Honeycrisp"
level, only
two of the three freshness
levels are included: "Excellent"
and
"Average"
. Note that both the other attributes (price
and
freshness
) should contain all of the possible levels. When these
constraints you can see that there are only 30 profiles compared to 63
without constraints:
profiles <- cbc_profiles(
price = c(1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5),
freshness = c('Poor', 'Average', 'Excellent'),
type = list(
"Fuji" = list(
price = c(2, 2.5, 3)
),
"Gala" = list(
price = c(1, 1.5, 2)
),
"Honeycrisp" = list(
price = c(2.5, 3, 3.5, 4, 4.5, 5),
freshness = c("Average", "Excellent")
)
)
)
nrow(profiles)
#> [1] 30
head(profiles)
#> profileID price freshness type
#> 1 1 2.0 Poor Fuji
#> 2 2 2.5 Poor Fuji
#> 3 3 3.0 Poor Fuji
#> 4 4 2.0 Average Fuji
#> 5 5 2.5 Average Fuji
#> 6 6 3.0 Average Fuji
tail(profiles)
#> profileID price freshness type
#> 25 25 2.5 Excellent Honeycrisp
#> 26 26 3.0 Excellent Honeycrisp
#> 27 27 3.5 Excellent Honeycrisp
#> 28 28 4.0 Excellent Honeycrisp
#> 29 29 4.5 Excellent Honeycrisp
#> 30 30 5.0 Excellent Honeycrisp
Once a set of profiles is obtained, a randomized conjoint survey can
then be generated using the cbc_design()
function:
design <- cbc_design(
profiles = profiles,
n_resp = 900, # Number of respondents
n_alts = 3, # Number of alternatives per question
n_q = 6 # Number of questions per respondent
)
dim(design) # View dimensions
#> [1] 16200 8
head(design) # Preview first 6 rows
#> respID qID altID obsID profileID price type freshness
#> 1 1 1 1 1 8 1.0 Gala Poor
#> 2 1 1 2 1 53 2.5 Gala Excellent
#> 3 1 1 3 1 40 3.0 Honeycrisp Average
#> 4 1 2 1 2 20 3.5 Honeycrisp Poor
#> 5 1 2 2 2 10 2.0 Gala Poor
#> 6 1 2 3 2 24 2.0 Fuji Average
For now, the cbc_design()
function only generates a randomized design.
Other packages, such as the {idefix}
package, are able to generate other types of designs, such as Bayesian
D-efficient designs. The randomized design simply samples from the set
of profiles
. It also ensures that no two profiles are the same in any
choice question.
The resulting design
data frame includes the following columns:
respID
: Identifies each survey respondent.qID
: Identifies the choice question answered by the respondent.altID
:Identifies the alternative in any one choice observation.obsID
: Identifies each unique choice observation across all respondents.profileID
: Identifies the profile inprofiles
.
You can also make a “labeled” design (also known as
“alternative-specific” design) where the levels of one attribute is used
as a label by setting the label
argument to that attribute. This by
definition sets the number of alternatives in each question to the
number of levels in the chosen attribute, so the n_alts
argument is
overridden. Here is an example labeled survey using the type
attribute
as the label:
design_labeled <- cbc_design(
profiles = profiles,
n_resp = 900, # Number of respondents
n_alts = 3, # Number of alternatives per question
n_q = 6, # Number of questions per respondent
label = "type" # Set the "type" attribute as the label
)
dim(design_labeled)
#> [1] 16200 8
head(design_labeled)
#> respID qID altID obsID profileID price type freshness
#> 1 1 1 1 1 47 3.0 Fuji Excellent
#> 2 1 1 2 1 8 1.0 Gala Poor
#> 3 1 1 3 1 17 2.0 Honeycrisp Poor
#> 4 1 2 1 2 43 1.0 Fuji Excellent
#> 5 1 2 2 2 50 1.0 Gala Excellent
#> 6 1 2 3 2 58 1.5 Honeycrisp Excellent
In the above example, you can see in the first six rows of the survey
that the type
attribute is always fixed to be the same order, ensuring
that each level in the type
attribute will always be shown in each
choice question.
You can include a “no choice” (also known as “outside good”) option in
your survey by setting no_choice = TRUE
. If included, all categorical
attributes will be dummy-coded to appropriately dummy-code the “no
choice” alternative.
design_nochoice <- cbc_design(
profiles = profiles,
n_resp = 900, # Number of respondents
n_alts = 3, # Number of alternatives per question
n_q = 6, # Number of questions per respondent
no_choice = TRUE
)
dim(design_nochoice)
#> [1] 21600 13
head(design_nochoice)
#> respID qID altID obsID profileID price type_Fuji type_Gala type_Honeycrisp
#> 1 1 1 1 1 49 4.0 1 0 0
#> 2 1 1 2 1 51 1.5 0 1 0
#> 3 1 1 3 1 36 1.0 0 0 1
#> 4 1 1 4 1 0 0.0 0 0 0
#> 5 1 2 1 2 46 2.5 1 0 0
#> 6 1 2 2 2 28 4.0 1 0 0
#> freshness_Poor freshness_Average freshness_Excellent no_choice
#> 1 0 0 1 0
#> 2 0 0 1 0
#> 3 0 1 0 0
#> 4 0 0 0 1
#> 5 0 0 1 0
#> 6 0 1 0 0
The package includes some functions to quickly inspect some basic metrics of a design.
The cbc_balance()
function prints out a summary of the individual and
pairwise counts of each level of each attribute across all choice
questions:
cbc_balance(design)
#> ==============================
#> price x type
#>
#> Fuji Gala Honeycrisp
#> NA 5364 5397 5439
#> 1 2264 736 755 773
#> 1.5 2359 754 774 831
#> 2 2339 799 763 777
#> 2.5 2368 791 776 801
#> 3 2334 810 766 758
#> 3.5 2238 754 759 725
#> 4 2298 720 804 774
#>
#> price x freshness
#>
#> Poor Average Excellent
#> NA 5481 5343 5376
#> 1 2264 754 724 786
#> 1.5 2359 810 799 750
#> 2 2339 799 772 768
#> 2.5 2368 790 810 768
#> 3 2334 789 736 809
#> 3.5 2238 737 754 747
#> 4 2298 802 748 748
#>
#> type x freshness
#>
#> Poor Average Excellent
#> NA 5481 5343 5376
#> Fuji 5364 1836 1756 1772
#> Gala 5397 1775 1829 1793
#> Honeycrisp 5439 1870 1758 1811
The cbc_overlap()
function prints out a summary of the amount of
“overlap” across attributes within the choice questions. For example,
for each attribute, the count under "1"
is the number of choice
questions in which the same level was shown across all alternatives for
that attribute (because there was only one level shown). Likewise, the
count under "2"
is the number of choice questions in which only two
unique levels of that attribute were shown, and so on:
cbc_overlap(design)
#> ==============================
#> Counts of attribute overlap:
#> (# of questions with N unique levels)
#>
#> price:
#>
#> 1 2 3
#> 91 1853 3456
#>
#> type:
#>
#> 1 2 3
#> 530 3617 1253
#>
#> freshness:
#>
#> 1 2 3
#> 553 3589 1258
You can simulate choices for a given design
using the cbc_choices()
function. By default, random choices are simulated:
data <- cbc_choices(
design = design,
obsID = "obsID"
)
head(data)
#> respID qID altID obsID profileID price type freshness choice
#> 1 1 1 1 1 8 1.0 Gala Poor 0
#> 2 1 1 2 1 53 2.5 Gala Excellent 0
#> 3 1 1 3 1 40 3.0 Honeycrisp Average 1
#> 4 1 2 1 2 20 3.5 Honeycrisp Poor 0
#> 5 1 2 2 2 10 2.0 Gala Poor 1
#> 6 1 2 3 2 24 2.0 Fuji Average 0
You can also pass a list of prior parameters to define a utility model that will be used to simulate choices. In the example below, the choices are simulated using a utility model with the following parameters:
- 1 continuous parameter for
price
- 2 categorical parameters for
type
('Gala'
and'Honeycrisp'
) - 2 categorical parameters for
freshness
("Average"
and"Excellent"
)
Note that for categorical variables (type
and freshness
in this
example), the first level defined when using cbc_profiles()
is set as
the reference level. The example below defines the following utility
model for simulating choices for each alternative j:
data <- cbc_choices(
design = design,
obsID = "obsID",
priors = list(
price = 0.1,
type = c(0.1, 0.2),
freshness = c(0.1, 0.2)
)
)
If you wish to include a prior model with an interaction, you can do so
inside the priors
list. For example, here is the same example as above
but with an interaction between price
and type
added:
data <- cbc_choices(
design = design,
obsID = "obsID",
priors = list(
price = 0.1,
type = c(0.1, 0.2),
freshness = c(0.1, 0.2),
`price*type` = c(0.1, 0.5)
)
)
Finally, you can also simulate data for a mixed logit model where
parameters follow a normal or log-normal distribution across the
population. In the example below, the randN()
function is used to
specify the type
attribute with 2 random normal parameters with a
specified vector of means (mean
) and standard deviations (sd
) for
each level of type
. Log-normal parameters are specified using
randLN()
.
data <- cbc_choices(
design = design,
obsID = "obsID",
priors = list(
price = 0.1,
type = randN(mean = c(0.1, 0.2), sd = c(1, 2)),
freshness = c(0.1, 0.2)
)
)
The simulated choice data can be used to conduct a power analysis by
estimating the same model multiple times with incrementally increasing
sample sizes. As the sample size increases, the estimated coefficient
standard errors will decrease (i.e. coefficient estimates become more
precise). The cbc_power()
function achieves this by partitioning the
choice data into multiple sizes (defined by the nbreaks
argument) and
then estimating a user-defined choice model on each data subset. In the
example below, 10 different sample sizes are used. All models are
estimated using the {logitr} package:
power <- cbc_power(
data = data,
pars = c("price", "type", "freshness"),
outcome = "choice",
obsID = "obsID",
nbreaks = 10,
n_q = 6
)
head(power)
#> sampleSize coef est se
#> 1 90 price 0.059100275 0.05355244
#> 2 90 typeGala 0.086638491 0.12738712
#> 3 90 typeHoneycrisp 0.048171450 0.12811675
#> 4 90 freshnessAverage 0.146950632 0.12869842
#> 5 90 freshnessExcellent 0.122825059 0.12774457
#> 6 180 price -0.007340841 0.03746124
tail(power)
#> sampleSize coef est se
#> 45 810 freshnessExcellent 0.045353237 0.04271695
#> 46 900 price 0.021581802 0.01680005
#> 47 900 typeGala -0.005864412 0.04044297
#> 48 900 typeHoneycrisp -0.072816654 0.04055494
#> 49 900 freshnessAverage 0.059181159 0.04044227
#> 50 900 freshnessExcellent 0.047257238 0.04052068
The power
data frame contains the coefficient estimates and standard
errors for each sample size. You can quickly visualize the outcome to
identify a required sample size for a desired level of parameter
precision by using the plot()
method:
plot(power)
If you want to examine any other aspects of the models other than the
standard errors, you can set return_models = TRUE
and cbc_power()
will return a list of estimated models. The example below prints a
summary of the last model in the list of models:
library(logitr)
models <- cbc_power(
data = data,
pars = c("price", "type", "freshness"),
outcome = "choice",
obsID = "obsID",
nbreaks = 10,
n_q = 6,
return_models = TRUE
)
summary(models[[10]])
#> =================================================
#>
#> Model estimated on: Fri Sep 02 17:30:02 2022
#>
#> Using logitr version: 0.7.2
#>
#> Call:
#> FUN(data = X[[i]], outcome = ..1, obsID = ..2, pars = ..3, randPars = ..4,
#> panelID = ..5, clusterID = ..6, robust = ..7, predict = ..8)
#>
#> Frequencies of alternatives:
#> 1 2 3
#> 0.32296 0.34056 0.33648
#>
#> Exit Status: 3, Optimization stopped because ftol_rel or ftol_abs was reached.
#>
#> Model Type: Multinomial Logit
#> Model Space: Preference
#> Model Run: 1 of 1
#> Iterations: 9
#> Elapsed Time: 0h:0m:0.04s
#> Algorithm: NLOPT_LD_LBFGS
#> Weights Used?: FALSE
#> Robust? FALSE
#>
#> Model Coefficients:
#> Estimate Std. Error z-value Pr(>|z|)
#> price 0.0215818 0.0168001 1.2846 0.19892
#> typeGala -0.0058644 0.0404430 -0.1450 0.88471
#> typeHoneycrisp -0.0728167 0.0405549 -1.7955 0.07257 .
#> freshnessAverage 0.0591812 0.0404423 1.4633 0.14337
#> freshnessExcellent 0.0472572 0.0405207 1.1662 0.24351
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Log-Likelihood: -5.928414e+03
#> Null Log-Likelihood: -5.932506e+03
#> AIC: 1.186683e+04
#> BIC: 1.189980e+04
#> McFadden R2: 6.897615e-04
#> Adj McFadden R2: -1.530526e-04
#> Number of Observations: 5.400000e+03
One of the convenient features of how the package is written is that the object generated in each step is used as the first argument to the function for the next step. Thus, just like in the overall program diagram, the functions can be piped together:
cbc_profiles(
price = seq(1, 4, 0.5), # $ per pound
type = c('Fuji', 'Gala', 'Honeycrisp'),
freshness = c('Poor', 'Average', 'Excellent')
) |>
cbc_design(
n_resp = 900, # Number of respondents
n_alts = 3, # Number of alternatives per question
n_q = 6 # Number of questions per respondent
) |>
cbc_choices(
obsID = "obsID",
priors = list(
price = 0.1,
type = c(0.1, 0.2),
freshness = c(0.1, 0.2)
)
) |>
cbc_power(
pars = c("price", "type", "freshness"),
outcome = "choice",
obsID = "obsID",
nbreaks = 10,
n_q = 6
) |>
plot()
- Author: John Paul Helveston https://www.jhelvy.com/
- Date First Written: October 23, 2020
- License: MIT
If you use this package for in a publication, I would greatly appreciate
it if you cited it - you can get the citation by typing
citation("cbcTools")
into R:
citation("cbcTools")
#>
#> To cite cbcTools in publications use:
#>
#> John Paul Helveston (2022). cbcTools: Tools For Designing Conjoint
#> Survey Experiments.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {cbcTools: Tools For Designing Choice-Based Conjoint Survey Experiments},
#> author = {John Paul Helveston},
#> year = {2022},
#> note = {R package version 0.0.3},
#> url = {https://jhelvy.github.io/cbcTools/},
#> }