xkdr / survey.jl Goto Github PK
View Code? Open in Web Editor NEWAnalysis of complex surveys
Home Page: https://xkdr.github.io/Survey.jl/
License: GNU General Public License v3.0
Analysis of complex surveys
Home Page: https://xkdr.github.io/Survey.jl/
License: GNU General Public License v3.0
Right now warnings that are generated by tests appear in the log messages. I think we can suppress them (since we intentionally test for them), it would be easier to read the logs.
apipop is a dataset included with Survey.jl. apipop columns containing NA are not being parsed correctly. They are being parsed as strings. They should be Int or missing.
Load apipop
using Survey
data(api)
View target column
apipop[!,"target"]
The output is
6194-element PooledArrays.PooledVector{InlineStrings.String3, UInt32, Vector{UInt32}}:
"5"
"11"
"11"
"3"
"1"
"4"
"2"
"7"
"1"
"11"
⋮
"10"
"13"
"12"
"11"
"12"
"11"
"5"
"2"
"6"
The expected output
6194-element Vector{Union{Missing, Int64}}:
5
11
11
3
1
4
2
7
1
11
⋮
10
13
12
11
12
11
5
2
6
Right now only rudimentary testing for total.jl and total
function. Add tests, especially for CategoricalArray count estimation.
Please add an example for the function in its docstrings.
DISCLAIMER: The following is based on my personal opinion, not on some official information. Please argue with me if you disagree.
The design_update
branch is not tested properly nor completely. I suggest a plan of action to improve code coverage and some guidelines for test-writing.
First of all, we need to differentiate between unit tests and integration tests. Since they are conceptually different (one of them is done in a white-box manner and the other in a black-box manner) they can be done in parallel, or by different people. Generally, unit testing should be done while developing.
Unit testing
Every implemented function needs to be tested for every functionality it provides, starting from the base up (i.e. from constructors to high-level functionality like plotting). This means:
Go through each file inside src
in the order low level - high level
Inside test/filename
write tests for each function from the file src/filename
such that the following are checked:
test main functionality
Example: @test nrow(svyby(:api00, :stype, srs, svymean)) == apisrs.stype |> unique |> size |> first
test all if-else
cases of the function
test errors (using @test_throws
)
test warnings if appropriate (using @test_warn
)
The simplest function should have at least two tests: one for main functionality and one for error check.
Make sure test/filename
is included in the test/runtests.jl
file.
Integration testing
The overall package needs to behave as expected. We need to think about scenarios that we would use the package for and build tests according to our expectations. The tests should answer common sense questions and edge-case questions starting with "What should happen if ..." (e.g. What should happen if I have an empty survey? or What should happen if I have a strata specifier for a SimpleRandomSample
? or What should happen if I only have one data point (and the rest are missing
)? or anything else that your mind can think of).
General guidelines
test/
integration/
integration1.jl
integration2.jl
...
unit/
SurveyDesign.jl
svyby.jl
...
runtests.jl
... but with more descriptive file names inside the integration
folder.
The svyboxplot
uses AlgebraOfGraphics
while the others use Makie. I suggest using AlgebraOfGraphics
everywhere.
The README currently shows examples using the old design. The values given in those examples are not even correct (we corrected them for the new design). The old examples need to be replaced with new ones corresponding to the new design.
Add multiple dispatch methods for CategoricalArray
type columns in the dataset. Behaviour like R does with factor variables.
We had implemented ht_svytotal
and HT_HartleyRaoVarApprox
in the ht.jl
file
SurveyDesign
object, giving correct totals for any design, without the need to specify cluster =
or strata =
argument. I think we should test the function and compare with some complex designs from R. If this is indeed the case, then we could encaspulate the relavant parts of SurveyDesign
into a HTSurveyDesign
object. This would be a crude approximation of complex designs, but it would be good starting point for the packageht_svymean
is still work in progress, as we havent been able to figure out the scaling factor correctly between variance of ht_svytotal
given by HT_HartleyRaoVarApprox
. I think having a look through the Sarndal textbook will find us the relation between variances for Hartley Rao total and means (it is not as simple as 1/N^2)In the current inline documentation and help pages we provide some amount of examples for user, and explain the API.
Eventually, I think an extended tutorial page/Pluto notebook/medium blog using Survey.jl would be beneficial to the community and for the package. Ideas along the lines of Getting Started with DataFrames or https://pandas.pydata.org/docs/user_guide/10min.html but with more explanation and narrative with a focus on one particular publicly-available survey. Idea would be to take the first-time user through installing and explaining key functionality of the package to answer a few research questions.
Since many functions require frequency weights, we can store it in design.variables
rather than having each function derive it from probs
.
Instead of probs
, perhaps, we should call it probability_weights_reserved
? Similarly, frequency_weights_reserved
.
The reason we're only storing probability weights is that we didn't anticipate so many Julia functions to use frequency weights. I believe the survey
package in R only store probability weights, and most R functions use probability weights.
Note, we want the information about weights to be in the same data frame and not make it a property of svydesign
because of performance reasons. combine
function in DataFrames
works a lot faster.
The load_data
function considers its arugument, name
, to be part of the "API" dataset. It should be a general function that loads data from any dataset.
So far, most of the testing suite is limited to the API dataset. I suggest to improve testing by using other publicly available survey datasets. R Lumley survey textbook examples could be used, (pg 7 section 1.2.1) eg. NHANES, SHS, SIPP.
http://asdfree.com - Analyse Surveys Free has many real-world datasets and examples with respective R survey code.
There is another issue in SurveyDesign.jl. If you write the following code, then the error message showing is not what it should be
srs = SimpleRandomSample(apisrs, popsize = -2.8, ignorefpc = true)
the error message is
either population size or frequency/probability weights must be specified
Also for this code
srs = SimpleRandomSample(apisrs, sampsize = -2.8, ignorefpc = true)# the function is working up to line 55
the function is working all the way up to the 55th line and then only shows that there is an error giving the following message:
ERROR: InexactError: UInt64(-3.0)
My suggestion is we add some check that lets us proceed with the function only if the sampsize and popsize are non-negative integers.
Right now only rudimentary testing for mean.jl
. Add tests, especially for CategoricalArray proportion calculations
Thomas Lumley replied:
It's a bug resulting from the new svyquantile() implementation. It's fixed in the development version, which you can get from r-forge here: https://r-forge.r-project.org/R/?group_id=1788
After re-reading through Sarndal textbook, section 3.3, 5.8 and Chap 10, there are alternate estimators of domain mean depending on whether domain population sizes are known or not.
What has been implemented is one of the options, should clarify what the underlying assumptions are to user in documentation, docstrings etc.
How about the following design for the general SurveyDesign
:
For example, if you have a simple random sample, is a multistage sample where there is 1 stratum and every observation is a cluster.
All the formulae will be for multistage stratified sample.
Calculating survey means and totals is relatively straighforward as compared to their variances in certain designs. For example in the Horvitz Thompson estimator, their is a double inclusion probability term (\pi_{ij}) which is not readily generalised to arbitrary sampling designs. For now I implemented the Hartley-Rao variance approximation which bypasses calculation of double probabilities.
It is interesting to note how R, SAS, Stata SUDAAN etc calculate their variances. For some 'elementary' sampling designs they use closed form solutions, but for more complex multistage designs they use replication or resampling style methods like Balanced Repeated Replicates (BRR), or Jacknife, or other bootstrap type techniques. I found it difficult debugging through the R survey
package code which is illegible in most places, what R exactly does under what cases.
Estimating variances of subpopulations mean/total estimators in Stata: here
SAS 9.2 documentation on Variance Estimation and here says that for Multistage sampling designs, they use a Taylor series variance method which only considers first stage of the sample design, and that for BRR and jacknife based estimation they dont take into account finite population correction.
Are these (or some other) reasonable assumptions/relaxations of the variance estimation problem for Survey.jl package? perhaps, but it is worth exploring into more detail. I think initially trying to implement the analytical solutions of variance of mean/total estimators is worth trying, especially given the power of Julia language; if that doesnt work then fall back to replicate/resampling methods.
This functionality is not really needed. It is simply calculating the size of the data frame used by the design. This is easily done using DataFrames.size
. I suggest we remove it.
For SimpleRandomSample
setting ignorefpc=true
should result in fpc
being equal to 1.0, since we allow fpc
to be of type Float64
. But this doesn't happen:
julia> srs_ignorefpc = SimpleRandomSample(apisrs; popsize = :fpc, ignorefpc = true)
┌ Warning: assuming all weights are equal to 1.0
└ @ Survey ~/Work/xKDR/Survey.jl/src/SurveyDesign.jl:162
SimpleRandomSample:
data: 200x42 DataFrame
weights: 1.0, 1.0, 1.0, ..., 1.0
probs: 1.0, 1.0, 1.0, ..., 1.0
fpc: 6194, 6194, 6194, ..., 6194
popsize: 6194
sampsize: 200
sampfraction: 0.0323
ignorefpc: true
For StratifiedSample
the same code looks ok:
julia> strat_ignorefpc = StratifiedSample(apistrat, :stype; popsize = :fpc, ignorefpc = true)
┌ Warning: assuming all weights are equal to 1.0
└ @ Survey ~/Work/xKDR/Survey.jl/src/SurveyDesign.jl:360
StratifiedSample:
data: 200x45 DataFrame
strata: stype
weights: 1.0, 1.0, 1.0, ..., 1.0
probs: 1.0, 1.0, 1.0, ..., 1.0
fpc: 1, 1, 1, ..., 1
popsize: 4421, 4421, 4421, ..., 755
sampsize: 100, 100, 100, ..., 50
sampfraction: 0.0226, 0.0226, 0.0226, ..., 0.0662
ignorefpc: true
(there's also extra information that shouldn't be there, I mentioned this in issue #111.)
TODO data structure for Single-stage Cluster sampling, assuming like in these lecture notes from PSU.
Add support in SurveyDesign.jl
with svymean
and svytotal
working as expected (compare with R). Later check usage with svyquantile
.
... to avoid name clash with AlgebraOfGraphics.
Also, you could return a named tuple and encourage the user to use
(; apistrat) = dataset(api)
to avoid cluttering the namespace.
The Survey design does not check whether the sum of probabilities is less or equal to 1. Even when the sum is more than one the design works and throws no error.
srs_w_p = SimpleRandomSample(apisrs, ignorefpc = false, weights = :fpc, probs = fill(0.3, size(apisrs_original, 1)))
This chunk of code throws no error even though the sum of probability is more than 1.
Make a blog post showcasing the speedups and comparisons with R, like was done in old landing page.
At first, we should make the package work for simple random samples and then extend the functionality to other design types. The current design of the package doesn't allow development to be done in that way.
How about:
abstract type SurveyDesign end
struct SimpleRandomSample <: SurveyDesign
data::DataFrame
end
struct StratifiedSample <: SurveyDesign
data::DataFrame
end
function plot(x::SurveyDesign)
plot(x.data)
# works on any child of SurveyDesign
end
function mean(variable, x::SimpleRandomSample)
# computes the mean of a variable from a simple random sample
end
Later, we can add:
function mean(variable, x::StratifiedSample)
# computes the mean of a variable from a stratified sample.
end
There is pristine clarity on which functions work on which design types.
If we want to keep the API similar to R, we should consider wrappers.
There are some @show
macros in the constructor body for SimpleRandomSample
and StratifiedSample
.
Hi all, in some of our research, we care about not just the combined survey mean/total for a StratifiedSample
, but the individual strata means/totals as well.
Suggestion is to return ȳₕ if specified as a keyword argument in the function call to svymean(x::Symbol, design::StratifiedSample)
. Alternatively, this functionality can be separated out with multiple dispatch.
The following fields in svyglm have been implemented
aic
family
coefficients
formula
data
model
deviance
offset
y
rank
linear.predictors
fitted.values
prior.weights
weights
residuals
converged
terms
control
contrasts
naive.cov
df.null
null.deviance
df.residual
df.null is dof_residual (degree of freedom residual) of null model. null.deviance is deviance for null model. Computing both of these requires null model. In R these are output by GLM. Julia GLM has no such output. nulldeviance will be in a future Julia GLM release. RIght now this is implemented by creating a null model formula glm and calling deviance
and dof_residual
on it.
It is defined as
g$df.residual <- degf(design)+1-length(coef(g)[!is.na(coef(g))])
However it requires degf function.
Julia GLM already has df.residual function
This is implemented as vcov(glm)/GLM.dispersion(glm.model,true)
The following fields have not been implemented yet
xlevels
effects
cov.unscaled
xlevels is a record of the levels of the factors used in fitting. effects is related to final weighed linear fit. It would be great if someone could point to some resources regarding these.
cov.unscaled requires translations of R functions svy.varcoef
and svyCprod
.
The following fields are out of scope
boundary
iter
call
method
survey.design
qr
R
This is a logical value which answers the question "Is the fitted value on the boundary of the attainable values?". It is unclear to me how does one specify this boundary. Looking at the source it seems that R GLM outputs this value, which would mean that we can't implement this until Julia GLM has it.
Number of iterations. Julia GLM does not output it (unlike R GLM). So we cannot add this until this is added to Julia GLM.
All these fields store an expression which can be eval'ed. call is the expression which was called to svyglm, method is the name of the method, and survey.design is expression which was used to create the design. Right now it does not make sense to implement it in Julia svyglm.
qr is the qr decomposition, R is the the r part of qr decomposition. Julia GLM does not use qr decomposition, hence it does not make sense to implement this.
eg. make svymean
just mean
We made a lot of changes to the design lately but svyquantile
was not tested. @sayantikaSSG can you write some tests for svyquantile
? And maybe @smishr can take a look at the source code and update it.
Remove the old survey design.
Branches like hartley-rao-se
and revert-*
need to be deleted.
It's really great that you aim to replicate features provided by the R survey package as it's really the reference in that domain. However, its API is quite ad hoc and forces users to learn a completely new syntax when moving from unweighted to weighted/survey analyses. Have you considered adopting a syntax based on DataFrames group_by
and combine
/select
/transform
? In R, the recent srvyr package does that by wrapping survey functions with a dplyr-like syntax.
In particular, svyby(:api00, [:cname, :meals], dclus1, svymean)
could be written as combine(groupby(dclus1, [:cname, :meals]), :api00 => svymean)
. With DataFramesMeta, this could become @combine(groupby(dclus1, [:cname, :meals]), svymean(:api00))
. One would also be able to compute the mean of all columns using combine(dclus1, All() .=> svymean)
.
In terms of implementation I saw that you already use combine
under the hood so it shouldn't be problematic.
Cc: @bkamins
if ignorefpc
@warn "assuming all weights are equal to 1.0"
weights = ones(nrow(data))
end
If population size is not know or cant be estimated for some reason, ignorefpc parameter should provide usage without the (N-n)/N correction factor applied to mean/var etc. But I don't think that the way it is being done currently (above) is fully correct. Right now weights vector is made 1s, which is not quite ignoring fpc (although it does end up giving correct variances and means from svymean
and svytotal
)
This functionality and usage should be explored in more depth. If current implementation is found lacking, it could be removed and replaced. Although for what its worth, any survey that has sampled a substantial portion of the population would desire to use the fpc correction factor in their calculations.
Currently, Statistics
has all the Hyndman & Fan rules, and Statsbase
has some rudimentary rule for Weighted (which I have been unable to classify), but from what I could find, those formulae under the "Weighted" column on the right hand side of the table are not implemented in Julia yet. They are straightforward to implement. Later, we could look at encapsulating parts of the code into Statsbase
quantile
Except for simple random samples when qrule="hf7", our quantile
function is incorrect. Weighting logic needs to be applied.
When running the command julia docs/make.jl
I get this output:
[ Info: SetupBuildDirectory: setting up build directory.
[ Info: Doctest: running doctests.
StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Normal{Float64}, IdentityLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}
ell ~ 1 + meals
Coefficients:
────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept) 6.86665 0.350512 19.59 <1e-84 6.17966 7.55364
meals 0.410511 0.00613985 66.86 <1e-99 0.398477 0.422545
────────────────────────────────────────────────────────────────────────
Degrees of Freedom: 6193.000324249264 (i.e. Null); 6192.000324249264 Residual
Null Deviance: 1.7556928968296547e6
Residual Deviance: 1.0196009035970895e6
AIC: 49195.42124574161
[ Info: ExpandTemplates: expanding markdown templates.
┌ Warning: duplicate docs found for 'Survey.svyby' in src/index.md:15-17
│ ```@autodocs
│ Modules = [Survey]
│ ```
└ @ Documenter.Expanders ~/.julia/packages/Documenter/48VFh/src/Utilities/Utilities.jl:34
[ Info: CrossReferences: building cross-references.
[ Info: CheckDocument: running document checks.
[ Info: Populate: populating indices.
[ Info: RenderDocument: rendering document.
[ Info: HTMLWriter: rendering HTML pages.
┌ Warning: Documenter could not auto-detect the building environment Skipping deployment.
└ @ Documenter ~/.julia/packages/Documenter/48VFh/src/deployconfig.jl:75
We should get rid of the warnings and get rid of the unnecessary output coming from svyglm
probably:
StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Normal{Float64}, IdentityLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}
ell ~ 1 + meals
Coefficients:
────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept) 6.86665 0.350512 19.59 <1e-84 6.17966 7.55364
meals 0.410511 0.00613985 66.86 <1e-99 0.398477 0.422545
────────────────────────────────────────────────────────────────────────
Degrees of Freedom: 6193.000324249264 (i.e. Null); 6192.000324249264 Residual
Null Deviance: 1.7556928968296547e6
Residual Deviance: 1.0196009035970895e6
AIC: 49195.42124574161
To implement Jacknife technique for survey sampling designs. To integrate with existing data structures and functions
This method has 0 code coverage. It can be easily tested.
Clean up design update branch with only working elements i.e. remove structs and functions that have not been tested, or still in dev. Work in progress could be moved to new branches, or kept in local branches, or removed entirely.
Make issues out of things are removed, eg. poststrafity.jl etc, so everyone remembers. add them to a milestone, and label them nicely.
#98 is a subtask of this.
Examples should have references to equivalent R code. The references could be done by creating an appendix in the API reference, or there should be a separate section.
In the line of svyby for SimpleRandomSample we pass :weights
. But then never use it later in svymean. Remove if not required. Double check with the Cochran slides implemented from
return combine(gdf, [formula, :weights] => ((a, b) -> func(a, design, b, params...)) => AsTable)
Is there a reason to depend on a plotting package?
Is there a value added compared to using one directly?
(; apistrat) = dataset(api) # see issue #27 on renaming `data` to `dataset`
dstrat = svydesign(data = apistrat, id = :1, strata = :stype, weights = :pw, fpc = :fpc);
using AlgebraOfGraphics
data(dstrat.variables) * mapping(:api99, :api00, markersize = :pw) |> draw
data(dstrat.variables) * mapping(:enroll, weights = :pw) * visual(Hist) |> draw
data(dstrat.variables) * mapping(:stype, :enroll; weights = :pw) * visual(BoxPlot) |> draw
Then you don't need to define all kinds of plotting functions here, but you encourage the user to use the full power of Makie/AlgebraOfGraphics or StatsPlots.jl/Plots.jl.
To simplify the life of users, you could of course find a way to do
data(dstrat) * mapping(...)
in AoG or @df dstrat plot(...)
in StatsPlots (without the .variables
). Probably it would be enough to define getproperty
on a survey design.
No method for svymean, and for svytotal, x::AbstractVector defined yet. Please add
In a few weeks, the design_update
branch will be overhauling the main
branch with more robust and "Julian" data structures for survey sampling designs, more functionality, testing and documentation.
Currently, the main branch supports weighted means, but doesn't quite work with cluster sampling and higher order sampling schemes. Variance and SE estimates of the means and totals also incorrect in places.
In the design_update
there are new data structures, including SimpleRandomSample
and StratifiedSample
, which provide functionality for Simple Random Sampling and Stratified Random Sampling respectively. These data structures support summary estimators like svymean
svytotal
and svyquantile
. Estimating within subpopulations (domain estimation) using svyby
has been implemented for SimpleRandomSample
. Variances for specific designs using their closed-form solutions have been tested and added. Hartley-Rao
approximation of Horvitz-Thompson Total has been fixed, among various additions that will be documented.
SurveyDesign
structure attempts to create a 'general' structure that is able to intelligently pasrse any arbitrary survey design given cluster, strata, weights, and population size information, similar to R survey packages svydesign()
function. It currently support SRS and stratified data. In due course, we will be adding a ClusterSample
structure, and then combine the building blocks to support Multistage sampling designs, initially using unbiased estimators and their variances (Chap 4, Sarndal et al Model Assisted Survey Sampling 1992).
Improve inline commenting, explanation of what happens in code. Latex formula for mean, total estimators etc, link to source and page number where taken from online.
Long term key idea is that the package is self-contained, and self-explanatory, without the need to goto external resources (some of which may be paywalled).
In R survey
, population proportions and population total estimation is done by passing a 'factor' (CategoricalArray) variable to svymean
svytotal
(see pg 21 Lumley textbook).
Basic proportion and count estimation functionality was added to SimpleRandomSample
and started for StratifiedSample
, inside their respective svymean and svytotal functions to emulate functionality from R. This is currently done using an if
statement
I am thinking that it would be neater and cleaner to have these features as multiple dispatch, instead of being caught by the if
. Minor improvements in speed would also be there for very large datasets by using multiple dispatch.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.