Giter Club home page Giter Club logo

survey.jl's People

Contributors

asinghvi17 avatar ayushpatnaikgit avatar codetalker7 avatar greimel avatar harsharora21 avatar itsdebartha avatar iuliadmtru avatar jishnub avatar nadiaenh avatar sayantikassg avatar smishr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

survey.jl's Issues

Suppress warning outputs during testing

Right now warnings that are generated by tests appear in the log messages. I think we can suppress them (since we intentionally test for them), it would be easier to read the logs.

Columns with NA not being parsed correctly

apipop is a dataset included with Survey.jl. apipop columns containing NA are not being parsed correctly. They are being parsed as strings. They should be Int or missing.

How to reproduce

Load apipop

using Survey
data(api)

View target column

apipop[!,"target"]

The output is

6194-element PooledArrays.PooledVector{InlineStrings.String3, UInt32, Vector{UInt32}}:
 "5"
 "11"
 "11"
 "3"
 "1"
 "4"
 "2"
 "7"
 "1"
 "11"
 ⋮
 "10"
 "13"
 "12"
 "11"
 "12"
 "11"
 "5"
 "2"
 "6"

The expected output

6194-element Vector{Union{Missing, Int64}}:
  5
 11
 11
  3
  1
  4
  2
  7
  1
 11
  ⋮
 10
 13
 12
 11
 12
 11
  5
  2
  6

Testing and modifying `total.jl`

Right now only rudimentary testing for total.jl and total function. Add tests, especially for CategoricalArray count estimation.

Code coverage needs to be improved

DISCLAIMER: The following is based on my personal opinion, not on some official information. Please argue with me if you disagree.

The design_update branch is not tested properly nor completely. I suggest a plan of action to improve code coverage and some guidelines for test-writing.

First of all, we need to differentiate between unit tests and integration tests. Since they are conceptually different (one of them is done in a white-box manner and the other in a black-box manner) they can be done in parallel, or by different people. Generally, unit testing should be done while developing.

Unit testing
Every implemented function needs to be tested for every functionality it provides, starting from the base up (i.e. from constructors to high-level functionality like plotting). This means:

  1. Go through each file inside src in the order low level - high level

  2. Inside test/filename write tests for each function from the file src/filename such that the following are checked:

    • test main functionality

      Example: @test nrow(svyby(:api00, :stype, srs, svymean)) == apisrs.stype |> unique |> size |> first

    • test all if-else cases of the function

    • test errors (using @test_throws)

    • test warnings if appropriate (using @test_warn)

    The simplest function should have at least two tests: one for main functionality and one for error check.

  3. Make sure test/filename is included in the test/runtests.jl file.

Integration testing
The overall package needs to behave as expected. We need to think about scenarios that we would use the package for and build tests according to our expectations. The tests should answer common sense questions and edge-case questions starting with "What should happen if ..." (e.g. What should happen if I have an empty survey? or What should happen if I have a strata specifier for a SimpleRandomSample? or What should happen if I only have one data point (and the rest are missing)? or anything else that your mind can think of).

General guidelines

  • It is NOT necessary to avoid hardcoding tests, sometimes it is even advised (as seen here, here, here or on any other online search I could find). What IS necessary is making sure that the tests are not dependent on the environment.
  • The test files should be as clean as the source files: no lingering prints, shows or comments, clear code, no long lines etc.
  • We should separate the unit tests from the integration tests. I suggest a structure like:
test/
    integration/
        integration1.jl
        integration2.jl
        ...
    unit/
        SurveyDesign.jl
        svyby.jl
        ...
    runtests.jl

... but with more descriptive file names inside the integration folder.

All plots should look alike

The svyboxplot uses AlgebraOfGraphics while the others use Makie. I suggest using AlgebraOfGraphics everywhere.

README needs to be updated

The README currently shows examples using the old design. The values given in those examples are not even correct (we corrected them for the new design). The old examples need to be replaced with new ones corresponding to the new design.

Enhancing, testing and documenting Horvitz-Thompson total/mean for `singledesign`

We had implemented ht_svytotal and HT_HartleyRaoVarApprox in the ht.jl file

  1. Since HT total is design agnostic weighted total, and you only need weighting info to calculate the total/mean, it should work with the currently implemented SurveyDesign object, giving correct totals for any design, without the need to specify cluster = or strata = argument. I think we should test the function and compare with some complex designs from R. If this is indeed the case, then we could encaspulate the relavant parts of SurveyDesign into a HTSurveyDesign object. This would be a crude approximation of complex designs, but it would be good starting point for the package
  2. ht_svymean is still work in progress, as we havent been able to figure out the scaling factor correctly between variance of ht_svytotal given by HT_HartleyRaoVarApprox. I think having a look through the Sarndal textbook will find us the relation between variances for Hartley Rao total and means (it is not as simple as 1/N^2)
  3. We need to test and document all the above

Extended Tutorial using Survey.jl

In the current inline documentation and help pages we provide some amount of examples for user, and explain the API.

Eventually, I think an extended tutorial page/Pluto notebook/medium blog using Survey.jl would be beneficial to the community and for the package. Ideas along the lines of Getting Started with DataFrames or https://pandas.pydata.org/docs/user_guide/10min.html but with more explanation and narrative with a focus on one particular publicly-available survey. Idea would be to take the first-time user through installing and explaining key functionality of the package to answer a few research questions.

Add frequency weights to `design.variables`

Since many functions require frequency weights, we can store it in design.variables rather than having each function derive it from probs.

Instead of probs, perhaps, we should call it probability_weights_reserved? Similarly, frequency_weights_reserved.

The reason we're only storing probability weights is that we didn't anticipate so many Julia functions to use frequency weights. I believe the survey package in R only store probability weights, and most R functions use probability weights.

Note, we want the information about weights to be in the same data frame and not make it a property of svydesign because of performance reasons. combine function in DataFrames works a lot faster.

Testing using other survey datasets

So far, most of the testing suite is limited to the API dataset. I suggest to improve testing by using other publicly available survey datasets. R Lumley survey textbook examples could be used, (pg 7 section 1.2.1) eg. NHANES, SHS, SIPP.

http://asdfree.com - Analyse Surveys Free has many real-world datasets and examples with respective R survey code.

Negative float as sampsize and popsize

There is another issue in SurveyDesign.jl. If you write the following code, then the error message showing is not what it should be
srs = SimpleRandomSample(apisrs, popsize = -2.8, ignorefpc = true)
the error message is

either population size or frequency/probability weights must be specified

Also for this code
srs = SimpleRandomSample(apisrs, sampsize = -2.8, ignorefpc = true)# the function is working up to line 55
the function is working all the way up to the 55th line and then only shows that there is an error giving the following message:

ERROR: InexactError: UInt64(-3.0)

My suggestion is we add some check that lets us proceed with the function only if the sampsize and popsize are non-negative integers.

Add tests `mean.jl`

Right now only rudimentary testing for mean.jl. Add tests, especially for CategoricalArray proportion calculations

Clarifying assumptions made for Domain estimation, other schemes

After re-reading through Sarndal textbook, section 3.3, 5.8 and Chap 10, there are alternate estimators of domain mean depending on whether domain population sizes are known or not.

What has been implemented is one of the options, should clarify what the underlying assumptions are to user in documentation, docstrings etc.

General survey design

How about the following design for the general SurveyDesign:

  1. Every sample is a multistage stratified sample.
  2. If no strata are passed, it means there is 1 stratum, that contains everything.
  3. If no clusters are passed, it means observation is a cluster.

For example, if you have a simple random sample, is a multistage sample where there is 1 stratum and every observation is a cluster.

All the formulae will be for multistage stratified sample.

Variance Estimation for Multistage and other Complex designs

Calculating survey means and totals is relatively straighforward as compared to their variances in certain designs. For example in the Horvitz Thompson estimator, their is a double inclusion probability term (\pi_{ij}) which is not readily generalised to arbitrary sampling designs. For now I implemented the Hartley-Rao variance approximation which bypasses calculation of double probabilities.

It is interesting to note how R, SAS, Stata SUDAAN etc calculate their variances. For some 'elementary' sampling designs they use closed form solutions, but for more complex multistage designs they use replication or resampling style methods like Balanced Repeated Replicates (BRR), or Jacknife, or other bootstrap type techniques. I found it difficult debugging through the R survey package code which is illegible in most places, what R exactly does under what cases.

Estimating variances of subpopulations mean/total estimators in Stata: here

SAS 9.2 documentation on Variance Estimation and here says that for Multistage sampling designs, they use a Taylor series variance method which only considers first stage of the sample design, and that for BRR and jacknife based estimation they dont take into account finite population correction.

Are these (or some other) reasonable assumptions/relaxations of the variance estimation problem for Survey.jl package? perhaps, but it is worth exploring into more detail. I think initially trying to implement the analytical solutions of variance of mean/total estimators is worth trying, especially given the power of Julia language; if that doesnt work then fall back to replicate/resampling methods.

Remove dimnames.jl?

This functionality is not really needed. It is simply calculating the size of the data frame used by the design. This is easily done using DataFrames.size. I suggest we remove it.

`ignorefpc=true` doesn't seem to work for `SimpleRandomSample`

For SimpleRandomSample setting ignorefpc=true should result in fpc being equal to 1.0, since we allow fpc to be of type Float64. But this doesn't happen:

julia> srs_ignorefpc = SimpleRandomSample(apisrs; popsize = :fpc, ignorefpc = true)
┌ Warning: assuming all weights are equal to 1.0
└ @ Survey ~/Work/xKDR/Survey.jl/src/SurveyDesign.jl:162
SimpleRandomSample:
data: 200x42 DataFrame
weights: 1.0, 1.0, 1.0, ..., 1.0
probs: 1.0, 1.0, 1.0, ..., 1.0
fpc: 6194, 6194, 6194, ..., 6194
popsize: 6194
sampsize: 200
sampfraction: 0.0323
ignorefpc: true

For StratifiedSample the same code looks ok:

julia> strat_ignorefpc = StratifiedSample(apistrat, :stype; popsize = :fpc, ignorefpc = true)
┌ Warning: assuming all weights are equal to 1.0
└ @ Survey ~/Work/xKDR/Survey.jl/src/SurveyDesign.jl:360
StratifiedSample:
data: 200x45 DataFrame
strata: stype
weights: 1.0, 1.0, 1.0, ..., 1.0
probs: 1.0, 1.0, 1.0, ..., 1.0
fpc: 1, 1, 1, ..., 1
popsize: 4421, 4421, 4421, ..., 755
sampsize: 100, 100, 100, ..., 50
sampfraction: 0.0226, 0.0226, 0.0226, ..., 0.0662
ignorefpc: true

(there's also extra information that shouldn't be there, I mentioned this in issue #111.)

`ClusterSample` with `mean` and `total`

TODO data structure for Single-stage Cluster sampling, assuming like in these lecture notes from PSU.

Add support in SurveyDesign.jl with svymean and svytotal working as expected (compare with R). Later check usage with svyquantile.

Rename `data` to `load_data` and improve it

... to avoid name clash with AlgebraOfGraphics.

Also, you could return a named tuple and encourage the user to use

(; apistrat) = dataset(api)

to avoid cluttering the namespace.

Parsing of keywords in Survey Design

The Survey design does not check whether the sum of probabilities is less or equal to 1. Even when the sum is more than one the design works and throws no error.
srs_w_p = SimpleRandomSample(apisrs, ignorefpc = false, weights = :fpc, probs = fill(0.3, size(apisrs_original, 1)))
This chunk of code throws no error even though the sum of probability is more than 1.

Julian design for the package

At first, we should make the package work for simple random samples and then extend the functionality to other design types. The current design of the package doesn't allow development to be done in that way.

How about:

abstract type SurveyDesign end
struct SimpleRandomSample <: SurveyDesign
       data::DataFrame
end
struct StratifiedSample <: SurveyDesign
       data::DataFrame
end
function plot(x::SurveyDesign)
plot(x.data)
# works on any child of SurveyDesign
end
function mean(variable, x::SimpleRandomSample)
# computes the mean of a variable from a simple random sample
end

Later, we can add:

function mean(variable, x::StratifiedSample)
# computes the mean of a variable from a stratified sample. 
end

There is pristine clarity on which functions work on which design types.
If we want to keep the API similar to R, we should consider wrappers.

Return strata means if option is specified in `svymean` function call

Hi all, in some of our research, we care about not just the combined survey mean/total for a StratifiedSample, but the individual strata means/totals as well.

Suggestion is to return ȳₕ if specified as a keyword argument in the function call to svymean(x::Symbol, design::StratifiedSample). Alternatively, this functionality can be separated out with multiple dispatch.

svyglm field status

Implemented Fields

The following fields in svyglm have been implemented

aic
family
coefficients
formula
data
model
deviance
offset
y
rank
linear.predictors
fitted.values
prior.weights
weights
residuals
converged
terms
control
contrasts
naive.cov
df.null
null.deviance
df.residual

df.null, null.deviance

df.null is dof_residual (degree of freedom residual) of null model. null.deviance is deviance for null model. Computing both of these requires null model. In R these are output by GLM. Julia GLM has no such output. nulldeviance will be in a future Julia GLM release. RIght now this is implemented by creating a null model formula glm and calling deviance and dof_residual on it.

df.residual

It is defined as

g$df.residual <- degf(design)+1-length(coef(g)[!is.na(coef(g))])

However it requires degf function.
Julia GLM already has df.residual function

naive.cov

This is implemented as vcov(glm)/GLM.dispersion(glm.model,true)

Unimplemented Fields

The following fields have not been implemented yet

xlevels
effects
cov.unscaled

xlevels, effects

xlevels is a record of the levels of the factors used in fitting. effects is related to final weighed linear fit. It would be great if someone could point to some resources regarding these.

cov.unscaled

cov.unscaled requires translations of R functions svy.varcoef and svyCprod.

Out of scope

The following fields are out of scope

boundary
iter
call
method
survey.design
qr
R

boundary

This is a logical value which answers the question "Is the fitted value on the boundary of the attainable values?". It is unclear to me how does one specify this boundary. Looking at the source it seems that R GLM outputs this value, which would mean that we can't implement this until Julia GLM has it.

iter

Number of iterations. Julia GLM does not output it (unlike R GLM). So we cannot add this until this is added to Julia GLM.

call, method, survey.design

All these fields store an expression which can be eval'ed. call is the expression which was called to svyglm, method is the name of the method, and survey.design is expression which was used to create the design. Right now it does not make sense to implement it in Julia svyglm.

qr, R

qr is the qr decomposition, R is the the r part of qr decomposition. Julia GLM does not use qr decomposition, hence it does not make sense to implement this.

DataFrames-like API

It's really great that you aim to replicate features provided by the R survey package as it's really the reference in that domain. However, its API is quite ad hoc and forces users to learn a completely new syntax when moving from unweighted to weighted/survey analyses. Have you considered adopting a syntax based on DataFrames group_by and combine/select/transform? In R, the recent srvyr package does that by wrapping survey functions with a dplyr-like syntax.

In particular, svyby(:api00, [:cname, :meals], dclus1, svymean) could be written as combine(groupby(dclus1, [:cname, :meals]), :api00 => svymean). With DataFramesMeta, this could become @combine(groupby(dclus1, [:cname, :meals]), svymean(:api00)). One would also be able to compute the mean of all columns using combine(dclus1, All() .=> svymean).

In terms of implementation I saw that you already use combine under the hood so it shouldn't be problematic.

Cc: @bkamins

ignorefpc correct usage and implementation

if ignorefpc
  @warn "assuming all weights are equal to 1.0"
  weights = ones(nrow(data))
end

If population size is not know or cant be estimated for some reason, ignorefpc parameter should provide usage without the (N-n)/N correction factor applied to mean/var etc. But I don't think that the way it is being done currently (above) is fully correct. Right now weights vector is made 1s, which is not quite ignoring fpc (although it does end up giving correct variances and means from svymean and svytotal)

This functionality and usage should be explored in more depth. If current implementation is found lacking, it could be removed and replaced. Although for what its worth, any survey that has sampled a substantial portion of the population would desire to use the fpc correction factor in their calculations.

Weighted `quantile` implementation

Currently, Statistics has all the Hyndman & Fan rules, and Statsbase has some rudimentary rule for Weighted (which I have been unable to classify), but from what I could find, those formulae under the "Weighted" column on the right hand side of the table are not implemented in Julia yet. They are straightforward to implement. Later, we could look at encapsulating parts of the code into Statsbase quantile
image

Except for simple random samples when qrule="hf7", our quantile function is incorrect. Weighting logic needs to be applied.

Doctests giving warnings and unnecessary output

When running the command julia docs/make.jl I get this output:

[ Info: SetupBuildDirectory: setting up build directory.
[ Info: Doctest: running doctests.
StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Normal{Float64}, IdentityLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

ell ~ 1 + meals

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      z  Pr(>|z|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)  6.86665   0.350512    19.59    <1e-84   6.17966    7.55364
meals        0.410511  0.00613985  66.86    <1e-99   0.398477   0.422545
────────────────────────────────────────────────────────────────────────
Degrees of Freedom: 6193.000324249264 (i.e. Null); 6192.000324249264 Residual
Null Deviance: 1.7556928968296547e6
Residual Deviance: 1.0196009035970895e6
AIC: 49195.42124574161
[ Info: ExpandTemplates: expanding markdown templates.
┌ Warning: duplicate docs found for 'Survey.svyby' in src/index.md:15-17```@autodocs
│ Modules = [Survey]
```
└ @ Documenter.Expanders ~/.julia/packages/Documenter/48VFh/src/Utilities/Utilities.jl:34
[ Info: CrossReferences: building cross-references.
[ Info: CheckDocument: running document checks.
[ Info: Populate: populating indices.
[ Info: RenderDocument: rendering document.
[ Info: HTMLWriter: rendering HTML pages.
┌ Warning: Documenter could not auto-detect the building environment Skipping deployment.
└ @ Documenter ~/.julia/packages/Documenter/48VFh/src/deployconfig.jl:75

We should get rid of the warnings and get rid of the unnecessary output coming from svyglm probably:

StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Normal{Float64}, IdentityLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

ell ~ 1 + meals

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      z  Pr(>|z|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)  6.86665   0.350512    19.59    <1e-84   6.17966    7.55364
meals        0.410511  0.00613985  66.86    <1e-99   0.398477   0.422545
────────────────────────────────────────────────────────────────────────
Degrees of Freedom: 6193.000324249264 (i.e. Null); 6192.000324249264 Residual
Null Deviance: 1.7556928968296547e6
Residual Deviance: 1.0196009035970895e6
AIC: 49195.42124574161

Cleaning up `design_update`

Clean up design update branch with only working elements i.e. remove structs and functions that have not been tested, or still in dev. Work in progress could be moved to new branches, or kept in local branches, or removed entirely.

Make issues out of things are removed, eg. poststrafity.jl etc, so everyone remembers. add them to a milestone, and label them nicely.

#98 is a subtask of this.

Why need `svyplot`, `svyhist`, `svyboxplot`?

Is there a reason to depend on a plotting package?

Is there a value added compared to using one directly?

(; apistrat) = dataset(api) # see issue #27 on renaming `data` to `dataset`

dstrat = svydesign(data = apistrat, id = :1, strata = :stype, weights = :pw, fpc = :fpc);

using AlgebraOfGraphics
data(dstrat.variables) * mapping(:api99, :api00, markersize = :pw) |> draw
data(dstrat.variables) * mapping(:enroll, weights = :pw) * visual(Hist) |> draw
data(dstrat.variables) * mapping(:stype, :enroll; weights = :pw) * visual(BoxPlot) |> draw

Then you don't need to define all kinds of plotting functions here, but you encourage the user to use the full power of Makie/AlgebraOfGraphics or StatsPlots.jl/Plots.jl.

To simplify the life of users, you could of course find a way to do
data(dstrat) * mapping(...) in AoG or @df dstrat plot(...) in StatsPlots (without the .variables). Probably it would be enough to define getproperty on a survey design.

Progressing towards merging design_update into main

In a few weeks, the design_update branch will be overhauling the main branch with more robust and "Julian" data structures for survey sampling designs, more functionality, testing and documentation.

Currently, the main branch supports weighted means, but doesn't quite work with cluster sampling and higher order sampling schemes. Variance and SE estimates of the means and totals also incorrect in places.

In the design_update there are new data structures, including SimpleRandomSample and StratifiedSample, which provide functionality for Simple Random Sampling and Stratified Random Sampling respectively. These data structures support summary estimators like svymean svytotal and svyquantile. Estimating within subpopulations (domain estimation) using svyby has been implemented for SimpleRandomSample. Variances for specific designs using their closed-form solutions have been tested and added. Hartley-Rao approximation of Horvitz-Thompson Total has been fixed, among various additions that will be documented.

SurveyDesign structure attempts to create a 'general' structure that is able to intelligently pasrse any arbitrary survey design given cluster, strata, weights, and population size information, similar to R survey packages svydesign() function. It currently support SRS and stratified data. In due course, we will be adding a ClusterSample structure, and then combine the building blocks to support Multistage sampling designs, initially using unbiased estimators and their variances (Chap 4, Sarndal et al Model Assisted Survey Sampling 1992).

Improved commenting and resource linking

Improve inline commenting, explanation of what happens in code. Latex formula for mean, total estimators etc, link to source and page number where taken from online.

Long term key idea is that the package is self-contained, and self-explanatory, without the need to goto external resources (some of which may be paywalled).

Proportion and count estimation for `SimpleRandomSample` and `StratifiedSample`

In R survey, population proportions and population total estimation is done by passing a 'factor' (CategoricalArray) variable to svymean svytotal (see pg 21 Lumley textbook).

Basic proportion and count estimation functionality was added to SimpleRandomSample and started for StratifiedSample, inside their respective svymean and svytotal functions to emulate functionality from R. This is currently done using an if statement

I am thinking that it would be neater and cleaner to have these features as multiple dispatch, instead of being caught by the if. Minor improvements in speed would also be there for very large datasets by using multiple dispatch.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.