mountainmath / cancensus Goto Github PK

View Code? Open in Web Editor NEW

80.0 80.0 15.0 43.03 MB

R wrapper for calling CensusMapper APIs

Home Page: https://mountainmath.github.io/cancensus/index.html

License: Other

R 99.40% CSS 0.06% Shell 0.54%

cancensus's People

Contributors

Stargazers

Watchers

Forkers

aminadibi sasha-ruby torontodatascientistswithoutborders bilguis92 janewyx snowdj r-forks-to-learn han-tun minghao2016 rtaph kumarnarendra0619 onatekinci anariaah daniel-simeone vineetp6

cancensus's Issues

Improve search_census_vectors and child_census_vectors functionality

Add functionality to search by CensusMapper internal census vector variable as additional option to pinpoint variables
Add functionality to set maximum depth for child census vectors, i.e. max_level=NA as an additional parameter. Then e.g. max_level=1 would only get direct children and no grandchildren.

parent/child_census_vectors should also allow named vectors

Currently the functionality of child_census_vectors and parent_census_vectors requires as an input a standard census vectors list, like the one produced by a call to list_census_vectors.

This works fine when in a workflow where you are browsing through the vectors but adds additional overhead when you already know which vector to target.

Suppose you want to get the child vectors of Non-official languages. Currently you have to:

search_census_vectors("Non-official language",'CA16')
list_census_vectors('CA16') %>% 
  filter(vector == 'v_CA16_551') %>% 
  child_census_vectors()

We should allow this to work in the situation where you already know the vector code.

child_census_vectors('v_CA16_551')

Also, currently running the above does not do anything useful, but it does not throw a warning or error either, so that is problematic as well. The above call returns

[1] "v_CA16_551"

Unfortunately I didn't think about this while testing the latest version before submitting to CRAN, but we can add this for the next release which can be sooner than the last one was.

Error in list_census_regions() for CA01 and CA06

> list_census_regions('CA01')
Querying CensusMapper API for regions data...
Error in handle_cm_status_code(response, NULL) : 
  could not find function "handle_cm_status_code"

> list_census_regions('ca06')
Querying CensusMapper API for regions data...
Error in handle_cm_status_code(response, NULL) : 
  could not find function "handle_cm_status_code"

> list_census_regions('CA06')
Querying CensusMapper API for regions data...
Error in handle_cm_status_code(response, NULL) : 
  could not find function "handle_cm_status_code"

list_census_vectors description not working properly.

There is an issue with the description field in list_census_vectors, it pulls incorrect parent vector labels.

Inconsistent handling of search queries by find_census_vectors

(1) In find_census_vectors(), type = c("male", "female") sometimes works, but throws a warning. Try:

find_census_vectors("full year, full time", "CA16",
                    type = c("male", "female"),
                    query_type = "exact")

Could you maybe remove the warning? Just please don't remove the ability to return data for both genders with one command, it is very handy.

(2) And sometimes type = c("male", "female") doesn't work:

find_census_vectors("part year and/or part time", 
                    "CA16",
                    type = c("male", "female"),
                    query_type = "exact")

Says 'No exact matches found. Please check spelling and try again or consider using semantic or keyword search.' However, this works (note exact same query term):

find_census_vectors("part year and/or part time", 
                    "CA16",
                    type = "male",
                    query_type = "exact")

Could you please make this consistent and allow to set type = c("male", "female")?

Regions parameter should accept an R list, not a stringified JSON dictionary

Right now the region parameter looks something like region='{"PR":["59"]}'. It would be nice to be able to use something more idiomatic to the R language, like

regions = list(PR = c("59"), ...)

This should be possible with jsonlite::toJSON, although we may need to ensure that the list elements are vectors to avoid making the server barf.

Regions list for invalid datasets gives strange errors

This is due to the fact that if you query the API for an invalid dataset, like

https://censusmapper.ca/data_sets/CA17/place_names.csv

You get a malformed CSV file with no content instead of some sort of error. I don't know whether it is possible to check for this issue on the server side, or if we should implement validation for the dataset parameter on the client side to prevent this issue.

Possible vignette amalgamation

I've been working on fixing vignette builds for Travis, and it has occured to me that the four existing vignettes (plus the README) cover pretty similar ground. It might be better to amalgamate the existing ones into a single, longer vignette (or perhaps two) with more subsections. I don't think "more is better" really applies here, since most people will probably only read one vignette.

The README file can then highlight a single cool example, and then link to the hosted vignette sections for more info.

It is also possible that some of the existing examples, particularly the thoughtful or complex ones, would make better blog posts than vignette content. It's not unusual to point to blog posts in the README file, either.

Thoughts on this? Should I try amalgamating the vignettes myself? And how should this relate to #65?

More informative server messages

This isn't really a client R package issue but it would need to be implemented in the package as well. We should have more informative error messages for malformed API calls or API rate issues to communicate to users.

Variable description in list_census_vectors

This is a more general look at the issue identified in #87

The process for creating the variable descriptions should work roughly something like this:

Take list of vectors and identify the appropriate paren vector as specified in the parent_vector field
Look up that parent_vector's label and prepend it to the child vector's label
Do this recursively to pull in the grand-parent vector, the grand-grand-parent vector, and so on, as necessary. The 2016 census has up to 6 levels of depth, but I don't want to hard code that.

The current process to traverse the label hierarchy to create concatenated variable descriptions works like this, where result is a data frame of that census vectors and labels for that census dataset:

# traverse hierarchy to add description field to variables
    result$description <- result$label
    list <- result
    browser()
    while (any(!is.na(list$parent_vector))) {
      parent_list=result %>% dplyr::filter(vector %in% list$parent_vector)
      result$description[!is.na(list$parent_vector)] <-
        paste(parent_list$label,result$description[!is.na(list$parent_vector)],sep=", ")
      list=parent_list
    }

This doesn't actually work correctly, as it doesn't index the positions of the parent vectors correctly, and results in misaligned concatenation.

@mountainMath and myself have tried a few different approaches over the last while to fix this, but have been stumped. I've had better luck by focusing on recursive join functions but can't quite get it work.

As an alternative, I put together an alternative approach that uses a function that traverses the label hierarchy for any given vector, and then splitting the data frame and vectorizing over that list. It works like this:

# traversal function
parent_labels <- function(vector_list) {
  base <- vector_list
  n <- 0
  vector_list <- result[result$vector == base$parent_vector,] %>% distinct(vector, .keep_all = TRUE)
  # Recursively assemble all parents of any vector
  while(n != nrow(vector_list)) {
    n = nrow(vector_list)
    new_list <- result %>% filter(vector %in% vector_list$parent_vector)
    vector_list <- vector_list %>% rbind(new_list) %>% distinct(vector, .keep_all = TRUE)
  }
  # Reverse order and collapse the parent labels into a single string
  labels <- vector_list[order(desc(row.names(vector_list))),]$label
  labels <- paste(labels, collapse = ": ")
  return(labels)
}

result_split <- split(result,seq_len(nrow(result)))
full_labels <- lapply(result_split, parent_labels)

Despite using a vectorized approach (*apply family is generally the fastest way to do anything in R), this process is unacceptably slow when run on the entire dataset. I've benchmarked each function call at around 0.015 seconds, but if you have 5000 vectors, then that's too slow to be used reliably.

IMO, this is the biggest outstanding issue left before cleaning up for CRAN. The vectors are hard to find and hard to parse without reliable descriptions, and this affects search_census_vectors as well. Does anyone have a better solution for addressing this?

Recent month data is not coming

Update making maps with cancensus vignette

Requires update for security flaw in rendering of html page in documentation. Not an issue with the package.

Could also benefit from an update to the vignette to account for newer versions of ggplot2 + sf interaction, and other example tools like mapdeck.

list_regions sends down CMAs that don't have census data

I noticed that the list_regions call sends down data for some CMAs that don't actually have census data attached to them. This is fixed on the server now, will take a day for the server side cache to expire. So we should all make sure we refresh the cached regions in 24 hours.

Reproducibility issue in get_census

Related to #126

The following code throws a parsing error now.

dataset <- "CA16"

regions_list10 <- list_census_regions(dataset) %>% 
  filter(level=="CMA") %>% 
  top_n(10,pop) %>% 
  as_census_region_list

csd_geo <- get_census(dataset, level = 'CSD', regions = regions_list10)

Warning: 1 parsing failure.
row col expected actual file
9 rpid a double s_1_35_35 literal data

Data looks fine in download, but the warning is unexpected.

This is from code written pre-CRAN release.

"cannot open the connection" errors

I followed the cancensus installation instructions but was having some trouble running the example commands:

Querying CensusMapper API for regions data...
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'name,geo_uid,type,population,flag,CMA_UID,CD_UID,PR_UID
Canada,01,C,35151728,,,,
Ontario,35,PR,13448494,,,,
Quebec,24,PR,8164361,,,,

+                           vectors=c("v_CA16_408","v_CA16_409","v_CA16_410"),
+                           level='CSD', use_cache = FALSE, geo_format = NA)
Querying CensusMapper API...
Downloading: 1.4 kB     Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'GeoUID,Type,Region Name,Area (sq km),Population ,Dwellings ,Households ,v_CA16_408: Occupied private dwellings by structural type of dwelling data,v_CA16_409: Single-detached house,v_CA16_410: Apartment in a building that has five or more storeys
5915001,CSD,Langley (DM),314.76313,117285,43720,41982,41980,21690,1100
5915002,CSD,Langley (CY),10.222,25888,12264,11840,11840,2730,40

These are my sessionInfo() results:

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] cancensus_0.1.5
loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14     digest_0.6.12    dplyr_0.7.4      withr_2.1.0     
 [5] assertthat_0.2.0 R6_2.2.2         jsonlite_1.5     magrittr_1.5    
 [9] git2r_0.19.0     httr_1.3.1       rlang_0.1.4      curl_3.0        
[13] bindrcpp_0.2     devtools_1.13.4  tools_3.4.3      glue_1.2.0      
[17] compiler_3.4.3   pkgconfig_2.0.1  memoise_1.1.0    bindr_0.1       
[21] tibble_1.3.4

There is a resolution suggested by @dshkol – install readr (install.packages('readr')).

Encoding issue with French names

French characters are not imported correctly. Here's an example:

data <- get_census(dataset='CA16', regions=list(PR="24"), 
  vectors=c("pop2016"= "v_CA16_401"),
  level='CSD',geo_format="sf") 
data %>% filter(GeoUID == 2401023) %>% pull(name)
[1] "Les Îles-de-la-Madeleine (Mï¿½)"

Submission to CRAN

My goal is to have this package end up on CRAN eventually. There's still work to do to clean up the vector/geo search and discovery, but I think it's pretty close otherwise. I'm not familiar with the process, so if anyone know what we need to do in order to be approved, please share insights.

Migrate options() to environment variables as default

Right now we set the API key and cache path via options. It might be cleaner to set these as environment variables by default instead. Nothing would change in terms of user experience, but we would change the defaults in the function calls and also change the docs, but fall back to options in case environment variables aren't set. And also change the docs and vignette accordingly.

Better vector search handling of key words

search_census_vectors("income","CA16")

Returns too many rows but

search_census_vectors("median household income","CA16")

Returns nothing, not even related vectors. This is because the process for searching takes the entire string rather than tokenizing it.

search_census_vectors <- function(searchterm, dataset, type=NA, ...) {
  #to do: add caching of vector list here
  veclist <- list_census_vectors(dataset, ...)
  result <- veclist[grep(searchterm, veclist$label, ignore.case = TRUE),]
... lots of other code

We should take the search term, and tokenize it, and search against tokens rather than the complete term. Adding this is a to-do for next version.

Deprecation warnings related to dplyr/tibble packages

I'm getting a bunch of deprecation warnings related to usage of dplyr and tibble packages when fetching information. Here's a reprex (note that I'm hiding my api key). This doesn't impact my work, but worth looking into:

library(cancensus)
#> Census data is currently stored temporarily.
#> 
#>  In order to speed up performance, reduce API quota usage, and reduce unnecessary network calls, please set up a persistent cache directory by setting options(cancensus.cache_path = '<path to cancensus cache directory>')
#> 
#>  You may add this option, together with your API key, to your .Rprofile.
my_api_key <- "<HIDDEN>"
options(cancensus.api_key = my_api_key)

all_regions <- list_census_regions("CA16")
#> Querying CensusMapper API for regions data...
all_vars <- list_census_vectors("CA16")

region_age_gender_educ <- get_census(dataset = "CA16", regions = list(C = "01"),
                                     vectors = c("v_CA16_65", "v_CA16_66", 
                                                 "v_CA16_83", "v_CA16_84", "v_CA16_101", "v_CA16_102",
                                                 "v_CA16_119", "v_CA16_120", "v_CA16_137", "v_CA16_138", "v_CA16_155", "v_CA16_156",
                                                 "v_CA16_173", "v_CA16_174", "v_CA16_191", "v_CA16_192", "v_CA16_209", "v_CA16_210",
                                                 "v_CA16_227", "v_CA16_228", "v_CA16_245", "v_CA16_246",
                                                 "v_CA16_5055", "v_CA16_5056",
                                                 "v_CA16_5058", "v_CA16_5059",
                                                 "v_CA16_5064", "v_CA16_5065", 
                                                 "v_CA16_5073", "v_CA16_5074",
                                                 "v_CA16_5076", "v_CA16_5077",
                                                 "v_CA16_5079", "v_CA16_5080"),
                                     level = "PR")
#> Census data is currently stored temporarily.
#> 
#>  In order to speed up performance, reduce API quota usage, and reduce unnecessary network calls, please set up a persistent cache directory by setting options(cancensus.cache_path = '<path to cancensus cache directory>')
#> 
#>  You may add this option, together with your API key, to your .Rprofile.
#> Querying CensusMapper API...
#> Downloading: 2.2 kB     Downloading: 2.2 kB     Downloading: 2.2 kB     Downloading: 2.2 kB
#> Warning: `data_frame()` is deprecated as of tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
#> Please use `as_tibble()` instead.
#> The signature and semantics have changed, see `?as_tibble`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Warning: The `x` argument of `as_tibble.matrix()` must have column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.

Travis-CI Integration

I think this would be nice to add, but since I don't own the repository I can't set it up myself. If we enable it I'm happy to work on the configuration.

Fair warning, though: R CMD check fails pretty hard on the vignettes on my local machine, so it would likely fail at the outset on Travis as well.

Consider moving away from Travis CI for CI

Let's check if we're affected by the upcoming changes to Travis for open-source projects and if doing so we cna adjust and use one of the recommended alternatives: https://ropensci.org/technotes/2020/11/19/moving-away-travis/

Consider replacing jsonlite with rcppsimdjson

The new c++ based json parser simdjson is making some waves with very impressive benchmarks compared to established parsers like jsonlite. An R implementation just hit CRAN: https://github.com/eddelbuettel/rcppsimdjson.

Perhaps risky to adopt just yet but we should monitor. Possible downside would be adding an Rcpp dependency. But the upside could be significant improvement in parsing speed of long datasets for cancensus and cansim.

Force load spatial library if spatial data selected

When downloading census data using get_census with sf spatial format, one has to have already loaded the sf library, otherwise the resulting data object will be a tbl_df with a geometry variable rather than an sf class object. This can lead to all sorts of problems when doing any operations on that data, such as transformation or summarization or plotting.

Something to look at. I imagine it's bad practice to force load the sf library with the get_census call if the user selects the sf format?

Column names should be clear, meaningful, and thoughtfully ordered

At the moment a basic load call like the following

census_data <- cancensus::cancensus.load(
  dataset='CA16', regions='{"PR":["59"]}',
  vectors = c("v_CA16_408"),
  level='CD'
)

Returns a the following sf object:

# Simple feature collection with 39 features and 18 fields
# geometry type:  MULTIPOLYGON
# dimension:      XY
# bbox:           xmin: -123.4319 ymin: 49.00193 xmax: -122.4091 ymax: 49.57428
# epsg (SRID):    4326
# proj4string:    +proj=longlat +datum=WGS84 +no_defs
# # A tibble: 39 x 19
#          a     t    dw    hh      id   pop              name  pop2  rgid  rpid  ruid   Type     `Region Name` `Area (sq km)` Population Dwellings Households
#  *   <chr> <chr> <chr> <chr>   <chr> <chr>             <chr> <chr> <chr> <chr> <chr> <fctr>            <fctr>          <dbl>      <dbl>     <dbl>      <dbl>
#  1 0.28012   CSD   286   285 5915825   471         Matsqui 4   498    59  5915 59933    CSD         Matsqui 4        0.28012        471       286        285
#  2 0.53308   CSD     4     4 5915810    10        Musqueam 4     5    59  5915 59933    CSD        Musqueam 4        0.53308         10         4          4
#  3 0.02195   CSD    25    22 5915805    54       Coquitlam 1    39    59  5915 59933    CSD       Coquitlam 1        0.02195         54        25         22
#  4 1.05651   CSD   932   920 5915806  1855   Burrard Inlet 3  1472    59  5915 59933    CSD   Burrard Inlet 3        1.05651       1855       932        920
#  5 0.27764   CSD   178   160 5915807   576         Mission 1   574    59  5915 59933    CSD         Mission 1        0.27764        576       178        160
#  6 1.79039   CSD  1433  1309 5915808  2931        Capilano 5  2700    59  5915 59933    CSD        Capilano 5        1.79039       2931      1433       1309
#  7 0.58363   CSD    16    15 5915809    49 Barnston Island 3    47    59  5915 59933    CSD Barnston Island 3        0.58363         49        16         15
#  8 0.49049   CSD    40    37 5915811   123   Seymour Creek 2   107    59  5915 59933    CSD   Seymour Creek 2        0.49049        123        40         37
#  9 0.30729   CSD    15    15 5915813    40          Katzie 2     0    59  5915 59933    CSD          Katzie 2        0.30729         40        15         15
# 10 1.78376   CSD    32    32 5915816    94 McMillan Island 6    68    59  5915 59933    CSD McMillan Island 6        1.78376         94        32         32
# # ... with 29 more rows, and 2 more variables: `v_CA16_408: Occupied private dwellings by structural type of dwelling data` <dbl>, geometry <simple_feature>

There are quite a number of duplicate columns here with unhelpful names; the order looks pretty arbitrary; and some of the columns clearly have the wrong type.

My suggestion for the appropriate result is the transformation (abusing dplyr here for illustration):

df %>%
  mutate_at(vars(Households, Dwellings, Population), funs(as.integer)) %>%
  select(id, name, level = t, pop = Population, area = `Area (sq km)`,
         dwellings = Dwellings, households = Households, starts_with("v_"))

# Simple feature collection with 39 features and 8 fields
# geometry type:  MULTIPOLYGON
# dimension:      XY
# bbox:           xmin: -123.4319 ymin: 49.00193 xmax: -122.4091 ymax: 49.57428
# epsg (SRID):    4326
# proj4string:    +proj=longlat +datum=WGS84 +no_defs
# # A tibble: 39 x 9
#         id              name level   pop    area dwellings households
#      <chr>             <chr> <chr> <int>   <dbl>     <int>      <int>
#  1 5915825         Matsqui 4   CSD   471 0.28012       286        285
#  2 5915810        Musqueam 4   CSD    10 0.53308         4          4
#  3 5915805       Coquitlam 1   CSD    54 0.02195        25         22
#  4 5915806   Burrard Inlet 3   CSD  1855 1.05651       932        920
#  5 5915807         Mission 1   CSD   576 0.27764       178        160
#  6 5915808        Capilano 5   CSD  2931 1.79039      1433       1309
#  7 5915809 Barnston Island 3   CSD    49 0.58363        16         15
#  8 5915811   Seymour Creek 2   CSD   123 0.49049        40         37
#  9 5915813          Katzie 2   CSD    40 0.30729        15         15
# 10 5915816 McMillan Island 6   CSD    94 1.78376        32         32
# # ... with 29 more rows, and 2 more variables: `v_CA16_408: Occupied private dwellings by
# #   structural type of dwelling data` <dbl>, geometry <simple_feature>

We should also explain these columns in the help entry.

Unexpected behaviour with parsing region lists if readr package not available

There appears to be an issue where list_census_region_list can in some situations load and keep region ids as integers instead of converting them into characters. This is not picked up when converting to a region list using as_census_region_list, which will lead to a malformed API call.

This occurs for some users who do not have the readr package installed and for whom the data is loaded using the base read.csv call instead, however there is no need to force a dependency on the readr package. This is straightforward to fix by either ensuring that region_ids are converted to strings in the first step, or by ensuring that they are strings when sent through as_census_region_list.

Some thoughts on function names, and possible changes

The package currently uses a very unusual naming convention for functions, namely package.function_name. I haven't seen this style in R before, but R is also famous for its inconsistent naming conventions. I didn't give it much thought for a while, but as the package draws nearer to release, I thought it might be a good idea to think more carefully about naming. I think we should be motivated to have clear, unambiguous names, of course, but beyond that things are more a matter of style.

What Do Other Packages Do?

Most other R packages in my experience do not use a prefix or suffix at all (e.g. dplyr). However, I did take a look at the most recently updated 50 or so API-backed R packages, and there the naming conventions are more varied.

Probably the closest package to this one is the censusapi package, which uses a getCensus() function for querying and a listCensusMetadata() function for inspection. The junr package, which is also somewhat similar, uses get_index(), get_data() and list_titles(), which I think are too generic -- I would prefer to stick "census" in there somewhere. (I think the dwapi package also suffers from this problem).

Some packages use nouns; for example the hansard package has commons_divisions() and mp_vote_record() and the HIBPwned package has account_breaches(). I like the specificity of this approach. Others use verbs, e.g. geocode() in the banR package.

My favourite approach at the moment is, as with the censusapi package, to take a specific noun and prefix it with "get", as with the GetSports() function in the pinnacle package or the get_lang() function in the languagelayeR package, or the get_owf() function in the openwindfarm package, or even the get_video_details() function in the tuber package.

Of course, there are some packages that use prefixes, e.g.

the bold package uses bold_
the datadogr package uses k9_
the refimpact package uses ref_
the bea.R package uses bea
the comtradr package uses ct_
the rcoreoa package uses core_
the nneo package uses nneo_

There are also some that use suffixes, e.g.

the sidrar package uses a _sidrar suffix for its functions, adopting get_, search_, etc function names -- so this could also be considered in the first category
the patentsview package uses a single search_pv function (that is, it uses a suffix)
the SkyWatchR package uses a single querySW function

There is also the approach taken by the rosettaApi package, which is to have a single, top-level api() function.

Proposed API Changes

All this taken in stride, my proposal is to make the following changes:

cancensus::cancensus.load becomes cancensus::get_census
cancensus::cancensus.load_data becomes cancensus::get_census_data
cancensus::cancensus.load_geo becomes cancensus::get_census_geometry
cancensus::cancensus.list_datasets becomes list_census_datasets, although I believe @mountainMath wants to add some non-census datasets in the future.
cancensus::cancensus.list_vectors becomes cancensus::list_census_vectors
cancensus::cancensus.search_vectors becomes cancensus::search_census_vectors
cancensus::cancensus.list_regions becomes cancensus::list_census_regions
In order to keep naming consistent, I think it makes sense to have cancensus::cancensus.census_labels become cancensus::census_vectors, since that is what they are termed in the parameters and in the list_* function.

All internal functions should just ditch their cancensus. prefix, since they're not exported anyway.

We can also keep aliases around for all these changes for a release or two so that no-one's code breaks.

I think that the current function parameters all have pretty good names, so I don't have any suggested changes there.

Alternatives

We can keep some element of namespacing, e.g. by using cm as a prefix, which is the same prefix currently used by the API key when it is an environment variable. So for example cm_get and cm_list_datasets.
We could use a top-level function with the same name as the package, that is, cancensus::cancensus() as the main entry point (replacing existing load*() functions). I don't think this is a particularly intuitive approach for this package.

Master Documentation and Examples

Issues may not be the best place for this but wanted to let you guys know that I've started working on a comprehensive set of documentation/vignettes to detail cancensus usage. I've been in and out of town recently but over the next week or so I intend to put together this material.

I see it working best as a three-part document:

Part 1 : Why cancensus and accessing data
- Stats Can
- Background and Censusmapper
- Data load functions
- Data parameters
- Working with regions
- Working with census vectors
- Spatial formats
Part 2: Analyzing and Visualizing cancensus data
- cancensus and tidy data manipulation (dplyr, %>% etc.)
- Visualizing (non-spatial) Census data
Part 3: Making maps with cancensus
- cancensus + sf+ ggplot2
- cancensus + leaflet

Any suggestions?

I'm halfway through Part 1, and I'll post them here first when ready. I'm going to be out of town Friday-Monday, but should gradually work towards finishing these by end of next week.

My plan is to put this series up here as vignettes, but also on a site (good reason to finally finish that rmarkdown+blogdown+hugo site). I think a condensed version would also work well as a submission to Rbloggers, R views, as well as general social media.

In the meantime, is there anything we still need to do in order to submit to CRAN?

Prefer ggplot2 to tmap in vignettes and README

As discussed in #44, it does not seem desirable to use tmap. So we should remove it in favour of ggplot2 examples.

Ensure that messages/warnings are not emitted when "quietly = TRUE"

As emerged in #99, some of the current functions still emit warnings/messages when quietly = TRUE. This should be resolved, and the vignettes can then be updated to remove warning/message suppression.

Soft deprecate get_census_geometry

See related #126

This function is duplicated by just using get_census with geo parameters set and no variable parameters.

We can soft deprecate this so it still works in legacy code but throws a warning on deprecation, and remove it from active documentation.

can't find `agr` columns

Getting a really weird error when running

get_census("CA16",regions=list(CSD="5915022"), vectors = c("med_hh_inc"="v_CA16_2397"), geo_format = 'sf')

Error message.

Show in New WindowClear OutputExpand/Collapse Output
Error in rename.sf(., !!!vectors) : internal error: can't find `agr` columns

using sf_0.9-5

Convenience functions

I can think of a couple of convenience function that we might want to add.

Take a subset of the regions_list array (filtered by the user) and turn it into a region object needed for the cancensus load calls.
Given a (list of) vector(s), or rows from the list_vectors csv, get all children rows from list_vectors
Given a list of vectors or rows from list_vectors csv, select all leaves (i.e. elements without children)

I can imagine that at least I will use these quite a bit, not sure about others. Should we add cancensus function to do this?

Census Variable Labels

We're currently providing people with the option to return short census vector variables when using the parameter labels = short in calling get_census(...). Somewhere along the way, the functionality I built in during an earlier branch to store and return the detailed variables was lost, but we're still making reference to it in the documentation.

The previous functionality was called via cancensus.labels() which doesn't appear to be in the master code anymore. I think it's a pretty useful feature when working with short form variable names, and easy enough to implement, so I'm wondering if we removed that on purpose or by accident. If an accident, we can re-insert it where appropriate.

If anyone has a suggestion for a better way of handling the variable descriptions, feel free to comment.

caching and geo_format

There seems to be an issue with caching data and geo_format. The current implementation stores files using the geo_format specified in the first call. If a subsequent call requests the same geographic data in a different geo_format the call fails.

Add discoverability for vectors and regions

We discussed this via email, but I think it is worth filing an issue to highlight that this is a desired feature. Something like the current list_datasets() command for vectors and regions would be nice. They could also return tidy data frames that could be filtered, examined, etc.

I'm imagining something like the following for regions:

> cancensus::list_regions(dataset = "CA16") %>%
>   filter(level %in% c("CMA", "PR"), province == "British Columbia")

Would yield

# # A tibble: 8 x 5
#   region level                 name         province
#    <chr> <chr>                <chr>            <chr>
# 1  59915   CMA              Kelowna British Columbia
# 2  59925   CMA             Kamloops British Columbia
# 3  59930   CMA           Chilliwack British Columbia
# 4  59932   CMA Abbotsford - Mission British Columbia
# 5  59933   CMA            Vancouver British Columbia
# 6  59935   CMA             Victoria British Columbia
# 7  59938   CMA              Nanaimo British Columbia
# 8  59970   CMA        Prince George British Columbia
# 9  59       PR     British Columbia British Columbia

For vectors it could be even more simple: list_vectors(dataset = "CA16") could return a two-column data frame with the vector and its accompanying description.

label metadata is missing when geo_format="sf"

Cancensus adds the detailed labels as an attribute to the data it returns that can be read with the label_vectors function if geo_fromat=NA, but this gets lost when geo_format="sf".

Problem with labels and sf format

The labels=short options breaks with sf format selected. I added a check to fix this, but am now thinking it would be better to implement the code higher up directly on the dat object. Not sure if that survives the potential merge with the geo data, but that can be dealt with.

Vectors list for invalid datasets gives strange errors

Dupe of #47, but for list_vectors():

> cancensus.list_vectors("CA17")
## Querying CensusMapper API for vectors data...
## Downloading: 20 B     Error: '' does not exist in current working directory ('...').

versus the current list_regions():

> cancensus.list_regions("CA17")
## Querying CensusMapper API for regions data...
## Error in cancensus.handle_status_code(response, NULL) : 
##  Download of Census Data failed. Invalid Dataset Parameter

If this can't be addressed easily on the server side, we should be able to implement dataset validation on the client side.

Default data when no spatial format requested

I noticed that when data is pulled without geo, we lose a couple of useful columns.

 cma.ct <- get_census("CA16", regions=list(CMA=cma), 
                        vectors = vectors, level = "CT",
                        labels = "short", geo_format = NA)

produces

$ GeoUID         <chr> "8250001.01", "8250001.02", "8250001.03", "8250001.04", "8250001.05", "8250...
$ Type           <fct> CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT,...
$ `Region Name`  <fct> Calgary, Calgary, Calgary, Calgary, Calgary, Calgary, Calgary, Calgary, Cal...
$ `Area (sq km)` <dbl> 1.72193, 3.94892, 1.04878, 2.57535, 1.15113, 3.34596, 2.97159, 3.66137, 3.5...
$ Population     <dbl> 5232, 6517, 2205, 5942, 2905, 3793, 6123, 5132, 6218, 2837, 5192, 4671, 261...
$ Dwellings      <dbl> 2156, 2619, 823, 2325, 1045, 1448, 2410, 2082, 2407, 1073, 1876, 1746, 1056...
$ Households     <dbl> 2104, 2571, 820, 2293, 1042, 1445, 2315, 2011, 2382, 1064, 1846, 1737, 1040...
...

Whereas

cma.ct <- get_census("CA16", regions=list(CMA=cma), 
                        vectors = vectors, level = "CT",
                        labels = "short", geo_format = "sf")

produces

$ `Shape Area`                            <dbl> 1.88067, 0.58484, 1.41712, 400.47943, 261.22246, 8...
$ Type                                    <fct> CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT...
$ Dwellings                               <int> 1220, 850, 1943, 1271, 2339, 1666, 1021, 2264, 227...
$ Households                              <int> 1114, 835, 1928, 1192, 2271, 1508, 923, 2227, 2213...
$ GeoUID                                  <chr> "8250055.00", "8250076.15", "8250052.09", "8250204...
$ Population                              <int> 3141, 2214, 5733, 3931, 6852, 4116, 2550, 6903, 65...
$ `Adjusted Population (previous Census)` <int> 2906, 2239, 5437, 3448, 6002, 3593, 2448, 5871, 64...
$ PR_UID                                  <chr> "48", "48", "48", "48", "48", "48", "48", "48", "4...
$ CMA_UID                                 <chr> "48825", "48825", "48825", "48825", "48825", "4882...
$ CSD_UID                                 <chr> "4806016", "4806016", "4806016", "4806014", "48060...
$ CD_UID                                  <chr> "4806", "4806", "4806", "4806", "4806", "4806", "4...
$ `Region Name`                           <fct> Calgary, Calgary, Calgary, Rocky View County, Rock...
$ `Area (sq km)`                          <dbl> 1.88067, 0.58484, 1.41712, 400.47943, 261.22246, 8...
...
$ geometry                                <MULTIPOLYGON [°]> MULTIPOLYGON (((-114.1179 5..., MULTI...

I think that the additional columns for PR_UID, CMA_UID, CSD_UID, CD_UID etc. should be retained even when no geo format is specified as its a common requirement to merge and aggregate at different levels of census geography, even when not explicitly working with spatial data. This would reduce load on the server by reducing the number of unnecessary calls for spatial data.

Reproducibility issue in get_census_geometry

Just ran into some errors trying to recompile old code. Reprex below:

dataset <- "CA16"

regions_list10 <- list_census_regions(dataset) %>% 
  filter(level=="CMA") %>% 
  top_n(10,pop) %>% 
  as_census_region_list

csd_geo <- get_census_geometry(dataset, level = 'Regions', regions = regions_list10)
csd_geo <- get_census_geometry(dataset, level = 'CSD', regions = regions_list10)
csd_geo <- get_census_geometry(dataset, level = 'CD', regions = regions_list10)
csd_geo <- get_census_geometry(dataset, level = 'CMA', regions = regions_list10)

Each of the get_census_geometry calls here fails with the error

Error in get_census(dataset, level, regions, vectors = c(), geo_format = geo_format, :
the level parameter must be one of 'Regions', 'PR', 'CMA', 'CD', 'CSD', 'CT', or 'DA'
In addition: Warning message:
In get_census(dataset, level, regions, vectors = c(), geo_format = geo_format, :
passing regions as a character vector is depreciated, and will be removed in future versions

Flagging this to figure out what is causing the issue and if its a deprecation issue think about it making it a softer deprecation for legacy code.

The original code was written pre-CRAN release.

Default geo_format

I am thinking we should change the default geo_format to NA. Had some conversations with people with older R version that could not get cancensus to work because

they could not or did not want to install a dev version of ggplot and sf, and
their R version was too old to install standard gdal to make sp work, and they did not want to upgrade in the middle of an important project

That's probably an extreme case, but I feel we should keep the entry barrier as low as possible. Also, I ran into a couple of bugs in sf/ggplot, so it might still take some time before this matures. So it might be better to let the user explicitly select which geography format they want to work with rather than selecting a default.

Thoughts?

Error with get_census(geo_format = "sp")

 census_data_sp <- get_census(dataset='CA16', regions=list(CMA="59933"),
                          vectors=c("v_CA16_408","v_CA16_409","v_CA16_410"),
                           level='CSD', geo_format = "sp", use_cache = FALSE)
Querying CensusMapper API...
Downloading: 1.4 kB     Querying CensusMapper API...
Downloading: 25 kB     Error in ogrInfo(dsn = dsn, layer = layer, encoding = encoding, use_iconv = use_iconv,  : 
  Cannot open layer

The "sf" option does work

GDAL Error 1: geoJSON object too complex

{cancensus} does not read complex geometries like Nunavut any more. It throws an error

GDAL Error 1: geoJSON object too complex

My hunch is that the {sf} package recently switched out their geoJSON driver to rely on GDAL, which has a fairly low memory limit (unless explicitly compiled with higher memory limit).

One way around this is to use a different geoJSON driver. For example, geojsonsf::geojson_sf is very fast and has no problem reading in large geojson.

Debian CRAN error in find_census_vectors

Caught in a CRAN check
https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-clang/cancensus-00check.html

find_census_vectors('after tax income', dataset = 'CA16', type = 'total', query_type = 'semantic')

Error in if (min(lev_dist_df) > 2 | is.infinite(min(lev_dist_df))) { :
  missing value where TRUE/FALSE needed

Add discoverability for datasets

The package's main function currently accepts a datasets parameter that specifies which census to query, one of c("CA16", "CA11", "CA06") at the moment. It would be nice if there was some way for users to "discover" what data were available, for example with a function like

> cancensus::list_datasets()
# # A tibble: 3 x 2
#   dataset                    description
#     <chr>                          <chr>
# 1    CA16                    2016 Census
# 2    CA11 2011 National Household Survey
# 3    CA06                    2006 Census

As I said via email, I think there are two ways of doing this, either (1) adding a datasets endpoint to the API that would return this kind of information, and/or (2) just hardcoding it into the package for now.

Create vignettes for CMHC data

There's currently very little documentation on how to use the additional data open-sourced by CMHC used here:

Can these articles be adapted into a vignette or two to include on the site and bundled with the package?

Give option of retrieving data in wide or long formats

Currently, get_census() returns variables in wide format:

cancensus::get_census(dataset='CA16', regions=list(CMA="59933"),
                      vectors=c("v_CA16_408","v_CA16_409","v_CA16_410"),
                      level='CSD') %>% glimpse()
Rows: 39
Columns: 13
$ GeoUID                                                                       <chr> "5915001", …
$ Type                                                                         <fct> CSD, CSD, C…
$ `Region Name`                                                                <fct> Langley (DM…
$ `Area (sq km)`                                                               <dbl> 314.76313, …
$ Population                                                                   <dbl> 117285, 258…
$ Dwellings                                                                    <dbl> 43720, 1226…
$ Households                                                                   <dbl> 41982, 1184…
$ CD_UID                                                                       <chr> "5915", "59…
$ PR_UID                                                                       <chr> "59", "59",…
$ CMA_UID                                                                      <chr> "59933", "5…
$ `v_CA16_408: Occupied private dwellings by structural type of dwelling data` <dbl> 41980, 1184…
$ `v_CA16_409: Single-detached house`                                          <dbl> 21690, 2730…
$ `v_CA16_410: Apartment in a building that has five or more storeys`          <dbl> 1100, 40, 6…

When working with multiple vector variables it might be preferable in some cases to have these in long format. Frequently I will call tidyr::gather() or pivot_longer() to collect these variables after retrieval.

Tidycensus has a parameter output that works like this:

One of "tidy" (the default) in which each row represents an enumeration unit-variable combination, or "wide" in which each row represents an enumeration unit and the variables are in the columns.

And in practice:

get_acs(geography = "county", variables = vars, state = "CA", geometry = TRUE, output = "wide")

I think this would be pretty easy to add but would require some work to make sure the various utility functions still work with data in long format.

Incorrect variable descriptions in search_census_vectors()

Replicate by running any of the following:

View(search_census_vectors("income","CA11"))
View(search_census_vectors("income","CA16"))

The description field is concatenating incorrectly. I suspect this is due to an issue in how it traverses the variable hierarchy after we switched to a recursive approach rather than the bumbling hard-coded version that was there before.

Another thing we should consider is what fields should be included in the search. Currently the search works as: veclist[grep(searchterm, veclist$label, ignore.case = TRUE),] which searches solely through the label field, but not the description field. If the decription field works accurately, it may be useful to include those - however then we may have too many variables returned when someone searches for something general like language or income. Etc. Thoughts?

Caching

We are making quite heavy use of caching, and I am wondering if we should either have a separate convenience function where we can bundle that code or use a package for caching.

Also, the this code

cache_dir <- system.file("cache/", package = "cancensus")

does not seem to be doing anything, at least for me.