lucasmation / microdadosbrasil Goto Github PK
View Code? Open in Web Editor NEWReads most common Brazilian public microdata (CENSO, PNAD, etc) easy and fast
Reads most common Brazilian public microdata (CENSO, PNAD, etc) easy and fast
yearly data, will accept year in numeric (2010) or string ( '2010' ) formats.
quartely data and montly data have to be strings. and be explicit:
'2014-1q' (1st quarter)
'2015-1m' (jan)
Hi guys,
I was wondering if you could include an argument to the read_DATA
function that specifies the variables that should be read. This argument would be extremely useful for reasons of time efficiency and/or memory limits. Most of the times, we don't need to read the entire census data set, but only a subset of its columns.
As you know, this is quite straightforward with readr
. Here is what I usually do it:
# load data dictionary as a data.table
dic_dom <- get_import_dictionary("CENSO", 2000, "domicilios") %>% setDT()
# Select a subset of variables which will be read from .txt files
myvaribles_dom <- c("V0102", "V0103", "V0104", "V0300", "V0400", "V1007")
dic_dom <- dic_dom[var_name %in% myvariblesDOM] # filter documentation
# read selected variables
read_fwf( datafile,
fwf_positions(start = dic_dom[,int_pos],
end = dic_dom[,fin_pos],
col_names = dic_dom[,var_name] ),
progress = interactive())
I changed the function name:
read_PNADContinua
to
read_PNADcontinua
I only changed the function name and the README.
please check if other internal commands (dictionary, metadata csv) also need to change accordingly. And test files too.
Create a tests file that expand the tests run on PNAD for all datasets
There should be a downlaod loop
And an "import loop".
Import loop should actually be two nested loops, one for each year and dataset. .
each iteration should record the time it takes, number of observations of the improted file and number of columns of the imported file.
The test shold be run twice, one from the local PC and one from the powerfull server.
In the server tests, it would be usefull to keep the imported datasets of each iteration to a list (or two nested list as in the case of the dictionaries) so they can be used to test compatibility of the files over time.
I could not download Census data. Did we break something?
I tryed:
download_sourceData("CENSO", 2000, unzip = T)
Error in download_sourceData("CENSO", 2000, unzip = T) :
Invalid dataset. Must be one of the following: CadUnico, CAGED, CensoEducacaoSuperior, CensoEscolar, CensoIBGE, PME, PNAD, PNADContinua, POF, RAIS, SISOBdownload_sourceData("CENSO", 2010, unzip = T)
Error in download_sourceData("CENSO", 2010, unzip = T) :
Invalid dataset. Must be one of the following: CadUnico, CAGED, CensoEducacaoSuperior, CensoEscolar, CensoIBGE, PME, PNAD, PNADContinua, POF, RAIS, SISOB
This code above was working before (it is the first example in the package README file....)
Also the error message indicates that microdadosBrasil expects "CensoIBGE"
I tryed with that but it also did not work
download_sourceData("CensoIBGE", 2010, unzip = T)
Error in download_sourceData("CensoIBGE", 2010, unzip = T) :
Can't download dataset, there are no information about the source
Then I tryed repeating the code above, just changing to 2000, but got a weird message:
download_sourceData("CensoIBGE", 2000, unzip = T)
Error in download_sourceData("CensoIBGE", 2000, unzip = T) :
This data was already downloaded.(check:
./testing_integen_mobility_within_young_adults_living_with_parents_Brazil_Census.R)
If you want to overwrite the previous files add replace=T to the function call.
THe check if the data already exist is wrong. I think it is currently just checking if there is any file on the folder (see message above).
@nicolassoarespinto , can you take a look?
Working on the RAIS files (if it matters). Not sure if this is a problem for me only. But if not, should be included in the "installation" section of the readme file:
#installation
install.packages("devtools")
...
install.packages("dplyr") #needed this installed first to not get error in install.github( ) below
install.packages("readr") #needed this installed first to not get error in install.github( ) below
devtools::install_github("lucasmation/microdadosBrasil")
library('microdadosBrasil')
#end installation
#use RAIS
library(RCurl) #needed this before download_sourceData would work
download_sourceData("RAIS", i = "2000")
...
See the section on publishing to CRAN on Hadley's Pages book:
In particular the checking chapter:
http://r-pkgs.had.co.nz/check.html
after reading the documentation run:
devtools::check()
output: Status: 1 ERROR, 8 WARNINGs, 5 NOTEs
then look at the log.
I looked quickly and most errors can be fixed with the suggestions in the chapter
The function appears to have a problem when downloading the 2014 RAIS.
Run the follow code and pay attention to the first printed value from the function (bolding added):
download_sourceData("RAIS", i = "2013")
download_sourceData("RAIS", i = "2014")
This is my output:
download_sourceData("RAIS", i = "2013")
[1] "2013/AC2013.7z"
[1] "ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2013/AC2013.7z"
trying URL 'ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2013/AC2013.7z'
...
download_sourceData("RAIS", i = "2014")
[1] "/AC2014.7z"
[1] "ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014//AC2014.7z"
downloaded 0 bytes
Error in download.file(file_links[y], destfile = paste(c(dest, file_dir, :
cannot download all files
In addition: Warning messages:
1: In dir.create(paste(c(dest, file_dir), collapse = "/")) :
cannot create dir '', reason 'No such file or directory'
2: In download.file(file_links[y], destfile = paste(c(dest, file_dir, :
URL ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014//AC2014.7z: cannot open destfile '/AC2014.7z', reason 'Permission denied'
3: In download.file(file_links[y], destfile = paste(c(dest, file_dir, :
downloaded length 0 != reported length 3625219
[1] "/AL2014.7z"
[1] "ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014//AL2014.7z"
downloaded 0 bytes
There is an error in the import dictionary for this dataset-year.
check:
CensoEscolar_dics[[5]][[1]] %>% View()
I think this is a problem in the original import dicionary, as the extracts bellow, from the dictionary and the data show.
the SAS dictionary contains:
DATA CENSOESC;
INFILE "C:\DADOS_CENSOESC.TXT" LRECL=7880 MISSOVER;
INPUT
@ 1 MASCARA 8. /* Código da Escola */
@ 9 ANO 5. /* Mascara da Escola */
@ 14 CODMUNIC $12. /* Ano do Censo Escolar */
@ 26 UF $50. /* Código do Município */
@ 76 SIGLA $2. /* Nome da Unidade Federativa */
the corresponding data (for these columns) is:
0000000001111111111122222222223333333333444444444455555555556666666666
1234567890123456789012345678901234567890123456789012345678901234567890
20046137 1999110100100205Rondonia ROPORTO VELHO
20043596 1999110100100205Rondonia ROPORTO VELHO
This works:
# Set working directory to folder in the local computer
setwd('C:/myfolder')
download_sourceData("CensoEducacaoSuperior",i = 2005)
This does not:
#Set working directory to a folder over the network, outside current station
setwd("//path_to_network_location")
download_sourceData("CensoEducacaoSuperior",i = 2005)
ftp://ftp.mtps.gov.br/pdet/microdados/CAGED/
I suggest:
if exists already:
stop("this data was already downloaded. If you want to overwride add replace=T to the function call"
devtools::install_github("lucasmation/microdadosBrasil")
Downloading GitHub repo lucasmation/microdadosBrasil@master
from URL https://api.github.com/repos/lucasmation/microdadosBrasil/zipball/master
Installing microdadosBrasil
"C:/PROGRA1/R/R-331.1/bin/x64/R" --no-site-file --no-environ --no-save
--no-restore --quiet CMD INSTALL
"C:/Users/r342471958/AppData/Local/Temp/RtmpUdb7Hu/devtools102440056502/lucasmation-microdadosBrasil-3d7133d"
--library="C:/Users/r342471958/Documents/R/win-library/3.3"
--install-tests
weird. The 'pessoas' file is taking way... too long to read comparede to the domicilios file.
d <- read_PNAD("domicilios", 2002)
Time difference of 1.341 secs
size: 0.1 Gb
d2 <- read_PNAD("pessoas", 2002)
Time difference of 3.777167 mins
0.9 Gb
This function only works if the package was loaded by devtools::load_all().
didn´t we use to have a get_import_dictionary() function?
it should be added to dictionary_functions.R
ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014/
what do you think of adding a new parameter "Source_file_mark=TRUE" to the import functions?
This would add a column containing the name of the file where the observation was downloaded from.
This is relevant for datasets that come spread across multiple files, and our import functions gather the data into a single data.table. Afterward, if someone sees something weird if an observation, it is not clear from which file it came from. So the package would do something like this:
Source_file_mark=TRUE >> add column Source_file
this would contain, for
read_SISOB("obitos", c(200103,200104) ) -> d
d would contain
Source_file
200103.txt
200103.txt
...
...
200304.txt
...
The default would be Source_file_mark=TRUE, but the person could opt out, with Source_file_mark=FALSE
let me know wat you think and I can implement that
existential question: why don't we have the import dictionaries as .csv files, as we do for the metadata files?
I think we should do that. That way every metadata is on the same file type, and all of if can be revised by non-programmers if necessary.
@nicolassoarespinto: do you see any problem with this?
CensoEscolar_dics[[20]] %>% str()
logi NA
position 20 of the list is where 2014 dictionaries should be but it is empty.
PNAD is also not downloading, although it may be a problem with the source.
In any case, despite the data not downloading the download function tryes to unzip it. It would be nice to avoid it doing that if it is not too difficult to implement.
regards
Lucas
download_sourceData("PNAD", 2002, unzip = T)
[1] "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/reponderacao_2001_2012/PNAD_reponderado_2002.zip"
[1] "PNAD_reponderado_2002.zip"
[1] "PNAD_reponderado_2002"
trying URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/reponderacao_2001_2012/PNAD_reponderado_2002.zip'
Error in download.file(link, destfile = paste(c(dest, filename), collapse = "/"), :
cannot open URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/reponderacao_2001_2012/PNAD_reponderado_2002.zip'
In addition: Warning message:
In download.file(link, destfile = paste(c(dest, filename), collapse = "/"), :
InternetOpenUrl failed: 'Uma conexão com o servidor não pôde ser estabelecida
'
Warning message:
In unzip(paste(c(dest, filename), collapse = "/"), exdir = paste(c(dest, :
error 1 in extracting from zip file
The traditional package is survey, explained here :
http://r-survey.r-forge.r-project.org/survey/index.html
Other options listed here:
http://www.hcp.med.harvard.edu/statistics/survey-soft/
In particular sqlsurvey :
http://sqlsurvey.r-forge.r-project.org/
http://r-forge.r-project.org/projects/sqlsurvey/
https://www.r-project.org/conferences/useR-2007/program/presentations/lumley.pdf
Everything a bit old, last developments from these packages were in 2012...
download_sourceData("PNAD", 2002, unzip = T)
dom <- read_PNAD("domicilios", 2002) %>% data.table
running the above the positions of variables beiond UF are wrong, when compared to the import dictionary and visual inspection of the txt file.
This is confirmed when I inspect the import dictionary for PNAD-2002-dom:
get_import_dictionary('PNAD',2002,'domicilios')
All of this may have been caused by uso updating the source files from a newer version from IBGE, but forgetting to update the import dicionaries. @nicolassoarespinto can you import again all import dictionaries, from the SAS import dictionary, for PNAD into R?
First of all, congratulations for the excelent work you have done by creating this package. It saved me a lot of time.
The issue I have is that I was unable to download the PNAD data from 2014. When I tried, I got the following error:
trying URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2014/Dados201604.zip'
downloaded 0 bytes
Error in download.file(link, destfile = paste(c(dest, filename), collapse = "/")) :
cannot download all files
In addition: Warning message:
In download.file(link, destfile = paste(c(dest, filename), collapse = "/")) :
URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2014/Dados201604.zip': status was '550 Requested action not taken; file unavailable'
Warning message:
In unzip(paste(c(dest, filename), collapse = "/"), exdir = paste(c(dest, :
error 1 in extracting from zip file
Some filenames contain latin characters. This is a problem when extracting files using the unzip() function, apparently, there is no builtin option to deal with this characteres. Example: in dataset Censo da Educação Superior, after the extraction of year 2011, the data folder became "Microdados Educa‡Æo Superior 2011"
Until this problem is solved, metadata for Censo da Educação Superior contains corrupted filenames, so years 2005,2006,2007 and 2011 can only be read if download and unzip was made by the function download_sourceData().
dataset_list <- c("PNAD", "CENSO", "POF", "CensoEscolar",
"CensoEducacaoSuperior")
currently only in download_sourceData
before running this:
devtools::install_github("lucasmation/microdadosBrasil")
i needed to install the following packages or previous would fail:
chron, DBI, assertthat, tibble
The following files didn't pass test_read()
When I try to download one of the numerous datasets I get a strange error message where it doesn't list one of the "following" it may be below, it stops at the warning pasted below:
> download_sourceData("CENSO", 2000, unzip = T) Error in download_sourceData("CENSO", 2000, unzip = T) : Invalid dataset. Must be one of the following:
I have downloaded the exdata folder from git but unsure if there is something I missed in downloading or reading the files? Is there some working directory I should have set (right now set to the folder the exdata folder is in) or data I should have previously downloaded? Unsure why this isn't working. Thanks!
sometimes some file in a ftp server will not download even after a minute of trying. R then procedes to the next file, but we need to warn the user of the missing file.
also, maybe the download function should do a second pass, trying to download only the files that did not download in the first. (maybe even a thrid one).
It would be nice to add a README_PT file, to have the documentation in Portuguese.
We need to check how to do that with devetools
Perhaps one solution is to create a function similar to use_readme_rmd (see code bellow), but changing every README for README_PT
please check if this could work
use_readme_rmd
function (pkg = ".")
{
pkg <- as.package(pkg)
use_template("README.Rmd", ignore = TRUE, open = TRUE, pkg = pkg)
use_build_ignore("^README-..png$", escape = FALSE, pkg = pkg)
if (uses_git(pkg$path) && !file.exists(pkg$path, ".git",
"hooks", "pre-commit")) {
message(" Adding pre-commit hook")
use_git_hook("pre-commit", render_template("readme-rmd-pre-commit.sh"),
pkg = pkg)
}
invisible(TRUE)
}
once working on a branch the "pull" and "push" buttons get greyed out. How to push changes to a branch to github?
maybe using : ffdf
we should search if a newer simple aproach existis
I installed all the packages but I could not download the RAIS data. This is the return message:
download_sourceData("RAIS", i = "2000")
Error in function (type, msg, asError = TRUE) :
Failed to connect to ftp.mtps.gov.br port 21: Timed out
Do you know if "ftp.mtps.gov.br" is down? What other problem may I have?
Best,
David
based on the list with the imported data for all years ( #11 ) maybe create another list where each element is a data.frame containing the output of str()
. I.e., the each row should represent a variable in the imported dataset, with the following columns:
var_name, var_type, num_of_cathegories.
Then, Merge all the elements of the list for each subdataset into a single data frame, so we can see how vartype and num_of_cathegories
merge should be on var_name.
var_type and num_of_cathegories of each year should be rename before the merge to keep the year information in the column name
Ex: for 2005:
var_type, num_of_cathegories becomes var_type2005, num_of_cathegories2005
@lucasmation should we do that?
No arquivo CensoEscolar_files_metadata_harmonization.csv, no ano de 2006, a coluna ft_escola_em_e_emprof está com valor INPUT_SAS_EM22.sas&EM22_2005.TXT, ou seja, tenta fazer a leitura do dado de 2005 (se eu entendi direito). Isso se repete nas colunas ft_escola_educprof.
I'm running test for all datasets. Will use this issue to store all problems that I found.
@lucasmation dest
argument seems outside of the pattern used for other functions. Should we change it for root_path
?
It loads until 99% then R stops. Was tested on server and local computer .
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.