lucasmation / microdadosbrasil Goto Github PK

View Code? Open in Web Editor NEW

163.0 163.0 59.0 6.07 MB

Reads most common Brazilian public microdata (CENSO, PNAD, etc) easy and fast

R 100.00%

microdadosbrasil's People

Contributors

Stargazers

Watchers

microdadosbrasil's Issues

new syntax for period (i) when data is quarterly or monthly

yearly data, will accept year in numeric (2010) or string ( '2010' ) formats.

quartely data and montly data have to be strings. and be explicit:

'2014-1q' (1st quarter)

'2015-1m' (jan)

ftp data should download to a dataset-year specific folder

Reading a subset of variables

Hi guys,

I was wondering if you could include an argument to the read_DATA function that specifies the variables that should be read. This argument would be extremely useful for reasons of time efficiency and/or memory limits. Most of the times, we don't need to read the entire census data set, but only a subset of its columns.

As you know, this is quite straightforward with readr. Here is what I usually do it:

# load data dictionary as a data.table
  dic_dom <- get_import_dictionary("CENSO", 2000, "domicilios") %>% setDT()

# Select a subset of variables which will be read from .txt files
  myvaribles_dom <- c("V0102", "V0103", "V0104", "V0300", "V0400", "V1007")
  dic_dom <- dic_dom[var_name %in% myvariblesDOM] # filter documentation


# read selected variables
  read_fwf( datafile,
            fwf_positions(start = dic_dom[,int_pos], 
                          end = dic_dom[,fin_pos], 
                          col_names = dic_dom[,var_name] ),
            progress = interactive())

change read_PNADContinua to read_PNADcontinua (check internals)

I changed the function name:

read_PNADContinua
to
read_PNADcontinua

I only changed the function name and the README.
please check if other internal commands (dictionary, metadata csv) also need to change accordingly. And test files too.

add tests for all file types

Create a tests file that expand the tests run on PNAD for all datasets

There should be a downlaod loop

And an "import loop".
Import loop should actually be two nested loops, one for each year and dataset. .

each iteration should record the time it takes, number of observations of the improted file and number of columns of the imported file.

The test shold be run twice, one from the local PC and one from the powerfull server.

In the server tests, it would be usefull to keep the imported datasets of each iteration to a list (or two nested list as in the case of the dictionaries) so they can be used to test compatibility of the files over time.

Can´t download CENSUS data

I could not download Census data. Did we break something?

I tryed:

download_sourceData("CENSO", 2000, unzip = T)
Error in download_sourceData("CENSO", 2000, unzip = T) :
Invalid dataset. Must be one of the following: CadUnico, CAGED, CensoEducacaoSuperior, CensoEscolar, CensoIBGE, PME, PNAD, PNADContinua, POF, RAIS, SISOB

download_sourceData("CENSO", 2010, unzip = T)
Error in download_sourceData("CENSO", 2010, unzip = T) :
Invalid dataset. Must be one of the following: CadUnico, CAGED, CensoEducacaoSuperior, CensoEscolar, CensoIBGE, PME, PNAD, PNADContinua, POF, RAIS, SISOB

This code above was working before (it is the first example in the package README file....)

Also the error message indicates that microdadosBrasil expects "CensoIBGE"
I tryed with that but it also did not work

download_sourceData("CensoIBGE", 2010, unzip = T)
Error in download_sourceData("CensoIBGE", 2010, unzip = T) :
Can't download dataset, there are no information about the source

Then I tryed repeating the code above, just changing to 2000, but got a weird message:

download_sourceData("CensoIBGE", 2000, unzip = T)
Error in download_sourceData("CensoIBGE", 2000, unzip = T) :
This data was already downloaded.(check:
./testing_integen_mobility_within_young_adults_living_with_parents_Brazil_Census.R)
If you want to overwrite the previous files add replace=T to the function call.

THe check if the data already exist is wrong. I think it is currently just checking if there is any file on the folder (see message above).

@nicolassoarespinto , can you take a look?

Needed to install/load other packages for microdados functions to work

Working on the RAIS files (if it matters). Not sure if this is a problem for me only. But if not, should be included in the "installation" section of the readme file:

#installation
install.packages("devtools")
...
install.packages("dplyr") #needed this installed first to not get error in install.github( ) below
install.packages("readr") #needed this installed first to not get error in install.github( ) below
devtools::install_github("lucasmation/microdadosBrasil")
library('microdadosBrasil')
#end installation

#use RAIS
library(RCurl) #needed this before download_sourceData would work
download_sourceData("RAIS", i = "2000")
...

Apply test for CRAN publication

See the section on publishing to CRAN on Hadley's Pages book:
In particular the checking chapter:

http://r-pkgs.had.co.nz/check.html

after reading the documentation run:
devtools::check()
output: Status: 1 ERROR, 8 WARNINGs, 5 NOTEs
then look at the log.

I looked quickly and most errors can be fixed with the suggestions in the chapter

Strange filepath error in RAIS 2014 Download

The function appears to have a problem when downloading the 2014 RAIS.

Run the follow code and pay attention to the first printed value from the function (bolding added):

download_sourceData("RAIS", i = "2013")
download_sourceData("RAIS", i = "2014")

This is my output:

download_sourceData("RAIS", i = "2013")
[1] "2013/AC2013.7z"
[1] "ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2013/AC2013.7z"
trying URL 'ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2013/AC2013.7z'

...

download_sourceData("RAIS", i = "2014")
[1] "/AC2014.7z"
[1] "ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014//AC2014.7z"
downloaded 0 bytes
Error in download.file(file_links[y], destfile = paste(c(dest, file_dir, :
cannot download all files
In addition: Warning messages:
1: In dir.create(paste(c(dest, file_dir), collapse = "/")) :
cannot create dir '', reason 'No such file or directory'
2: In download.file(file_links[y], destfile = paste(c(dest, file_dir, :
URL ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014//AC2014.7z: cannot open destfile '/AC2014.7z', reason 'Permission denied'
3: In download.file(file_links[y], destfile = paste(c(dest, file_dir, :
downloaded length 0 != reported length 3625219
[1] "/AL2014.7z"
[1] "ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014//AL2014.7z"
downloaded 0 bytes

error in CensoEscolar - escola - 1999 import dictionary

There is an error in the import dictionary for this dataset-year.
check:

CensoEscolar_dics[[5]][[1]] %>% View()

I think this is a problem in the original import dicionary, as the extracts bellow, from the dictionary and the data show.

the SAS dictionary contains:

DATA CENSOESC;                                                      

     INFILE "C:\DADOS_CENSOESC.TXT"  LRECL=7880 MISSOVER;                                           
     INPUT                                                      

       @ 1 MASCARA 8.       /*  Código da Escola   */                                      
       @ 9 ANO 5.       /*  Mascara da Escola   */                                      
       @ 14 CODMUNIC $12.       /*  Ano do Censo Escolar    */                                      
       @ 26 UF $50.         /*  Código do Município   */                                      
       @ 76 SIGLA $2.       /*  Nome da Unidade Federativa  */

the corresponding data (for these columns) is:
0000000001111111111122222222223333333333444444444455555555556666666666
1234567890123456789012345678901234567890123456789012345678901234567890
20046137 1999110100100205Rondonia ROPORTO VELHO
20043596 1999110100100205Rondonia ROPORTO VELHO

unziping files does not work if current directory is in the network (outside current desktop)

This works:

# Set working directory to folder in the local computer
setwd('C:/myfolder')
download_sourceData("CensoEducacaoSuperior",i = 2005)

This does not:

#Set working directory to a folder over the network, outside current station
setwd("//path_to_network_location")
download_sourceData("CensoEducacaoSuperior",i = 2005)

add CAGED

ftp://ftp.mtps.gov.br/pdet/microdados/CAGED/

download functionality should check if data already exists before downloading

I suggest:

if exists already:

stop("this data was already downloaded. If you want to overwride add replace=T to the function call"

error installing the package

devtools::install_github("lucasmation/microdadosBrasil")
Downloading GitHub repo lucasmation/microdadosBrasil@master
from URL https://api.github.com/repos/lucasmation/microdadosBrasil/zipball/master
Installing microdadosBrasil
"C:/PROGRA~~1/R/R-33~~1.1/bin/x64/R" --no-site-file --no-environ --no-save
--no-restore --quiet CMD INSTALL
"C:/Users/r342471958/AppData/Local/Temp/RtmpUdb7Hu/devtools102440056502/lucasmation-microdadosBrasil-3d7133d"
--library="C:/Users/r342471958/Documents/R/win-library/3.3"
--install-tests

installing source package 'microdadosBrasil' ...
** R
** data
*** moving datasets to lazyload DB
** inst
** preparing package for lazy loading
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
there is no package called 'chron'
ERROR: lazy loading failed for package 'microdadosBrasil'
removing 'C:/Users/r342471958/Documents/R/win-library/3.3/microdadosBrasil'
Error: Command failed (1)

add support for PNAD_Continua

http://www.ibge.gov.br/home/estatistica/indicadores/trabalhoerendimento/pnad_continua/default_microdados.shtm

pnad 'pessoas' file, too slow to read

weird. The 'pessoas' file is taking way... too long to read comparede to the domicilios file.

d <- read_PNAD("domicilios", 2002)
Time difference of 1.341 secs
size: 0.1 Gb

d2 <- read_PNAD("pessoas", 2002)
Time difference of 3.777167 mins
0.9 Gb

add choice of only opening a regional subset of the data (by UF, or regiao)

Fix get_import_dictionary()

This function only works if the package was loaded by devtools::load_all().

Update Censo Escolar for changes (csv to txt) introduced by INEP recently

Create dictionaries for non fwf datasets

We could use it to optimize column classes and in #95 and #94 , to keep track of available variables in each dataset.

add get_import_dictionary function

didn´t we use to have a get_import_dictionary() function?

it should be added to dictionary_functions.R

Add POF 2002-03

http://www.ibge.gov.br/home/estatistica/pesquisas/pesquisa_resultados.php?id_pesquisa=25

add RAIS

ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014/

Source_file_mark=TRUE

@nicolassoarespinto ,

what do you think of adding a new parameter "Source_file_mark=TRUE" to the import functions?
This would add a column containing the name of the file where the observation was downloaded from.
This is relevant for datasets that come spread across multiple files, and our import functions gather the data into a single data.table. Afterward, if someone sees something weird if an observation, it is not clear from which file it came from. So the package would do something like this:

Source_file_mark=TRUE >> add column Source_file

this would contain, for

read_SISOB("obitos", c(200103,200104) ) -> d

d would contain

Source_file
200103.txt
200103.txt
...
...
200304.txt
...

The default would be Source_file_mark=TRUE, but the person could opt out, with Source_file_mark=FALSE

let me know wat you think and I can implement that

import dictionaries should also be CSV files (as with the metadata files)

existential question: why don't we have the import dictionaries as .csv files, as we do for the metadata files?

I think we should do that. That way every metadata is on the same file type, and all of if can be revised by non-programmers if necessary.

@nicolassoarespinto: do you see any problem with this?

can't find CensoEscolar -2014 dictionary files

CensoEscolar_dics[[20]] %>% str()
 logi NA

position 20 of the list is where 2014 dictionaries should be but it is empty.

Donwload PNAD erros

PNAD is also not downloading, although it may be a problem with the source.
In any case, despite the data not downloading the download function tryes to unzip it. It would be nice to avoid it doing that if it is not too difficult to implement.

regards
Lucas

download_sourceData("PNAD", 2002, unzip = T)
[1] "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/reponderacao_2001_2012/PNAD_reponderado_2002.zip"
[1] "PNAD_reponderado_2002.zip"
[1] "PNAD_reponderado_2002"
trying URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/reponderacao_2001_2012/PNAD_reponderado_2002.zip'
Error in download.file(link, destfile = paste(c(dest, filename), collapse = "/"), :
cannot open URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/reponderacao_2001_2012/PNAD_reponderado_2002.zip'
In addition: Warning message:
In download.file(link, destfile = paste(c(dest, filename), collapse = "/"), :
InternetOpenUrl failed: 'Uma conexão com o servidor não pôde ser estabelecida
'
Warning message:
In unzip(paste(c(dest, filename), collapse = "/"), exdir = paste(c(dest, :
error 1 in extracting from zip file

add suport for complex survey data

The traditional package is survey, explained here :
http://r-survey.r-forge.r-project.org/survey/index.html

Other options listed here:
http://www.hcp.med.harvard.edu/statistics/survey-soft/

In particular sqlsurvey :
http://sqlsurvey.r-forge.r-project.org/
http://r-forge.r-project.org/projects/sqlsurvey/
https://www.r-project.org/conferences/useR-2007/program/presentations/lumley.pdf

Everything a bit old, last developments from these packages were in 2012...

PNAD domicílios import dictionary is wrong

download_sourceData("PNAD", 2002, unzip = T)
dom <- read_PNAD("domicilios", 2002) %>% data.table

running the above the positions of variables beiond UF are wrong, when compared to the import dictionary and visual inspection of the txt file.

This is confirmed when I inspect the import dictionary for PNAD-2002-dom:

get_import_dictionary('PNAD',2002,'domicilios')

the first variable (ano) is not included
the variable UF is ok, but all other variables start and end in the wrong position.
this is probably because this import dictionary is characterized by start position + length. For instance, UF and v0102 both start at position 5

All of this may have been caused by uso updating the source files from a newer version from IBGE, but forgetting to update the import dicionaries. @nicolassoarespinto can you import again all import dictionaries, from the SAS import dictionary, for PNAD into R?

Unable to download pnad data from 2014

First of all, congratulations for the excelent work you have done by creating this package. It saved me a lot of time.

The issue I have is that I was unable to download the PNAD data from 2014. When I tried, I got the following error:

trying URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2014/Dados201604.zip'
downloaded 0 bytes

Error in download.file(link, destfile = paste(c(dest, filename), collapse = "/")) :
cannot download all files
In addition: Warning message:
In download.file(link, destfile = paste(c(dest, filename), collapse = "/")) :
URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2014/Dados201604.zip': status was '550 Requested action not taken; file unavailable'
Warning message:
In unzip(paste(c(dest, filename), collapse = "/"), exdir = paste(c(dest, :
error 1 in extracting from zip file

Encoding problem in unzip()

Some filenames contain latin characters. This is a problem when extracting files using the unzip() function, apparently, there is no builtin option to deal with this characteres. Example: in dataset Censo da Educação Superior, after the extraction of year 2011, the data folder became "Microdados Educa‡Æo Superior 2011"
Until this problem is solved, metadata for Censo da Educação Superior contains corrupted filenames, so years 2005,2006,2007 and 2011 can only be read if download and unzip was made by the function download_sourceData().

make dataset_list common to all programs

dataset_list <- c("PNAD", "CENSO", "POF", "CensoEscolar", 
    "CensoEducacaoSuperior")

currently only in download_sourceData

Dependencies to add for main installation

before running this:

devtools::install_github("lucasmation/microdadosBrasil")

i needed to install the following packages or previous would fail:

chron, DBI, assertthat, tibble

add ENEM data , read_ENEM()

Problem in read_POF()

The following files didn't pass test_read()

T_DESPESAS_VEICULO_S
T_OUTRAS_DESP_S

add support for using this package from within other programs

Problems accessing data

When I try to download one of the numerous datasets I get a strange error message where it doesn't list one of the "following" it may be below, it stops at the warning pasted below:

> download_sourceData("CENSO", 2000, unzip = T) Error in download_sourceData("CENSO", 2000, unzip = T) : Invalid dataset. Must be one of the following:

I have downloaded the exdata folder from git but unsure if there is something I missed in downloading or reading the files? Is there some working directory I should have set (right now set to the folder the exdata folder is in) or data I should have previously downloaded? Unsure why this isn't working. Thanks!

github authentication from Rstudio failling. Still asking for username and password

ftp server data: should warn user of download errors

sometimes some file in a ftp server will not download even after a minute of trying. R then procedes to the next file, but we need to warn the user of the missing file.

also, maybe the download function should do a second pass, trying to download only the files that did not download in the first. (maybe even a thrid one).

add README_POR file

It would be nice to add a README_PT file, to have the documentation in Portuguese.

We need to check how to do that with devetools

Perhaps one solution is to create a function similar to use_readme_rmd (see code bellow), but changing every README for README_PT

please check if this could work

use_readme_rmd
function (pkg = ".")
{
pkg <- as.package(pkg)
use_template("README.Rmd", ignore = TRUE, open = TRUE, pkg = pkg)
use_build_ignore("^README-..png$", escape = FALSE, pkg = pkg)
if (uses_git(pkg$path) && !file.exists(pkg$path, ".git",
"hooks", "pre-commit")) {
message(" Adding pre-commit hook")
use_git_hook("pre-commit", render_template("readme-rmd-pre-commit.sh"),
pkg = pkg)
}
invisible(TRUE)
}

can not push changes in a branch with Rstudio IDE

once working on a branch the "pull" and "push" buttons get greyed out. How to push changes to a branch to github?

add PME (Pesquisa Mensal de Emprego)

http://www.ibge.gov.br/home/estatistica/indicadores/trabalhoerendimento/pme_nova/defaultmicro.shtm

add support for out of memmory datasets

maybe using : ffdf

we should search if a newer simple aproach existis

Failed to access to RAIS data

I installed all the packages but I could not download the RAIS data. This is the return message:

download_sourceData("RAIS", i = "2000")
Error in function (type, msg, asError = TRUE) :
Failed to connect to ftp.mtps.gov.br port 21: Timed out

Do you know if "ftp.mtps.gov.br" is down? What other problem may I have?
Best,
David

within dataset-subdaset check variable type consistency over time

based on the list with the imported data for all years ( #11 ) maybe create another list where each element is a data.frame containing the output of str(). I.e., the each row should represent a variable in the imported dataset, with the following columns:
var_name, var_type, num_of_cathegories.

Then, Merge all the elements of the list for each subdataset into a single data frame, so we can see how vartype and num_of_cathegories
merge should be on var_name.
var_type and num_of_cathegories of each year should be rename before the merge to keep the year information in the column name
Ex: for 2005:
var_type, num_of_cathegories becomes var_type2005, num_of_cathegories2005

Delete non related datasets from RA branch

@lucasmation should we do that?

Censo Escolar de 2006 lê dados de 2005?

No arquivo CensoEscolar_files_metadata_harmonization.csv, no ano de 2006, a coluna ft_escola_em_e_emprof está com valor INPUT_SAS_EM22.sas&EM22_2005.TXT, ou seja, tenta fazer a leitura do dado de 2005 (se eu entendi direito). Isso se repete nas colunas ft_escola_educprof.

lucasmation / microdadosbrasil Goto Github PK

microdadosbrasil's People

Contributors

Stargazers

Watchers

Forkers

microdadosbrasil's Issues

Recommend Projects

Recommend Topics

Recommend Org