Giter Club home page Giter Club logo

microdadosbrasil's People

Contributors

brunolgc avatar daniellima123 avatar gutorc92 avatar leoniedu avatar lnribeiro avatar lucasmation avatar lucianomoura90 avatar monteirogustavo avatar nicolassoarespinto avatar raphael-gouvea avatar zlkrvsm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

microdadosbrasil's Issues

Reading a subset of variables

Hi guys,

I was wondering if you could include an argument to the read_DATA function that specifies the variables that should be read. This argument would be extremely useful for reasons of time efficiency and/or memory limits. Most of the times, we don't need to read the entire census data set, but only a subset of its columns.

As you know, this is quite straightforward with readr. Here is what I usually do it:

# load data dictionary as a data.table
  dic_dom <- get_import_dictionary("CENSO", 2000, "domicilios") %>% setDT()

# Select a subset of variables which will be read from .txt files
  myvaribles_dom <- c("V0102", "V0103", "V0104", "V0300", "V0400", "V1007")
  dic_dom <- dic_dom[var_name %in% myvariblesDOM] # filter documentation


# read selected variables
  read_fwf( datafile,
            fwf_positions(start = dic_dom[,int_pos], 
                          end = dic_dom[,fin_pos], 
                          col_names = dic_dom[,var_name] ),
            progress = interactive())

add tests for all file types

Create a tests file that expand the tests run on PNAD for all datasets

There should be a downlaod loop

And an "import loop".
Import loop should actually be two nested loops, one for each year and dataset. .

each iteration should record the time it takes, number of observations of the improted file and number of columns of the imported file.

The test shold be run twice, one from the local PC and one from the powerfull server.

In the server tests, it would be usefull to keep the imported datasets of each iteration to a list (or two nested list as in the case of the dictionaries) so they can be used to test compatibility of the files over time.

Can´t download CENSUS data

I could not download Census data. Did we break something?

I tryed:

download_sourceData("CENSO", 2000, unzip = T)
Error in download_sourceData("CENSO", 2000, unzip = T) :
Invalid dataset. Must be one of the following: CadUnico, CAGED, CensoEducacaoSuperior, CensoEscolar, CensoIBGE, PME, PNAD, PNADContinua, POF, RAIS, SISOB

download_sourceData("CENSO", 2010, unzip = T)
Error in download_sourceData("CENSO", 2010, unzip = T) :
Invalid dataset. Must be one of the following: CadUnico, CAGED, CensoEducacaoSuperior, CensoEscolar, CensoIBGE, PME, PNAD, PNADContinua, POF, RAIS, SISOB

This code above was working before (it is the first example in the package README file....)

Also the error message indicates that microdadosBrasil expects "CensoIBGE"
I tryed with that but it also did not work

download_sourceData("CensoIBGE", 2010, unzip = T)
Error in download_sourceData("CensoIBGE", 2010, unzip = T) :
Can't download dataset, there are no information about the source

Then I tryed repeating the code above, just changing to 2000, but got a weird message:

download_sourceData("CensoIBGE", 2000, unzip = T)
Error in download_sourceData("CensoIBGE", 2000, unzip = T) :
This data was already downloaded.(check:
./testing_integen_mobility_within_young_adults_living_with_parents_Brazil_Census.R)
If you want to overwrite the previous files add replace=T to the function call.

THe check if the data already exist is wrong. I think it is currently just checking if there is any file on the folder (see message above).

@nicolassoarespinto , can you take a look?

Needed to install/load other packages for microdados functions to work

Working on the RAIS files (if it matters). Not sure if this is a problem for me only. But if not, should be included in the "installation" section of the readme file:

#installation
install.packages("devtools")
...
install.packages("dplyr") #needed this installed first to not get error in install.github( ) below
install.packages("readr") #needed this installed first to not get error in install.github( ) below
devtools::install_github("lucasmation/microdadosBrasil")
library('microdadosBrasil')
#end installation

#use RAIS
library(RCurl) #needed this before download_sourceData would work
download_sourceData("RAIS", i = "2000")
...

Apply test for CRAN publication

See the section on publishing to CRAN on Hadley's Pages book:
In particular the checking chapter:

http://r-pkgs.had.co.nz/check.html

after reading the documentation run:
devtools::check()
output: Status: 1 ERROR, 8 WARNINGs, 5 NOTEs
then look at the log.

I looked quickly and most errors can be fixed with the suggestions in the chapter

Strange filepath error in RAIS 2014 Download

The function appears to have a problem when downloading the 2014 RAIS.

Run the follow code and pay attention to the first printed value from the function (bolding added):

download_sourceData("RAIS", i = "2013")
download_sourceData("RAIS", i = "2014")

This is my output:

download_sourceData("RAIS", i = "2013")
[1] "2013/AC2013.7z"
[1] "ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2013/AC2013.7z"
trying URL 'ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2013/AC2013.7z'

...

download_sourceData("RAIS", i = "2014")
[1] "/AC2014.7z"
[1] "ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014//AC2014.7z"
downloaded 0 bytes
Error in download.file(file_links[y], destfile = paste(c(dest, file_dir, :
cannot download all files
In addition: Warning messages:
1: In dir.create(paste(c(dest, file_dir), collapse = "/")) :
cannot create dir '', reason 'No such file or directory'
2: In download.file(file_links[y], destfile = paste(c(dest, file_dir, :
URL ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014//AC2014.7z: cannot open destfile '/AC2014.7z', reason 'Permission denied'
3: In download.file(file_links[y], destfile = paste(c(dest, file_dir, :
downloaded length 0 != reported length 3625219
[1] "/AL2014.7z"
[1] "ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014//AL2014.7z"
downloaded 0 bytes

error in CensoEscolar - escola - 1999 import dictionary

There is an error in the import dictionary for this dataset-year.
check:

CensoEscolar_dics[[5]][[1]] %>% View()

I think this is a problem in the original import dicionary, as the extracts bellow, from the dictionary and the data show.

the SAS dictionary contains:

DATA CENSOESC;                                                      

     INFILE "C:\DADOS_CENSOESC.TXT"  LRECL=7880 MISSOVER;                                           
     INPUT                                                      

       @ 1 MASCARA 8.       /*  Código da Escola   */                                      
       @ 9 ANO 5.       /*  Mascara da Escola   */                                      
       @ 14 CODMUNIC $12.       /*  Ano do Censo Escolar    */                                      
       @ 26 UF $50.         /*  Código do Município   */                                      
       @ 76 SIGLA $2.       /*  Nome da Unidade Federativa  */

the corresponding data (for these columns) is:
0000000001111111111122222222223333333333444444444455555555556666666666
1234567890123456789012345678901234567890123456789012345678901234567890
20046137 1999110100100205Rondonia ROPORTO VELHO
20043596 1999110100100205Rondonia ROPORTO VELHO

add CAGED

ftp://ftp.mtps.gov.br/pdet/microdados/CAGED/

error installing the package

devtools::install_github("lucasmation/microdadosBrasil")
Downloading GitHub repo lucasmation/microdadosBrasil@master
from URL https://api.github.com/repos/lucasmation/microdadosBrasil/zipball/master
Installing microdadosBrasil
"C:/PROGRA1/R/R-331.1/bin/x64/R" --no-site-file --no-environ --no-save
--no-restore --quiet CMD INSTALL
"C:/Users/r342471958/AppData/Local/Temp/RtmpUdb7Hu/devtools102440056502/lucasmation-microdadosBrasil-3d7133d"
--library="C:/Users/r342471958/Documents/R/win-library/3.3"
--install-tests

  • installing source package 'microdadosBrasil' ...
    ** R
    ** data
    *** moving datasets to lazyload DB
    ** inst
    ** preparing package for lazy loading
    Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
    there is no package called 'chron'
    ERROR: lazy loading failed for package 'microdadosBrasil'
  • removing 'C:/Users/r342471958/Documents/R/win-library/3.3/microdadosBrasil'
    Error: Command failed (1)

pnad 'pessoas' file, too slow to read

weird. The 'pessoas' file is taking way... too long to read comparede to the domicilios file.

d <- read_PNAD("domicilios", 2002)
Time difference of 1.341 secs
size: 0.1 Gb
d2 <- read_PNAD("pessoas", 2002)
Time difference of 3.777167 mins
0.9 Gb

add RAIS

ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2014/

Source_file_mark=TRUE

@nicolassoarespinto ,

what do you think of adding a new parameter "Source_file_mark=TRUE" to the import functions?
This would add a column containing the name of the file where the observation was downloaded from.
This is relevant for datasets that come spread across multiple files, and our import functions gather the data into a single data.table. Afterward, if someone sees something weird if an observation, it is not clear from which file it came from. So the package would do something like this:

Source_file_mark=TRUE >> add column Source_file

this would contain, for

read_SISOB("obitos", c(200103,200104) ) -> d

d would contain

Source_file
200103.txt
200103.txt
...
...
200304.txt
...

The default would be Source_file_mark=TRUE, but the person could opt out, with Source_file_mark=FALSE

let me know wat you think and I can implement that

Donwload PNAD erros

PNAD is also not downloading, although it may be a problem with the source.
In any case, despite the data not downloading the download function tryes to unzip it. It would be nice to avoid it doing that if it is not too difficult to implement.

regards
Lucas

download_sourceData("PNAD", 2002, unzip = T)
[1] "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/reponderacao_2001_2012/PNAD_reponderado_2002.zip"
[1] "PNAD_reponderado_2002.zip"
[1] "PNAD_reponderado_2002"
trying URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/reponderacao_2001_2012/PNAD_reponderado_2002.zip'
Error in download.file(link, destfile = paste(c(dest, filename), collapse = "/"), :
cannot open URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/reponderacao_2001_2012/PNAD_reponderado_2002.zip'
In addition: Warning message:
In download.file(link, destfile = paste(c(dest, filename), collapse = "/"), :
InternetOpenUrl failed: 'Uma conexão com o servidor não pôde ser estabelecida
'
Warning message:
In unzip(paste(c(dest, filename), collapse = "/"), exdir = paste(c(dest, :
error 1 in extracting from zip file

PNAD domicílios import dictionary is wrong

download_sourceData("PNAD", 2002, unzip = T)
dom <- read_PNAD("domicilios", 2002) %>% data.table

running the above the positions of variables beiond UF are wrong, when compared to the import dictionary and visual inspection of the txt file.

This is confirmed when I inspect the import dictionary for PNAD-2002-dom:

get_import_dictionary('PNAD',2002,'domicilios')

  • the first variable (ano) is not included
  • the variable UF is ok, but all other variables start and end in the wrong position.
  • this is probably because this import dictionary is characterized by start position + length. For instance, UF and v0102 both start at position 5

All of this may have been caused by uso updating the source files from a newer version from IBGE, but forgetting to update the import dicionaries. @nicolassoarespinto can you import again all import dictionaries, from the SAS import dictionary, for PNAD into R?

Unable to download pnad data from 2014

First of all, congratulations for the excelent work you have done by creating this package. It saved me a lot of time.

The issue I have is that I was unable to download the PNAD data from 2014. When I tried, I got the following error:

trying URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2014/Dados201604.zip'
downloaded 0 bytes

Error in download.file(link, destfile = paste(c(dest, filename), collapse = "/")) :
cannot download all files
In addition: Warning message:
In download.file(link, destfile = paste(c(dest, filename), collapse = "/")) :
URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2014/Dados201604.zip': status was '550 Requested action not taken; file unavailable'
Warning message:
In unzip(paste(c(dest, filename), collapse = "/"), exdir = paste(c(dest, :
error 1 in extracting from zip file

Encoding problem in unzip()

Some filenames contain latin characters. This is a problem when extracting files using the unzip() function, apparently, there is no builtin option to deal with this characteres. Example: in dataset Censo da Educação Superior, after the extraction of year 2011, the data folder became "Microdados Educa‡Æo Superior 2011"
Until this problem is solved, metadata for Censo da Educação Superior contains corrupted filenames, so years 2005,2006,2007 and 2011 can only be read if download and unzip was made by the function download_sourceData().

Dependencies to add for main installation

before running this:

devtools::install_github("lucasmation/microdadosBrasil")

i needed to install the following packages or previous would fail:

chron, DBI, assertthat, tibble

Problems accessing data

When I try to download one of the numerous datasets I get a strange error message where it doesn't list one of the "following" it may be below, it stops at the warning pasted below:

> download_sourceData("CENSO", 2000, unzip = T) Error in download_sourceData("CENSO", 2000, unzip = T) : Invalid dataset. Must be one of the following:

I have downloaded the exdata folder from git but unsure if there is something I missed in downloading or reading the files? Is there some working directory I should have set (right now set to the folder the exdata folder is in) or data I should have previously downloaded? Unsure why this isn't working. Thanks!

ftp server data: should warn user of download errors

sometimes some file in a ftp server will not download even after a minute of trying. R then procedes to the next file, but we need to warn the user of the missing file.

also, maybe the download function should do a second pass, trying to download only the files that did not download in the first. (maybe even a thrid one).

add README_POR file

It would be nice to add a README_PT file, to have the documentation in Portuguese.

We need to check how to do that with devetools

Perhaps one solution is to create a function similar to use_readme_rmd (see code bellow), but changing every README for README_PT

please check if this could work

use_readme_rmd
function (pkg = ".")
{
pkg <- as.package(pkg)
use_template("README.Rmd", ignore = TRUE, open = TRUE, pkg = pkg)
use_build_ignore("^README-..png$", escape = FALSE, pkg = pkg)
if (uses_git(pkg$path) && !file.exists(pkg$path, ".git",
"hooks", "pre-commit")) {
message("
Adding pre-commit hook")
use_git_hook("pre-commit", render_template("readme-rmd-pre-commit.sh"),
pkg = pkg)
}
invisible(TRUE)
}

Failed to access to RAIS data

I installed all the packages but I could not download the RAIS data. This is the return message:

download_sourceData("RAIS", i = "2000")
Error in function (type, msg, asError = TRUE) :
Failed to connect to ftp.mtps.gov.br port 21: Timed out

Do you know if "ftp.mtps.gov.br" is down? What other problem may I have?
Best,
David

within dataset-subdaset check variable type consistency over time

based on the list with the imported data for all years ( #11 ) maybe create another list where each element is a data.frame containing the output of str(). I.e., the each row should represent a variable in the imported dataset, with the following columns:
var_name, var_type, num_of_cathegories.

Then, Merge all the elements of the list for each subdataset into a single data frame, so we can see how vartype and num_of_cathegories
merge should be on var_name.
var_type and num_of_cathegories of each year should be rename before the merge to keep the year information in the column name
Ex: for 2005:
var_type, num_of_cathegories becomes var_type2005, num_of_cathegories2005

Censo Escolar de 2006 lê dados de 2005?

No arquivo CensoEscolar_files_metadata_harmonization.csv, no ano de 2006, a coluna ft_escola_em_e_emprof está com valor INPUT_SAS_EM22.sas&EM22_2005.TXT, ou seja, tenta fazer a leitura do dado de 2005 (se eu entendi direito). Isso se repete nas colunas ft_escola_educprof.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.