Giter Club home page Giter Club logo

scrape_oryx's People

Contributors

ajnafa avatar kaldjian avatar scarnecchia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

scrape_oryx's Issues

Issue in get_inputfile(): Files must have all 6 columns, File 20 has 7 columns.

When running on Linux, I get the following error:

Linux

Local .Rprofile detected at /mnt/d/Github/scrape_oryx/.Rprofile

library(rvest)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(purrr)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
library(tibble)
library(stringr)
library(readr)
#> 
#> Attaching package: 'readr'
#> The following object is masked from 'package:rvest':
#> 
#>     guess_encoding
library(glue)
library(logr)
library(ggplot2)
library(scales)
#> 
#> Attaching package: 'scales'
#> The following object is masked from 'package:readr':
#> 
#>     col_factor
#> The following object is masked from 'package:purrr':
#> 
#>     discard
library(ggthemes)
library(fs)

get_inputfile <- function(.file) {
  path <- fs::dir_info("inputfiles", type = "file", regexp=".file") %>%
    dplyr::filter(!stringr::str_detect(path, ".bak")) %>%
    dplyr::select(path, change_time, birth_time) %>%
    dplyr::filter(stringr::str_detect(path, .file)) %>%
    dplyr::filter(birth_time == max(birth_time)) %>%
    dplyr::pull(path)

  message(path)

  # logr::put(path)

  readr::read_csv(path)
}

get_inputfile(.file="totals_by_system")
#> inputfiles/totals_by_system2022-04-08.csvinputfiles/totals_by_system2022-04-09.csvinputfiles/totals_by_system2022-04-10.csvinputfiles/totals_by_system2022-04-11.csvinputfiles/totals_by_system2022-04-12.csvinputfiles/totals_by_system2022-04-14.csvinputfiles/totals_by_system2022-04-15.csvinputfiles/totals_by_system2022-04-16.csvinputfiles/totals_by_system2022-04-17.csvinputfiles/totals_by_system2022-04-18.csvinputfiles/totals_by_system2022-04-19.csvinputfiles/totals_by_system2022-04-20.csvinputfiles/totals_by_system2022-04-21.csvinputfiles/totals_by_system2022-04-22.csvinputfiles/totals_by_system2022-04-23.csvinputfiles/totals_by_system2022-04-24.csvinputfiles/totals_by_system2022-04-25.csvinputfiles/totals_by_system2022-04-26.csvinputfiles/totals_by_system2022-05-02.csvinputfiles/totals_by_system2022-05-04.csvinputfiles/totals_by_system2022-05-06.csv
#> Error: Files must all have 6 columns:
#> * File 20 has 7 columns

Session Info

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices datasets utils methods base

other attached packages:
[1] reprex_2.0.1 ggthemes_4.2.4 scales_1.1.1 ggplot2_3.3.5
[5] logr_1.2.9 glue_1.6.2 readr_2.1.2 stringr_1.4.0
[9] tibble_3.1.6 magrittr_2.0.2 purrr_0.3.4 lubridate_1.8.0
[13] tidyr_1.2.0 dplyr_1.0.8 rvest_1.0.2 renv_0.15.4

loaded via a namespace (and not attached):
[1] highr_0.9 pillar_1.7.0 compiler_4.2.0 tools_4.2.0
[5] digest_0.6.29 bit_4.0.4 evaluate_0.15 lifecycle_1.0.1
[9] gtable_0.3.0 pkgconfig_2.0.3 rlang_1.0.2 rstudioapi_0.13
[13] cli_3.2.0 yaml_2.3.5 parallel_4.2.0 xfun_0.30
[17] fastmap_1.1.0 knitr_1.37 withr_2.5.0 httr_1.4.2
[21] xml2_1.3.3 generics_0.1.2 vctrs_0.3.8 fs_1.5.2
[25] hms_1.1.1 bit64_4.0.5 grid_4.2.0 tidyselect_1.1.2
[29] R6_2.5.1 processx_3.5.2 this.path_0.5.1 fansi_1.0.2
[33] vroom_1.5.7 rmarkdown_2.13 clipr_0.8.0 callr_3.7.0
[37] tzdb_0.2.0 ps_1.6.0 htmltools_0.5.2 ellipsis_0.3.2
[41] colorspace_2.0-3 utf8_1.2.2 stringi_1.7.6 munsell_0.5.0
[45] crayon_1.5.0

Created on 2022-05-08 by the reprex package (v2.0.1)

When running on Windows, the error doesn't occur:

Windows

setwd("D:/Github/scrape_oryx")
library(rvest)
library(dplyr)
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(purrr)
library(magrittr)
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
library(tibble)
library(stringr)
library(readr)
#> Attaching package: 'readr'
#> The following object is masked from 'package:rvest':
#> 
#>     guess_encoding
library(glue)
library(logr)
library(ggplot2)
library(scales)
#> Attaching package: 'scales'
#> The following object is masked from 'package:readr':
#> 
#>     col_factor
#> The following object is masked from 'package:purrr':
#> 
#>     discard
library(ggthemes)
library(fs)

get_inputfile <- function(.file) {
  path <- fs::dir_info("inputfiles", type = "file", regexp=".file") %>%
    dplyr::filter(!stringr::str_detect(path, ".bak")) %>%
    dplyr::select(path, change_time, birth_time) %>%
    dplyr::filter(stringr::str_detect(path, .file)) %>%
    dplyr::filter(birth_time == max(birth_time)) %>%
    dplyr::pull(path)
  
  message(path)
  
  # logr::put(path)
  
  readr::read_csv(path)
}

get_inputfile(.file="totals_by_system")
#> inputfiles/totals_by_system2022-05-06.csv
#> Rows: 4557 Columns: 6
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr  (5): country, origin, system, status, url
#> date (1): date_recorded
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 4,557 x 6
#>    country origin system            status    url                  date_recorded
#>    <chr>   <chr>  <chr>             <chr>     <chr>                <date>       
#>  1 Russia  Russia (Unknown) truck   destroyed https://twitter.com~ 2022-05-05   
#>  2 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  3 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  4 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  5 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  6 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  7 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  8 Russia  Russia (Unknown) vehicle captured  https://i.postimg.c~ 2022-03-19   
#>  9 Russia  Russia (Unknown) vehicle destroyed https://twitter.com~ 2022-03-26   
#> 10 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-26   
#> # ... with 4,547 more rows


<sup>Created on 2022-05-08 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>

SessionInfo

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22610)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices datasets utils methods base

other attached packages:
[1] ggthemes_4.2.4 scales_1.1.1 ggplot2_3.3.5 logr_1.2.9 glue_1.6.2 readr_2.1.2 stringr_1.4.0 tibble_3.1.6 magrittr_2.0.2 purrr_0.3.4
[11] lubridate_1.8.0 tidyr_1.2.0 dplyr_1.0.8 rvest_1.0.2

loaded via a namespace (and not attached):
[1] svglite_2.1.0 ps_1.6.0 digest_0.6.29 utf8_1.2.2 R6_2.5.1 reprex_2.0.1 evaluate_0.15 httr_1.4.2 highr_0.9
[10] pillar_1.7.0 rlang_1.0.2 rstudioapi_0.13 callr_3.7.0 R.utils_2.11.0 R.oo_1.24.0 rmarkdown_2.13 styler_1.7.0 webshot_0.5.2
[19] this.path_0.5.1 bit_4.0.4 munsell_0.5.0 compiler_4.1.0 xfun_0.30 pkgconfig_2.0.3 systemfonts_1.0.4 clipr_0.8.0 htmltools_0.5.2
[28] tidyselect_1.1.2 fansi_1.0.2 viridisLite_0.4.0 crayon_1.5.0 tzdb_0.2.0 withr_2.5.0 R.methodsS3_1.8.1 grid_4.1.0 gtable_0.3.0
[37] lifecycle_1.0.1 cli_3.2.0 stringi_1.7.6 vroom_1.5.7 renv_0.15.4 fs_1.5.2 xml2_1.3.3 ellipsis_0.3.2 generics_0.1.2
[46] vctrs_0.3.8 kableExtra_1.3.4 tools_4.1.0 bit64_4.0.5 R.cache_0.15.0 hms_1.1.1 processx_3.5.2 parallel_4.1.0 fastmap_1.1.0
[55] yaml_2.3.5 colorspace_2.0-3 knitr_1.37

scraper: no updates in 4 days

Saw that the *.csv repo hadn't been updated in 4 days, and was wondering if something broke ='(. I hope all is well.

There's a bug.

For Helicopters, the script unable to pull data of 'abandoned' I guess it's probably because of the order which is different from other equipment type. This also happens on other equipment_type as well.

One possible fix (and note, switching to argument defaults instead of hard links makes it compatible with the Wayback Machine for getting archival data):

Creating a new issue for this idea by @leedrake5, as it offers a more flexible option for coding the URLs.

One possible fix (and note, switching to argument defaults instead of hard links makes it compatible with the Wayback Machine for getting archival data):

russia_materiel <-
  get_data(
    russia_link,
    "article"
  ) %>%
  rvest::html_elements("li")
  
  ukraine_materiel <-
    get_data(
    ukraine_link,
      "article"
    ) %>%
    rvest::html_elements("li")
    
    materiel <- c(russia_materiel, ukraine_materiel)

Note that this requires the 'do' package to concat xml node sets.

Originally posted by @leedrake5 in #18 (comment)

Quality Assurance

Generate a report which notes differences in counts between datasets to indicate when a quality assurance issue exists.

Israel-Hamas Conflict

It looks like there is a similar effort to log Israel and Hamas equipment losses here: https://armadarotta.blogspot.com/2023/10/israel-at-war-tracking-equipment-losses.html

However it looks like the functions don't apply here. I'm poking around to see if there is any way to adapt the script to a different page's headings (Google Blogspot), but not having much success yet. I wasn't sure if this was an easy incorporation or if the different platform meant it would be more difficult to adapt to tracking.

get_inputfile(): line filtering on str_detect is has misspelled dataset name, causing error

get_inputfile <- function(.file) {
  path <- fs::dir_info("inputfiles", type = "file") %>%
    dplyr::select(path, change_time, birth_time) %>%
    dplyr::filter(stringr::str_detect(path, file)) %>%
    dplyr::filter(birth_time == max(birth_time)) %>%
    dplyr::pull(path)

  message(path)

  # logr::put(path)

  readr::read_csv(path)
}

Should be:

get_inputfile <- function(.file) {
  path <- fs::dir_info("inputfiles", type = "file") %>%
    dplyr::select(path, change_time, birth_time) %>%
    dplyr::filter(stringr::str_detect(path, .file)) %>%
    dplyr::filter(birth_time == max(birth_time)) %>%
    dplyr::pull(path)

  message(path)

  # logr::put(path)

  readr::read_csv(path)
}

Script can't handle non-braking space

It seems to me that script can't handle non-braking space (&nbsp;) on Oryx website.
Source code from webpage: ... />&nbsp;7 BMP-1KSh command and staff vehicle: results incorrectly 7 BMP-1KSh as new system (system is BMP-1KSh not 7 BMP-1KSh).
Source code from webpage: ... /> 2 9P148 Konkurs: results correctly 9P148 Konkurs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.