scarnecchia / scrape_oryx Goto Github PK

A simple R script for extracting tabular data from Oryx' excellent post detailing materiel lost by all sides in the [Russian invasion of Ukraine](https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-equipment.html).

License: MIT License

R 100.00%

scrape_oryx's People

Contributors

Stargazers

Watchers

Forkers

slau101willowburn ajnafa kaldjian leedrake5 favstats comcomson rennfahrer ganymede23 saywhatsaywah jaydi85 edgerunner107

scrape_oryx's Issues

Issue in get_inputfile(): Files must have all 6 columns, File 20 has 7 columns.

When running on Linux, I get the following error:

Linux

Local .Rprofile detected at /mnt/d/Github/scrape_oryx/.Rprofile

library(rvest)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(purrr)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
library(tibble)
library(stringr)
library(readr)
#> 
#> Attaching package: 'readr'
#> The following object is masked from 'package:rvest':
#> 
#>     guess_encoding
library(glue)
library(logr)
library(ggplot2)
library(scales)
#> 
#> Attaching package: 'scales'
#> The following object is masked from 'package:readr':
#> 
#>     col_factor
#> The following object is masked from 'package:purrr':
#> 
#>     discard
library(ggthemes)
library(fs)

get_inputfile <- function(.file) {
  path <- fs::dir_info("inputfiles", type = "file", regexp=".file") %>%
    dplyr::filter(!stringr::str_detect(path, ".bak")) %>%
    dplyr::select(path, change_time, birth_time) %>%
    dplyr::filter(stringr::str_detect(path, .file)) %>%
    dplyr::filter(birth_time == max(birth_time)) %>%
    dplyr::pull(path)

  message(path)

  # logr::put(path)

  readr::read_csv(path)
}

get_inputfile(.file="totals_by_system")
#> inputfiles/totals_by_system2022-04-08.csvinputfiles/totals_by_system2022-04-09.csvinputfiles/totals_by_system2022-04-10.csvinputfiles/totals_by_system2022-04-11.csvinputfiles/totals_by_system2022-04-12.csvinputfiles/totals_by_system2022-04-14.csvinputfiles/totals_by_system2022-04-15.csvinputfiles/totals_by_system2022-04-16.csvinputfiles/totals_by_system2022-04-17.csvinputfiles/totals_by_system2022-04-18.csvinputfiles/totals_by_system2022-04-19.csvinputfiles/totals_by_system2022-04-20.csvinputfiles/totals_by_system2022-04-21.csvinputfiles/totals_by_system2022-04-22.csvinputfiles/totals_by_system2022-04-23.csvinputfiles/totals_by_system2022-04-24.csvinputfiles/totals_by_system2022-04-25.csvinputfiles/totals_by_system2022-04-26.csvinputfiles/totals_by_system2022-05-02.csvinputfiles/totals_by_system2022-05-04.csvinputfiles/totals_by_system2022-05-06.csv
#> Error: Files must all have 6 columns:
#> * File 20 has 7 columns

Session Info

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices datasets utils methods base

other attached packages:
[1] reprex_2.0.1 ggthemes_4.2.4 scales_1.1.1 ggplot2_3.3.5
[5] logr_1.2.9 glue_1.6.2 readr_2.1.2 stringr_1.4.0
[9] tibble_3.1.6 magrittr_2.0.2 purrr_0.3.4 lubridate_1.8.0
[13] tidyr_1.2.0 dplyr_1.0.8 rvest_1.0.2 renv_0.15.4

loaded via a namespace (and not attached):
[1] highr_0.9 pillar_1.7.0 compiler_4.2.0 tools_4.2.0
[5] digest_0.6.29 bit_4.0.4 evaluate_0.15 lifecycle_1.0.1
[9] gtable_0.3.0 pkgconfig_2.0.3 rlang_1.0.2 rstudioapi_0.13
[13] cli_3.2.0 yaml_2.3.5 parallel_4.2.0 xfun_0.30
[17] fastmap_1.1.0 knitr_1.37 withr_2.5.0 httr_1.4.2
[21] xml2_1.3.3 generics_0.1.2 vctrs_0.3.8 fs_1.5.2
[25] hms_1.1.1 bit64_4.0.5 grid_4.2.0 tidyselect_1.1.2
[29] R6_2.5.1 processx_3.5.2 this.path_0.5.1 fansi_1.0.2
[33] vroom_1.5.7 rmarkdown_2.13 clipr_0.8.0 callr_3.7.0
[37] tzdb_0.2.0 ps_1.6.0 htmltools_0.5.2 ellipsis_0.3.2
[41] colorspace_2.0-3 utf8_1.2.2 stringi_1.7.6 munsell_0.5.0
[45] crayon_1.5.0

^{Created on 2022-05-08 by the reprex package (v2.0.1)}

When running on Windows, the error doesn't occur:

Windows

setwd("D:/Github/scrape_oryx")
library(rvest)
library(dplyr)
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(purrr)
library(magrittr)
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
library(tibble)
library(stringr)
library(readr)
#> Attaching package: 'readr'
#> The following object is masked from 'package:rvest':
#> 
#>     guess_encoding
library(glue)
library(logr)
library(ggplot2)
library(scales)
#> Attaching package: 'scales'
#> The following object is masked from 'package:readr':
#> 
#>     col_factor
#> The following object is masked from 'package:purrr':
#> 
#>     discard
library(ggthemes)
library(fs)

get_inputfile <- function(.file) {
  path <- fs::dir_info("inputfiles", type = "file", regexp=".file") %>%
    dplyr::filter(!stringr::str_detect(path, ".bak")) %>%
    dplyr::select(path, change_time, birth_time) %>%
    dplyr::filter(stringr::str_detect(path, .file)) %>%
    dplyr::filter(birth_time == max(birth_time)) %>%
    dplyr::pull(path)
  
  message(path)
  
  # logr::put(path)
  
  readr::read_csv(path)
}

get_inputfile(.file="totals_by_system")
#> inputfiles/totals_by_system2022-05-06.csv
#> Rows: 4557 Columns: 6
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr  (5): country, origin, system, status, url
#> date (1): date_recorded
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 4,557 x 6
#>    country origin system            status    url                  date_recorded
#>    <chr>   <chr>  <chr>             <chr>     <chr>                <date>       
#>  1 Russia  Russia (Unknown) truck   destroyed https://twitter.com~ 2022-05-05   
#>  2 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  3 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  4 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  5 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  6 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  7 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-19   
#>  8 Russia  Russia (Unknown) vehicle captured  https://i.postimg.c~ 2022-03-19   
#>  9 Russia  Russia (Unknown) vehicle destroyed https://twitter.com~ 2022-03-26   
#> 10 Russia  Russia (Unknown) vehicle destroyed https://i.postimg.c~ 2022-03-26   
#> # ... with 4,547 more rows


<sup>Created on 2022-05-08 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>

SessionInfo

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22610)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices datasets utils methods base

other attached packages:
[1] ggthemes_4.2.4 scales_1.1.1 ggplot2_3.3.5 logr_1.2.9 glue_1.6.2 readr_2.1.2 stringr_1.4.0 tibble_3.1.6 magrittr_2.0.2 purrr_0.3.4
[11] lubridate_1.8.0 tidyr_1.2.0 dplyr_1.0.8 rvest_1.0.2

loaded via a namespace (and not attached):
[1] svglite_2.1.0 ps_1.6.0 digest_0.6.29 utf8_1.2.2 R6_2.5.1 reprex_2.0.1 evaluate_0.15 httr_1.4.2 highr_0.9
[10] pillar_1.7.0 rlang_1.0.2 rstudioapi_0.13 callr_3.7.0 R.utils_2.11.0 R.oo_1.24.0 rmarkdown_2.13 styler_1.7.0 webshot_0.5.2
[19] this.path_0.5.1 bit_4.0.4 munsell_0.5.0 compiler_4.1.0 xfun_0.30 pkgconfig_2.0.3 systemfonts_1.0.4 clipr_0.8.0 htmltools_0.5.2
[28] tidyselect_1.1.2 fansi_1.0.2 viridisLite_0.4.0 crayon_1.5.0 tzdb_0.2.0 withr_2.5.0 R.methodsS3_1.8.1 grid_4.1.0 gtable_0.3.0
[37] lifecycle_1.0.1 cli_3.2.0 stringi_1.7.6 vroom_1.5.7 renv_0.15.4 fs_1.5.2 xml2_1.3.3 ellipsis_0.3.2 generics_0.1.2
[46] vctrs_0.3.8 kableExtra_1.3.4 tools_4.1.0 bit64_4.0.5 R.cache_0.15.0 hms_1.1.1 processx_3.5.2 parallel_4.1.0 fastmap_1.1.0
[55] yaml_2.3.5 colorspace_2.0-3 knitr_1.37

scraper: no updates in 4 days

Saw that the *.csv repo hadn't been updated in 4 days, and was wondering if something broke ='(. I hope all is well.

Log open is missing NAMESPACE

Currently line 27 is lf <- log_open(tmp). Should be lf <- logr::log_open(tmp)

Oryx Site has Changed

Russia is still located here: https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-equipment.html

But Ukraine is now listed here: https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-ukrainian.html

The good news is that you no longer need to reference the T-64BV tank to find positions, bad news is a lot of functions need to be updated.

Rename output "total_by_system_wide".

Data output in "totals_by_system_wide" is not actually wide, just aggregated by system.

Rewrite to produce actual wide dataset.

There's a bug.

For Helicopters, the script unable to pull data of 'abandoned' I guess it's probably because of the order which is different from other equipment type. This also happens on other equipment_type as well.

One possible fix (and note, switching to argument defaults instead of hard links makes it compatible with the Wayback Machine for getting archival data):

Creating a new issue for this idea by @leedrake5, as it offers a more flexible option for coding the URLs.

One possible fix (and note, switching to argument defaults instead of hard links makes it compatible with the Wayback Machine for getting archival data):

russia_materiel <-
  get_data(
    russia_link,
    "article"
  ) %>%
  rvest::html_elements("li")
  
  ukraine_materiel <-
    get_data(
    ukraine_link,
      "article"
    ) %>%
    rvest::html_elements("li")
    
    materiel <- c(russia_materiel, ukraine_materiel)

Note that this requires the 'do' package to concat xml node sets.

Originally posted by @leedrake5 in #18 (comment)

Twitter request to extract html elements

Quality Assurance

Generate a report which notes differences in counts between datasets to indicate when a quality assurance issue exists.

Israel-Hamas Conflict

It looks like there is a similar effort to log Israel and Hamas equipment losses here: https://armadarotta.blogspot.com/2023/10/israel-at-war-tracking-equipment-losses.html

However it looks like the functions don't apply here. I'm poking around to see if there is any way to adapt the script to a different page's headings (Google Blogspot), but not having much success yet. I wasn't sure if this was an easy incorporation or if the different platform meant it would be more difficult to adapt to tracking.

get_inputfile(): line filtering on str_detect is has misspelled dataset name, causing error

get_inputfile <- function(.file) {
  path <- fs::dir_info("inputfiles", type = "file") %>%
    dplyr::select(path, change_time, birth_time) %>%
    dplyr::filter(stringr::str_detect(path, file)) %>%
    dplyr::filter(birth_time == max(birth_time)) %>%
    dplyr::pull(path)

  message(path)

  # logr::put(path)

  readr::read_csv(path)
}

Should be:

get_inputfile <- function(.file) {
  path <- fs::dir_info("inputfiles", type = "file") %>%
    dplyr::select(path, change_time, birth_time) %>%
    dplyr::filter(stringr::str_detect(path, .file)) %>%
    dplyr::filter(birth_time == max(birth_time)) %>%
    dplyr::pull(path)

  message(path)

  # logr::put(path)

  readr::read_csv(path)
}

Script can't handle non-braking space

It seems to me that script can't handle non-braking space ( ) on Oryx website.
Source code from webpage: ... /> 7 BMP-1KSh command and staff vehicle: results incorrectly 7 BMP-1KSh as new system (system is BMP-1KSh not 7 BMP-1KSh).
Source code from webpage: ... /> 2 9P148 Konkurs: results correctly 9P148 Konkurs.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.