Giter Club home page Giter Club logo

adar's Introduction

adaR

R-CMD-check CRAN status CRAN Downloads Codecov test coverage ada-url Version

adaR is a wrapper for ada-url, a WHATWG-compliant and fast URL parser written in modern C++ .

It implements several auxilliary functions to work with urls:

  • public suffix extraction (top level domain excluding private domains) like psl
  • fast c++ implementation of utils::URLdecode (~40x speedup)

More general information on URL parsing can be found in the introductory vignette via vignette("adaR").

adaR is part of a series of R packages to analyse webtracking data:

Installation

You can install the development version of adaR from GitHub with:

# install.packages("devtools")
devtools::install_github("gesistsa/adaR")

The version on CRAN can be installed with

install.packages("adaR")

Example

This is a basic example which shows all the returned components of a URL.

library(adaR)
ada_url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag")
#>                                                      href protocol username
#> 1 https://user_1:[email protected]:8080/api?q=1#frag   https:   user_1
#>     password             host    hostname port pathname search  hash
#> 1 password_1 example.org:8080 example.org 8080     /api   ?q=1 #frag
  /*
   * https://user:[email protected]:1234/foo/bar?baz#quux
   *       |     |    |          | ^^^^|       |   |
   *       |     |    |          | |   |       |   `----- hash_start
   *       |     |    |          | |   |       `--------- search_start
   *       |     |    |          | |   `----------------- pathname_start
   *       |     |    |          | `--------------------- port
   *       |     |    |          `----------------------- host_end
   *       |     |    `---------------------------------- host_start
   *       |     `--------------------------------------- username_end
   *       `--------------------------------------------- protocol_end
   */

It solves some problems of urltools with more complex urls.

urltools::url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.
   7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
#>   scheme                            domain port
#> 1  https 40.7519848,-74.0015045,14.\n   7z <NA>
#>                                                                                 path
#> 1 data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   parameter fragment
#> 1      <NA>     <NA>

ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m
   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
#>                                                                                                                                                                         href
#> 1 https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   protocol username password           host       hostname port
#> 1   https:                   www.google.com www.google.com     
#>                                                                                                                                               pathname
#> 1 /maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   search hash
#> 1

A “raw” url parse using ada is extremely fast (see ada-url.com) but for this to carry over to R is tricky. The performance is still compatible with urltools::url_parse with the noted advantage in accuracy in some practical circumstances.

bench::mark(
    ada = ada_url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag", decode = FALSE),
    urltools = urltools::url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag"),
    iterations = 1, check = FALSE
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 ada          2.43ms   2.43ms      411.    2.49KB        0
#> 2 urltools   526.26µs 526.26µs     1900.    2.49KB        0

For further benchmark results, see benchmark.md in data_raw.

There are four more groups of functions available to work with url parsing:

  • ada_get_*() get a specific component
  • ada_has_*() check if a specific component is present
  • ada_set_*() set a specific component from URLS
  • ada_clear_*() remove a specific component from URLS

Public Suffix extraction

public_suffix() extracts their top level domain from the public suffix list, excluding private domains.

urls <- c(
    "https://subsub.sub.domain.co.uk",
    "https://domain.api.gov.uk",
    "https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
)
public_suffix(urls)
#> [1] "co.uk"                            "gov.uk"                          
#> [3] "butthisispartoftheps.kawasaki.jp"

If you are wondering about the last url. The list also contains wildcard suffixes such as *.kawasaki.jp which need to be matched.

Acknowledgement

The logo is created from this portrait of Ada Lovelace, a very early pioneer in Computer Science.

adar's People

Contributors

chainsawriot avatar schochastics avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

adar's Issues

crash for bad input

R> adaR::ada_url_parse(NA)

 *** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
 1: Rcpp_ada_parse(url, nchar(url))
 2: adaR::ada_url_parse(NA)

public_suffix() doesnt work with paths

adaR::public_suffix("http://google.de/test")
#> [1] NA

related to #51

Note that psl also fails.

psl::public_suffix("http://google.de/test")
#> [1] "de/test"

need to reconsider if this is a bug or a feature

Forced encoding change in `ada_set_*()` functions

Hi!
I've encountered a bug in adaR::ada_set_* functions family related to pathname processing.
In cases where an URL is in punycode (domain starting with xn--), using adaR's set family functions changes pathname encoding and I don't know how to prevent (or revert) this behavior.

For example:

examples <- c(
  "http://xn--53-6kcainf4buoffq.xn--p1ai/pood/junior-electrical-engineer-jobs-remote.html",
  "http://xn--80abb0biooohbv.xn--p1ai/",
  "http://xn--alicantesueo-khb.com/insomnio",
  "https://normal-url.com/this-path-will-be-fine",
  "http://xn--53-6kcainf4buoffq.xn--p1ai/this-path-will-not-be-fine"
)
pathnames <- adaR::ada_get_pathname(examples, decode = FALSE)
result_pathnames <- adaR::ada_set_pathname(examples, pathnames, decode = FALSE)

will return:

result_pathnames 
[1] "http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html"
[2] "http://xn--80abb0biooohbv.xn--p1ai/"                                            
[3] "http://xn--alicantesueo-khb.com/insomnio"                                       
[4] "https://normal-url.com/this-path-will-be-fine"                                  
[5] "http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be"

Notice 1st and 5th URLs.

even though adaR::ada_get_pathname(examples, decode = FALSE) returns correct output:

pathnames 
[1] "/pood/junior-electrical-engineer-jobs-remote.html" 
[2] "/"                                                
[3] "/insomnio"                                         
[4] "/this-path-will-be-fine"                          
[5] "/this-path-will-not-be-fine"  

The same behavior is present even when pathname isn't changed, for example:

hostnames <- adaR::ada_get_hostname(examples, decode = FALSE)
result_hostnames <- adaR::ada_set_hostname(examples, hostnames, decode = FALSE)
result_hostnames 
[1] "http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html"
[2] "http://xn--80abb0biooohbv.xn--p1ai/"                                            
[3] "http://xn--alicantesueo-khb.com/insomnio"                                       
[4] "https://normal-url.com/this-path-will-be-fine"                                  
[5] "http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be"  

Also it's worth noting that hostnames looks different (is encoded), but the function call above didn't change the hostname at all.

hostnames
[1] "поверкадома53.рф"  "бамбукхутор.рф"    "alicantesueño.com" "normal-url.com"    "поверкадома53.рф" 

My sessionInfo()

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Warsaw
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] adaR_0.3.1

loaded via a namespace (and not attached):
[1] compiler_4.3.0    tools_4.3.0       rstudioapi_0.15.0 yaml_2.3.8        Rcpp_1.0.11       triebeard_0.4.1   renv_0.17.3

Support for idn/punnycode

A problem in #17 is the decoding of idn/punnycode.
There are Pointers in the ada code to support that but this needs further investigation

public_suffic fails with wildcard only url

adaR::ada_url_parse("http://kobe.jp")
#>              href protocol username password    host hostname port pathname
#> 1 http://kobe.jp/    http:                   kobe.jp  kobe.jp             /
#>   search hash
#> 1
adaR::public_suffix("http://kobe.jp")
#> [1] "jp.kobe.jp"

Created on 2023-09-26 with reprex v2.0.2

One row dataframe is coercised to named vector

https://github.com/schochastics/adaR/blob/b2eb3e4662423b53db979541777da0e5847a7b69/R/parse.R#L13

str(ada_url_parse("https://www.google.co.jp/search?q=ドイツ"))
# Named chr [1:10] "https://www.google.co.jp/search?q=ドイツ" "https:" "" "" "www.google.co.jp" #"www.google.co.jp" "" "/search" "?q=ドイツ" ...
# - attr(*, "names")= chr [1:10] "href" "protocol" "username" "password" ...

simplify should be FALSE and coercied as data.frame again; or a better way.

ada_url_parse <- function(url, decode = TRUE) {
    url <- utf8::as_utf8(url)
    # url_parsed <- Rcpp_ada_parse(url, nchar(url, type = "bytes"))
    url_parsed <- as.data.frame(do.call("rbind", lapply(url, function(x) Rcpp_ada_parse(x, nchar(x, type = "bytes")))))
    if (isTRUE(decode)) {
        url_parsed <- apply(url_parsed, 2, function(x) utils::URLdecode(x), simplify = FALSE)
        return(as.data.frame(url_parsed))
    }
    return(url_parsed)
}

Doesn't compile on Ubuntu 18.04

Hi, when compiling on Ubuntu (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0) I unfortunately get

In file included from ada/ada.cpp:3:0, from adaR.h:5, from adaR.cpp:1: ada/ada.h:5067:10: fatal error: charconv: No such file or directory #include <charconv> ^~~~~~~~~~ compilation terminated. make: *** [/usr/lib/R/etc/Makeconf:200: adaR.o] Error 1 ERROR: compilation failed for package ‘adaR’

Release adaR 0.3.0

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • Add preemptive link to blog post in pkgdown news menu
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)
  • Finish blog post
  • Tweet

add ada_get_domain

kindly requested by webtrack team:

ada_get_domain("https://subsub.sub.domain.co.uk")
#> domain.co.uk

Just glueing some existing functions

Make `URLdecode` optional

Sometimes, I might want to have those %.

ada_url_parse <- function(url, decode = TRUE) {
    url <- stringi::stri_enc_toutf8(url)
    url_parsed <- Rcpp_ada_parse(url, nchar(url, type = "bytes"))
    if (isTRUE(decode)) {
      return(lapply(url_parsed, URLdecode))
    }
    return(url_parsed)
}
ada_url_parse("https://www.google.co.jp/search?q=ドイツ")$search
## [1] "?q=ドイツ"
ada_url_parse("https://www.google.co.jp/search?q=ドイツ", decode = FALSE)$search
## [1] "?q=%E3%83%89%E3%82%A4%E3%83%84"

speedup

runtime is ok, but given how fast ada-url is by itself, there is room to improvement at a) the interface R/C++ and b)the URLencoding to fix UTF8 support (see #1)

support PSL

The main purpose of the package is to wrap ada-url. But their might be other features we could support (which are needed for webtrackR).

One such feature is public suffix extraction via PSL. There is a package for that but it is not on CRAN.

The list is accessible as a textfile here and here.

We may not be able to implement something fast, but a simple lookup-ish thing should be possible

C++17

checking C++ specification
     Not all R platforms support C++17

might haunt us on CRAN

not UTF8-safe

adaR::ada_url_parse("https://www.hk01.com/zone/1/港聞")
#> $href
#> [1] "https://www.hk01.com/zone/1/%E6%B8"
#>
#> $protocol
#> [1] "https:"
#>
#> $username
#> [1] ""
#>
#> $password
#> [1] ""
#>
#> $host
#> [1] "www.hk01.com"
#>
#> $hostname
#> [1] "www.hk01.com"
#>
#> $port
#> [1] ""
#>
#> $pathname
#> [1] "/zone/1/%E6%B8"
#>
#> $search
#> [1] ""
#>
#> $hash
#> [1] ""

Created on 2023-09-22 with reprex v2.0.2

Feature Request: `ada_get_basename`

Probably a fringe use case, but the other day I tried to read the HTML data from the root of a website and though ada_get_domain would get me there.

adaR::ada_get_domain("https://github.com/schochastics/adaR/issues") |> 
  rvest::read_html()
#> Error: 'github.com' does not exist in current working directory ('/tmp/RtmpWgmD8k/reprex-95ac10e83d89-wax-mouse').

Unfortunatly, the domain is recognised as local path without the protocol. Would be fantastic if there was a function to get to the base name. This is roughly the behaviour I would expect.

ada_get_basename <- function(x) {
  sub(adaR::ada_get_pathname(x), "", x, fixed = TRUE)
}
ada_get_basename("https://github.com/schochastics/adaR/issues") |> 
  rvest::read_html()
#> {html_document}
#> <html lang="en" data-a11y-animated-images="system" data-a11y-link-underlines="true">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body class="logged-out env-production page-responsive header-overlay hom ...

Created on 2023-10-05 with reprex v2.0.2

Thanks for considering!

ada_get_domain() doesnt work if path present

adaR::ada_get_domain("http://sub.google.de/test")
#> [1] NA

This is due to public_suffix() not being able to parse URLs with paths

adaR::public_suffix("http://google.de/test")
#> [1] NA
psl::public_suffix("http://google.de/test")
#> [1] "de/test"

Not sure if this should be NA instead of de. But then again, psl also fails.
For now, we fix ada_get_domain internally and reconsider public_suffix in #54

parsing fails when protocol is missing

adaR::ada_url_parse("bit.ly/32G1ciy")
#>             href protocol username password host hostname port pathname search
#> 1 bit.ly/32G1ciy     <NA>     <NA>     <NA> <NA>     <NA> <NA>     <NA>   <NA>
#>   hash
#> 1 <NA>
urltools::url_parse("bit.ly/32G1ciy")
#>   scheme domain port    path parameter fragment
#> 1   <NA> bit.ly <NA> 32G1ciy      <NA>     <NA>

Created on 2023-09-25 with reprex v2.0.2

not sure if/how to handle this

Release adaR 0.2.0

First release:

Prepare for release:

  • git pull
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • git push
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • Add preemptive link to blog post in pkgdown news menu
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)
  • Finish blog post
  • Tweet

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.