ironholds / urltools Goto Github PK

View Code? Open in Web Editor NEW

131.0 12.0 32.0 1 MB

Elegant URL handling in R

License: Other

R 37.08% C++ 49.30% Python 0.27% C 13.36%

r url access-logs data-import

urltools's Introduction

urltools

A package for elegantly handling and parsing URLs from within R.

Author: Oliver Keyes, Jay Jacobs
License: MIT
Status: Stable

Description

URLs in R are often treated as nothing more than part of data retrieval - they're used for making connections and reading data. With web analytics and research, however, URLs can be the data, and R's default handlers are not best suited to handle vectorised operations over large datasets. urltools is intended to solve this.

It contains drop-in replacements for R's URLdecode and URLencode functions, along with new functionality such as a URL parser and parameter value extractor. In all cases, the functions are designed to be content-safe (not breaking on unexpected values) and fully vectorised, resulting in a dramatic speed improvement over existing implementations - crucial for large datasets. For more information, see the urltools vignette.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Installation

The latest CRAN version can be obtained via:

install.packages("urltools")

To get the development version:

devtools::install_github("ironholds/urltools")

Dependencies

R. Doy.
Rcpp

urltools's People

Contributors

Stargazers

Watchers

urltools's Issues

Handling of NAs by url_parse and url_compose

I ran into this behavior today that I thought could be improved:

> str(url_parse(c("http://www.google.com", NA_character_, "")))
'data.frame':   3 obs. of  6 variables:
 $ scheme   : chr  "http" "" ""
 $ domain   : chr  "www.google.com" "na" ""
 $ port     : chr  "" "" ""
 $ path     : chr  "" "" ""
 $ parameter: chr  "" "" ""
 $ fragment : chr  "" "" ""
> url_compose(url_parse(c("http://www.google.com", NA_character_, "")))
[1] "http://www.google.com/" "na/"                    ""

I would like to suggest that url_compose(url_parse(NA_character_)) be made to return NA_character_ for consistency.

Maybe this could be done by having url_parse return a row with all NAs, which url_compose then checks for.

Change refs to const refs

Increases maintainability.

url handling tweaks

Look at https://github.com/jayjacobs/tldextract and look for a more sensible set of extractors (for example, url_extract(x, fields) that produces a vector or a list, depending. Or maybe even a df?)

Add query constructor

http://medialize.github.io/URI.js/

Case sensitivity in paths/routes not honoured by url_compose

Hi, great package, thank you!

There's an issue though where

url_compose(
  url_parse("http://www.example.com/Route")
)

produces [1] "http://www.example.com/route

Many paths and queries are indeed case sensitive, see http://tools.ietf.org/html/rfc3986#section-6.2.2.1 for reference.

Expand unit tests

Self-explanatory.

Document all the new C++

The internal methods should be documented; they ain't.

Fix component setting

The replace element is broken. Agh!

lower case in url_encode

Got a problem when using url_encode after url_decode.
urltools::url_encode(urltools::url_decode("%2B")) == "%2B" # FALSE

Really we should have some parameter to control the case of the result, something like:
urltools::url_encode(urltools::url_decode("%2B"), upper = TRUE) == "%2B" # TRUE

With default value equal to FALSE to support current version

domain() parsing should drop www.

What it says on the tin.

internationalized domain name support

Dear Oliver,

I don't think this is as much of a priority but it may be a useful feature as use of idns catches on. More here:
https://en.wikipedia.org/wiki/Internationalized_domain_name
and here: https://www.icann.org/resources/pages/idn-2012-02-25-en

Warmly

Add unit tests

The unit tests are from an earlier, more civilized time. Make them more extensive.

Reinstate `suffix_refresh()` returning an object with the updated suffix list

Here is the idea we discussed about updating the suffix list:

Have suffix_refresh() back returning a data[frame/table] object on the same format as the internal object format used by suffix_extract().
Make suffix_extract() have an optional argument suffix_list that defaults to the internal object

No need to update the internal object and someone that needs up-to-date info can use it without worry.

Request: Return actual NA_character_ entries on the component functions

When I try the function with a simple NA_character_ , it is returning a "na" string.

I think it would be better if it preserved the NA_character_ to make a workflow using it more consistent.

> domain(NA_character_)
[1] "na"

Issue parsing parameters

Hey Oliver, really nice work on this package - the url_parameters function and the output to data frames are awesome additions.

Also, I just can't believe how quick the thing is - it will turn my daily url parsing job from literally hours (using httr and loops) to a few seconds work!!!

One issue I have come up against with the url_parameters function - one of my parameters is 'to' (as in a dated search). The parser seems to be looking for the first instance of to in the whole url as being restricted to the query. The means in the url:

www.housetrip.es/es/buscar-apartamentos-de-vacaciones/comporta/geo?from=01/04/2015&guests=4&to=05/04/2015

I get -de-vacaciones/comporta/geo?from=01/04/2015 instead of 05/04/2015.

On this basis, I could probably also see a situation where even if it was restricted to just the query, the parser could potentially be greed about the 'to' and find it somewhere else.

Anything I can do to help just let me know - I've always got loads of URLs to parse although sadly your C++ is all greek to me at this point.

Thanks
Jacob

Not out of the woods yet: new `suffix_extract` bug

I believe the trie configuration needs a bit more tuning. Look at how it is doing this incorrectly still:

> new_suffix <- suffix_refresh()
> suffix_extract("0-ac.els-cdn.com.oasis.unisa.ac.za", new_suffix)
                                host        subdomain domain      suffix
1 0-ac.els-cdn.com.oasis.unisa.ac.za 0-ac.els-cdn.com  oasis unisa.ac.za
> as.data.table(new_suffix)[str_detect(suffixes, "ac.za")]
   suffixes                                                        comments
1:    ac.za // za : http://www.zadna.org.za/content/page/domain-information

custom URL protocol handlers support

It would interesting if could be some support of coston protocol handles

A function repository of custom connection management

    URLS(proto='custom', handler = function(url) ... return(con)

    con <- URL("custom://...").open()

I've post a question on StackOverflow:

http://stackoverflow.com/questions/34903379/r-custom-url-protocol-handlers

A literal bug!! - Domain extraction woes

It seems the domain extraction has issues if for some reason there is no trailing slash after the domain name.

> library(urltools)
> domain("http://www.nextpedition.com?inav=menu_travel_nextpedition")
[1] "www.nextpedition.com?inav=menu_travel_nextpedition"
> domain("http://www.nextpedition.com/?inav=menu_travel_nextpedition")
[1] "www.nextpedition.com"

punycode handling

For:

urltools::suffix_extract("xn--80aagrchk9a2a2a.xn--p1ai")

this:

                          host subdomain domain  tld
1 xn--80aagrchk9a2a2a.xn--p1ai      <NA>   <NA> <NA>

should not be the outcome.

might want to handle punycode encoding & decoding at the same time (perhaps via)

Multi-level suffix handling woes

When suffix_extract runs on a "multi-level suffix" from the list, it has inconsistent behavior if there is a subdomain after the suffix or not.

> suffix_extract("googleapis.com")
            host subdomain     domain tld
1 googleapis.com      <NA> googleapis com
> suffix_extract("myapi.googleapis.com")
                  host subdomain domain            tld
1 myapi.googleapis.com      <NA>  myapi googleapis.com

url_parse fails to separate port without trailing slash

> url_parse("http://localhost:8000")
  scheme         domain port path parameter fragment
1   http localhost:8000                             
> url_parse("http://localhost:8000/")
  scheme    domain port path parameter fragment
1   http localhost 8000

I'm in no way a URL nerd, but this looks in violation of the rfc

 authority   = [ userinfo "@" ] host [ ":" port ]
URI producers and normalizers should omit the ":" delimiter that
separates host from port if the port component is empty. Some
schemes do not allow the userinfo and/or port subcomponents.

If a URI contains an authority component, then the path component
must either be empty or begin with a slash ("/") character. Non-
validating parsers (those that merely separate a URI reference into
its major components) will often ignore the subcomponent structure of
authority, treating it as an opaque string from the double-slash to
the first terminating delimiter, until such time as the URI is
dereferenced.

(also why are they requests for comments if they're meant to be authoritative I don't understand).

Include org data as a column behind suffixes in the public suffix dataset

Dear Oliver,

First, congratulations on a fantastic library --- it is fast and clean.

My request is pretty straight forward --- and it may well be that I am missing something:

the publicsuffixlist carries a bunch of data in the comments:
https://publicsuffix.org/list/public_suffix_list.dat

It would be awesome to incorporate that data. It is easily done. And perhaps already is. If not, it would be great to see it.

Expand unit tests around url_extract_param and url_replace_param

df-ify parameter extraction

See Hadley's commentary on webtools

Should parameters not present in the URL be passed back as NA?

It seems that both parameters not in the URL, and parameters set equal to nothing will both be passed back as "":

myurl = "http://fios.verizon.com/hop.html?CMP=111&s_ace=&gclid=111"
out = param_get(myurl, c('s_ace', 's_stlnk'))
t(out)
# >         1 
# > s_stlnk ""
# > s_ace   ""

This is troubling for my use case, but more broadly I think it makes sense to have a query for a url parameter that is not there to return differently than one that is there, but is set to empty.

Out of range character fix

decode wikipedia url

I am trying to decode this wikipedia url:

"https://en.wikipedia.org/wiki/Games_for_Windows_%E2%80%93_Live"

that should result in

"https://en.wikipedia.org/wiki/Games for Windows – Live"

instead I get this:

url_decode("https://en.wikipedia.org/wiki/Games_for_Windows_%E2%80%93_Live")
[1] "https://en.wikipedia.org/wiki/Games_for_Windows_â€“_Live"

Please advise
Felipe

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] urltools_1.3.2 RCurl_1.95-4.7 bitops_1.0-6 httr_1.0.0
[5] XML_3.98-1.3 Rquantview_0.1.3 shinyTree_0.2.2 DT_0.1
[9] readr_0.2.2 shiny_0.12.2 latticeExtra_0.6-26 RColorBrewer_1.1-2
[13] lattice_0.20-33 stringr_1.0.0 dplyr_0.4.3 plyr_1.8.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 magrittr_1.5 ProjectTemplate_0.6 xtable_1.8-0
[5] R6_2.1.1 tools_3.2.2 parallel_3.2.2 grid_3.2.2
[9] DBI_0.3.1 htmltools_0.2.6 lazyeval_0.1.10 assertthat_0.1
[13] digest_0.6.8 htmlwidgets_0.5 curl_0.9.3 mime_0.4
[17] stringi_1.0-1 httpuv_1.3.3

suffix_extract extracts the domain wrong

tried: suffix_extract("http://www.converse.com")
result :
host subdomain domain suffix
1 http://www.converse.com http://www conve se.com

set new component does not work on NA component

Hi,

I tried to construct a url by its components using set functionnality but it does not work if I have empty part in the URL. See example

library(urltools)
example_url <- "http://cran.r-project.org:80"
path(example_url)
#> [1] NA
path(example_url) <- "bin/windows/"
path(example_url)
#> [1] NA
example_url
#> [1] "http://cran.r-project.org:80"

I believe it should have work. What do you think?

build tld_extract to go along with suffix_extract

Dear Oliver,

It would also be nice to support TLD parsing.

I agree that is a bit of a crapshoot, but I think it would be nice to have some support for it.

Warmly

lubridate-style element extraction and setting!

[FR] url_compose()

Apologies if there's already an awesome way to do this.

Compose a url from it's parsed data.frame

round-trip !

"https://en.wikipedia.org/wiki/Article" %>%
    url_parse() %>%
    url_compose()
#> [1] "https://en.wikipedia.org/wiki/Article"

Create url_compose

See the commentary on #18

host_extract

See #41

url_detect()?

vis a vis http://stackoverflow.com/questions/33773299/r-regex-to-extract-url-from-text

`suffix_extract` doesn't use the `suffixes` parameter when it is passed

It defaults to the one that is stored inside the package.

URL Encoding of Slashes '/' - bug?

Hey, nice work ...

In you example you use something like this:

url_encode("https://de.wikipedia.org/wiki/Käse")
## [1] "https://de.wikipedia.org%2fwiki%2fK%e4se"

... which encodes the Ä but also the slashes making it basically a non-working URL. Is this known/wanted/cannot be prevented or did some bugs creep in? (without encoding the slashes, the URL work: http://de.wikipedia.org/wiki/K%e4se )

Fix OOR character bug

It's an improvement on URLdecode in the sense that it doesn't explicitly BREAK, but it does lose characters. My suspicion is that this is where the unterminated string error BDR complained about was coming from. Stupid C.

isRelative, isAbsolute.

Handling the `!`s on the public_suffix_list dataset

With this on the dataset:

// jp geographic type names
// http://jprs.jp/doc/rule/saisoku-1.html
*.kawasaki.jp
*.kitakyushu.jp
*.kobe.jp
*.nagoya.jp
*.sapporo.jp
*.sendai.jp
*.yokohama.jp
!city.kawasaki.jp
!city.kitakyushu.jp
!city.kobe.jp
!city.nagoya.jp
!city.sapporo.jp
!city.sendai.jp
!city.yokohama.jp

This is incorrect:

> suffix_extract("city.sapporo.jp")
             host subdomain domain          suffix
1 city.sapporo.jp      <NA>   <NA> city.sapporo.jp

Not sure what the "right" is right now.

Parse URLs into a data.frame instead of a list

It makes it convenient to pass the "parameters" field to url_parameters
data.frames are just easier for subsetting on a per-column basis.

Use a combination of reference pointers and splitting individual chunks of the URL parser out into their own (private) methods to make this readable and something less than an utter shitshow.

url_encode only certain characters.

Synopsis

I want to be able to efficiently encode a specified set of characters:

eg. only encode # and :

uniform_name <- 'example=idenfitier#with:odd_format.gff3,gz'

urltools::url_encode(uniform_name, only = c(':', '#'))
#> [1] "example=idenfitier%23with%3Aodd_format.gff3,gz"


# OR equivalently...
urltools::url_encode_chars(uniform_name, c(':', '#'))

Current Use Case

Consider the following unique identifier for a file:

uniform_name <- "H1:Cell-Line=S2-DRSC#Developmental-Stage=Late-Embryonic-stage#Tissue=Embryo-derived-cell-line:ChIP-chip:Rep-1:input:Dmel_r5.32:modENCODE_3300:12_A8_Input.62.CEL.gz"

uniform_name is the name of a file at:

ftp_dir <- "ftp://data.modencode.org/D.melanogaster/Histone-Modification/ChIP-chip/raw-arrayfile_CEL/"

With the data above I should be able to download the file with curl::curl_download() or equiv. BUT, I can't b/c the metadata I have access to (uniform_name,...) actually represents a user-facing link. To actually download the file I need to transform uniform_name to:

transformed_uniform_name <- "H1%3ACell-Line=S2-DRSC%23Developmental-Stage=Late-Embryonic-stage%23Tissue=Embryo-derived-cell-line%3AChIP-chip%3ARep-1%3Ainput%3ADmel_r5.32%3AmodENCODE_3300%3A12_A8_Input.62.CEL.gz"

Yupp, the download url is only "partially encoded" (only : and # are encoded). So do make transformed_uniform_name I currently do something like)

stringi::stri_replace_all_fixed(uniform_name, ':', urltools::url_encode(':'))
# and again with '#' ...

This is unsatisfying to say the least. It would be great if I could do something like:

urltools::url_encode(uniform_name, only = c(':', '#'))
# OR...
urltools::url_encode_chars(uniform_name, c(':', '#'))

This way collaborators and future me don't have to spend time figuring out why I wasn't just encoding the url properly.

Finish Rcppification of parameter code

What it says on the tin.

suffix_refresh is not being exported

Title should be self explanatory

Add keyboard interrupt to Rcpp calls

Potentially long Rcpp calls (url_decode and encode, parse) should have Rcpp keyboard interrupt calls

`suffix_refresh()` is pointing to the wrong file internally

When I try to call suffix_refresh(), it is trying to get the path of a specific file inside the package directory:

> library(urltools)
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] urltools_1.2.1

loaded via a namespace (and not attached):
[1] tools_3.2.2 Rcpp_0.12.1
> suffix_refresh()
Error in save(suffix_dataset, file = system.file("data/suffix_dataset.rda",  : 
  'file' must be non-empty string

Enter a frame number, or 0 to exit   

1: suffix_refresh()
2: save(suffix_dataset, file = system.file("data/suffix_dataset.rda", package = "urltools"))

Selection:

However, it seems that file does not exist:


> system.file(package = "urltools")
[1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/urltools"

$ ls -la /Library/Frameworks/R.framework/Versions/3.2/Resources/library/urltools/data
total 96
drwxrwxr-x   5 alexcp  admin    170 Aug 31 20:40 .
drwxrwxr-x  14 alexcp  admin    476 Aug 31 20:40 ..
-rw-rw-r--   1 alexcp  admin  37325 Aug 31 20:40 Rdata.rdb
-rw-rw-r--   1 alexcp  admin     78 Aug 31 20:40 Rdata.rds
-rw-rw-r--   1 alexcp  admin    144 Aug 31 20:40 Rdata.rdx

tld_refresh is ignoring punycode TLDs

There is a side effect from this line here

urltools/R/suffix.R

Line 195 in 7b2c6c2

raw_tlds <- tolower(raw_tlds[!grepl(x = raw_tlds, pattern = "(#|--)")])

It is removing not only lines with #, but also ones with --, including the XN--* lines that represent the punycode entries.

> ss
[1] "xn--80aa8argd0e.xn--80aswg"      "xn--80aadccbto5aw2bf4m.xn--p1ai" "xn--80aatgnxx.xn--80asehdb"     
> tld_extract(ss)
                           domain  tld
1      xn--80aa8argd0e.xn--80aswg <NA>
2 xn--80aadccbto5aw2bf4m.xn--p1ai <NA>
3      xn--80aatgnxx.xn--80asehdb <NA>