Giter Club home page Giter Club logo

urltools's Introduction

urltools

A package for elegantly handling and parsing URLs from within R.

Author: Oliver Keyes, Jay Jacobs
License: MIT
Status: Stable

Travis-CI Build Status downloads

Description

URLs in R are often treated as nothing more than part of data retrieval - they're used for making connections and reading data. With web analytics and research, however, URLs can be the data, and R's default handlers are not best suited to handle vectorised operations over large datasets. urltools is intended to solve this.

It contains drop-in replacements for R's URLdecode and URLencode functions, along with new functionality such as a URL parser and parameter value extractor. In all cases, the functions are designed to be content-safe (not breaking on unexpected values) and fully vectorised, resulting in a dramatic speed improvement over existing implementations - crucial for large datasets. For more information, see the urltools vignette.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Installation

The latest CRAN version can be obtained via:

install.packages("urltools")

To get the development version:

devtools::install_github("ironholds/urltools")

Dependencies

urltools's People

Contributors

akristiansson avatar alexcpsec avatar alexxyjiang avatar bryant1410 avatar cderv avatar crowding avatar emilbode avatar hrbrmstr avatar ironholds avatar jayjacobs avatar marcolussetti avatar okeyes-r7 avatar quiri avatar stephlocke avatar wrathematics avatar zeloff avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

urltools's Issues

Reinstate `suffix_refresh()` returning an object with the updated suffix list

Here is the idea we discussed about updating the suffix list:

  1. Have suffix_refresh() back returning a data[frame/table] object on the same format as the internal object format used by suffix_extract().
  2. Make suffix_extract() have an optional argument suffix_list that defaults to the internal object

No need to update the internal object and someone that needs up-to-date info can use it without worry.

`suffix_refresh()` is pointing to the wrong file internally

When I try to call suffix_refresh(), it is trying to get the path of a specific file inside the package directory:

> library(urltools)
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] urltools_1.2.1

loaded via a namespace (and not attached):
[1] tools_3.2.2 Rcpp_0.12.1
> suffix_refresh()
Error in save(suffix_dataset, file = system.file("data/suffix_dataset.rda",  : 
  'file' must be non-empty string

Enter a frame number, or 0 to exit   

1: suffix_refresh()
2: save(suffix_dataset, file = system.file("data/suffix_dataset.rda", package = "urltools"))

Selection:

However, it seems that file does not exist:


> system.file(package = "urltools")
[1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/urltools"
$ ls -la /Library/Frameworks/R.framework/Versions/3.2/Resources/library/urltools/data
total 96
drwxrwxr-x   5 alexcp  admin    170 Aug 31 20:40 .
drwxrwxr-x  14 alexcp  admin    476 Aug 31 20:40 ..
-rw-rw-r--   1 alexcp  admin  37325 Aug 31 20:40 Rdata.rdb
-rw-rw-r--   1 alexcp  admin     78 Aug 31 20:40 Rdata.rds
-rw-rw-r--   1 alexcp  admin    144 Aug 31 20:40 Rdata.rdx

Handling the `!`s on the public_suffix_list dataset

With this on the dataset:

// jp geographic type names
// http://jprs.jp/doc/rule/saisoku-1.html
*.kawasaki.jp
*.kitakyushu.jp
*.kobe.jp
*.nagoya.jp
*.sapporo.jp
*.sendai.jp
*.yokohama.jp
!city.kawasaki.jp
!city.kitakyushu.jp
!city.kobe.jp
!city.nagoya.jp
!city.sapporo.jp
!city.sendai.jp
!city.yokohama.jp

This is incorrect:

> suffix_extract("city.sapporo.jp")
             host subdomain domain          suffix
1 city.sapporo.jp      <NA>   <NA> city.sapporo.jp

Not sure what the "right" is right now.

Handling of NAs by url_parse and url_compose

I ran into this behavior today that I thought could be improved:

> str(url_parse(c("http://www.google.com", NA_character_, "")))
'data.frame':   3 obs. of  6 variables:
 $ scheme   : chr  "http" "" ""
 $ domain   : chr  "www.google.com" "na" ""
 $ port     : chr  "" "" ""
 $ path     : chr  "" "" ""
 $ parameter: chr  "" "" ""
 $ fragment : chr  "" "" ""
> url_compose(url_parse(c("http://www.google.com", NA_character_, "")))
[1] "http://www.google.com/" "na/"                    ""     

I would like to suggest that url_compose(url_parse(NA_character_)) be made to return NA_character_ for consistency.

Maybe this could be done by having url_parse return a row with all NAs, which url_compose then checks for.

set new component does not work on NA component

Hi,

I tried to construct a url by its components using set functionnality but it does not work if I have empty part in the URL. See example

library(urltools)
example_url <- "http://cran.r-project.org:80"
path(example_url)
#> [1] NA
path(example_url) <- "bin/windows/"
path(example_url)
#> [1] NA
example_url
#> [1] "http://cran.r-project.org:80"

I believe it should have work. What do you think?

A literal bug!! - Domain extraction woes

It seems the domain extraction has issues if for some reason there is no trailing slash after the domain name.

> library(urltools)
> domain("http://www.nextpedition.com?inav=menu_travel_nextpedition")
[1] "www.nextpedition.com?inav=menu_travel_nextpedition"
> domain("http://www.nextpedition.com/?inav=menu_travel_nextpedition")
[1] "www.nextpedition.com"                         

decode wikipedia url

Hi

I am trying to decode this wikipedia url:

"https://en.wikipedia.org/wiki/Games_for_Windows_%E2%80%93_Live"

that should result in

"https://en.wikipedia.org/wiki/Games for Windows – Live"

instead I get this:

url_decode("https://en.wikipedia.org/wiki/Games_for_Windows_%E2%80%93_Live")
[1] "https://en.wikipedia.org/wiki/Games_for_Windows_–_Live"

Please advise
Felipe


sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] urltools_1.3.2 RCurl_1.95-4.7 bitops_1.0-6 httr_1.0.0
[5] XML_3.98-1.3 Rquantview_0.1.3 shinyTree_0.2.2 DT_0.1
[9] readr_0.2.2 shiny_0.12.2 latticeExtra_0.6-26 RColorBrewer_1.1-2
[13] lattice_0.20-33 stringr_1.0.0 dplyr_0.4.3 plyr_1.8.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 magrittr_1.5 ProjectTemplate_0.6 xtable_1.8-0
[5] R6_2.1.1 tools_3.2.2 parallel_3.2.2 grid_3.2.2
[9] DBI_0.3.1 htmltools_0.2.6 lazyeval_0.1.10 assertthat_0.1
[13] digest_0.6.8 htmlwidgets_0.5 curl_0.9.3 mime_0.4
[17] stringi_1.0-1 httpuv_1.3.3

Multi-level suffix handling woes

When suffix_extract runs on a "multi-level suffix" from the list, it has inconsistent behavior if there is a subdomain after the suffix or not.

> suffix_extract("googleapis.com")
            host subdomain     domain tld
1 googleapis.com      <NA> googleapis com
> suffix_extract("myapi.googleapis.com")
                  host subdomain domain            tld
1 myapi.googleapis.com      <NA>  myapi googleapis.com

URL Encoding of Slashes '/' - bug?

Hey, nice work ...

In you example you use something like this:

url_encode("https://de.wikipedia.org/wiki/Käse")
## [1] "https://de.wikipedia.org%2fwiki%2fK%e4se"

... which encodes the Ä but also the slashes making it basically a non-working URL. Is this known/wanted/cannot be prevented or did some bugs creep in? (without encoding the slashes, the URL work: http://de.wikipedia.org/wiki/K%e4se )

Not out of the woods yet: new `suffix_extract` bug

I believe the trie configuration needs a bit more tuning. Look at how it is doing this incorrectly still:

> new_suffix <- suffix_refresh()
> suffix_extract("0-ac.els-cdn.com.oasis.unisa.ac.za", new_suffix)
                                host        subdomain domain      suffix
1 0-ac.els-cdn.com.oasis.unisa.ac.za 0-ac.els-cdn.com  oasis unisa.ac.za
> as.data.table(new_suffix)[str_detect(suffixes, "ac.za")]
   suffixes                                                        comments
1:    ac.za // za : http://www.zadna.org.za/content/page/domain-information

punycode handling

For:

urltools::suffix_extract("xn--80aagrchk9a2a2a.xn--p1ai")

this:

                          host subdomain domain  tld
1 xn--80aagrchk9a2a2a.xn--p1ai      <NA>   <NA> <NA>

should not be the outcome.

might want to handle punycode encoding & decoding at the same time (perhaps via)

tld_refresh is ignoring punycode TLDs

There is a side effect from this line here

raw_tlds <- tolower(raw_tlds[!grepl(x = raw_tlds, pattern = "(#|--)")])

It is removing not only lines with #, but also ones with --, including the XN--* lines that represent the punycode entries.

> ss
[1] "xn--80aa8argd0e.xn--80aswg"      "xn--80aadccbto5aw2bf4m.xn--p1ai" "xn--80aatgnxx.xn--80asehdb"     
> tld_extract(ss)
                           domain  tld
1      xn--80aa8argd0e.xn--80aswg <NA>
2 xn--80aadccbto5aw2bf4m.xn--p1ai <NA>
3      xn--80aatgnxx.xn--80asehdb <NA>

Add unit tests

The unit tests are from an earlier, more civilized time. Make them more extensive.

Fix OOR character bug

It's an improvement on URLdecode in the sense that it doesn't explicitly BREAK, but it does lose characters. My suspicion is that this is where the unterminated string error BDR complained about was coming from. Stupid C.

Should parameters not present in the URL be passed back as NA?

It seems that both parameters not in the URL, and parameters set equal to nothing will both be passed back as "":

myurl = "http://fios.verizon.com/hop.html?CMP=111&s_ace=&gclid=111"
out = param_get(myurl, c('s_ace', 's_stlnk'))
t(out)
# >         1 
# > s_stlnk ""
# > s_ace   ""

This is troubling for my use case, but more broadly I think it makes sense to have a query for a url parameter that is not there to return differently than one that is there, but is set to empty.

url_encode only certain characters.

Synopsis

I want to be able to efficiently encode a specified set of characters:

  • eg. only encode # and :
uniform_name <- 'example=idenfitier#with:odd_format.gff3,gz'

urltools::url_encode(uniform_name, only = c(':', '#'))
#> [1] "example=idenfitier%23with%3Aodd_format.gff3,gz"


# OR equivalently...
urltools::url_encode_chars(uniform_name, c(':', '#'))

Current Use Case

Consider the following unique identifier for a file:

uniform_name <- "H1:Cell-Line=S2-DRSC#Developmental-Stage=Late-Embryonic-stage#Tissue=Embryo-derived-cell-line:ChIP-chip:Rep-1:input:Dmel_r5.32:modENCODE_3300:12_A8_Input.62.CEL.gz"

uniform_name is the name of a file at:

ftp_dir <- "ftp://data.modencode.org/D.melanogaster/Histone-Modification/ChIP-chip/raw-arrayfile_CEL/"

With the data above I should be able to download the file with curl::curl_download() or equiv. BUT, I can't b/c the metadata I have access to (uniform_name,...) actually represents a user-facing link. To actually download the file I need to transform uniform_name to:

transformed_uniform_name <- "H1%3ACell-Line=S2-DRSC%23Developmental-Stage=Late-Embryonic-stage%23Tissue=Embryo-derived-cell-line%3AChIP-chip%3ARep-1%3Ainput%3ADmel_r5.32%3AmodENCODE_3300%3A12_A8_Input.62.CEL.gz"

Yupp, the download url is only "partially encoded" (only : and # are encoded). So do make transformed_uniform_name I currently do something like)

stringi::stri_replace_all_fixed(uniform_name, ':', urltools::url_encode(':'))
# and again with '#' ...

This is unsatisfying to say the least. It would be great if I could do something like:

urltools::url_encode(uniform_name, only = c(':', '#'))
# OR...
urltools::url_encode_chars(uniform_name, c(':', '#'))

This way collaborators and future me don't have to spend time figuring out why I wasn't just encoding the url properly.

lower case in url_encode

Got a problem when using url_encode after url_decode.
urltools::url_encode(urltools::url_decode("%2B")) == "%2B" # FALSE

Really we should have some parameter to control the case of the result, something like:
urltools::url_encode(urltools::url_decode("%2B"), upper = TRUE) == "%2B" # TRUE

With default value equal to FALSE to support current version

[FR] url_compose()

Apologies if there's already an awesome way to do this.

Compose a url from it's parsed data.frame

round-trip !

"https://en.wikipedia.org/wiki/Article" %>%
    url_parse() %>%
    url_compose()
#> [1] "https://en.wikipedia.org/wiki/Article"

Issue parsing parameters

Hey Oliver, really nice work on this package - the url_parameters function and the output to data frames are awesome additions.

Also, I just can't believe how quick the thing is - it will turn my daily url parsing job from literally hours (using httr and loops) to a few seconds work!!!

One issue I have come up against with the url_parameters function - one of my parameters is 'to' (as in a dated search). The parser seems to be looking for the first instance of to in the whole url as being restricted to the query. The means in the url:

www.housetrip.es/es/buscar-apartamentos-de-vacaciones/comporta/geo?from=01/04/2015&guests=4&to=05/04/2015

I get -de-vacaciones/comporta/geo?from=01/04/2015 instead of 05/04/2015.

On this basis, I could probably also see a situation where even if it was restricted to just the query, the parser could potentially be greed about the 'to' and find it somewhere else.

Anything I can do to help just let me know - I've always got loads of URLs to parse although sadly your C++ is all greek to me at this point.

Thanks
Jacob

url_parse fails to separate port without trailing slash

> url_parse("http://localhost:8000")
  scheme         domain port path parameter fragment
1   http localhost:8000                             
> url_parse("http://localhost:8000/")
  scheme    domain port path parameter fragment
1   http localhost 8000                        

I'm in no way a URL nerd, but this looks in violation of the rfc

 authority   = [ userinfo "@" ] host [ ":" port ]

URI producers and normalizers should omit the ":" delimiter that
separates host from port if the port component is empty. Some
schemes do not allow the userinfo and/or port subcomponents.

If a URI contains an authority component, then the path component
must either be empty or begin with a slash ("/") character. Non-
validating parsers (those that merely separate a URI reference into
its major components) will often ignore the subcomponent structure of
authority, treating it as an opaque string from the double-slash to
the first terminating delimiter, until such time as the URI is
dereferenced.

(also why are they requests for comments if they're meant to be authoritative I don't understand).

Parse URLs into a data.frame instead of a list

  1. It makes it convenient to pass the "parameters" field to url_parameters
  2. data.frames are just easier for subsetting on a per-column basis.

Use a combination of reference pointers and splitting individual chunks of the URL parser out into their own (private) methods to make this readable and something less than an utter shitshow.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.