Giter Club home page Giter Club logo

rdomains's People

Contributors

quantifiedcode-bot avatar soodoku avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

rdomains's Issues

virustotal_cat() returns different structures depending upon results from virustotal.com

Calls to virustotal_cat() return different structures depending upon what information is available for the domain on virustotal.com. The difference in structure makes programming more difficult when combining information from multiple domains. A better approach would be to return the same structure with NA for missing information.

virustotal_cat("www.google.com",apikey = virusTotalApiKey);
domain bitdefender dr_web alexa google websense trendmicro
1 www.google.com searchengines chats google searchengines advertisements search engines portals
virustotal_cat("www.social-buttons.com",apikey = virusTotalApiKey);
domain bitdefender dr_web
1 www.social-buttons.com newly registered websites newly registered websites
Warning message:
In names(d_res)[names(d_res) %in% cat_names] <- c("bitdefender", :
number of items to replace is not a multiple of replacement length

Problem trusted_cat() on Windows for non-admin users

I think there is a permissions problem with trusted_cat() on non-admin users for Windows machines. I can start a Selenium server with startServer(log=FALSE), but in the trusted_cat() call, there is no way to pass a log=FALSE argument. In looking at the error messages for startServer(), it looks like you may want to include the option of specifying a path to a Selenium server for non-admin users.

startServer()
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning messages:
1: startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
2: In file(file, ifelse(append, "a", "w")) :
cannot open file 'C:/Program Files/R/R-3.3.1/library/RSelenium/bin/sellog.txt': Permission denied
?startServer
startServer(log=FALSE)
$stop
function ()
{
tools::pskill(selPID)
}

$getPID
function ()
{
return(selPID)
}

Warning message:
startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")

trusted_cat("http://www.crazyguyonabike.com")
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning messages:
1: checkForServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
2: startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
3: In file(file, ifelse(append, "a", "w")) :
cannot open file 'C:/Program Files/R/R-3.3.1/library/RSelenium/bin/sellog.txt': Permission denied
trusted_cat("http://www.crazyguyonabike.com",log=FALSE)
Error in trusted_cat("http://www.crazyguyonabike.com", log = FALSE) :
unused argument (log = FALSE)

In virustotal_cat() invalid domain causes call to throw an error on dataframe

When you pass an invalid domain, virustotal_cat() throws an error with no information about the problem or the data that caused the problem. The error causes an exit from any current function. The behavior should probably be to issue a warning or message that the domain was not found and return a normal structure with NA for the various entries.

virustotal_cat("53275950.videos-for-your-business.com",apikey=virusTotalApiKey)
Error in data.frame(domain, d_res) :
arguments imply differing number of rows: 1, 0
virustotal_cat("videos-for-your-business.com",apikey=virusTotalApiKey)
domain bitdefender dr_web
1 videos-for-your-business.com uncategorized uncategorized
Warning message:
In names(d_res)[names(d_res) %in% cat_names] <- c("bitdefender", :
number of items to replace is not a multiple of replacement length

No results from dmoz_cat() call for domain where dmoz.com data exists.

All or at least one of the queries below should return a valid category for the www.crazyguyonabike.com domain. I've used the get_dmoz_data() command and the domain is present in the downloaded file, as well as the dmoz.com web site.

dmoz_cat(domains="http://www.crazyguyonabike.com")
domain_name dmoz_category
1 crazyguyonabike.com
dmoz_cat(domains="https://www.crazyguyonabike.com")
domain_name dmoz_category
1 https://www.crazyguyonabike.com
dmoz_cat(domains="www.crazyguyonabike.com")
domain_name dmoz_category
1 crazyguyonabike.com
dmoz_cat(domains="crazyguyonabike.com")
domain_name dmoz_category
1 crazyguyonabike.com

non admin selenium issues

via Bruce


I think there is a permissions problem with trusted_cat() on non-admin users for Windows machines. I can start a Selenium server with startServer(log=FALSE), but in the trusted_cat() call, there is no way to pass a log=FALSE argument. In looking at the error messages for startServer(), it looks like you may want to include the option of specifying a path to a Selenium server for non-admin users.

startServer()
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning messages:
1: startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
2: In file(file, ifelse(append, "a", "w")) :
cannot open file 'C:/Program Files/R/R-3.3.1/library/RSelenium/bin/sellog.txt': Permission denied
?startServer
startServer(log=FALSE)
$stop
function ()
{
tools::pskill(selPID)
}
<environment: 0x000000002a7d7610>

$getPID
function ()
{
return(selPID)
}
<environment: 0x000000002a7d7610>

Warning message:
startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")

trusted_cat("http://www.crazyguyonabike.com")
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning messages:
1: checkForServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
2: startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
3: In file(file, ifelse(append, "a", "w")) :
cannot open file 'C:/Program Files/R/R-3.3.1/library/RSelenium/bin/sellog.txt': Permission denied
trusted_cat("http://www.crazyguyonabike.com",log=FALSE)
Error in trusted_cat("http://www.crazyguyonabike.com", log = FALSE) :
unused argument (log = FALSE)

Problem with get_shalla_data() permissions on Windows

The get_shalla_data() call fails on Windows for non-admin users with a call to "tar" where it appears to be a permissions or related problem. Environment is Windows 10 with Cygwin installed.

get_shalla_data(outdir = "./shalla_domain_category.csv", overwrite = FALSE)
tar (child): Cannot connect to C: resolve failed

gzip: stdin: unexpected end of file
/usr/bin/tar: Child returned status 128
/usr/bin/tar: Error is not recoverable: exiting now
Error in file(file, "rt") : cannot open the connection
In addition: Warning messages:
1: running command 'tar.exe -zxf "C:\Users\bwmoore\AppData\Local\Temp\RtmpYd0hIU\file21e4636a815" -C "C:/Users/bwmoore/Documents/Consulting_Business/R/working"' had status 2
2: In untar(tmp, exdir = getwd()) :
β€˜tar.exe -zxf "C:\Users\bwmoore\AppData\Local\Temp\RtmpYd0hIU\file21e4636a815" -C "C:/Users/bwmoore/Documents/Consulting_Business/R/working"’ returned error code 2
3: running command 'tar.exe -ztf "C:\Users\bwmoore\AppData\Local\Temp\RtmpYd0hIU\file21e4636a815"' had status 2
4: In file(file, "rt") :
cannot open file './shalla_domain_category.csv': No such file or directory

dockerize / example

Hi,

Hope you are all well !

Is it possible to dockerize (for eg, alpine-r) rdomains as I am interested to lead some tests on the odp directory.

I have 5M websites and wanted to see how the classification is going but I am a newbie in R and could not find a way to make it work.

Can you help me/us by dockerizing rdomains and providing an example on a random domain ?

Thanks in advance for any insights or inputs on these questions.

Cheers,
X

Error in virustotal_cat() when calls exceed four per minute

virustotal_cat() throws an error with the message shown below when you exceed the API limit of four per minute. You might look at a more descriptive message, and perhaps a wait and then a retry.

From a programming perspective, it would be nice if virustotal_cat() took a vector like the other _cat calls. You would have to put some rate-limiting code in.

[1] "i = 1"
[1] "Domain = 100dollars-seo.com"
[1] "Domain results = 1" "Domain results = NA" "Domain results = 1" "Domain results = 1" "Domain results = NA" "Domain results = NA"
[7] "Domain results = NA"
[1] "i = 2"
[1] "Domain = best-seo-offer.com"
[1] "Domain results = 1" "Domain results = NA" "Domain results = 1" "Domain results = 1" "Domain results = NA" "Domain results = NA"
[7] "Domain results = NA"
[1] "i = 3"
[1] "Domain = best-seo-solution.com"
[1] "Domain results = 1" "Domain results = 1" "Domain results = 1" "Domain results = 1" "Domain results = NA" "Domain results = NA"
[7] "Domain results = NA"
[1] "i = 4"
[1] "Domain = buttons-for-website.com"
[1] "Domain results = 1" "Domain results = NA" "Domain results = 1" "Domain results = 1" "Domain results = NA" "Domain results = NA"
[7] "Domain results = NA"
[1] "i = 5"
[1] "Domain = buttons-for-your-website.com"
Error in if (res$verbose_msg == "Domain not found") { :
argument is of length zero
d> traceback()
2: virustotal_cat(domainDf[i], apikey = virusTotalApiKey) at #19
1: getVirusTotal(gaRefSpamDf$referrerDomain, virusTotalApiKey)
d> virustotal_cat("buttons-for-your-website.com",apikey = virusTotalApiKey)
domain bitdefender websense google dr_web trendmicro alexa
1 buttons-for-your-website.com NA uncategorized uncategorized NA NA NA

Code to recreate

getVirusTotal <- function(domainDf,virusTotalApiKey) {
require(rdomains)
require(dplyr)
#if (exists("virusDomain")) {

rm(virusDomain)

#}
domainDf <- gaRefSpamDf$referrerDomain
print(NROW(domainDf))
virusDomain <- data.frame(domain=as.character(),
bitdefender=as.character(),
dr_web=as.character(),
alexa=as.character(),
google=as.character(),
websense=as.character(),
trendmicro=as.character());
for (i in 1:NROW(domainDf)) {
print(paste("i = ",i));
print(paste("Domain = ",domainDf[i]));
thisDomain <- virustotal_cat(domainDf[i],apikey = virusTotalApiKey);
if (exists("thisDomain")) {
print(paste("Domain results = ",thisDomain))
virusDomain <- merge(virusDomain,thisDomain,all=TRUE)
}
}
return(result)
}
gaRefSpam1Df <- getVirusTotal(gaRefSpamDf$referrerDomain,virusTotalApiKey)

d> gaRefSpamDf$referrerDomain[1:10]
[1] "100dollars-seo.com" "best-seo-offer.com" "best-seo-solution.com" "buttons-for-website.com"
[5] "buttons-for-your-website.com" "crazyguyonabike.com" "darodar.com" "delta-search.com"
[9] "duckduckgo.com" "facebook.com"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.