themains / rdomains Goto Github PK
View Code? Open in Web Editor NEWClassifying the content of domains
Home Page: http://themains.github.io/rdomains/
License: Other
Classifying the content of domains
Home Page: http://themains.github.io/rdomains/
License: Other
The call get_shalla_data() appears to misspell name of download file as "shalla_domain_cateory.csv" when it should probably be "shalla_domain_category.csv" with a "g" in "category".
Calls to virustotal_cat() return different structures depending upon what information is available for the domain on virustotal.com. The difference in structure makes programming more difficult when combining information from multiple domains. A better approach would be to return the same structure with NA for missing information.
virustotal_cat("www.google.com",apikey = virusTotalApiKey);
domain bitdefender dr_web alexa google websense trendmicro
1 www.google.com searchengines chats google searchengines advertisements search engines portals
virustotal_cat("www.social-buttons.com",apikey = virusTotalApiKey);
domain bitdefender dr_web
1 www.social-buttons.com newly registered websites newly registered websites
Warning message:
In names(d_res)[names(d_res) %in% cat_names] <- c("bitdefender", :
number of items to replace is not a multiple of replacement length
I think there is a permissions problem with trusted_cat() on non-admin users for Windows machines. I can start a Selenium server with startServer(log=FALSE), but in the trusted_cat() call, there is no way to pass a log=FALSE argument. In looking at the error messages for startServer(), it looks like you may want to include the option of specifying a path to a Selenium server for non-admin users.
startServer()
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning messages:
1: startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
2: In file(file, ifelse(append, "a", "w")) :
cannot open file 'C:/Program Files/R/R-3.3.1/library/RSelenium/bin/sellog.txt': Permission denied
?startServer
startServer(log=FALSE)
$stop
function ()
{
tools::pskill(selPID)
}
$getPID
function ()
{
return(selPID)
}
Warning message:
startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
trusted_cat("http://www.crazyguyonabike.com")
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning messages:
1: checkForServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
2: startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
3: In file(file, ifelse(append, "a", "w")) :
cannot open file 'C:/Program Files/R/R-3.3.1/library/RSelenium/bin/sellog.txt': Permission denied
trusted_cat("http://www.crazyguyonabike.com",log=FALSE)
Error in trusted_cat("http://www.crazyguyonabike.com", log = FALSE) :
unused argument (log = FALSE)
When you pass an invalid domain, virustotal_cat() throws an error with no information about the problem or the data that caused the problem. The error causes an exit from any current function. The behavior should probably be to issue a warning or message that the domain was not found and return a normal structure with NA for the various entries.
virustotal_cat("53275950.videos-for-your-business.com",apikey=virusTotalApiKey)
Error in data.frame(domain, d_res) :
arguments imply differing number of rows: 1, 0
virustotal_cat("videos-for-your-business.com",apikey=virusTotalApiKey)
domain bitdefender dr_web
1 videos-for-your-business.com uncategorized uncategorized
Warning message:
In names(d_res)[names(d_res) %in% cat_names] <- c("bitdefender", :
number of items to replace is not a multiple of replacement length
All or at least one of the queries below should return a valid category for the www.crazyguyonabike.com domain. I've used the get_dmoz_data() command and the domain is present in the downloaded file, as well as the dmoz.com web site.
dmoz_cat(domains="http://www.crazyguyonabike.com")
domain_name dmoz_category
1 crazyguyonabike.com
dmoz_cat(domains="https://www.crazyguyonabike.com")
domain_name dmoz_category
1 https://www.crazyguyonabike.com
dmoz_cat(domains="www.crazyguyonabike.com")
domain_name dmoz_category
1 crazyguyonabike.com
dmoz_cat(domains="crazyguyonabike.com")
domain_name dmoz_category
1 crazyguyonabike.com
via Bruce
I think there is a permissions problem with trusted_cat() on non-admin users for Windows machines. I can start a Selenium server with startServer(log=FALSE), but in the trusted_cat() call, there is no way to pass a log=FALSE argument. In looking at the error messages for startServer(), it looks like you may want to include the option of specifying a path to a Selenium server for non-admin users.
startServer()
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning messages:
1: startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
2: In file(file, ifelse(append, "a", "w")) :
cannot open file 'C:/Program Files/R/R-3.3.1/library/RSelenium/bin/sellog.txt': Permission denied
?startServer
startServer(log=FALSE)
$stop
function ()
{
tools::pskill(selPID)
}
<environment: 0x000000002a7d7610>
$getPID
function ()
{
return(selPID)
}
<environment: 0x000000002a7d7610>
Warning message:
startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
trusted_cat("http://www.crazyguyonabike.com")
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning messages:
1: checkForServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
2: startServer is deprecated.
Users in future can find the function in file.path(find.package("RSelenium"), "example/serverUtils").
The sourcing/starting of a Selenium Server is a users responsiblity.
Options include manually starting a server see vignette("RSelenium-basics", package = "RSelenium")
and running a docker container see vignette("RSelenium-docker", package = "RSelenium")
3: In file(file, ifelse(append, "a", "w")) :
cannot open file 'C:/Program Files/R/R-3.3.1/library/RSelenium/bin/sellog.txt': Permission denied
trusted_cat("http://www.crazyguyonabike.com",log=FALSE)
Error in trusted_cat("http://www.crazyguyonabike.com", log = FALSE) :
unused argument (log = FALSE)
Offer hook to brightcloud API
The get_shalla_data() call fails on Windows for non-admin users with a call to "tar" where it appears to be a permissions or related problem. Environment is Windows 10 with Cygwin installed.
get_shalla_data(outdir = "./shalla_domain_category.csv", overwrite = FALSE)
tar (child): Cannot connect to C: resolve failed
gzip: stdin: unexpected end of file
/usr/bin/tar: Child returned status 128
/usr/bin/tar: Error is not recoverable: exiting now
Error in file(file, "rt") : cannot open the connection
In addition: Warning messages:
1: running command 'tar.exe -zxf "C:\Users\bwmoore\AppData\Local\Temp\RtmpYd0hIU\file21e4636a815" -C "C:/Users/bwmoore/Documents/Consulting_Business/R/working"' had status 2
2: In untar(tmp, exdir = getwd()) :
βtar.exe -zxf "C:\Users\bwmoore\AppData\Local\Temp\RtmpYd0hIU\file21e4636a815" -C "C:/Users/bwmoore/Documents/Consulting_Business/R/working"β returned error code 2
3: running command 'tar.exe -ztf "C:\Users\bwmoore\AppData\Local\Temp\RtmpYd0hIU\file21e4636a815"' had status 2
4: In file(file, "rt") :
cannot open file './shalla_domain_category.csv': No such file or directory
Hi,
Hope you are all well !
Is it possible to dockerize (for eg, alpine-r) rdomains as I am interested to lead some tests on the odp directory.
I have 5M websites and wanted to see how the classification is going but I am a newbie in R and could not find a way to make it work.
Can you help me/us by dockerizing rdomains and providing an example on a random domain ?
Thanks in advance for any insights or inputs on these questions.
Cheers,
X
virustotal_cat() throws an error with the message shown below when you exceed the API limit of four per minute. You might look at a more descriptive message, and perhaps a wait and then a retry.
From a programming perspective, it would be nice if virustotal_cat() took a vector like the other _cat calls. You would have to put some rate-limiting code in.
[1] "i = 1"
[1] "Domain = 100dollars-seo.com"
[1] "Domain results = 1" "Domain results = NA" "Domain results = 1" "Domain results = 1" "Domain results = NA" "Domain results = NA"
[7] "Domain results = NA"
[1] "i = 2"
[1] "Domain = best-seo-offer.com"
[1] "Domain results = 1" "Domain results = NA" "Domain results = 1" "Domain results = 1" "Domain results = NA" "Domain results = NA"
[7] "Domain results = NA"
[1] "i = 3"
[1] "Domain = best-seo-solution.com"
[1] "Domain results = 1" "Domain results = 1" "Domain results = 1" "Domain results = 1" "Domain results = NA" "Domain results = NA"
[7] "Domain results = NA"
[1] "i = 4"
[1] "Domain = buttons-for-website.com"
[1] "Domain results = 1" "Domain results = NA" "Domain results = 1" "Domain results = 1" "Domain results = NA" "Domain results = NA"
[7] "Domain results = NA"
[1] "i = 5"
[1] "Domain = buttons-for-your-website.com"
Error in if (res$verbose_msg == "Domain not found") { :
argument is of length zero
d> traceback()
2: virustotal_cat(domainDf[i], apikey = virusTotalApiKey) at #19
1: getVirusTotal(gaRefSpamDf$referrerDomain, virusTotalApiKey)
d> virustotal_cat("buttons-for-your-website.com",apikey = virusTotalApiKey)
domain bitdefender websense google dr_web trendmicro alexa
1 buttons-for-your-website.com NA uncategorized uncategorized NA NA NA
getVirusTotal <- function(domainDf,virusTotalApiKey) {
require(rdomains)
require(dplyr)
#if (exists("virusDomain")) {
#}
domainDf <- gaRefSpamDf$referrerDomain
print(NROW(domainDf))
virusDomain <- data.frame(domain=as.character(),
bitdefender=as.character(),
dr_web=as.character(),
alexa=as.character(),
google=as.character(),
websense=as.character(),
trendmicro=as.character());
for (i in 1:NROW(domainDf)) {
print(paste("i = ",i));
print(paste("Domain = ",domainDf[i]));
thisDomain <- virustotal_cat(domainDf[i],apikey = virusTotalApiKey);
if (exists("thisDomain")) {
print(paste("Domain results = ",thisDomain))
virusDomain <- merge(virusDomain,thisDomain,all=TRUE)
}
}
return(result)
}
gaRefSpam1Df <- getVirusTotal(gaRefSpamDf$referrerDomain,virusTotalApiKey)
d> gaRefSpamDf$referrerDomain[1:10]
[1] "100dollars-seo.com" "best-seo-offer.com" "best-seo-solution.com" "buttons-for-website.com"
[5] "buttons-for-your-website.com" "crazyguyonabike.com" "darodar.com" "delta-search.com"
[9] "duckduckgo.com" "facebook.com"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.