jkeirstead / scholar Goto Github PK
View Code? Open in Web Editor NEWAnalyse citation data from Google Scholar
License: Other
Analyse citation data from Google Scholar
License: Other
I get the following error when trying to run get_publications():
Error in assign(name, value, envir = attr(static, ".env")) : use of NULL environment is defunct
@cimentadaj I just tested the new PR (#54 ), it's super cool, but I've encountered those three issues:
coauthors <- scholar::get_coauthors("bg0BZ-QAAAAJ", n_coauthors=5, n_deep=1)
scholar::plot_coauthors(coauthors)
As you can see, the coauthors Pascale Piolino seems to be lacking her coauthors. This does not apppear to change when modifying n_coauthors and n_deep.
Warning messages:
1: In grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
This might be caused by a non-standart font used for the plot?
Error in data.frame(author = author_name, author_url = url, coauthors = coauthors, :
arguments imply differing number of rows: 0, 1
Thanks ;)
Hi
I ran:
sen <- get_publications(id="sLNFo0sAAAAJ", cstart = 0, pagesize = 10, flush = FALSE)
I am getting the following error message:
Error in read_xml.response(x, encoding, ..., as_html = TRUE) :
Service Unavailable (HTTP 503).
Am I getting this because I am being banned by Google Scholar? If so, how can I slow-down the downloading?
Thanks
Adel
As of today (3AM CET, as the earliest measured occurrence):
Error in tables[[1]] : subscript out of bounds
I suspect something has changed in the google API?
Just tried figuring out (but I'm not that skilled) however a pull of the XML content like so using RCurl
getURL('https://scholar.google.com/citations?hl=en&user=qZLGnroAAAAJ')
returns a bunch of source code but with an error message at the end:
We're sorry but it appears that there has been an internal server error while processing your request. Our engineers have been notified and are working to resolve the issue.<p>Please try again later.</p>"`
I'll let you decide if this is worth closing this case. I imagine it is, but there may be more to it that someone more expert can check out.
Looks like compare_scholar_careers()
is capped at 8 years. E.g. for Feynman and Hawking the output is:
id year cites career_year name
1 B7vSqZsAAAAJ 2009 3161 0 Richard Feynman
2 B7vSqZsAAAAJ 2010 3375 1 Richard Feynman
3 B7vSqZsAAAAJ 2011 3471 2 Richard Feynman
4 B7vSqZsAAAAJ 2012 4060 3 Richard Feynman
5 B7vSqZsAAAAJ 2013 4146 4 Richard Feynman
6 B7vSqZsAAAAJ 2014 4039 5 Richard Feynman
7 B7vSqZsAAAAJ 2015 3843 6 Richard Feynman
8 B7vSqZsAAAAJ 2016 4069 7 Richard Feynman
9 B7vSqZsAAAAJ 2017 2409 8 Richard Feynman
10 qj74uXkAAAAJ 2009 4942 0 Stephen W. Hawking
11 qj74uXkAAAAJ 2010 4860 1 Stephen W. Hawking
12 qj74uXkAAAAJ 2011 5477 2 Stephen W. Hawking
13 qj74uXkAAAAJ 2012 5730 3 Stephen W. Hawking
14 qj74uXkAAAAJ 2013 5853 4 Stephen W. Hawking
15 qj74uXkAAAAJ 2014 6207 5 Stephen W. Hawking
16 qj74uXkAAAAJ 2015 5944 6 Stephen W. Hawking
17 qj74uXkAAAAJ 2016 6196 7 Stephen W. Hawking
18 qj74uXkAAAAJ 2017 4084 8 Stephen W. Hawking
The function description mentions the bar chart in scholar profiles as a source; however, clicking on that chart does reveal citations for all years of a career, which doesn't seem to be used here. Is this expected behaviour? Limits the utility of this function quite drastically.
Hello, I am trying to pull publications for multiple authors. However, get_publications() only works the with the first ID it is used on in a session- if I try to use it with a subsequent ID, the following is returned:
[1] title author journal number cites year cid pubid
<0 rows> (or 0-length row.names)
Any thoughts? I've carefully checked the IDs, and they are valid and all work- as long as they are the first one tried in a particular session. To do another one, I have to end the session and reopen R, which is very inefficient!
Like others, I was having problems with the cookies / 2 requests issues just recently fixed. Was happy to see the new version posted fixing the error.
With the new version, however, everything I try gives me the following error:
Error: is.handle(handle) is not TRUE
This includes all the simple examples on the Readme doc, such as the following code:
# Define the id for Richard Feynman
id <- 'B7vSqZsAAAAJ'
# Get his profile and print his name
l <- get_profile(id)
l$name
# Get his citation history, i.e. citations to his work in a given year
get_citation_history(id)
# Get his publications (a large data frame)
get_publications(id)
Hi.
I love this package!
But I get errors getting some citation history. I get this very inconsistent error. That is, the error is always the same, but it occurs at differing times (sometimes it will loop through all publications I want without error, other times it will fail at varying papers, i.e. it does not always fail at the same paper!):
Error in if (zero_range(from) || zero_range(to)) { :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In max(as.numeric(cit$year)) :
no non-missing arguments to max; returning -Inf
2: In max(cit$cites) : no non-missing arguments to max; returning -Inf
trying to get cite data from all of this users publications (not counting the ones that have NA's in them).
I'll try to give you reproducible example (but as I said, it does not always fail!)
USR="yiSLTAcAAAAJ"
### Publications and citation history ####
Pubs = get_publications(USR) #get publication information
Pubs$cites = as.numeric(as.character(Pubs$cites))
Pubs = Pubs[complete.cases(Pubs),]
Years = seq(from=min(Pubs$year,na.rm=T),to=max(Pubs$year,na.rm=T),by=1)
CiteYears = as.data.frame(matrix(0, ncol=length(Years), nrow=nrow(Pubs)))
names(CiteYears) = as.character(Years)
###Get each article's cite history
pb = txtProgressBar(min = 1, max = nrow(CiteYears), initial = 1, style=3) ##Set progressbar
for(i in 1:nrow(CiteYears)){
setTxtProgressBar(pb,i)
CitePub = get_article_cite_history(USR,Pubs$pubid[i])
CitePub = cast(CitePub, value="cites", ~year)[,-1]
CiteYears[i, grepl(paste(names(CitePub),collapse="|"),names(CiteYears))] = CitePub
}
Pubs = cbind(Pubs,CiteYears)
Some users may choose to make their profiles private, in which case the current code is unable to get this data. Investigate whether it's possible to support such profiles.
In addition to giving citation histories for an author, you can also view the results for a single article. It would be nice to able to retrieve these values as well.
I extracted the publication list from one author and tried to get the article citation history though with an error message. The article tried here is from 2012 and has 86 citations. Not that much to crash a function.
article1 <- get_article_cite_history('fYQY8Y8AAAAJ', '7591188124196201684')
Error in min(years):max(years) : result would be too long a vector In addition: Warning messages: 1: In min(years) : no non-missing arguments to min; returning Inf 2: In max(years) : no non-missing arguments to max; returning -Inf
@cimentadaj as mentioned in discussion of #54, I think it would be best if you made it clear at the appropriate place in the docs and vignette that coauthor_network
and friends only return co-authors listed on the google scholar profile, not from all retrieved publications.
This is what I would like to do:
get_publications(author)
get_article_cite_history(author,pubid)
However, get_publications(author)
returns void(0)
values in the pubid
column, which means I cannot run get_article_cite_history
.
Any idea what is not working? The same script was working a couple of months ago, so would it be possible that it might be cause by the Google Scholar recent update?
I am trying to retrieve current h-indeces for about 200 scholars at once. I have the ID list, and have run a loop that runs predict_h_index()
on every ID. It worked fine for the first 70 or so and then ran into Error in tables[[1]] : subscript out of bounds
, which also occured when I was running a similar loop for getting said IDs. It seems the first time Google blocked me out for scraping (as indicated by the fact that when I went on Scholar manually it asked me to verify I wasn't a robot) and getting on a different network solved it naturally.
However several minutes later, I try rerunning the command and am now receiving:
Error in if (any(diff(h.vals) < 0)) warning(paste0("Decreasing h-values predicted. ", : missing value where TRUE/FALSE needed In addition: Warning message: In min(papers$year, na.rm = TRUE) : no non-missing arguments to min; returning Inf
I can't tell if I'm being blocked again, because like I said the commend worked fine 70-ish times in a row, nothing changed, but now it doesn't work on any of the IDs. If this is literally Google blocking me out every time for scraping, even though I did put in Sys.sleep()
, what would you recommend? If not, how might I solve this problem?
Edit: Hello. After further inspections I found that only certain IDs prompted this new error, even though the data manually inspected on Scholar looks fine. The rest work as intended. I made a workaround with tryCatch
but I'd love a solution for these particular IDs. Examples of ones that prompt errors: 5JserkUAAAAJ
and EdV8gVgAAAAJ
.
Scholar only provides citation history values for the past 9 years. This should be clarified in the documentation.
As reported by email, predict_h_index
can give negative numbers. This is because the underlying method is a regression analysis calibrated on neuroscientists so those in other fields can get weird answers (see the documentation).
While this isn't a bug per se, it might be preferable to have the method restrict future h-index values to be greater than or equal to the current h-index. Any objections?
@cimentadaj A minor suggestion: I've noticed some people have their name written in full uppercase (Serge NICOLAS) while some others don't (Dominique Makowski).
Wouldn't it be more neat-looking (especially in the network graph) to homogenise the author names?
It can be easily done by title-casing the names vector:
stringr::str_to_title("Serge NICOLAS")
stringr::str_to_title("Dominique Makowski")
What do you think?
Is there a reason why the count of cites using the two methods should be different? Thanks.
id="B7vSqZsAAAAJ"
citationsbyyear=get_citation_history(id)
sum(citationsbyyear$cites)
pubs=get_publications(id)
sum(pubs$cites)
The get_profile() function fills returns total citations in the h_index slot, and h_index in the i10_index slot, etc. Output and a sessionInfo dump below.
get_profile('0ryVFl8AAAAJ')
$id
[1] "0ryVFl8AAAAJ"
$name
[1] "Chris Miller"
$affiliation
[1] "The Genome Institute at Washington University"
$total_cites
[1] NA
$h_index
[1] 1257
$i10_index
[1] 11
$fields
[1] "cancer genomics" "computational biology" "systems biology"
$homepage
[1] "http://www.chrisamiller.com/"
Warning message:
In get_profile("0ryVFl8AAAAJ") : NAs introduced by coercion
sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] scholar_0.1.0
loaded via a namespace (and not attached):
[1] R.cache_0.9.0 R.methodsS3_1.5.2 R.oo_1.15.8 R.utils_1.27.1
[5] XML_3.2-0 digest_0.4.2 plyr_1.8 stringr_0.4
It seems get_publications() is not retrieving the first 100 papers as mentioned in the literature but only the first 20.
library(scholar)
id = "xJaxiEEAAAAJ" # Isaac Newton
get_num_articles(id)
[1] 20
packageVersion("scholar")
[1] ‘0.1.2’
It would be great if get_publications() could retrieve all papers, or a specific number of papers (with some safe defaults).
Hello,
In the given example, you have started with an ID for the author. Is there a way to find out the author ID given his/her name?
Thanks
Google Scholar changed its design recently and pubid
return by scholar::get_publications()
are now "void(0)".
It would be great if this table also included the ID used by google for each article.
This information is available next to citation count in the form of a link, e.g. for this link http://scholar.google.com/scholar?oi=bibs&hl=en&cites=6728932339706166581
the article id is 6728932339706166581.
As far as I can tell, this should be relatively straight forward, just need to extract an extra element at this point in code.
There are three advantages of having the article id:
I'll have a go at this sometime soon, but don't let that discourage anyone who already knows how to do it.
Hello, Thank you for your package.
I found the publication IDs not returned from get_publications. Without publication IDs, it is impossible to proceed commands like get_article_cite_history.
chaoyisheng
If I execute this code:
ids <- c('dYWNWicAAAAJ', 'R2ZrHtsAAAAJ', 'yg0LY3QAAAAJ', 'OixfQOcAAAAJ', 'k6b-lLYAAAAJ', 'g-U6tyoAAAAJ', 'ixGcu5gAAAAJ', 'docGNYEAAAAJ', 'kU2jvOMAAAAJ', '_XJMeP0AAAAJ', 'aqsaHZwAAAAJ', 'wEU99lsAAAAJ')
cmp2 <- compare_scholar_careers(ids)
I get:
Error in data.frame(year = years, cites = vals) :
arguments imply differing number of rows: 7, 6
Calls: compare_scholar_careers ... lapply -> FUN -> cbind -> get_citation_history -> data.frame
Execution halted
PS. these same instructions worked before but I guess some changes in the google data triggers this error.
I get an 'empty' object with 3 variable (year, cites, pubid) and zero observations with the command
get_article_cite_history(id, article)
I obtained the article ids by using the get_publications functions and I have tried different article ids without success.
There is no problem with: get_citation_history(id)
I am using Version: 0.1.6
Is this a bug or am I doing something wrong?
I observed that Google Scholar do not list all authors if a publication has number of authors more than 5 or 6 ( "..." appear at the end of authors).
For this, we need to parse publication page instead of profile page for any such specific publication. I wrote a simple code to achieve this. Can you incorporate in package so other people can get benefit?
getCompleteAuthors = function(id, pubid)
{
auths = ""
url_template = "http://scholar.google.com/citations?view_op=view_citation&citation_for_view=%s:%s"
url = sprintf(url_template, id, pubid)
tree = htmlTreeParse(url, useInternalNodes = T)
auths = xpathApply(tree, '//*/div[@Class="gsc_value"]',xmlValue)[[1]]
return(auths)
}
Google Scholar has changed the layout of the page so the xpaths no longer extract the correct elements.
Is there a way to download not an aggregate of citations for a paper of a person but a detailed list of who is citing that paper and, ideally, when the citation came in? This would open many doors for great things in terms of building networks and analyzing the "rate of adoption" (how long it takes for the first citation to drop, etc.).
> scholar::tidy_id("cuXoCA8AAAAJ")
Error in if (getOption("scholar_call_home")) { :
argument is of length zero
> library(scholar)
> scholar::tidy_id("cuXoCA8AAAAJ")
[1] "cuXoCA8AAAAJ"
This is relevant if the scholar package is imported by another package.
Hi. For some reason when I try to get my citation history I get the following error:
Error in data.frame(year = years, cites = vals) :
arguments imply differing number of rows: 9, 8
This doesn't happen for many other IDs that I've tried. My id is 'xpECwJQAAAAJ'... do you know why it doesn't work? Too few citations? ;-)
Very nice package!
But I found some problems. I am doing
id <- "xLd8lNoAAAAJ&hl"
augusto <- get_profile(id)
get_publications(id)
get_num_articles(id)
The result is showing only 20 of my papers (not all of them). Am I doing something wrong?
Thanks for your help.
Additionally
@GuangchuangYu are you OK to review develop and take this the next step?
I tried install.packages("scholar", dependencies=T)
but eventually I get the error above. I'm using R version 3.0.1 (2013-05-16) under Linux Ubuntu 13.10 (64 bit).
I have found by trial and error that get_article_cite_history fails when there are 0 citations to the article, giving the error message below. I know you're aware of this, so please keep reading after the error message.
Error in min(years):max(years) : result would be too long a vector
In addition: Warning messages:
1: In min(years) : no non-missing arguments to min; returning Inf
2: In max(years) : no non-missing arguments to max; returning -Inf
I have run in to another case where I get the same error when there is one citation to the article having a publication year in the future. Consider the following example
get_article_cite_history("xXHEaAUAAAAJ","Wp0gIr-vW9MC")
There is one citing article, with publication year 2016. Looking at the source code for this function, there is an attempt to read the citation bar chart provided by GS; however, GS reports no bar chart for the citations to this article, thus the error and perhaps no easy fix, although a check and a graceful return from the function would be welcome.
2015 is almost over, so this particular example will fix itself soon :)
Here is another example with two citations, one in 2016 and one with no year. The same error occurs
get_article_cite_history("0pYNftwAAAAJ","M3ejUd6NZC8C")
In the examples, compare_scholars
and compare_scholar_careers
(src) takes ages to run because it's pulling down hundreds of publications for Stephen Hawking (damn those prolific scientists). I've changed this to Isaac Newton for the moment and made the slowest compare_scholars
example DONTRUN but that's a rather temporary fix.
sorry - should've read the documentation before posting this issue!
Rob
Just a simple idea, but
pubs<- get_publications(id, cstart = 0, pagesize = 400, flush = FALSE)
pubs$cumsum <- cumsum(pubs$cites)
pubs$citerank <- get_num_articles(id) - rank(pubs$cites, ties.method = "last") +1
pubs$htest <- (pubs$cites - pubs$citerank) >=0
pubs$hvalue <- sum(pubs$htest)
pubs$gtest <- (pubs$cumsum -pubs$citerank^2) >=0
pubs$gvalue <- sum(pubs$gtest)
pubs$i10 <- pubs$cites >10
pubs$i10value <- sum(pubs$i10)
pubs$i50 <- pubs$cites >50
pubs$i50value <- sum(pubs$i50)
pubs$i100 <- pubs$cites >100
pubs$i100value <- sum(pubs$i100)
You could probably do similar things to get the more exotic indices that Harzing's PoP produces
https://harzing.com/pophelp/metrics.htm
Just discovered this nice package for extracting the SJR index of journals' prestige. It could potentially be a nice addition to the impact_factor related functions.
Specifically, as it includes the SJR index for different years, it would provide a unique opportunity to compute this index for each author's publication at their time of publication. Could be interesting for developing new authors' impact metrics.
get_oldest_article()
returns Inf
when some articles do not have year.
like in this profile: get_oldest_article("QW5aIMgAAAAJ")
I would like to get multiple authors' information using purrr:map(ids, get_oldest_article)
and the function stops due to Inf
result.
Is it possible to return NA
or "smallest year available" ?
Best wishes
@GuangchuangYu @jefferis The CRAN version of scholar
is still v0.1.4, built on 2015-11-21. Some recent issues (#47, #48) are strange and should be coming for the older build. Please push the latest release to CRAN so people can update by update.packages()
or install.packages("scholar")
.
Greetings,
The following reveals get_citation_history is barfing on a year with zero counts... latest versions of R, RStudio, and all packages installed...
library(scholar)
library(ggplot2)
cit <- get_citation_history('juybEFMAAAAJ&hl=en')
Error in data.frame(year = years, cites = vals) :
arguments imply differing number of rows: 5, 4
Thanks!
This StackOverflow question clearly reflects user error/confusion, but the error message that you get when putting in a bogus article ID,
Error in min(years):max(years) : result would be too long a vector
In addition: Warning messages:
1: In min(years) : no non-missing arguments to min; returning Inf
2: In max(years) : no non-missing arguments to max; returning -Inf
is not terribly transparent to new users, and could probably be clearer ...
Seems the tidy_id function is not working anymore because of the "https" call in the "sample_url" variable.
Replacing "https" by "http" in the "sample_url" variable solve the issue, however it needs a package recompilation.
There appear to be some issues with Unicode support. See jaumebonet@e90bd0e and problems parsing the number of citations for struck-through values (e.g. when citations are grouped with another article like 'Le cours de physique de Feynman')
Is there a way to get ID of the scholar by entering his name? Imagine I have a large character vector of scholar names and I want to get their IDs without searching for them manually one by one. Thanks.
Caching is used to avoid hammering Google's servers when making multiple requests for a scholar's data. However it would be useful to have an option to flush the cache, at least for development.
Would it be possible that scholar::get_publications()
also find DOI / URL ? That would very useful!
Thanks for that package!
What would be really useful is an addition that allows the calculation of overall lifetime h-index for a group of scholars- such as a lab, a department, and so forth....allow the user to input all ids within the group, and spit out the total number of citations and h-index for the group as a whole. G-index would be a welcome addition to this proposed module or existing ones.
Hi! Thanks a lot for scholar
, it's a very interesting package. Inadvertently, I created the coauthornetwork
package which does a very simple thing: extracts your coauthor network and visualizes a network of coauthorship.
Of course the package would be much better if it used the already existing functions from the scholar
package. Would you be open to integrating the function from coauthornetwork
in scholar
? @mkiang suggested the idead here and it made sense to me as this is purely a two-function package which will be probably fit much better using the already existing structure of scholar
.
I'd adapt the code to match the style/dependencies of the package, of course.
You can check out the package here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.