This package has been archived. The former README is now in README-not.
ropensci-archive / solrium Goto Github PK
View Code? Open in Web Editor NEW:warning: ARCHIVED :warning: A general purpose R interface to Solr
License: Other
:warning: ARCHIVED :warning: A general purpose R interface to Solr
License: Other
This package has been archived. The former README is now in README-not.
Hi,
Thanks for this great work.
I am new to R language, and trying to use this today.
I followed the test case:
test-solr_search.r
and run: a <- solr_search(q=':', rows=2, fl='size', url='mysolrurl', key='size')
and then call a$response, it failed with following error:
Error in a$response : $ operator is invalid for atomic vector
a['response']
[1] NA
class(a)
[1] "character"
Any idea why this happened and how to fix?
Thanks again :)
https://issues.apache.org/jira/browse/SOLR-66
and note which version first had it for solr_search()
to warn users that may be using older solr installs
@ropensci/owners Does this package name need to change from solr
to something else? Or is it okay
Using the solr package for medium size datasets (5000+ observations;30+ variables) is quite a stretch. I therefore wonder what size of datasets your targeting at.
E.g. http://api.plos.org/search?q=*:*&wt=csv&fl=id,score
id,score
10.1371/journal.pone.0107569/introduction,1.0
10.1371/journal.pone.0107569/results_and_discussion,1.0
10.1371/journal.pone.0107569/materials_and_methods,1.0
10.1371/journal.pone.0107569/supporting_information,1.0
10.1371/journal.pone.0062138,1.0
10.1371/journal.pone.0044030/title,1.0
10.1371/journal.pone.0062138/title,1.0
10.1371/journal.pone.0044030/abstract,1.0
10.1371/journal.pone.0062138/abstract,1.0
10.1371/journal.pone.0044030/references,1.0
I think we just handle an API key passed in the url for now...
url
is a function in base R, so change the url
parameter to base
, for base url in all functions.
@sckott I think package is a brilliant idea. Standardizing how we handle solr queries across packages would be a big boost. dataone supports a full set of queries as well, and the dataone
package provides a basic interface for this. CC'ing @mbjones in case he wants to take a look at how you're going about this or has any suggestions for you.
I see your query about XML in the README. Though I haven't done much with solr at this time, I'd nonetheless recommend we consider supporting XML as well as JSON. I don't think it makes sense for the package to make this decision for the user: a user who wants solr queries that return XML should be able to have them, yes?
JSON certainly has it's advantages, but we have a lot of tools for working with XML that don't really have analogs in JSON: XPath, XPointer, schema, XSLT, etc that can all be pretty handy.
Probably this is not an issue in the code but some data maybe missing.
solr_group(q=':', group.field='journal', rows=5, group.limit=1,
group.sort='publication_date desc', fl='journal, publication_date', url=url, key=key)
Error in rbindlist(lapply(datout, function(x) { :
Item 4 has 4 columns, inconsistent with item 1 which has 5 columns
The problem is that I get a similar error with many queries to http://api.plos.org/search
Can you deal with this missing data?
Right now, we have solr
setup so that users pass in options to the fl
parameter in a vector like c("one", "two")
, which gets parsed to fl=one&fl=two
in the URL string, but it doesn't always work. In Dyrad's Solr endpoint this doesn't work
http://datadryad.org/solr/search/select?q=Galliard&wt=json&fl=handle&fl=dc.title_sort
the second fl
parameter is ignored.
But this works in PLOS's search API
I can not reproduce it in PLOS.
Normal query with 3 attributes in field list. I get a warning
> solr_group(q='*:*', group.field='accoid', group.limit=1, group.sort='price asc', sort='price asc', fl="touroperator, rating, price", fq="transport:VL", url = url)
groupValue numFound start rating touroperator
1 137a9c30-ee49-11df-a13b-0050569335f3 12545 0 3 JI
2 103fe3f0-8f5c-11df-a2df-001c42000009 6702 0 3 JI
3 79760f30-5b14-11e2-bb05-000c297659d3 19983 0 1 CH
4 50e1b6f0-5fe7-11e2-bb05-000c297659d3 39773 0 2 JI
5 fda70780-9b3c-11e0-9153-005056930057 1659 0 4 JI
6 8fdcaba0-bc9c-11e2-a109-000c297659d3 69484 0 2 JI
7 10d5bb50-8f5c-11df-a2df-001c42000009 5235 0 4 JI
8 0e3769c0-8f5c-11df-a2df-001c42000009 51906 0 4 JI
9 108fd8b0-8f5c-11df-a2df-001c42000009 2584 0 3 JI
10 0c880c10-8f5c-11df-a2df-001c42000009 57270 0 2 JI
price
1 12700
2 13500
3 13700
4 14017
5 14700
6 14833
7 15166
8 15208
9 15225
10 15233
Warning message:
In if (names(input) == "response") { :
the condition has length > 1 and only the first element will be used
Same query with raw true
> solr_group(q='*:*', group.field='accoid', group.limit=1, group.sort='price asc', sort='price asc', fl="touroperator, rating, price", fq="transport:VL", url = url, raw=TRUE)
[1] "{\"responseHeader\":{\"status\":0,\"QTime\":1472},\"grouped\":{\"accoid\":{\"matches\":34553291,\"groups\":[{\"groupValue\":\"137a9c30-ee49-11df-a13b-0050569335f3\",\"doclist\":{\"numFound\":12545,\"start\":0,\"docs\":[{\"rating\":3,\"touroperator\":\"JI\",\"price\":12700}]}},{\"groupValue\":\"103fe3f0-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":6702,\"start\":0,\"docs\":[{\"rating\":3,\"touroperator\":\"JI\",\"price\":13500}]}},{\"groupValue\":\"79760f30-5b14-11e2-bb05-000c297659d3\",\"doclist\":{\"numFound\":19983,\"start\":0,\"docs\":[{\"rating\":1,\"touroperator\":\"CH\",\"price\":13700}]}},{\"groupValue\":\"50e1b6f0-5fe7-11e2-bb05-000c297659d3\",\"doclist\":{\"numFound\":39773,\"start\":0,\"docs\":[{\"rating\":2,\"touroperator\":\"JI\",\"price\":14017}]}},{\"groupValue\":\"fda70780-9b3c-11e0-9153-005056930057\",\"doclist\":{\"numFound\":1659,\"start\":0,\"docs\":[{\"rating\":4,\"touroperator\":\"JI\",\"price\":14700}]}},{\"groupValue\":\"8fdcaba0-bc9c-11e2-a109-000c297659d3\",\"doclist\":{\"numFound\":69484,\"start\":0,\"docs\":[{\"rating\":2,\"touroperator\":\"JI\",\"price\":14833}]}},{\"groupValue\":\"10d5bb50-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":5235,\"start\":0,\"docs\":[{\"rating\":4,\"touroperator\":\"JI\",\"price\":15166}]}},{\"groupValue\":\"0e3769c0-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":51906,\"start\":0,\"docs\":[{\"rating\":4,\"touroperator\":\"JI\",\"price\":15208}]}},{\"groupValue\":\"108fd8b0-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":2584,\"start\":0,\"docs\":[{\"rating\":3,\"touroperator\":\"JI\",\"price\":15225}]}},{\"groupValue\":\"0c880c10-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":57270,\"start\":0,\"docs\":[{\"rating\":2,\"touroperator\":\"JI\",\"price\":15233}]}}]}}}\n"
attr(,"class")
[1] "sr_group"
attr(,"wt")
[1] "json"
When I add qt='distributedSearch' in the response the last 2 attributes are missing
> solr_group(q='*:*', group.field='accoid', group.limit=1, group.sort='price asc', sort='price asc', fl="touroperator, rating, price", fq="transport:VL", url = url, qt='distributedSearch', raw=FALSE)
groupValue numFound start touroperator
1 accaa2a0-fb51-11e2-a109-000c297659d3 17750 0 CH
2 77f8e0f0-9c42-11e2-a109-000c297659d3 4084 0 JI
3 53432a60-c7df-11e0-aa1b-005056930057 6636 0 JI
4 edefdd00-8f5b-11df-a2df-001c42000009 23974 0 JI
5 137a9c30-ee49-11df-a13b-0050569335f3 12545 0 JI
6 10438d70-8f5c-11df-a2df-001c42000009 13220 0 CH
7 110c34a0-8f5c-11df-a2df-001c42000009 13384 0 CH
8 10427c00-8f5c-11df-a2df-001c42000009 8898 0 JI
9 c69d6fb0-9c41-11e2-a109-000c297659d3 4104 0 JI
10 6f885e80-9336-11e0-9153-005056930057 13065 0 CH
Warning message:
In if (names(input) == "response") { :
the condition has length > 1 and only the first element will be used
In the raw response they are also missing
> solr_group(q='*:*', group.field='accoid', group.limit=1, group.sort='price asc', sort='price asc', fl="touroperator, rating, price", fq="transport:VL", url = url, qt='distributedSearch', raw=TRUE)
[1] "{\"responseHeader\":{\"status\":0,\"QTime\":1774},\"grouped\":{\"accoid\":{\"matches\":141800873,\"groups\":[{\"groupValue\":\"accaa2a0-fb51-11e2-a109-000c297659d3\",\"doclist\":{\"numFound\":17750,\"start\":0,\"docs\":[{\"touroperator\":\"CH\"}]}},{\"groupValue\":\"77f8e0f0-9c42-11e2-a109-000c297659d3\",\"doclist\":{\"numFound\":4084,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"53432a60-c7df-11e0-aa1b-005056930057\",\"doclist\":{\"numFound\":6636,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"edefdd00-8f5b-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":23974,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"137a9c30-ee49-11df-a13b-0050569335f3\",\"doclist\":{\"numFound\":12545,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"10438d70-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":13220,\"start\":0,\"docs\":[{\"touroperator\":\"CH\"}]}},{\"groupValue\":\"110c34a0-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":13384,\"start\":0,\"docs\":[{\"touroperator\":\"CH\"}]}},{\"groupValue\":\"c69d6fb0-9c41-11e2-a109-000c297659d3\",\"doclist\":{\"numFound\":4104,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"10427c00-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":8898,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"6f885e80-9336-11e0-9153-005056930057\",\"doclist\":{\"numFound\":13065,\"start\":0,\"docs\":[{\"touroperator\":\"CH\"}]}}]}}}\n"
attr(,"class")
[1] "sr_group"
attr(,"wt")
[1] "json"
I don't know how to get url sent to Solr to check if the url was built correctly.
Doesn't seem necessary as far as I can tell
This 2 properties make sense when you do a search with groups and facets.
Then the results in the facets are related to the grouping.
I can remove it from the documentation, but I wanted to ask first.
\item{group.truncate}{(logical) If true, facet counts are
based on the most relevant document of each group
matching the query. Same applies for StatsComponent.
Default is false. <!> Solr3.4 Supported from Solr 3.4 and
up.}
\item{group.facet}{(logical) Ihether to compute grouped
facets for the field facets specified in facet.field
parameters. Grouped facets are computed based on the
first specified group. Just like normal field faceting,
fields shouldn't be tokenized (otherwise counts are
computed for each token). Grouped faceting supports
single and multivalued fields. Default is false. <!>
Solr4.0 WARNING: If this parameter is set to true on a
sharded environment, all the documents that belong to the
same group have to be located in the same shard,
otherwise the count will be incorrect. If you are using
SolrCloud, consider using "custom hashing"}
Docs: http://wiki.apache.org/solr/FieldCollapsing
An example query:
http://api.plos.org/search/?q=ecology&group=true&group.field=journal&group.limit=3&fl=id,score
{
grouped: {
journal: {
matches: 18120,
groups: [
{
groupValue: "plos one",
doclist: {
numFound: 13939,
start: 0,
docs: [
{
id: "10.1371/journal.pone.0059813"
}
]
}
},
{
groupValue: "plos biology",
doclist: {
numFound: 746,
start: 0,
docs: [
{
id: "10.1371/journal.pbio.0020072"
}
]
}
},
{
...cutoff
group.main should be a boolean following the doc
\item{group.main}{(logical) If true, the result of the
last field grouping command is used as the main result
list in the response, using group.format=simple}
And should return the the group field in the main result
solr_group(q=':', group.field='journal', rows=5, group.limit=3, group.sort='publication_date desc', group.format='simple', group.main='true', fl='publication_date', url=url, key=key)
numFound start publication_date
1 889099 0 2014-01-17T00:00:00Z
2 889099 0 2014-01-17T00:00:00Z
3 889099 0 2014-01-17T00:00:00Z
4 889099 0 2014-01-16T00:00:00Z
5 889099 0 2014-01-16T00:00:00Z
like all other DB connector clients
This ruby gem has a nice template could look at http://www.rubydoc.info/gems/rsolr/1.0.12
Carl says Dataone has a solr interface. Test from here and make sure it works, give examples, etc.
grouping, faceting, etc. are all just param options in search, so all could be done from one function. Returning raw data would be easy. However,
search
results (i.e., the docs
element) back when group=true
, but perhaps there is a way to return docs
response <- solr_search(q=':', fl='id', rows=2, url=url, key=key)
response$numFound
NULL
Looks like while adding the solr_group function the solr_search response has lost the attributes
and only returns the docs
solr_search(q=':', fl='id', rows=2, url=url, key=key)
id
1 10.1371/journal.pone.0071557
2 10.1371/journal.pone.0064577/title
This should provide significant speed advantage over xml and json, and appears to be in Solr for many versions now, meaning it should work for most Solr installations, hopefully.
Should write larger test suite for wt=csv
specifically to make sure it's not failing anywhere, and data output is identical to wt=json
and wt=xml
also, experiment with replacing read.table()
with something else, like data.table::fread()
, readr::read_csv()
from https://github.com/hadley/readr
From kurt hornik
These have README.md files which when converted to (X)HTML using a
current version of pandoc show problems when validated using W3C Markup
Validator, see below.Most of these problems are caused by using images without giving a name
(so the required alt attribute for<img>
is not provided), or using<br>
instead of<br/>
.Pls fix these problems in your README.md files for your next release: in
all cases I inspected, the fixes were obvious and confirmation using
pandoc and W3C markup validator seemed unnecessary.Please also visit your package check web page at http://cran.r-project.org/web/checks/check_results_PACKAGENAME.html to see if other problems need to be addressed as well.
These packages contain README.md files with invalid HTML output created
by pandoc 1.12.4.2 according to W3C-validator.
I attach the HTML errors and warnings found below, and will put copies
of the corresponding HTML files up at
http://www.r-project.org/nosvn/pandoc.
Please investigate the problems and fix as needed.
Afaics, many of the problems are caused by adding "raw" HTML elements in
the README.md files and not realizing that the default output format
"html" is XHTML 1 (and not HTML 5). E.g., a raw
results in an
end tag for "br" omitted, but OMITTAG NO was specified
error.
Best
-k
solr.html:
Valid: FALSE (errors: 1, warnings: 0)
Errors:
line col message
339 98 required attribute "alt" not specified
Already started this, and changed slightly solr_search()
in 46debc5
Nice work on this package, looks awesome and really useful.
Minor quibble: in an example like the one you give below:
solr_facet(q = "*:*", facet.field = "journal", facet.query = "cell,bird", url = url)
it feels a bit un-R like to me that facet.query
is "cell,bird"
instead of c("cell", "bird")
. As an R user I expect a query on two facets to be a length 2 character object in R, not a character string separated by some particular syntax. (Yeah, the c
notation is more verbose, but if I'm programmatically assembling my query from an R object I've created some other way, then paste0(facets, collapse=",")
is even more verbose...
Anyway, just a minor thought, probably fine either way.
Right now the parameter wt
is available but internally forced to json
The sorting of groups is done using the default sorting
solr_group(q='*:*', group.field='journal', rows=5, group.limit=1, group.sort='publication_date desc', sort='publication_date desc', fl='publication_date', url=url, key=key)
groupValue numFound start publication_date
1 plos one 677297 0 2014-01-17T00:00:00Z
2 plos neglected tropical diseases 19106 0 2014-01-16T00:00:00Z
3 plos genetics 33698 0 2014-01-16T00:00:00Z
4 none 62518 0 2012-10-23T00:00:00Z
5 plos biology 24111 0 2014-01-14T00:00:00Z
Looks like the param sort is not sent to Solr server
solr_group(q='*:*', group.field='journal', rows=5, group.limit=1,
group.sort='publication_date desc', sort='error', fl='publication_date', url=url, key=key)
groupValue numFound start publication_date
1 plos one 677297 0 2014-01-17T00:00:00Z
2 plos neglected tropical diseases 19106 0 2014-01-16T00:00:00Z
3 plos genetics 33698 0 2014-01-16T00:00:00Z
4 none 62518 0 2012-10-23T00:00:00Z
5 plos biology 24111 0 2014-01-14T00:00:00Z
This is different from what data to return, XML or JSON via wt
param
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.