ropensci-archive / solrium Goto Github PK

View Code? Open in Web Editor NEW

58.0 13.0 13.0 1.67 MB

:warning: ARCHIVED :warning: A general purpose R interface to Solr

License: Other

Makefile 0.39% R 99.61%

solr solr-client solr-search rstats database r r-package

solrium's Introduction

This package has been archived. The former README is now in README-not.

solrium's People

Contributors

Stargazers

Watchers

Forkers

imclab jonasanso jonbaer kbroman seandavi sharmarakesh 4d1in3 adymimos 1havran melsiddieg chreman pozypakya abhik1368

solrium's Issues

Error in a$response : $ operator is invalid for atomic vector

Hi,

Thanks for this great work.
I am new to R language, and trying to use this today.

I followed the test case:
test-solr_search.r

and run: a <- solr_search(q=':', rows=2, fl='size', url='mysolrurl', key='size')
and then call a$response, it failed with following error:
Error in a$response : $ operator is invalid for atomic vector

a['response']
[1] NA
class(a)
[1] "character"

Any idea why this happened and how to fix?
Thanks again :)

Make note about how long csv has been around

https://issues.apache.org/jira/browse/SOLR-66

and note which version first had it for solr_search() to warn users that may be using older solr installs

Before 1st version to CRAN

Test against some other endpoints, BISON, Dataone, etc.
More examples exploring all/most parameters for users not familiar with solr
Figure out instructions for local setup and interaction with solr via this pkg
Do we need formal S3 classes for return objects? made some simple stuff
Do internally parsing of arguments within each function consistently, not consistent right now. NO NEED, ONLY MULTIPLE ARGS ALLOWED FOR FACETING, SO LEAVING ALONE

Package name

@ropensci/owners Does this package name need to change from solr to something else? Or is it okay

Expected capacity of the solr package is unknown

Using the solr package for medium size datasets (5000+ observations;30+ variables) is quite a stretch. I therefore wonder what size of datasets your targeting at.

Add tests for solr_group

Support wt=csv

E.g. http://api.plos.org/search?q=*:*&wt=csv&fl=id,score

id,score
10.1371/journal.pone.0107569/introduction,1.0
10.1371/journal.pone.0107569/results_and_discussion,1.0
10.1371/journal.pone.0107569/materials_and_methods,1.0
10.1371/journal.pone.0107569/supporting_information,1.0
10.1371/journal.pone.0062138,1.0
10.1371/journal.pone.0044030/title,1.0
10.1371/journal.pone.0062138/title,1.0
10.1371/journal.pone.0044030/abstract,1.0
10.1371/journal.pone.0062138/abstract,1.0
10.1371/journal.pone.0044030/references,1.0

Handle various authentications schemes

I think we just handle an API key passed in the url for now...

Change url param out in all fxns to base

url is a function in base R, so change the url parameter to base, for base url in all functions.

Remove assertthat dep

XML, dataone, etc

@sckott I think package is a brilliant idea. Standardizing how we handle solr queries across packages would be a big boost. dataone supports a full set of queries as well, and the dataone package provides a basic interface for this. CC'ing @mbjones in case he wants to take a look at how you're going about this or has any suggestions for you.

I see your query about XML in the README. Though I haven't done much with solr at this time, I'd nonetheless recommend we consider supporting XML as well as JSON. I don't think it makes sense for the package to make this decision for the user: a user who wants solr queries that return XML should be able to have them, yes?

JSON certainly has it's advantages, but we have a lot of tools for working with XML that don't really have analogs in JSON: XPath, XPointer, schema, XSLT, etc that can all be pretty handy.

Some queries return errors I don't understand

Probably this is not an issue in the code but some data maybe missing.

solr_group(q=':', group.field='journal', rows=5, group.limit=1,
group.sort='publication_date desc', fl='journal, publication_date', url=url, key=key)

Error in rbindlist(lapply(datout, function(x) { :
Item 4 has 4 columns, inconsistent with item 1 which has 5 columns

The problem is that I get a similar error with many queries to http://api.plos.org/search

Can you deal with this missing data?

Multiple fl parameters not accepted with some versions of Solr

Right now, we have solr setup so that users pass in options to the fl parameter in a vector like c("one", "two"), which gets parsed to fl=one&fl=two in the URL string, but it doesn't always work. In Dyrad's Solr endpoint this doesn't work

http://datadryad.org/solr/search/select?q=Galliard&wt=json&fl=handle&fl=dc.title_sort

the second fl parameter is ignored.

But this works in PLOS's search API

http://api.plos.org/search?q=*:*&wt=json&fl=id&fl=journal

Add ping function

Rare behaviour with param qt for Solr_group

I can not reproduce it in PLOS.

Normal query with 3 attributes in field list. I get a warning

> solr_group(q='*:*', group.field='accoid', group.limit=1, group.sort='price asc', sort='price asc', fl="touroperator, rating, price", fq="transport:VL", url = url)
                             groupValue numFound start rating touroperator
1  137a9c30-ee49-11df-a13b-0050569335f3    12545     0      3           JI
2  103fe3f0-8f5c-11df-a2df-001c42000009     6702     0      3           JI
3  79760f30-5b14-11e2-bb05-000c297659d3    19983     0      1           CH
4  50e1b6f0-5fe7-11e2-bb05-000c297659d3    39773     0      2           JI
5  fda70780-9b3c-11e0-9153-005056930057     1659     0      4           JI
6  8fdcaba0-bc9c-11e2-a109-000c297659d3    69484     0      2           JI
7  10d5bb50-8f5c-11df-a2df-001c42000009     5235     0      4           JI
8  0e3769c0-8f5c-11df-a2df-001c42000009    51906     0      4           JI
9  108fd8b0-8f5c-11df-a2df-001c42000009     2584     0      3           JI
10 0c880c10-8f5c-11df-a2df-001c42000009    57270     0      2           JI
   price
1  12700
2  13500
3  13700
4  14017
5  14700
6  14833
7  15166
8  15208
9  15225
10 15233
Warning message:
In if (names(input) == "response") { :
  the condition has length > 1 and only the first element will be used

Same query with raw true

> solr_group(q='*:*', group.field='accoid', group.limit=1, group.sort='price asc', sort='price asc', fl="touroperator, rating, price", fq="transport:VL", url = url, raw=TRUE)
[1] "{\"responseHeader\":{\"status\":0,\"QTime\":1472},\"grouped\":{\"accoid\":{\"matches\":34553291,\"groups\":[{\"groupValue\":\"137a9c30-ee49-11df-a13b-0050569335f3\",\"doclist\":{\"numFound\":12545,\"start\":0,\"docs\":[{\"rating\":3,\"touroperator\":\"JI\",\"price\":12700}]}},{\"groupValue\":\"103fe3f0-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":6702,\"start\":0,\"docs\":[{\"rating\":3,\"touroperator\":\"JI\",\"price\":13500}]}},{\"groupValue\":\"79760f30-5b14-11e2-bb05-000c297659d3\",\"doclist\":{\"numFound\":19983,\"start\":0,\"docs\":[{\"rating\":1,\"touroperator\":\"CH\",\"price\":13700}]}},{\"groupValue\":\"50e1b6f0-5fe7-11e2-bb05-000c297659d3\",\"doclist\":{\"numFound\":39773,\"start\":0,\"docs\":[{\"rating\":2,\"touroperator\":\"JI\",\"price\":14017}]}},{\"groupValue\":\"fda70780-9b3c-11e0-9153-005056930057\",\"doclist\":{\"numFound\":1659,\"start\":0,\"docs\":[{\"rating\":4,\"touroperator\":\"JI\",\"price\":14700}]}},{\"groupValue\":\"8fdcaba0-bc9c-11e2-a109-000c297659d3\",\"doclist\":{\"numFound\":69484,\"start\":0,\"docs\":[{\"rating\":2,\"touroperator\":\"JI\",\"price\":14833}]}},{\"groupValue\":\"10d5bb50-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":5235,\"start\":0,\"docs\":[{\"rating\":4,\"touroperator\":\"JI\",\"price\":15166}]}},{\"groupValue\":\"0e3769c0-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":51906,\"start\":0,\"docs\":[{\"rating\":4,\"touroperator\":\"JI\",\"price\":15208}]}},{\"groupValue\":\"108fd8b0-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":2584,\"start\":0,\"docs\":[{\"rating\":3,\"touroperator\":\"JI\",\"price\":15225}]}},{\"groupValue\":\"0c880c10-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":57270,\"start\":0,\"docs\":[{\"rating\":2,\"touroperator\":\"JI\",\"price\":15233}]}}]}}}\n"
attr(,"class")
[1] "sr_group"
attr(,"wt")
[1] "json"

When I add qt='distributedSearch' in the response the last 2 attributes are missing

> solr_group(q='*:*', group.field='accoid', group.limit=1, group.sort='price asc', sort='price asc', fl="touroperator, rating, price", fq="transport:VL", url = url, qt='distributedSearch', raw=FALSE)
                             groupValue numFound start touroperator
1  accaa2a0-fb51-11e2-a109-000c297659d3    17750     0           CH
2  77f8e0f0-9c42-11e2-a109-000c297659d3     4084     0           JI
3  53432a60-c7df-11e0-aa1b-005056930057     6636     0           JI
4  edefdd00-8f5b-11df-a2df-001c42000009    23974     0           JI
5  137a9c30-ee49-11df-a13b-0050569335f3    12545     0           JI
6  10438d70-8f5c-11df-a2df-001c42000009    13220     0           CH
7  110c34a0-8f5c-11df-a2df-001c42000009    13384     0           CH
8  10427c00-8f5c-11df-a2df-001c42000009     8898     0           JI
9  c69d6fb0-9c41-11e2-a109-000c297659d3     4104     0           JI
10 6f885e80-9336-11e0-9153-005056930057    13065     0           CH
Warning message:
In if (names(input) == "response") { :
  the condition has length > 1 and only the first element will be used

In the raw response they are also missing

> solr_group(q='*:*', group.field='accoid', group.limit=1, group.sort='price asc', sort='price asc', fl="touroperator, rating, price", fq="transport:VL", url = url, qt='distributedSearch', raw=TRUE) 
[1] "{\"responseHeader\":{\"status\":0,\"QTime\":1774},\"grouped\":{\"accoid\":{\"matches\":141800873,\"groups\":[{\"groupValue\":\"accaa2a0-fb51-11e2-a109-000c297659d3\",\"doclist\":{\"numFound\":17750,\"start\":0,\"docs\":[{\"touroperator\":\"CH\"}]}},{\"groupValue\":\"77f8e0f0-9c42-11e2-a109-000c297659d3\",\"doclist\":{\"numFound\":4084,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"53432a60-c7df-11e0-aa1b-005056930057\",\"doclist\":{\"numFound\":6636,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"edefdd00-8f5b-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":23974,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"137a9c30-ee49-11df-a13b-0050569335f3\",\"doclist\":{\"numFound\":12545,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"10438d70-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":13220,\"start\":0,\"docs\":[{\"touroperator\":\"CH\"}]}},{\"groupValue\":\"110c34a0-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":13384,\"start\":0,\"docs\":[{\"touroperator\":\"CH\"}]}},{\"groupValue\":\"c69d6fb0-9c41-11e2-a109-000c297659d3\",\"doclist\":{\"numFound\":4104,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"10427c00-8f5c-11df-a2df-001c42000009\",\"doclist\":{\"numFound\":8898,\"start\":0,\"docs\":[{\"touroperator\":\"JI\"}]}},{\"groupValue\":\"6f885e80-9336-11e0-9153-005056930057\",\"doclist\":{\"numFound\":13065,\"start\":0,\"docs\":[{\"touroperator\":\"CH\"}]}}]}}}\n"
attr(,"class")
[1] "sr_group"
attr(,"wt")
[1] "json"

I don't know how to get url sent to Solr to check if the url was built correctly.

remove R version dep in description file

Doesn't seem necessary as far as I can tell

Start functions for writing

create document
edit document
delete document

etc.

e.g. see https://wiki.apache.org/solr/UpdateJSON

group.truncate and group.facet does not make sense for solr_group

This 2 properties make sense when you do a search with groups and facets.
Then the results in the facets are related to the grouping.

I can remove it from the documentation, but I wanted to ask first.

 \item{group.truncate}{(logical) If true, facet counts are
  based on the most relevant document of each group
  matching the query. Same applies for StatsComponent.
  Default is false. <!> Solr3.4 Supported from Solr 3.4 and
  up.}

  \item{group.facet}{(logical) Ihether to compute grouped
  facets for the field facets specified in facet.field
  parameters. Grouped facets are computed based on the
  first specified group. Just like normal field faceting,
  fields shouldn't be tokenized (otherwise counts are
  computed for each token). Grouped faceting supports
  single and multivalued fields. Default is false. <!>
  Solr4.0 WARNING: If this parameter is set to true on a
  sharded environment, all the documents that belong to the
  same group have to be located in the same shard,
  otherwise the count will be incorrect. If you are using
  SolrCloud, consider using "custom hashing"}

Add result grouping/field collapsing

Docs: http://wiki.apache.org/solr/FieldCollapsing

An example query:

http://api.plos.org/search/?q=ecology&group=true&group.field=journal&group.limit=3&fl=id,score

{
grouped: {
journal: {
matches: 18120,
groups: [
{
groupValue: "plos one",
doclist: {
numFound: 13939,
start: 0,
docs: [
{
id: "10.1371/journal.pone.0059813"
}
]
}
},
{
groupValue: "plos biology",
doclist: {
numFound: 746,
start: 0,
docs: [
{
id: "10.1371/journal.pbio.0020072"
}
]
}
},
{

...cutoff

Add stats function

See here http://wiki.apache.org/solr/StatsComponent

Have tets run on travis

solr_group group.main

group.main should be a boolean following the doc

  \item{group.main}{(logical) If true, the result of the
  last field grouping command is used as the main result
  list in the response, using group.format=simple}

And should return the the group field in the main result

solr_group(q=':', group.field='journal', rows=5, group.limit=3, group.sort='publication_date desc', group.format='simple', group.main='true', fl='publication_date', url=url, key=key)
numFound start publication_date
1 889099 0 2014-01-17T00:00:00Z
2 889099 0 2014-01-17T00:00:00Z
3 889099 0 2014-01-17T00:00:00Z
4 889099 0 2014-01-16T00:00:00Z
5 889099 0 2014-01-16T00:00:00Z

Should probably add connection fxn and object

like all other DB connector clients

This ruby gem has a nice template could look at http://www.rubydoc.info/gems/rsolr/1.0.12

provide function to write template files for a solr db

use compact internal version of fxn

Test Dataone from solr

Carl says Dataone has a solr interface. Test from here and make sure it works, give examples, etc.

Pull out GET helper fxn to use across solr fxns

Try to unify group, facet, any other functionality into solr_search, or a new fxn

grouping, faceting, etc. are all just param options in search, so all could be done from one function. Returning raw data would be easy. However,

Dealing with parsing the complex result might be tricky. Though perhaps simply user mlt, group, facet, etc. parsers for each component returned.
As far as I know, you can't get regular search results (i.e., the docs element) back when group=true, but perhaps there is a way to return docs

Regression in solr_search

response <- solr_search(q=':', fl='id', rows=2, url=url, key=key)
response$numFound
NULL

Looks like while adding the solr_group function the solr_search response has lost the attributes

numFound
start

and only returns the docs

solr_search(q=':', fl='id', rows=2, url=url, key=key)
id
1 10.1371/journal.pone.0071557
2 10.1371/journal.pone.0064577/title

Datacite testing

Pretty sure datacite API is using Solr

http://search.datacite.org/help.html

Test out

set wt=csv as default

This should provide significant speed advantage over xml and json, and appears to be in Solr for many versions now, meaning it should work for most Solr installations, hopefully.

Should write larger test suite for wt=csv specifically to make sure it's not failing anywhere, and data output is identical to wt=json and wt=xml

also, experiment with replacing read.table() with something else, like data.table::fread(), readr::read_csv() from https://github.com/hadley/readr

readme xhtml

From kurt hornik

These have README.md files which when converted to (X)HTML using a
current version of pandoc show problems when validated using W3C Markup
Validator, see below.

Most of these problems are caused by using images without giving a name
(so the required alt attribute for <img> is not provided), or using <br>
instead of <br/>.

Pls fix these problems in your README.md files for your next release: in
all cases I inspected, the fixes were obvious and confirmation using
pandoc and W3C markup validator seemed unnecessary.

Please also visit your package check web page at http://cran.r-project.org/web/checks/check_results_PACKAGENAME.html to see if other problems need to be addressed as well.

Update fxn for inserting from R objects

Test using two other history APIs

Haithi - may expose solr endpoints
Internet Archive - may expose solr endpoints

Skip on cran

Add FunctionQuery examples to solr_search

Docs: http://wiki.apache.org/solr/FunctionQuery#query

readme fixes

These packages contain README.md files with invalid HTML output created
by pandoc 1.12.4.2 according to W3C-validator.

I attach the HTML errors and warnings found below, and will put copies
of the corresponding HTML files up at
http://www.r-project.org/nosvn/pandoc.

Please investigate the problems and fix as needed.

Afaics, many of the problems are caused by adding "raw" HTML elements in
the README.md files and not realizing that the default output format
"html" is XHTML 1 (and not HTML 5). E.g., a raw
results in an

end tag for "br" omitted, but OMITTAG NO was specified

error.

Best
-k

solr.html:
  Valid: FALSE (errors: 1, warnings: 0)
  Errors:
    line  col  message
     339   98  required attribute "alt" not specified

Test using the Europeana API

Already started this, and changed slightly solr_search() in 46debc5

change vignette setup to install from cran

class assigntment and attr -> structure

giving multiple facets as character vectors instead of comma-separated strings?

Nice work on this package, looks awesome and really useful.

Minor quibble: in an example like the one you give below:

solr_facet(q = "*:*", facet.field = "journal", facet.query = "cell,bird", url = url)

it feels a bit un-R like to me that facet.query is "cell,bird" instead of c("cell", "bird"). As an R user I expect a query on two facets to be a length 2 character object in R, not a character string separated by some particular syntax. (Yeah, the c notation is more verbose, but if I'm programmatically assembling my query from an R object I've created some other way, then paste0(facets, collapse=",") is even more verbose...

Anyway, just a minor thought, probably fine either way.

solr_group(q='*:*', group.field='journal', rows=5, group.limit=1, group.sort='publication_date desc', sort='publication_date desc', fl='publication_date', url=url, key=key)

                        groupValue numFound start     publication_date
  1                         plos one   677297     0 2014-01-17T00:00:00Z
  2 plos neglected tropical diseases    19106     0 2014-01-16T00:00:00Z
  3                    plos genetics    33698     0 2014-01-16T00:00:00Z
  4                             none    62518     0 2012-10-23T00:00:00Z
  5                     plos biology    24111     0 2014-01-14T00:00:00Z

Looks like the param sort is not sent to Solr server

solr_group(q='*:*', group.field='journal', rows=5, group.limit=1, 
group.sort='publication_date desc', sort='error', fl='publication_date', url=url, key=key)

                        groupValue numFound start     publication_date
  1                         plos one   677297     0 2014-01-17T00:00:00Z
  2 plos neglected tropical diseases    19106     0 2014-01-16T00:00:00Z
  3                    plos genetics    33698     0 2014-01-16T00:00:00Z
  4                             none    62518     0 2012-10-23T00:00:00Z
  5                     plos biology    24111     0 2014-01-14T00:00:00Z