jedahan / collections-api Goto Github PK

View Code? Open in Web Editor NEW

36.0 20.0 18.0 487 KB

API scraping from the metmuseum website

Home Page: http://scrAPI.org

Shell 6.79% HTML 6.09% JavaScript 83.80% Dockerfile 3.33%

collections-api's Introduction

⚠️ the museum api has changed a bunch, this needs a rewrite

scrapi, a metropolitan museum collections api

scrAPI.org is an api that grabs object information from the metropolitan museum's collections website.

Get a random object (`/random`)

Try curl scrapi.org/random in a terminal, or just click on /random

$ curl 'scrapi.org/random'
{
  "CRDID": 12351,
  "accessionNumber": "65.211.3",
  ...
}

Object information (`/object/:id`)

Try curl scrapi.org/object/123 in a terminal, or just click on object/1234

$ curl 'scrapi.org/object/123'
{
  "CRDID": 123,
  "accessionNumber": "64.291.2",
  ...
}

Searching for object ids (`/search/:terms`)

You can now search for terms, and get back an array of hrefs to object pages

$ curl 'scrapi.org/search/mirror'
{
  "collection": {
    "items": [
      {
          "href": "http://scrapi.org/object/156225"
      },
      {
          "href": "http://scrapi.org/object/207785"
      },
      ...
      ]
    }

}

additional Params in search:

&page=X - for additional pages

&gallerynos=X for only objects in that gallery

Filtering with the `fields` parameter

If you want to filter any response, use the fields parameter, like so:

$ curl 'scrapi.org/object/123?fields=title,whoList/who/name'
{
  "whoList": {
    "who": {
      "name": "Richard Wittingham"
    }
  },
  "title": "Andiron"
}

The syntax to filter out fields is loosely based on XPath:

a,b,c comma-separated list will select multiple fields
a/b/c path will select a field from its parent
a(b,c) sub-selection will select many fields from a parent
a/*/c the star * wildcard will select all items in a field

I like the following fields for basic object information: fields=title,primaryArtistNameOnly,primaryImageUrl,medium,whatList/what/name,whenList/when/name,whereList/where/name,whoList/who/name

Guidelines

The code is CC0, but if you do anything interesting with the data, it would be nice to give attribution to The Metropolitan Museum of Art. If you do anything interesting with the code, it would be nice to give attribution to the contributors, or even better, become one!

Please submit all questions, bugs and feature requests to the issue page.

Dedicated to the memory of Aaron Swartz.

Installation and Deployment

The API requires node.js, uses redis for caching, and is built on the koa web framework.

If you already have nodejs installed:

which yarn || npm install -g yarn
yarn
yarn start
open 127.0.0.1:8080 || xdg-open 127.0.0.1:8080

If you don't want to have to setup node, yarn, and redis on your local machine, I published a docker image:

which docker || { sudo apt-get install -y docker || cask install docker }
docker pull jedahan/collections-api
docker run -d -p 8080:8080 --name collections-api jedahan/collections-api
open 127.0.0.1:8080 || xdg-open 127.0.0.1:8080
curl localhost:8080/random

You can build the docker image yourself if you want:

which docker || { sudo apt-get install -y docker || cask install docker }
docker build -t jedahan/collections-api:latest .
docker run -d -p 8080:8080 --name collections-api jedahan/collections-api
open 127.0.0.1:8080 || xdg-open 127.0.0.1:8080
curl localhost:8080/random

collections-api's People

Contributors

Stargazers

Watchers

Forkers

yellowelephant bdkauff pdanshov philbritton lwalley panman metmuseum-medialab imclab josegonzalez ragusamj akoel nunb

collections-api's Issues

add face detect info

for smiles project

Determine media types

For object, ids.

Request for functionality to return featured image

It would be great to be able to request only featured items.

Make swagger ui domain independent

random only returns from the first page of results

such low id numbers!

add ids

with hypermedia paging

investigate switching to koa

date of publication of object

separate from the date of creation of the image of the object (for cosmo wenmen)

/random content-type wrong

is text/plain, should be application/json

Add Rights and Reproduction

Rights and Reproduction is on some pages, like http://www.metmuseum.org/Collections/search-the-collections/499717

Create a dump of all the data

To make it easier so people don't have to get all the code up and running

/random not showing x-response-time-metmusuem header

add related exhibitions and events

wrong image sometimes grabbed

http://www.metmuseum.org/collections/search-the-collections/80262?img=0 << see that is the label image

return array of images if there is an array...

Reserved word "yield"

trying to run
coffee -c server.coffee

gets:
/Users/undeed/.Trash/collections-api/server.coffee:18:34: error: reserved word "yield"

offending line is
cache = ratelimit = -> (next) -> yield next

Passing both a query and images parameter breaks /ids

not sure why

merge node.io into master

next and previous links don't work in /search

implement /random

host headers wrong for /random

0.0.0.0:5000 instead of srapi.org

check req.headers.host

req.url for /random is actually a self url! need to change it for fixing /random

http://0.0.0.0:5000/object/447008 instead of http://scrapi.org/object/464341

implement /search

create /random endpoint to grab a random object

Trying to make the api more interesting than just a straight dump of what you get from the website. First new endpoint is simple, just grabs a random piece of art in the collection.

Xml output

Does anyone want this?

EXPIRE cache for /ids

set /ids cache to expire after a week, right now they cache forever

if /search has no objects, it returns statuscode 500, instead of 404

Make deployment easier

We have a deliver config
Working on a Vagrantfile
Would like to try SmartOS from joyent, as it give DTrace which is useful to figure out why shit gets slow sometimes

CSV output

provenance data split wrong

Only split on ; if it is not in ()

404 on empty id pages

evaluate HAL / collections+JSON

cache returns objects that should 404

first hit shows 404, but still saves an empty object in the cache >_>

ssh scrapi.org
redis-cli
FLUSH /object/700
exit
exit
http -h scrapi.org/object/700
http -h scrapi.org/object/700

replace request with restify.client

Setup production server

So people can just hit the api, and it be fast because of the redis cache

Currently installed on http://collections-api.herokuapp.com but it crashes a bunch and is slow
Trying to migrate from heroku to linode using deliver
I have a linode instance, and am looking at joyent's smartmachines

Redirect / -> /index.html

random endpoint is yielding an Internal Server Error

Use TMS to check permissions data

image missing for some objects

eg:
http://www.metmuseum.org/Collections/search-the-collections/1981
has images on website
http://scrapi.org:80/object/1981
No image attribute (but does have related-images).

But:
http://www.metmuseum.org/Collections/search-the-collections/1814
has images on website
http://scrapi.org/object/1814
Has image attribute (and also related-images)

or
http://www.metmuseum.org/Collections/search-the-collections/5403
Image, no related images
http://scrapi.org/object/5403
no image

use more node.io builtins

in line 38, try using the node.io builtins
@filter($(k)).text().trim()).ifNull()

self href duplicating information

"href": "http://0.0.0.0:5000http://0.0.0.0:5000/object/452102"

Setup vagrant

Vagrant would make setup a bit easier both locally and remotely.

test using zombie

zombie seems closer to the metal and more robust as it uses webkit

add stats

maybe with https://github.com/koajs/statsd

problem in json's structure

When crawling /random to get the json response, I see some lines like "timelineList" but has only one "timeline item", those Lists are now defined without the "[" and "]" construct (don't know how to say that) which is only valid for a single item and it makes me confused if the response comes with more than one "timeline" item in the List. please help.

add human readable titles to rel links

"_links": {
"self": {
"href": "https://api-sandbox.foxycart.com/users/2/stores",
"title": "This Collection"
},
"first": {
"href": "https://api-sandbox.foxycart.com/users/2/stores?offset=0",
"title": "First Page of this Collection"
},
"prev": {
"href": "https://api-sandbox.foxycart.com/users/2/stores?offset=0",
"title": "Previous Page of this Collection"
},
"next": {
"href": "https://api-sandbox.foxycart.com/users/2/stores?offset=0",
"title": "Next Page of this Collection"
},
"last": {
"href": "https://api-sandbox.foxycart.com/users/2/stores?offset=0",
"title": "Last Page of this Collection"
}