cc-archive / cccatalog-api Goto Github PK

[PROJECT TRANSFERRED] The Creative Commons Catalog API allows programmatic access to search for CC-licensed and public domain digital media.

Home Page: https://github.com/WordPress/openverse-api

License: MIT License

Python 90.79% Shell 1.01% Dockerfile 0.49% HTML 7.60% Mako 0.12%

python elasticsearch discontinued

cccatalog-api's Introduction

Project Transferred

This project was transfered to WordPress:

WordPress/openverse-api: The Openverse API allows programmatic access to search for CC-licensed and public domain digital media.

For additional context see:

2020-12-07: Upcoming Changes to the CC Open Source Community — Creative Commons Open Source
2021-05-03: CC Search to Join WordPress - Creative Commons
2021-05-10: Welcome to Openverse – Openverse — WordPress.org
2021-12-13: Dear Users of CC Search, Welcome to Openverse - Creative Commons

Creative Commons Catalog API

Purpose

The Creative Commons Catalog API ('cccatalog-api') is a system that allows programmatic access to public domain digital media. It is our ambition to index and catalog billions of Creative Commons works, including articles, songs, videos, photographs, paintings, and more. Using this API, developers will be able to access the digital commons in their own applications.

This repository is primarily concerned with back end infrastructure like datastores, servers, and APIs. The pipeline that feeds data into this system can be found in the cccatalog repository. A front end web application that interfaces with the API can be found at the cccatalog-frontend repository.

API Documentation

In the API documentation, you can find more details about the endpoints with examples on how to use them.

How to Run the Server Locally

Prerequisites

You need to install Docker (with Docker Compose), Git, and PostgreSQL client tools. On Debian, the package is called postgresql-client-common.

How to Do It

Run the Docker daemon
Open your command prompt (CMD) or terminal
Clone CC Catalog API

git clone https://github.com/creativecommons/cccatalog-api.git

Change directory to CC Catalog API

cd cccatalog-api

Start CC Catalog API locally

docker-compose up

Wait until your CMD or terminal displays that it is starting development server at http://0.0.0.0:8000/

Open up your browser and type localhost:8000 in the search tab
Make sure you see the local API documentation

Open a new CMD or terminal and change directory to CC Catalog API
Still in the new CMD or terminal, load the sample data

./load_sample_data.sh

Still in the new CMD or terminal, hit the API with a request

curl localhost:8000/v1/images?q=honey

Make sure you see the following response from the API

Congratulations! You just run the server locally.

What Happens In the Background

After executing docker-compose up (in Step 5), you will be running:

A Django API server
Two PostgreSQL instances (one simulates the upstream data source, the other serves as the application database)
Elasticsearch
Redis
A thumbnail-generating image proxy
ingestion-server, a service for bulk ingesting and indexing search data.
analytics, a REST API server for collecting search usage data

Diagnosing local Elasticsearch issues

If the API server container failed to start, there's a good chance that Elasticsearch failed to start on your machine. Ensure that you have allocated enough memory to Docker applications, otherwise the container will instantly exit with an error. Also, if the logs mention "insufficient max map count", increase the number of open files allowed on your system. For most Linux machines, you can fix this by adding the following line to /etc/sysctl.conf:

vm.max_map_count=262144

To make this setting take effect, run:

sudo sysctl -p

System Architecture

Basic flow of data

Search data is ingested from upstream sources provided by the data pipeline. As of the time of writing, this includes data from Common Crawl and multiple 3rd party APIs. Once the data has been scraped and cleaned, it is transferred to the upstream database, indicating that it is ready for production use.

Every week, the latest version of the data is automatically bulk copied ("ingested") from the upstream database to the production database by the Ingestion Server. Once the data has been downloaded and indexed inside of the database, the data is indexed in Elasticsearch, at which point the new data can be served up from the CC Catalog API servers.

Description of subprojects

cccatalog-api is a Django Rest Framework API server. For a full description of its capabilities, please see the browsable documentation.
ingestion-server is a service for downloading and indexing search data once it has been prepared by the CC Catalog.
analytics is a Falcon REST API for collecting usage data.

Running the tests

How to Run API live integration tests

You can check the health of a live deployment of the API by running the live integration tests.

Change directory to CC Catalog API

cd cccatalog-api

Install all dependencies for CC Catalog API

pipenv install

Launch a new shell session

pipenv shell

Run API live integration test

./test/run_test.sh

How to Run Ingestion Server tests

You can ingest and index some dummy data using the Ingestion Server API.

Change directory to ingestion server

cd ingestion_server

Install all dependencies for Ingestion Server API

pipenv install

Launch a new shell session

pipenv shell

Run the integration tests

python3 test/integration_tests.py

Deploying and monitoring the API

The API infrastructure is orchestrated using Terraform hosted in creativecommons/ccsearch-infrastructure. You can find more details on this wiki page.

Django Admin

You can view the custom administration views at the /admin/ endpoint.

Contributing

Pull requests are welcome! Feel free to join us on Slack and discuss the project with the engineers on #cc-search.

You are welcome to take any open issue in the tracker labeled help wanted or good first issue; there's no need to ask for permission in advance. Other issues are open for contribution as well, but may be less accessible or well defined in comparison to those that are explicitly labeled.

See the CONTRIBUTING file for details.

cccatalog-api's People

Contributors

Stargazers

Watchers

Forkers

andent09 annyurina bhanditz alirizwi avvinci pseudonerd sanjana11147 vatsalsin ahmedkrmn tsunamira nishantsethi anubhavshakya ayushshivani pratikmishra356 keshwithan ashikmeerankutty bigdatasciencegroup mrhavens-forks doodeehee70 botduy paulofilip3 vastsvenskastaket sk044 h4ng-man krysal ritesh-pandey sando1 sudoshweta princeyusz rockerboi77 retzger anoited007 dantraztrev blueyez4ever haksoat gauravahlawat81 fakoor nguyenst032 mishrasubha ayanchoudhary scheleon spring-dot techievivek gmoney0305 kss682 tanuj22 vineet-sharma29 maxslide tanmay1201 abbasidaniyal amartya-dev pavitra14 devaziz0 achilep vina365 christian7877 millennium-challenge nikitag0329iitkgp enelesmai hedonhermdev one909 madewithkode senyor akanksha-v dhirensr ycczhao fabian19941220-gmail-com mexicanoviva papayaone http-srisuk19766-high-up-com uds5501 siblibusro87 blahblah1777 phitattoo dynamoh and2carvalho tushar912 manav1403 bas-151998 abhishekmishra25 xtuden-com global-localhost global19 global19-atlassian-net codesankalp arnoldh84 cronus1007 rnerone1 ayushmankumar7 muraad33 pop2534 haaami01 gizzzmo maks92k soffieswan044 medsauce bkrm2706

cccatalog-api's Issues

Documentation and unit testing catch-up

Anonymous update list endpoint

If the user supplies the correct authorization header for a list, allow them to submit updates to the list.

Set up SSL certs

Image provider filter isn't accepting multiple values

To Reproduce:

Visit
https://api.creativecommons.engineering/image/search?q=nature&provider=nypl
https://api.creativecommons.engineering/image/search?q=nature&provider=met
and verify images are returned.

Add the provider type as a parameter(&provider=nypl,met ).

https://api.creativecommons.engineering/image/search?q=nature&provider=nypl,met

Notice you get the following:

[CC API] Move API server infrastructure to us-east-1 (everything should be in the same region as the database)

We had to move our development database to US-East-1 (N Virginia) in order to make use of Amazon Glue and left everything else in US-West-1 (N California).

Cross-region communication causes data transfer fees and adds latency, so the dev cccatalog API server should be moved to US East 1 as well. This will entail setting up a new VPC.

es-syncer aquires unnecessary table locks

Some tags are empty in our database

Example: https://ccsearch.creativecommons.org/photos/15621775

Error:

Traceback (most recent call last):                                                                                    
  File "/usr/local/lib/python3.7/site-packages/rest_framework/fields.py", line 441, in get_attribute                  
    return get_attribute(instance, self.source_attrs)                                                                 
  File "/usr/local/lib/python3.7/site-packages/rest_framework/fields.py", line 98, in get_attribute                   
    instance = instance[attr]                                                                                         
KeyError: 'name'

Tags for this image (note the empty tag):

15621775 | [{}, {"name": "business", "accuracy": 0.97734, "provider": "clarifai"}, {"name": "man", "accuracy": 0.93226, "provider": "clarifai"}, {"name": "paper", "accuracy": 0.89205, "provider": "clarifai"}, {
"name": "presentation", "accuracy": 0.89086, "provider": "clarifai"}, {"name": "people", "accuracy": 0.86843, "provider": "clarifai"}, {"name": "contemporary", "accuracy": 0.85604, "provider": "clarifai"}, {"nam
e": "identity", "accuracy": 0.84483, "provider": "clarifai"}, {"name": "facts", "accuracy": 0.81679, "provider": "clarifai"}, {"name": "blank", "accuracy": 0.81582, "provider": "clarifai"}, {"name": "intelligenc
e", "accuracy": 0.81183, "provider": "clarifai"}, {"name": "bill", "accuracy": 0.80434, "provider": "clarifai"}, {"name": "achievement", "accuracy": 0.7914, "provider": "clarifai"}, {"name": "horizontal", "accur
acy": 0.78989, "provider": "clarifai"}, {"name": "education", "accuracy": 0.77418, "provider": "clarifai"}, {"name": "template", "accuracy": 0.77406, "provider": "clarifai"}, {"name": "fine-looking", "accuracy":
 0.77116, "provider": "clarifai"}, {"name": "indoors", "accuracy": 0.76115, "provider": "clarifai"}, {"name": "company", "accuracy": 0.75583, "provider": "clarifai"}, {"name": "looking", "accuracy": 0.75398, "pr
ovider": "clarifai"}, {"name": "banner", "accuracy": 0.74502, "provider": "clarifai"}]

[CC API] Containerize!

As discussed, this is a great project to start our contanerization.

Image license filter isn't returning any results

To Reproduce:

Visit
https://api.creativecommons.engineering/image/search?q=nature
and select an image license from the returned images.

Add that license type as a parameter(&li=by).

https://api.creativecommons.engineering/image/search?q=nature&li=by

Notice you don't get any results.

Add provider search filter

List endpoints

Migrate legacy lists to anonymous lists

Improve share toolbar UI

Index "tombstone" meta_data column in Elasticsearch

Add "about" endpoint that dynamically shows the number of images we have for each provider

Split "creator" search into a separate search query parameter instead of including it in the results by default

Docker was suddenly removed from the Linux 2 repositories, breaking the syncer

Return fully resolved images instead of IDs in list detail view

Sort search results by popularity

Discussion: How should we handle image and tag metrics?

There is ongoing discussion about the best ways to store and display image and tag metrics. This issue is meant to help flush out those ideas.

@aldenstpage
@pa-w

Search by tag query in ES

Once the machine generated tags have been denormalized and stored in the 'image' table, I will get started on this.

Implement search endpoints

Decommission old ccsearch app infrastructure

Elasticsearch cluster
EC2 instances

We will keep the database around just in case we need to import any other data to the new system.

[CC API] OAuth2 support on the backend

This API will have clients which will need consumer keys and secrets (OAuth)...

Let's implement OAuth and see if we can use CCID for these purposes.

Separate legacy postgres/elasticsearch instances from new infrastructure

[es-syncer] Use index aliases to allow reindexing without loss of availability of Elasticsearch

Minimal front-end for cccatalog

Set up nginx in front of API server for gzip compression and faster serving of static content

Improve image loading strategy when searching

`view_count` column needs a default value

give jane access to categorize and add assignees to issues on this repo

thanks!

Add optional shorten parameter to create list post body

Create detail view at /image/{id}

Register a page view (cc-archive/cccatalog-frontend#9)
Return everything we have on this particular image ID in Elasticsearch + the view count

Return authorization token upon list creation

Users need to be able to update and delete lists anonymously. To accommodate this, we can return an authorization token, which can then be stored on the user's machine (either in a cookie or localstorage).

It will be sent in plain text over SSL. An attacker with local access to the user's machine could steal this token and make updates to other lists, but this isn't really a threat to be concerned about since the stakes are really low.

Use `removed_from_source` field to filter dead links from Elasticsearch

We routinely crawl the web and consume APIs to find CC licensed works. As a search engine, we have to deal with link rot; images are often deleted or moved from their original location. During crawling, if we detect that an image has been deleted, we will mark it with the removed_from_source field in our database.

ingestion-server, our tool for building an Elasticsearch index, will need to be updated to discard images where removed_from_source=True.

Load testing via Locust

Allow editing of lists if the correct List password (either set by the creator or automatically generated) is provided.

PUT /list/{id} to edit existing list
DELETE /list/{id}

Both will require an HTTP Basic token. This is intentionally simplistic from a security standpoint; we want users to be able to easily share and curate lists anonymously.

Create mock search UI

Track number of clicks shortened URLs receive

Require login to access dev API instances

Provide tag metrics API endpoint. Return the top N tags and show how the number of occurrences of each tag.

Return the top N (between 1 and 10000) tags and a count of their occurrences.

Query
curl -X GET api-dev.creativecommons.engineering/tag/n=10

Response

{
  "portrait" = 32843,
  "World War 1" = 22412,
  "vehicle" = 21032,
  "woman" = 19328,
  . . .
}

Cache Elasticsearch query server-side for at least 8 hours (our tag data is not changing frequently at this time so cache freshnesss is not a priority)

Build PostgreSQL --> Elasticsearch syncer

Check Postgres for updates on tables every few seconds. Whenever there is sizable change to the table, use the bulk insert API to push the data to Elasticsearch.

This should improve our indexing speed considerably over the old Django signals based implementation.

Link shortening and sharing

POST /link sometimes fails under traffic due to a race condition

License tags should probably be filtered out or be a special category

When you find an image (https://ccsearch.creativecommons.org/photos/16136571 for example) and what to explore further using the tags there should be a special category or filtering of all common names of CC licenses.

When you click on a tag for, example CC0, you expect to find more cc0 material. However that tag also appears on non cc0 licensed material. This is confusion to the end-user.

Initial work on CC-catalog API server (set up Django REST skeleton and migrate models from the legacy app)

Use something besides primary key to identify objects exposed from the API

Primary keys are an implementation detail that can change without notice (such as if we were to migrate to a new database). As such, we need to expose an alternative identifier besides the primary key.

This means we will have to set an alternative unique identifier on each image when it is loaded in the database. For entities that are created on the API server, such as lists, this can be done upon creation, but for entities created by the data pipeline (images), the unique identifier must be assigned when it is loaded into the database. See uuslug; we would have to perform an operation similar to this in the data pipeline.

@sclachar, any thoughts on this? Could we use the "identifier" column for this purpose (after adding a unique and not null constraint)?