Giter Club home page Giter Club logo

cccatalog-api's Introduction


Project Transferred

This project was transfered to WordPress:

  • WordPress/openverse-api: The Openverse API allows programmatic access to search for CC-licensed and public domain digital media.

For additional context see:


Creative Commons Catalog API

Build Status License

Purpose

The Creative Commons Catalog API ('cccatalog-api') is a system that allows programmatic access to public domain digital media. It is our ambition to index and catalog billions of Creative Commons works, including articles, songs, videos, photographs, paintings, and more. Using this API, developers will be able to access the digital commons in their own applications.

This repository is primarily concerned with back end infrastructure like datastores, servers, and APIs. The pipeline that feeds data into this system can be found in the cccatalog repository. A front end web application that interfaces with the API can be found at the cccatalog-frontend repository.


API Documentation

In the API documentation, you can find more details about the endpoints with examples on how to use them.


How to Run the Server Locally

Prerequisites

You need to install Docker (with Docker Compose), Git, and PostgreSQL client tools. On Debian, the package is called postgresql-client-common.


How to Do It

  1. Run the Docker daemon

  2. Open your command prompt (CMD) or terminal

  3. Clone CC Catalog API

git clone https://github.com/creativecommons/cccatalog-api.git
  1. Change directory to CC Catalog API
cd cccatalog-api
  1. Start CC Catalog API locally
docker-compose up
  1. Wait until your CMD or terminal displays that it is starting development server at http://0.0.0.0:8000/

Initialization

  1. Open up your browser and type localhost:8000 in the search tab

  2. Make sure you see the local API documentation


Local API Documentation

  1. Open a new CMD or terminal and change directory to CC Catalog API

  2. Still in the new CMD or terminal, load the sample data

./load_sample_data.sh
  1. Still in the new CMD or terminal, hit the API with a request
curl localhost:8000/v1/images?q=honey
  1. Make sure you see the following response from the API

Sample API_Request

Congratulations! You just run the server locally.


What Happens In the Background

After executing docker-compose up (in Step 5), you will be running:

  • A Django API server
  • Two PostgreSQL instances (one simulates the upstream data source, the other serves as the application database)
  • Elasticsearch
  • Redis
  • A thumbnail-generating image proxy
  • ingestion-server, a service for bulk ingesting and indexing search data.
  • analytics, a REST API server for collecting search usage data

Diagnosing local Elasticsearch issues

If the API server container failed to start, there's a good chance that Elasticsearch failed to start on your machine. Ensure that you have allocated enough memory to Docker applications, otherwise the container will instantly exit with an error. Also, if the logs mention "insufficient max map count", increase the number of open files allowed on your system. For most Linux machines, you can fix this by adding the following line to /etc/sysctl.conf:

vm.max_map_count=262144

To make this setting take effect, run:

sudo sysctl -p

System Architecture

System Architecture


Basic flow of data

Search data is ingested from upstream sources provided by the data pipeline. As of the time of writing, this includes data from Common Crawl and multiple 3rd party APIs. Once the data has been scraped and cleaned, it is transferred to the upstream database, indicating that it is ready for production use.

Every week, the latest version of the data is automatically bulk copied ("ingested") from the upstream database to the production database by the Ingestion Server. Once the data has been downloaded and indexed inside of the database, the data is indexed in Elasticsearch, at which point the new data can be served up from the CC Catalog API servers.


Description of subprojects

  • cccatalog-api is a Django Rest Framework API server. For a full description of its capabilities, please see the browsable documentation.
  • ingestion-server is a service for downloading and indexing search data once it has been prepared by the CC Catalog.
  • analytics is a Falcon REST API for collecting usage data.

Running the tests

How to Run API live integration tests

You can check the health of a live deployment of the API by running the live integration tests.

  1. Change directory to CC Catalog API
cd cccatalog-api
  1. Install all dependencies for CC Catalog API
pipenv install
  1. Launch a new shell session
pipenv shell
  1. Run API live integration test
./test/run_test.sh

How to Run Ingestion Server tests

You can ingest and index some dummy data using the Ingestion Server API.

  1. Change directory to ingestion server
cd ingestion_server
  1. Install all dependencies for Ingestion Server API
pipenv install
  1. Launch a new shell session
pipenv shell
  1. Run the integration tests
python3 test/integration_tests.py

Deploying and monitoring the API

The API infrastructure is orchestrated using Terraform hosted in creativecommons/ccsearch-infrastructure. You can find more details on this wiki page.


Django Admin

You can view the custom administration views at the /admin/ endpoint.


Contributing

Pull requests are welcome! Feel free to join us on Slack and discuss the project with the engineers on #cc-search.

You are welcome to take any open issue in the tracker labeled help wanted or good first issue; there's no need to ask for permission in advance. Other issues are open for contribution as well, but may be less accessible or well defined in comparison to those that are explicitly labeled.

See the CONTRIBUTING file for details.

cccatalog-api's People

Contributors

aldenstpage avatar ariessa avatar avvinci avatar ayanchoudhary avatar brenoferreira avatar cc-creativecommons-github-io-bot avatar dantraztrev avatar dependabot[bot] avatar dhirensr avatar dhruvkb avatar haksoat avatar kelvindecosta avatar kgodey avatar krysal avatar maxslide avatar nimishbongale avatar outloudvi avatar pa-w avatar paulofilip3 avatar ritesh-pandey avatar sando1 avatar sebworks avatar sp35 avatar tanuj22 avatar timidrobot avatar tushar912 avatar vsomnath avatar zackkrida avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cccatalog-api's Issues

Some tags are empty in our database

Example: https://ccsearch.creativecommons.org/photos/15621775

Error:

Traceback (most recent call last):                                                                                    
  File "/usr/local/lib/python3.7/site-packages/rest_framework/fields.py", line 441, in get_attribute                  
    return get_attribute(instance, self.source_attrs)                                                                 
  File "/usr/local/lib/python3.7/site-packages/rest_framework/fields.py", line 98, in get_attribute                   
    instance = instance[attr]                                                                                         
KeyError: 'name' 

Tags for this image (note the empty tag):

15621775 | [{}, {"name": "business", "accuracy": 0.97734, "provider": "clarifai"}, {"name": "man", "accuracy": 0.93226, "provider": "clarifai"}, {"name": "paper", "accuracy": 0.89205, "provider": "clarifai"}, {
"name": "presentation", "accuracy": 0.89086, "provider": "clarifai"}, {"name": "people", "accuracy": 0.86843, "provider": "clarifai"}, {"name": "contemporary", "accuracy": 0.85604, "provider": "clarifai"}, {"nam
e": "identity", "accuracy": 0.84483, "provider": "clarifai"}, {"name": "facts", "accuracy": 0.81679, "provider": "clarifai"}, {"name": "blank", "accuracy": 0.81582, "provider": "clarifai"}, {"name": "intelligenc
e", "accuracy": 0.81183, "provider": "clarifai"}, {"name": "bill", "accuracy": 0.80434, "provider": "clarifai"}, {"name": "achievement", "accuracy": 0.7914, "provider": "clarifai"}, {"name": "horizontal", "accur
acy": 0.78989, "provider": "clarifai"}, {"name": "education", "accuracy": 0.77418, "provider": "clarifai"}, {"name": "template", "accuracy": 0.77406, "provider": "clarifai"}, {"name": "fine-looking", "accuracy":
 0.77116, "provider": "clarifai"}, {"name": "indoors", "accuracy": 0.76115, "provider": "clarifai"}, {"name": "company", "accuracy": 0.75583, "provider": "clarifai"}, {"name": "looking", "accuracy": 0.75398, "pr
ovider": "clarifai"}, {"name": "banner", "accuracy": 0.74502, "provider": "clarifai"}]

Search by tag query in ES

Once the machine generated tags have been denormalized and stored in the 'image' table, I will get started on this.

Return authorization token upon list creation

Users need to be able to update and delete lists anonymously. To accommodate this, we can return an authorization token, which can then be stored on the user's machine (either in a cookie or localstorage).

It will be sent in plain text over SSL. An attacker with local access to the user's machine could steal this token and make updates to other lists, but this isn't really a threat to be concerned about since the stakes are really low.

Use `removed_from_source` field to filter dead links from Elasticsearch

We routinely crawl the web and consume APIs to find CC licensed works. As a search engine, we have to deal with link rot; images are often deleted or moved from their original location. During crawling, if we detect that an image has been deleted, we will mark it with the removed_from_source field in our database.

ingestion-server, our tool for building an Elasticsearch index, will need to be updated to discard images where removed_from_source=True.

Build PostgreSQL --> Elasticsearch syncer

Check Postgres for updates on tables every few seconds. Whenever there is sizable change to the table, use the bulk insert API to push the data to Elasticsearch.

This should improve our indexing speed considerably over the old Django signals based implementation.

Use something besides primary key to identify objects exposed from the API

Primary keys are an implementation detail that can change without notice (such as if we were to migrate to a new database). As such, we need to expose an alternative identifier besides the primary key.

This means we will have to set an alternative unique identifier on each image when it is loaded in the database. For entities that are created on the API server, such as lists, this can be done upon creation, but for entities created by the data pipeline (images), the unique identifier must be assigned when it is loaded into the database. See uuslug; we would have to perform an operation similar to this in the data pipeline.

@sclachar, any thoughts on this? Could we use the "identifier" column for this purpose (after adding a unique and not null constraint)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.