Giter Club home page Giter Club logo

docker-setup's Introduction

Docker scripts for Hoover

This repository contains a Docker Compose configuration for Hoover.

Installation

These instructions have been tested on Debian Jessie.

  1. Increase vm.max_map_count to at least 262144, to make elasticsearch happy
  1. Install docker:

    apt-get install -y apt-transport-https ca-certificates curl gnupg2 software-properties-common
    curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
    add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
    apt-get update
    apt-get install -y docker-ce
    service docker start
    curl -L https://github.com/docker/compose/releases/download/1.13.0/docker-compose-`uname -s`-`uname -m` > /usr/local/bin/docker-compose
    chmod +x /usr/local/bin/docker-compose
  2. Clone the repo and set up folders:

    git clone https://github.com/hoover/docker-setup /opt/hoover
    cd /opt/hoover
    mkdir volumes volumes/metrics volumes/metrics/users volumes/search-es-snapshots collections
    chmod 777 volumes/search-es-snapshots
  3. Create configuration files:

    • /opt/hoover/snoop.env:

      DOCKER_HOOVER_SNOOP_SECRET_KEY=some-random-secret
      DOCKER_HOOVER_SNOOP_DEBUG=on
      DOCKER_HOOVER_SNOOP_BASE_URL=http://snoop.hoover.example.com
    • /opt/hoover/search.env:

      DOCKER_HOOVER_SEARCH_SECRET_KEY=some-random-secret
      DOCKER_HOOVER_SEARCH_DEBUG=on
      DOCKER_HOOVER_BASE_URL=http://hoover.example.com
  4. Spin up the docker containers, run migrations, create amdin user:

    docker-compose run --rm snoop ./manage.py migrate
    docker-compose run --rm snoop ./manage.py resetstatsindex
    docker-compose run --rm search ./manage.py migrate
    docker-compose run --rm search ./manage.py createsuperuser
    docker-compose run --rm ui node build.js
    docker-compose run --rm search ./manage.py collectstatic --noinput
    docker-compose up -d
  5. Import the test dataset:

    git clone https://github.com/hoover/testdata collections/testdata
    docker-compose run --rm snoop ./manage.py createcollection testdata /opt/hoover/collections/testdata/data
    
    # wait for jobs to finish, i.e. when this command stops printing messages:
    docker-compose logs -f snoop-worker
    
    docker-compose run --rm search ./manage.py addcollection testdata http://snoop/collections/testdata/json --public

Configuring two-factor authentication

Since hoover-search has built-in support for TOTP two-factor authentication, you just need to enable the module by adding a line to search.env:

DOCKER_HOOVER_TWOFACTOR_ENABLED=on

Then generate an invitation for your user (replace admin with your username):

docker-compose run --rm search ./manage.py invite admin

Importing OCR'ed documents

The OCR process (Optical Character Recognition – extracting machine-readable text from scanned documents) is done external to Hoover, using e.g. Tesseract. Try the Python pypdftoocr package. The resulting OCR'ed documents should be PDF files whose filename is the MD5 checksum of the original document, e.g. d41d8cd98f00b204e9800998ecf8427e.pdf. Put all the OCR'ed files in a folder (we'll call it ocr foler below) and follow these steps to import them into Hoover:

  • The ocr folder should be in a path accessible to the hoover docker images, e.g. in the shared "collections" folder, /opt/hoover/collections/testdata/ocr/myocr.

  • Register ocr folder as a source for OCR named myocr (choose any name you like):

    docker-compose run --rm snoop ./manage.py createocrsource myocr /opt/hoover/collections/testdata/ocr/myocr
    # wait for jobs to finish
    

Decrypting PGP emails

If you have access to PGP private keys, snoop can decrypt emails that were encrypted for those keys. Import the keys into a gnupg home folder placed next to the docker-compose.yml file. Snoop will automatically use this folder when it encounters an encrypted email.

gpg --home gnupg --import < path_to_key_file

You may need to remove an existing but known password once and use this key instead.

gpg --home gnupg --export-options export-reset-subkey-passwd --export-secret-subkeys ABCDEF01 > path_to_key_nopassword
gpg --home gnupg --delete-secret-keys ABCDEF01
gpg --home gnupg --delete-key ABCDEF01
gpg --home gnupg --import < path_to_key_nopassword

Development

Clone the code repositories:

git clone https://github.com/hoover/docker-setup
git clone https://github.com/hoover/snoop2
git clone https://github.com/hoover/search
git clone https://github.com/hoover/ui

Create a docker-compose.override.yml file in docker-setup with the following content. It will mount the code repositories inside the docker containers to run the local development code:

version: "2"

services:

  snoop-worker:
    volumes:
      - ../snoop2:/opt/hoover/snoop

  snoop:
    volumes:
      - ../snoop2:/opt/hoover/snoop

  search:
    volumes:
      - ../search:/opt/hoover/search

  ui:
    volumes:
      - ../ui:/opt/hoover/ui

Docker images

Docker-hub builds images based on the Hoover GitHub repos triggered by pushes to the master branches: snoop2, search, ui.

You can also build images locally. For example, the snoop2 image:

cd snoop2
docker build . --tag snoop2

Then add this snippet to docker-compose.override.yml to test the image locally, and run docker-compose up -d to (re)start the containers:

version: "2"

services:

  snoop-worker:
    image: snoop2

  snoop:
    image: snoop2

Testing

For Snoop and Search tests based on pytest can be executed using this commands:

docker-compose run --rm snoop pytest
docker-compose run --rm search pytest

The test definitions can be found in the testsuite folder of each project. Individual tests can be started using:

docker-compose run --rm snoop pytest testsuite/test_tika.py

Working with collections

Creating a collection

To create a collection, copy the original files in a folder inside the collections folder. Then run the createcollection command for snoop, and the addcollection command for search. It will set up a new collection in the snoop SQL database, create an elasticsearch index, and it will trigger "walk" tasks to analyze the collection's contents. As the files get processed they will show up in the search results.

In this example, we'll name the collection foo, and assume the data is copied to the collections/foo directory. The final --public flag will make the collection accessible to all users (or anybody who can access the server if two-factor authentication is not enabled).

docker-compose run --rm snoop ./manage.py createcollection foo /opt/hoover/collections/foo
docker-compose run --rm search ./manage.py addcollection foo http://snoop/collections/foo/json --public

Exporting and importing collections

Snoop2 provides commands to export and import collection database records, blobs, and elasticsearch indexes. The collection name must be the same - this limitation could be lifted if the elasticsearch import code is modified to rename the index on import.

Exporting:

docker-compose run --rm -T snoop ./manage.py exportcollectiondb testdata | gzip -1 > testdata-db.tgz
docker-compose run --rm -T snoop ./manage.py exportcollectionindex testdata | gzip -1 > testdata-index.tgz
docker-compose run --rm -T snoop ./manage.py exportcollectionblobs testdata | gzip -1 > testdata-blobs.tgz

Importing:

docker-compose run --rm -T snoop ./manage.py importcollectiondb testdata < testdata-db.tgz
docker-compose run --rm -T snoop ./manage.py importcollectionindex testdata < testdata-index.tgz
docker-compose run --rm -T snoop ./manage.py importblobs < testdata-blobs.tgz

Note that the importblobs command doesn't expect a collection as argument; the blobs have no connection to any particular collection.

Deleting a collection

docker-compose run --rm snoop ./manage.py deletecollection testdata

This will delete the collection and associated files and directories, the elasticsearch index, and all tasks directly linked to the collection. It does NOT delete any blobs or tasks potentially shared with other collections, i.e. tasks that only handle content from specific blobs.

Monitoring snoop processing of a collection

Snoop provides an administration interface with statistics on the progress of analysis on collections. It is exposed via docker on port 45023. To access it you need to create an account:

docker-compose run --rm snoop ./manage.py createsuperuser

Sometimes it's necessary to rerun some snoop tasks. You can reschedule them using this command:

docker-compose run --rm snoop ./manage.py retrytasks --func filesystem.walk --status pending

Both --func and --status are optional and serve to filter down the number of tasks.

docker-setup's People

Contributors

gabriel-v avatar mgax avatar mrc-toader avatar salevajo avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.