Giter Club home page Giter Club logo

act-scio2's Introduction

act-scio2

Scio v2 is a reimplementation of Scio in Python3.

Scio uses tika to extract text from documents (PDF, HTML, DOC, etc).

The result is sent to the Scio Analyzer that extracts information using a combination of NLP (Natural Language Processing) and pattern matching.

Changelog

0.0.42

SCIO now supports setting TLP on data upload, to annotate documents with tlp tag. Documents downloaded by feeds will have a default TLP white, but this can be changed in the config for feeds.

Source code

The source code the workers are available on github.

Setup

To setup, first install from PyPi:

sudo pip3 install act-scio

You will also need to install beanstalkd. On debian/ubuntu you can run:

sudo apt install beanstalkd

Configure beanstalk to accept larger payloads with the -z option. For red hat derived setups this can be configured in /etc/sysconfig/beanstalkd:

MAX_JOB_SIZE=-z 524288

You then need to install NLTK data files. A helper utility to do this is included:

scio-nltk-download

You will also need to create a default configuration:

scio-config user

API

To run the api, execute:

scio-api

This will setup the API on 127.0.0.1:3000. Use --port <PORT> and --host <IP> to listen on another port and/or another interface.

For documentation of the API endpoint see API.md.

Configuration

You can create a default configuration using this command (should be run as the user running scio):

scio-config user

Common configuration can be found under ~/.config/scio/etc/scio.ini

Running Manually

Scio Tika Server

The Scio Tika server reads jobs from the beanstalk tube scio_doc and the extracted text will be sent to the tube scio_analyze.

The first time the server runs, it will download tika using maven. It will use a proxy if $https_proxy is set.

scio-tika-server

scio-tika-server uses tika-python which depends on tika-server.jar. If your server has internet access, this will downloaded automatically. If not or you need proxy to connect to the internet, follow the instructions on "Airagap Environment Setup" here: https://github.com/chrismattmann/tika-python. Currently only tested with tika-server version 2.7.0.

Scio Analyze Server

Scio Analyze Server reads (by default) jobs from the beanstalk tube scio_analyze.

scio-analyze

You can also read directly from stdin like this:

echo "The companies in the Bus; Finanical, Aviation and Automobile industry are large." | scio-analyze --beanstalk= --elasticsearch=

Scio Submit

Submit document (from file or URI) to scio_api.

Example:

scio-submit \
   --uri https://www2.fireeye.com/rs/848-DID-242/images/rpt-apt29-hammertoss.pdf \
   --scio-baseuri http://localhost:3000/submit \
   --tlp white

Running as a service

Systemd compatible service scripts can be found under examples/systemd.

To install:

sudo cp examples/systemd/*.service /usr/lib/systemd/system
sudo systemctl enable scio-tika-server
sudo systemctl enable scio-analyze
sudo service start scio-tika-server
sudo service start scio-analyze

scio-feed cron job

To continously fetch new content from feeds, you can add scio-feed to cron like this (make sure the directory $HOME/logs exists):

# Fetch scio feeds every hour
0 * * * * /usr/local/bin/scio-feeds >> $HOME/logs/scio-feed.log.$(date +\%s) 2>&1

# Delete logs from scio-feeds older than 7 days
0 * * * * find $HOME/logs/ -name 'scio-feed.log.*' -mmin +10080 -exec rm {} \;

Local development

Use pip to install in local development mode. act-scio uses namespacing, so it is not compatible with using setup.py install or setup.py develop.

In repository, run:

pip3 install --user -e .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.