liquidinvestigations / hoover-search Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 9.0 957 KB

Backend for the search engine service in Liquid Investigations.

Home Page: https://hoover.crji.org/

License: MIT License

Python 92.14% HTML 5.91% Shell 1.14% Dockerfile 0.81%

celery django docker elasticsearch

hoover-search's Introduction

Welcome to Liquid Investigations!

The ‘Liquid investigations’ project is a Google DNI funded project, driven by CRJI.org, where coders and journalists work towards making investigative collaborations less burdensome and more secure.

We are creating an open source digital toolkit, based on existing hardware and software. When fully developed, the kit will allow for distributed data search and sharing, annotations, wiki and chat. While the software can run on any server, focus will be placed on small and portable devices.

Please take a look at our website at https://liquidinvestigations.org

You can find our project wiki here.

hoover-search's People

Contributors

Stargazers

Watchers

Forkers

openapi-ro kaanuki jarib slad99 mattesr mgogoulos phoebebright spiderpig86 lishaoman

hoover-search's Issues

Set user email, first and last names from auth

We're only writing the username to the local user table. The email, first_name and last_name fields are left blank.

Two-factor invitation crashes if user is not active

The invitation flow triggers an assertion error here. It should display a friendly error message to the user instead.

collection list crushes if the index does not exist in elasticsearch

`--public` parameter for `addcollection` management command

./manage.py addcollection name url --public should set the collection as public.

Deduplication

Right now, two documents with identical content are indexede separately. The task is to treat them as a single entity and list all paths where that content shows up in a collection.

hoover/snoop#45

Configure 2fa on Docker deployment

YubiKey authentication

The hoover.contrib.twofactor plugin supports TOTP authentication (e.g. with a smartphone app). It should also support YubiKey authentication.

wrong display name in collection column

The collectios in the column are named after the collection slug (with some capitalization). We have a bunch of fields in snoop and/or search that say "Name" and "Description", let's use those instead

Disable elasticsearch automatic index creation

Elasticsearch tries to be helpful and automagically creates indexes when you send data. This can be disabled with the action.auto_create_index setting. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

Dependabot couldn't authenticate with https://pypi.org/simple/

Dependabot couldn't authenticate with https://pypi.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

readme is linking to the archived snoop

instead of snoop2 @mgax

elasticsearch throws error: Connection reset by peer

When indexing a collection, after ca. 74000 documents the process for

docker-compose run --rm search ./manage.py update -v2 mycollection

suddenly stops with this error message:

2018-02-08 12:33:35 1 INFO hoover.search.index updating <Collection: mycollection>                                                                                                                                       
2018-02-08 12:33:35 1 INFO hoover.search.index resuming load: {'feed_state': 'http://snoop/htmidi/feed?lt=2018-02-07T04:20:59.559099Z', 'report': {'indexed': 74000}}                                              
2018-02-08 12:34:00 1 WARNING elasticsearch POST http://search-es:9200/_bulk [status:N/A request:1.140s]                                                                                                           
Traceback (most recent call last):                                                                                                                                                                                 
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen                                                                                                                    
    chunked=chunked)                                                                                                                                                                                               
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 356, in _make_request                                                                                                              
    conn.request(method, url, **httplib_request_kw)                                                                                                                                                                
  File "/usr/local/lib/python3.6/http/client.py", line 1239, in request                                                                                                                                            
    self._send_request(method, url, body, headers, encode_chunked)                                                                                                                                                 
  File "/usr/local/lib/python3.6/http/client.py", line 1285, in _send_request                                                                                                                                      
    self.endheaders(body, encode_chunked=encode_chunked)                                                                                                                                                           
  File "/usr/local/lib/python3.6/http/client.py", line 1234, in endheaders                                                                                                                                         
    self._send_output(message_body, encode_chunked=encode_chunked)                                                                                                                                                 
  File "/usr/local/lib/python3.6/http/client.py", line 1065, in _send_output                                                                                                                                       
    self.send(chunk)                                                                                                                                                                                               
  File "/usr/local/lib/python3.6/http/client.py", line 986, in send                                                                                                                                                
    self.sock.sendall(data)                                                                                                                                                                                        
ConnectionResetError: [Errno 104] Connection reset by peer

I assume, there is a very large file in mycollection causing this error.

As a workaround, a max document size could be introduced or a large file split into several pieces.

Run tests on Travis

Since we have a test suite in this repo, we should set up a Travis build.

`./manage.py invite <username> --create` is broken

not displaying search results beyond 10.000

When searching a term that has more then 10.000 hits (e.g. "done"), Hoover will display only up to 10.000 results (no matter if I choose 10 or 1.000 results per page).

When trying to go to the page that would display the messages from 10.001 onwards, an error message is displayed: "Unknown server error while searching". Is this a Hoover search limitation or a Hoover UI ?

The exact same problem I encounter with your UI application @jarib on https://hoover-ui.herokuapp.com/. The message reads "Error: unable to fetch https://hoover-ui.herokuapp.com/search: 500 Internal Server Error"

Normalize docker image names

Dependabot generates branch names that contain slashes, like dependabot/pip/django-2.2.3, and docker-hub doesn't like the tag name: https://jenkins.liquiddemo.org/hoover/search/32. We can either normalize the name or just refuse to attempt to push those images to hub.

Use the Python "qrencode" library

Right now, we shell out to call the qrencode command, when we could use a Python library, qrencode.