liquidinvestigations / hoover-snoop2 Goto Github PK

Processing system for the search engine service in Liquid Investigations.

License: MIT License

Python 97.84% Shell 0.89% HTML 0.89% Dockerfile 0.15% CSS 0.05% JavaScript 0.18%

celery django docker elasticsearch tesseract-ocr tika

hoover-snoop2's Introduction

Welcome to Liquid Investigations!

The ‘Liquid investigations’ project is a Google DNI funded project, driven by CRJI.org, where coders and journalists work towards making investigative collaborations less burdensome and more secure.

We are creating an open source digital toolkit, based on existing hardware and software. When fully developed, the kit will allow for distributed data search and sharing, annotations, wiki and chat. While the software can run on any server, focus will be placed on small and portable devices.

Please take a look at our website at https://liquidinvestigations.org

You can find our project wiki here.

hoover-snoop2's People

Contributors

Stargazers

Watchers

Forkers

jarib mattesr mgogoulos hoover391github2 trellixvulnteam

hoover-snoop2's Issues

Exif data extraction: KeyError: ‘location’

Traceback (most recent call last):
  File “/opt/hoover/snoop/snoop/data/tasks.py”, line 98, in laterz_shaorma
    result = shaormerie[task.func](*args, **depends_on)
  File “/opt/hoover/snoop/snoop/data/digests.py”, line 64, in gather
    rv[‘location’] = exif_data[‘location’]
KeyError: ‘location’

Mark all blobs with json data with mime_type = application/json

set task state = pending when retrying

Before requeueing a task, set state = pending and save it.

Collapsing multiple blank lines

For some Excel (and other?) files, Tika extracts hundreds of thousands of blank lines (representing empty cells). This breaks fulltext view of the search UI and information extraction (as performed in newsleak).

I suggest collapsing of multiple blank lines to a maximum number (e.g. 50).

Docker - move libpst and file (magic) out of /opt/hoover/snoop

Because, during development, if you mount the source tree in the docker container, it will override these files. Bad.

Create a separate "walk" task for each file

Right now, "walk" imports a whole folder in a task. If a file is problematic, the whole folder will fail, its transaction will be rolled back, and no other files or walk tasks will be created.

Process documents using tika

See https://github.com/hoover/snoop/blob/master/snoop/tikalib.py for refefrence

"search manage.py update" fails intermittently

make django admin pagination quicker

Make sure the Task admin paginator only queries for the first page or two. This should mean the pagination queries get LIMIT 200 on them all the time. See https://docs.djangoproject.com/en/2.0/topics/pagination/#paginator-objects for how to override it. See https://docs.djangoproject.com/en/2.0/ref/contrib/admin/#django.contrib.admin.ModelAdmin.paginator for what to override in the admin

Admin UI link files related to a blob entry

To investigate on task errors its currently only possible to see the related blob of a task but from there, a connection to all related files is missing.

Admin page with stats about task errors

>>> from django.db import connection
>>> with connection.cursor() as cursor:
...     cursor.execute("select func, substring(error for position('(' in error) - 1) as short_error, count(*) from data_task where status = 'error' group by func, short_error;")
...     rows = cursor.fetchall()

This will be a (30-something long) list of (func, short_error, count) tuples.

Shaorma must run the task in a subtransaction

If there is a database-related error in a task, this will put the SQL transaction in an error state, so the task result can't be saved to the task table. So let's run the task logic in a subtransaction that can be rolled back if there is any exception.

Rerun dependencies after rerunning task

Detect language

In the old snoop this was done by making an extra request to Tika, if the feature was enabled.

We should explore how much this affects indexing time.

Should be indexed to the lang field and then exposed as a facet filter in https://github.com/hoover/ui.

Support using external text-only OCR sources

Right now, the code that uses external OCR only looks for pdf files on which to run tika on.

We need to support reading text files that only contain the extracted text.

Add view for the File model

We have views for Digests and Directories, but not Files.

The UI needs access to the File model to properly show the file tree structure, since the digests are de-duplicated and will contain bad parent links.

The File view should probably also return the digest for that file.

Build and install custom file executable in docker

filesystem.walk_file says: RuntimeError(‘file exited with 1: ERROR: JPEG image data, JFIF standard [... long output removed ...] name use count (30) exceeded’,)

This was fixed by liquidinvestigations/magic-definitions#1 but the docker setup doesn't use the patched file utility.

Create one `ocr.walk_source` task per source directory

In any case, make it create one task per directory.

OperationalError from workers

When adding a large collection, some tasks fail with errors like this.

| ERROR:  deadlock detected
| DETAIL:  Process 29292 waits for ShareLock on transaction 974325; blocked by process 29293.
|   Process 29293 waits for ShareLock on transaction 974324; blocked by process 29292.
|   Process 29292: INSERT INTO "data_blob" ("sha3_256", "sha256", "sha1", "md5", "size", "magic", "mime_type", "mime_encoding", "date_created", "date_modified") VALUES ('602e11efade1949e6c80a197138b1e1e659e7bb67ec30acf22553642cda9fdbf', '39859d7b234a72f1f6abf901b3434a1149f075b1df6469b19765b371c2dcd3a1', '4847a2cd294e4a00de06888d600d1926c6e1faa7', 'f5f2e0fc7005c475e759384e22902c27', 2803, 'ASCII text, with very long lines', 'message/rfc822', 'us-ascii', '2018-07-21T09:42:48.334830+00:00'::timestamptz, '2018-07-21T09:42:48.334927+00:00'::timestamptz)
|   Process 29293: INSERT INTO "data_blob" ("sha3_256", "sha256", "sha1", "md5", "size", "magic", "mime_type", "mime_encoding", "date_created", "date_modified") VALUES ('c093c99ac68ce1c2f8c0f3e568054f21e0696040ecb4f0be9db913862dd4e274', 'a39f0a886eec503762d0b7f6218b686a5a28805b325a911ada1707f85e5327b2', '39905a6f6021eb8cd9ddb5fab5aae2ad83cc700b', 'a8f18dfd2fcb06f5e39cab42c31622cd', 874, 'ASCII text, with very long lines', 'message/rfc822', 'us-ascii', '2018-07-21T09:39:20.353531+00:00'::timestamptz, '2018-07-21T09:39:20.353609+00:00'::timestamptz)
| HINT:  See server log for query details.
| CONTEXT:  while inserting index tuple (10482,23) in relation "data_blob_pkey"

So far I've only seen this happen for tasks where func = archives.unarchive.

~~I also see quite a few of these, though I'm not sure if they are related:~~

ERROR:  duplicate key value violates unique constraint "data_task_func_args_4615ae37_uniq"
DETAIL:  Key (func, args)=(email.parse, ["31e9ff163ee482bd8a8c4f3b432bdc87705bd79315f92f3ca95cbcd0691c5a51"]) already exists.
STATEMENT:  INSERT INTO "data_task" ("func", "blob_arg_id", "args", "result_id", "date_created", "date_modified", "date_started", "date_finished", "worker", "status", "error", "broken_reason", "log") VALUES ('email.parse', '31e9ff163ee482bd8a8c4f3b432bdc87705bd79315f92f3ca95cbcd0691c5a51', '["31e9ff163ee482bd8a8c4f3b432bdc87705bd79315f92f3ca95cbcd0691c5a51"]', NULL, '2018-07-21T09:49:59.200550+00:00'::timestamptz, '2018-07-21T09:49:59.200588+00:00'::timestamptz, NULL, NULL, '', 'pending', '', '', '') RETURNING "data_task"."id"

Prepare a testdata collection that will cover as many format of emails and docs as possible

We should enrich our test data-set (https://github.com/hoover/testdata) with all possible formats of emails and docs, as well as different special characters.

@salevajo should we replace testdata with enron-nuix data set ?

Export & Import collections

Two management commands: export $collection and import $collection. They will read/write .tar format files.

Adding new files to an existing collection doesn't work as expected

After adding new files to an existing collection with auto-indexing enabled, these files doesn't get indexed.

Tried to use rundispatcher, but without success. After a while I tried retrytasks --type filesystem.walk and this seems to work.

make tasks depend on multiple tasks

right now it won't wait for all of those

Don't filesystem.walk inside archives when looking for .emlxpart

When emlx processing encounters a .partial.emlx file, it looks for .emlxpart files in the same folder, by waiting for the relevant filesystem.walk task. This doesn't work inside archives because there is no filesystem.walk inside archives. The code should detect that it's in an archive and expect the files to be there already.

Email parsing: treat bad base64 encodings

Traceback (most recent call last):
  File “/usr/local/lib/python3.6/email/_encoded_words.py”, line 109, in decode_b
    return base64.b64decode(padded_encoded, validate=True), defects
  File “/usr/local/lib/python3.6/base64.py”, line 86, in b64decode
    raise binascii.Error(‘Non-base64 digit found’)
binascii.Error: Non-base64 digit found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File “/opt/hoover/snoop/snoop/data/tasks.py”, line 98, in laterz_shaorma
    result = shaormerie[task.func](*args, **depends_on)
  File “/opt/hoover/snoop/snoop/data/tasks.py”, line 233, in wrapper
    rv = func(*args, **kwargs)
  File “/opt/hoover/snoop/snoop/data/analyzers/email.py”, line 104, in parse
    data = dump_part(message, depends_on)
  File “/opt/hoover/snoop/snoop/data/analyzers/email.py”, line 55, in dump_part
    payload_bytes = message.get_payload(decode=True)
  File “/usr/local/lib/python3.6/email/message.py”, line 286, in get_payload
    value, defects = decode_b(b’’.join(bpayload.splitlines()))
  File “/usr/local/lib/python3.6/email/_encoded_words.py”, line 124, in decode_b
    raise AssertionError(“unexpected binascii.Error”)
AssertionError: unexpected binascii.Error

These documents should be marked as broken.

More database indexes!!!

Somehow, we have no indexes set up. Let's start with

Task: date_started and date_finished (in case we need to redo all tasks in some time interval, func, status
Digest: date_modified as it's used for pagination

readpst segfaults during tests on docker

maybe ./configure; make; make install from https://github.com/mhorbul/libpst (someone's github mirror) or http://hg.five-ten-sg.com/libpst/ (hg repo)

Configure the collection's title

The collection API should return a pretty user-configurable value for the title field. Let's define a new setting, SNOOP_COLLECTION_TITLE. Read the value from an environment variable when run in Docker.

Process text/html and text/plain types

Extract the text from plain/html and plain/text files.
Sanitize the html as done in https://github.com/hoover/snoop/blob/master/snoop/html.py to keep current UI functionality.

Run dispatcher for a single collection

Instead of dispatching everything for all collections again it would be more efficient to start dispatching for a specified collection:

docker-compose run --rm snoop ./manage.py rundispatcher --collection=testdata

match current /api/feed to old snoop (as much as possible)

Make digest.gather always succeed

Make digest.gather always succeed by adding details about broken dependencies to the broken field on the result blob.

This should mean that the digest.gather code uses isinstance to see if the dependency is a blob or an error.

Process PGP encrypted emails

Use https://github.com/hoover/snoop/blob/master/snoop/pgp.py for reference

Extract language detection to a separate task

https://github.com/hoover/snoop2/blob/e975ca27cd48fc72845c8e1baf4cb2ee96a1019e/snoop/data/indexing.py#L91

Write a test for path component

Whoops, we merged #165 with no tests.

URL detection

This tool can detect any kind of URLs in text: https://github.com/Andrew-Kang-G/pattern-dreamer/blob/master/README.md

Store filenames as bytes

In the case of broken filenames (bytes that are not valid utf-8) snoop is not able to write the filenames to the database. We could store the filenames as raw bytes and as a cleaned-up text version for display purposes.

Change ocr.walk_file pattern matching

ocr.walk_file should look at the first 32 chars of the filename and try to match it as an md5 hash. The current code also concatenates parent folder names; this is not needed.

Relevant line: https://github.com/hoover/snoop2/blob/1864656db28f0c9b81c06f2a4e74acb264cc011a/snoop/data/ocr.py#L60

Parsed document date fields are missing

We're not saving:

send dates for emails
creation and modification dates for tika-analyzed documents

Also implement the management commands needed to manage external OCR sources

Set up testing environment

And start writing some unit tests.

./manage.py updatename not workink since Collection have been removed from snoop 2

Traceback (most recent call last):
  File "./manage.py", line 16, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 371, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 365, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.6/site-packages/django/core/management/base.py", line 288, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.6/site-packages/django/core/management/base.py", line 335, in execute
    output = self.handle(*args, **options)
  File "/opt/hoover/snoop/snoop/data/management/commands/updatename.py", line 19, in handle
    collection = models.Collection.objects.get(name=collection_name)
AttributeError: module 'snoop.data.models' has no attribute 'Collection'

updatename should be removed or updated

Reset stack_trace field when task changes status to broken

digest.gather push result to elasticsearch

instead of saving it as a blob

Process emails

Use the new method (create a Directory for all Files that have children) to populate email attachments.
Use https://github.com/hoover/snoop/blob/master/snoop/emails.py for reference.

Process eml file headers
Process eml file attachments (and create the file structures)
Process msg files by turning them into eml first
Process emlx + emlxpart files by turning them into eml files

Process different types of archives

There's a stub for unpacking and processing archives under tasks.py. Let's throw all archives and pst/ost types related functionality in a archives.py module.

Unpack pst/ost files. Use https://github.com/hoover/snoop/blob/master/snoop/pst.py for PST reference.
Unpack zip, 7z and rar archives. Use https://github.com/hoover/snoop/blob/master/snoop/archives.py for archive (non-tar) reference.
Unpacktar.* archives (that 7z does not support).

Extract unnamed calendar attachment from emails

Some emails contain calendar (vcal?) attachments that have no filename, only these headers:

Content-Type: text/calendar; charset="utf-8"; method=CANCEL
Content-Transfer-Encoding: base64

It would be nice to show them as plain-text attachments.

When somebody decides to work on this, they should ping thiss issue and I'll add an example file to testdata and link it here.

.rtf filetype not indexed

.rtf files are not indexed, search results by .rtf show 0 words for all hits