Giter Club home page Giter Club logo

hoover-snoop2's Introduction

Welcome to Liquid Investigations!

The ‘Liquid investigations’ project is a Google DNI funded project, driven by CRJI.org, where coders and journalists work towards making investigative collaborations less burdensome and more secure.

We are creating an open source digital toolkit, based on existing hardware and software. When fully developed, the kit will allow for distributed data search and sharing, annotations, wiki and chat. While the software can run on any server, focus will be placed on small and portable devices.

Please take a look at our website at https://liquidinvestigations.org

You can find our project wiki here.

hoover-snoop2's People

Contributors

alexneamtu avatar dependabot-preview[bot] avatar dependabot[bot] avatar gabriel-v avatar ioanpocol avatar jarib avatar k-jell avatar mattesr avatar mgax avatar morten-spiegel avatar mugurrus avatar raduklb avatar razorbest avatar salevajo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hoover-snoop2's Issues

Exif data extraction: KeyError: ‘location’

Traceback (most recent call last):
  File “/opt/hoover/snoop/snoop/data/tasks.py”, line 98, in laterz_shaorma
    result = shaormerie[task.func](*args, **depends_on)
  File “/opt/hoover/snoop/snoop/data/digests.py”, line 64, in gather
    rv[‘location’] = exif_data[‘location’]
KeyError: ‘location’

Collapsing multiple blank lines

For some Excel (and other?) files, Tika extracts hundreds of thousands of blank lines (representing empty cells). This breaks fulltext view of the search UI and information extraction (as performed in newsleak).

I suggest collapsing of multiple blank lines to a maximum number (e.g. 50).

Create a separate "walk" task for each file

Right now, "walk" imports a whole folder in a task. If a file is problematic, the whole folder will fail, its transaction will be rolled back, and no other files or walk tasks will be created.

Admin page with stats about task errors

>>> from django.db import connection
>>> with connection.cursor() as cursor:
...     cursor.execute("select func, substring(error for position('(' in error) - 1) as short_error, count(*) from data_task where status = 'error' group by func, short_error;")
...     rows = cursor.fetchall()

This will be a (30-something long) list of (func, short_error, count) tuples.

Shaorma must run the task in a subtransaction

If there is a database-related error in a task, this will put the SQL transaction in an error state, so the task result can't be saved to the task table. So let's run the task logic in a subtransaction that can be rolled back if there is any exception.

Add view for the File model

We have views for Digests and Directories, but not Files.

The UI needs access to the File model to properly show the file tree structure, since the digests are de-duplicated and will contain bad parent links.

The File view should probably also return the digest for that file.

OperationalError from workers

When adding a large collection, some tasks fail with errors like this.

| ERROR:  deadlock detected
| DETAIL:  Process 29292 waits for ShareLock on transaction 974325; blocked by process 29293.
|   Process 29293 waits for ShareLock on transaction 974324; blocked by process 29292.
|   Process 29292: INSERT INTO "data_blob" ("sha3_256", "sha256", "sha1", "md5", "size", "magic", "mime_type", "mime_encoding", "date_created", "date_modified") VALUES ('602e11efade1949e6c80a197138b1e1e659e7bb67ec30acf22553642cda9fdbf', '39859d7b234a72f1f6abf901b3434a1149f075b1df6469b19765b371c2dcd3a1', '4847a2cd294e4a00de06888d600d1926c6e1faa7', 'f5f2e0fc7005c475e759384e22902c27', 2803, 'ASCII text, with very long lines', 'message/rfc822', 'us-ascii', '2018-07-21T09:42:48.334830+00:00'::timestamptz, '2018-07-21T09:42:48.334927+00:00'::timestamptz)
|   Process 29293: INSERT INTO "data_blob" ("sha3_256", "sha256", "sha1", "md5", "size", "magic", "mime_type", "mime_encoding", "date_created", "date_modified") VALUES ('c093c99ac68ce1c2f8c0f3e568054f21e0696040ecb4f0be9db913862dd4e274', 'a39f0a886eec503762d0b7f6218b686a5a28805b325a911ada1707f85e5327b2', '39905a6f6021eb8cd9ddb5fab5aae2ad83cc700b', 'a8f18dfd2fcb06f5e39cab42c31622cd', 874, 'ASCII text, with very long lines', 'message/rfc822', 'us-ascii', '2018-07-21T09:39:20.353531+00:00'::timestamptz, '2018-07-21T09:39:20.353609+00:00'::timestamptz)
| HINT:  See server log for query details.
| CONTEXT:  while inserting index tuple (10482,23) in relation "data_blob_pkey"

So far I've only seen this happen for tasks where func = archives.unarchive.

I also see quite a few of these, though I'm not sure if they are related:

ERROR:  duplicate key value violates unique constraint "data_task_func_args_4615ae37_uniq"
DETAIL:  Key (func, args)=(email.parse, ["31e9ff163ee482bd8a8c4f3b432bdc87705bd79315f92f3ca95cbcd0691c5a51"]) already exists.
STATEMENT:  INSERT INTO "data_task" ("func", "blob_arg_id", "args", "result_id", "date_created", "date_modified", "date_started", "date_finished", "worker", "status", "error", "broken_reason", "log") VALUES ('email.parse', '31e9ff163ee482bd8a8c4f3b432bdc87705bd79315f92f3ca95cbcd0691c5a51', '["31e9ff163ee482bd8a8c4f3b432bdc87705bd79315f92f3ca95cbcd0691c5a51"]', NULL, '2018-07-21T09:49:59.200550+00:00'::timestamptz, '2018-07-21T09:49:59.200588+00:00'::timestamptz, NULL, NULL, '', 'pending', '', '', '') RETURNING "data_task"."id"

Export & Import collections

Two management commands: export $collection and import $collection. They will read/write .tar format files.

Don't filesystem.walk inside archives when looking for .emlxpart

When emlx processing encounters a .partial.emlx file, it looks for .emlxpart files in the same folder, by waiting for the relevant filesystem.walk task. This doesn't work inside archives because there is no filesystem.walk inside archives. The code should detect that it's in an archive and expect the files to be there already.

Email parsing: treat bad base64 encodings

Traceback (most recent call last):
  File “/usr/local/lib/python3.6/email/_encoded_words.py”, line 109, in decode_b
    return base64.b64decode(padded_encoded, validate=True), defects
  File “/usr/local/lib/python3.6/base64.py”, line 86, in b64decode
    raise binascii.Error(‘Non-base64 digit found’)
binascii.Error: Non-base64 digit found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File “/opt/hoover/snoop/snoop/data/tasks.py”, line 98, in laterz_shaorma
    result = shaormerie[task.func](*args, **depends_on)
  File “/opt/hoover/snoop/snoop/data/tasks.py”, line 233, in wrapper
    rv = func(*args, **kwargs)
  File “/opt/hoover/snoop/snoop/data/analyzers/email.py”, line 104, in parse
    data = dump_part(message, depends_on)
  File “/opt/hoover/snoop/snoop/data/analyzers/email.py”, line 55, in dump_part
    payload_bytes = message.get_payload(decode=True)
  File “/usr/local/lib/python3.6/email/message.py”, line 286, in get_payload
    value, defects = decode_b(b’’.join(bpayload.splitlines()))
  File “/usr/local/lib/python3.6/email/_encoded_words.py”, line 124, in decode_b
    raise AssertionError(“unexpected binascii.Error”)
AssertionError: unexpected binascii.Error

These documents should be marked as broken.

More database indexes!!!

Somehow, we have no indexes set up. Let's start with

  • Task: date_started and date_finished (in case we need to redo all tasks in some time interval, func, status
  • Digest: date_modified as it's used for pagination

Configure the collection's title

The collection API should return a pretty user-configurable value for the title field. Let's define a new setting, SNOOP_COLLECTION_TITLE. Read the value from an environment variable when run in Docker.

Run dispatcher for a single collection

Instead of dispatching everything for all collections again it would be more efficient to start dispatching for a specified collection:

docker-compose run --rm snoop ./manage.py rundispatcher --collection=testdata

Make digest.gather always succeed

Make digest.gather always succeed by adding details about broken dependencies to the broken field on the result blob.

This should mean that the digest.gather code uses isinstance to see if the dependency is a blob or an error.

Store filenames as bytes

In the case of broken filenames (bytes that are not valid utf-8) snoop is not able to write the filenames to the database. We could store the filenames as raw bytes and as a cleaned-up text version for display purposes.

./manage.py updatename not workink since Collection have been removed from snoop 2

Traceback (most recent call last):
  File "./manage.py", line 16, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 371, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 365, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.6/site-packages/django/core/management/base.py", line 288, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.6/site-packages/django/core/management/base.py", line 335, in execute
    output = self.handle(*args, **options)
  File "/opt/hoover/snoop/snoop/data/management/commands/updatename.py", line 19, in handle
    collection = models.Collection.objects.get(name=collection_name)
AttributeError: module 'snoop.data.models' has no attribute 'Collection'

updatename should be removed or updated

Process emails

Use the new method (create a Directory for all Files that have children) to populate email attachments.
Use https://github.com/hoover/snoop/blob/master/snoop/emails.py for reference.

  • Process eml file headers
  • Process eml file attachments (and create the file structures)
  • Process msg files by turning them into eml first
  • Process emlx + emlxpart files by turning them into eml files

Extract unnamed calendar attachment from emails

Some emails contain calendar (vcal?) attachments that have no filename, only these headers:

Content-Type: text/calendar; charset="utf-8"; method=CANCEL
Content-Transfer-Encoding: base64

It would be nice to show them as plain-text attachments.

When somebody decides to work on this, they should ping thiss issue and I'll add an example file to testdata and link it here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.