steiza / docstore Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 4.0 578 KB

For any civics-minded organization that needs a simple place to host documents publicly

Home Page: http://a2docs.org/

CSS 2.16% HTML 37.11% Python 60.73%

docstore's People

Contributors

Stargazers

Watchers

Forkers

eby cdzombak a2civictech

docstore's Issues

Make webserver script executable

I've looked at OSX and a few linux distros. Especially ones that support having both 2.x and 3.x pythons installed. It seems that the executable python2.7 exists on all of them.

On my server I made webserver.py executable and added the following to the top:

!/usr/bin/env python2.7

That looks for the python2.7 in path vs a hard path. Maybe give it a try on your dev box and see if it causes issues. Seems to work wherever I put it.

This allows it to just run as ./webserver.py which helps in writing a init.d script to control it. I need to polish up the init.d script and then I'll share that in a support scripts folder along with nginx config.

500 error when not logged in

When I hit a2docs.org/review as an un-logged in user, I get this 500 error. I'm trying
to review a submission.

File import: confirm that import includes all files for multi-file imports

Related to #3 -

The current a2docs has a number of docs where a single document id is associated with multiple files, e.g.

http://a2docs.org/doc/292/ "Ann Arbor Fire Department response times"

which is different from

https://a2docs.aadl.org/view/292 "Ann Arbor Golf Proposal for Huron Hills"

I'm not sure where the ID skew is coming from, but the goal is to preserve the old URLs so that Arborwiki doesn't require a bunch of updates.

Run minio on the docstore server in read-only mode to support S3 access to files?

See https://github.com/minio/minio - the opportunity is to allow people to download files directly from a2docs without going through the web interface by using at Minio server. Minio provides an Amazon S3 compatible interface layer.

Urgency: low. Interestingness: high.

Comma in filename

If you upload a file that has a comma in the name, it goes boom. System is Chrome, running against localhost.

The localhost page isn’t working

localhost sent an invalid response.
ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION

_AAATA Board Packet November 19, 2015_Revised.pdf is the filename.

[Include details here]

The Stack Overflow answer to this is here:

http://stackoverflow.com/questions/8588818/chrome-pdf-display-duplicate-headers-received-from-the-server

and that issue has something to do with the comma (",") character in the filename.

Feature: Narrative Description field to support Markdown formatting

Noted in #4, this feature would support and render Markdown in the Narrative Description field.

Per that discussion, this is probably behind #3 (multiple file uploads).

"413 Request Entity Too Large nginx/1.9.13" when uploading 10.6 megabyte file

I attempted to upload a 10.6 megabyte site plan from a project in front of the Planning Commission, and got this error message. @eby - the logs should show this from June 6 at about 12:50 pm.

The docs I can find about nginx refer to a stanza declaring client_max_body_size as the thing to change.

consider setting content-type on attached files

It would be nice to have the Content-type response header set for attached files, which might make reading on e.g. Chrome, iOS webview, etc. more convenient. I'm not sure if setting content-disposition: attachment prevents the webview from displaying the document in the native PDF viewer, but I can experiment with that if needed.

$ curl -v https://a2docs.org/file/570/2760+Stanton+-+FOIA+Final.pdf
> GET /file/570/2760+Stanton+-+FOIA+Final.pdf HTTP/2
> Host: a2docs.org
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/2 200
< date: Fri, 11 Dec 2020 16:44:08 GMT
< content-type: application/octet-stream
< content-length: 167169
< server: TornadoServer/6.0.3
< content-disposition: attachment; filename="2760 Stanton - FOIA Final.pdf"
< etag: "770df252e24b5b9c39539ec2a8a459da19a45e1e"
< strict-transport-security: max-age=15768000

A link that does display inline correctly:

$ curl -v https://cdn.ballotpedia.org/images/c/cf/2020_Hawaii_sample_ballot_%28Hawaii_County%29.pdf
> Host: cdn.ballotpedia.org
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/2 200
< content-type: application/pdf
< content-length: 648057
< date: Fri, 11 Dec 2020 16:45:16 GMT
< last-modified: Tue, 20 Oct 2020 16:35:33 GMT
< etag: "bd9648313b96686eb357f26a728f7914"
< accept-ranges: bytes
< server: AmazonS3
< x-cache: Miss from cloudfront
< via: 1.1 63b9a4cda82206b6b34aab8f3e958cbe.cloudfront.net (CloudFront)
< x-amz-cf-pop: ORD52-C1
< x-amz-cf-id: l2t0ZfreqWrmhldPoyPu70kdH7JORaGyjK_ZIRpP_U6cOaLV2gyJTQ==

HTTPS support for a2docs.org

Some notes on a transition:

We have an old URL (a2docs.org) and a new URL (a2docs.aadl.org). It would be good to have a plan to consolidate the two, and I think that the surviving URL is the .aadl.org domain.

I suspect that the long term answer is to transfer the a2docs.org domain handling to an nginx configuration which does whatever necessary domain mapping.

The main reason for wanting this is to ensure that all of the old links to a2docs.org that are in Arborwiki still work. An alternative plan is to identify all of those pages that have those links one by one and fix them, and then retire the old a2docs.org name entirely.

Several templates' title blocks have "A2" hardcoded in them

Looking at index.html, org.html, search.html and probably others, the title typically includes "A2":

{% block title %}
Search A2 Government Document Repository
{% end %}

I note that base.html uses {{region}}. I'm too tired to fix this now and verify it actually works on my machine (and I am not familiar with Tornado's templating so I'll need to test this locally), so filing this as a note for later.

User Management and FOIA Request Tracking

This is just here for discussion and is likely a long term change. Auth helps things from getting deleted and spam and the admin user is a good fit for that.

It seems from glancing at some of the uploads and the fields that a use case is tracking requests that have been places. So putting in a stub record of what was requested and date requested and coming back later and uploading the doc when it is received. Correct me if I'm wrong @vielmetti

If that is the case then might be worth discussing what a user management might look like along with views for managing requests.

Could probably do something external like basic webserver auth which the app just then associates the file with the login name. That would prevent the need for user admin interfaces.

500: Internal Server Error after upload to a2docs

On the evening of 2022-06-11 I set up an upload of 3 files to a2docs (FOIA 1258).
The upload failed with an Internal Server Error.

No idea what went wrong.

Source Organization field should be an autocomplete to prevent duplicates and aid search

Feature request from @vielmetti

Old system had an autocomplete on the source organization field. It also had a couple common ones.

It my test of the old autocomplete it was still showing quite a few slight spelling differences so obviously isn't fool proof but would aid in some of the browsing if most things are uniform.

Narrative / Description field should have more detailed prompts for info and be larger

Feature request from @vielmetti

Original a2docs.org has the text

Add any relevant details about the documents. What are the documents about? Were there any problems or revelations? If your request was denied, what reason was given? What is the larger issue?

Could use this text as the alt or discuss a different form of the text.

The styling of the box should probably be more fluid for browser size.

Feature: multiple "request tracking numbers"

This is a feature not in the current system, and needs a little thought.

Any given request might have multiple tracking numbers; e.g. the tracking number assigned by the reader to their own request, the tracking number assigned by the institution for internal use, and the tracking number kept by a third party like a2civictech or seeclickfix for external review.

Sometimes these tracking numbers have URLs too.

I don't know how to represent this.

Review queued docs due to earlier server error

In #35 it was noted that there was a server error (now fixed) when uploading to a2docs.

There are a couple of documents stuck in the queue as a result. Review them, and when they are reviewed, close this.

requirements.txt needs pyyaml as a requirement

Maybe some distros has it by default but had to install pyyaml. I'm not sure if there is a minimum version so just doing issue instead of pull request. I installed 3.11

Source Organization should link to organization search

Just putting in a couple things to meet @vielmetti 's hope for parity with old version.

In a2docs.org when viewing a doc the source organization goes to a search for other things for that organization.

Compatibility: "view" URL for detail of each uploaded document

Compare http://a2docs.org/doc/382/

Note that there are 4 documents in this document set, and that each of them has a detail page URL, e.g.

http://a2docs.org/doc/382/view/496/ "BlockbyBlock_Ann Arbor DDA - OperatingBudget - 436 hours.xlsx"

While I can think of all kinds of features that might be on this page, the minimum necessary for it is to have a compatible URL so that a deep link to that particular record continues to work.

Database cleanup tools

As I was doing an upload this a.m. I noticed that there were two semi-identical names for agencies that came up in the popup - "Ann Arbor Area Transportation Authority" and "Ann Arbor Area Transit Authority". Only one of those is correct.

The hope would be for some administrative way to remedy this, not sure the precise best way yet.

Sample Support Scripts

This is here for my tracking. Need to create a directory (support-scripts ??) and provide the following sample docs:

Nginx Config File
Apache Config File
Init.d startup script

Should probably also do a systemd script but will have to throw up a VM to test.

Auth Broken in Python 3

Haven't had time to dig but guessing maybe a python 3+ issue? Could also be nginx needs specific config for that path but looking at some other posts it sounds like behaviour changed in 3.x and things have to be encoded manually.

Traceback (most recent call last):
  File "/usr/lib/python3.8/base64.py", line 510, in _input_type_check
    m = memoryview(s)
TypeError: memoryview: a bytes-like object is required, not 'str'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tornado/web.py", line 1702, in _execute
    result = method(*self.path_args, **self.path_kwargs)
  File "/var/www/a2docs/docstore", line 490, in get
    auth_decoded = base64.decodestring(auth_header[6:])
  File "/usr/lib/python3.8/base64.py", line 554, in decodestring
    return decodebytes(s)
  File "/usr/lib/python3.8/base64.py", line 545, in decodebytes
    _input_type_check(s)
  File "/usr/lib/python3.8/base64.py", line 513, in _input_type_check
    raise TypeError(msg) from err
TypeError: expected bytes-like object, not str

Enhancement: RSS Feed

It'd be neat to have an RSS/Atom feed of new documents.

Document details page: "date posted" metadata

Cf https://a2docs.aadl.org/view/292

A date field displayed to the reader should include not only "date requested" and "date received" but also "date posted". This ensures that there's at least one date displayed.

The relevant bit from the old system here:

500: Internal Server Error on download after upload

See https://a2docs.aadl.org/view/408 especially

esp at the bottom

Download 3-16-2016 HDC Minutes with Live Links.pdf

where I get "500: Internal Server Error" as a response.

The three docs had been uploaded as a batch in a single transaction from the "upload" function on my Mac running Chrome.

500: Internal Server Error on upload

I just tried to upload the CARD presentation on the 1,4 Dioxane plume and got an internal server error. The time was approximately 0815 on 2/29.

500: Internal Server Error

tornado.web.stream_request_body vs. Request Entity Too Large.

In #24 there have been ongoing problems with "413 Request Entity Too Large" errors. Not seeing any right now, but it is anticipated that by switching to the tornado.web.stream_request_body method this problem might go away entirely.

See e.g.

https://www.tornadoweb.org/en/stable/guide/structure.html?highlight=stream_request_body#handling-request-input

Add Document form should allow multiple files per document

Feature request to bring it up to the old a2docs format.

The old setup allowed this by cloning the browse field. Not sure if easier to do it as a multiselect browse instead.

Deploy "autocomplete" version of code to a2docs.aadl.org

The current version of the code has autocomplete, but the aadl version doesn't have that yet.

Zach identified the question that we're not sure that his import script imported properly the files where there are multiple documents in a single entry, so a redeploy will need to track that issue too.