allenai / pawls Goto Github PK

View Code? Open in Web Editor NEW

376.0 22.0 74.0 12.55 MB

Software that makes labeling PDFs easy.

Home Page: https://pawls.apps.allenai.org

License: Apache License 2.0

Jsonnet 6.72% Dockerfile 0.76% Python 57.29% HTML 0.21% TypeScript 34.79% Shell 0.18% JavaScript 0.06%

pawls's Introduction

Demo Server | Video Tutorial | Paper

PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated with a PDF document. It was written specifically for annotating academic papers within the Semantic Scholar corpus, but can be used with any collection of PDF documents.

Quick Start

Quick start will download some pre-processed PDFs and get the UI set up so that you can see them. If you want to pre-process your own PDFs, keep reading! If it's your first time working with PAWLS, we recommend you try the quick start first though.

First, we need to download some processed PDFs to view in the UI. PAWLS uses the PDFs themselves to render in the browser, as well as using a JSON file of extracted token bounding boxes per page, called pdf_structure.json. The PAWLS CLI can be used to do this pre-processing, but for the quick start, we have done it for you. Download them from the provided AWS S3 Bucket like so:

aws s3 sync s3://ai2-s2-pawls-public/example-data ./skiff_files/apps/pawls/papers/ --no-sign-request

Configuration in PAWLS is controlled by a JSON file, located in the api/config directory. The location that we downloaded the PDFs to above corresponds to the location in the config file, where it is mounted in using docker-compose.yaml. So, when PAWLS starts up, the API knows where to look to serve the PDFs we want.

Next, we can start the services required to use PAWLS using docker-compose:

~ docker-compose up --build

This process launches 4 services:

the ui, which renders the user interface that PAWLS uses
the api, which serves PDFs and saves/recieves annotations
a proxy responsible for forwarding traffic to the appropriate services.
A grobid service, running a fork of Grobid. This is not actually necessary for the application, but is useful for the CLI.

You'll see output from each.

Once all of these have come up, navigate to localhost:8080 in your browser and you should see the PAWLS UI! Happy annotating.

Getting Started

In order to run a local environment, you'll need to use the PAWLS CLI to preprocess and assign the PDFs you want to serve. When using PDFs from semantic scholar, the CLI is also used to download the PDFs. The PDFs have to be put in a directory structure within skiff_files/apps/pawls (see PAWLS CLI usage for details).

For instance, you can run the following commands to download, preprocess, and assign PDFs:

  # Fetches PDFs from semantic scholar's S3 buckets.
  python scripts/ai2-internal/fetch_pdfs.py skiff_files/apps/pawls/papers 34f25a8704614163c4095b3ee2fc969b60de4698 3febb2bed8865945e7fddc99efd791887bb7e14f 553c58a05e25f794d24e8db8c2b8fdb9603e6a29
  # ensure that the papers are pre-processed with grobid so that they have token information.
  pawls preprocess grobid skiff_files/apps/pawls/papers
  # Assign the development user to all the papers we've downloaded.
  pawls assign skiff_files/apps/pawls/papers [email protected] --all --name-file skiff_files/apps/pawls/papers/name_mapping.json

and then open up the UI locally by running docker-compose up.

Authentication and Authorization

Authentication is simply checking that users are who they say they are. Whether or not these users' requests are allowed (e.g., to view a PDFs) is considerd authorization. See more about this distinction at Skiff Login.

Authentication

All requests must be authenticated.

The production deployment of PAWLS uses Skiff Login to authenticate requests. New users are bounced to a Google login workflow, and redirected back to the site if they authenticate with Google. Authenticated requests carry an HTTP header that identifies the user.
For local development, there is no login workflow. Instead, all requests are supplemented with a hard-coded authentication header in proxy/local.conf specifying that the user is [email protected].

Look at the function get_user_from_header in main.py for details.

Authorization

Authorization is enforced by the PAWLS app. A file of allowed user email addresses is consulted on every request.

In production, this file is sourced from the secret named "users" in Marina, which is projected to /users/allowed.txt in the container.
For local development, this file is sourced from allowed_users_local_development.txt, and also projected to /users/allowed.txt in the Docker container.

The format of the file is simply a list of allowed email addresses.

There's a special case when an allowed email address in this file starts with "@", meaning all users in that domain are allowed. That is, an entry "@allenai.org" will grant access to all AI2 people.

Look at the function user_is_allowed in main.py for details.

Python Development

The Python service and Python cli are formatted using black and flake8. Currently this is run in a local environment using the app's requirements.txt. To run the linters:

black api/
flake8 api/

Prerequisites

Make sure that you have the latest version of Docker 🐳 installed on your local machine.

To start a version of the application locally for development purposes, run this command:

~ docker-compose up --build

This process launches 3 services, the ui, api and a proxy responsible for forwarding traffic to the appropriate services. You'll see output from each.

It might take a minute or two for the application to start, particularly if it's the first time you've executed this command. Be patience and wait for a clear message indicating that all of the required services have started up.

As you make changes the running application will be automatically updated. Simply refresh your browser to see them.

Sometimes one portion of your application will crash due to errors in the code. When this occurs resolve the related issue and re-run docker-compose up --build to start things back up.

Development Tips and Tricks

The skiff template contains some features which are ideal for a robust web application, but might be un-intuitive for researchers. Below are some small technical points that might help you if you are making substantial changes to the skiff template.

Skiff uses sonar to check that all parts of the application (frontend, backend) are up and running before serving requests. To do this, it checks that your api returns 2XX codes from its root url - if you change the server, you'll need to make sure to add code which returns a 2XX response from your server.
To ease development/deployment differences, skiff uses a proxy to route different urls to different containers in your application. The TL;DR of this is the following:

External URL	Internal URL	Container
`localhost:8080/*`	`localhost:3000/*`	`ui`
`localhost:8080/api/*`	`localhost:8000/*`	`api`

So, in your web application, you would make a request, e.g axios.get("/api/route", data), which the server recieves at localhost:8000/route. This makes it easy to develop without worrying about where apis will be hosted in production vs development, and also allows for things like rate limiting. The configuration for the proxy lives here for development and here for production.

For example, if you wanted to expose the docs route localhost:8000/docs from your api container to users of your app in production, you would add this to prod.conf:

location /docs/ {
    limit_req zone=api;
    proxy_pass http://api:8000/;
}

Troubleshooting

Updating UI Dependencies based on Dependabot Alerts

Add the package and version reqs in the resolutions field from the package.json file;
Run yarn install to update the yarn.lock file
Start the docker and test whether the UI still works docker-compose up --build

Windows EOL format (CRLF) vs Linux (LF)

The application was developed for Linux, and might fail to start on Windows because of line-ending differences.

To fix this, run this command from the root of the repository:

~ (cd ./ui && yarn && yarn lint:fix) # with parenthesis, to stay in same directory

Cite PAWLS

If you find PAWLS helpful for your research, please consider cite PAWLS.

@misc{neumann2021pawls,
      title={PAWLS: PDF Annotation With Labels and Structure}, 
      author={Mark Neumann and Zejiang Shen and Sam Skjonsberg},
      year={2021},
      eprint={2101.10281},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

PAWLS is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

pawls's People

Contributors

Stargazers

Watchers

pawls's Issues

Add image or basic demo to README

This looks like a really useful project, so thank you for sharing it.

However, when people first read the README, because it is all text, it doesn't really reinforce if it's the kind of tool a potential user might be looking for - there are no images or brief demos of how it actually works (ie the UI) and that's a key part of assessing it. You'd typically want a bit more confidence before you take the plunge and install it.

I had a feeling it was the kind of tool I was after but I couldn't tell till I found a tweet showing a v. quick demo of the UI (here)

Therefore would it be okay to add a screenshot / brief demo of the UI (or at least link to somewhere that sort of thing is hosted)?

Relationally N-ary annotations on Shift + Click

Allow the definition of relations between spans/groupings of spans
Link spans using shift + click with a selected relation

Incorporate the PDFPlumber Token Extraction Function

Add CLI dataset management

I was thinking of a set of commands, to create a dataset:

pawls dataset create [DATASET NAME] [INITIAL PDFS]

Add pdfs to the dataset:

pawls dataset add [PDFS]

And offer per-dataset configuration for the label-set. Some discussion happening in #144, with the proposal that datasets are top-level folders (w/in skiff_files). I think that's the simplest, and would let you drop an overriding configuration file into each dataset folder.

One last concern is where the datasets should reside. Would a user need to provide the relative path to the skiff_files fiolder for each of the sub-commands, to make sure they're copied into the right place?

Add an instructions modal

This should be specified in the configuration file.

Ensure problematic pages won't be assigned

When performing pawls assign, check whether a pdf could be parsed in the Grobid server.

Onboard to Skiff and Skiff Files.

The application should be deployed to Skiff and have access to Skiff files, as data will be read from and possibly persisted to disk.

Add the ability to select to the left, and up.

Right now a selection is limited to moving down and to the right from the origin. We should support selections which move in all directions (up and to the right, up and to the left, down and to the left).

Create new labels functionality

Hi,

Is it possible to create new labels? I couldn't find such an option in the demo here.

Many thanks,
Amit

CTRL-Z doesn't stop propagation of event stack on some browsers

Feedback from Doug's contact at Northeastern:

"Lastly, one very minor bug report: it says I can use command-z to undo my last annotation, and while that appears to be true, it also doesn't seem to capture/stop the propagation of that event as it also fires an undo event at the browser level, too (which, at least in Safari, causes the most recently closed tab to respawn"

Explore possibility of a sequential token stream annotation mode

The grobid token stream defines an ordering for tokens, which we might be able to use to highlight tokens easily across lines. This might be "expected" behaviour from annotators. However, it relies on the quality of the token stream, and so might have unexpected behaviour for poorly parsed pdfs.

Loading a .pdf in PAWLS causes the user's "_annotations.json" file in the folder to be overwritten with empty annotations

I'm having an issue which is causing all annotations (for example from the pre-annotation command "pawls preannotate...") stored in the directory of the .pdf to be overwritten with the following, when I actually load the .pdf in PAWLS

{"annotations": [], "relations": []}

What I want instead (and what happens with a previous version of PAWLS running on another system) is for the pre-annotations to be visible a bounding boxes on the .pdf. With the current behavior, there is no way to use pre-annotations and PAWLS is not very useful, because then all annotations have to be labelled from scratch rather than corrected from a model's output.

Is there something hidden in a configuration that could be causing this behavior?
All the .pdfs involved are assigned to the same user.

403 Forbidden for api/annotation/allocation/info at startup

Hi,

when I run $ docker-compose up --build all containers appear to be starting and running. But when I then point my browser at localhost:8080 I get a loading icon and nothing happens. When I reload the page with the browser developer tools open I get an error message "Unhandled Rejection (Error): Request failed with status code 403".

Looking at the network traffic in the browser developer tools it appears that an access of api/annotation/allocation/info results in a 403 response.

Complete console output and website with error message attached.

console_output.txt
trace.html.txt

Any idea what's going wrong in this case?

Thanks in advance for any help.

Random missing of labeling objects

Do we plan to support entity annotation?

Sometime an entity such as person name, or address may across two lines but not fully contains the tokens of the covered line. Similar things applies to a sentence annotation. For example, currently I find no way to annotate the sentence inside the red line area. Thank you for your answers.

Sometimes the number of annotations for a given paper in the the assigned papers is incorrect

Bailey - "also not sure what's going on here. had a mini panic attack but my annotations are there, it's just showing "0". I refreshed and it still says 0"

Running with Node 10.24.1

The earliest Node 10.x.y I can run the UI with is 10.24.1, because of security vulnerabilities in the version of node in the current Dockerfile.

When I run with the version of typescript in the current package.json for the ui, I get typescript errors. ("=" expected TS1005 on AntdIcon.d.ts, line 2 "import type...").
When I run with typescript 4.2.4 (changing the version in the ui/package.json) I get syntax error "Cannot read property 'map' of undefined' on ./src/pages/PDFPage.tsx.

Is there a version of typescript which will run the code with Node 10.24.1? Or is it necessary to change to PAWLS code?

There's some latency after releasing a selection.

After releasing the mouse to finish a selection there's a little latency, where the mouse feels unresponsive.

We should dig into what's going on. It's probably some sort of unintentional re-render.

Quickstart - s3 access error

I tried to download the pre-processed papers from the s3 bucket but got an access error. When I run the aws s3 sync command I get this error:

fatal error: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

If I open https://s3.console.aws.amazon.com/s3/buckets/ai2-s2-pawls-public/example-data in a browser I get this error:

You don’t have permission to get object details
After you or your AWS administrator have updated your permissions to allow the s3:ListBucket action, refresh the page. Learn more about Identity and access management in Amazon S3
API response
Access Denied

I have a free-tier AWS account but haven't used it before (only GCP) so perhaps I need to do additinal configuration? However, I am able to access other public S3 buckets.

Any thoughts much appreciated (and thanks for what looks like a really impressive project)!

Annotation status

Tracking on branch pdf-status.

To do this I need to:

Add a command line option to add annotators and assign them pdfs. Needs an easy way to assign all annotators all pdfs.

This will probably be something like pawls add <annotators> and pawls assign <annotator> <pdfs>/ all

Add a pawls/output_dir/status/<user>.json which would be created by the CLI and read/modified by the service.
Add a dropdown enum to the sidebar which sets the pdf status.
Switch the /api/annotation/allocation/metadata to use this status json file, so that the status can be shown in the sidebar.

Should we add containing tokens for free-form annotations?

Junk papers show appear at the bottom of the list

PAWLS and sub-service configuration

Service Name	Branch	Main Usage	URL	Annotation File Path
prod	master	for demonstration and debugging	https://pawls.apps.allenai.org/	`pawls/papers`

Individual annotations can be deleted but layered annotations {onClick} get obscured by the most recent annotation.

Incorporate Bailey and Sophie's feedback after densely annotating 8 papers each

Would it be possible/feasible to add a "command + z" hotkey to undo? I accidentally added a box over 'technology' (see screenshot 1), and since there is a lot going on/overlap of token labels within the author box, it's sometimes hard to specifically select the box I would like to delete (or sometimes, a new box is accidentally created on top of it if your click isn't right on the 'x').
Going off of point 2: Would it be possible/feasible to have some sort of order/layer distinction of the labels? For example, if I click on a specific token label, is there a way to have that box/label come to the front of the plane, since all token tags seem to be in one plane? I think this could potentially help with the busyness in certain sections, like author, or header with paper venue (see screenshot 2, where the "header" label is no longer visible to delete without deleting both labels).
The comment option is great- but can the comment save at a specific location on the pdf?
Colours of labels are not clear from the background of the sidebar
Use arrow keys instead of tabs for switching labels, make it clear you can actually do that in the UI. This will fix the other problem where the tab key sometimes defaults to the browser functionality also.
Is there a functionality currently for listing all of the annotations in the "Annotation" section of the sidebar? It looks like it just lists all the annotations I've done. It could be potentially helpful for the annotator (to check work, see if they missed anything) if these were linked to their respective annotations in the pdf. For example, if I click on a specific annotation, and it then navigated me to that annotation in the pdf. Or, if it grouped your annotations by label here. For example, if I select "Paragraph" in this section, and all of my corresponding 'paragraph' labels become highlighted in the pdf so I can check them one label category at a time, rather than looking through the entire pdf for errors. Definitely not necessary (I tend to check my work a lot when I annotate), but just a thought since it's already displayed and is quite a long list.
When boxing two successive paragraphs, there was (pretty frequently) accidental overlap (such as boxing the last sentence of the paragraph above with the paragraph below it). When this happens, it's pretty subtle (screenshot, "when applied to NB"), so if there's any way to notify the annotator, or flag this in some way, it might help to avoid/notice & remove it-- since I think it is probably not desirable for what you guys are trying to do?
Is it possible to have a single save button instead of two? It's a little clunky to have individual saves for both comments and annotations. Or just cmd + s (or autosave after certain time period; or every time you switch labels) would be nice. This would be especially nice because "comments" are located at the very bottom of the sidebar, so there's a lot of scrolling up and down to go back and forth and save comments and annotations.
It could be nice to allow the annotator to control the layout of the sidebar display. For example, I'd personally rather have the "papers" section be out of the way while I'm annotating a paper, and either move it to the top or bottom (or have it collapse) so I can more easily access the labels/comments sections.

Enrich Annotations and Relations in Sidebar

Currently, the annotation and relation view in the sidebar is very basic. We should enrich this view with labels, colours, bounding box information, annotation order, page numbers etc.

Feature Request: Add Annotation Analytics

Logging user stats would be nice to track the annotation details. Like in CVAT

This could lead to a better management and tool improvement, identifying where and how people are struggling.

Save intersecting tokens.

When a user finishes selecting tokens that intersect with the bounds of their selection, we should persist this to the client in some form.

The "labels" definition in `api/config/configuration.json` should be project specific

Add a Junk button to the sidebar

relations - example/documentation

Would it be possible to see an example of a valid relations configuration. i.e.

{
    "output_directory": "/skiff_files/apps/pawls/papers/",
    "labels": [...],
    "relations": [<what should this be?>],
    "users_file": "/users/allowed.txt"
}

I have tried a few different things but I don't seem able to export the relations. Many thanks in advance

Create a command line tool for PAWLS

Instead of specifying PDF allocations within the pawls configuration file, we should have a commandline tool to interact with PAWLS. At a minimum it should do the following:

pawls fetch pdfs <s2-paper shas> <path to a file containing s2 paper shas> - fetches pdfs from s2.
pawls fetch metadata <path to directory containing pdfs> - fetches metadata from s2 for pdfs.
pawls add <directory to pdfs> /path/to/skiff-files - copies pdfs/metadata into pawls.
pawls annotate <pdf/list of pdfs> <preprocessor name, .eg 'grobid'> --commit - add annotations to pdfs, --commit flag to commit annotations to pdf-structure service.
pawls export /path/to/skiff-files - exports a dataset in pawls, with optional formats etc.

unauthorized request shows a bug

When I request https://pawls.apps.allenai.org/pdf/e47c046a1837fb25e4da091e4d36a3ed5cd45604 I see a "Unable to Render Document" message, but then the page goes blank.

In the console is an error about something being undefined.

Grobid returned status code 500

I followed all the steps to install pawls.

This works: pawls preprocess pdfplumber skiff_files/apps/pawls/papers

This don't: pawls preprocess grobid skiff_files/apps/pawls/papers
pawls_grobid is running as a service in a separated shell.

$ pawls preprocess grobid skiff_files/apps/pawls/papers
Processing using the grobid preprocessor...
Processing 01E4VGC1YN...:   0%|                                                                                     | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/vuser/anaconda3/bin/pawls", line 33, in <module>
    sys.exit(load_entry_point('pawls==0.0.1', 'console_scripts', 'pawls')())
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/pawls-0.0.1-py3.8.egg/pawls/commands/preprocess.py", line 45, in preprocess
    data = process_grobid(str(path))
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/pawls-0.0.1-py3.8.egg/pawls/preprocessors/grobid.py", line 60, in process_grobid
    grobid_structure = fetch_grobid_structure(pdf_file, grobid_host)
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/pawls-0.0.1-py3.8.egg/pawls/preprocessors/grobid.py", line 17, in fetch_grobid_structure
    raise Exception("Grobid returned status code {}".format(resp.status_code))
Exception: Grobid returned status code 500

Adding a log out button for PAWLS?

Remove Pawls service requiring access to s2 pdf buckets

We should move the pdf fetching into a command line tool, so that PAWLS only serves PDFs that it knows about, and cannot access the entire S2 corpus. This would allow us to use a basic, per user http login to get us off the ground, rather than requiring more advanced https oauth login via Skiff, which is currently only available within AI2.

Add a free text comments field

We should add a free text comments field to pdfs, so annotators can note things/communicate.

Make PAWLS a template

Pawls should be a github template, to make it easy to set up new annotation tasks.

Should we add a "non_textual_categories" key in config?

"labels": ...
"non_textual_categories": [
        "Figure", "Table", "Algorithm"
    ]
"relations": ...
}

Do we plan to support cross page annotation?

It seems a label cannot across two pages even it is a free-form annotation.
Just wondering if the cross-page annotation is possible and on the roadmap?
Thanks!

PAWLS Configuration

We might need to think ways for setting up either global PAWLS configuration or a .pawls file stored locally in some folders.

Render a local PDF using PDF.js

The application at a minimum needs to render a local PDF using PDF.js.

Split tokens

Sometimes we need to select specific parts of the text inside a token. I wanted to select only the amount in this case.

Adding the ability to split it somehow would help a lot. PyMuPDF might help here as a preprocessor, as it can extract the text at the character level.

PDF Zoom

When labeling large figures or huge blocks of text, it should be ideal to be able to zoom out the PDF document.

Add the ability to draw free-form selections.

Currently we only support selecting Grobid tokens that intersect with a user's selection. We also need the ability to draw an arbitrary bounding box and preserve the bounds as an annotation.

Support token stream, search-based and "freehand" bounding boxes

At a minimum, the application needs to support the user drawing bounding boxes in 3 modes:

Token Stream Annotation

The user draws a bounding box, which highlights a number of raw token bounding boxes described in #3
Extraneously highlighted raw spans can be deleted from the selected annotation

Search Based Annotation

The user can use text search to highlight the same phrase in the whole document.
Token bounding boxes are highlighted based on the raw token stream.
Bounding boxes highlighted by the search can be modified or deleted

Freehand

The user draws annotation boxes which are not aligned with the raw token stream.

"Render" token level bounding boxes for a given paper sha

Given a PDF which has been ingested into the pdf structure service using Grobid, we will then want to fetch the lowest level of token stream annotation to display "invisibly" on top of the PDF. This will allow functionality such as as "snapping" to a span, and also make it easy to correlate new annotations to the raw token stream.

An example response from this low level service would be:

GET /raw/{sha}/{page}/

Response:

                "pages": [
                    {
                        "page": {
                            "index": 0,
                            "width": 612.0,
                            "height": 792.0
                        },
                        "tokens": [
                            {
                                "text": "Journal",
                                "x": 90.0,
                                "y": 41.97740173339844,
                                "width": 27.878599166870117,
                                "height": 7.970099925994873,
                                "styleName": "style167"
                            },
                            {
                                "text": "of",
                                "x": 120.69999694824219,
                                "y": 41.97740173339844,
                                "width": 6.8224101066589355,
                                "height": 7.970099925994873,
                                "styleName": "style167"
                            },
                  ]
         }
]

Add an annotation management view

Sometimes we get an error about the canvas being used multiple times.

Sometimes we see an error like:

Uncaught (in promise) Error: Cannot use the same canvas during multiple render() operations. Use different canvas or ensure previous operations were cancelled or completed.

The error should be caught.
The error shouldn't happen.

Add the ability to remove tokens from a selection.

After selecting tokens, but before finalizing the selection, the user should be able to remove certain tokens from the selection.