Giter Club home page Giter Club logo

pawls's Introduction


PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated with a PDF document. It was written specifically for annotating academic papers within the Semantic Scholar corpus, but can be used with any collection of PDF documents.

Quick Start

Quick start will download some pre-processed PDFs and get the UI set up so that you can see them. If you want to pre-process your own PDFs, keep reading! If it's your first time working with PAWLS, we recommend you try the quick start first though.

First, we need to download some processed PDFs to view in the UI. PAWLS uses the PDFs themselves to render in the browser, as well as using a JSON file of extracted token bounding boxes per page, called pdf_structure.json. The PAWLS CLI can be used to do this pre-processing, but for the quick start, we have done it for you. Download them from the provided AWS S3 Bucket like so:

aws s3 sync s3://ai2-s2-pawls-public/example-data ./skiff_files/apps/pawls/papers/ --no-sign-request

Configuration in PAWLS is controlled by a JSON file, located in the api/config directory. The location that we downloaded the PDFs to above corresponds to the location in the config file, where it is mounted in using docker-compose.yaml. So, when PAWLS starts up, the API knows where to look to serve the PDFs we want.

Next, we can start the services required to use PAWLS using docker-compose:

~ docker-compose up --build

This process launches 4 services:

  • the ui, which renders the user interface that PAWLS uses
  • the api, which serves PDFs and saves/recieves annotations
  • a proxy responsible for forwarding traffic to the appropriate services.
  • A grobid service, running a fork of Grobid. This is not actually necessary for the application, but is useful for the CLI.

You'll see output from each.

Once all of these have come up, navigate to localhost:8080 in your browser and you should see the PAWLS UI! Happy annotating.

Getting Started

In order to run a local environment, you'll need to use the PAWLS CLI to preprocess and assign the PDFs you want to serve. When using PDFs from semantic scholar, the CLI is also used to download the PDFs. The PDFs have to be put in a directory structure within skiff_files/apps/pawls (see PAWLS CLI usage for details).

For instance, you can run the following commands to download, preprocess, and assign PDFs:

  # Fetches PDFs from semantic scholar's S3 buckets.
  python scripts/ai2-internal/fetch_pdfs.py skiff_files/apps/pawls/papers 34f25a8704614163c4095b3ee2fc969b60de4698 3febb2bed8865945e7fddc99efd791887bb7e14f 553c58a05e25f794d24e8db8c2b8fdb9603e6a29
  # ensure that the papers are pre-processed with grobid so that they have token information.
  pawls preprocess grobid skiff_files/apps/pawls/papers
  # Assign the development user to all the papers we've downloaded.
  pawls assign skiff_files/apps/pawls/papers [email protected] --all --name-file skiff_files/apps/pawls/papers/name_mapping.json

and then open up the UI locally by running docker-compose up.

Authentication and Authorization

Authentication is simply checking that users are who they say they are. Whether or not these users' requests are allowed (e.g., to view a PDFs) is considerd authorization. See more about this distinction at Skiff Login.

Authentication

All requests must be authenticated.

  • The production deployment of PAWLS uses Skiff Login to authenticate requests. New users are bounced to a Google login workflow, and redirected back to the site if they authenticate with Google. Authenticated requests carry an HTTP header that identifies the user.
  • For local development, there is no login workflow. Instead, all requests are supplemented with a hard-coded authentication header in proxy/local.conf specifying that the user is [email protected].

Look at the function get_user_from_header in main.py for details.

Authorization

Authorization is enforced by the PAWLS app. A file of allowed user email addresses is consulted on every request.

The format of the file is simply a list of allowed email addresses.

There's a special case when an allowed email address in this file starts with "@", meaning all users in that domain are allowed. That is, an entry "@allenai.org" will grant access to all AI2 people.

Look at the function user_is_allowed in main.py for details.

Python Development

The Python service and Python cli are formatted using black and flake8. Currently this is run in a local environment using the app's requirements.txt. To run the linters:

black api/
flake8 api/

Prerequisites

Make sure that you have the latest version of Docker 🐳 installed on your local machine.

To start a version of the application locally for development purposes, run this command:

~ docker-compose up --build

This process launches 3 services, the ui, api and a proxy responsible for forwarding traffic to the appropriate services. You'll see output from each.

It might take a minute or two for the application to start, particularly if it's the first time you've executed this command. Be patience and wait for a clear message indicating that all of the required services have started up.

As you make changes the running application will be automatically updated. Simply refresh your browser to see them.

Sometimes one portion of your application will crash due to errors in the code. When this occurs resolve the related issue and re-run docker-compose up --build to start things back up.

Development Tips and Tricks

The skiff template contains some features which are ideal for a robust web application, but might be un-intuitive for researchers. Below are some small technical points that might help you if you are making substantial changes to the skiff template.

  • Skiff uses sonar to check that all parts of the application (frontend, backend) are up and running before serving requests. To do this, it checks that your api returns 2XX codes from its root url - if you change the server, you'll need to make sure to add code which returns a 2XX response from your server.

  • To ease development/deployment differences, skiff uses a proxy to route different urls to different containers in your application. The TL;DR of this is the following:

External URL Internal URL Container
localhost:8080/* localhost:3000/* ui
localhost:8080/api/* localhost:8000/* api

So, in your web application, you would make a request, e.g axios.get("/api/route", data), which the server recieves at localhost:8000/route. This makes it easy to develop without worrying about where apis will be hosted in production vs development, and also allows for things like rate limiting. The configuration for the proxy lives here for development and here for production.

For example, if you wanted to expose the docs route localhost:8000/docs from your api container to users of your app in production, you would add this to prod.conf:

location /docs/ {
    limit_req zone=api;
    proxy_pass http://api:8000/;
}

Troubleshooting

Updating UI Dependencies based on Dependabot Alerts

  1. Add the package and version reqs in the resolutions field from the package.json file;
  2. Run yarn install to update the yarn.lock file
  3. Start the docker and test whether the UI still works docker-compose up --build

Windows EOL format (CRLF) vs Linux (LF)

The application was developed for Linux, and might fail to start on Windows because of line-ending differences.

To fix this, run this command from the root of the repository:

~ (cd ./ui && yarn && yarn lint:fix) # with parenthesis, to stay in same directory

Cite PAWLS

If you find PAWLS helpful for your research, please consider cite PAWLS.

@misc{neumann2021pawls,
      title={PAWLS: PDF Annotation With Labels and Structure}, 
      author={Mark Neumann and Zejiang Shen and Sam Skjonsberg},
      year={2021},
      eprint={2101.10281},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

PAWLS is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

pawls's People

Contributors

bartbroere avatar bwindsor22 avatar codeviking avatar dependabot[bot] avatar e-tornike avatar egork520 avatar geli-gel avatar illdepence avatar jbarrow avatar jsv4 avatar julianmack avatar lolipopshock avatar mewil avatar schmmd avatar tjaffri avatar vtcaregorodtcev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pawls's Issues

Add image or basic demo to README

This looks like a really useful project, so thank you for sharing it.

However, when people first read the README, because it is all text, it doesn't really reinforce if it's the kind of tool a potential user might be looking for - there are no images or brief demos of how it actually works (ie the UI) and that's a key part of assessing it. You'd typically want a bit more confidence before you take the plunge and install it.

I had a feeling it was the kind of tool I was after but I couldn't tell till I found a tweet showing a v. quick demo of the UI (here)

Therefore would it be okay to add a screenshot / brief demo of the UI (or at least link to somewhere that sort of thing is hosted)?

Add CLI dataset management

I was thinking of a set of commands, to create a dataset:

pawls dataset create [DATASET NAME] [INITIAL PDFS]

Add pdfs to the dataset:

pawls dataset add [PDFS]

And offer per-dataset configuration for the label-set. Some discussion happening in #144, with the proposal that datasets are top-level folders (w/in skiff_files). I think that's the simplest, and would let you drop an overriding configuration file into each dataset folder.

One last concern is where the datasets should reside. Would a user need to provide the relative path to the skiff_files fiolder for each of the sub-commands, to make sure they're copied into the right place?

Onboard to Skiff and Skiff Files.

The application should be deployed to Skiff and have access to Skiff files, as data will be read from and possibly persisted to disk.

Add the ability to select to the left, and up.

Right now a selection is limited to moving down and to the right from the origin. We should support selections which move in all directions (up and to the right, up and to the left, down and to the left).

CTRL-Z doesn't stop propagation of event stack on some browsers

Feedback from Doug's contact at Northeastern:

"Lastly, one very minor bug report: it says I can use command-z to undo my last annotation, and while that appears to be true, it also doesn't seem to capture/stop the propagation of that event as it also fires an undo event at the browser level, too (which, at least in Safari, causes the most recently closed tab to respawn"

Explore possibility of a sequential token stream annotation mode

The grobid token stream defines an ordering for tokens, which we might be able to use to highlight tokens easily across lines. This might be "expected" behaviour from annotators. However, it relies on the quality of the token stream, and so might have unexpected behaviour for poorly parsed pdfs.

Loading a .pdf in PAWLS causes the user's "_annotations.json" file in the folder to be overwritten with empty annotations

I'm having an issue which is causing all annotations (for example from the pre-annotation command "pawls preannotate...") stored in the directory of the .pdf to be overwritten with the following, when I actually load the .pdf in PAWLS

{"annotations": [], "relations": []}

What I want instead (and what happens with a previous version of PAWLS running on another system) is for the pre-annotations to be visible a bounding boxes on the .pdf. With the current behavior, there is no way to use pre-annotations and PAWLS is not very useful, because then all annotations have to be labelled from scratch rather than corrected from a model's output.

Is there something hidden in a configuration that could be causing this behavior?
All the .pdfs involved are assigned to the same user.

403 Forbidden for api/annotation/allocation/info at startup

Hi,

when I run $ docker-compose up --build all containers appear to be starting and running. But when I then point my browser at localhost:8080 I get a loading icon and nothing happens. When I reload the page with the browser developer tools open I get an error message "Unhandled Rejection (Error): Request failed with status code 403".

Looking at the network traffic in the browser developer tools it appears that an access of api/annotation/allocation/info results in a 403 response.

Complete console output and website with error message attached.

console_output.txt
trace.html.txt

Any idea what's going wrong in this case?

Thanks in advance for any help.

Do we plan to support entity annotation?

Sometime an entity such as person name, or address may across two lines but not fully contains the tokens of the covered line. Similar things applies to a sentence annotation. For example, currently I find no way to annotate the sentence inside the red line area. Thank you for your answers.
image

Running with Node 10.24.1

The earliest Node 10.x.y I can run the UI with is 10.24.1, because of security vulnerabilities in the version of node in the current Dockerfile.

When I run with the version of typescript in the current package.json for the ui, I get typescript errors. ("=" expected TS1005 on AntdIcon.d.ts, line 2 "import type...").
When I run with typescript 4.2.4 (changing the version in the ui/package.json) I get syntax error "Cannot read property 'map' of undefined' on ./src/pages/PDFPage.tsx.

Is there a version of typescript which will run the code with Node 10.24.1? Or is it necessary to change to PAWLS code?

There's some latency after releasing a selection.

After releasing the mouse to finish a selection there's a little latency, where the mouse feels unresponsive.

We should dig into what's going on. It's probably some sort of unintentional re-render.

Quickstart - s3 access error

I tried to download the pre-processed papers from the s3 bucket but got an access error. When I run the aws s3 sync command I get this error:

fatal error: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

If I open https://s3.console.aws.amazon.com/s3/buckets/ai2-s2-pawls-public/example-data in a browser I get this error:

You don’t have permission to get object details
After you or your AWS administrator have updated your permissions to allow the s3:ListBucket action, refresh the page. Learn more about Identity and access management in Amazon S3
API response
Access Denied

I have a free-tier AWS account but haven't used it before (only GCP) so perhaps I need to do additinal configuration? However, I am able to access other public S3 buckets.

Any thoughts much appreciated (and thanks for what looks like a really impressive project)!

Annotation status

Tracking on branch pdf-status.

To do this I need to:

  • Add a command line option to add annotators and assign them pdfs. Needs an easy way to assign all annotators all pdfs.

This will probably be something like pawls add <annotators> and pawls assign <annotator> <pdfs>/ all

  • Add a pawls/output_dir/status/<user>.json which would be created by the CLI and read/modified by the service.
  • Add a dropdown enum to the sidebar which sets the pdf status.
  • Switch the /api/annotation/allocation/metadata to use this status json file, so that the status can be shown in the sidebar.

Incorporate Bailey and Sophie's feedback after densely annotating 8 papers each

  • Would it be possible/feasible to add a "command + z" hotkey to undo? I accidentally added a box over 'technology' (see screenshot 1), and since there is a lot going on/overlap of token labels within the author box, it's sometimes hard to specifically select the box I would like to delete (or sometimes, a new box is accidentally created on top of it if your click isn't right on the 'x').

  • Going off of point 2: Would it be possible/feasible to have some sort of order/layer distinction of the labels? For example, if I click on a specific token label, is there a way to have that box/label come to the front of the plane, since all token tags seem to be in one plane? I think this could potentially help with the busyness in certain sections, like author, or header with paper venue (see screenshot 2, where the "header" label is no longer visible to delete without deleting both labels).

  • The comment option is great- but can the comment save at a specific location on the pdf?

  • Colours of labels are not clear from the background of the sidebar

  • Use arrow keys instead of tabs for switching labels, make it clear you can actually do that in the UI. This will fix the other problem where the tab key sometimes defaults to the browser functionality also.

  • Is there a functionality currently for listing all of the annotations in the "Annotation" section of the sidebar? It looks like it just lists all the annotations I've done. It could be potentially helpful for the annotator (to check work, see if they missed anything) if these were linked to their respective annotations in the pdf. For example, if I click on a specific annotation, and it then navigated me to that annotation in the pdf. Or, if it grouped your annotations by label here. For example, if I select "Paragraph" in this section, and all of my corresponding 'paragraph' labels become highlighted in the pdf so I can check them one label category at a time, rather than looking through the entire pdf for errors. Definitely not necessary (I tend to check my work a lot when I annotate), but just a thought since it's already displayed and is quite a long list.

  • When boxing two successive paragraphs, there was (pretty frequently) accidental overlap (such as boxing the last sentence of the paragraph above with the paragraph below it). When this happens, it's pretty subtle (screenshot, "when applied to NB"), so if there's any way to notify the annotator, or flag this in some way, it might help to avoid/notice & remove it-- since I think it is probably not desirable for what you guys are trying to do?

  • Is it possible to have a single save button instead of two? It's a little clunky to have individual saves for both comments and annotations. Or just cmd + s (or autosave after certain time period; or every time you switch labels) would be nice. This would be especially nice because "comments" are located at the very bottom of the sidebar, so there's a lot of scrolling up and down to go back and forth and save comments and annotations.

  • It could be nice to allow the annotator to control the layout of the sidebar display. For example, I'd personally rather have the "papers" section be out of the way while I'm annotating a paper, and either move it to the top or bottom (or have it collapse) so I can more easily access the labels/comments sections.

Enrich Annotations and Relations in Sidebar

Currently, the annotation and relation view in the sidebar is very basic. We should enrich this view with labels, colours, bounding box information, annotation order, page numbers etc.
Screen Shot 2020-10-30 at 2 52 34 PM

Feature Request: Add Annotation Analytics

Logging user stats would be nice to track the annotation details. Like in CVAT

This could lead to a better management and tool improvement, identifying where and how people are struggling.

Save intersecting tokens.

When a user finishes selecting tokens that intersect with the bounds of their selection, we should persist this to the client in some form.

relations - example/documentation

Would it be possible to see an example of a valid relations configuration. i.e.

{
    "output_directory": "/skiff_files/apps/pawls/papers/",
    "labels": [...],
    "relations": [<what should this be?>],
    "users_file": "/users/allowed.txt"
}

I have tried a few different things but I don't seem able to export the relations. Many thanks in advance

Create a command line tool for PAWLS

Instead of specifying PDF allocations within the pawls configuration file, we should have a commandline tool to interact with PAWLS. At a minimum it should do the following:

  • pawls fetch pdfs <s2-paper shas> <path to a file containing s2 paper shas> - fetches pdfs from s2.
  • pawls fetch metadata <path to directory containing pdfs> - fetches metadata from s2 for pdfs.
  • pawls add <directory to pdfs> /path/to/skiff-files - copies pdfs/metadata into pawls.
  • pawls annotate <pdf/list of pdfs> <preprocessor name, .eg 'grobid'> --commit - add annotations to pdfs, --commit flag to commit annotations to pdf-structure service.
  • pawls export /path/to/skiff-files - exports a dataset in pawls, with optional formats etc.

Grobid returned status code 500

I followed all the steps to install pawls.

This works: pawls preprocess pdfplumber skiff_files/apps/pawls/papers

This don't: pawls preprocess grobid skiff_files/apps/pawls/papers
pawls_grobid is running as a service in a separated shell.

$ pawls preprocess grobid skiff_files/apps/pawls/papers
Processing using the grobid preprocessor...
Processing 01E4VGC1YN...:   0%|                                                                                     | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/vuser/anaconda3/bin/pawls", line 33, in <module>
    sys.exit(load_entry_point('pawls==0.0.1', 'console_scripts', 'pawls')())
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/pawls-0.0.1-py3.8.egg/pawls/commands/preprocess.py", line 45, in preprocess
    data = process_grobid(str(path))
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/pawls-0.0.1-py3.8.egg/pawls/preprocessors/grobid.py", line 60, in process_grobid
    grobid_structure = fetch_grobid_structure(pdf_file, grobid_host)
  File "/home/vuser/anaconda3/lib/python3.8/site-packages/pawls-0.0.1-py3.8.egg/pawls/preprocessors/grobid.py", line 17, in fetch_grobid_structure
    raise Exception("Grobid returned status code {}".format(resp.status_code))
Exception: Grobid returned status code 500

Remove Pawls service requiring access to s2 pdf buckets

We should move the pdf fetching into a command line tool, so that PAWLS only serves PDFs that it knows about, and cannot access the entire S2 corpus. This would allow us to use a basic, per user http login to get us off the ground, rather than requiring more advanced https oauth login via Skiff, which is currently only available within AI2.

Make PAWLS a template

Pawls should be a github template, to make it easy to set up new annotation tasks.

PAWLS Configuration

We might need to think ways for setting up either global PAWLS configuration or a .pawls file stored locally in some folders.

Split tokens

Sometimes we need to select specific parts of the text inside a token. I wanted to select only the amount in this case.

image

Adding the ability to split it somehow would help a lot. PyMuPDF might help here as a preprocessor, as it can extract the text at the character level.

PDF Zoom

When labeling large figures or huge blocks of text, it should be ideal to be able to zoom out the PDF document.

Add the ability to draw free-form selections.

Currently we only support selecting Grobid tokens that intersect with a user's selection. We also need the ability to draw an arbitrary bounding box and preserve the bounds as an annotation.

Support token stream, search-based and "freehand" bounding boxes

At a minimum, the application needs to support the user drawing bounding boxes in 3 modes:

Token Stream Annotation

  • The user draws a bounding box, which highlights a number of raw token bounding boxes described in #3
  • Extraneously highlighted raw spans can be deleted from the selected annotation

Search Based Annotation

  • The user can use text search to highlight the same phrase in the whole document.
  • Token bounding boxes are highlighted based on the raw token stream.
  • Bounding boxes highlighted by the search can be modified or deleted

Freehand

  • The user draws annotation boxes which are not aligned with the raw token stream.

"Render" token level bounding boxes for a given paper sha

Given a PDF which has been ingested into the pdf structure service using Grobid, we will then want to fetch the lowest level of token stream annotation to display "invisibly" on top of the PDF. This will allow functionality such as as "snapping" to a span, and also make it easy to correlate new annotations to the raw token stream.

An example response from this low level service would be:

GET /raw/{sha}/{page}/

Response:

                "pages": [
                    {
                        "page": {
                            "index": 0,
                            "width": 612.0,
                            "height": 792.0
                        },
                        "tokens": [
                            {
                                "text": "Journal",
                                "x": 90.0,
                                "y": 41.97740173339844,
                                "width": 27.878599166870117,
                                "height": 7.970099925994873,
                                "styleName": "style167"
                            },
                            {
                                "text": "of",
                                "x": 120.69999694824219,
                                "y": 41.97740173339844,
                                "width": 6.8224101066589355,
                                "height": 7.970099925994873,
                                "styleName": "style167"
                            },
                  ]
         }
]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.