Giter Club home page Giter Club logo

demand-editable-drafts's Introduction

Made by Ted Han & with 💖 & 🤔 for Demand Progress

demand-editable-drafts's People

Contributors

knowtheory avatar

Stargazers

Anne Thorpe avatar @lpha avatar Chris Amico avatar Simon Willison avatar  avatar Mad Bernard avatar Джон, просто Джон avatar Gregory Foster avatar Julia Solórzano avatar

Watchers

 avatar James Cloos avatar Julia Solórzano avatar Daniel Schuman avatar  avatar

demand-editable-drafts's Issues

Layout Analysis

In order to perform the conversion from pdf to docx that we're interested in, we'll need to analyze uploaded pdfs in order to gain some sense of the internal structure.

There are general document processing techniques which we can use to infer things about document structure, but we can also leverage the info we know from Congress's guides to legislative style, and from the structure of Congress's legislative XML & its associated stylesheets.

tktk

Font Identification

As it turns out, Congress's Office of Legislative Counsel provides guidelines for legislation. These guidelines include specific advices about formatting and style.

The guide specifies a font hierarchy, and we can use those as the guide for making assumptions about the draft legislative documents which are uploaded to the tool.

Investigate Font Sizing Issues

We have some reports that the font size ends up way too small. I suspect that there's an issue with the PDF rendering width determining the font size absolutely, rather than relative to the canvas width

DOCS formats?

Just a thought. Would using PANDOC or initial representation of this information as markdown make it easier down the line to convert into all sorts of different formats, beyond DOCX like HTML or a future version of docx? This is out of spec, but just thinking down the line for what may make easier for future extensions.

Build a Tabula clone to handle data tables

Much to my dismay, defense appropriations bills (and perhaps other bills) include data tables which are laid out in text.

These tables (as @jazzido warned me about years ago) include text elements that span columns and all sorts of messes.

Short of building a Tabula clone, it may be sufficient to notify users that tables show up on certain pages (and that they aren't being handled)

PDF Rendering

In browser access to PDF internals should be possible through pdf.js. Firefox relies on pdf.js as it's main PDF display toolkit, and so we can rely on the repo being battle hardened enough to reliably access and render PDF internals.

pdf.js renders pdf pages to an HTML Canvas element, which is a requirement for our fallback layout analysis strategy.

pdf.js isn't trivial to integrate unfortunately, but this is ground that's trod well enough that there are other examples we can follow. By default pdf.js isn't set up to be used as an es6 import. Regretfully, Mozilla's instructions aren't super clear on how to integrate it, and it appears that pdf.js is set up to use CommonJS modules. There's a ticket about this.

There's a react-pdf library which wraps pdf.js and explains how they integrate pdf.js's workers.

We're importing the pdf.js core directly into the main app bundle to provide our ability to manipulate PDFs from the app.

Add description somewhere

It would be cool to document the purpose of the repo somewhere -- ideally both in a repo description but also in the README to explain what the project intends to do as well as how to set up a dev environment.

Text & Layout Analysis Feasibility

pdf.js affords us access to PDF internals. Unfortunately for us, PDF internals could mean many things. In the best case scenario, it means a digitally native PDF where each page has instructions on how to render text and what positions to put it onto the page, or in the worst case, it could mean a biiiiiiiiig image file.

Either way, we can use pdf.js's main API to render a page to a Canvas element. In the worst case, being able to dump a page into pixels on a Canvas means that we can rely on tesseract.js to analyze a page's layout and text from Canvas data.

In the best case, pdf.js's APIs and users' documents willing, we may be able to iterate through the instructions in a page and identify the position of text elements directly.

What to do about line numbers?

Congressional bills number lines of bill text. Unfortunately they also include unnumbered text, including tables, and section listings.

the DocX file format provides flexibility around sections and line number, but their implementation across MS Word, Google Docs, and LibreOffice are inconsistent, and there is no shared baseline of capabilities between the three.

So there's a question about what the correct priorities are in terms of preserving line numbers.

  • The only way to do auto-incrementing line numbers (which is nice if someone is editing the document) is to use the line numbering feature in the DocX format. Unfortunately line numbers are per-section, and sections have inconsistent support.
  • It is possible to build a table of line numbers and lines to fake numbers
  • It may be possible to use numbered lists, however doing so will be messy in terms of keeping numbering consistent in the right places.

Document Output

We'll go with exporting as a docx file.

There's a javascript library of the same name which should provide mostf of the functionality needed.

It is notably missing the ability to add attachments, so it won't be possible to include any legislative XML files unfortunately. ☹️

Additionally it's missing the ability to do line numbers (there's a pull request out for this), so if we include the line numbers we'll have to fake it with either a numbered list, or a columnar layout.

Editable Draft MVP

Phase 1: Technology proving ground

Our initial milestone is proving out that layout data can be extracted in browser via some combination of pdf.js and tesseract.js.

Phase 2: Main Functionality

Phase 3: App & Deployment

  • Analytics configuration
  • Stress testing & performance improvement
  • Site Copy
  • ad hoc user testing & further refinements.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.