Light

badideafactory / demand-editable-drafts Goto Github PK

View Code? Open in Web Editor NEW

9.0 5.0 2.0 10.74 MB

Home Page: https://billtotext.com

CSS 0.19% HTML 2.28% JavaScript 6.20% XSLT 91.33%

demand-editable-drafts's Introduction

Made by Ted Han & with 💖 & 🤔 for Demand Progress

demand-editable-drafts's People

Contributors

Stargazers

Watchers

Forkers

danielschuman antoinemcgrath

demand-editable-drafts's Issues

Layout Analysis

Line repair (fixing SmallCaps and other random breaks in text elements)
White space detection
Tab & Column identification (Converting white space into things that are columns)
Inference about Fonts and structural hierarchy

In order to perform the conversion from pdf to docx that we're interested in, we'll need to analyze uploaded pdfs in order to gain some sense of the internal structure.

There are general document processing techniques which we can use to infer things about document structure, but we can also leverage the info we know from Congress's guides to legislative style, and from the structure of Congress's legislative XML & its associated stylesheets.

tktk

Font Identification

As it turns out, Congress's Office of Legislative Counsel provides guidelines for legislation. These guidelines include specific advices about formatting and style.

The guide specifies a font hierarchy, and we can use those as the guide for making assumptions about the draft legislative documents which are uploaded to the tool.

Investigate Font Sizing Issues

We have some reports that the font size ends up way too small. I suspect that there's an issue with the PDF rendering width determining the font size absolutely, rather than relative to the canvas width

DOCS formats?

Just a thought. Would using PANDOC or initial representation of this information as markdown make it easier down the line to convert into all sorts of different formats, beyond DOCX like HTML or a future version of docx? This is out of spec, but just thinking down the line for what may make easier for future extensions.

Build a Tabula clone to handle data tables

Much to my dismay, defense appropriations bills (and perhaps other bills) include data tables which are laid out in text.

These tables (as @jazzido warned me about years ago) include text elements that span columns and all sorts of messes.

Short of building a Tabula clone, it may be sufficient to notify users that tables show up on certain pages (and that they aren't being handled)

Loading User PDFs

The browser File API should get us where we need to go.

The File API (as MDN notes) has a variety of options and it's feasible to handle multiple files, although doing so in a manner which provides a speedy and pleasant user experience may require some thoughtfulness.

PDF Rendering

In browser access to PDF internals should be possible through pdf.js. Firefox relies on pdf.js as it's main PDF display toolkit, and so we can rely on the repo being battle hardened enough to reliably access and render PDF internals.

pdf.js renders pdf pages to an HTML Canvas element, which is a requirement for our fallback layout analysis strategy.

pdf.js isn't trivial to integrate unfortunately, but this is ground that's trod well enough that there are other examples we can follow. By default pdf.js isn't set up to be used as an es6 import. Regretfully, Mozilla's instructions aren't super clear on how to integrate it, and it appears that pdf.js is set up to use CommonJS modules. There's a ticket about this.

There's a react-pdf library which wraps pdf.js and explains how they integrate pdf.js's workers.

We're importing the pdf.js core directly into the main app bundle to provide our ability to manipulate PDFs from the app.

Add description somewhere

It would be cool to document the purpose of the repo somewhere -- ideally both in a repo description but also in the README to explain what the project intends to do as well as how to set up a dev environment.

Text & Layout Analysis Feasibility

pdf.js affords us access to PDF internals. Unfortunately for us, PDF internals could mean many things. In the best case scenario, it means a digitally native PDF where each page has instructions on how to render text and what positions to put it onto the page, or in the worst case, it could mean a biiiiiiiiig image file.

Either way, we can use pdf.js's main API to render a page to a Canvas element. In the worst case, being able to dump a page into pixels on a Canvas means that we can rely on tesseract.js to analyze a page's layout and text from Canvas data.

In the best case, pdf.js's APIs and users' documents willing, we may be able to iterate through the instructions in a page and identify the position of text elements directly.

What to do about line numbers?

Congressional bills number lines of bill text. Unfortunately they also include unnumbered text, including tables, and section listings.

the DocX file format provides flexibility around sections and line number, but their implementation across MS Word, Google Docs, and LibreOffice are inconsistent, and there is no shared baseline of capabilities between the three.

So there's a question about what the correct priorities are in terms of preserving line numbers.

The only way to do auto-incrementing line numbers (which is nice if someone is editing the document) is to use the line numbering feature in the DocX format. Unfortunately line numbers are per-section, and sections have inconsistent support.
It is possible to build a table of line numbers and lines to fake numbers
It may be possible to use numbered lists, however doing so will be messy in terms of keeping numbering consistent in the right places.

Wrap processing up in a web worker

Document Output

We'll go with exporting as a docx file.

There's a javascript library of the same name which should provide mostf of the functionality needed.

It is notably missing the ability to add attachments, so it won't be possible to include any legislative XML files unfortunately. ☹️

Additionally it's missing the ability to do line numbers (there's a pull request out for this), so if we include the line numbers we'll have to fake it with either a numbered list, or a columnar layout.

Progress bar (and error messages)

Editable Draft MVP

Phase 1: Technology proving ground

Our initial milestone is proving out that layout data can be extracted in browser via some combination of pdf.js and tesseract.js.

Load pdf.js with a Svelte app & render a test PDF
Extract text & layout from PDF (with pdf.js or with tesseract.js)

Phase 2: Main Functionality

loading PDFs from the user (and perhaps by URL?)
Layout analysis to find main content
font style identification?
Site layout designs
Decisions about output formats

Phase 3: App & Deployment

Analytics configuration
Stress testing & performance improvement
Site Copy
ad hoc user testing & further refinements.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.