badideafactory / demand-editable-drafts Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://billtotext.com
Home Page: https://billtotext.com
In order to perform the conversion from pdf
to docx
that we're interested in, we'll need to analyze uploaded pdfs in order to gain some sense of the internal structure.
There are general document processing techniques which we can use to infer things about document structure, but we can also leverage the info we know from Congress's guides to legislative style, and from the structure of Congress's legislative XML & its associated stylesheets.
tktk
As it turns out, Congress's Office of Legislative Counsel provides guidelines for legislation. These guidelines include specific advices about formatting and style.
The guide specifies a font hierarchy, and we can use those as the guide for making assumptions about the draft legislative documents which are uploaded to the tool.
We have some reports that the font size ends up way too small. I suspect that there's an issue with the PDF rendering width determining the font size absolutely, rather than relative to the canvas width
Just a thought. Would using PANDOC or initial representation of this information as markdown make it easier down the line to convert into all sorts of different formats, beyond DOCX like HTML or a future version of docx? This is out of spec, but just thinking down the line for what may make easier for future extensions.
Much to my dismay, defense appropriations bills (and perhaps other bills) include data tables which are laid out in text.
These tables (as @jazzido warned me about years ago) include text elements that span columns and all sorts of messes.
Short of building a Tabula clone, it may be sufficient to notify users that tables show up on certain pages (and that they aren't being handled)
The browser File API should get us where we need to go.
The File API (as MDN notes) has a variety of options and it's feasible to handle multiple files, although doing so in a manner which provides a speedy and pleasant user experience may require some thoughtfulness.
In browser access to PDF internals should be possible through pdf.js
. Firefox relies on pdf.js
as it's main PDF display toolkit, and so we can rely on the repo being battle hardened enough to reliably access and render PDF internals.
pdf.js
renders pdf pages to an HTML Canvas element, which is a requirement for our fallback layout analysis strategy.
pdf.js
isn't trivial to integrate unfortunately, but this is ground that's trod well enough that there are other examples we can follow. By default pdf.js
isn't set up to be used as an es6 import. Regretfully, Mozilla's instructions aren't super clear on how to integrate it, and it appears that pdf.js is set up to use CommonJS modules. There's a ticket about this.
There's a react-pdf
library which wraps pdf.js
and explains how they integrate pdf.js
's workers.
We're importing the pdf.js
core directly into the main app bundle to provide our ability to manipulate PDFs from the app.
It would be cool to document the purpose of the repo somewhere -- ideally both in a repo description but also in the README to explain what the project intends to do as well as how to set up a dev environment.
pdf.js
affords us access to PDF internals. Unfortunately for us, PDF internals could mean many things. In the best case scenario, it means a digitally native PDF where each page has instructions on how to render text and what positions to put it onto the page, or in the worst case, it could mean a biiiiiiiiig image file.
Either way, we can use pdf.js
's main API to render a page to a Canvas element. In the worst case, being able to dump a page into pixels on a Canvas means that we can rely on tesseract.js
to analyze a page's layout and text from Canvas data.
In the best case, pdf.js
's APIs and users' documents willing, we may be able to iterate through the instructions in a page and identify the position of text elements directly.
Congressional bills number lines of bill text. Unfortunately they also include unnumbered text, including tables, and section listings.
the DocX file format provides flexibility around sections and line number, but their implementation across MS Word, Google Docs, and LibreOffice are inconsistent, and there is no shared baseline of capabilities between the three.
So there's a question about what the correct priorities are in terms of preserving line numbers.
We'll go with exporting as a docx file.
There's a javascript library of the same name which should provide mostf of the functionality needed.
It is notably missing the ability to add attachments, so it won't be possible to include any legislative XML files unfortunately.
Additionally it's missing the ability to do line numbers (there's a pull request out for this), so if we include the line numbers we'll have to fake it with either a numbered list, or a columnar layout.
Our initial milestone is proving out that layout data can be extracted in browser via some combination of pdf.js
and tesseract.js.
pdf.js
with a Svelte app & render a test PDFpdf.js
or with tesseract.js
)A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.