Giter Club home page Giter Club logo

awesome-document-understanding's Introduction

Awesome Document Understanding Awesome

A curated list of resources for Document Understanding (DU) topic related to Intelligent Document Processing (IDP), which is relative to Robotic Process Automation (RPA) from unstructured data, especially form Visually Rich Documents (VRDs).

Note 1: bolded positions are more important then others.

Note 2: due to the novelty of the field, this list is under construction - contributions are welcome (thank you in advance!). Please remember to use following convention:





Table of contents

  1. Introduction
  2. Research topics
    1. Key Information Extraction (KIE)
    2. Document Layout Analysis (DLA)
    3. Document Question Answering (DQA)
    4. Scientific Document Understanding (SDU)
    5. Optical Character Recognition (OCR)
    6. Related
      1. General
      2. Tabular Data Comprehension (TDC)
      3. Robotic Process Automation (RPA)
  3. Others
    1. Resources
      1. Datasets for Pre-training Language Models
      2. PDF processing tools
    2. Conferences / workshops
    3. Blogs
    4. Solutions
  4. Examples
    1. Visually Rich Documents (VRDs)
    2. Key Information Extraction (KIE)
    3. Document Layout Analysis (DLA)
    4. Document Question Answering (DQA)
  5. Inspirations

Introduction

Documents are a core part of many businesses in many fields such as law, finance, and technology among others. Automatic understanding of documents such as invoices, contracts, and resumes is lucrative, opening up many new avenues of business. The fields of natural language processing and computer vision have seen tremendous progress through the development of deep learning such that these methods have started to become infused in contemporary document understanding systems. source

Papers

2021

  • Efficient Automated Processing of the Unstructured Documents using Artificial Intelligence: A Systematic Literature Review and Future Directions

    Dipali Baviskar, Swati Ahirrao, Vidyasagar Potdar, Ketan Kotecha IEEE Access 2021 The unstructured data impacts 95% of the organizations and costs them millions of dollars annually. If managed well, it can significantly improve business productivity. The traditional information extraction techniques are limited in their functionality, but AI-based techniques can provide a better solution. A thorough investigation of AI-based techniques for automatic information extraction from unstructured documents is missing in the literature. The purpose of this Systematic Literature Review (SLR) is to recognize, and analyze research on the techniques used for automatic information extraction from unstructured documents and to provide directions for future research. The SLR guidelines proposed by Kitchenham and Charters were adhered to conduct a literature search on various databases between 2010 and 2020. We found that: 1. The existing information extraction techniques are template-based or rule-based, 2. The existing methods lack the capability to tackle complex document layouts in real-time situations such as invoices and purchase orders, 3.The datasets available publicly are task-specific and of low quality. Hence, there is a need to develop a new dataset that reflects real-world problems. Our SLR discovered that AI-based approaches have a strong potential to extract useful information from unstructured documents automatically. However, they face certain challenges in processing multiple layouts of the unstructured documents. Our SLR brings out conceptualization of a framework for construction of high-quality unstructured documents dataset with strong data validation techniques for automated information extraction. Our SLR also reveals a need for a close association between the businesses and researchers to handle various challenges of the unstructured data analysis.
  • Automating Paperwork

    Ted Benson - 2021 Automating Paperwork is a practical, no-hype technical guide for business leaders, product managers, and operations teams who are pursuing a document automation initiative at their company. Its goal is to provide an end-to-end tour of the technical decisions and tradeoffs involved so that you can prepare for success, know what to expect, and ask the right questions of engineers and vendors along the way.

2020

  • A Survey of Deep Learning Approaches for OCR and Document Understanding

    Nishant Subramani, Alexandre Matton, Malcolm Greaves, Adrian Lam ML-RSA Workshop at NeurIPS 2020 Documents are a core part of many businesses in many fields such as law, finance, and technology among others. Automatic understanding of documents such as invoices, contracts, and resumes is lucrative, opening up many new avenues of business. The fields of natural language processing and computer vision have seen tremendous progress through the development of deep learning such that these methods have started to become infused in contemporary document understanding systems. In this survey paper, we review different techniques for document understanding for documents written in English and consolidate methodologies present in literature to act as a jumping-off point for researchers exploring this area.
  • Conversations with Documents. An Exploration of Document-Centered Assistance

    Maartje ter Hoeve, Robert Sim, Elnaz Nouri, Adam Fourney, Maarten de Rijke, Ryen W. White CHIIR 2020 The role of conversational assistants has become more prevalent in helping people increase their productivity. Document-centered assistance, for example to help an individual quickly review a document, has seen less significant progress, even though it has the potential to tremendously increase a user's productivity. This type of document-centered assistance is the focus of this paper. Our contributions are three-fold: (1) We first present a survey to understand the space of document-centered assistance and the capabilities people expect in this scenario. (2) We investigate the types of queries that users will pose while seeking assistance with documents, and show that document-centered questions form the majority of these queries. (3) We present a set of initial machine learned models that show that (a) we can accurately detect document-centered questions, and (b) we can build reasonably accurate models for answering such questions. These positive results are encouraging, and suggest that even greater results may be attained with continued study of this interesting and novel problem space. Our findings have implications for the design of intelligent systems to support task completion via natural interactions with documents.

2018

  • Future paradigms of automated processing of business documents
    Matteo Cristania, Andrea Bertolasob, Simone Scannapiecoc, Claudio Tomazzolia International Journal of Information Management 2018 In this paper we summarize the results obtained so far in the communities interested in the development of automated processing techniques as applied to business documents, and devise a few evolutions that are demanded by the current stage of either those techniques by themselves or by collateral sector advancements. It emerges a clear picture of a field that has put an enormous effort in solving problems that changed a lot during the last 30 years, and is now rapidly evolving to incorporate document processing into workflow management systems on one side and to include features derived by the introduction of cloud computing technologies on the other side. We propose an architectural schema for business document processing that comes from the two above evolution lines.

Older

  • Machine Learning for Intelligent Processing of Printed Documents

    F. Esposito, D. Malerba, F. Lisi - 2004 A paper document processing system is an information system component which transforms information on printed or handwritten documents into a computer-revisable form. In intelligent systems for paper document processing this information capture process is based on knowledge of the specific layout and logical structures of the documents. This article proposes the application of machine learning techniques to acquire the specific knowledge required by an intelligent document processing system, named WISDOM++, that manages printed documents, such as letters and journals. Knowledge is represented by means of decision trees and first-order rules automatically generated from a set of training documents. In particular, an incremental decision tree learning system is applied for the acquisition of decision trees used for the classification of segmented blocks, while a first-order learning system is applied for the induction of rules used for the layout-based classification and understanding of documents. Issues concerning the incremental induction of decision trees and the handling of both numeric and symbolic data in first-order rule learning are discussed, and the validity of the proposed solutions is empirically evaluated by processing a set of real printed documents.
  • Document Understanding: Research Directions

    S. Srihari, S. Lam, V. Govindaraju, R. Srihari, J. Hull - 1994 A document image is a visual representation of a printed page such as a journal article page, a facsimile cover page, a technical document, an office letter, etc. Document understanding as a research endeavor consists of studying all processes involved in taking a document through various representations: from a scanned physical document to high-level semantic descriptions of the document. Some of the types of representation that are useful are: editable descriptions, descriptions that enable exact reproductions and high-level semantic descriptions about document content. This report is a definition of five research subdomains within document understanding as pertaining to predominantly printed documents. The topics described are: modular architectures for document understanding; decomposition and structural analysis of documents; model-based OCR; table, diagram and image understanding; and performance evaluation under distortion and noise.

Research topics

Others

Resources

Back to top

Datasets for Pre-training Language Models

  1. The RVL-CDIP Dataset - dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class
  2. The Industry Documents Library - a portal to millions of documents created by industries that influence public health, hosted by the UCSF Library
  3. Color Document Dataset - from the Intelligent Sensory Information Systems, University of Amsterdam
  4. The IIT CDIP Test Collection - broken, see github discussion

PDF processing tools

  1. borb - is a pure python library to read, write and manipulate PDF documents. It represents a PDF document as a JSON-like datastructure of nested lists, dictionaries and primitives (numbers, string, booleans, etc).
  2. pawls - PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated with a PDF document
  3. pdfplumber - Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging
  4. Pdfminer.six - Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data
  5. Layout Parser - Layout Parser is a deep learning based tool for document image layout analysis tasks
  6. Tabulo - Table extraction from images
  7. OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted
  8. PDFBox - The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents
  9. PdfPig - This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents containing text and geometrical shapes. This project aims to port PDFBox to C#
  10. parsing-prickly-pdfs - Resources and worksheet for the NICAR 2016 workshop of the same name
  11. pdf-text-extraction-benchmark - PDF tools benchmark
  12. Born digital pdf scanner - checking if pdf is born-digital

Conferences, workshops

Back to top

General/ Business / Finance

  1. International Conference on Document Analysis and Recognition (ICDAR) [2021, 2019, 2017]
  2. Workshop on Document Intelligence (DI) [2021, 2019]
  3. Financial Narrative Processing Workshop (FNP) [2021, 2020, 2019 ]
  4. Workshop on Economics and Natural Language Processing (ECONLP) [2021, 2019, 2018 ]
  5. INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS) [2020, 2018, 2016]
  6. ACM International Conference on AI in Finance (ICAIF)
  7. The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services
  8. CVPR 2020 Workshop on Text and Documents in the Deep Learning Era
  9. KDD Workshop on Machine Learning in Finance (KDD MLF 2020)
  10. FinIR 2020: The First Workshop on Information Retrieval in Finance
  11. 2nd KDD Workshop on Anomaly Detection in Finance (KDD 2019)
  12. Document Understanding Conference (DUC 2007)

Scientific Document Understanding

  1. The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021)
  2. First Workshop on Scholarly Document Processing (SDProc 2020)
  3. International Workshop on SCIentific DOCument Analysis (SCIDOCA) [2020, 2018, 2017 ]

Blogs

Back to top

  1. A Survey of Document Understanding Models, 2021
  2. Document Form Extraction, 2021
  3. How to automate processes with unstructured data, 2021
  4. A Comprehensive Guide to OCR with RPA and Document Understanding, 2021
  5. Information Extraction from Receipts with Graph Convolutional Networks, 2021
  6. How to extract structured data from invoices, 2021
  7. Extracting Structured Data from Templatic Documents, 2020
  8. To apply AI for good, think form extraction, 2020
  9. UiPath Document Understanding Solution Architecture and Approach, 2020
  10. How Can I Automate Data Extraction from Complex Documents?, 2020
  11. LegalTech: Information Extraction in legal documents, 2020

Solutions

Back to top

Big companies:

  1. Abby
  2. Accenture
  3. Amazon
  4. Google
  5. Microsoft
  6. Uipath

Smaller:

  1. Applica.ai
  2. Docstack
  3. Element AI
  4. Indico
  5. Nanonets
  6. Rossum
  7. Silo

Examples

Visually Rich Documents

Back to top

In VRDs the importance of the layout information is crucial to understand the whole document correctly (this is the case with almost all business documents). For humans spatial information improves readability and speeds document understanding.

Invoice / Resume / Job Ad



NDA / Annual reports



Key Information Extraction

Back to top

The aim of this task is to extract texts of a number of key fields from a given collection of documents containing similar key entities.


Scanned Receipts



NDA / Annual reports

Examples of a real business applications and data for Kleister datasets (The key entities are in blue)



Multimedia Online Flyers

An example of a commercial real estate flyer and manually entered listing information © ProMaker Commercial Real Estate LLC, © BrokerSavant Inc.



Value-added tax invoice



Webpages



Document Layout Analysis

Back to top

In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis. (https://en.wikipedia.org/wiki/Document_layout_analysis)

Scientific publication





Historical newspapers



Business documents

Red: text block, Blue: figure.



Document Question Answering

Back to top

DocVQA example





Inspirations

Back to top

Domain

  1. https://github.com/kba/awesome-ocr
  2. https://github.com/Liquid-Legal-Institute/Legal-Text-Analytics
  3. https://github.com/icoxfog417/awesome-financial-nlp
  4. https://github.com/BobLd/DocumentLayoutAnalysis
  5. https://github.com/bikash/DocumentUnderstanding
  6. https://github.com/harpribot/awesome-information-retrieval
  7. https://github.com/roomylee/awesome-relation-extraction
  8. https://github.com/caufieldjh/awesome-bioie
  9. https://github.com/HelloRusk/entity-related-papers
  10. https://github.com/pliang279/awesome-multimodal-ml
  11. https://github.com/thunlp/LegalPapers
  12. https://github.com/heartexlabs/awesome-data-labeling

General AI/DL/ML

  1. https://github.com/jsbroks/awesome-dataset-tools
  2. https://github.com/EthicalML/awesome-production-machine-learning
  3. https://github.com/eugeneyan/applied-ml
  4. https://github.com/awesomedata/awesome-public-datasets
  5. https://github.com/keon/awesome-nlp
  6. https://github.com/thunlp/PLMpapers
  7. https://github.com/jbhuang0604/awesome-computer-vision#awesome-lists
  8. https://github.com/papers-we-love/papers-we-love
  9. https://github.com/BAILOOL/DoYouEvenLearn
  10. https://github.com/hibayesian/awesome-automl-papers

awesome-document-understanding's People

Contributors

michalturski avatar tstanislawek avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.