Giter Club home page Giter Club logo

glam's Introduction

GLAM

Graph-based Layout Analysis Model (GLAM) is a deep learning model for document layout analysis.

Unofficial implementation in PyTorch of "A Graphical Approach to Document Layout Analysis" [arXiv].

imgur/WfUbA0B Examples of GLAM predictions on PubLayNet documents. Bounding boxes in orange, blue, green, magenta, and red colors represent predicted segments in text, title, list, table, and figure categories, respectively.

Introduction

The Graph-based Layout Analysis Model (GLAM) is a novel deep learning model designed for advanced document layout analysis. This repository contains an unofficial PyTorch implementation of the model as described in the paper "A Graphical Approach to Document Layout Analysis". You can find the original paper here.

Retrieval Augmented Generation (RAG) tasks represent a significant advancement in the field of large language models, focusing on enhancing model performance by integrating external knowledge sources. However, a fundamental challenge in these tasks arises from the processing of PDF files. Unlike standard text documents, PDFs are composed of positioned font glyphs that often lack labels, making them inherently unstructured. Traditional methods like image embedding or OCR can extract content, but they fall short in organizing it into meaningful structures, such as differentiating titles from paragraphs, and tables from figures. This is where GLAM comes into play. It bridges this gap by converting the unstructured content of PDFs into structured data, enabling the efficient use of such information in RAG tasks. With GLAM, the barrier of transforming complex PDF content into an organized format suitable for large language models is effectively removed, paving the way for more sophisticated and informed data retrieval and generation processes.

Prerequisites

  • Python 3.6+
  • pip
  • Optional: Tesseract or EasyOCR
  • Optional: Git with Git LFS support

Ubuntu/Debian

apt-get update -q -y
apt-get install -q -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu tesseract-ocr-fra tesseract-ocr-jpn
python -m pip install -q -U -r requirements.txt
TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata

Dataset preparation

Download and extract DocLayNet dataset:

python dln_download_and_extract.py --download-path /home/i/dataset/DocLayNet/raw --extract-path /home/i/dataset/DocLayNet/raw/DocLayNet

Make own DocLayNet-v1.1, free from bugs, parsing spans with unlabelled glyphs with Tesseract:

python dln_parse_pdf.py --dataset-path /home/i/dataset/DocLayNet/raw/DocLayNet --image-scale 1

Make training examples:

python dln_glam_prepare.py --dataset-path /home/i/dataset/DocLayNet/raw/DocLayNet/DATA --output-path /home/i/dataset/DocLayNet/glam

Training

Some paths are hardcoded in dln_glam_train.py. Please, change them before training.

python dln_glam_train.py

Evaluation

Please, change paths in dln_glam_evaluate.py before evaluation.

python dln_glam_inference.py

Features

  • Simple architecture.
  • Fast. With batch size of 128 examples it takes 00:11:35 for training on 507 batches and 00:02:17 for validation on 48 batches on CPU per 1 epoch.

Limitations

  • No reading order prediction, though it is not objective of this model, and dataset does not contain such information.

TODO

  • Implement mAP@IoU[0.5:0.05:0.95] metric because there is no way to compare with other models yet.
  • Implement input features normalization.
  • Implement text and image features.
  • Batching in inference. Currently, only one page is processed at a time.
  • W&B integration for training.
  • Some text spans in PDF contains unlabelled font glyphs. Currently, whole span is passed to OCR. It is faster to OCR font glyphs separately and then merge them into spans.

Alternatives

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Acknowledgements

  • Jilin Wang, Michael Krumdick, Baojia Tong, Hamima Halim, Maxim Sokolov, Vadym Barda, Delphine Vendryes, and Chris Tanner. "A Graphical Approach to Document Layout Analysis". 2023. arXiv: 2308.02051

glam's People

Contributors

ivanstepanovftw avatar

Stargazers

Timur Ionov avatar  avatar  avatar Ivan Shivalov avatar Markus Rauhalahti avatar  avatar  avatar Alexandra Faynburd avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.