Hi, I'm working in a page layout analysis and information extractor

Ok, thanks! Right now I'm using the <a href="https://github.com/dhlab-epfl/dhSegme

Original Training image with XML labels to extract data from documents about dhsegment HOT 4 CLOSED

dhlab-epfl commented on June 15, 2024

Original Training image with XML labels to extract data from documents

from dhsegment.

Comments (4)

solivr commented on June 15, 2024

Hi,
dhSegment takes as input a pair of images : the original image and the labelled image where the regions you want to extract are annotated with different 'colors'. It is not restricted to any format of annotation, as long as you are able to convert it to the above-mentioned labelled image.
So to answer your question, if you want to input directly XML files to dhSegment, no it will not work, but if you generate the corresponding labelled images, then yes, you'll be able to train a model.
There are already some implemented functions to parse files with PAGE-XML format and generate the corresponding masks in the PAGE.py file. You can also have a look at the exps/diva/utils.py file that may give you some hints on how to adapt it to your specific experiment (the Layout Analysis example is the DIVA experiment with DIVA-HisDB data).

from dhsegment.

Omua commented on June 15, 2024

Ok, thanks!
Right now I'm using the page.py functions to analyze de XML files I have currently, to labeled image that dhSegment takes as input. After that, I should be able to train the system to recognize the type of documents I need to analyze.
But what about extracting the text to postprocess it and analyze what is written? Is that possible?

from dhsegment.

Omua commented on June 15, 2024

After thinking about the last question I made, I think I have the solution.
After training dhSegment, the output will be the page regions classified by different colours. After that, I have to analyze that image. Having known beforehand which colour corresponds to which element, I can take the coordinates and extract it from the original image. Only then I can analyze it properly because I know exactly what type of information is in that region (table, image, text...)

from dhsegment.

Aminfaraji commented on June 15, 2024

how train dhsegment using own dataset?

from dhsegment.

Recommend Projects

Original Training image with XML labels to extract data from documents about dhsegment HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent