Giter Club home page Giter Club logo

HTR-United

FR Go to htr-united.github.io

CC BY 4.0

What is HTR-United

HTR-United is a Github organization without any other form of legal personality. It aims at gathering HTR/OCR transcriptions of all periods and style of writing, mostly but not exclusively in French. It was born from the mere necessity -for projects- to possess potential ground truth to rapidly train models on smaller corpora.

What is shared?

Datasets shared or referenced with HTR-United must, at minimum, take the form of:

  • an ensemble of ALTO XML and/or PAGE XML files containing either only informations on the segmentation, either the segmentation and the corresponding transcription;
  • an ensemble of corresponding images. They can be shared in the form of a simple permalink to ressources hosted somewhere else, or can be the contact information necessary to request access to the images. It must be possible to recompose the link between the XML files and the image without any intermediary process such as changing the images' names;
  • a documentation on the context in which the dataset was produced and the rules followed to segment and transcribed the documents. For Github repositories, this documentation is usually presented in the README.

A corpus can be sub-divided into smaller ensembles if it seems necessary.

If you need help to compose your repository, you can check our template!

Only data?

Eventually, this organization will also aim at sharing -under free licenses- models suited for requested HTR processors. This will make it possible for projects with smaller capacities to benefit from ready-to-use models. Thus, if you share your data, and according to the rythm followed by the other members, you will soon be able to use such models.

However, keep in mind there exists a virtuous circle Transcription<->Training which will eventually, we hope, considerably improve the transcriptions created by young projects starting from scratch.

How does it work?

There are two cases:

  1. You already have data in a repository
  2. You don't have one and prefer to help the organization directly

You already have data in a repository

It's rather convenient: you stay in control, and there's no issue with joining the organization. However, if you want your dataset to gain visibility, it seems important to us that you describe it here. Indeed, if you take benefit from data or models provided by HTR-United, you may as well contribute!

To do so, you just need to open an issue or request an update on the deposit repository by adding a YAML file generated with our form, presented as follows:

    schema: https://htr-united.github.io/schema/2021-10-15/schema.json
    title: My Example Dataset
    url: http://link.to.repository
    authors:
      - name: John
        surname: Doe
        roles:
          - transcriber
      - name: Jeanne
        surname: Dupont
        roles:
          - project-manager
    description: A short description of the content of the dataset.
    project-name: My Awesome Project
    project-website: http://optional.link.to.project
    language:
      - fra
    script:
      - Latn
    script-type: only-manuscript
    time:
      notBefore: '1830'
      notAfter: '1875'
    hands:
      count: '1'
      precision: exact
    license:
      - name: CC-BY 4.0
        url: https://creativecommons.org/licenses/by/4.0/
    format: Page-XML
    volume:
      - metric: pages
        count: 42
      - metric: lines
        count: 420
      - metric: characters
        count: 4200
    transcription-guidelines: A presentation of the rules established for the transcription.

You don't have one

Well, we'll be happy to get help from you. Open an issue here and we will gladly help to create and share your repository on HTR-United. Skills with git are appreciated but, if you want to share data, we will help you. It's the purpose of this organization!

Overview

You can browse the content of the catalog from our website: here.

Here is an overview of the periods covered by the datasets documented in HTR-United's catalog!

graph

Quality Check

To help you improve and guarantee the quality of your dataset, we developped a series of tools which can easily be add to your repositories. Check out our Tools webpage to see descriptions and demos!

Publications

  • (FR) Alix Chagué, Thibault Clérice, Laurent Romary. HTR-United : Mutualisons la vérité de terrain !. DHNord2021 - Publier, partager, réutiliser les données de la recherche : les data papers et leurs enjeux, MESHS, Nov 2021, Lille, France. ⟨hal-03398740⟩

  • (FR) Alix Chagué. Conditions de la mutualisation : les principes FAIR et HTR-United. Humanistica 2022, May 2022, Montréal, Canada. ⟨hal-03685731⟩


Logo by Alix Chagué.

HTR United's Projects

cremma-medieval icon cremma-medieval

Transcription corpora for training HTR models for medieval manuscripts from the 12th to the 15th century.

cremma-wikipedia icon cremma-wikipedia

A collection of ground truth to train HTR models on contemporary French handwritings

dahncorpus icon dahncorpus

Ground Truth dataset for French 20th typewritten OCR produced by the DAHN project

htr-united icon htr-united

Ground Truth Resources for the HTR of patrimonial documents

htruc icon htruc

HTR-United Catalog tooling (pronunced EuchTruc)

htrvx icon htrvx

HTRVX : HTR Validation with XSD

lectaurep-bronod icon lectaurep-bronod

Lectaurep-Bronod, ground truth for Maitre Bronod's documents (French 18th century)

lectaurep-mariages-et-divorces icon lectaurep-mariages-et-divorces

Lectaurep-Mariages-et-Divorces, ground truth for the Registres des Contrats de Mariages et des Séparations et Divorces (French 19th century)

schema icon schema

Repository for schema related business

tapuscorpus icon tapuscorpus

Ground Truth for French 20th century typewritten documents collected on Gallica and Europeana

timeuscorpus icon timeuscorpus

Ground Truth datasets for French 18th and 19th HTR produced by the ANR project TIME US

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.