Giter Club home page Giter Club logo

abdoelsayed2016 / hkr_dataset Goto Github PK

View Code? Open in Web Editor NEW
53.0 1.0 4.0 5.92 MB

Handwritten Kazakh and Russian (HKR) database for text recognition

Home Page: https://doi.org/10.1007/s11042-021-11399-6

handwriting-recognition handwriting handwritten-text-recognition handwritten-character-recognition database dataset deep-learning convolutional-neural-networks russian-language kazakhstan recurrent-neural-networks ctc-loss computer-vision pattern-recognition

hkr_dataset's Introduction

Handwritten Kazakh and Russian (HKR) database for text recognition

The HKR Dataset for Russian and Kazakh database (with about 95% of Russian and 5% of Kazakh words/sentences respectively) for offline handwriting recognition. The dataset can be downloaded through the following link:

Note: The HKR Dataset can only be used for non-commercial research purpose. For researchers who wants to use the HKR database, please first fill in this Application Form and send it via email to us ([email protected],[email protected]).

Description

The database is written in Cyrillic and shares the same 33 characters. Besides these characters, the Kazakh alphabet also contains 9 additional specific characters. This dataset is a collection of forms. The sources of all the forms in the datasets were generated by LATEX which subsequently was filled out by persons with their handwriting. The database consists of more than 1400 filled forms. There are approximately 63000 sentences, more than 715699 symbols produced by approximately 200 diferent writers. We utilized three different datasets described as following:

  • Handwritten samples (Forms) of keywords in Kazakh and Russian (Areas, Cities , Village , etc.)
  • Handwritten Kazakh and Russian alphabet in cyrillic
  • Handwritten samples (Forms) of poems in Russian

The following are some sample of forms from HKR dataset:

The following are some word images after segmented the forms:

For example, the following image shows the number of character in the HKR dataset.

Split Dataset

Divides the handwriting text dataset into Training, Validation, Test 1, and Test 2 folders. Test1 is a subset of unseen words, Test2 is a subset of words seen, but by other writers. you can use the following python code to split as we discussed before https://github.com/bosskairat/Dataset

Citation and Contact

Please consider to cite our papers when you use our dataset:

@article{nurseitov2021handwritten,
  title={Handwritten Kazakh and Russian (HKR) database for text recognition},
  author={Nurseitov, Daniyar and Bostanbekov, Kairat and Kurmankhojayev, Daniyar and Alimova, Anel and Abdallah, Abdelrahman and Tolegenov, Rassul},
  journal={Multimedia Tools and Applications},
  pages={1--23},
  year={2021},
  publisher={Springer}
}
@article{Abdallah_2020, 
  title={Attention-Based Fully Gated CNN-BGRU for Russian Handwritten Text}, 
  volume={6}, 
  ISSN={2313-433X}, 
  url={http://dx.doi.org/10.3390/jimaging6120141}, 
  DOI={10.3390/jimaging6120141},
  number={12}, 
  journal={Journal of Imaging}, 
  publisher={MDPI AG}, 
  author={Abdallah, Abdelrahman and Hamada, Mohamed and Nurseitov, Daniyar},
  year={2020}, 
  month={Dec},
  pages={141}
}
@article{DaniyarNurseitov2020,
  author = {{Daniyar Nurseitov, Kairat Bostanbekov, Maksat Kanatov, Anel Alimova, Abdelrahman Abdallah}, Galymzhan Abdimanap},
  doi = {10.25046/aj0505114},
  file = {:D$\backslash$:/ASTESJ/ASTESJ{\_}0505114.pdf:pdf},
  journal = {Advances in Science, Technology and Engineering Systems Journal},
  keywords = {CNN,CTC,Convolutional neural networks,RNN,Recurrent neural networks},
  number = {5},
  pages = {934--943},
  title = {{Classification of Handwritten Names of Cities and Handwritten Text Recognition using Various Deep Learning Models}},
  volume = {5},
  year = {2020}
}

For any quetions about the dataset please contact the authors by sending email to Prof. Daniyar Nurseitov ([email protected]), Dr. Kairat Bostanbekov ([email protected])

hkr_dataset's People

Contributors

abdoelsayed2016 avatar bosskairat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

hkr_dataset's Issues

Unable to download dataset

I am a Chinese graduate student. Using the website you provided to download the data set seems to require registering with a Russian email address, which is not so convenient for me. Can you upload the data set to Google Cloud Drive or OneDrive? This will make downloading easier. In addition, a separate application should be provided for decompression, so I don't think this will cause the data set to be misused.

License Type

Hi!

It is me again :) I would like to know about License type for The HKR Dataset

Note: The HKR Dataset can only be used for non-commercial research purpose. For researchers who wants to use the HKR database, please first fill in this Application Form and send it via email to us ([email protected],[email protected]).

Could you add License file in your repository? https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/licensing-a-repository

P.S. I would like to publish my github repository with weights of my model for also non-commercial research with references on your works

Thank you!

Data Splitting

Hello! Thank you all for great hard work especially for creating this dataset!

Could you provide data splitting from your article? https://arxiv.org/pdf/2008.05373.pdf
(image ids for Validation_images, Test_1_images and Test_2_images)

It would be very useful for comparison and publishing other articles and citing of your original article and dataset :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.