Giter Club home page Giter Club logo

synthtext's Introduction

SynthText

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

Synthetic Scene-Text Image Samples Synthetic Scene-Text Samples

The code in the master branch is for Python2. Python3 is supported in the python3 branch.

The main dependencies are:

pygame==2.0.0, opencv (cv2), PIL (Image), numpy, matplotlib, h5py, scipy

Generating samples

python gen.py --viz [--datadir <path-to-dowloaded-renderer-data>]

where, --datadir points to the renderer_data directory included in the data torrent. Specifying this datadir is optional, and if not specified, the script will automatically download and extract the same renderer.tar.gz data file (~24 M). This data file includes:

  • sample.h5: This is a sample h5 file which contains a set of 5 images along with their depth and segmentation information. Note, this is just given as an example; you are encouraged to add more images (along with their depth and segmentation information) to this database for your own use.
  • fonts: three sample fonts (add more fonts to this folder and then update fonts/fontlist.txt with their paths).
  • newsgroup: Text-source (from the News Group dataset). This can be subsituted with any text file. Look inside text_utils.py to see how the text inside this file is used by the renderer.
  • models/colors_new.cp: Color-model (foreground/background text color model), learnt from the IIIT-5K word dataset.
  • models: Other cPickle files (char_freq.cp: frequency of each character in the text dataset; font_px2pt.cp: conversion from pt to px for various fonts: If you add a new font, make sure that the corresponding model is present in this file, if not you can add it by adapting invert_font_size.py).

This script will generate random scene-text image samples and store them in an h5 file in results/SynthText.h5. If the --viz option is specified, the generated output will be visualized as the script is being run; omit the --viz option to turn-off the visualizations. If you want to visualize the results stored in results/SynthText.h5 later, run:

python visualize_results.py

Pre-generated Dataset

A dataset with approximately 800000 synthetic scene-text images generated with this code can be found in the SynthText.zip file in the torrent here; dataset detais/description in readme.txt file in the same torrent.

Adding New Images

Segmentation and depth-maps are required to use new images as background. Sample scripts for obtaining these are available here.

  • predict_depth.m MATLAB script to regress a depth mask for a given RGB image; uses the network of Liu etal. However, more recent works (e.g., this) might give better results.
  • run_ucm.m and floodFill.py for getting segmentation masks using gPb-UCM.

For an explanation of the fields in sample.h5 (e.g.: seg,area,label), please check this comment.

Pre-processed Background Images

The 8,000 background images used in the paper, along with their segmentation and depth masks, are included in the same torrent as the pre-generated dataset under the bg_data directory. The files are:

filenames description
imnames.cp names of images which do not contain background text
bg_img.tar.gz images (filter these using imnames.cp)
depth.h5 depth maps
seg.h5 segmentation maps

use_preproc_bg.py provides sample code for reading this data.

Note: We do not own the copyright to these images.

Generating Samples with Text in non-Latin (English) Scripts

  • @JarveeLee has modified the pipeline for generating samples with Chinese text here.
  • @adavoudi has modified it for arabic/persian script, which flows from right-to-left here.
  • @MichalBusta has adapted it for a number of languages (e.g. Bangla, Arabic, Chinese, Japanese, Korean) here.
  • @gachiemchiep has adapted for Japanese here.
  • @gungui98 has adapted for Vietnamese here.
  • @youngkyung has adapted for Korean here.
  • @kotomiDu has developed an interactive UI for generating images with text here.
  • @LaJoKoch has adapted for German here.

Further Information

Please refer to the paper for more information, or contact me (email address in the paper).

synthtext's People

Contributors

ankush-me avatar carandraug avatar codeveryslow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

synthtext's Issues

Problems with Opencv 3.0

This code can not run in opencv 3.1. There have any version of this code which can run in opencv 3.1?

seg question

How to get the "seg" "label" and "area" from gPb-UCM (https://github.com/jponttuset/mcg) code that you advised. I only found it got ucm image. Are these information including "seg" "label" and "area" related to superpixel? Can you give me a clue about the specific process gathering these information? Thanks a lot. @ankush-me

ValueError: too many values to unpack

i meet this problem,
xting@xiaomi-To-be-filled-by-O-E-M:~/SynthText$ python gen.py
getting data..
-> done
Storing the output in: results/SynthText.h5
0 of 4
Traceback (most recent call last):
File "/home/xting/SynthText/synthgen.py", line 615, in render_text
regions = self.filter_for_placement(xyz,seg,regions)
File "/home/xting/SynthText/synthgen.py", line 391, in filter_for_placement
res = get_text_placement_mask(xyz,seg==l,regions['coeff'][idx],pad=2)
File "/home/xting/SynthText/synthgen.py", line 219, in get_text_placement_mask
contour,hier = cv2.findContours(mask.copy().astype('uint8'),mode=cv2.RETR_CCOMP,method=cv2.CHAIN_APPROX_SIMPLE)
ValueError: too many values to unpack
1 of 4
ALSA lib pcm.c:7843:(snd_pcm_recover) underrun occurred
Traceback (most recent call last):
File "/home/xting/SynthText/synthgen.py", line 615, in render_text
regions = self.filter_for_placement(xyz,seg,regions)
File "/home/xting/SynthText/synthgen.py", line 391, in filter_for_placement
res = get_text_placement_mask(xyz,seg==l,regions['coeff'][idx],pad=2)
File "/home/xting/SynthText/synthgen.py", line 219, in get_text_placement_mask
contour,hier = cv2.findContours(mask.copy().astype('uint8'),mode=cv2.RETR_CCOMP,method=cv2.CHAIN_APPROX_SIMPLE)
ValueError: too many values to unpack
2 of 4
Traceback (most recent call last):
File "/home/xting/SynthText/synthgen.py", line 615, in render_text
regions = self.filter_for_placement(xyz,seg,regions)
File "/home/xting/SynthText/synthgen.py", line 391, in filter_for_placement
res = get_text_placement_mask(xyz,seg==l,regions['coeff'][idx],pad=2)
File "/home/xting/SynthText/synthgen.py", line 219, in get_text_placement_mask
contour,hier = cv2.findContours(mask.copy().astype('uint8'),mode=cv2.RETR_CCOMP,method=cv2.CHAIN_APPROX_SIMPLE)
ValueError: too many values to unpack
3 of 4
ALSA lib pcm.c:7843:(snd_pcm_recover) underrun occurred
ALSA lib pcm.c:7843:(snd_pcm_recover) underrun occurred
ALSA lib pcm.c:7843:(snd_pcm_recover) underrun occurred
ALSA lib pcm.c:7843:(snd_pcm_recover) underrun occurred
Traceback (most recent call last):
File "/home/xting/SynthText/synthgen.py", line 615, in render_text
regions = self.filter_for_placement(xyz,seg,regions)
File "/home/xting/SynthText/synthgen.py", line 391, in filter_for_placement
res = get_text_placement_mask(xyz,seg==l,regions['coeff'][idx],pad=2)
File "/home/xting/SynthText/synthgen.py", line 219, in get_text_placement_mask
contour,hier = cv2.findContours(mask.copy().astype('uint8'),mode=cv2.RETR_CCOMP,method=cv2.CHAIN_APPROX_SIMPLE)
ValueError: too many values to unpack
4 of 4
Traceback (most recent call last):
File "/home/xting/SynthText/synthgen.py", line 615, in render_text
regions = self.filter_for_placement(xyz,seg,regions)
File "/home/xting/SynthText/synthgen.py", line 391, in filter_for_placement
res = get_text_placement_mask(xyz,seg==l,regions['coeff'][idx],pad=2)
File "/home/xting/SynthText/synthgen.py", line 219, in get_text_placement_mask
contour,hier = cv2.findContours(mask.copy().astype('uint8'),mode=cv2.RETR_CCOMP,method=cv2.CHAIN_APPROX_SIMPLE)
ValueError: too many values to unpack

Have you tried any type of stochastic pooling for regularization similar to dropout?

I'm training my network now and from epoch to epoch, I'm testing the model to see how well it is in predictions. So far, it's getting me less than 25% recall and I'm trying to figure out what I'm doing wrong. I believe the math I'm doing for converting the pose parameters back to bounding boxes is correct. Anyways, some light research brought me to stochastic pooling and I'm curious if you have ever tried this in your networks and if it's worth me trying to see if maybe my network is overfitting.

Also- how many epochs did you end up running? My epochs are each taking 30 hours with roughly 550k images (I will try a larger set once I am able to see if this approach is going to generalize well, if at all, to my actual data).

question in function get_text_placement_mask(xyz,mask,plane,pad=2,viz=False):

In synthgen.py, I know that pts is points of contour in 2d image and pts_fp is where these points should be in fronto-parallel. But what does pts_tmp mean?
rect = cv2.minAreaRect(pts_fp[0].copy().astype('float32'))
box = np.array(cv2.cv.BoxPoints(rect))
R2d = su.unrotate2d(box.copy())
box = np.vstack([box,box[0,:]]) #close the box for visualization

mu = np.median(pts_fp[0],axis=0)
pts_tmp = (pts_fp[0]-mu[None,:]).dot(R2d.T) + mu[None,:]
boxR = (box-mu[None,:]).dot(R2d.T) + mu[None,:]
s = rescale_frontoparallel(pts_tmp,boxR,pts[0])

Why does these codes necessary ? Why pts_fp and box cant't use in function “rescale_frontoparallel " directlly but use pts_tmp and boxR?

how to filter this ?

43_0
some chars is not rightly placed(maybe the reason of font), but i cannot find how to filter this in pygame

Generating Cropped Word Images

@ankush-me This is a great work. Thanks for sharing it.
Currently I am working on a problem of text recognition from cropped word images. Initially I used MJSynth dataset provided by Visual Geometry Group for training my LSTM based model. But this dataset is very biased towards words and has very less occurrences of numbers. I have a dictionary containing the type of text/numbers/symbols that I want to recognize but I am more concerned about rendering process. Can I use your script to synthetically generate cropped images (resembling natural text images) containing words and numbers?
Please help.

model training question

Hi,
Is there any trick or constraint to force the confidence in the 7 length vector to lie between 0 and 1 during training.

THanks

IoU in cvpr 2016 paper

Hi,
For the paper "Synthetic Data for Text Localisation in Natural Images" did you calculate IoU for oriented BBs or you converted them to axis aligned bbs ?

Thanks
M

The rotations of some word bounding boxes do not line up properly with the word itself.

It's strange and it happens quite often, maybe 20% of the time. It usually occurs when a word is rotated itself or has been projected onto a surface causing it to have a strange dimension. Thing is, though, it does not only happen on extreme projections. It also happens on only slightly projected text. It makes it hard for the fully convolutional network to converge well on the sin/cos pose params.

questions on your paper

Hi Ankur,
In your paper you mention hough voting but I do not see it being used in the paper. Can you clarify ?

Thanks

How to get some bigger text?

Thanks for sharing the code. I want to get some bigger text (for example, the text occupies a large part of the image), can you tell me how to operate your code? Thank you very much.

synthtext question

Hi,
Sorry to ask a question regarding the SynthText dataset since I am not aware of any other place to pose this q .

I am guessing that the size of the textboxes and angles are wrt the original image size and not wrt 512 x 512 ?

Thanks

Using the "FCRN" approach to create a saliency map

I've recently been kicking around the idea of possibly bypassing the localization stage altogether and moving right into a holistic method for text recognition- more specifically for creating a saliency map of characters found within in an image. My thoughts are that instead of trying to predict the pose params for a bounding box around the text, I'm trying to predict a vector that gives the confidence that a specific cell contains each character of the alphabet.

I don't recall seeing any research papers published on this topic yet. Ankush, have you seen any research done on this? Have you tried this yourself?

Recently, I trained up a text detection network that, similar to your FCRN, only tried to detect the presence of text within each cell. Basically a 0 meant no text while a 1 meant high confidence of text. The results I've had both on the synth text dataset and on other datasets have been very promising.

I'm having a lot of fun with this FCRN approach, if you couldn't tell. For my specific case, I don't care as much about having bounding boxes around text as I do for knowing what the text in the image says.

Thanks again for any advice, comments, lessons learned, or references you may have.

it's really hard to hack the code

First thanks for sharing the code, I want to remove some restrictions in your filtering (which use depth and segmentation masks), but so many numeric operations (like ransac, depth camera..) maybe not that intuitive to understand, can you show me a direction ? Maybe some detailed materials about how to use the depth and segmentation info will help a lot.

text angle

I found the text angle is related to plane normal, and it range from -90 to 90 degrees. Is there any possible for text angle range from -180 to 180 degrees?

loss function

Hi,
I had a question about the loss function in your synthtext paper. Can you give a latex formula for the version used in synth text paper?

when i run the code , it generates the errors

when i use the python gen.py --viz in the command

it generates
~/SynthText-master$ python gen.py --viz
Traceback (most recent call last):
File "gen.py", line 19, in
from synthgen import *
File "/home/ubuntu/SynthText-master/synthgen.py", line 20, in
import text_utils as tu
File "/home/ubuntu/SynthText-master/text_utils.py", line 12, in
from pygame import freetype
ImportError: cannot import name freetype

waht is the wrong ?

Ability to specify minimum words per image

I have some images that have quite a bit of space open for text to be drawn but only a single 2-3 character word will be drawn on the entire image. It would be nice to specify if the minimum amount of words that I'd like drawn on an image so that it can continue looking for words to draw.

What training params did you use for the dicount in your FCRN?

Similar to your research, I'm finding that 3% of the cells in my input images contained bounding boxes and I'm starting my discount at 0.01 of the loss for the negative cells in the confidence matrix, c, during training. I am having trouble finding a good step value (for increasing the discount value), however, during training. How long did it take you to train your final network with 800,000 images? How often did you increase the discount that you applied to the loss of the negative c classes during training? What did you change it by?

Question about FCRN downscaling inference

In the paper "Synthetic Data for Text Localisation in Natural Images", you mention that you downscale the images by 1/2, 1/4 and 1/8 in order to pick up larger words. I'm wondering if you are able to provide more details about this. For instance, when you downscale the images, did you feed them into the same (1x512x512) network? Did you fill the empty area with black/white pixels?

can not download the 'http://www.robots.ox.ac.uk/~ankush/data.tar.gz'

hello, when i run the gen.py, i failed. i can not find the "http://www.robots.ox.ac.uk/~ankush/data.tar.gz". is there other's way to get the data? thanks.

the error is as follows:
File "gen.py", line 55
print colorize(Color.RED,'Data not found and have problems downloading.',bold=True)
^
SyntaxError: invalid syntax

by the way,800000 synthetic scene-text images,the download can not complete, the URL i tried is http://www.robots.ox.ac.uk/~vgg/data/scenetext/, is it right?
thanks a lot!

difference between px and pt?

def get_nline_nchar(self,mask_size,font_height,font_width):
"""
Returns the maximum number of lines and characters which can fit
in the MASK_SIZED image.
"""
H,W = mask_size
nline = int(np.ceil(H/(2*font_height)))
nchar = int(np.floor(W/font_width))
return nline,nchar
In this function, font_height and font_width are type of pt, but H and W are type of px. Why can px be divide by pt? what‘s difference between px and pt?

dataset

If I use my own images, how to get the depth and segmentation information for my image? Is there available code for it ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.