Saturday, April 30, 2016

Tesseract API - Providing training data

Tesseract is fully trainable. Tesseract needs to know about different shapes of the same character by having different fonts separated explicitly. The number of fonts is limited to 64 fonts. Note that runtime is heavily dependent on the number of fonts provided, and training more than 32 will result in a significant slow-down.

Some of the procedure is inevitably manual. As much automated help as possible is provided. The tools referenced below are all built in the training subdirectory.

The first step is to determine full character set to be used, and prepare a text or word processor file containing a set of examples. Below are few points to remember while generating the sample files 

1. Make sure there are a minimum number of samples for each character. 10 is good, 5 is OK,
2. There should be more samples of the more frequent characters - at least 20 
3. Don't make the mistake of grouping all the non-letters together. Make the text more realistic. For example, The quick brown fox jumps over the lazy dog. 0123456789 !@#$%^&(),.{}<>/? is terrible. Much better is The (quick) brown {fox} jumps! over the $3,456.78 #90 dog & duck/goose, as 12.5% of E-mail from aspammer@website.com is spam? This gives the textline finding code a much better chance of getting sensible baseline metrics for the special characters.

In 3.0.3, there is automated method,  We need to prepare a UTF-8 text file (training_text.txt) containing training text according to the above specification. Obtain truetype/opentype font files for the fonts we wish to recognize. And the run the following command for each font in turn to create a matching tif/box pair. 

training/text2image —text=training_text.txt —outputbase=[lang].[fontname].exp0 —font=‘Font name’ —fonts_dir=/path/to/fonts 
e.g 
training/text2image --text=training_text.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman Bold' --fonts_dir=/usr/share/fonts

references:

No comments:

Post a Comment