skip the navigation
sAccess Net Logo
>>Japanese Page
Infty Project
Research Project on Mathematical Information Processing
Mathematical Document Recognition and Analysis, User Interface,Accessibility of Scientific Documents


To promote further research into OCR for scientific documents, the InftyProject releases databases that may be suitable for research outside of InftyProject development. We have carefully scrutinized the data in these releases, and expect that the databases can be relied upon with high confidence. Nevertheless, if you experience any problems with them, please contact us.

1. InftyCDB-1

A Ground Truth Database of Characters, Symbols words and Formulas in Mathematical Documents; First Distribution, March 18, 2005

  • Description:

    InftyCDB-1 consists of 30 mathematics articles in English. In all, it comprises 688,580 character samples, from 476 pages of text. The image of each alphanumeric character or mathematical symbol is recorded, together with the character code of the symbol it represents. In addition, links are recorded that represent the structure of each word or mathematical expression that appears. Thus, InftyCDB-1 can be used as a word or mathematical expression database, as well as a character image database.

    There are 108,914 words and 21,056 mathematical expressions in InftyCDB-1. For more details about the database, please see here.

  • Conditions of use:

    Usage for research, development, or testing of OCR systems (possibly commercial) for scientific documents is permitted, free of charge.

  • Download:


2. InftyCDB-2

A Ground Truth Database of Characters, Symbols, words and Formulas in Mathematical Documents; Second Distribution, December 27, 2006New!

  • Description:

    This is a continuation of InftyCDB-1, with the same structure. It contains some documents in German and French, as well as many in English.

    There are 662,142 characters from English articles, 37,439 from French articles, and 77,812 from German articles. For a complete list of the articles, see here. Note that the database was corrected recently; details about the revisions are here.

  • Conditions of use:

    Same as InftyCDB-1.

  • Download:


3. InftyCDB-3

A Ground Truth Database of Characters, Symbols in Mathematical Documents; Third Distribution, October 24, 2006

  • Description:

    InftyCDB-3 is a database of single alphanumeric characters and mathematical symbols, divided into two data sets. Unlike InftyCDB-1 and InftyCDB-2, word and mathematical expression structure is not included. The images are of individual characters only. To make it easy to use for experimentation and development with single-character recognition engines, symbols whose form is identical (for example, the summation symbol and the Greek capital sigma) are assigned the same character code.

    In InftyCDB-3-A, there are 188,752 characters; in InftyCDB-3-B, there are 70,637 characters.

  • InftyCDB-3-A is the training set used to produce recent versions of InftyReader (Versions 2.0 - 2.5.0). Taking data from more than 300 sources, we have tried to cover as many varieties of characters and symbols as possible. The data was extracted from books and journals from various publishers, Japanese documents, typesetting samples from printing companies, fonts installed on personal computer operating systems, and LaTeX fonts.

  • InftyCDB-3-B is an extract of InftyCDB-1, which includes data from 20 of its articles. To reduce the number of samples with the same character code, size, and shape, clustering was applied to the data from these 20 articles, reducing the number of data samples to about 70,000. The data is written in the same format as in InftyCDB-3-A.

    Please see this explanation for more details.

  • Note:

    This data set does not include any German symbols.

  • Conditions of use:

    Same as InftyCDB-1.

  • Download:


3. InftyMDB-1

A Ground Truth Database of Mathematical Expressions, August 12, 2009,

  • Description:

Finding errors in the recognition is an important task in OCR.. This database was prepared to be used in the research and development of the algorithm to find misrecognitions in mathematical OCR.
InftyProject wishes everyone to use the database to evaluate new verification methods.

This database was used in the paper "A. Fujiyoshi, M. Suzuki, and S. Uchida, Verification of mathematical formulae based on a combination of context-free grammar and tree grammar, in the proceedings of MKM 2008, pp. 415-429, LNCS(LNAI) 5144, 2008."

The database was constructed by collecting 3,000 correctly recognized mathematical formulae and 1,400 misrecognized mathematical formulae generated by InftyReader.
The formulae were collected from 32 pure mathematical articles, in which 30 articles are the same as in InftyCDB-1.
Each mathimatical formula in the database consist of 10 or more symbols.
Original images of formulae are available.
A CSV file with corrected results of misrecognitions, the ground-truthed data for the misrecognized mathematical formulae, is also available.

  • To download the database file, please click the following:




 >Top of this page 
Go to the Top Page of Infty Project

Mail to Web Master