Leptonica 1.68
C Image Processing Library

partition.c File Reference

Bruel algorithm for extracting large whitespace boxes from a binary image: classification and location of connected components (characters or words) More...

#include <stdio.h>
#include <stdlib.h>
#include "allheaders.h"

Go to the source code of this file.

Data Structures

struct  PartitionElement

Defines

#define OUTPUT_HEAP_STATS   0

Typedefs

typedef struct PartitionElement PARTEL

Functions

static PARTELpartelCreate (BOX *box)
static void partelDestroy (PARTEL **ppartel)
static l_int32 partelSetSize (PARTEL *partel, l_int32 sortflag)
static BOXAboxaGenerateSubboxes (BOX *box, BOXA *boxa, l_int32 maxperim, l_float32 fract)
static BOXboxaSelectPivotBox (BOX *box, BOXA *boxa, l_int32 maxperim, l_float32 fract)
static l_int32 boxCheckIfOverlapIsBig (BOX *box, BOXA *boxa, l_float32 maxoverlap)
BOXAboxaGetWhiteblocks (BOXA *boxas, BOX *box, l_int32 sortflag, l_int32 maxboxes, l_float32 maxoverlap, l_int32 maxperim, l_float32 fract, l_int32 maxpops)
BOXAboxaPruneSortedOnOverlap (BOXA *boxas, l_float32 maxoverlap)

Variables

static const l_int32 DEFAULT_MAX_POPS = 20000

Detailed Description

Bruel algorithm for extracting large whitespace boxes from a binary image: classification and location of connected components (characters or words)

    Whitespace block extraction
        BOXA            *boxaGetWhiteblocks()

    Helpers
        static PARTEL   *partelCreate()
        static void      partelDestroy()
        static l_int32   partelSetSize()
        static BOXA     *boxaGenerateSubboxes()
        static BOX      *boxaSelectPivotBox()
        static l_int32   boxaCheckIfOverlapIsSmall()
        BOXA            *boxaPruneSortedOnOverlap()

Definition in file partition.c.


Define Documentation

#define OUTPUT_HEAP_STATS   0

Definition at line 57 of file partition.c.


Typedef Documentation

typedef struct PartitionElement PARTEL

Definition at line 41 of file partition.c.


Function Documentation

static PARTEL * partelCreate ( BOX box) [static]

partelCreate()

Input: box (region; inserts a copy) Return: partel, or null on error

Definition at line 289 of file partition.c.

References PartitionElement::box, boxCopy(), CALLOC, ERROR_PTR, NULL, and PROCNAME.

Referenced by boxaGetWhiteblocks().

static void partelDestroy ( PARTEL **  ppartel) [static]

partelDestroy()

Input: &partel (<will be="" set="" to="" null="" before="" returning>="">) Return: void

Definition at line 310 of file partition.c.

References PartitionElement::box, PartitionElement::boxa, boxaDestroy(), boxDestroy(), FREE, L_WARNING, NULL, and PROCNAME.

Referenced by boxaGetWhiteblocks().

static l_int32 partelSetSize ( PARTEL partel,
l_int32  sortflag 
) [static]

partelSetSize()

Input: partel sortflag (L_SORT_BY_WIDTH, L_SORT_BY_HEIGHT, L_SORT_BY_MIN_DIMENSION, L_SORT_BY_MAX_DIMENSION, L_SORT_BY_PERIMETER, L_SORT_BY_AREA) Return: 0 if OK, 1 on error

Definition at line 342 of file partition.c.

References PartitionElement::box, boxGetGeometry(), ERROR_INT, L_MAX, L_MIN, L_SORT_BY_AREA, L_SORT_BY_HEIGHT, L_SORT_BY_MAX_DIMENSION, L_SORT_BY_MIN_DIMENSION, L_SORT_BY_PERIMETER, L_SORT_BY_WIDTH, NULL, PROCNAME, and PartitionElement::size.

Referenced by boxaGetWhiteblocks().

static BOXA * boxaGenerateSubboxes ( BOX box,
BOXA boxa,
l_int32  maxperim,
l_float32  fract 
) [static]

boxaGenerateSubboxes()

Input: box (region to be split into up to four overlapping subregions) boxa (boxes of rectangles intersecting the box) maxperim (maximum half-perimeter for which pivot is selected by proximity to box centroid) fract (fraction of box diagonal that is an acceptable distance from the box centroid to select the pivot) Return: boxa (of four or less overlapping subrectangles of the box), or null on error

Definition at line 384 of file partition.c.

References boxaAddBox(), boxaCreate(), boxaSelectPivotBox(), boxCreate(), boxDestroy(), boxGetGeometry(), ERROR_PTR, L_INSERT, NULL, and PROCNAME.

Referenced by boxaGetWhiteblocks().

static BOX * boxaSelectPivotBox ( BOX box,
BOXA boxa,
l_int32  maxperim,
l_float32  fract 
) [static]

boxaSelectPivotBox()

Input: box (containing box; to be split by the pivot box) boxa (boxes of rectangles, from which 1 is to be chosen) maxperim (maximum half-perimeter for which pivot is selected by proximity to box centroid) fract (fraction of box diagonal that is an acceptable distance from the box centroid to select the pivot) Return: box (pivot box for subdivision into 4 rectangles), or null on error

Notes: (1) This is a tricky piece that wasn't discussed in the Breuel's 2002 paper. (2) Selects a box from boxa whose centroid is reasonably close to the centroid of the containing box (xc, yc) and whose half-perimeter does not exceed the maxperim value. (3) If there are no boxes in the boxa that are small enough, then it selects the smallest of the larger boxes, without reference to its location in the containing box. (4) If a small box has a centroid at a distance from the centroid of the containing box that is not more than the fraction 'fract' of the diagonal of the containing box, that box is chosen as the pivot, terminating the search for the nearest small box. (5) Use fract in the range [0.0 ... 1.0]. Set fract = 0.0 to choose the small box nearest the centroid. (6) Choose maxperim to represent a connected component that is small enough so that you don't care about the white space that could be inside of it.

Definition at line 460 of file partition.c.

References boxaGetBox(), boxaGetBoxGeometry(), boxaGetCount(), boxDestroy(), boxGetCenter(), boxGetGeometry(), ERROR_PTR, FALSE, L_CLONE, L_COPY, L_WARNING, NULL, PROCNAME, and TRUE.

Referenced by boxaGenerateSubboxes().

static l_int32 boxCheckIfOverlapIsBig ( BOX box,
BOXA boxa,
l_float32  maxoverlap 
) [static]

boxCheckIfOverlapIsBig()

Input: box (to be tested) boxa (of boxes already stored) maxoverlap (maximum fractional overlap of the input box by any of the boxes in boxa) Return: 0 if box has small overlap with every box in boxa; 1 otherwise or on error

Definition at line 540 of file partition.c.

References boxaGetBox(), boxaGetCount(), boxDestroy(), boxOverlapFraction(), ERROR_INT, L_CLONE, and PROCNAME.

Referenced by boxaGetWhiteblocks().

BOXA* boxaGetWhiteblocks ( BOXA boxas,
BOX box,
l_int32  sortflag,
l_int32  maxboxes,
l_float32  maxoverlap,
l_int32  maxperim,
l_float32  fract,
l_int32  maxpops 
)

boxaGetWhiteblocks()

Input: boxas (typically, a set of bounding boxes of fg components) box (initial region; typically including all boxes in boxas; if null, it computes the region to include all boxes in boxas) sortflag (L_SORT_BY_WIDTH, L_SORT_BY_HEIGHT, L_SORT_BY_MIN_DIMENSION, L_SORT_BY_MAX_DIMENSION, L_SORT_BY_PERIMETER, L_SORT_BY_AREA) maxboxes (maximum number of output whitespace boxes; e.g., 100) maxoverlap (maximum fractional overlap of a box by any of the larger boxes; e.g., 0.2) maxperim (maximum half-perimeter, in pixels, for which pivot is selected by proximity to box centroid; e.g., 200) fract (fraction of box diagonal that is an acceptable distance from the box centroid to select the pivot; e.g., 0.2) maxpops (maximum number of pops from the heap; use 0 as default) Return: boxa (of sorted whitespace boxes), or null on error

Notes: (1) This uses the elegant Breuel algorithm, found in "Two Geometric Algorithms for Layout Analysis", 2002, url: "citeseer.ist.psu.edu/breuel02two.html". It starts with the bounding boxes (b.b.) of the connected components (c.c.) in a region, along with the rectangle representing that region. It repeatedly divides the rectangle into four maximal rectangles that exclude a pivot rectangle, sorting them in a priority queue according to one of the six sort flags. It returns a boxa of the "largest" set that have no intersection with boxes from the input boxas. (2) If box == NULL, the initial region is the minimal region that includes the origin and every box in boxas. (3) maxboxes is the maximum number of whitespace boxes that will be returned. The actual number will depend on the image and the values chosen for maxoverlap and maxpops. In many cases, the actual number will be 'maxboxes'. (4) maxoverlap allows pruning of whitespace boxes depending on the overlap. To avoid all pruning, use maxoverlap = 1.0. To select only boxes that have no overlap with each other (maximal pruning), choose maxoverlap = 0.0. Otherwise, no box can have more than the 'maxoverlap' fraction of its area overlapped by any larger (in the sense of the sortflag) box. (5) Choose maxperim (actually, maximum half-perimeter) to represent a c.c. that is small enough so that you don't care about the white space that could be inside of it. For all such c.c., the pivot for 'quadfurcation' of a rectangle is selected as having a reasonable proximity to the rectangle centroid. (6) Use fract in the range [0.0 ... 1.0]. Set fract = 0.0 to choose the small box nearest the centroid as the pivot. If you choose fract > 0.0, it is suggested that you call boxaPermuteRandom() first, to permute the boxes (see usage below). This should reduce the search time for each of the pivot boxes. (7) Choose maxpops to be the maximum number of rectangles that are popped from the heap. This is an indirect way to limit the execution time. Use 0 for default (a fairly large number). At any time, you can expect the heap to contain about 2.5 times as many boxes as have been popped off. (8) The output result is a sorted set of overlapping boxes, constrained by 'maxboxes', 'maxoverlap' and 'maxpops'. (9) The main defect of the method is that it abstracts out the actual components, retaining only the b.b. for analysis. Consider a component with a large b.b. If this is chosen as a pivot, all white space inside is immediately taken out of consideration. Furthermore, even if it is never chosen as a pivot, as the partitioning continues, at no time will any of the whitespace inside this component be part of a rectangle with zero overlapping boxes. Thus, the interiors of all boxes are necessarily excluded from the union of the returned whitespace boxes. (10) USAGE: One way to accommodate to this weakness is to remove such large b.b. before starting the computation. For example, if 'box' is an input image region containing 'boxa' b.b. of c.c.:

// Faster pivot choosing boxaPermuteRandom(boxa, boxa);

// Remove anything either large width or height boxat = boxaSelectBySize(boxa, maxwidth, maxheight, L_SELECT_IF_BOTH, L_SELECT_IF_LT, NULL);

boxad = boxaGetWhiteblocks(boxat, box, type, maxboxes, maxoverlap, maxperim, fract, maxpops);

The result will be rectangular regions of "white space" that extend into (and often through) the excluded components. (11) As a simple example, suppose you wish to find the columns on a page. First exclude large c.c. that may block the columns, and then call:

boxad = boxaGetWhiteblocks(boxa, box, L_SORT_BY_HEIGHT, 20, 0.15, 200, 0.2, 2000);

to get the 20 tallest boxes with no more than 0.15 overlap between a box and any of the taller ones, and avoiding the use of any c.c. with a b.b. half perimeter greater than 200 as a pivot.

Definition at line 168 of file partition.c.

References PartitionElement::box, PartitionElement::boxa, boxaAddBox(), boxaCopy(), boxaCreate(), boxaDestroy(), boxaGenerateSubboxes(), boxaGetBox(), boxaGetCount(), boxaGetExtent(), boxaIntersectsBox(), boxCheckIfOverlapIsBig(), boxClone(), boxCreate(), boxDestroy(), DEFAULT_MAX_POPS, ERROR_PTR, FALSE, L_CLONE, L_INSERT, L_SORT_BY_AREA, L_SORT_BY_HEIGHT, L_SORT_BY_MAX_DIMENSION, L_SORT_BY_MIN_DIMENSION, L_SORT_BY_PERIMETER, L_SORT_BY_WIDTH, L_SORT_DECREASING, L_WARNING, lheapAdd(), lheapCreate(), lheapDestroy(), lheapRemove(), NULL, partelCreate(), partelDestroy(), partelSetSize(), and PROCNAME.

Referenced by main().

BOXA* boxaPruneSortedOnOverlap ( BOXA boxas,
l_float32  maxoverlap 
)

boxaPruneSortedOnOverlap()

Input: boxas (sorted by size in decreasing order) maxoverlap (maximum fractional overlap of a box by any of the larger boxes) Return: boxad (pruned), or null on error

Notes: (1) This selectively removes smaller boxes when they are overlapped by any larger box by more than the input 'maxoverlap' fraction. (2) To avoid all pruning, use maxoverlap = 1.0. To select only boxes that have no overlap with each other (maximal pruning), set maxoverlap = 0.0. (3) If there are no boxes in boxas, returns an empty boxa.

Definition at line 593 of file partition.c.

References boxaAddBox(), boxaCopy(), boxaCreate(), boxaGetBox(), boxaGetCount(), boxDestroy(), boxOverlapFraction(), ERROR_PTR, FALSE, L_CLONE, L_COPY, L_INSERT, NULL, PROCNAME, and TRUE.


Variable Documentation

const l_int32 DEFAULT_MAX_POPS = 20000 [static]

Definition at line 53 of file partition.c.

Referenced by boxaGetWhiteblocks().

 All Data Structures Files Functions Variables Typedefs Enumerations Enumerator Defines