Fun with Computer Vision
6.27.2010 | Blog, Computer Vision
One of my early projects in Computer Vision was a simple Optical Character Recognition (OCR) program. Although this is by no means anything new, it’s still fun to learn how it works, and I’m going to share with you a little demo of OCR that you can create yourself!
Your mission, should you choose to accept it: Detect all the letter i’s on a document!
Couple of things to note:
- Not all of the letter i’s will look the same (there will be some blotches of ink that make the letters look a little different)
- You won’t have knowledge of what every single letter i looks like (or ever will look like) in the universe
- You won’t have to recognize handwriting (that’s another party in and of itself)
So let’s get started! We’ll be using the following document for our character detection:
First, we can break the problem down into a bunch of subtasks, namely:
- Manipulating the data to make a solution easier for the program to visualize.
- Cropping the image to the area of interest (getting rid of the image and handwriting above).
- Finding and extracting ‘templates’, regions of the image which contain the ‘i’s we’re looking for.
- Determining the locations of each row of text.
- Correlating the template of the i character with each possible character.
- Determining thresholds for each of the templates (the ‘bouncer’ which determines if a character is really an i or not).
- Displaying the results with the rectangles around each i.
So let’s get started! I’ll be using MATLAB and the Image Processing Toolbox for my implementation in this post, but I’ll also attach a script which accomplishes the same results using Python.
Manipulating the Data
First things first, we need to read in our image. We’ll do that as follows:
im = imread(‘TextImage.tif’);
im = im2double(im);
This reads in the TextImage.tif image as a grayscale image, and assigns each pixel to an intensity between 0 and 1 inclusive. However, we’re not quite done yet. By convention, data of interest is represented as white (all 1’s), and data which isn’t of interest being black (all 0’s) – so we’ll want to invert our image to show that the black text we just read in is of interest (e.g. white) we’ll do that as follows:
im = imadjust(im, [0 1],[1 0]);
THRESHOLD = 0.05;im = im > THRESHOLD;
Cropping the Image
Although this is not an essential step, and the algorithm which we’ll be implementing soon will function fine without it, we’re going to restrict the area which we’ll be looking at. This helps for visualization in the end, and can be accomplished pretty quickly.
im = im(400:end,:);
This line of code cuts off everything above 400px, and leaves the rest of our image intact (end MATLAB knows means “go to the end of the image”, it’s not a variable we set earlier).
Finding Templates
The next goal we’ll have is to extract i’s from the image above that are of interest to us (computer vision junkies call these ‘templates’). I usually open up Adobe Photoshop and manually select areas which are of interest, and copy and paste the pixel coordinates. It’s important to note that we’re ‘cheating’ on some level as we have a priori knowledge of what our document already looks like – a luxury which many commercial OCR software systems don’t often have
Alas, I’ve taken some of the dirty work out for you, and found some i’s which are indicative of i’s across the entire document. Here’s my findings:
templateOne = im(510:542, 532:551);
templateOne = im2double(templateOne);templateTwo = im(68:100, 390:409);
templateTwo = im2double(templateTwo);templateThree = im(266:298, 899:918);
templateThree = im2double(templateThree);templateFour = im(67:99, 144:163);
templateFour = im2double(templateFour);templateFive = im(168:200, 875:894);
templateFive = im2double(templateFive);
So we now have five templates (patches of the letter i in the image), each template is 19 pixels wide and 32 pixels tall. Now we’re going to combine all of them into an array for easy access later.
templates(:,:,1) = templateOne;
templates(:,:,2) = templateTwo;
templates(:,:,3) = templateThree;
templates(:,:,4) = templateFour;
templates(:,:,5) = templateFive;
This creates a three dimensional array. In the first two dimensions, we store the grayscale intensity (again from 0 to 1 inclusive) of the templates we found. Each index in the third dimension represents another template (The colon (:) operator in MATLAB is shorthand for “include everything”). So imagine a cardboard box stacked with a bunch of sheets of paper of the letter i on them, roughly speaking.
Finding the rows of text
Here’s some fun debauchery. We need to detect where each row of text begins so that we know where we can match our templates to, and this can be accomplished fairly quickly.
First, what we need to do is rotate the image 90 degrees (put it on its side), and add up the pixel values in each column. Note that now that the image is rotated 90 degrees, what used to be rows are now columns. We’re rotating the image 90 degrees as it plays nicer with MATLAB’s sum function (only adds up columns). We’ll then be storing this in an array called spaces.
spaces = sum(im’);
Next, we need to determine (another) threshold, the intensity of what each row of text should be when summed up. If the sum of each row doesn’t match a threshold, there probably isn’t text there, so it’s not the beginning of a row. We can threshold the row sums by doing the following:
ROW_SPACE_THRESHOLD = 10;
spaces = spaces > (ROW_SPACE_THRESHOLD);
Now let’s detect where changes occur by taking the derivative of spaces. This highlights where values changed from 0 to 1 and vice versa. Begins stores where each row begins (derivative is 1), and ends stores where each row ends (derivative is -1).
spaceDerivative = diff(spaces);
begins = find(spaceDerivative == 1);
ends = find(spaceDerivative == -1);
Correlating the template
Before we sink our teeth into the meat of the OCR, we first need to define thresholds for each character template we found. Roughly speaking, this is the amount of ’slack’ we’ll give to each character we find. If it’s within this ’slack’ region, the program will think the character we found is an i, otherwise it won’t be. This is useful as we’ll never know what every single i looks like (and even if we did, that algorithm would be extremely inefficient), so this defines a level of tolerance in how much a character we are analyzing can deviate from our templates of the letter i to actually be considered the letter i. We’ll be definining these using integer constants determined by experimentation.
MAX_FILTERED_I_VALUES = [ 145, 147, 147, 170, 150 ];
Each index of the MAX_FILTERED_I_VALUES corresponds to the tolerance for each template (e.g. the first number corresponds to the first template, and so on). The meaning and purpose of these templates will be explored in greater detail very soon.
Now! This is where the fun starts. First we’re going to want to iterate through each row we found. We can do that like so:
for i = 1 : (min([length(begins) length(ends)])),
Next we’re going to want to extract each row, and then shortly thereafter each character from each row. We can get the current row this way:
currentRow = im(begins(i):ends(i),:);
This says, “hey for this particular row number (index), we only care about pixels from whatever you found from the beginning to the end”. Now we need to find the spaces in between the characters:
row_spaces = sum(currentRow);
row_spaces = row_spaces > 0.8;
row_diff = diff(row_spaces);charBegin = find(row_diff == 1);
charEnd = find(row_diff == -1);
This is a similar approach to what we did before, in that we’re taking this row of text we found, checking to see if it matches a threshold (which I found by experimentation for you) in each column to see if there’s actually text there, and then taking the derivative of that threshold to see where each character begins and ends.
Now let’s loop through each character we found:
for j = 1 : min([length(charEnd) length(charBegin)]),
Let’s get things set up for processing:
currentChar = currentRow(:,charBegin(j):charEnd(j));
currentChar = im2double(currentChar);possibleMach = 0;
In the above code, we’re segmenting the character from the image and converting its intensity to a double (why we’re doing this will be explained soon). We’re also creating a flag called possibleMatch to determine if we’ve actually found something that is likely to be an i.
Now for each character we found, iterate through the templates we extracted earlier to see if there’s a match. Enjoy your O(n^3) efficiency
. We can iterate through each template this way (again, this is MATLAB shorthand for: every item in templates, take whatever’s in the third dimension – which is where we stored the index of our templates).
for k = 1 : length(templates(1,1,:)),
And we can correlate our above character with the template this way:
filtered = abs(imfilter(currentChar,templates(:,:,k),’corr’));
possibleMach = max(filtered(:)) > MAX_FILTERED_I_VALUES(k);
This is where the magic happens! Let’s break apart what we’re doing here:
In the first lone of code, we do something called a correlation (specified by the ‘corr’ string to MATLAB). The correlation operation multiplies each pixel in the neighborhood by the corresponding pixel in the template. The output values (in a Cartesian x,y plane) are a sum of the results. The output value will be the highest at points where the template closely matches. If the value is determined to be a possible match by matching a constant threshold (MAX_FILTERED_I_VALUES), then we have a match.
In other words, the correlation number returned from the imfilter function is (roughly speaking) a score of how closely the character we just found matches a possible template, and if the correlation value we found is greater than the threshold which we set earlier in MAX_FILTERED_I_VALUES, we’ve got a match!
Displaying the Results
The party is over, and now it’s time to clean up. So if we’ve found a possible match, draw a red rectangle around it:
if (possibleMach)
% Draw a red box around each character matching
finalRGB(begins(i):ends(i),charBegin(j),1) = 1.0;
finalRGB(begins(i):ends(i),charBegin(j),2) = 0.0;
finalRGB(begins(i):ends(i),charBegin(j),3) = 0.0;finalRGB(begins(i):ends(i),charEnd(j),1) = 1.0;
finalRGB(begins(i):ends(i),charEnd(j),2) = 0.0;
finalRGB(begins(i):ends(i),charEnd(j),3) = 0.0;finalRGB(begins(i),charBegin(j):charEnd(j),1) = 1.0;
finalRGB(begins(i),charBegin(j):charEnd(j),2) = 0.0;
finalRGB(begins(i),charBegin(j):charEnd(j),3) = 0.0;finalRGB(ends(i),charBegin(j):charEnd(j),1) = 1.0;
finalRGB(ends(i),charBegin(j):charEnd(j),2) = 0.0;
finalRGB(ends(i),charBegin(j):charEnd(j),3) = 0.0;break;
end
We draw a red rectangle by setting the red intensity (By MATLAB, 1st index in the third dimension of the image) to be 1 (all the way) and setting green and blue to be zero.
We’re done! Our final product now looks like this:
If you’d like to download the example source code, it can be found here:
http://chriscowdery.com/download/blog/ocr-fun.zip






Comments