OpenCV is a de-facto in the computer vision world. Besides many useful features, it has a machine learning module in which the community has paid less attention to it. In this tutorial, we will see how to use famous ML models to classify handwritten digits. As always, the code is in C++ and available on GitHub.
Loading the data
Machine learning, in most of the time, is dealing with data preparation. To avoid dirty data manipulations and quickly start using OpenCV’s ML modules, we use a simple data of handwritten digits which is accessiblefrom/sample/data/digits.pngin the OpenCV repo.
It contains 500 samples for each digit which sums up to 5000 samples in total. All we need is to extract the 20×20 images of each sample:
std::vector<cv::Mat> extractDigits(const cv::Mat &img)
constexpr int digitSize = 20;
for(int i = 0; i < img.rows; i += digitSize)
for(int j = 0; j < img.cols; j += digitSize)
cv::Rect roi = cv::Rect(j,i,digitSize, digitSize);
cv::Mat digitImg = img(roi);
Now, if we extract features from these digits and train a model, we can hopefully classify digits in any English document. But before this, let’s do a preprocessing to help the classifier.
As you may have noticed, the handwritten digits are not necessarily straight. Some of them are slanted to left or right. Clearly, features of a right slanted “1” would be different from a straight “1”. To help the classifier, we can apply an Affine transformation to the slanted digits and make them straight. The Affine matrix can be estimated using moments of the image:
Histogram of Oriented Gradients is one of the powerful and discriminating features that have been successfully used for several tasks such as pedestrian detection. With OpenCV, we can make an instance of cv::HOGDescriptor and call compute() function to calculate the features:
Now that the features and labels of the digits are available, let’s create our dataset. OpenCV provides an easy and abstract cv::ml::TrainDatafor this. It allows us toshuffle the data and split it into training and testing sets via setTrainTestSplitRatio()function.
cv::Ptr<cv::ml::TrainData> createTrainData(const std::string &imgPath)
cv::Mat img = cv::imread(imgPath, cv::IMREAD_GRAYSCALE);
auto digitImages = extractDigits(img);
auto features = extractFeatures(digitImages);
auto labels = loadLabels();
cv::Ptr<cv::ml::TrainData> data = cv::ml::TrainData::create(features, cv::ml::ROW_SAMPLE, labels);
Now, 0.8 of the data (4000 digits) is devoted to training and the remaining samples (1000 digits) are devoted to testing.
K Nearest Neighbor is the simplest classifier in our list. Such simple that no training phase is involved! To label a new sample, it calculates its distance from all the training samples, sorts the list of the mutual distancesin ascending order, and selects a label that is dominant in the first K items of the list. Note that the word “distances” refers to the Euclidean distance between feature vectors (here HOG features).
This classifier assumes that the distribution of the features in each class is Gaussian. In the training phase, it finds the mean and covariance of each class’s distribution, and in the testing phase, it devotes a sample to a class that has the nearest distance to its distribution.
I found something strange with OpenCV’s implementation of this algorithm. Feeding the dataset object directly to the train()function results in a much less accurate model compared to a model that is trained with separate features and labels!!!
Despite linear regression that tries to fit a line to the data using a least-square method, logistic regression tries to fit a logistic or sigmoid function using maximum-likelihood technique. The simplicity of doing inference with this method has made it very popular among machine learning dudes. OpenCV provides functions to set the optimizer (Gradient descent and mini-batch gradient descent) as well as the learning rate and the number of iterations. To avoid overfitting, we enable regularizationbased on L2 norm. Note that despite other classifiers, logistic regression requires labels to be float (CV_32F) instead of int (CV_32S).
Decision trees can be regarded as a sequence of binary decisions that try to get close to the answer by rejecting false answers. It starts from a root node, and finishes in the leaf nodes. The nodes are extracted from the features in the training phase. The depth of the tree determines the complexity of the classifier.
If we make multiple decision trees and do a majority vote on the outputs, we have a random forest algorithm. So the term “forest” comes from a collection of trees. The word “random” comes from the way that trees are trained. The training data is randomly split to N size and fed to each tree.
Support Vector Machines are very powerful classifiers. The basic idea is to draw a line to separate the features of two classes. When features have 2, 3, or more dimensions, SVM draws a hyperplane to separate the two classes. So, SVM is originally a binary classifier (but can be easily extended to multi-class use cases). The real power of SVM comes from its kernel trick. When features are not linearly separable, the kernels (e.g. RBF) map them to a new dimension where a hyper-plane can easily discriminate them.
Multi-Layer Perceptron is a widely used form of neural networks that mimic the functionality of our brain’s neurons. Each neuron gets a number of inputs, multiplies them by weights, adds a bias term, and uses an activation function to produce an output. During the training phase, the algorithm finds the optimal weights that map the inputs to the outputs.
OpenCV provides Identity, Sigmoid, Gaussian, ReLu, and LeakyRelu activation functions. We use the classic backpropagation algorithm for training the network but you can choose RPROP or simulated annealing methods. Note that despite other classifiers that accept labels as a single column matrix, MLP requires labels to be “one-hot-encoded”. In other words, instead of [0;2;1], we should feed it with [1 0 0; 0 0 1; 0 1 0].
The following table compares training and testing time, as well as error rate of the mentioned models. Note that:
We just studied the task of handwritten classification with HOG features. It is highly probable that the behavior of these classifiers on other datasets with different features would be different.
I didn’t tweak the parameters of each model. Maybe by some tweaking, we get different accuracies.
Let’s do it with a CNN
All the mentioned algorithms required features to work. If we feed them with better and more discriminating features, we will get higher classification accuracies. But designing good features (also known as feature engineering) is not easy. In the next post, we will see how a simple Convolutional Neural Network can achieve superb accuracies without the need to give handcrafted features. Stay tuned.