Blog

OpenCV 3.x K Nearest Neighbour (KNN) Regression

OpenCV k nearest neighbors tutorial

K-Nearest Neighbor (or K-NN for short) is one of the non-parametric algorithms in pattern recognition that is used for classification or regression. It is simple with the idea of finding the closest sample in training data for the coming sample (test data).

In the following at first we will describe shortly about knnsearch or knnmatch then there is a k nearest opencv example at the end.

OpenCV K Nearest Neighbour (KNN) Regression - Sample

Assume that in your city there are families with blue and red houses like the image above. A new comer builds a house in one region of the city(shown in green). How can we find that whether it belongs to the blue houses or the red families? one idea is looking for the closest neighbour (here the red triangle) and say that the new house belongs to the red Triangle family. This means 1-NN. In another words we have found the distance to each house and the closest one was the red Triangle and concluded that the new one belongs to red family. But what if we try to find the 7 nearest neighbour? In the top image, when we choose 7 nearest neighbours (7-NN) we have 5 blue and 2 red house and now we can conclude that the new house belongs to the blue family. 

Modified KNN

What if we choose 4 NN? now we have 2 of each colors. but it seems more reasonable to assign the new house to the red family because it is closer to them. This type of conclusion is like giving some weight to each family depending on their distance to the new comer. For those who are near to him we assign higher weights while those who are away get lower weights. Then we sum the total weight of each family separately and whoever gets the highest weight, is the final category of new comer. This is called modified KNN .

Choosing the Optimal K

Choosing the optimal value for K is best done by first inspecting the data. In general, a large K value is more precise as it reduces the overall noise; however, the compromise is that the distinct boundaries within the feature space are blurred. Cross-validation is another way to retrospectively determine a good K value by using an independent data set to validate your K value. The optimal K for most datasets is 10 or more. That produces much better results than 1-NN.

 

KNN Features

  • You should check the distance of the new sample to all the previous collected data for training so it needs lots of memory and calculation time.
  • There is almost zero time for training or preparation so all the calculation time is deferred to the test phase(classification).
  • It approximates locally so it is sensitive to local structure of data.

Standardized Distance

One major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set by dividing each feature to the highest value or a goal value (normalizing them).

K Nearest Neighbour Regression

when KNN is used in regression, the predicted result is a mean value of the property value (label) associated to the chosen neighbours. But in case of classification, the class is determined by voting. 

C++ OpenCv KNN

Function Header

C++ opencv knn: float CvKNearest::find_nearest(const Mat& samples, int k, Mat* results=0, const float** neighbors=0, Mat* neighborResponses=0, Mat* dist=0 ) const

Parameters:

  • samples : Input samples stored by rows. It is a single-precision floating-point matrix of number_of_samples x number_of_features size.
  • k :  is Number of used nearest neighbors. 
  • results :   Vector with results of prediction (regression or classification) for each input sample. It is a single-precision floating-point vector with number_of_samples elements
  • neighbors : Optional output pointers to the neighbor vectors themselves. It is an array of k*samples->rows pointers.
    • the neighbors are sorted by their distances to the vector.
  • neighborResponses : Optional output values for corresponding neighbors. It is a single-precision floating-point matrix of number_of_samples x k size.
  • dist : Optional output distances from the input vectors to the corresponding neighbors. It is a single-precision floating-point matrix of NUMBER_OF_SAMPLES x K size. 

Some Notes

  • The function is parallelized with the TBB library.
  • In case of C++ interface you can use output pointers to empty matrices and the function will allocate memory itself.
  • If only a single input vector is passed, all output matrices are optional and the predicted value is returned by the method.

Sample code

You can see opencv knnsearch example and the usage of opencv knn train in the following example.

int number_of_train_elements = 1000;

int number_of_sample_elements = 10;

int vectorSize = 5;

Mat matTrainFeatures(number_of_train_elements, vectorSize, CV_32F);

Mat matSample(number_of_sample_elements, vectorSize, CV_32F);

 

 for (int trainIndex = 0; i < number_of_train_elements ; i++) {

   matTrainFeatures.at<float>(trainIndex, 0) = trainData[i][0];

              matTrainFeatures.at<float>(trainIndex, 1) = trainData[i][1];

              matTrainFeatures.at<float>(trainIndex, 2) = trainData[i][2];

              matTrainFeatures.at<float>(trainIndex, 3) = trainData[i][3];

              matTrainFeatures.at<float>(trainIndex, 4) = trainData[i][4];

              matTrainLabels.at<float>(trainIndex, 0) = trainLabels[i];

}

 

 for (int sampleIndex= 0; i < number_of_sample_elements ; i++) {

 matSample.at<float>(sampleIndex, 0) = sampleData[i][0];

            matSample.at<float>(sampleIndex, 1) = sampleData[i][1];

            matSample.at<float>(sampleIndex, 2) = sampleData[i][2];

            matSample.at<float>(sampleIndex, 3) = sampleData[i][3];

            matSample.at<float>(sampleIndex, 4) = sampleData[i][4];

            matSampleLabels.at<float>(sampleIndex, 0) = sampleLabel[i];

}

 

Ptr<TrainData> trainingData;

Ptr<KNearest> kclassifier = KNearest::create();

 

trainingData = TrainData::create(matTrainFeatures,ml::ROW_SAMPLE,

                                  matTrainLabels);

kclassifier->setIsClassifier(0);

kclassifier->setAlgorithmType(KNearest::BRUTE_FORCE);

kclassifier->setDefaultK(10);

 

kclassifier->train(trainingData);

kclassifier->findNearest(matSample, kclassifier->getDefaultK(), matResults);

 

  //Just checking the settings

  cout << "Training data: " << endl << "getNSamples\t"

       << trainingData->getNSamples() << endl << "getSamples\n"

       << trainingData->getSamples() << endl << "getLabels\n"

       << trainingData->getClassLabels() << endl << endl;

 

  cout << "Classifier :" << endl << "kclassifier->getDefaultK(): "

       << kclassifier->getDefaultK() << endl

       << "kclassifier->getIsClassifier()   : "

       << kclassifier->getIsClassifier() << endl

       << "kclassifier->getAlgorithmType(): " << kclassifier->getAlgorithmType()

       << endl << endl;

 

  //confirming sample order

  cout << "matSample: " << endl << matSample << endl << endl;

 

  //displaying the results

  cout << "matResults: " << endl << matResults << endl << endl;

0 Comments :

Comment