Nearest Neighbor Classification

Procedures

Assemble a data set (training set)
How to classify a new image x?
- find its closest neighbor y, and label it the same

Notes:

training set of 60000 images
test set of 10000 images

How do we determine if two data (images) are closest?

With 28 x 28 image, we can strech it to become a vector of 784.

We then use Euclidean distance formula to find the distance betwen them.

$$||x - z|| = \sqrt{\sum_{i=1}^{784}(x_i - z_i)^2}$$

With 1 closest neighbor, we are getting 3.09% on the test set.

Ideas for Improvement

Choose different number of neighbors (kNN instead of 1NN)
Better distance functions

With different K

k	1	3	5	7	9	11
Test Error %	3.09	2.94	3.13	3.10	3.43	3.34

With different distance functions
There are many active researches on accodmading the translations, rotations and deformation.

For example, the 1 in the right is generated by shifting 1 in the left by few pixel.

dist function	l2	tangent distance	shape context
error rate	3.09	1.10	0.63

Problem

The curse of Dimension

Although the kNN search provides very good algorithm, the search time is slow. It takes time proportion to the size of data, n, and dimension, d, O(nd).

Ideas for addressing the issues:

Different Data Structure
- Locality Sensitive Search #TODO
- Ball Tree #TODO
- K-d tree #TODO
Different distance function
- Approximate Distance (link) #TODO
Data Reduction
- Prototype Section

k-d Tree

kd-tree

Credit to CSE 250 Class notes.

Nearest Neighbor Classification