K-Nearest Neighbor Algorithm (KNN)
Theory
K-Nearest Neighbor is one of the most basic and essential classification algorithms in Machine Learning. It is the type of supervised learning and having application in pattern recognition and data mining etc. It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make any underlying assumptions about the distribution of data. We are given some prior data (also called training data), which classifies coordinates into groups identified by an attribute.
Algorithm
Implement KNN algorithm using following steps :
- Load data set
- Initialize value of k
- For getting the predicted class, iterate from 1 to total number of training data points :
- Calculate the distance between test data and each row of training
data by using Euclidean distance. - Sort the calculated distances in ascending order based on distance
values - Get top k rows from the sorted array
- Get the most frequent class of these rows
- Return the predicted class
Euclidean Distance
It is the distance between two points. These points can be in different dimensional space and are represented by different forms of coordinates. In 1D space, the points are on a straight line. In 2D, the coordinates are given as
points on the x and y axes, and in 3D x, y and z axes are used. Finding the
Euclidean Distance of the two points depends on the particular dimensional
space in which they are found. Euclidean distance between two points is : sqrt((x2 − x1)2 + (y2 − y1)2)
Pros and Cons of KNN
Below are the Pros for choosing KNN algorithm :
- KNN is very simple and easy to understand and use.
- KNN is a non-parametric algorithm which means it doesn’t have any
assumptions. - It doesn’t have any training step.
- It constantly evolves, allows algorithm to respond quickly to change
the input. - KNN can be easily implementable for multi-class problem.
- It can be used for both Classification and Regression.
- One Hyper Parameter: K-NN might take some time while selecting the
first hyper parameter but after that rest of the parameters are aligned
to it. - Variety of distance criteria to be choose from: K-NN algorithm gives
user the flexibility to choose distance while building K-NN model.
- Euclidean Distance
- Hamming Distance
- Manhattan Distance
- Minkowski Distance
Below are the Cons for choosing KNN algorithm :
- K-NN slow algorithm.
- Curse of Dimensionality.
- It needs homogeneous features.
- Optimal number of neighbors need to be consider for classifying new
data entry. - Data which is unbalanced causes problems.
- It is very sensitive to outliers as it simply chose the neighbors based on
distance criteria. - KNN has no capability of dealing with missing value problem.
In next part I’ll explain coding part of KNN algorithm.…