"In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method [...] k-NN is a type of classification where the function is only approximated locally and all computation is deferred until function evaluation" 1
This project is an implementation of the kNN algorithm, to estimate the type of Iris flower based on four properties:
- Sepal length
- Sepal width
- Petal length
- Petal width
This project relies on CMake to generate the executable file. If you don't have CMake yet, you can download it here, or alternatively, in a Debian-based system, run
$ sudo apt install cmake
Then, once cmake is installed, clone this project's repository with:
$ git clone 'https://github.com/bananabajanana/KNNClassifier.git'
You can run the code with our default input files, or alternatively put in your own.
The server has it's training data in TrainingData.csv
.
In addition the client's input and output files can be given as arguments during compilation time, but it is recommended to put them as Unclassified.csv
and Classified.csv
folders as shown below.
KNNClassifier
│ ...
│
└───Client
│ │ ...
│ │
│ └───Data
│ │ │ Unclassified.csv
│ │
│ └───Output
│ │ Classified.csv
│
└───Server
│ │ ...
│ │
│ └───Data
│ │ │ TrainingData.csv
You can run the project using our provided CMakeLists.txt
file:
$ mkdir -p build && cd build
$ cmake ..
$ make -j && make Server && make Client
$ ./Server
After this, the compilation and linkage of the codes is complete and the Server is running. All that is left is to run the Client program from a different terminal, as following:
$ cd {path}/build
$ ./Client {IPath} {OPath}
Replacing {path} with the folder path to the project's repository on your personal computer.
In addition, {IPath} should be replaced with the path to the input file to be classified, and {OPath} with the path to output the data to. It is also possible to keep not specify {IPath} and {OPath} arguments, which will cause the program to work with the default arguments specified above.
We first created a Flower object, that will store a single flower's type and parameters. The flower's type is stored as an enum, with four options: the three possible types, and an undefined option.
enum typeIris { versicolor, virginica, setosa, undifined };
class Flower {
private:
typeIris type;
const NPoint character;
...
};
The NPoint is a representation of the Flower's parameters, and functions as a point in an n-dimensional coordinate system (in 4d with the current implementation example, but can easily be expanded).
Our code implements the algorithm with three different possible distance functions: Euclidean distance, Manhattan distance, and Chebyshev distance, but to allow the addition of other distance functions, we implemented generic code with an abstract distance class.
class DistanceCalc {
public:
virtual double distance(NPoint p1, NPoint p2) = 0;
...
};
In addition, to keep track of all our different types of distances implemented, we made a DistancesData builder class.
class DistancesData {
/**
* @return all the types of distance calculators that can be used.
*/
static std::vector<DistanceCalc*>& getAllTypes();
};
Finally, we created a Classifier class. This class identifies a given vector of flowers based on an input list of already identified Irises. This is implemented step-by-step according to the kNN algorithm, first finding the k closest neighbors to the unidentifiable flower, and then finding which category is most common among them.
With these implemented, the server runs indefinitely in the following loop:
graph TD;
Listen-->Accept;
Accept-->DataFromUser;
DataFromUser-->OutputClass;
OutputClass-->DataFromUser;
OutputClass-->UserDisconnects;
UserDisconnects-->Accept;
Listen
: The server waits for a user to connect.Accept
: The accept stage in the connection process.DataFromUser
: The server receives information about an unclassified flower from the user.OutputClass
: The server sends to the user the classification of the flower.UserDisconnects
: The current user disconnects, allowing the server to interact with a new user.
All the transitions are managed by a middle step we'll call Select
. The Select
is in charge of managing the following things:
- Gracefully closing the server in case of internal errors.
- Managing timeouts: if there is a non-active client - kick it, and in any case keep listening for further actions.
- If there are no connected clients, a client must be connecting, and we'll accept it.
- If there is an active client, we'll communicate with it through the server protocol.
- Port = 6969 We chose this port number since it is not a commonly used port and is not part of the super-user port range (0-1024). 2
- Timeout = 5sec We gave clients a generous amount of time before kicking them out.
- Buffer Size = 128 Since the user can only send one flower info at a time, we limited the user's message size to 128 bytes.
The client is a sort of simple demo for interaction with the server. It takes an input file from the user, and sends the flower information - one by one - to the server, while outputting the given classifications to an output file.
graph TD;
ReadLine-.No lines left.->Exit;
ReadLine-->SendInfo;
SendInfo-->WriteLine;
WriteLine-->ReadLine;
ReadLine
: The client program reads a line from the input file.SendInfo
: The client sends the flower info previously read to the server.WriteLine
: The client writes the given class to the output file.Exit
: If there are no lines left, the client closes its connection with the server.
For a better understanding of the algorithms, c++ language, and multiple dependencies, we used the following sites: