This project uses the Malware Dataset provided by Microsoft in kaggle competitions to classify the malware files into one of the nine classes using various combinations of vectorization schemes and machine-learning models and selecting the one which yields best results
As a part of this project/asignment, first I download the dataset from kaggle using kaggle apis. Since this dataset requires almost 150 GiB of space, which is not possible to contain in a colab notebook environment, I directly unzip the zip file provided by kaggle into my gdrive by mounting my drive into the colab notebook itself.
As the dataset contains nine malware classes, so at first, I check the spread of data points across these 9 classes using seaborn's countplot