Giter Club home page Giter Club logo

linear-ranking-of-webpages's Introduction

A linear ranking approach to LEarning-TO-Rank(LETOR) using Linear Regression, K-means clustering, Gaussian RBF

This project was implemented as part of an assignment for the course CSE574 : Introduction to Machine Learning at University at Buffalo, The State University of New York in Fall 2016. The goal of the project was to develop a Linear Regression system to perform fine grained ranking of webpages.

Dataset

The developed system was trained on LETOR 4.0 Dataset. The 'Querylevelnorm.txt\โ€™ from MQ2007 was used for this project. The dataset consists of 69623 samples and 46 features with rankings as 0, 1 or 2. This dataset is divided into three sets of 76% of the samples, 12 % of the samples and the remaining for Training, Validation and Testing respectively.

Implementation

Data Pre-processing

The implementation of the project involved performing dimensionality reduction and then training the system to learn to rank. Dimensionality reduction is performed by performing clustering in the form of K-means clustering and then applying Gaussian Radial Basis Function over the identified centroids to reduce the dimensions.

The samples are then randomly shuffled to overcome the stagnation of Stochastic Gradient Descent at a local optima and improve convergence.

Linear Regression

Three implementations of Linear Regression was developed using

  • Gradient Descent (GD)

A Linear Regression system using gradient descent algorithm was developed to learn the rankings of the pages. A system to adapt the learning rate was developed to optimize the learning rate which penalized the learning rate by two-thirds if the at any iteration the cost of the system increased and reward the learning rate marginally if the cost of the system decreased. This optimization was carried out to speed up the running time of the system.

  • Stochastic Gradient Descent (SGD)
  • Closed Form Solution

Optimizations of the hyperparameters was carried out by iteratively searching in the hyperparameter space for the optimal solution. Hyperparameters tuned include the number of basis functions to reduce the dimensions, the learning rate, L2 regularization parameters.

Results

The results were calculated in the form of error. The Erms observed over the validation set and test set are

System Validation Error Test Error
GD 0.5832 0.5681
SGD 0.5743 0.5566
Closed Form 0.5736 0.5565

On analysis of the dataset, it is observed that 51632 samples(roughly 74%) have 0 as the output rank and 14128 samples(roughly 20%) have 1 as the output rank and the rest have 2 as the rank. This skewness in the ground truths is causing the system to perform poorly.

linear-ranking-of-webpages's People

Contributors

pavangj avatar

Watchers

 avatar  avatar

Forkers

amandayin0630

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.