Giter Club home page Giter Club logo

spark-search-pipeline's Introduction

Spark Search Pipeline

This is an Assessed Coursework for the Big Data postgraduate course at the University of Glasgow during the academic year 2022-2023.

Coursework Specification

Summary

The goal of this exercise is to familiarize yourselves with the design, implementation and performance testing of Big Data analysis tasks using Apache Spark. You will be required to design and implement a single reasonably complex Spark application. You will then test the running of this application locally on a data of various sizes. Finally, you will write a short report describing your design, design decisions, and where appropriate critique your design. You will be evaluated based on code functionality (does it produce the expected outcome), code quality (is it well designed and follows good software engineering practices) and efficiency (how fast is it and does it use resources efficiently), as well as your submitted report.

Task Description

You are to develop a batch-based text search and filtering pipeline in Apache Spark. The core goal of this pipeline is to take in a large set of text documents and a set of user defined queries, then for each query, rank the text documents by relevance for that query, as well as filter out any overly similar documents in the final ranking. The top 10 documents for each query should be returned as output. Each document and query should be processed to remove stopwords (words with little discriminative value, e.g. ‘the’) and apply stemming (which converts each word into its ‘stem’, a shorter version that helps with term mismatch between documents and queries). Documents should be scored using the DPH ranking model. As a final stage, the ranking of documents for each query should be analysed to remove unneeded redundancy (near duplicate documents), if any pairs of documents are found where their titles have a textual distance (using a comparison function provided) less than 0.5 then you should only keep the most relevant of them (based on the DPH score). Note that there should be 10 documents returned for each query, even after redundancy filtering.

Results

Efficiency

The execution of the full dataset (~5 GB) and with a set of ten queries using four local executors took ~69 seconds.

Note

  • The dataset was too large to include in the project, but it can be accessed from here.

spark-search-pipeline's People

Contributors

saadanis avatar

Stargazers

Fahd Mohammed avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.