Giter Club home page Giter Club logo

aidocs's Introduction

Aidocs

This is a project that I have done in my final years of Electrical and Computer Engineering Education at Haramaya University. It is a web-based document similarity analyzer written in python which can compute document similarity with two different approachs.

Objectives

General Objective

To create a web application an average user can perform document similarity analysis in both semantic and syntactic methods using NLP and machine learning algorithms through a simple interface.

Specific Objective

  • demonstrate TF-IDF algorithm for syntactic feature extraction from a text document
  • demonstrate Word2Vec algorithm for semantic feature extraction from a text document
  • develop and web application that can perform text similarity analysis
  • demonstrate the use of similarity algorithms through a plagiarism detector

Walkthrough and Screenshots

Authentication

Through the authentication setup, users of the system can sign up if they are first-time users or can sign in if they have already created an account on the system image When clients are signing up for the first time, they will be required to fill out a form that asks them for their username, first name, last name, email, and password. image image If a user is signed-in with a valid user identity the navbar of the application shows the username of the signed-in user and it also gives the user the ability to sign out of the system with a button beside its username. image

Home

The home element is what any user receives when navigating to the root of the web application's URL. The server will respond with a home page where users are introduced with alittle intro about what the web app does and a call-to-action button to get them started. image Apart from that further down the page, it answers the question, of what document similarity is, how it works, and what is it used for. image

Projects

"Projects" is the part of the web app that is presented after users have successfully signed into the system. This is the section where users can create a project and perform all the analysis they need. On the first time after signing up for the system when users navigate to the projects page, they are only able to see the "Plagiarism Detection Database" project that is created by the admin of the site and accessible through all users' accounts. image Apart from that, there will not be any listed projects under that user's projects page. There is a button with the plus icon at the header of the page that redirects the users to create a new project. On the new projects page, the client is required to give their project a name and add files to the project with a file picker. When choosing a file users have the flexibility to choose a document that has a .pdf format or a .txt format, both formats are allowed under one corpus. After filling in all the information, when the user clicks on the button "create project" the backend will process all the input files and determine their file type automatically. If a file has a .txt file type it will get automatically saved to the database but if a file has a .pdf file type it is going to be converted to a .txt file by iterating through the page and extracting all the text information. image On successful completion of project creation, users are redirected to the projects page again to see the list of the created projects by that user.

When users select a project from the list of projects that they have created they are going to be redirected to that specific project’s page that allows them to add additional files or remove the project. Apart from the users will see a list of added files and an option to compute document similarity on the corpus. image when the user clicks on the compute document similarity button they are redirected to a page where they find a list of ways of computing document similarity. Mainly in our project, there are two ways of computing document similarity these are, syntactically and semantically. The syntactic way of computing document similarity is presented as a TF-IDF cosine similarity method and the semantic way of computing document similarity is presented as a Word2Vec cosine similarity method. Besides the two users have the flexibility to click on an option called plagiarism detection, which uses both the semantic and syntactic way of computing document similarity and gives the average value of the two results. image Once a user selects an algorithm, they are redirected to a page that allows them to select one file with a dropdown menu from the corpus of documents that the user needs to compare against the rest of the corpus image Once the user selects the file and clicks on the button "compare", the backend will process all data in the corpus-based on the algorithm selected by the user and respond with a value of how much the documents in the corpus seemed similar to the selected document. image

aidocs's People

Contributors

tadiosabebe avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.