Giter Club home page Giter Club logo

cs3-lab4-natural-language-processing's Introduction

CS3-Lab4--Natural-Language-Processing

Description: Natural Language Processing (NLP) is the subfield of artificial intelligence that deals with designing algorithms, programs, and systems that can understand human languages in written and spoken forms. A useful initial analysis in NLP is the extraction of n-grams, that is, sequences of n words. These n-grams commonly encode concepts and are used by several natural language processing algorithms. Consider for example the first paragraph in this assignment. Each individual word constitutes a 1-gram. The 2-grams are: “Natural Language”, “Language Processing”, “Processing is”, “is the”, and so on. The first 3-gram is “Natural Language Processing”, etcetera.

Problem: Implement a very simple text analysis tool that will receive a text file and an integer n and will print all n-grams in the text that appear at least twice. Notice that the number of possible n-grams is huge (actually |V|^n, where V is the vocabulary), so an array of counters is unfeasible. Instead of an array of counters, use a hash table that solves collisions by chaining to keep track of the n-grams that have been read. Either modify the intNode object described in class to contain the n-gram, encoded as a single string and a counter for the number of occurrences, or use a linked list utility containing objects consisting of a string and a counter. Since the key used for hashing is a string, you need to convert it to an integer value in a way that would ultimately result in as few collisions as possible. A simple way is to add the int values of all the characters in the strings, and then apply the mod operation, but perhaps you can propose a method that results in fewer collisions. Also, we will consider strings in upper and lowercase as being equivalent, so convert your strings to lowercase before hashing.

Run experiments with several plain text files and different values of n, and write a report describing your results.

cs3-lab4-natural-language-processing's People

Contributors

oiricaud avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.