Giter Club home page Giter Club logo

indexingexercise's Introduction

IndexingExercise

##Entry Points:

  • Run IndexConstructor.java <location_to_save_index>
  • Run QueryServer.java <location_to_read_index>

##Assumptions:

  • Matching is first character inclusive, but middle of string underscore exclusive. For example, __abc__def_ghi would match strings like __abc..,_abc..,abc..,_def..,ghi. Not __def,_ghi.
  • Name length is limited to a max of 65,000 characters, so (max search query) << n.
  • Preconditions on name and score are met, so they are not checked.
  • No deletion from index required.

Design Overview:

A container object was created to hold the name and score pair, called datum. The primary index structure consists of an array of tries, with each trie root (element of the array) corresponding to the first character of a string. The biggest challenge was prefix matching queries with underscores. This was solved by separating the leading underscores from the strings. First, a list of all possible full prefix substrings were generated (for example: abc__def = abc, _def) and added individually to the tries. Before adding, underscores would be counted and stripped from the substrings (ex: _def = def w/ 1 underscore). During insertion, the string and score would be added to each node of the trie. Each node has an unbounded array called underscores, and each element of this array would hold a top 10 ranking of names that pass through that node (ex: _def would be added to underscores[1] since it has 1 underscore, but it would also be added to underscores[0] since 'def' would match the search). To prevent duplicate insertion, the node is updated with the current string number (coinciding with the new size of the index), and the maximum number of underscores previously traversed. If the node is traversed again for the same string, the string number will match. The score will only be updated for the same string if the underscore number is greater than the one previously recorded for the string (and only to the ones which have not already had the element added).

Development Process:

An iterative test driven development process was used. Basic structure of index was determined, and it was made to work with just letters, then leading underscores, then underscores in the middle of a string. (see IndexTester.java)

##Complexity: At the expense of memory, search complexity should be proportional to the length of the search query, k, or O(k), where the underlying assumption is that k << n.

indexingexercise's People

Contributors

mahsu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.