Giter Club home page Giter Club logo

searchbetter's Introduction

SearchBetter: query rewriting for search engines on small corpuses

by Neel Mehta, Harvard University

SearchBetter lets you make powerful, fast, and drop-in search engines for any dataset, no matter how small or how large. It also offers built-in query rewriting, which uses NLP to help your search engines find semantically-related content to the user's search term.

For instance, a search for machine learning might only return results for items that contain the words "machine learning". But with query rewriting, you would get results not only for machine learning but also, say, artificial intelligence and neural networks.

SearchBetter lets you power up your search engines with minimal effort. It's especially useful if you have a small dataset to search on, or if you don't have the time or data to make fancy bespoke query rewriting algorithms.

Getting started

To drop this module into your app:

pip install searchbetter

For more advanced analysis and research purposes, use the interactive demo to get yourself set up!

Usage

Try out the interactive demo!

For a truly quick-and-dirty dive into SearchBetter (no setup required), use:

from searchbetter import rewriter

query_rewriter = rewriter.WikipediaRewriter()
query_rewriter.rewrite('biochemistry')

Documentation

Documentation is available online at http://searchbetter.readthedocs.io/.

To build the docs yourself using Sphinx:

cd docs
make html
open _build/html/index.html

Where to find data

Some of this data is proprietary to Harvard and HarvardX. Other info, like the Udacity API and Wikipedia dump, is open to the public.

Name URL What to name file
Udacity API https://www.udacity.com/public-api/v0/courses udacity-api.json
Wikipedia dump See below wikiclean8
edX courses Proprietary Master CourseListings - edX.csv
DART data Proprietary corpus_HarvardX_LatestCourses_based_on_2016-10-18.csv

How to prepare Wikipedia data

Download and unzip the enwik8 dataset from http://www.mattmahoney.net/dc/enwik8.zip. Then run:

perl processing-scripts/wiki-clean.pl enwik8 > wikiclean8

This might take a minute or two to run.

Context

SearchBetter was designed as part of a research project by Neel Mehta, Daniel Seaton, and Dustin Tingley for Harvard's CS 91r, a research for credit course.

It was originally designed for Harvard DART, a tool that helps educators reuse HarvardX assets such as videos and exercises in their online or offline courses. SearchBetter is especially useful for MOOCs, which often have small corpuses and have to deal with many uncommon queries (students will search for the most unfamiliar terms, after all.) Still, SearchBetter has been made general-purpose enough that it can be used with any corpus or any search engine.

searchbetter's People

Contributors

dseaton avatar hathix avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

searchbetter's Issues

Add support for ngrams

I'm not sure if I'm reading the enwik8 wrong or something, but when i try adding the Phraser the word2vec messes up spectacularly...

Generalize rewriter

rewriter.py needs more work on generalization. This is a bit harder as we have prebuilt models, APIs, and we also want to build our own models (word2vec, doc2vec). Let’s talk Thursday a bit about how you view this file growing in the next couple weeks.

If no existing model, create a new one

This prevents errors. Override the create flag, then.

        # TODO have an auto-detect feature that will determine if the
        # index exists, and depending on that creates or loads the index
        # TODO have the `create` option become `force_create`; normally
        #   it'll intelligently auto-generate, but if you force it it'll
        #   do what you say

Clean the wikipedia data first

http://www.mattmahoney.net/dc/textdata.html

Appendix A

This Perl program filters Wikipedia text dumps to produce 27 character text (lowercase letters and spaces) as described in this article. To use:

perl wikifil.pl enwik9 > text

Then truncate the text to the desired length (e.g. 108 bytes).

You can cut and paste the program below. (Note it contains URL encoding to display properly).

#!/usr/bin/perl

Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase

letters (a-z, converted from A-Z), and spaces (never consecutive).

All other characters are converted to spaces. Only text which normally appears

in the web browser is displayed. Tables are removed. Image captions are

preserved. Links are converted to normal text. Digits are spelled out.

Written by Matt Mahoney, June 10, 2006. This program is released to the public domain.

$/=">"; # input record separator
while (<>) {
if (/<text /) {$text=1;} # remove all but between ...
if (/#redirect/i) {$text=0;} # remove #REDIRECT
if ($text) {

# Remove any text not normally visible
if (/<\/text>/) {$text=0;}
s/<.*>//;               # remove xml tags
s/&amp;/&/g;            # decode URL encoded chars
s/&lt;/</g;
s/&gt;/>/g;
s/<ref[^<]*<\/ref>//g;  # remove references <ref...> ... </ref>
s/<[^>]*>//g;           # remove xhtml tags
s/\[http:[^] ]*/[/g;    # remove normal url, preserve visible text
s/\|thumb//ig;          # remove images links, preserve caption
s/\|left//ig;
s/\|right//ig;
s/\|\d+px//ig;
s/\[\[image:[^\[\]]*\|//ig;
s/\[\[category:([^|\]]*)[^]]*\]\]/[[$1]]/ig;  # show categories without markup
s/\[\[[a-z\-]*:[^\]]*\]\]//g;  # remove links to other languages
s/\[\[[^\|\]]*\|/[[/g;  # remove wiki url, preserve visible text
s/{{[^}]*}}//g;         # remove {{icons}} and {tables}
s/{[^}]*}//g;
s/\[//g;                # remove [ and ]
s/\]//g;
s/&[^;]*;/ /g;          # remove URL encoded chars

# convert to lowercase letters and spaces, spell digits
$_=" $_ ";
tr/A-Z/a-z/;
s/0/ zero /g;
s/1/ one /g;
s/2/ two /g;
s/3/ three /g;
s/4/ four /g;
s/5/ five /g;
s/6/ six /g;
s/7/ seven /g;
s/8/ eight /g;
s/9/ nine /g;
tr/a-z/ /cs;
chop;
print $_;

}
}

Search engine metadata

How can we add the idea of ‘metadata’ to search.py? e.g., in the Udacity dataset, there is all this great metadata ranging from tags to course images. Pulling that into a metadata object in search.py would give access to a lot of useful information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.