hathix / searchbetter Goto Github PK

View Code? Open in Web Editor NEW

31.0 5.0 4.0 17.44 MB

SearchBetter: query rewriting for search engines on small corpuses (Harvard research project)

Home Page: http://searchbetter.readthedocs.io/en/latest/

License: MIT License

Jupyter Notebook 99.83% Python 0.16% Perl 0.01% Shell 0.01%

search-engine python query-rewriting edtech harvard word2vec

searchbetter's Introduction

SearchBetter: query rewriting for search engines on small corpuses

by Neel Mehta, Harvard University

SearchBetter lets you make powerful, fast, and drop-in search engines for any dataset, no matter how small or how large. It also offers built-in query rewriting, which uses NLP to help your search engines find semantically-related content to the user's search term.

For instance, a search for machine learning might only return results for items that contain the words "machine learning". But with query rewriting, you would get results not only for machine learning but also, say, artificial intelligence and neural networks.

SearchBetter lets you power up your search engines with minimal effort. It's especially useful if you have a small dataset to search on, or if you don't have the time or data to make fancy bespoke query rewriting algorithms.

Getting started

To drop this module into your app:

pip install searchbetter

For more advanced analysis and research purposes, use the interactive demo to get yourself set up!

Usage

Try out the interactive demo!

For a truly quick-and-dirty dive into SearchBetter (no setup required), use:

from searchbetter import rewriter

query_rewriter = rewriter.WikipediaRewriter()
query_rewriter.rewrite('biochemistry')

Documentation

Documentation is available online at http://searchbetter.readthedocs.io/.

To build the docs yourself using Sphinx:

cd docs
make html
open _build/html/index.html

Where to find data

Some of this data is proprietary to Harvard and HarvardX. Other info, like the Udacity API and Wikipedia dump, is open to the public.

Name	URL	What to name file
Udacity API	https://www.udacity.com/public-api/v0/courses	`udacity-api.json`
Wikipedia dump	See below	`wikiclean8`
edX courses	Proprietary	`Master CourseListings - edX.csv`
DART data	Proprietary	`corpus_HarvardX_LatestCourses_based_on_2016-10-18.csv`

How to prepare Wikipedia data

Download and unzip the enwik8 dataset from http://www.mattmahoney.net/dc/enwik8.zip. Then run:

perl processing-scripts/wiki-clean.pl enwik8 > wikiclean8

This might take a minute or two to run.

Context

SearchBetter was designed as part of a research project by Neel Mehta, Daniel Seaton, and Dustin Tingley for Harvard's CS 91r, a research for credit course.

It was originally designed for Harvard DART, a tool that helps educators reuse HarvardX assets such as videos and exercises in their online or offline courses. SearchBetter is especially useful for MOOCs, which often have small corpuses and have to deal with many uncommon queries (students will search for the most unfamiliar terms, after all.) Still, SearchBetter has been made general-purpose enough that it can be used with any corpus or any search engine.

searchbetter's People

Contributors

Stargazers

Watchers

Forkers

hydrosquall qingchao-kong barryzm carvalhoamc

searchbetter's Issues

Address all TODOs

Add support for ngrams

I'm not sure if I'm reading the enwik8 wrong or something, but when i try adding the Phraser the word2vec messes up spectacularly...

Generalize rewriter

rewriter.py needs more work on generalization. This is a bit harder as we have prebuilt models, APIs, and we also want to build our own models (word2vec, doc2vec). Let’s talk Thursday a bit about how you view this file growing in the next couple weeks.

If no existing model, create a new one

This prevents errors. Override the create flag, then.

        # TODO have an auto-detect feature that will determine if the
        # index exists, and depending on that creates or loads the index
        # TODO have the `create` option become `force_create`; normally
        #   it'll intelligently auto-generate, but if you force it it'll
        #   do what you say

use enwik9 not enwik8

Clean the wikipedia data first

http://www.mattmahoney.net/dc/textdata.html

Appendix A

This Perl program filters Wikipedia text dumps to produce 27 character text (lowercase letters and spaces) as described in this article. To use:

perl wikifil.pl enwik9 > text

Then truncate the text to the desired length (e.g. 108 bytes).

You can cut and paste the program below. (Note it contains URL encoding to display properly).

#!/usr/bin/perl

Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase

letters (a-z, converted from A-Z), and spaces (never consecutive).

All other characters are converted to spaces. Only text which normally appears

in the web browser is displayed. Tables are removed. Image captions are

preserved. Links are converted to normal text. Digits are spelled out.

Written by Matt Mahoney, June 10, 2006. This program is released to the public domain.

$/=">"; # input record separator
while (<>) {
if (/<text /) {$text=1;} # remove all but between ...
if (/#redirect/i) {$text=0;} # remove #REDIRECT
if ($text) {

# Remove any text not normally visible
if (/<\/text>/) {$text=0;}
s/<.*>//;               # remove xml tags
s/&amp;/&/g;            # decode URL encoded chars
s/&lt;/</g;
s/&gt;/>/g;
s/<ref[^<]*<\/ref>//g;  # remove references <ref...> ... </ref>
s/<[^>]*>//g;           # remove xhtml tags
s/\[http:[^] ]*/[/g;    # remove normal url, preserve visible text
s/\|thumb//ig;          # remove images links, preserve caption
s/\|left//ig;
s/\|right//ig;
s/\|\d+px//ig;
s/\[\[image:[^\[\]]*\|//ig;
s/\[\[category:([^|\]]*)[^]]*\]\]/[[$1]]/ig;  # show categories without markup
s/\[\[[a-z\-]*:[^\]]*\]\]//g;  # remove links to other languages
s/\[\[[^\|\]]*\|/[[/g;  # remove wiki url, preserve visible text
s/{{[^}]*}}//g;         # remove {{icons}} and {tables}
s/{[^}]*}//g;
s/\[//g;                # remove [ and ]
s/\]//g;
s/&[^;]*;/ /g;          # remove URL encoded chars

# convert to lowercase letters and spaces, spell digits
$_=" $_ ";
tr/A-Z/a-z/;
s/0/ zero /g;
s/1/ one /g;
s/2/ two /g;
s/3/ three /g;
s/4/ four /g;
s/5/ five /g;
s/6/ six /g;
s/7/ seven /g;
s/8/ eight /g;
s/9/ nine /g;
tr/a-z/ /cs;
chop;
print $_;

}
}