Giter Club home page Giter Club logo

search-engine-affinity's Introduction

Search Engine Similarity Analysis: A Combined Content and Rankings Approach

Are search engines different? The search engine wars are a favourite topic of on-line analysts, as two of the biggest companies in the world, Google and Microsoft, battle for dominance of the web search space. Differences in search engine popularity can be explained because one of them gives better results to the users, or by other factors, such as familiarity with the most popular first engine, peer imitation, or force of habit. In this work we present a thorough analysis of the affinity of the two major search engines, while also adding a third one, DuckDuckGo, which goes to great lengths to emphasize its privacy-friendly credentials. To do the analysis, we collected response data using a comprehensive set of queries for a period of one month. Then we applied a number of different similarity methods, while also using a new measure, which we developed specifically for the task that leverages both the content and the ranking of search responses. The results show that Google stands apart, but Bing and DuckDuckGo are largely indistinguishable from each other. It therefore appears that, while the difference between Google and the rest can be explained by the different results they produce, search results cannot account for the difference in market share between Bing and DuckDuckGo. Moreover, our work, via the new measure we propose, contributes also to the top k lists comparison literature and the application of machine learning techniques to search engine comparisons.

The full decription of our work will appear at the proceedings of the upcoming International Conference on Web Information Systems Engineering (WISE 2020).

Setup

Create a Python virtual environment and install the necessary Python packages.

virtualenv .env
source .env/bin/activate
pip install -r requirements.txt

Install the seanlz tool used to analyze the search engine results (read below for the documentation of the tool)

python setup.py install

To download the nltk vocbulary, in an IPython shell, run:

In [1]: import nltk
In [2]: nltk.download('punkt')

Reproduce the results of the paper

Open the notebook located in the notebook directory and run it

jupyter notebook notebook/evaluation.ipynb

the notebook first fetches the dataset used in our work, and then reproduces the results for each research question presented in the paper.

We have two datasets: one for the search engine results collected in 2016 and one for the web results collected in 2019.

The seanlz Tool

To estimate the degree of similarity between the web results of search engines we implemented a command-line tool which computes the similarity with various ways. This tools takes as input the web results of search engines produced by multiple queries for multiple days.

Use

The seanlz command provides you multiple methods to compute similarity of search engines with the following sub-commands:

  • cont: Evaluates the results of search engines by exploiting the snippets of results and using machine learning, and tensor decomposition techniques.
  • rank: Measures the similarity of a pair of search engines, using ranking-based methods.
  • metrict: Computes the similarity between two search engines, using the metric T.

Config

The seanlz tool needs a configuration file which defines which queries are used in your dataset. These queries are grouped into categories, thus, you can make an analysis based on the category.

This configuration file must follow a JSON. A valid schema for this configuration file is the following:

{
    "categories": {
        "Query Category 1": ["Query1", "Query2"],
        "Query Category 2": ["Query3", "Query4"],
    },
    "results": "directory/to/your/dataset"
}

The configuration file used for our work is .config.sea located on the root of this repository.

CLI Reference

seanalz

Description

Estimates the degree of similarity between the results of multiple search engines produced from queries grouped into categories.

Synopsis

seanlz [options] <sub-command>

Options

  • --config: (optional) The location of configuration file. This location can also be specified by an enviroment variable named SESIM_CONFIG. Default location is ~/.config.sea.
  • -s, --search_engines: (required) The search engines for the analysis (seperated by ',').
  • --categories (optional): The categories of the queries for the analysis (seperated by ','). By default, sub-command inspects all categories specified in the configuration file.
  • --merge: (flag option) Display analysis of query categories in a single diagram.
  • -N: (int) Number of retrieved results per query. Default 10.

cont

Description

Evaluates the results of search engines by exploiting the snippets of results and using machine learning, and tensor decomposition techniques.

Synopsis

seanlz [options] cont [options]

Options

  • method: (cmp or clf -- required) Strategy used to evaluate the results. The cmp denotes that results are split into components to examine the similarity of search engines. The clf indicates that a classifier is used a mean of similarity between search engines.

  • -m, --model: (tensor | lda |query | se -- required) Actual model used to estimate the similarity.

    • tensor: A four-mode tensor (query, search engine, day, result) is constructed, and CP Alternating Poisson Regression algorithm is applied on it.
    • lda: Latent Dirictlet Allocation algorithm is used.
    • query: Classification model for predicting queries is used.
    • se: Classification model for predicting search engines is used.
  • -c (required) Model-specific configuration. Key-value pairs, seperated by =.

    tensor and lda models need the number of components to be specified.

    Example:

    seanlz -s foo,bar cont --method cmp -m lda -c components=10

    query model need the following configuration options:

    • evaluation: Evaluation method for classification model. Currently, only CrossLearnCompare method is supported.
    • classfier: Classifier for predicting queries (e.g. SVC).

    Example:

    seanlz -s foo,bar cont --method clf -m query -c evalution=clc -c classifier=SVC

    se model needs:

    • folds: Number of folds for the cross-validation evaluation method.
    • metric: (roc | cm) Metric for computing the effectiveness of classification model (Receiver Operating Characteristic or Confusion Matrix).
    • classifier: Classifier for predicting search engines (e.g. SVC).
  • --vocabulary: (int) Number of words of vocabulary to be used for the analysis. Default 100.

  • --per-day / --per-result: Specify what the bag of words representation of results contains. The --per-day option denotes that all top results are taken into account, whereas the --per-result indicates that every single result is converted into a bag of words representation.

rank

Description

Measures the similarity of a pair of search engines, using ranking-based methods

Synopsis

seanlz [options] rank [options]

Options

  • --metric: (required) Ranking-based method used to measure similarity. Methods:
    • LEV: Levenshtein distance.
    • DAM-LEV: Damerau-Levenshtein distance.
    • HAM: Hamming distance
    • JAR: Jaro distance.
    • JAR-WIN: Jaro-Winkler distance.
    • KENDALL: Kendal-Tau distance.
    • SPEARMAN: Spearman's Footrule distance.
    • G: G distance.
    • M: M distance.
    • PYTHON: Uses python's difflib.SequenceMatcher

metrict

Description

Computes the similarity between two search engines, using the metric T.

Synopsis

seanlz [options] metrict [options]

Options

  • --evol: (flag option) Display the similarity of two search engines over time.
  • -a: (float, between 0-1) The weight attached to the penalty due to snippets. It accepts more than one values (seperated by ','), only with the --evol option. Default is 0.8.
  • -b: (float, between 0-1) The weight attached to the penalty due to titles. It accepts more than one values (seperated by ','), only with the --evol option. Default is 1.0.
  • -c: (float, between 0-1) The weight attached to the penalty due to transpositions. It accepts more than one values (seperated by ','), only with the --evol option. Default is 0.33.

search-engine-affinity's People

Contributors

theosotr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.