Giter Club home page Giter Club logo

crawler-extension's Introduction

banner

Mwmbl - the Open Source Web Search Engine

Matrix

No ads, no tracking, no profit

Mwmbl is a non-profit, open source search engine where the community determines the rankings. We aim to be a replacement for commercial search engines such as Google and Bing.

mwmbl

We have our own index powered by our community. Our index is currently much smaller than those of commercial search engines, with around 500 million unique URLs (more stats). The quality is a long way off the commercial engines at the moment, but you can help change that by joining us! We aim to have 1 billion unique URLs indexed by the end of 2024, 10 billion by the end of 2025 and 100 billion by the end of 2026 by which point we should be comparable with the commercial search engines.

Community

Our main community is on Matrix but we also have a Discord server for non-development related discussion.

The community is responsible for crawling the web (see below) and curating search results. We are friendly and welcoming. Join us!

Documentation

All documentation is at https://book.mwmbl.org.

Crawling

Crawling is distributed across the community, while indexing is centralised on the main server.

If you have spare compute and bandwidth, the best way you can help is by running our command line crawler with as many threads as you can spare.

If you have Firefox you can help out by installing our extension. This will crawl the web in the background. It does not use or access any of your personal data. Instead it crawls a set of URLs sent from our central server. After extracting a summary of each page, it batches these up and sends the data to the central server to be stored and indexed.

Why a non-profit search engine?

The motives of ad-funded search engine are at odds with providing an optimal user experience. These sites are optimised for ad revenue, with user experience taking second place. This means that pages are loaded with ads which are often not clearly distinguished from search results. Also, eitland on Hacker News comments:

Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - [to some] degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.

But what about...?

The space of alternative search engines has expanded rapidly in recent years. Here's a very incomplete list of some that have interested me:

  • search.marginalia.nu - a search engine favouring text-heavy websites
  • SearXNG - an open source meta search engine
  • YaCy - an open source distributed search engine
  • Gigablast - a privacy-focused search engine whose owner makes money by selling the technology to third parties
  • Brave
  • DuckDuckGo
  • Kagi

Of these, YaCy is the closest in spirit to the idea of a non-profit search engine. The index is distributed across a peer-to-peer network. Unfortunately this design decision makes search very slow.

Marginalia Search is fantastic, but our goals are different: we aim to be a replacement for commercial search engines but Marginalia aims to provide a different type of search.

All other search engines that I've come across are for-profit. Please let me know if I've missed one!

Designing for non-profit

To be a good search engine, we need to store many items, but the cost of running the engine is at least proportional to the number of items stored. Our main consideration is thus to reduce the cost per item stored.

The design is founded on the observation that most items rank for a small set of terms. In the extreme version of this, where each item ranks for a single term, the usual inverted index design is grossly inefficient, since we have to store each term at least twice: once in the index and once in the item data itself.

Our design is a giant hash map. We have a single store consisting of a fixed number N of pages. Each page is of a fixed size (currently 4096 bytes to match a page of memory), and consists of a compressed list of items. Given a term for which we want an item to rank, we compute a hash of the term, a value between 0 and N - 1. The item is then stored in the corresponding page.

To retrieve pages, we simply compute the hash of the terms in the user query and load the corresponding pages, filter the items to those containing the term and rank the items. Since each page is small, this can be done very quickly.

Because we compress the list of items, we can rank for more than a single term and maintain an index smaller than the inverted index design. Well, that's the theory. This idea has yet to be tested out on a large scale.

How to contribute

There are lots of ways to help:

If you would like to help in any of these or other ways, thank you! Please join our Matrix chat server or email the main author (email address is in the git commit history).

Development

Local Testing

For trying out the service locally see the section in the Mwmbl book.

Using Dokku

Note: this method is not recommended as it is more involved, and your index will not have any data in it unless you set up a crawler to crawl to your server. You will need to set up your own Backblaze or S3 equivalent storage, or have access to the production keys, which we probably won't give you.

Follow the deployment instructions

Frequently Asked Question

How do you pronounce "mwmbl"?

Like "mumble". I live in Mumbles, which is spelt "Mwmbwls" in Welsh. But the intended meaning is "to mumble", as in "don't search, just mwmbl!"

crawler-extension's People

Contributors

adjagu avatar colinespinas avatar daoudclarke avatar omasanori avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

marxo

crawler-extension's Issues

Add dev mode for panel and options

At the moment the dev script from package.json does not serve any file and throws a 404.

We could use the dev mode to live reload the panel and options files to make a better developer experience.

[Feature] Crawl page on demand

I think this one is easy to add.

A contextual menu option to crawl the specific page (not the whole website) of a website you may be visitting.

This could help to crawl your favourite or used pages. Could be tutorials you look from time to time, real useful replies on stackoverflow, etc.

Could be considered these websites were more heavy in the index if satinized properly (avoid SPAM from corporations).

Adding a new URL to a query on mwmbl.org with "Search Google" enabled redirects back to the homepage

While I have found another example for mwmbl/mwmbl/issues/158, I have also noticed that the issue only happens on a browser with this addon enabled. I have then noticed that it also only happens when I have the "Search Google" option enabled in the addon's modal window.

When making the query: "agile roles"
https://mwmbl.org/?q=agile+roles
I try to add the following URL:
https://neilkillick.wordpress.com/2013/09/05/scrum-basics-part-1-activities-not-roles/

I can either try and click on the Save button, or press Enter on my keyboard to confirm it, but I still get redirected to the homepage with an URL parameter like so:
https://mwmbl.org/?new_url=https%3A%2F%2Fneilkillick.wordpress.com%2F2013%2F09%2F05%2Fscrum-basics-part-1-activities-not-roles%2F

Maybe it's trying to make some sort of a duplicate check? Let me know if I can provide any more information.

Browser Compatibility

As Chrome Web Store deprecated manifests v2 and Firefox doesn't support them we will need a different version for each of those.

The ideal solution would be to get a script running on build time to convert a single manifest to each standards and distribute two separate versions for each stores/browsers.

This could be ran by a workflow once/if #5 is done.

Don't try and crawl really large pages

Our extension can potentially cause our user issues if it attempts to crawl a really large page. We should have a safeguard so that if we attempt to fetch a large page we bail out before causing memory issues for our users.

Crawl one page at a time

At the moment we use setInterval to crawl one page per second. Instead, it would be safer to ensure we only crawl one page at a time in a continuous loop using async/await.

Also, ensure that this loop is only called once. At the moment this can be triggered by multiple events: onInstalled, onStartup and onEnabled. We need to ensure there is exactly one loop running at all times.

Build Failed

Issue: Build failed when following the "How to build" section of README.md.

Steps to reproduce:

  1. git clone https://github.com/mwmbl/crawler-extension
  2. cd crawler-extension
  3. npm run build

Output:

> [email protected] build
> vite build
/tmp/build-ddf9cdff.sh: line 1: vite: command not found

System Information:

  • NPM v8.14.0
  • NodeJS v18.6.0
  • OS: EndeavourOS
  • Kernel: 5.18.12-zen1-1-zen

I'm probably missing something obvious.

Improve crawler prioritisation

At the moment, the crawler uses heuristics to decide which pages to crawl, however this often fails and it gets stuck in loops of crawling the same rubbish site for days on end.

To get around this, I propose that we choose which sites we wish to crawl, and distribute the crawling across those sites - instead of distributing by individual pages.

So the algorithm would work something like this:

  • We have a list of top sites that we would like to crawl. We will spend say 50% of our time crawling these sites.
  • For each top site, choose a maximum N pages to crawl for that site. We can use our current scoring technique to choose which pages.
  • For every other site, choose a maximum M pages to crawl for those sites, where M < N. Again we can use our current scoring to choose which sites/pages to crawl, up to a maximum of 50% of the total number of pages to crawl.

I would suggest initial values of N = 100 and M = 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.