Giter Club home page Giter Club logo

phinde's Introduction

phinde - generic web search engine

Self-hosted search engine you can use for your static blog or about any other website you want search functionality for.

My live instance is at http://search.cweiske.de/ and indexes my website, blog and all linked URLs.

Features

  • Crawler and indexer with the ability to run many in parallel
  • Shows and highlights text that contains search words
  • Boolean search queries:
    • foo bar searches for foo AND bar
    • foo OR bar
    • title:foo searches for foo only in the page title
  • Facets for tag, domain, language and type
  • Date search:
    • before:2016-08-30 - modification date before that day
    • after:2016-08-30 - modified after that day
    • date::2016-08-30 - exact modification day match
  • Site search
    • Query: foo bar site:example.org/dir/
    • or use the site GET parameter: /?q=foo&site=example.org/dir
  • OpenSearch support with HTML and Atom result lists
  • Instant indexing with WebSub (formerly PubSubHubbub)

Dependencies

  • PHP 8.x
  • Elasticsearch 2.0
  • MySQL or MariaDB for WebSub subscriptions
  • Gearman (Debian 9: gearman-job-server, not gearman-server)
  • gearadmin command line tool (gearman-tools package)
  • PHP Gearman extension
  • Some PHP libraries that get installed with composer

Setup

  1. Install and run Elasticsearch and Gearman
  2. Install php-gearman and gearman-tools
  3. Get a local copy of the code:

    $ git clone https://git.cweiske.de/phinde.git phinde
  4. Install dependencies via composer:

    $ composer install --no-dev
  5. Point your webserver's document root to phinde's www directory
  6. Copy data/config.php.dist to data/config.php and adjust it. Make sure your add your domain to the crawl whitelist.
  7. Create a MySQL database and import the schema from data/schema.sql
  8. Run bin/setup.php which sets up the Elasticsearch schema
  9. Put your homepage into the queue:

    $ ./bin/process.php http://example.org/
  10. Start at least one worker to process the crawl+index queue:

    $ ./bin/phinde-worker.php
  11. Check phinde's status page in your browser. The number of open tasks should be > 0, the number of workers also.

Re-index when your site changes

When your site changed, the search engine needs to re-crawl and re-index the pages.

Simply tell phinde that something changed by running:

$ ./bin/process.php http://example.org/foo.htm

phinde supports HTML pages and Atom feeds, so if your blog has a feed it's enough to let phinde reindex that one. It will find all linked pages automatically.

Website integration

Adding a simple search form to your website is easy. It needs two things:

  • <form> tag with an action that points to the phinde instance
  • Search text field with name of q.

Example:

<form method="get" action="http://phinde.example.org">
  <input type="text" name="q" placeholder="Search text"/>
  <button type="submit">Search</button>
</form>

System service

When using systemd, you can let it run multiple worker instances when the system boots up:

  1. Copy files data/systemd/phinde*.service into /etc/systemd/system/
  2. Adjust user and group names, and the work directories
  3. Enable three worker processes:

    $ systemctl daemon-reload
    $ systemctl enable phinde@1
    $ systemctl enable phinde@2
    $ systemctl enable phinde@3
    $ systemctl enable phinde
    $ systemctl start phinde
  4. Now three workers are running. Restarting the phinde service also restarts the workers.

Cron job

Run bin/renew-subscriptions.php once a day with cron. It will renew the WebSub subscriptions.

Howto

Delete index data from one domain:

$ curl -iv -XDELETE -H 'Content-Type: application/json' -d '{"query":{"term":{"domain":"example.org"}}}' http://127.0.0.1:9200/phinde/_query

That's delete-by-query 2.0, see https://www.elastic.co/guide/en/elasticsearch/plugins/2.0/delete-by-query-usage.html

Subscribe to a website/feed

Phinde supports WebSub to get subscribe to changes of a website. When phinde gets notified by the website's hub about changes, it will immediately crawl and index the changed pages.

Subscribe to a website's feed:

$ php bin/subscribe.php http://example.org/feed.atom

Phinde will determine the website's hub and send a registration request to it.

The status page will show the number of working, and the number of open subscriptions.

Unsubscribing also happens on command line:

$ php bin/unsubscribe.php http://example.org/feed.atom

About phinde

Source code

phinde's source code is available from http://git.cweiske.de/phinde.git or the mirror on github.

License

phinde is licensed under the AGPL v3 or later.

Author

phinde was written by Christian Weiske.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.