Giter Club home page Giter Club logo

rp-indexer's Introduction

๐Ÿ—ƒ๏ธ Indexer

Build Status codecov Go Report Card

Service for indexing RapidPro/TextIt contacts into Elasticsearch.

Deploying

As it is a Go application, it compiles to a binary and that binary along with the config file is all you need to run it on your server. You can find bundles for each platform in the releases directory. You should only run a single instance for a deployment.

It can run in two modes:

  1. the default mode, which simply queries the ElasticSearch database, finds the most recently modified contact, then on a schedule queries the contacts_contact table in the database for contacts to add or delete. You should run this as a long running service which constantly keeps ElasticSearch in sync with your contacts.

  2. a rebuild mode, started with --rebuild. This builds a brand new index from nothing, querying all contacts on RapidPro. Once complete, this switches out the alias for the contact index with the newly build index. This can be run on a cron (in parallel with the mode above) to rebuild your index occasionally to get rid of bloat.

Configuration

The service uses a tiered configuration system, each option takes precendence over the ones above it:

  1. The configuration file
  2. Environment variables starting with INDEXER_
  3. Command line parameters

We recommend running it with no changes to the configuration and no parameters, using only environment variables to configure it. You can use % rp-indexer --help to see a list of the environment variables and parameters and for more details on each option.

RapidPro

For use with RapidPro, you will want to configure these settings:

  • INDEXER_DB: a URL connection string for your RapidPro database or read replica
  • INDEXER_ELASTIC_URL: the URL for your ElasticSearch endpoint

Recommended settings for error reporting:

  • INDEXER_SENTRY_DSN: DSN to use when logging errors to Sentry

Development

Once you've checked out the code, you can build the service with:

go build github.com/nyaruka/rp-indexer/cmd/rp-indexer

This will create a new executable in $GOPATH/bin called rp-indexer.

To run the tests you need to create the test database:

$ createdb elastic_test

To run all of the tests:

go test ./... -p=1

rp-indexer's People

Contributors

brianhlin avatar chris-erickson avatar dependabot[bot] avatar dodobas avatar ericnewcomer avatar morrismukiri avatar nicpottier avatar norkans7 avatar rowanseymour avatar tybritten avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

rp-indexer's Issues

Error creating index

{"error":{"root_cause":[{"type":"illegal_argument_exception",
"reason":"Custom normalizer [lowercase] may not use filter [trim]"}],"type":"illegal_argument_exception","reason":"Custom normalizer [lowercase] may not use filter [trim]"},"status":400}"

This is with the latest version of elastic search.

Reduce number of shards

The number of shards is set to 5 which follows what ES used to use as defaults. As of the most recent version they have reduced this to 1 as most people never touched it and were left with ballooning indexes with no real value. As of now, they recommend setting this to 1 unless you know you need otherwise. Also, if you have less than X nodes (we have 2) this doesn't really buy much durability either. Would like to set this to 1 if you have no issues with that.

rp-indexer/indexer.go

Lines 410 to 414 in ca85fff

"index": {
"number_of_shards": 5,
"number_of_replicas": 1,
"routing_partition_size": 3
},

ElasticSearch does not return expected results

The options are to redefine the index for text fields and reindex all contacts:

  • set "ignore_above": 640
  • remove the param

Consider tokenizing text fields

Would allow for much more sophisticated searching of text fields, IE, we could do contains instead of exact matches. May also reduce index size though we would need to experiment to know for sure. Does have some implications as to exact searches however and that would also need to be experimented with.

[Security] Potential Secret Leak

It has been noticed that while using harmon758/postgresql-action@v1 your postgresql password temba is present in plaintext. Please ensure that secrets are encrypted or not passed as plain text in github workflows.

Contacts with ridiculous numeric values don't get indexed

example:

"number": 700000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Log sentry error if see large blocks of contacts with same modified_on

We're currently loading up to 500,000 contacts to index which is far from ideal, but was the only way to deal with hundreds of thousands of contacts having the same modified_on. We've gotten better at only ever updating contacts in small batches, so this could be lower.

If we load up to N contacts, and we get N contacts with the same modified_on, then we won't advance properly, so that's a sentry error and something that needs to be addressed ASAP.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.