garvincasimir / elasticsearch-azure-paas Goto Github PK

View Code? Open in Web Editor NEW

17.0 5.0 8.0 2.5 MB

Visual Studio Project which creates an Elasticsearch cluster on Microsoft Azure using worker roles

License: MIT License

C# 85.32% PowerShell 14.68%

worker-role c-sharp elasticsearch azure

elasticsearch-azure-paas's Introduction

Elasticsearch-Azure-PAAS

This is a Visual Studio project for creating an Elasticsearch cluster on Microsoft Azure using worker roles.

Who is this for?

This is for people who want to run Elastic search on Azure in the Platform as a Service environment. This is also an opportunity to test Elasticsearch in a simulated distributed environment.

How does this work?

This is a visual studio project which can serve as a base for a solution based on an Elasticsearch cluster. The intent is to handle all the different aspects of setting up and managing a cluster

Installation
Configuration
Plugin Setup
Logging
Snapshots
Automatic Node Discovery
Security

Typical usage involves installing the NuGet package on existing Web or Worker role in your visual studio project. Once the package is installed you update the configuration settings and include the service initializer in your WorkerRole.cs or WebRole.cs.

Do I need an Azure Account to try this?

No, it runs in the full Azure Emulator on your desktop. The project is designed to work with azure files service for data and snapshots but falls back to a resource folder in the Azure Emulator. Other than that, there is no significant difference between running this project on the Azure Emulator and publishing it to Azure.

Installation

Install the NuGet package

Install-Package Elasticsearch-Azure-PAAS

This package will add the required settings to any cloud service projects that refer to the Web or Worker Role the packaged was installed on. It also adds two folders called Config and Plugins. Please set the contents of these folders to always copy to output directory.

Config/elasticsearch.yml and Config/logging.yml are the base config files for elasticsearch. You can modify them if you want to add any settings of your own. The settings in these files will apply to all instances in the cluster.

Settings

There are a couple config steps before you can run it either in the Azure Emulator or in an Azure Cloud Service. The NuGet package has already added those settings with defaults from this project. Please change them where necessary.

JavaDownloadURL

The service will download the java jre installer from this url.

Default: http://127.0.0.1:10000/devstoreaccount1/installers/jre-8u40-windows-x64.exe

JavaDownloadType

This tells the service whether JavaDownloadURL is a web accessible location or on the configured storage account

Default: storage Options: web, storage

JavaInstallerName

This is simply the name used to save the jre installer into the download cache on the role instance

Default: jre-8u40-windows-x64.exe

ElasticsearchDownloadURL

The service will download the java jre installer from this url.

Default: https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-5.3.0.zip

ElasticsearchDownloadType

This tells the service whether ElasticsearchDownloadURL is a web accessible location or on the configured storage account

Default: storage Options: web, storage

ElasticsearchZip

This is simply the name used to save the elasticsearch package into the download cache on the role instance

Default: elasticsearch-5.3.0.zip

StorageConnection

The service will use this connection string to download any download types marked as storage. It will also be used to create the share used to store elasticsearch data.

Default: UseDevelopmentStorage=true

ShareName

This config value will be used to name the azure file service share. https://myaccount.file.core.windows.net/[ShareName]. This share will be used as a persistent store for elasticsearch data and snapshots.

Default: elasticdata

ShareDrive

This is the drive letter assigned to the azure file service share on the role instance

Default: P:

EndpointName

This is the name of the endpoint elasticsearch nodes in the cluster will use to communicate which each other

Default: elasticsearch

UseElasticLocalDataFolder

If this option is enabled, the service will store data on the role instance rather than on the azure file service share. This might be handy when you need the maxium i/o speed and your data is easily replaceable. This is the only available option when using the emulator.

Default: true

ElasticsearchPluginContainer

This is the name of a container accessible through the storage account in the StorageConnection setting which contains plugin zip files you intend to install in your cluster.

Default: elasticsearchplugins

NamedPlugins

This is a pipe separated list of plugins. They will be installed using the built in plugin installer.

/bin/elasticsearch-plugin.bat install plugin-name

Sample: analysis-stempel|analysis-phonetic|analysis-smartcn

Usage

Once the package is installed and all configuration values are correct you can go ahead and initilize the service in WebRole.cs or WorkerRole.cs

    public class WorkerRole : RoleEntryPoint
    {
        private CloudStorageAccount storage = CloudStorageAccount.Parse(RoleEnvironment.GetConfigurationSettingValue("StorageConnection"));
        private ElasticsearchService service;
        public override void Run()
        {
            service.RunAndBlock();
        }

        public override bool OnStart()
        {
            // Not sure what this should be. Hopefully storage over smb doesn't open a million connections
            ServicePointManager.DefaultConnectionLimit = 12;


            var settings = ElasticsearchServiceSettings.FromStorage(storage);
            service = ElasticsearchService.FromSettings(settings);
            bool result = base.OnStart();

            Trace.TraceInformation("ElasticsearchRole has been started");

            return result;
        }

        public override void OnStop()
        {
            Trace.TraceInformation("ElasticsearchRole is stopping");

            service.OnStop();


            base.OnStop();

            Trace.TraceInformation("ElasticsearchRole has stopped");
        }


    }

Running in the Emulator

If you find things a bit sluggish on startup in the emulator don't be alarmed. The code is written to use as much of the available resources as possible to minimize startup time. As a result, the initialization steps run concurrently using async tasks. After deployment to Azure, it will not re-run the initialization steps after the initial config. Therefore, subsequent role instance recycles will be much quicker.

NuGet Package Source

The source of the NuGet package used to install this project on Web and Worker Roles is the Package.NuGet project located in this repository.

Alternate Configurations

There are different options for configuring your cluster and other services on top of it. Here are a few ideas:

Worker Roles only with public communication using Shield or private communication over a virtual network
Worker Roles for elasticsearch and separate Public facing Web Roles which use elasticsearch as a backend service
Public facing WebRoles which run both iis and Elasticsearch

I hope this project is useful to you. If you have a quick question and don't want to create an issue you can reach me on twitter @garvincasimir.

elasticsearch-azure-paas's People

Contributors

Stargazers

Watchers

Forkers

tormodu rvukovic yonglehou kneeclass kunlqt cata cdoru lgsonic

elasticsearch-azure-paas's Issues

Store additional plugins in storage

It would be nice if we moved marvel and any additional plugins to a configurable storage container. The only plugin required by this solution is the discovery plugin so we can make an exception and keep it in the solution. We don't want to be dictating what plugins should be used. We also don't want to make it difficult for people to add their own plugins for their purposes.

JAVA_HOME not set

I added some more logging and have found the reason why Elastic search does not start on new instances during scaling.
When Elastic Search is started it complains about missing JAVA_HOME variable. This variable has been set, but is somehow not picked up.
By restarting the worker role everything works as expected.

By starting the Elastic Search process like this:

 _process = new Process();
 _process.StartInfo = new ProcessStartInfo
  {
        FileName = startupScript,
        UseShellExecute = false,
        RedirectStandardOutput = true
    };
_process.Start();
_process.BeginOutputReadLine();

and the capturing the standardoutput like this:

_process.OutputDataReceived +=
   delegate(object sender, DataReceivedEventArgs args)
    {
         var output = args.Data;
          if (output == "JAVA_HOME environment variable must be set!")
          {
               Trace.TraceInformation("JAVA_HOME not set restarting");
                throw new Exception("JAVA_HOME not set restarting");
           }
      };

It is possible to listen for the text "JAVA_HOME environment variable must be set!" which elasticsearch.bat file outputs and restart the worker role by throwing an exception. Not very elegant, but it works as a temporary solution.

Elasticsearch command line plugin install

This project currently supports installing plugins by placing the zip files in a special storage container. However, using a local zip file is only one of the ways an Elasticsearch plugin can be installed. It would be nice if the project supported the following pattern

plugin --install <org>/<user/component>/<version>

I am not sure how this would work but we would need to store a list of these org/user/component/version combinations somewhere then feed it into the plugin installer for processing. It would be nice if the service configuration supported arbitrary lists of strings in a setting.

<ListSettings>
    <ListSetting name="ElasticsearchPlugins">
        <Setting>elasticsearch/marvel</Setting>
        <Setting>elasticsearch/shield</Setting>
   </ListSetting>
</ListSettings>

Can this be run in web jobs?

I'm pretty unfamiliar with how this actually works. I have a really low traffic site but I need ES and was hoping I could just add it via a web job.

es_heap_size

We need a way to set the es_heap_size. I think this value should be set automatically based on the available memory on the worker role. The recommended setting is 50% of the available memory. It should also be possible to override this setting through a config value.

Elastic search recommends setting this value using the environment variable es_heap_size, but I think it would be a better approach setting the Xms and xmx parameters when starting elastic search to minimize dependency.

Bootstrap Bulk Data

It would be nice if there was a simple process for adding bulk data to the cluster when it is first created. A good sample dataset can be found at courtlistener.com. The bulk processing power shell script I created for processing the data is a good place to start.

Elasticsearch binding 'problem' when deployed to Azure

This is not really an issue with the project, but something I learned that might save others time. Elasticsearch 2.4.1 (don't know about other versions) will bind to the loopback address by default if you don't specify an address in the configuration file. This is OK when running in the Emulator, but can be problematic when you deploy to Azure because even if you define the correct endpoint in the role settings, you won't be able to connect to Elasticsearch. Binding to the address '0' will allow the Elasticsearch instance to accept requests on the endpoint you have defined.

RoleRoot missing separator character when deployed to Azure

When deployed to Azure the RoleRoot point to a drive not a folder like in the emulator.
This leads to an invalid path when constructing PackagePluginPath. The path will look like this:
E:approot\plugins. The code should append a valid separator character to the drive like:
roleRoot = roleRoot + @"";

Azure files share

I have been testing the performance of Elastic Search using the Azure files share persistent storage for Elastic Search indexes and the performance is not good.
I ran two tests one using local storage and one using the file share, both tests indexed 13 million documents. When using local storage indexing took about 45 minutes, using the share it took over 2 hours.

There should be a config value specifying whether to use local storage or azure files.
For systems using an external data store to populate the index, there may not be any need for a persistent disk for the index.

Error using Azure worker role temp folder for download

In softwareManager.cs line 40 download is called using:

_artifact.Download(_binaryArchive);

This causes the download method to use the Azure Worker role temp folder. This folder has a size limit of 100mb witch is too small to download both Elastic Search and Java.
Changing the call to download to this fixes the problem:

 _artifact.Download(_binaryArchive, false);

Implement internal load balancing

Configure worker roles (Elasticsearch Nodes) to have a single internally load balanced endpoint on the default Elasticsearch port.

Elastic Search shutdown

I have been trying to debug this part of the code trying to shutdown Elastic Search:

public virtual void Stop()
{
   if (_process != null)
        {
            return;
        }

        if (_process.HasExited)
        {
            return;
        }

        _process.CloseMainWindow();
}

Too me it seems like _process is never null which means that _process.CloseMainWindow(); is never called.

Should the correct implementation be this?

    public virtual void Stop()
    {
        if (_process == null)
        {
            return;
        }

        if (_process.HasExited)
        {
            return;
        }

        _process.CloseMainWindow();

    }

[feature request] file based discovery

With file based discovery, i think it will make it easier to upgrade to latest ES by getting rid of azure-runtime java plugin
https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-file.html

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.