apache / fluo-uno Goto Github PK

View Code? Open in Web Editor NEW

36.0 14.0 35.0 400 KB

Apache Fluo Uno

Home Page: https://fluo.apache.org

License: Apache License 2.0

Shell 100.00%

fluo big-data accumulo hacktoberfest

fluo-uno's Introduction

Uno automates setting up Apache Accumulo or Apache Fluo (and their dependencies) on a single machine.

Uno makes it easy for a developer to experiment with Accumulo or Fluo in a realistic environment. Uno is designed for developers who need to frequently upgrade and test their code, and do not care about preserving data. While Uno makes it easy to setup a dev stack running Fluo or Accumulo, it also makes it easy to clear your data and setup your dev stack again. To avoid inadvertent data loss, Uno should not be used in production.

Checkout Muchos for setting up Accumulo or Fluo on multiple machines.

Requirements

Uno requires the following software to be installed on your machine.

Java - JDK 11 is required for running Fluo (with Accumulo 2.1).
wget - Needed for fetch command to download tarballs.
Maven - Only needed if fetch command builds tarball from local repo.

You should also be able to ssh to localhost without a passphrase. The following instructions can help you setup these requirements in your environment :

Quickstart

The following commands will get you up and running with an Accumulo instance if you have satisfied the requirements mentioned above. Replace accumulo with fluo to setup a Fluo instance.

git clone https://github.com/apache/fluo-uno.git
cd fluo-uno
./bin/uno fetch accumulo            # Fetches binary tarballs of Accumulo and its dependencies
./bin/uno setup accumulo            # Sets up Accumulo and its dependencies (Hadoop & ZooKeeper)
source <(./bin/uno env)             # Bash-specific command that sets up current shell

Accumulo is now ready to use. Verify your installation by checking the Accumulo Monitor and Hadoop NameNode status pages.

Note that the Accumulo shell can be accessed in one of two ways. The easiest is method is to use the uno command.

./bin/uno ashell

You can also access the shell directly. The Accumulo installation is initialized using the username root and password secret (set in the uno.conf file). Therefore, the shell can be accessed directly using:

accumulo shell -u root -p secret

Starting with Accumulo 2.1, a Jshell session can also be used.

./bin/uno jshell

When you're all done testing out Accumulo you can clean up:

./bin/uno wipe

For a more complete understanding of Uno, please continue reading.

Installation

First, clone the Uno repo on a local disk with enough space to run Hadoop, Accumulo, etc:

git clone https://github.com/apache/fluo-uno.git

The uno command uses conf/uno.conf for its default configuration which should be sufficient for most users.

Optionally, you can customize this configuration by modifying the uno.conf file for your environment. Inside this script the variable UNO_HOME defaults to the root of the Uno repository.

vim conf/uno.conf

If you would like to avoid modifying uno.conf because it is managed by git, there is a second way to configure uno. If conf/uno-local.conf exists then it is used instead of uno.conf. After pulling the latest changes to Uno, a tool like meld can be used to compare uno.conf and uno-local.conf.

cp conf/uno.conf conf/uno-local.conf
vim conf/uno-local.conf

All commands are run using the uno script in bin/. Uno has a command that helps you configure your shell so that you can run commands from any directory and easily set common environment variables in your shell for Uno, Hadoop, ZooKeeper, Fluo, and Spark. Run the following command to print this shell configuration. You can also add --paths or --vars to the command below to limit output to PATH or environment variable configuration:

uno env

You can either copy and paste this output into your shell or add the following (with a correct path) to your ~/.bashrc automatically configure every new shell.

source <(/path/to/uno/bin/uno env)

With uno script set up, you can now use it to download, configure, and run Fluo's dependencies.

Fetch command

The uno fetch <component> command fetches the tarballs of a component and its dependencies for later use by the setup command. By default, the fetch command downloads tarballs but you can configure it to build Fluo or Accumulo from a local git repo by setting FLUO_REPO or ACCUMULO_REPO in uno.conf. Run uno fetch to see a list of possible components.

After the fetch command is run for the first time, it only needs to run again if you want to upgrade components and need to download/build the latest version.

Setup command

The uno setup command combines uno install and uno run into one command. It will install the downloaded tarballs to the directory set by $INSTALL in your uno.conf and run you local development cluster. The command can be run in several different ways:

Sets up Apache Accumulo and its dependencies of Hadoop, ZooKeeper. This starts all processes and will wipe Accumulo/Hadoop if this command was run previously.
```
 uno setup accumulo
```
Sets up Apache Fluo along with Accumulo (and its dependencies). This command will wipe your cluster. While Fluo is set up, it does not start any Fluo applications.
```
 uno setup fluo
```
For Fluo & Accumulo, you can setup the software again without wiping/setting up their underlying dependencies. You can upgrade Accumulo or Fluo by running uno fetch before running this command.
```
 uno setup fluo --no-deps
 uno setup accumulo --no-deps
```

You can confirm that everything started by checking the monitoring pages below:

If you run some tests and then want a fresh cluster, run the setup command again which will kill all running processes, clear any data and logs, and restart your cluster.

Plugins

Uno is focused on running Accumulo & Fluo. Optional features and service can be run using plugins. These plugins can optionally execute after the install or run commands. They are configured by setting POST_INSTALL_PLUGINS and POST_RUN_PLUGINS in uno.conf.

Post install plugins

These plugins can optionally execute after the install command for Accumulo and Fluo:

accumulo-encryption - Turns on Accumulo encryption
influx-metrics - Install and run metrics service using InfluxDB & Grafana
- Grafana
- InfluxDB Admin

Post run plugins

These plugins can optionally execute after the run command for Accumulo and Fluo:

spark - Install Apache Spark and start Spark's History server
- Spark HistoryServer
accumulo-proxy - Starts an Accumulo Proxy which enables Accumulo clients in other languages.

Wipe command

The uno wipe command will kill all running processes for your local development cluster and clear all the data and logs. It does not delete the binary tarballs downloaded by the fetch command so you can use setup directly again in the future. If you need to reclaim the space used by the binary tarballs you'll have to manually delete them.

Running Apache Fluo applications

Before running an Apache Fluo application, it is recommended that you configure your shell using uno env. If this is done, many Fluo example applications (such as Webindex and Phrasecount) can be run by simply cloning their repo and executing their start scripts (which will use environment variables set in your shell by uno env).

If you want to create your own Fluo application, you should mimic the scripts of example Fluo applications or follow the instructions starting at the Configure a Fluo application section of the Fluo install instructions. These instructions will guide you through the process of configuring, initializing, and starting your application.

fluo-uno's People

Contributors

Stargazers

Watchers

fluo-uno's Issues

Support Mesos development

Support using Mesos in with fluo-dev.

Apache Mesos has source tarball releases which must be built. Rather than scripting this build, Fluo-dev should instruct users install mesos on their own using yum, apt, etc. Fluo-dev could handle the configuration and management of Mesos.`

Limit 'fluo-dev deploy' to only deploying fluo

Remove any initialization or starting of Fluo. Originally from #25

Use Accumulo native maps by default

I checked. Native maps are disabled by default. They should be enabled, built, and used by default.

Look for uno.conf in persistent location

I would like to do things like git clean -fdx without accidentally blowing away the git-ignored env.sh file. It would be preferable if the scripts looked in a directory like $HOME/.fluo-dev/env.sh first, and only used conf/env.sh if the first location didn't exist.

FLUO_DEV?

Is FLUO_DEV meant to point to the location of this Github repo, or does it have a different meaning?

In fetch.sh and all the setup-*.sh files, I replaced

source "$FLUO_DEV"/bin/impl/util.sh

with

DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/../.."
source "$DIR"/bin/impl/util.sh

which gets the location of the Github repo, as long as the scripts are not moved or symlinked. I also changed FLUO_DEV to DIR when referenced. Then the scripts worked.

Make it easier for user to deploy Fluo

Currently fluo-dev deploy builds a tarball from a repo. It would be nice to have two other ways to deploy Fluo:

Download latest release tarball
Specify path to a Fluo distribution tarball

Look into geo-env

If we ever want to go the docker route with fluo-dev should look into https://github.com/pomadchin/geo-env

I do like everything running natively for performance testing... not sure what performance would be like w/ docker.

Set up project with initial scripts and configuration

Users should be able to the following:

Follow instructions to deploy Fluo locally with Hadoop, Accumulo, and Zookeeper.
Build latest Fluo and upgrade deployment
Run stress tests with single command

Create download and install commands

It would be nice if fluo-dev could automate the download and install of dependencies. The download command could also download hashes and signature for users to verify releases.

Updated README.md

uno setup all is no longer a usable command and should probably be taken out of the README.md

Use Hadoop 2.7

Optimize Accumulo setting for Fluo

Need to optimize the Accumulo settings for fluo.

execute mvn clean -DskipTests

When running fluo-dev deply should execute mvn clean -DskipTests package

Disable hostname check for Mac OS X

It used to work but it is now failing.

Create 'uno setup metrics' command

Currently, the metrics service (Grafana+InfluxDB) is set up by 'uno setup fluo'. While this should continue, it would be nice if the metrics service could be set up on its own using 'uno setup metrics'.

Add AUTHORS & LICENSE files

Add monitoring tool to fluo-dev

Fluo produces metrics using the the dropwizard metrics library. It would be nice if fluo-dev could set up and configure a monitoring tool to display these metrics. Setting up the monitoring should optional/configurable in the configuration for fluo-dev.

Document minimum ram tested with

Would be nice to document in readme the minimum ram fluo-dev has been tested with. The minimum I have run fluo-dev on is a laptop w/ 8G. Not really sure it would work with less.

Modify scripts to point to new fluo-stress repo

Needed after apache/fluo#385

Configure Hadoop Secondary Namenode to avoid connecting to 0.0.0.0

Combine software & data directories into one directory

Create an install/ directory that holds installed software as well as the data/ directory.

Configure log dirs

Configure HADOOP_LOG_DIR, YARN_LOG_DIR, ACCUMULO_LOG_DIR, and ZOO_LOG_DIR so they aren't directly underneath their unpacked tarballs.

They should also be under the same directory tree, so they can be easily reset. This also addresses an annoyance with ZooKeeper where the default log directory is "." (the current directory).

Automatically use best mirror

Something like:

curl -sk https://apache.org/mirrors.cgi?as_json | jq -r '.preferred'

Add sanity checks that setup was done

For uno ashell, uno env, uno start, and uno stop it would be nice if the commands failed with clean error message if setup was not yet run.

Allow users to run phrasecount using fluo-dev

Enable users to specify a docs directory to index:

fluo-dev run phrasecount /path/to/txt_docs/

Minimize output during setup

When running setup a lot of stuff is printed. Would be nice if it printed less like the following

uno setup fluo
Setting up Zookeeper at <ZK dir> logging setup output to logs/setup/zookeeper-setup.log
Setting up Hadoop at <Hadoop dir> logging setup output to logs/setup/hadoop-setup.log
Setting up Accumulo at <Accumulo dir> logging setup output to logs/setup/accumulo-setup.log
Setting up Fluo at <Fluo dir> logging setup output to logs/setup/fluo-setup.log
Setup complete.

Improve how observers are configured

Currently, the command fluo-dev configure fluo generates a fluo.properties file. When it does, it just a hard codes the stress test observers. It would be better if the generated fluo.properties was concatenated with a observer.props file specified by the user. A observer.props.example could be configured for the stress tests.

Include command to set up environment in fluo-dev

fluo-dev paths is great for setting up the paths for the various utilities, using shellcode like:

PATH=$(/path/tofluo-dev paths):${PATH}

..but there are a couple other environment variables that would be useful to have set, such as FLUO_HOME, and maybe even some Hadoop or Accumulo environment variables. It would be useful to have something like fluo-dev env that operates similar to docker-machine env and spits out the necessary shellcode to establish all of the required enviornment settings (include path, as above). It might look something like this:

$> docker-machine env big
export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://192.168.99.100:2376"
export DOCKER_CERT_PATH="/Users/andrewfarris/.docker/machine/machines/big"
export DOCKER_MACHINE_NAME="big"
# Run this command to configure your shell:
# eval "$(docker-machine env big)"

Support running arbitrary fluo applications

Remove the specialized support for running stress test and replacing it with something like apache/fluo-muchos#44

Need to update readme w/ new setup command

README was not updated for changes made in #22

Make scripts work with Mac OS X

Some commands in the fluo-dev script are incompatible with Mac OS X. The script should recognize the user's OS and pick the right command to use.

Configure YARN capacity scheduler for CPU scheduling

While cgroups should be enabled in production for CPU scheduling, configuring them requires root so it doesn't make sense for fluo-dev. Below are some docs on CPU scheduling:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.0/YARN_RM_v22/cpu_scheduling/index.html#Item1.1

http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.2.0/bk_system-admin-guide/content/enabling_cpu_scheduling.html

Modify the setup command to run a Spark HistoryServer

The setup command should configure and run a Spark HistoryServer. YARN should point to the HistoryServer for completed Spark jobs. See docs below...

http://spark.apache.org/docs/latest/monitoring.html

Allow users to specify where data is stored

Fluo-dev currently configures Hadoop and Zookeeper to store their data in hard coded directories in /tmp. It would be better if fluo-dev used a variable in env.sh to let users specify where data is stored. The variable could be called DATA_DIR and could default to $FLUO_DEV/data. With this change, several configuration files will need to modified using sed to store data based off the DATA_DIR variable.

Reduce number of fluo-dev commands

Fluo-dev currently has more commands than needed. The commands install, configure, reset could probably be removed and combined into a command called setup. This would make this tool easier to use.

Use Accumulo 1.6.2

Fail when extra arguments are passed to commands

Currently, uno will just ignore extra arguments.

Fetch should check if wget is installed

fetch command should check to make sure wget is installed to prevent error when downloading.
Print out an error if it's not.

Configure mapred-site.xml to use yarn as framework

This setting is required. If not set, mapreduce jobs will run as local jobs

Add documentation that users should be able to SSH to localhost

Simplify 'fluo-dev kill' command

There is no need to have it kill things independently. The command fluo-dev kill should just kill everything.

Document source for public keys

I am experimenting with fluo-dev in a Docker container. After running fluo-dev download, I see the messages:

Verifying the authenticity of tarballs using gpg and downloaded signatures:

Verifying accumulo-1.7.1-bin.tar.gz
gpg: directory `/root/.gnupg' created
gpg: new configuration file `/root/.gnupg/gpg.conf' created
gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during this run
gpg: keyring `/root/.gnupg/pubring.gpg' created
gpg: Signature made Mon 22 Feb 2016 09:32:43 PM UTC using RSA key ID 00B6899D
gpg: Can't check signature: No public key

verifying hadoop-2.6.3.tar.gz
gpg: Signature made Fri 18 Dec 2015 02:22:42 AM UTC using RSA key ID 526633F3
gpg: Can't check signature: No public key

Verifying zookeeper-3.4.6.tar.gz
gpg: Signature made Thu 20 Feb 2014 11:09:58 AM UTC using RSA key ID D2C80E32
gpg: Can't check signature: No public key

Verifying spark-1.5.1-bin-hadoop2.6.tgz
gpg: Signature made Thu 24 Sep 2015 06:13:27 AM UTC using RSA key ID FC8ED089
gpg: Can't check signature: No public key

If we are going to the trouble of doing a gpg verify, perhaps fluo-dev fetch and install the public keys into the appropriate public key ring?

Reset zookeeper command is failing

Zookeeper is misspelled causing it to fail

Mysterious File appears

Whenever I do a "uno fetch accumulo" from my Uno directory, Uno will build Accumulo properly but a strange file (a.out) will show up in my Accumulo source directory under the assemble module.
$ git status
On branch 1.8
Untracked files:
(use "git add ..." to include in what will be committed)

assemble/a.out

nothing added to commit but untracked files present (use "git add" to track)

$ file assemble/a.out
assemble/a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), not stripped

$ strings assemble/a.out
/lib/ld64.so.1
libhadoop.so.1.0.0
_edata
__bss_start
_end
.symtab
.strtab
.shstrtab
.interp
.hash
.dynsym
.dynstr
.eh_frame
.dynamic
_DYNAMIC
GLOBAL_OFFSET_TABLE
_edata
_end
_start
__bss_start

Update Fluo configuration prefixes to 'fluo'

Changes due to apache/fluo#700

DNS check is problematic in certain environments

The call to host in setup.sh added 666eb3c doesn't work if we're using a hostname that's in /etc/hosts but not DNS. I ran into a problem when running in Docker because hostnames are an arbitrary hexidecimal string that get an entry in the /etc/hosts, but naturally have no corresponding DNS entry.

I see the following error:

[root@0250a97efa16 /]# fluo-dev setup
fluo-dev is using custom configuration at /root/fluo-dev/conf/env.sh
ERROR - Your machine failed to do a DNS lookup of your IP given your hostname using 'host 0250a97efa16'.  This is likely a DNS issue
that can cause fluo-dev services (such as Hadoop) to not start up.  You should confirm that /etc/resolv.conf is correct.

Removing the check from fluo-dev/bin/impl/setup.sh allows setup to complete seemingly without issue this context. Unfortunately there doesn't seem to be a good way to perform this check that is cross platform. getent hosts works on linux, but prefers ipv6 over ipv4, so may return unexpected results.

A stack overflow post suggests something like this that may be most appropriate in this case:

hostname=$(hostname)
if [ `grep -c "${hostname}" /etc/hosts` -ge 1 ]; then
   # OK, hostname is in /etc/hosts
elsif [ `host "${hostname}"` -ge 1 ]; then 
   # OK, hostname via DNS
else
   # with message
fi

404 Not Found

I did a straight https clone of Uno master onto an AWS instance and ran using all the defaults. It was unable to find Accumulo with the selected mirror. Zookeeper and Hadoop used the same mirror and had no problem downloading the tars.

--2017-03-02 11:13:08-- http://mirrors.advancedhosters.com/apache//accumulo/1.8.0/accumulo-1.8.0-bin.tar.gz
Resolving mirrors.advancedhosters.com (mirrors.advancedhosters.com)... 46.229.166.133, 2a02:b48:6:1::2
Connecting to mirrors.advancedhosters.com (mirrors.advancedhosters.com)|46.229.166.133|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2017-03-02 11:13:08 ERROR 404: Not Found.

The tarball accumulo-1.8.0-bin.tar.gz does not exist in downloads/