Giter Club home page Giter Club logo

mantl's Introduction

image

Overview

Join the chat at https://gitter.im/CiscoCloud/mantl Build Status Stories in Progress

Mantl is a modern, batteries included platform for rapidly deploying globally distributed services

Table of Contents

Features

Core Components

  • Consul for service discovery
  • Vault for managing secrets
  • Mesos cluster manager for efficient resource isolation and sharing across distributed services
  • Marathon for cluster management of long running containerized services
  • Kubernetes for managing, organizing, and scheduling containers
  • Terraform deployment to multiple cloud providers
  • Docker container runtime
  • Traefik for proxying external traffic
  • mesos-consul populating Consul service discovery with Mesos tasks
  • Mantl API easily install supported Mesos frameworks on Mantl
  • Mantl UI a beautiful administrative interface to Mantl

Addons

  • ELK Stack for log collection and analysis
  • GlusterFS for container volume storage
  • Docker Swarm for clustering Docker daemons between networked hosts
  • etcd, distributed key-value store for Calico
  • Calico, a new kind of virtual network
  • collectd for metrics collection
  • Chronos a distributed task scheduler
  • Kong for managing APIs

See the addons/ directory for the most up-to-date information.

Goals

  • Security
  • High availability
  • Rapid immutable deployment (with Terraform + Packer)

Architecture

The base platform contains control nodes that manage the cluster and any number of agent nodes. Containers automatically register themselves into DNS so that other services can locate them.

mantl-diagram

Control Nodes

The control nodes manage a single datacenter. Each control node runs Consul for service discovery, Mesos and Kubernetes leaders for resource scheduling and Mesos frameworks like Marathon.

The Consul Ansible role will automatically bootstrap and join multiple Consul nodes. The Mesos role will provision highly-availabile Mesos and ZooKeeper environments when more than one node is provisioned.

Agent Nodes

Agent nodes launch containers and other Mesos- or Kubernetes-based workloads.

Edge Nodes

Edge nodes are responsible for proxying external traffic into services running in the cluster.

Getting Started

All development is done on the master branch. Tested, stable versions are identified via git tags. To get started, you can clone or fork this repo:

git clone https://github.com/mantl/mantl.git

To use a stable version, use git tag to list the stable versions:

git tag
0.1.0
0.2.0
...
1.2.0


git checkout 1.2.0

A Vagrantfile is provided that provisions everything on a few VMs. To run, first ensure that your system has at least 2GB of RAM free, then just:

vagrant up

Note:

  • There is no support for Windows at this time, however support is planned.
  • Use the latest version of Vagrant for best results. Version 1.8 is required.
  • There is no support for the VMware Fusion Vagrant provider; hence your provider is set to Virtualbox in your Vagrantfile.

Software Requirements

The only requirements for running Mantl are working installations of Terraform and Ansible (or Vagrant, if you're deploying to VMs). See the "Development" sections for requirements for developing Mantl.

Deploying on multiple servers

Please refer to the Getting Started Guide, which covers cloud deployments.

Documentation

All documentation is located at http://docs.mantl.io.

To build the documentation locally, run:

sudo pip install -r requirements.txt
cd docs
make html

Roadmap

Mesos Frameworks

  • Marathon
  • Kafka
  • Riak
  • Cassandra
  • Elasticsearch
  • HDFS
  • Spark
  • Storm
  • Chronos
  • MemSQL

Note: The most up-to-date list of Mesos frameworks that are known to work with Mantl is always in the mantl-universe repo.

Security

  • Manage Linux user accounts
  • Authentication and authorization for Consul
  • Authentication and authorization for Mesos
  • Authentication and authorization for Marathon
  • Application load balancer (based on Traefik)
  • Application dynamic firewalls (using consul template)

Operations

  • Logging (with the ELK stack)
  • Metrics (with the collectd addon)
  • In-service upgrade with rollback
  • Autoscaling of worker nodes
  • Self maintaining system (log rotation, etc)
  • Self healing system (automatic failed instance replacement, etc)

Supported Platforms

Community Supported Platforms

Please see milestones for more details on the roadmap.

Development

If you're interested in contributing to the project, install Terraform and the Python modules listed in requirements.txt and follow the Getting Started instructions. To build the docs, enter the docs directory and run make html. The docs will be output to _build/html.

Good issues to start with are marked with the low hanging fruit tag.

To keep your fork up to date.

1. Clone your fork:

git clone [email protected]:YOUR-USERNAME/mantl.git

2. Add remote from original repository in your forked repository:

cd into/cloned/fork-repo
git remote add upstream git://github.com/mantl/mantl.git
git fetch upstream

3. Updating your fork from original repo to keep up with their changes:

git pull upstream master

Getting Support

If you encounter any issues, please open a Github Issue against the project. We review issues daily.

We also have a gitter chat room. Drop by and ask any questions you might have. We'd be happy to walk you through your first deployment.

Cisco Intercloud Services provides support for OpenStack based deployments of Mantl.

License

Copyright © 2015 Cisco Systems, Inc.

Licensed under the Apache License, Version 2.0 (the "License").

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

mantl's People

Contributors

andreimc avatar anners avatar avnik avatar bbaassssiiee avatar brianhicks avatar chrisaubuchon avatar corebug avatar floriangrundig avatar hpreston avatar ic3cool avatar johnswil avatar kagen101 avatar keithchambers avatar kenjones-cisco avatar langston-barrett avatar larry-svds avatar mcapuccini avatar metahertz avatar novas0x2a avatar oleksandrberchenko avatar peterlamar avatar robodenitro avatar ryane avatar sehqlr avatar stevendborrelli avatar tanyacouture avatar theaxiom avatar thomasvincent avatar tpolekhin avatar zogg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mantl's Issues

Rename hosts in inventory files

Host naming is very limited, we can only have 5 hosts in each dc.

Refactor host names so we can provision a large number of hosts in each DC without conflicts.

Change Ansible vars to use OpenStack environment vars

These values are currently set in inventory/group_vars

os_auth_url: xxx
os_tenant_name: xxxxx
os_tenant_id: xxx
os_net_id: xxx

The openstack rc file that is downloadable sets values for environment vars OS_AUTH_URL, OS_TENANT_ID, and OS_TENANT_NAME.

Pre-Release documentation tasks

Before the 0.1 release is tagged, the following needs to happen in the docs:

  • release in docs/conf.py needs to be set to "0.1" instead of "0.1pre"
  • after merging into development into master, there will be a CHANGELOG.md and a CHANGELOG.rst. The markdown version should be removed.
  • improve README.md to give general introduction and link to full documentation. (probably move it to RST too)
  • diagrams of architecture

After the repo is published:

Standardize on variable names

I shorted some of the variable names in the consul and registrator playbooks as a matter of personal preference.

We should define as a team if we want the short variable names of longer more descriptive and update roles / playbooks so they are uniform across the project.

consul_image: progrium/consul
consul_image_tag: latest

vs.

zookeeper_docker_image: asteris/zookeeper
zookeeper_docker_tag: latest

Standardize on Ansible task names

Define standard Ansible task names and update roles / playbooks to the standard:

A possible standard could be:

  • verb noun
  • all lower case
  • state what is happening vs. how it is happening

Support for multi-dc deployment

Support for multi-dc deployment is a differentiator.

Consul role supports multi-dc already.

We need to test / update the follow roles to support multi-dc:

  • ZooKeeper
  • Mesos
  • Marathon

Investigate removing dnsmasq and use consul dns directly

Instead of going through 127.0.0.0:53 -> consul:8600, let's have consul listen on port 53 and have the resolv.conf point to consul systems:

Tasks:

  • update consul role to listen on host port 53 vs. 8600
  • update NetworkManager configuration to use consul servers before dhcp provided servers (/etc/NetworkManager/dhclient.d?)
  • remove dns settings from docker role (file: /etc/default/docker-network)
  • remove dnsmasq from zookeeper systemd dependency roles/zookeeper/templates/zookeeper-service.j2
  • remove dnsmasq role from site.yml
  • remove dnsmasq role from vagrant.yml
  • remove roles/dnsmasq
  • test: Build
  • test: Reboot
  • test: Vagrant
  • test: Single-DC
  • test: Multi-DC

Auth setup script

We should have a crypto setup script to automate all the key setup. Related to #45 and #46.

The usage would be something like this:

$ crypto-setup init
===== consul =====
----> encryption key
no encryption key found, generating into inventory/group_vars/all/all.yml

==== marathon ====
----> certificate
no certificate found, generate? [Y/n]
generating certificate...
password? []: *****
writing certificate to inventory/marathon.jks
writing password to inventory/group_vars/marathon/all.yml

----> user
use basic auth to connect? [Y/n]
username [admin]: *****
password: *****
writing user:pass to inventory/group_vars/marathon/all.yml

If an update is needed (password change, etc), the user can run crypto-setup update COMPONENT [COMPONENT...]

The following needs to be done in this script (from @stevendborrelli):

  • mesos leader/follower credentials
  • mesos --authenticate flag (if credential is set)
  • mesos --authenticate_slaves flag (if credential is set)
  • consul acl_master token
  • consul gossip key
  • marathon framework auth
  • lock down mesos credentials (0600)
  • ZK credentials
  • ZK superuser
  • single Mesos follower credential
  • CA generation
  • certificate generation for nginx
  • certificate generation for consul
  • output JSON format in setup script to pass to the extra vars argument

Performance + usage metrics Linux + Mesos + Marathon + Containers

High level story is that we want the entire system to be well instrumented right out of the box. This will include:

  • Linux host metrics
    • CPU states
      • Per core
      • Average for all cores
    • Memory
    • Disk
      • IO count
      • IO size
      • IO service time (histogram):
        • I'd like the leverage Brandon Gregs's great work in this space.
          • Example collectd metric:
            • disk.sda.block_size_kb.1 18
            • disk.sda.block_size_kb.2 2
            • disk.sda.block_size_kb.4 2
            • disk.sda.block_size_kb.8 2
            • disk.sda.block_size_kb.16 10
            • disk.sda.block_size_kb.32 2
            • disk.sda.block_size_kb.64 17
            • disk.sda.block_size_kb.128 0
            • disk.sda.block_size_kb.256 0
            • disk.sda.block_size_kb.512 0
            • disk.sda.block_size_kb.1024 0
            • disk.sda.read_time_ms.1 10
            • disk.sda.read_time_ms.2 2
            • disk.sda.read_time_ms.4 100
            • disk.sda.read_time_ms.8 12
            • disk.sda.read_time_ms.16 12
            • disk.sda.read_time_ms.32 12
            • disk.sda.read_time_ms.64 100
            • disk.sda.read_time_ms.128 0
            • disk.sda.read_time_ms.256 0
            • disk.sda.read_time_ms.512 12
            • disk.sda.read_time_ms.1024 0
            • disk.sda.read_time_ms.2048 0
            • disk.sda.read_time_ms.4096 12
            • disk.sda.read_time_ms.8192 0
            • disk.sda.read_time_ms.16384 0
      • Queue depth
    • Network
      • IO count
      • IO size
    • Users
      • Count
      • Logged in
    • Uptime
  • Mesos metrics
  • Marathon metrics
  • ZooKeeper
  • Consul metrics
  • Container metrics
    • Size
    • Uptime
    • CPU
    • RAM
    • Network IO
    • Disk IO
    • Disk space

This needs to be broken in to a number of smaller issues.

executor_registration_timeout may need to be tuned further

Current time out is 5 min, but seems like some docker image pull are taking longer than 5 min. So we may have to increase timeout further, if image pul is consistently taking longer.
cat /etc//mesos-slave/executor_registration_timeout
5mins

Marathon UI does not start (Vagrant)

After spawning and provisioning an instance with vagrant up, the marathon service does not listen on 8080, and in the Consul UI, the "Service 'marathon' check" is critical.

The service is running.

-bash-4.2# curl http://127.0.0.1:8080
curl: (7) Failed connect to 127.0.0.1:8080; Connection refused
-bash-4.2# service marathon status -l
Redirecting to /bin/systemctl status  -l marathon.service
marathon.service - Marathon
   Loaded: loaded (/usr/lib/systemd/system/marathon.service; enabled)
   Active: active (running) since Wed 2015-03-04 13:37:37 UTC; 4min 31s ago
 Main PID: 876 (java)
   CGroup: /system.slice/marathon.service
           ├─ 876 java -Djava.library.path=/usr/local/lib -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp /usr/local/bin/marathon mesosphere.marathon.Main --zk zk://zookeeper.service.consul:2181/marathon --master zk://zookeeper.service.consul:2181/mesos
           ├─ 999 logger -p user.info -t marathon[876]
           └─1000 logger -p user.notice -t marathon[876]

Mar 04 13:42:08 localhost.localdomain marathon[999]: [2015-03-04 13:42:08,384] INFO Initiating client connection, connectString=zookeeper.service.consul:2181 sessionTimeout=10000 watcher=com.twitter.common.zookeeper.ZooKeeperClient$3@61526468 (org.apache.zookeeper.ZooKeeper:379)
Mar 04 13:42:08 localhost.localdomain marathon[999]: [2015-03-04 13:42:08,384] WARN Unable to connect to Zookeeper, retrying... (mesosphere.marathon.Main$:45)
Mar 04 13:42:08 localhost.localdomain marathon[999]: [2015-03-04 13:42:08,389] INFO Connecting to Zookeeper... (mesosphere.marathon.Main$:39)
Mar 04 13:42:08 localhost.localdomain marathon[999]: [2015-03-04 13:42:08,389] INFO Initiating client connection, connectString=zookeeper.service.consul:2181 sessionTimeout=10000 watcher=com.twitter.common.zookeeper.ZooKeeperClient$3@683e19c2 (org.apache.zookeeper.ZooKeeper:379)
Mar 04 13:42:08 localhost.localdomain marathon[999]: [2015-03-04 13:42:08,389] WARN Unable to connect to Zookeeper, retrying... (mesosphere.marathon.Main$:45)
Mar 04 13:42:08 localhost.localdomain marathon[999]: [2015-03-04 13:42:08,389] INFO Connecting to Zookeeper... (mesosphere.marathon.Main$:39)
Mar 04 13:42:08 localhost.localdomain marathon[999]: [2015-03-04 13:42:08,389] INFO Initiating client connection, connectString=zookeeper.service.consul:2181 sessionTimeout=10000 watcher=com.twitter.common.zookeeper.ZooKeeperClient$3@450d4505 (org.apache.zookeeper.ZooKeeper:379)
Mar 04 13:42:08 localhost.localdomain marathon[999]: [2015-03-04 13:42:08,389] WARN Unable to connect to Zookeeper, retrying... (mesosphere.marathon.Main$:45)
Mar 04 13:42:08 localhost.localdomain marathon[999]: [2015-03-04 13:42:08,389] INFO Connecting to Zookeeper... (mesosphere.marathon.Main$:39)
Mar 04 13:42:08 localhost.localdomain marathon[999]: [2015-03-04 13:42:08,389] INFO Initiating client connection, connectString=zookeeper.service.consul:2181 sessionTimeout=10000 watcher=com.twitter.common.zookeeper.ZooKeeperClient$3@6a2e6ead (org.apache.zookeeper.ZooKeeper:379)

I can telnet ZooKeeper:

-bash-4.2# telnet zookeeper.service.consul 2181
Trying 10.0.2.15...
Connected to zookeeper.service.consul.
Escape character is '^]'.

Every other services start correctly. I tried re-provisioning and rebooting, still no luck. The only change I made in the Vagrantfile is setting 2 CPU cores instead of 1 and assigning 4096MB of RAM instead of 1024.
I'm on MacOS X Yosemite and VirtualBox 4.3.24.

I'm on commit ddf2598.

PS: very promising project

Fix Openstack hostname issues

After reboot, nodes are being named hostname.novalocal, which causes issues with both Marathon and Mesos.

We need to investigate how to get consistent naming lookups for the hosts.

TLS for Consul

TLS for Consul.

This is a little tricky with the current docker-consul image we're using.

Marathon worker instance size

Currently using 2 medium instances as marathon worked. So if I create 2 docker instance with 1 vcpu each no resources left for third.
Should we allocate GP2-Large instance instead of Medium instance for slaves.

Ensure that consul clients are configured correctly

The current consul role is trying to configure all consul hosts as servers. We need to make sure that clients are correctly configured.

changes for agents:

  • remove -server from clients
  • no bootstrap-expect
  • remove data volume
  • clean up bootstrap flags. Only one consul server should be in bootstrap mode

Remove unneeded vars for zookeeper role

Assign {{ zk_id }} automatically:
e.g., https://github.com/hpcloud-mon/ansible-zookeeper/blob/master/templates/myid.j2

Validate the following are required and remove them from code base if not:

zookeeper_env: dev
zookeeper_ensemble: cluster1
zookeeper_container_name: "{{ zookeeper_service }}-{{ zookeeper_env }}-{{ zookeeper_ensemble }}-zkid{{ zk_id }}"

When we are done the inventory file should look like:

[zookeeper_servers]
host-[01:03]

vs.

[zookeeper_servers:vars]
service_tags=ensemble1

[zookeeper_servers]
host-01 zk_id=1
host-02 zk_id=2
host-03 zk_id=3

Bad error reporting when environment variables aren't set

If OS_USERNAME and/or OS_PASSWORD aren't set ansible errors with a python error:

failed: [host-05 -> 127.0.0.1] => {"failed": true, "parsed": false}
Traceback (most recent call last):
  File "/Users/chris/.ansible/tmp/ansible-tmp-1424446963.22-640673646885/nova_keypair", line 1771, in <module>
    main()
  File "/Users/chris/.ansible/tmp/ansible-tmp-1424446963.22-640673646885/nova_keypair", line 105, in main
    nova.authenticate()
  File "/Library/Python/2.7/site-packages/python_novaclient-2.20.0-py2.7.egg/novaclient/client.py", line 169, in wrapper
    return f(self, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/python_novaclient-2.20.0-py2.7.egg/novaclient/v1_1/client.py", line 239, in authenticate
    self.client.authenticate()
  File "/Library/Python/2.7/site-packages/python_novaclient-2.20.0-py2.7.egg/novaclient/client.py", line 586, in authenticate
    auth_url = self._v2_auth(auth_url)
  File "/Library/Python/2.7/site-packages/python_novaclient-2.20.0-py2.7.egg/novaclient/client.py", line 677, in _v2_auth
    return self._authenticate(url, body)
  File "/Library/Python/2.7/site-packages/python_novaclient-2.20.0-py2.7.egg/novaclient/client.py", line 690, in _authenticate
    **kwargs)
  File "/Library/Python/2.7/site-packages/python_novaclient-2.20.0-py2.7.egg/novaclient/client.py", line 439, in _time_request
    resp, body = self.request(url, method, **kwargs)
  File "/Library/Python/2.7/site-packages/python_novaclient-2.20.0-py2.7.egg/novaclient/client.py", line 433, in request
    raise exceptions.from_response(resp, body, url, method)
novaclient.exceptions.BadRequest: object of type 'NoneType' has no len() (HTTP 400)

Kibana 4

Kibana 4 for viewing logs and metrics.

Needs to support authentication.
Multi-tenancy needs to be considered for hosted use-cases.

  • This is a FRAMEWORK (tied to the roadmapped Elasticsearch / Logstash frameworks) however may not be configured out of the box to receive logs from the M-I platform.

YAML formatting issues

The YAML standard is to have a ---\n as the file header. Some of our files don't have them.

Standard playbook tags

For all roles and playbooks define and implement standards for:

  • Types of tags (e.g., config, upgrade, ...)
  • Naming convention for tags (e.g., config, consul-config, ...)

When defining our standard let's KISS.

100% self maintaining system

Control and Compute nodes need to be 100% self maintaining. If an administrator needs to login it's a bug.

Some areas that will need attention:
- OS log rotation
- Purge old containers
- Healing after a fault

Add application load balancer

Add application load balancer for web services:

  • HAProxy
  • Dynamic configuration of endpoints via Consul + Consul Template
  • SSL Support

Kafka message bus

Implement Kafka message bus

Should this run on the controller on compute nodes? It's sort of oxygen for our stack ...

Needs to be self maintaining

We need to consider security aspects

Let's use the bits from Confluent: http://confluent.io/downloads/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.