hashicorp / nomad Goto Github PK

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

Home Page: https://www.nomadproject.io/

License: Other

Makefile 0.09% Go 72.50% Shell 0.26% Python 0.01% Java 0.01% HTML 0.01% JavaScript 11.31% HCL 1.67% Dockerfile 0.02% PowerShell 0.07% SCSS 0.61% Handlebars 2.15% MDX 11.30%

nomad's Introduction

Nomad

Nomad is a simple and flexible workload orchestrator to deploy and manage containers (docker, podman), non-containerized applications (executable, Java), and virtual machines (qemu) across on-prem and clouds at scale.

Nomad is supported on Linux, Windows, and macOS. A commercial version of Nomad, Nomad Enterprise, is also available.

Website: https://developer.hashicorp.com/nomad
Tutorials: HashiCorp Developer
Forum: Discuss

Nomad provides several key features:

Deploy Containers and Legacy Applications: Nomad’s flexibility as an orchestrator enables an organization to run containers, legacy, and batch applications together on the same infrastructure. Nomad brings core orchestration benefits to legacy applications without needing to containerize via pluggable task drivers.
Simple & Reliable: Nomad runs as a single binary and is entirely self contained - combining resource management and scheduling into a single system. Nomad does not require any external services for storage or coordination. Nomad automatically handles application, node, and driver failures. Nomad is distributed and resilient, using leader election and state replication to provide high availability in the event of failures.
Device Plugins & GPU Support: Nomad offers built-in support for GPU workloads such as machine learning (ML) and artificial intelligence (AI). Nomad uses device plugins to automatically detect and utilize resources from hardware devices such as GPU, FPGAs, and TPUs.
Federation for Multi-Region, Multi-Cloud: Nomad was designed to support infrastructure at a global scale. Nomad supports federation out-of-the-box and can deploy applications across multiple regions and clouds.
Proven Scalability: Nomad is optimistically concurrent, which increases throughput and reduces latency for workloads. Nomad has been proven to scale to clusters of 10K+ nodes in real-world production environments.
HashiCorp Ecosystem: Nomad integrates seamlessly with Terraform, Consul, Vault for provisioning, service discovery, and secrets management.

Quick Start

Testing

See Developer: Getting Started for instructions on setting up a local Nomad cluster for non-production use.

Optionally, find Terraform manifests for bringing up a development Nomad cluster on a public cloud in the terraform directory.

Production

See Developer: Nomad Reference Architecture for recommended practices and a reference architecture for production deployments.

Documentation

Full, comprehensive documentation is available on the Nomad website: https://developer.hashicorp.com/nomad/docs

Guides are available on HashiCorp Developer.

Roadmap

A timeline of major features expected for the next release or two can be found in the Public Roadmap.

This roadmap is a best guess at any given point, and both release dates and projects in each release are subject to change. Do not take any of these items as commitments, especially ones later than one major release away.

Contributing

See the contributing directory for more developer documentation.

nomad's People

Contributors

Stargazers

Watchers

Forkers

jbayer rbramwell prodigeni jefflaplante pmoust eftychis orivej carlosdp ranjib spacejam mbrukman ziporah mesosphere-backup timfallmk asbjornenge lalyos aldergren jweissig huangyan9188 geminas gosuri wotek mrwacky42 forrestfinch twinuma savaki jimmyyan qinglouer alihalabyah carevoyance macb bscott aniculescu errordeveloper legal90 jmitchell iverberk pinterb crosbymichael tomzhang cartertsai th3architect jmptrader xavm duanshuaimin dancannon cigolabs kikitux crgm steve-jansen riddopic bascht tharanga-abeyseela charlieok bvector ldesiqueira sdedelbrock the-control-group theckman adamenger rowhit fernandezvara marcelocorreia gjacquet wyattanderson hooklift chrismckenzie saluber brmatt maximilize feliksik arminc huiliang dgshep kangkot rajamony cdrage wathiede asteris-llc rmoorman nautsio jorgemarey bnordbo sean- cautio altoplano kennethzfeng simudream sheerun msabramo bastiaanb ljt achanda gavioto rudyjessop ryanslade ericpfisher fanyeren ignaciovaquero mkabischev

nomad's Issues

Truncate UUID to the first 8 characters

When displaying the UUID in the UI we should truncate to the first 8 chars. Two considerations:

The UI should accept short UUIDs
We should have a way to detect a duplicate truncation (pretty unlikely, but possible).

-config option fails and throws error when directory is passed.

When starting nomad with the -config option and passing a path to a directory, Nomad fails and throws an error. Passing a path to a file works fine.

Error

$ nomad agent -config="/etc/nomad.d/"
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x8 pc=0x49120d]

goroutine 1 [running]:
github.com/hashicorp/nomad/command/agent.(*Config).Merge(0xc820077b00, 0x0, 0x0)
    /home/ubuntu/go/src/github.com/hashicorp/nomad/command/agent/config.go:251 +0x6d
github.com/hashicorp/nomad/command/agent.(*Command).readConfig(0xc820087a40, 0xb09e00)
    /home/ubuntu/go/src/github.com/hashicorp/nomad/command/agent/command.go:109 +0xf09
...

Successful

$ nomad agent -config="/etc/nomad.d/nomad.conf"
==> Starting Nomad agent...
==> Error starting agent: must have at least client or server mode enabled
ubuntu@ip-172-31-10-48:~$

Full error log and successful log at https://gist.github.com/clstokes/4cb766a9209dda8fff65.

Put demo Java app in git

The demo Java app is currently in Dropbox. We should put this in the git repo and serve via http test server or similar.

Change default region to 'global'

Most deployments will have only a single region and "global" sounds better than "region1"

Running test on OS X does not seem to work.

Looks like make test assumes it's running on AWS with access to the docker daemon socket. Maybe the README should be updated to reflect environment requirements for development.

2015/09/27 18:29:05 [INFO] raft: Node at 127.0.0.1:17017 [Leader] entering Leader state
2015/09/27 18:29:05 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2015/09/27 18:29:05 [INFO] serf: EventMemberJoin: Node 17017.global 127.0.0.1
2015/09/27 18:29:05 [DEBUG] raft: Node 127.0.0.1:17017 updated peer set (2): [127.0.0.1:17017]
2015/09/27 18:29:05 [INFO] nomad: starting 4 scheduling worker(s) for [service batch _core]
2015/09/27 18:29:05 [INFO] nomad: cluster leadership acquired
2015/09/27 18:29:05 [INFO] client: using alloc directory /var/folders/ds/jbxmdvmx0rj1z7vc4br6j97h0000gn/T/NomadClient631364695
2015/09/27 18:29:05 [INFO] nomad: adding server Node 17017.global (Addr: 127.0.0.1:17017) (DC: dc1)
2015/09/27 18:29:05 [WARN] fingerprint.network: Ethtool not found, checking /sys/net speed file
2015/09/27 18:29:05 [WARN] fingerprint.network: Error getting information about net speed
2015/09/27 18:29:05 [ERR] fingerprint.env_aws: Error querying AWS Metadata URL, skipping
2015/09/27 18:29:05 [DEBUG] client: applied fingerprints [arch cpu host memory storage network]
2015/09/27 18:29:05 [DEBUG] driver.java: must run as root user, disabling
2015/09/27 18:29:05 [DEBUG] driver.qemu: must run as root user, disabling
--- FAIL: TestAgent_RPCPing (0.04s)
    agent_test.go:67: err: client setup failed: driver setup failed: Get http://unix.sock/version: dial unix /var/run/docker.sock: connect: no such file or directory

Sidebar Nav

The sidebar in the docs (and probably other) sections is a bit too tight. We end up line-breaking just about everything we put there. Can we make it wider? /cc @captainill

Add cron type to job description

We have the ability to create "service" and "batch" jobs, but it'd be really nice to have an additional "cron" type. In this type, a task would be created whenever the cron triggers and once the task is complete (be it the process exits gracefully or not), the task is not recreated until the next cron trigger.

The type would not allocate a new task if the number of tasks are already at the maximum count for the group. This might be a nice configuration item to control how this works; performing a no-op, replacement or forced addition.

Assuming we think this a good idea, I'd be happy to jump into it.

Export the proper alloc dirs in environment variables

Right now we use something like execContext.AllocDir but we actually want to use task.AllocDir

"Smarter" job specifications

I am not sure if this is within the scope of the project, but I think a use-case like this is pretty common place.

I would like to run something like hdfs on a nomad cluster. Mesos uses schedulers and executors for their frameworks. They have a hdfs framework here: https://github.com/mesosphere/hdfs

Basically I would like to be able to do the following:

Collocate a namenode, journalnode and zookeeper failover controller on 1 node in the cluster.
Collocate a standby namenode, journalnode and zookeeper failover controller on another node in the cluster.
Run a journalnode on another node in the cluster.
Run datanodes on all other nodes in the cluster.
Easily scale up and down the number of namenodes, journalnodes and datanodes (along with their collocated components).

There doesn't really appear to be a way to do this in a job specification at the moment, but if we can do something like this, it would be really awesome!

Docker driver doesn't resolve the correct image if there are multiple versions of the same image on the machine

I think this is a good time to refactor this call to use the API instead of CLI. The CLI doesn't give me the data I want here.

Change termination to SIGTERM instead of SIGINT

POSIX conventions give us three means to stop a program:

SIGINT is actually a keyboard interrupt (Ctrl-C). Many traditional daemons (apache, mysql) ignore this.
SIGTERM means please cleanup and shutdown immediately. Apache, MySQL obey this as the termination command.
SIGKILL Aaaand you're done. In the application is stuck or you want to crash instead of waiting for cleanup, you can kill the process outright.

I think we should TERM and follow with KILL (vs. INT followed by KILL that we do currently).

Support mixed case / lowercase for port labels

These will all be uppercased for use in environment variables. As such they should be normalized and de-duped. We should do the same thing with meta tags.

Pass dynamic port(s) down to the executor via ENV or similar

Docs for Task Drivers

Could you guys please fill in the docs for the task drivers as you are more familiar.
Feel free to copy/paste from exec for consistency.

Exec
Java
Docker
Qemu

/cc: @cbednarski @catsby @dadgar

Rename some agent-* commands

Since some of the agent-* commands only pertain to servers, it would be more obvious if they were named accordingly.

agent
agent-info
agent-force-leave -> server-force-leave
agent-join        -> server-join
agent-members     -> server-members
node-drain
node-status
run
status
stop

Thoughts?

Add volumes for docker driver

Similar in theme to #60 and #61, we want to share data between tasks in a group. We will need to mount the task group allocation directory as a docker volume so we can share data between these processes.

Handle nil task resources consistently

We have some spots (maybe even just docker) where we panic if the provided task resources is a typed nil. We should either handle this uniformly before invoking a specific driver, or remove the panic.

Example: https://github.com/hashicorp/nomad/blob/master/client/driver/docker.go#L65-L67

Need a new API/CLI to query and set server list for client agents

Client agents do not use Serf, and instead are provided a list of server hostnames or IPs. As a result, we need some way to query and manipulate that.

Support mounting volumes into executor context / task alloc dir

When running a task we would like to share data between other tasks in a task group. Also, if using something like ceph or EBS we would like to support mounting a volume into the executor's context (chroot, cgroup, etc.) so it can write persistent data.

Probably depends on #60, so mounts are mounted into a tasks's environment and have some level of isolation from other tasks on the system.

Scan for and reserve already-used ports when nomad starts

Ports are dynamically allocated in a range between 20000 and 60000. We should scan and detect ports in this range that are already bound so we don't collide with other services. The high port range is likely safe, but we should check.

Also, we should check for low ports in case a service tries to use a reserved port in that range.

Support static ports for docker

Docker currently supports dynamic ports but not static ports

Reserve nomad's ports

When nomad starts it should mark its own ports as reserved to prevent tasks from trying to use that port number.

Add driver for rkt

It'd be fantastic if Nomad supported CoreOS's rkt container engine in addition to Docker.

Drop privileges on exec'd processes

Nomad itself will need root to access things like network resources and cgroups. However, processes that are started by nomad should be forked as nobody or other non-privileged access by default.

nomad agent fails to start when it can't locate a container.

After stopping nomad for about a minute or two and attempting to restart it, I get the following error:

nomad agent -config /etc/nomad.hcl

==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

* 1 error(s) occurred:

* Failed to find container ed401155afbf0686f372be304a23eae6b3b57555730ab42df34f04abe746dfa3: <nil>

Maybe nomad should start regardless of missing containers and just log an error.

Can't schedule a job because "resources exhausted"

I tried starting nomad on both CoreOS (local and GCE) and Debian (GCE) and running a cluster (both with -dev and with client/server. For installation, I follow the steps described in the Vagrantfile.

As soon as I try to run:

nomad init
nomad run example.nomad

I get a message that my resources are exhausted:

username@instance-2:~$ nomad run example.nomad
==> Monitoring evaluation "bac37878-9a39-bd3e-5ef2-acc51cde7981"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "2263a54c-ea0e-ab68-adee-30f8d2e0827a" status "failed" (0/1 nodes filtered)
      * Resources exhausted on 1 nodes
      * Dimension "network: bandwidth exceeded" exhausted on 1 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "bac37878-9a39-bd3e-5ef2-acc51cde7981" finished with status "complete"

Or this variant:

core@core-01 ~ $ ./nomad run example.nomad 
==> Monitoring evaluation "ee52b439-49e1-f3e4-a3d2-0ad6a1fc48d2"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "c23ca37e-2de5-b10d-d90a-d36bc468092c" status "failed" (0/1 nodes filtered)
      * Resources exhausted on 1 nodes
      * Dimension "network: no networks available" exhausted on 1 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "ee52b439-49e1-f3e4-a3d2-0ad6a1fc48d2" finished with status "complete"

This is the first job I am scheduling so it is unlikely that the network resources are actually exhausted. Am I missing some kind of configuration telling nomad about the network?

Bootstrapping guide

There are a few places in the docs or CLI usage where it would be nice to reference a comprehensive bootstrapping guide, like Consul has, rather than re-documenting bootstrap a couple of times.

Add reserved ports from drivers

When a driver or fingerprinted runs it should mark any related ports as reserved. For example, we should prevent a task from trying to allocate the same ports used by docker or consul, if we detect those on the nomad host.

Clean up docker images periodically

It would be nice to remove docker images only after they are unused after X time. This allows minor gaps in time where the image isn't in use, but still cleans up stale images that aren't needed.

Support chroot for exec / java

Chroot gives us disk isolation and also provides a foundation to mount shared data and persistent volumes.

API client does not use ListStub

@ryanuber Many of the "list" APIs use a Stub response instead of the full object. For example, job list returns the JobListStub instead of a list of Job structs. We are decoding into the full struct, which is fine as its a super set. I just wonder if its confusing since many fields will be blank. It may be more clear to also have ListStub versions in the API. Up for discussion, I'm not sure either way.

Duplicate allocations during reschedule

@armon documenting this here, but it's the same bug we were seeing earlier today.

We are seeing duplicate allocations when a job is rescheduled. For example, register a simple job with 1 task. You get 1 allocation. Then schedule the same thing again. You get 2 more new allocations. Scheduling yet again yields 4 new allocations.

Steps to reproduce:

$ cat ~/test.job 
job "job1" {
    region = "region1"
    datacenters = ["dc1"]
    type = "batch"
    group "group1" {
        count = 1
        task "task1" {
            driver = "exec"

            resources {
                cpu = 1
                memory = 1
            }

            config {
                command = "yes"
                args = "hello world"
            }
        }
    }
}

$ nomad run ~/test.job 
JobID  = job1
EvalID = 7c675f82-2493-f3c1-7adf-32cf5d699ee8

$ curl localhost:4646/v1/evaluation/7c675f82-2493-f3c1-7adf-32cf5d699ee8/allocations?pretty
[
    {
        "ID": "effb8e12-faaa-01d3-0e9b-75c5a290b3e1",
        "EvalID": "7c675f82-2493-f3c1-7adf-32cf5d699ee8",
        "Name": "job1.group1[0]",
        "NodeID": "3202242b-f871-d578-59a3-5e044d3780aa",
        "JobID": "job1",
        "TaskGroup": "group1",
        "DesiredStatus": "run",
        "DesiredDescription": "",
        "ClientStatus": "running",
        "ClientDescription": "{\"task1\":{\"Status\":\"running\",\"Description\":\"task started\"}}",
        "CreateIndex": 14,
        "ModifyIndex": 18
    }
]

$ nomad run ~/test.job 
JobID  = job1
EvalID = 56156e2a-99e1-11de-0d5f-efa448a442dd

$ curl localhost:4646/v1/evaluation/56156e2a-99e1-11de-0d5f-efa448a442dd/allocations?pretty
[
    {
        "ID": "af746b79-d35a-3c1f-e17d-d73221a20e60",
        "EvalID": "56156e2a-99e1-11de-0d5f-efa448a442dd",
        "Name": "job1.group1[0]",
        "NodeID": "3202242b-f871-d578-59a3-5e044d3780aa",
        "JobID": "job1",
        "TaskGroup": "group1",
        "DesiredStatus": "run",
        "DesiredDescription": "",
        "ClientStatus": "running",
        "ClientDescription": "{\"task1\":{\"Status\":\"running\",\"Description\":\"task started\"}}",
        "CreateIndex": 23,
        "ModifyIndex": 27
    },
    {
        "ID": "c9a1dcbc-a259-ad43-9645-49b9d110123e",
        "EvalID": "56156e2a-99e1-11de-0d5f-efa448a442dd",
        "Name": "job1.group1[0]",
        "NodeID": "3202242b-f871-d578-59a3-5e044d3780aa",
        "JobID": "job1",
        "TaskGroup": "group1",
        "DesiredStatus": "run",
        "DesiredDescription": "",
        "ClientStatus": "running",
        "ClientDescription": "{\"task1\":{\"Status\":\"running\",\"Description\":\"task started\"}}",
        "CreateIndex": 23,
        "ModifyIndex": 26
    }
]

$ nomad run ~/test.job 
JobID  = job1
EvalID = 2d64a70b-ef9a-51f1-99ab-a558b6f9e122

$ curl localhost:4646/v1/evaluation/2d64a70b-ef9a-51f1-99ab-a558b6f9e122/allocations?pretty
[
    {
        "ID": "59ccc408-c3f5-7785-ef4f-1fbb06e05f0b",
        "EvalID": "2d64a70b-ef9a-51f1-99ab-a558b6f9e122",
        "Name": "job1.group1[0]",
        "NodeID": "3202242b-f871-d578-59a3-5e044d3780aa",
        "JobID": "job1",
        "TaskGroup": "group1",
        "DesiredStatus": "run",
        "DesiredDescription": "",
        "ClientStatus": "running",
        "ClientDescription": "{\"task1\":{\"Status\":\"running\",\"Description\":\"task started\"}}",
        "CreateIndex": 35,
        "ModifyIndex": 40
    },
    {
        "ID": "6b49c24b-ac95-6028-9765-e5d10ac176df",
        "EvalID": "2d64a70b-ef9a-51f1-99ab-a558b6f9e122",
        "Name": "job1.group1[0]",
        "NodeID": "3202242b-f871-d578-59a3-5e044d3780aa",
        "JobID": "job1",
        "TaskGroup": "group1",
        "DesiredStatus": "run",
        "DesiredDescription": "",
        "ClientStatus": "running",
        "ClientDescription": "{\"task1\":{\"Status\":\"running\",\"Description\":\"task started\"}}",
        "CreateIndex": 35,
        "ModifyIndex": 42
    },
    {
        "ID": "c5b7b880-e044-4dc8-9ac1-40fdc8ead470",
        "EvalID": "2d64a70b-ef9a-51f1-99ab-a558b6f9e122",
        "Name": "job1.group1[0]",
        "NodeID": "3202242b-f871-d578-59a3-5e044d3780aa",
        "JobID": "job1",
        "TaskGroup": "group1",
        "DesiredStatus": "run",
        "DesiredDescription": "",
        "ClientStatus": "running",
        "ClientDescription": "{\"task1\":{\"Status\":\"running\",\"Description\":\"task started\"}}",
        "CreateIndex": 35,
        "ModifyIndex": 41
    },
    {
        "ID": "f783b3c7-8df0-1da1-fe71-90fedbb654f4",
        "EvalID": "2d64a70b-ef9a-51f1-99ab-a558b6f9e122",
        "Name": "job1.group1[0]",
        "NodeID": "3202242b-f871-d578-59a3-5e044d3780aa",
        "JobID": "job1",
        "TaskGroup": "group1",
        "DesiredStatus": "run",
        "DesiredDescription": "",
        "ClientStatus": "running",
        "ClientDescription": "{\"task1\":{\"Status\":\"running\",\"Description\":\"task started\"}}",
        "CreateIndex": 35,
        "ModifyIndex": 39
    }
]

I think this is also the reason behind the multiple "Allocation xxx created on node yyy" log messages. Looking at the CreateIndex for each allocation associated with a given evaluation, they are all always the same, and always greater than the evaluation's CreateIndex, which makes me think we might just be dup'ing something on the server side each time.

Consider managing deps with something like Godeps

In my attempt to build nomad I had to update the following deps:

github.com/fsouza/go-dockerclient
github.com/mitchellh/mapstructure

Might be better to manage this with something like Godeps.

Artifact fetcher

We need a common utility for fetching source artifacts over the internet. It should at least:

support retries
some authentication support

Possibly an interface that has more concrete implementations, like SSHFetcher or HTTPSFetcher, etc.

Client should inject nomad version information as attributes

We should have a "fingerprint" that just reports our build information as a node attribute.

Community Updates

Please made sure you add yourselves to community page.
/cc: @cbednarski @catsby @dadgar @ryanuber

`nomad run` doesn't tell you if the allocation succeed

I run nomad run example.nomad and I see the eval all worked and is now complete.

vagrant@nomad:/opt/gopath/src/github.com/hashicorp/nomad$ nm run example.nomad
==> Monitoring evaluation "ac974418-715f-4349-9807-cc5c213697a3"
    Evaluation triggered by job "example"
    Allocation "c3d0fc1e-9b17-2ec3-4ba8-89270a7231d1" created: node "f89dcb8c-f0f3-68c9-8cfd-04ef1f09def9", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "ac974418-715f-4349-9807-cc5c213697a3" finished with status "complete"

However, I do not see if my thing is actually running. In my example, it's not..

vagrant@nomad:/opt/gopath/src/github.com/hashicorp/nomad$ nm status example
ID          = example
Name        = example
Type        = service
Priority    = 50
Datacenters = dc1
Status      =

==> Evaluations
ID                                    Priority  TriggeredBy   Statusac974418-715f-4349-9807-cc5c213697a3  50        job-register  complete

==> Allocations
ID                                    EvalID                                NodeID                                TaskGroup  Desired  Status
c3d0fc1e-9b17-2ec3-4ba8-89270a7231d1  ac974418-715f-4349-9807-cc5c213697a3  f89dcb8c-f0f3-68c9-8cfd-04ef1f09def9  cache      run      failed
vagrant@nomad:/opt/gopath/src/github.com/hashicorp/nomad$

CLI command to display the agents fingerprint(s)

For operating or debugging, it would be helpful if I could run a command on a node and get some information on the results of the Fingerprints, both which ones ran, and what they recorded. This would be helpful for understanding why a job isn't placing where I thought it would, or debugging a Fingerprint.

Ex:

$ nomad node-status -fingerprint
Fingerprints ran:
- Java
  - version: 1.7...
- Docker
- CPU:
  - cores: 24
  - mhz: 9001
- Blah..
- Blah

Race in nomad/nomad tests

Periodic failure (probably a race condition) while running tests on nomad/nomad; seen on linux with 2 cpus, running make test. Log below.

/cc @armon

Resource constraints for exec

On Linux we can use cgroups to constrain resource usage. There is initial support and documentation for this feature in the Docker driver, but we should extend it to other drivers.

Questions:

Is there a good cgroups library for go? If not we can use the cg* tools but will need to note that they are an external dependency that must be installed on the system in advance.

Docs:

/cc @catsby

Cross-compilation failures

When building #63 on OS X, I get the following errors:

echo "--> Building..."
--> Building...
gox \
        -osarch="darwin/amd64 freebsd/amd64 linux/amd64 linux/arm windows/amd64" \
        -output="bin/nomad_{{.OS}}_{{.Arch}}" \
        -verbose \
        .
Number of parallel builds: 8

-->   freebsd/amd64: github.com/hashicorp/nomad
-->    darwin/amd64: github.com/hashicorp/nomad
-->       linux/arm: github.com/hashicorp/nomad
-->     linux/amd64: github.com/hashicorp/nomad
-->   windows/amd64: github.com/hashicorp/nomad

3 errors occurred:
--> freebsd/amd64 error: exit status 2
Stderr: # github.com/fsouza/go-dockerclient/external/github.com/Sirupsen/logrus
../../fsouza/go-dockerclient/external/github.com/Sirupsen/logrus/terminal_freebsd.go:10: ioctlReadTermios redeclared in this block
    previous declaration at ../../fsouza/go-dockerclient/external/github.com/Sirupsen/logrus/terminal_bsd.go:7
../../fsouza/go-dockerclient/external/github.com/Sirupsen/logrus/terminal_freebsd.go:12: Termios redeclared in this block
    previous declaration at ../../fsouza/go-dockerclient/external/github.com/Sirupsen/logrus/terminal_bsd.go:9

--> linux/arm error: exit status 2
Stderr: # github.com/shirou/gopsutil/host
../../shirou/gopsutil/host/host_linux.go:110: cannot use u.User[:] (type []byte) as type []int8 in argument to common.IntToString
../../shirou/gopsutil/host/host_linux.go:111: cannot use u.Line[:] (type []byte) as type []int8 in argument to common.IntToString
../../shirou/gopsutil/host/host_linux.go:112: cannot use u.Host[:] (type []byte) as type []int8 in argument to common.IntToString

--> windows/amd64 error: exit status 2
Stderr: # github.com/hashicorp/nomad/client/executor
client/executor/exec.go:125: c.Cmd.SysProcAttr.Credential undefined (type *syscall.SysProcAttr has no field or method Credential)
client/executor/exec.go:126: c.Cmd.SysProcAttr.Credential undefined (type *syscall.SysProcAttr has no field or method Credential)
client/executor/exec.go:126: undefined: syscall.Credential
client/executor/exec.go:128: c.Cmd.SysProcAttr.Credential undefined (type *syscall.SysProcAttr has no field or method Credential)
client/executor/exec.go:141: c.Cmd.SysProcAttr.Credential undefined (type *syscall.SysProcAttr has no field or method Credential)
client/executor/exec.go:142: c.Cmd.SysProcAttr.Credential undefined (type *syscall.SysProcAttr has no field or method Credential)
client/executor/exec.go:142: undefined: syscall.Credential
client/executor/exec.go:144: c.Cmd.SysProcAttr.Credential undefined (type *syscall.SysProcAttr has no field or method Credential)

make: *** [xc] Error 1

Log the config files used

Suggested by @cbednarski in #33: It could be helpful to an operator if we had a debug-level log to show the names of the config files used when starting the agent, especially in cases where a config dir is used.

QEmu driver should use NetworkResource ports not user configured ports

There are two sets of ports we are concerned with, those internal to the VM and those external which are running on the host. The internal ports are specific to the machine and application and we don't know what they are. Likely they are hard coded based on the application.

The external ports may either be statically reserved ("always give me 8080") or dynamically assigned. The task driver should take a list of internal ports to map to, and then use the list of ReservedPorts as provided by the scheduler and do the 1:1 mapping.

For example:

task "web" {
   driver = "qemu"
   config {
      image = "foo"

      # SSH on port 22 always, web app on 8000
      ports = [22, 8000]
   }
   resources {
       ....
       network {
           dynamic_ports = ["ssh", "web"] # Dynamically assign 2 ports from the host
       }
   }
}

/cc: @catsby

Service jobs should support the ability to be restarted after a failure

While testing the example redis job, I stopped the agent and the job was marked dead. When I restarted the agent, the job did not get rescheduled. This maybe by design, but it was unexpected, and I assumed since the job was a service nomad would just reschedule the job to an health node.

Support ip / virtual ip per task group

In kubernetes, there is the idea of services. Essentially, it assigns a virtual ip (from a range) to each task launched in the cluster.

This means, that if we launch task, we do not have to ask nomad to dynamically assign a port, or force it to assign a port and hope it exists. Since each task has a virtual ip, we can access all http services on 80, https on 443, etc.

Extract `httptest.NewServer` from AWS ENV Test into something reusable

#14 introduced a mock server for HTTP requests. It should be extracted into a generic thing that wraps net.httptestserver, and allows a given test library to append "endpoints" to it:

s := nomadTesting.NewTestHTTPServer()
s.AddEndpoint(&nomadTesting.HTTPTestEndpoint{
  Endpoint: "/v1/some/endpoint",
  ContentType: "text/json",
  Body: <some body>,
})

Or something like that...

Use nomad.structs import in api package

@armon Right now we have mirrored structs in the API the same way we do in other projects. I think the point of this was to avoid the extra import of the structs package everywhere. Can we just import the structs package in the API, and use structs.Job and the stubs directly? Or are there other reasons I'm not thinking of?

Motivation is #46. We are going to need this signature to take a *structs.Job instead of *api.Job.

Adjustable kill time

Right now processes have 5 seconds to clean up before a force kill.
We should explore ways to make this configurable, depending on the job(s)

Backstory:

#15 (comment)

Add vagrant file for Getting Started guide

The Getting Started guide references a vagrant file to download,we need to post it and update the link:

http://localhost:4567/intro/getting-started/install.html

generic_sched panic

@armon seeing this during some tests: https://gist.github.com/ryanuber/c358e95f2e3e468f1210

After tracing some code, it looks like it stems from this: https://github.com/hashicorp/nomad/blob/master/scheduler/generic_sched.go#L106 if Process() is called on an eval with its initial TriggeredBy != "job-register" && TriggeredBy != "node-update".