Giter Club home page Giter Club logo

incus-deploy's Introduction

Incus deployment tools

This is a collection of Ansible playbooks, Terraform configurations and scripts to deploy and operate Incus clusters.

How to get the test setup run:

Install incus and OpenTofu

Install incus stable or LTS on your system from the zabbly/incus release and initialize it on your local machine.

Install OpenTofu.

Install the required ceph packages for ansible on the controller, on Debian that's the ceph-base and ceph-common packages:

apt install --no-install-recommends ceph-base ceph-common

Create the test VMs with OpenTofu

Go to terraform directory:

cd terraform/

Init the terraform project:

tofu init

Create the VMs for testing:

tofu apply

Run the Ansible Playbook

Go to the ansible directory:

cd ../ansible/

Copy the example inventory file:

cp hosts.yaml.example hosts.yaml

Run the Playbooks:

ansible-playbook deploy.yaml

NOTE: When re-deploying the same cluster (e.g. following a terraform destroy), you need to make sure to also clear any local state from the data directory, failure to do so will cause Ceph/OVN to attempt connection to the previously deployed systems which will cause the deployment to get stuck.

Deploying against production systems

Requirements (when using Incus with both Ceph and OVN)

  • At least 3 servers
  • One main network interface (or bond/VLAN), referred to as enp5s0 in the examples
  • One additional network interface (or bond/VLAN) to use for ingress into OVN, referred to as enp6s0 in the examples
  • Configured IPv4/IPv6 subnets on that additional network interface, in the examples, we're using:
    • IPv4 subnet: 172.31.254.0/24
    • IPv4 gaterway: 172.31.254.1
    • IPv6 subnet: fd00:1e4d:637d:1234::/64
    • IPv6 gateway: fd00:1e4d:637d:1234::1
    • DNS server: 1.1.1.1
  • A minimum of 3 disks (or partitions) on distinct servers across the cluster for consumption by Ceph, in the examples, we're using (on each server):
    • nvme-QEMU_NVMe_Ctrl_incus_disk1
    • nvme-QEMU_NVMe_Ctrl_incus_disk2
  • A minimum of 1 disk (or partition) on each server for local storage, that's /dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_incus_disk3 in the examples

Configuring Ansible

With a deployment against physical servers, Terraform isn't currently used at all. Ansible will be used to deploy Ceph, OVN and Incus on the servers.

You'll need to create a new hosts.yaml which you can base on the example one provided.

You'll then need to do the following changes at minimum:

  • Generate a new ceph_fsid (use uuidgen)
  • Set a new incus_name
  • Set a new ovn_name
  • Update the number and name of servers to match the FQDN of your machines
  • Ensure that you have 3 servers with the mon ceph_role and 3 servers with the central ovn_role
  • Update the connection details to fit your deployment:
    • Unset ansible_connection, ansible_incus_remote, ansible_user and ansible_become as those are specific to our test environment
    • Set the appropriate connection information to access your servers (ansible_connection, ansible_user, SSH key, ...)
  • Update the list of ceph and local disks for each servers (look at /dev/disk/by-id for identifiers)
  • Tweak the incus_init variable to match your environment

You'll find more details about the Ansible configuration options in ansible/README.md.

incus-deploy's People

Contributors

gigadjo avatar jarrodu avatar keestux avatar mkbrechtel avatar stgraber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

incus-deploy's Issues

Ansible namespace

Do we already have a lxc or incus ansible namespace? If not we should create one? :)

support Terraform lxc/incus provider 0.1.2 renaming of incus_volume to incus_storage_volume

The Terraform provider lxc/incus v0.1.2 release has renamed some of the resource names used in Terraform resulting in the incus-deploy repo referencing resources that do not exist if you upgrade the provider, or install it for the first time.

https://github.com/lxc/terraform-provider-incus/releases/tag/v0.1.2

  • storage: Refactor storage volume resource names

https://registry.terraform.io/providers/lxc/incus/latest/docs/resources/storage_volume

I don't have a fully functional environment where I can spin all this stuff up on yet to confirm all the details of what this needs, though I'm getting closer, this was just one of the things I hit along the way of getting things working.

I think this just involves renaming incus_volume to incus_storage_volume in terraform/baremental-incus/main.tf, and then updating terraform/versions.tf to reflect the new minimum version of 0.1.2.

Add support for generating OVN client certificates

We currently generate server-specific certificates for OVN, but those are meant for use by OVN services.
It's currently being slightly abused by Incus for its own connection to OVN when it should instead be using a separate client certificate for that.

We should add an ovn_clients config key listing a list of names for which we should generate client certificates.

Error enabling msgr2 messenger in Ceph during Ansible playbook execution

Description:When running the Ansible playbook deploy.yaml from the incus-deploy project, an error occurs while attempting to enable the msgr2 messenger in Ceph. The ceph mon enable-msgr2 command fails with a timeout, indicating that it could not connect to the RADOS cluster.

Error Message:
fatal: [server01]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.070563", "end": "2024-07-11 13:43:18.284315", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.213752", "stderr": "2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}
fatal: [server03]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.109144", "end": "2024-07-11 13:43:18.320621", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.211477", "stderr": "2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}
fatal: [server02]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.093801", "end": "2024-07-11 13:43:18.316757", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.222956", "stderr": "2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}

Steps to Reproduce:

Execute the Ansible playbook deploy.yaml in the directory ~/incus-deploy/ansible.
Observe the error during the task to enable the msgr2 messenger in Ceph.
Expected Behavior:

The ceph mon enable-msgr2 command should execute without errors, enabling the msgr2 messenger in the Ceph cluster.

Actual Behavior:

The ceph mon enable-msgr2 command fails with a timeout, indicating it could not connect to the RADOS cluster.

Additional Details:

The error occurs on multiple servers (server01, server02, server03).
Specific error message: RADOS timed out (error connecting to the cluster).
The playbook was executed as root.
Environment:

Ansible version: [2.17.1]]
Ubuntu: 22.04


Execute:
root@haruunkal:/incus-deploy/terraform# cd ../ansible/
root@haruunkal:
/incus-deploy/ansible# ansible-playbook deploy.yaml

PLAY [Ceph - Generate cluster keys and maps] ********************************************************************************************

TASK [Gathering Facts] ******************************************************************************************************************
[WARNING]: Platform linux on host server03 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of
another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible-
core/2.17/reference_appendices/interpreter_discovery.html for more information.
ok: [server03]
[WARNING]: Platform linux on host server04 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of
another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible-
core/2.17/reference_appendices/interpreter_discovery.html for more information.
ok: [server04]
[WARNING]: Platform linux on host server02 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of
another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible-
core/2.17/reference_appendices/interpreter_discovery.html for more information.
ok: [server02]
[WARNING]: Platform linux on host server05 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of
another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible-
core/2.17/reference_appendices/interpreter_discovery.html for more information.
ok: [server05]
[WARNING]: Platform linux on host server01 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of
another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible-
core/2.17/reference_appendices/interpreter_discovery.html for more information.
ok: [server01]

TASK [Generate mon keyring] *************************************************************************************************************
changed: [server03 -> 127.0.0.1]
ok: [server04 -> 127.0.0.1]
ok: [server01 -> 127.0.0.1]
ok: [server05 -> 127.0.0.1]
ok: [server02 -> 127.0.0.1]

TASK [Generate client.admin keyring] ****************************************************************************************************
changed: [server03 -> 127.0.0.1]
ok: [server04 -> 127.0.0.1]
ok: [server01 -> 127.0.0.1]
ok: [server05 -> 127.0.0.1]
ok: [server02 -> 127.0.0.1]

TASK [Generate bootstrap-osd keyring] ***************************************************************************************************
changed: [server03 -> 127.0.0.1]
ok: [server04 -> 127.0.0.1]
ok: [server01 -> 127.0.0.1]
ok: [server05 -> 127.0.0.1]
ok: [server02 -> 127.0.0.1]

TASK [Generate mon map] *****************************************************************************************************************
changed: [server03 -> 127.0.0.1]
ok: [server04 -> 127.0.0.1]
ok: [server01 -> 127.0.0.1]
ok: [server05 -> 127.0.0.1]
ok: [server02 -> 127.0.0.1]

RUNNING HANDLER [Add key to client.admin keyring] ***************************************************************************************
changed: [server03 -> 127.0.0.1]

RUNNING HANDLER [Add key to bootstrap-osd keyring] **************************************************************************************
changed: [server03 -> 127.0.0.1]

RUNNING HANDLER [Add nodes to mon map] **************************************************************************************************
changed: [server03 -> 127.0.0.1] => (item={'name': 'server01', 'ip': 'fd42:60dc:dec6:a73b:216:3eff:fe2d:4c57'})
changed: [server03 -> 127.0.0.1] => (item={'name': 'server02', 'ip': 'fd42:60dc:dec6:a73b:216:3eff:fe05:31f6'})
changed: [server03 -> 127.0.0.1] => (item={'name': 'server03', 'ip': 'fd42:60dc:dec6:a73b:216:3eff:fe01:1c21'})

PLAY [Ceph - Add package repository] ****************************************************************************************************

TASK [Gathering Facts] ******************************************************************************************************************
ok: [server04]
ok: [server05]
ok: [server03]
ok: [server01]
ok: [server02]

TASK [Create apt keyring path] **********************************************************************************************************
ok: [server03]
ok: [server01]
ok: [server05]
ok: [server04]
ok: [server02]

TASK [Add ceph GPG key] *****************************************************************************************************************
changed: [server04]
changed: [server03]
changed: [server05]
changed: [server01]
changed: [server02]

TASK [Get DPKG architecture] ************************************************************************************************************
ok: [server04]
ok: [server03]
ok: [server05]
ok: [server01]
ok: [server02]

TASK [Add ceph package sources] *********************************************************************************************************
changed: [server03]
changed: [server05]
changed: [server04]
changed: [server02]
changed: [server01]

RUNNING HANDLER [Update apt] ************************************************************************************************************
changed: [server01]
changed: [server04]
changed: [server05]
changed: [server03]
changed: [server02]

PLAY [Ceph - Install packages] **********************************************************************************************************

TASK [Gathering Facts] ******************************************************************************************************************
ok: [server01]
ok: [server04]
ok: [server05]
ok: [server03]
ok: [server02]

TASK [Install ceph-common] **************************************************************************************************************
changed: [server02]
changed: [server03]
changed: [server05]
changed: [server04]
changed: [server01]

TASK [Install ceph-mon] *****************************************************************************************************************
skipping: [server04]
skipping: [server05]
changed: [server03]
changed: [server01]
changed: [server02]

TASK [Install ceph-mgr] *****************************************************************************************************************
skipping: [server04]
skipping: [server05]
changed: [server03]
changed: [server02]
changed: [server01]

TASK [Install ceph-mds] *****************************************************************************************************************
skipping: [server04]
skipping: [server05]
changed: [server01]
changed: [server02]
changed: [server03]

TASK [Install ceph-osd] *****************************************************************************************************************
changed: [server01]
changed: [server04]
changed: [server03]
changed: [server02]
changed: [server05]

TASK [Install ceph-rbd-mirror] **********************************************************************************************************
skipping: [server01]
skipping: [server02]
skipping: [server04]
skipping: [server05]
skipping: [server03]

TASK [Install radosgw] ******************************************************************************************************************
skipping: [server01]
skipping: [server02]
skipping: [server03]
changed: [server04]
changed: [server05]

PLAY [Ceph - Set up config and keyrings] ************************************************************************************************

TASK [Transfer the cluster configuration] ***********************************************************************************************
changed: [server01]
changed: [server04]
changed: [server03]
changed: [server05]
changed: [server02]

TASK [Create main storage directory] ****************************************************************************************************
ok: [server04]
ok: [server01]
ok: [server03]
ok: [server05]
ok: [server02]

TASK [Create monitor bootstrap path] ****************************************************************************************************
skipping: [server05]
skipping: [server04]
changed: [server01]
changed: [server03]
changed: [server02]

TASK [Create OSD bootstrap path] ********************************************************************************************************
changed: [server05]
changed: [server04]
changed: [server01]
changed: [server03]
changed: [server02]

TASK [Transfer main admin keyring] ******************************************************************************************************
changed: [server05]
changed: [server03]
changed: [server01]
changed: [server02]
changed: [server04]

TASK [Transfer additional client keyrings] **********************************************************************************************
skipping: [server05]
skipping: [server03]
skipping: [server04]
skipping: [server01]
skipping: [server02]

TASK [Transfer bootstrap mon keyring] ***************************************************************************************************
skipping: [server05]
skipping: [server04]
changed: [server03]
changed: [server02]
changed: [server01]

TASK [Transfer bootstrap mon map] *******************************************************************************************************
skipping: [server05]
skipping: [server04]
changed: [server03]
changed: [server02]
changed: [server01]

TASK [Transfer bootstrap OSD keyring] ***************************************************************************************************
changed: [server05]
changed: [server04]
changed: [server01]
changed: [server03]
changed: [server02]

RUNNING HANDLER [Restart Ceph] **********************************************************************************************************
changed: [server05]
changed: [server03]
changed: [server02]
changed: [server04]
changed: [server01]

PLAY [Ceph - Deploy mon] ****************************************************************************************************************

TASK [Gathering Facts] ******************************************************************************************************************
ok: [server01]
ok: [server02]
ok: [server05]
ok: [server04]
ok: [server03]

TASK [Bootstrap Ceph mon] ***************************************************************************************************************
skipping: [server04]
skipping: [server05]
changed: [server02]
changed: [server03]
changed: [server01]

TASK [Enable and start Ceph mon] ********************************************************************************************************
skipping: [server04]
skipping: [server05]
changed: [server02]
changed: [server03]
changed: [server01]

RUNNING HANDLER [Enable msgr2] **********************************************************************************************************
fatal: [server01]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.070563", "end": "2024-07-11 13:43:18.284315", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.213752", "stderr": "2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}
fatal: [server03]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.109144", "end": "2024-07-11 13:43:18.320621", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.211477", "stderr": "2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}
fatal: [server02]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.093801", "end": "2024-07-11 13:43:18.316757", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.222956", "stderr": "2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}

PLAY RECAP ******************************************************************************************************************************
server01 : ok=29 changed=18 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
server02 : ok=29 changed=18 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
server03 : ok=32 changed=25 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
server04 : ok=22 changed=11 unreachable=0 failed=0 skipped=10 rescued=0 ignored=0
server05 : ok=22 changed=11 unreachable=0 failed=0 skipped=10 rescued=0 ignored=0

Make ansible role consumable from requirements.yml

I'm wondering what the intended way to make use of the playbooks is. Generally we would reference the git repository in our requirements.yml, but the current directory structure seems unsuitable for that.

https://docs.ansible.com/ansible/latest/galaxy/user_guide.html#installing-multiple-roles-from-a-file

This is how I would expect to use it:

roles:
  - name: lxc.incus
    src: https://github.com/lxc/incus-deploy
    scm: git

And this is how it currently fails.

$ ansible-galaxy install -r requirements.yml
[...]
[WARNING]: - lxc.incus was NOT installed successfully: this role does not appear to have a meta/main.yml file.

For one, I think the ansible directory is not something that can be referenced from a requirements.yml, but also the role is missing a meta/main.yaml, likely with a galaxy_info dict.

https://docs.ansible.com/ansible/latest/galaxy/user_guide.html#using-meta-main-yml

WDYT?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.