HOWTO: Ubuntu High Availability - Shared SCSI Disk only Environment (Azure and other Environments) -------------------- This is a mini tutorial indicating how to deploy a High Availability Cluster in an environment that supports SCSI shared disks. Instead of relying in APIs of public or private clouds, for example, to fence the virtual machines being clustered, this example relies only in the SCSI shared disk feature, making this example a perfect example to virtual and/or real machines having shared SCSI disks. NOTES: 1. I have made this document with Microsoft Azure Cloud environment in my head and that's why the beginning of this document shows how to get a SHARED SCSI DISK in an Azure environment. Clustering examples given bellow will work with any environment, physical or virtual. 2. If you want to skip the cloud provider configuration, just search for BEGIN keyword and you will be taken to the cluster and OS specifics. -------------------- As all High Availability Clusters, this one also needs some way to guarantee consistence among different cluster resources. Clusters usually do that by having fencing mechanisms: A way to guarantee the other nodes are *not* accessing the resources before services running on them, and managed by the cluster, are taken over. If following this mini tutorial in a Microsoft Azure Environment, make sure to have in mind that this example needs Microsoft Azure Shared Disk feature: - docs.microsoft.com/en-us/azure/virtual-machines/windows/disks-shared-enable And the Linux Kernel Module called "softdog": - /lib/modules/xxxxxx-azure/kernel/drivers/watchdog/softdog.ko -------------------- Azure clubionicshared01 disk json file "shared-disk.json": { "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { "diskName": { "type": "string", "defaultValue": "clubionicshared01" }, "diskSizeGb": { "type": "int", "defaultValue": 1024 }, "maxShares": { "type": "int", "defaultValue": 4 } }, "resources": [ { "apiVersion": "2019-07-01", "type": "Microsoft.Compute/disks", "name": "[parameters('diskName')]", "location": "westcentralus", "sku": { "name": "Premium_LRS" }, "properties": { "creationData": { "createOption": "Empty" }, "diskSizeGB": "[parameters('diskSizeGb')]", "maxShares": "[parameters('maxShares')]" }, "tags": {} } ] } -------------------- Command to create the resource in a resource-group called "clubionic": $ az group deployment create --resource-group clubionic \ --template-file ./shared-disk.json -------------------- Basics: - You will create a resource-group called "clubionic" with the following resources at first: clubionicplacement Proximity placement group clubionicnet Virtual Network subnets: private 10.250.3.0/24 public 10.250.98.024 clubionic01 Virtual machine clubionic01-ip Public IP address clubionic01private Network interface clubionic01public Network interface (clubionic01-ip associated) clubionic01_OsDisk... OS Disk (automatic creation) clubionic02 Virtual machine clubionic02-ip Public IP address clubionic02private Network interface clubionic02public Network interface (clubionic02-ip associated) clubionic02_OsDisk... OS Disk (automatic creation) clubionic03 Virtual machine clubionic03-ip Public IP address clubionic03private Network interface clubionic03public Network interface (clubionic03-ip associated) clubionic03_OsDisk... OS Disk (automatic creation) clubionicshared01 Shared Disk (created using cmdline and json file) rafaeldtinocodiag Storage account (needed for console access) -------------------- Initial idea is to create the network interfaces: - clubionic{01,02,03}{public,private} - clubionic{01,02,03}-public - associate XXX-public interfaces to clubionic{01,02,03}public And then create then create the clubionicshared01 disk (using yaml file). After those are created, next step is to create the 3 needed virtual machines with the proper resources, like showed above, so we can move on in with the cluster configuration. -------------------- I have created a small cloud-init file that can be used in "advanced" tab during VM creation screens (you can copy and paste it there): #cloud-config package_upgrade: true packages: - man - manpages - hello - locales - less - vim - jq - uuid - bash-completion - sudo - rsync - bridge-utils - net-tools - vlan - ncurses-term - iputils-arping - iputils-ping - iputils-tracepath - traceroute - mtr-tiny - tcpdump - dnsutils - ssh-import-id - openssh-server - openssh-client - software-properties-common - build-essential - devscripts - ubuntu-dev-tools - linux-headers-generic - gdb - strace - ltrace - lsof - sg3-utils write_files: - path: /etc/ssh/sshd_config content: | Port 22 AddressFamily any SyslogFacility AUTH LogLevel INFO PermitRootLogin yes PubkeyAuthentication yes PasswordAuthentication yes ChallengeResponseAuthentication no GSSAPIAuthentication no HostbasedAuthentication no PermitEmptyPasswords no UsePAM yes IgnoreUserKnownHosts yes IgnoreRhosts yes X11Forwarding yes X11DisplayOffset 10 X11UseLocalhost yes PermitTTY yes PrintMotd no TCPKeepAlive yes ClientAliveInterval 5 PermitTunnel yes Banner none AcceptEnv LANG LC_* EDITOR PAGER SYSTEMD_EDITOR Subsystem sftp /usr/lib/openssh/sftp-server - path: /etc/ssh/ssh_config content: | Host * ForwardAgent no ForwardX11 no PasswordAuthentication yes CheckHostIP no AddressFamily any SendEnv LANG LC_* EDITOR PAGER StrictHostKeyChecking no HashKnownHosts yes - path: /etc/sudoers content: | Defaults env_keep += "LANG LANGUAGE LINGUAS LC_* _XKB_CHARSET" Defaults env_keep += "HOME EDITOR SYSTEMD_EDITOR PAGER" Defaults env_keep += "XMODIFIERS GTK_IM_MODULE QT_IM_MODULE QT_IM_SWITCHER" Defaults secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" Defaults logfile=/var/log/sudo.log,loglinelen=0 Defaults !syslog, !pam_session root ALL=(ALL) NOPASSWD: ALL %wheel ALL=(ALL) NOPASSWD: ALL %sudo ALL=(ALL) NOPASSWD: ALL rafaeldtinoco ALL=(ALL) NOPASSWD: ALL runcmd: - systemctl stop snapd.service - systemctl stop unattended-upgrades - systemctl stop systemd-remount-fs - system reset-failed - passwd -d root - passwd -d rafaeldtinoco - echo "debconf debconf/priority select low" | sudo debconf-set-selections - DEBIAN_FRONTEND=noninteractive dpkg-reconfigure debconf - DEBIAN_FRONTEND=noninteractive apt-get update -y - DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade -y - DEBIAN_FRONTEND=noninteractive apt-get autoremove -y - DEBIAN_FRONTEND=noninteractive apt-get autoclean -y - systemctl disable systemd-remount-fs - systemctl disable unattended-upgrades - systemctl disable apt-daily-upgrade.timer - systemctl disable apt-daily.timer - systemctl disable accounts-daemon.service - systemctl disable motd-news.timer - systemctl disable irqbalance.service - systemctl disable rsync.service - systemctl disable ebtables.service - systemctl disable pollinate.service - systemctl disable ufw.service - systemctl disable apparmor.service - systemctl disable apport-autoreport.path - systemctl disable apport-forward.socket - systemctl disable iscsi.service - systemctl disable open-iscsi.service - systemctl disable iscsid.socket - systemctl disable multipathd.socket - systemctl disable multipath-tools.service - systemctl disable multipathd.service - systemctl disable lvm2-monitor.service - systemctl disable lvm2-lvmpolld.socket - systemctl disable lvm2-lvmetad.socket apt: preserve_sources_list: false primary: - arches: [default] uri: http://us.archive.ubuntu.com/ubuntu sources_list: | deb $MIRROR $RELEASE main restricted universe multiverse deb $MIRROR $RELEASE-updates main restricted universe multiverse deb $MIRROR $RELEASE-proposed main restricted universe multiverse deb-src $MIRROR $RELEASE main restricted universe multiverse deb-src $MIRROR $RELEASE-updates main restricted universe multiverse deb-src $MIRROR $RELEASE-proposed main restricted universe multiverse conf: | Dpkg::Options { "--force-confdef"; "--force-confold"; }; sources: debug.list: source: | # deb http://ddebs.ubuntu.com $RELEASE main restricted universe multiverse # deb http://ddebs.ubuntu.com $RELEASE-updates main restricted universe multiverse # deb http://ddebs.ubuntu.com $RELEASE-proposed main restricted universe multiverse keyid: C8CAB6595FDFF622 -------------------- After provisioning machines "clubionic01, clubionic02, clubionic03" (Standard D2s v3 (2 vcpus, 8 GiB memory)) with Linux Ubuntu Bionic (18.04), using the same resource-group (clubionic), located in "West Central US" AND having the same proximity placement group (clubionicplacement), you will be able to access all the VMs through their public IPs... and make sure the shared disk works as a fencing mechanism by testing SCSI persistent reservations using the "sg3-utils" tools. Run these commands in *at least* 1 node after the shared disk attached to it: # clubionic01 # read current reservations: rafaeldtinoco@clubionic01:~$ sudo sg_persist -r /dev/sdc Msft Virtual Disk 1.0 Peripheral device type: disk PR generation=0x0, there is NO reservation held # register new reservation key 0x123abc: rafaeldtinoco@clubionic01:~$ sudo sg_persist --out --register \ --param-sark=123abc /dev/sdc Msft Virtual Disk 1.0 Peripheral device type: disk # To reserve the DEVICE (write exclusive): rafaeldtinoco@clubionic01:~$ sudo sg_persist --out --reserve \ --param-rk=123abc --prout-type=5 /dev/sdc Msft Virtual Disk 1.0 Peripheral device type: disk # Check reservation created: rafaeldtinoco@clubionic01:~$ sudo sg_persist -r /dev/sdc Msft Virtual Disk 1.0 Peripheral device type: disk PR generation=0x3, Reservation follows: Key=0x123abc scope: LU_SCOPE, type: Write Exclusive, registrants only # To release the reservation: rafaeldtinoco@clubionic01:~$ sudo sg_persist --out --release \ --param-rk=123abc --prout-type=5 /dev/sdc Msft Virtual Disk 1.0 Peripheral device type: disk # To unregister a reservation key: rafaeldtinoco@clubionic01:~$ sudo sg_persist --out --register \ --param-rk=123abc /dev/sdc Msft Virtual Disk 1.0 Peripheral device type: disk # Make sure reservation is gone: rafaeldtinoco@clubionic01:~$ sudo sg_persist -r /dev/sdc Msft Virtual Disk 1.0 Peripheral device type: disk PR generation=0x4, there is NO reservation held BEGIN -------------------- Now it is time to configure the cluster network. In the beginning of this recipe you saw there were 2 subnet created in the virtual network assigned to this environment: clubionicnet Virtual network subnets: private 10.250.3.0/24 public 10.250.98.0/24 Since there might be a limit of 2 extra virtual network adapters attached to your VMs, we are doing the *minimum* required amount of networks for the HA cluster to operate in good conditions. public network: This is the network where the HA cluster virtual IPs will be placed on. This means that every cluster node will have 1 IP from this subnet assigned to itself and possibly a floating IP, depending on where the service is running (the resource is active). private network: This is "internal-to-cluster" interface. Where all the cluster nodes will continuously exchange messages regarding the cluster state. This network is important as corosync relies on it to know if the cluster nodes are online or not. It is also possible to create a "2nd" virtual adapter to each of the nodes, having a 2nd private network (2nd ring in the messaging layer). This may guarantee that there are no false-positives in cluster failure detections because of network jittering/delays when having "a single nic adapter" for the inter-node messaging. Instructions: - Provision the 3 VMs with 2 network interfaces each (public & private) - Make sure that, when started, all 3 of them have an external IP (to access) - A 4th machine is possible (just to access the env, depending on topology) - Make sure both, public and private networks are configured as: clubionic01: - public = 10.250.98.10/24 - private = 10.250.3.10/24 clubionic02: - public = 10.250.98.11/24 - private = 10.250.3.11/24 clubionic03: - public = 10.250.98.12/24 - private = 10.250.3.12/24 And that all interfaces are configured as "static". Then, after powering up the virtual machines, make sure to disable cloud-init networking configuration AND to set the interfaces are "static" interfaces. -------------------- Ubuntu Bionic Cloud Images, deployed by Microsoft Azure to our VMs, come, by default, installed with "netplan.io" network tool installed, using systemd- networkd as its backend network provider. This means that all the network interfaces are being configured and managed by systemd. Unfortunately, because of the following bug: https://bugs.launchpad.net/netplan/+bug/1815101 (currently being worked on), any HA environment that wants to have "virtual aliases" in any network interface should rely in the previous "ifupdown" network management method. This happens because systemd-networkd had to "learn" how to deal with restarting interfaces that were being controlled by HA software just recently and, before that, it used to remove the aliases without cluster synchronization (fixed in Eoan by using KeepConfiguration= stanza in systemd-networkd .network file). With that, here are the instructions on how to remove netplan.io AND install ifupdown + resolvconf packages: $ sudo apt-get remove --purge netplan.io $ sudo apt-get install ifupdown bridge-utils vlan resolvconf $ sudo apt-get install cloud-init $ sudo rm /etc/netplan/50-cloud-init.yaml $ sudo vi /etc/cloud/cloud.cfg.d/99-custom-networking.cfg $ sudo cat /etc/cloud/cloud.cfg.d/99-custom-networking.cfg network: {config: disabled} And how to configure the interfaces using ifupdown: $ cat /etc/network/interfaces auto lo iface lo inet loopback dns-nameserver 168.63.129.16 # public auto eth0 iface eth0 inet static address 10.250.98.10 netmask 255.255.255.0 gateway 10.250.98.1 # private auto eth1 iface eth1 inet static address 10.250.3.10 netmask 255.255.255.0 $ cat /etc/hosts 127.0.0.1 localhost ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts And disable systemd-networkd: $ sudo systemctl disable systemd-networkd.service \ systemd-networkd.socket systemd-networkd-wait-online.service \ systemd-resolved.service $ sudo update-initramfs -k all -u And make sure grub configuration is right: $ cat /etc/default/grub GRUB_DEFAULT=0 GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR="Ubuntu" GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0 earlyprintk=ttyS0 rootdelay=300 elevator=noop apparmor=0" GRUB_CMDLINE_LINUX="" GRUB_TERMINAL=serial GRUB_SERIAL_COMMAND="serial --speed=9600 --unit=0 --word=8 --parity=no --stop=1" GRUB_RECORDFAIL_TIMEOUT=0 $ sudo update-grub and reboot (stop and start the instance so grub cmdline is changed). $ ifconfig -a eth0: flags=4163 mtu 1500 inet 10.250.98.10 netmask 255.255.255.0 broadcast 10.250.98.255 inet6 fe80::20d:3aff:fef8:6551 prefixlen 64 scopeid 0x20 ether 00:0d:3a:f8:65:51 txqueuelen 1000 (Ethernet) RX packets 483 bytes 51186 (51.1 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 415 bytes 65333 (65.3 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 eth1: flags=4163 mtu 1500 inet 10.250.3.10 netmask 255.255.255.0 broadcast 10.250.3.255 inet6 fe80::20d:3aff:fef8:3d01 prefixlen 64 scopeid 0x20 ether 00:0d:3a:f8:3d:01 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11 bytes 866 (866.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73 mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10 loop txqueuelen 1000 (Local Loopback) RX packets 84 bytes 6204 (6.2 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 84 bytes 6204 (6.2 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 Note: This has to be done in ALL cluster nodes in order for the HA software, pacemaker in our case, to correctly manage the interfaces, virtual aliases and services. -------------------- Now let's start configuring the cluster. First /etc/hosts with all names. For all nodes make sure you have something similar to: rafaeldtinoco@clubionic01:~$ cat /etc/hosts 127.0.0.1 localhost 127.0.1.1 clubionic01 ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts # cluster 10.250.98.13 clubionic # floating IP (application) 10.250.98.10 bionic01 # node01 public IP 10.250.98.11 bionic02 # node02 public IP 10.250.98.12 bionic03 # node03 public IP 10.250.3.10 clubionic01 # node01 ring0 private IP 10.250.3.11 clubionic02 # node02 ring0 private IP 10.250.3.12 clubionic03 # node03 ring0 private IP And that all names are accessible from all nodes: $ ping clubionic01 -------------------- And let's install corosync package and make sure we are able to create a messaging (only, for now) cluster with corosync. Install corosync in all the 3 nodes: $ sudo apt-get install pacemaker pacemaker-cli-utils corosync corosync-doc \ resource-agents fence-agents crmsh With packages properly installed it is time to create the corosync.conf file: $ sudo cat /etc/corosync/corosync.conf totem { version: 2 secauth: off cluster_name: clubionic transport: udpu } nodelist { node { ring0_addr: 10.250.3.10 # ring1_addr: 10.250.4.10 name: clubionic01 nodeid: 1 } node { ring0_addr: 10.250.3.11 # ring1_addr: 10.250.4.11 name: clubionic02 nodeid: 2 } node { ring0_addr: 10.250.3.12 # ring1_addr: 10.250.4.12 name: clubionic03 nodeid: 3 } } quorum { provider: corosync_votequorum two_node: 0 } qb { ipc_type: native } logging { fileline: on to_stderr: on to_logfile: yes logfile: /var/log/corosync/corosync.log to_syslog: no debug: off } But, before restarting corosync with this new configuration, we have to make sure we create a keyfile and share among all the cluster nodes: rafaeldtinoco@clubionic01:~$ sudo corosync-keygen Corosync Cluster Engine Authentication key generator. Gathering 1024 bits for key from /dev/random. Press keys on your keyboard to generate entropy. Press keys on your keyboard to generate entropy (bits = 920). Press keys on your keyboard to generate entropy (bits = 1000). Writing corosync key to /etc/corosync/authkey. rafaeldtinoco@clubionic01:~$ sudo scp /etc/corosync/authkey \ root@clubionic02:/etc/corosync/authkey rafaeldtinoco@clubionic01:~$ sudo scp /etc/corosync/authkey \ root@clubionic03:/etc/corosync/authkey And now we are ready to make corosync service started by default: rafaeldtinoco@clubionic01:~$ systemctl enable --now corosync rafaeldtinoco@clubionic01:~$ systemctl restart corosync rafaeldtinoco@clubionic02:~$ systemctl enable --now corosync rafaeldtinoco@clubionic02:~$ systemctl restart corosync rafaeldtinoco@clubionic03:~$ systemctl enable --now corosync rafaeldtinoco@clubionic03:~$ systemctl restart corosync Finally it is time to check if the messaging layer of our new cluster is good. Don't worry too much about restarting nodes as the resource-manager (pacemaker) is not installed yet and quorum won't be enforced in any way. rafaeldtinoco@clubionic01:~$ sudo corosync-quorumtool -si Quorum information ------------------ Date: Mon Feb 24 01:54:10 2020 Quorum provider: corosync_votequorum Nodes: 3 Node ID: 1 Ring ID: 1/16 Quorate: Yes Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 1 1 10.250.3.10 (local) 2 1 10.250.3.11 3 1 10.250.3.12 Perfect! We have the messaging layer ready for the resource-manager to be configured ! -------------------- It is time to configure the resource-manager (pacemaker) now: rafaeldtinoco@clubionic01:~$ systemctl enable --now pacemaker rafaeldtinoco@clubionic02:~$ systemctl enable --now pacemaker rafaeldtinoco@clubionic03:~$ systemctl enable --now pacemaker rafaeldtinoco@clubionic01:~$ sudo crm_mon -1 Stack: corosync Current DC: NONE Last updated: Mon Feb 24 01:56:11 2020 Last change: Mon Feb 24 01:40:53 2020 by hacluster via crmd on clubionic01 3 nodes configured 0 resources configured Node clubionic01: UNCLEAN (offline) Node clubionic02: UNCLEAN (offline) Node clubionic03: UNCLEAN (offline) No active resources As you can see we have to wait until the resource manager uses the messaging transport layer and defines all nodes status. Give it a few seconds to settle and you will have: rafaeldtinoco@clubionic01:~$ sudo crm_mon -1 Stack: corosync Current DC: clubionic01 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Mon Feb 24 01:57:22 2020 Last change: Mon Feb 24 01:40:54 2020 by hacluster via crmd on clubionic02 3 nodes configured 0 resources configured Online: [ clubionic01 clubionic02 clubionic03 ] No active resources -------------------- Perfect! It is time to do a few "basic" setup for pacemaker. Here, in this doc, I'm using "crmsh" tool to configure the cluster. For Ubuntu Bionic this is the preferred way of configuring pacemaker. At any time you can execute "crmsh" and join/leave the commands as they were directories: rafaeldtinoco@clubionic01:~$ sudo crm crm(live)# ls cibstatus help site cd cluster quit end script verify exit ra maintenance bye ? ls node configure back report cib resource up status corosync options history crm(live)# cd configure crm(live)configure# ls .. get_property cibstatus primitive set validate_all help rsc_template ptest back cd default-timeouts erase validate-all rsctest rename op_defaults modgroup xml quit upgrade group graph load master location template save collocation rm bye clone ? ls node default_timeouts exit acl_target colocation fencing_topology assist alert ra schema user simulate rsc_ticket end role rsc_defaults monitor cib property resource edit show up refresh order filter get-property tag ms verify commit history delete And you can even edit the CIB file for the cluster: rafaeldtinoco@clubionic01:~$ crm configure edit rafaeldtinoco@clubionic01:~$ crm crm(live)# cd configure crm(live)configure# edit crm(live)configure# commit INFO: apparently there is nothing to commit INFO: try changing something first -------------------- Let's check the current cluster configuration: rafaeldtinoco@clubionic01:~$ crm configure show node 1: clubionic01 node 2: clubionic02 node 3: clubionic03 property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.18-2b07d5c5a9 \ cluster-infrastructure=corosync \ cluster-name=clubionic With this basic settings we can see 2 important things before we attempt to configure any resource: we are missing a "watchdog" device AND there is no "fencing" configured for the cluster. NOTE: 1. This is an important note to read. Since we are going to rely our cluster health on pacemaker, it is mandatory that pacemaker knows how to decide which side of the cluster is the one that should have enabled resources IF there is a rupture in the messaging (internal / ring0) layer. The size with more "votes" is the size that will become "active" while the rest of node(s) without communication will be "fenced". Usually fencing comes in the form of power fencing: The quorate side of the cluster is able to get a positive response from the fencing mechanism of the broken side through an external communication path (like a network talking to ILOs or BMCs). For this case, we are going to use shared SCSI disk and its SCSI3 feature called SCSI PERSISTENT RESERVATIONS as the fencing mechanism : Every time the interconnect communication faces a disruption, the quorate side (in this 3-node example, the side that has 2-node still communicating through the private ring network) will make sure to "fence" the other node using SCSI PERSISTENT RESERVATION (by removing the SCSI reservation key used by the node to be fenced, for example). Other fencing mechanisms support "reboot/reset" action whenever the quorate cluster wants to fence some node. Let's start calling things by name: pacemaker has a service called "stonith" (shot the other node in the head) and that's how it executes fencing actions: by having fencing agents (fence_scsi in our case) and having arguments given to these agents that will execute programmed actions to "shoot the other node in the head". Since fence_scsi agent does not have a "reboot/reset" action, it is good to have a "watchdog" device capable of realizing that the node cannot read and/or write to a shared disk and kill itself whenever that happens. With a watchdog device we have a "complete" solution for HA: a fencing mechanism that will block the fenced node to read or write from the application disk (saving a shared filesystem from being corrupted, for example) AND a watchdog device that will, as soon as it realizes the node has been fenced, reset the node. -------------------- There are multiple HW watchdog devices around but if you don't have one in your HW (and/or virtual machine) you can always count with the in-kernel software watchdog device (kernel module called "softdog"). $ apt-get install watchdog For the questions when installing the "watchdog" package, make sure to set: Watchdog module to preload: softdog and all the others to default. Install the "watchdog" package in all 3 nodes. Of course watchdog won't do anything to pacemaker by itself. We have to tell watchdog that we would like it to check for the fence_scsi shared disks access from time to time. The way we do this is: $ apt-file search fence_scsi_check fence-agents: /usr/share/cluster/fence_scsi_check $ sudo mkdir /etc/watchdog.d/ $ sudo cp /usr/share/cluster/fence_scsi_check /etc/watchdog.d/ $ systemctl restart watchdog $ ps -ef | grep watch root 41 2 0 00:10 ? 00:00:00 [watchdogd] root 8612 1 0 02:21 ? 00:00:00 /usr/sbin/watchdog Also do that for all the 3 nodes. After configuring watchdog, lets keep it disabled and stopped for now... or else your nodes will keep rebooting because the reservations are not in the shared disk yet (as pacemaker is not configured). $ systemctl disable watchdog Synchronizing state of watchdog.service with SysV service script with /lib/systemd/systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install disable watchdog $ systemctl stop watchdog -------------------- Now our cluster has "fence_scsi" resource to fence a node AND watchdog devices (/dev/watchdog) created by the kernel module "softdog" and managed by the watchdog daemon, which executes our fence_scsi_check script. Let's tell this to the cluster: rafaeldtinoco@clubionic01:~$ crm configure crm(live)configure# property stonith-enabled=on crm(live)configure# property stonith-action=off crm(live)configure# property no-quorum-policy=stop crm(live)configure# property have-watchdog=true crm(live)configure# commit crm(live)configure# end crm(live)# end bye rafaeldtinoco@clubionic01:~$ crm configure show node 1: clubionic01 node 2: clubionic02 node 3: clubionic03 property cib-bootstrap-options: \ have-watchdog=true \ dc-version=1.1.18-2b07d5c5a9 \ cluster-infrastructure=corosync \ cluster-name=clubionic \ stonith-enabled=on \ stonith-action=off \ no-quorum-policy=stop And, not only telling cluster we have watchdog and what is the fencing policy, we have also to configure the fence resource and tell where to run it. -------------------- Let's continue creating the fencing resource in the cluster: rafaeldtinoco@clubionic03:~$ sudo sg_persist --in --read-keys --device=/dev/sda LIO-ORG cluster.bionic. 4.0 Peripheral device type: disk PR generation=0x0, there are NO registered reservation keys rafaeldtinoco@clubionic03:~$ sudo sg_persist -r /dev/sda LIO-ORG cluster.bionic. 4.0 Peripheral device type: disk PR generation=0x0, there is NO reservation held rafaeldtinoco@clubionic01:~$ crm configure primitive fence_clubionic \ stonith:fence_scsi params \ pcmk_host_list="clubionic01 clubionic02 clubionic03" \ devices="/dev/disk/by-path/acpi-VMBUS:01-scsi-0:0:0:0" \ meta provides=unfencing After creating the fencing agent, make sure it is running: rafaeldtinoco@clubionic01:~$ crm_mon -1 Stack: corosync Current DC: clubionic02 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Mon Feb 24 04:06:15 2020 Last change: Mon Feb 24 04:06:11 2020 by root via cibadmin on clubionic01 3 nodes configured 1 resource configured Online: [ clubionic01 clubionic02 clubionic03 ] Active resources: fence_clubionic (stonith:fence_scsi): Started clubionic01 and also make sure that the reservations are in place: rafaeldtinoco@clubionic03:~$ sudo sg_persist --in --read-keys --device=/dev/sda LIO-ORG cluster.bionic. 4.0 Peripheral device type: disk PR generation=0x3, 3 registered reservation keys follow: 0x3abe0001 0x3abe0000 0x3abe0002 Having 3 keys registered show that all nodes have registered their keys while, when checking which host has the reservation, you have to see a single node key: rafaeldtinoco@clubionic03:~$ sudo sg_persist -r /dev/sda LIO-ORG cluster.bionic. 4.0 Peripheral device type: disk PR generation=0x3, Reservation follows: Key=0x3abe0001 scope: LU_SCOPE, type: Write Exclusive, registrants only -------------------- Testing fencing before moving on It is very important to make sure that we are able to fence the node that faced issues. In our case, as we are also using a watchdog device, so we want to make sure that our node will reboot in case it looses access to the share scsi disk. In order to obtain that, we can do a simple test: rafaeldtinoco@clubionic01:~$ crm_mon -1 Stack: corosync Current DC: clubionic01 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Fri Mar 6 16:43:01 2020 Last change: Fri Mar 6 16:38:55 2020 by hacluster via crmd on clubionic01 3 nodes configured 1 resource configured Online: [ clubionic01 clubionic02 clubionic03 ] Active resources: fence_clubionic (stonith:fence_scsi): Started clubionic01 You can see that fence_clubionic resource is running at clubionic01. With that information we can stop the interconnect (private) network communication of that node only and check 2 things: 1) fence_clubionic service has to be started in another node 2) clubionic01 (where fence_clubionic is running) will reboot rafaeldtinoco@clubionic01:~$ sudo iptables -A INPUT -i eth2 -j DROP rafaeldtinoco@clubionic02:~$ crm_mon -1 Stack: corosync Current DC: clubionic02 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Fri Mar 6 16:45:31 2020 Last change: Fri Mar 6 16:38:55 2020 by hacluster via crmd on clubionic01 3 nodes configured 1 resource configured Online: [ clubionic02 clubionic03 ] OFFLINE: [ clubionic01 ] Active resources: fence_clubionic (stonith:fence_scsi): Started clubionic02 Okay (1) worked. fence_clubionic resource migrated to clubionic02 node AND the reservation key from clubionic01 node was removed from the shared storage: rafaeldtinoco@clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda LIO-ORG cluster.bionic. 4.0 Peripheral device type: disk PR generation=0x4, 2 registered reservation keys follow: 0x3abe0001 0x3abe0002 After up to 60sec (default timeout for the softdog driver + watchdog daemon): [ 596.943649] reboot: Restarting system clubionic01 is rebooted by watchdog daemon (remember the file /etc/watchdog.d/fence_scsi_check ? that file was responsible for making watchdog daemon to reboot the node... when it realized the scsi disk wasn't accessible any longer by our node). After the reboot succeeds: rafaeldtinoco@clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda LIO-ORG cluster.bionic. 4.0 Peripheral device type: disk PR generation=0x5, 3 registered reservation keys follow: 0x3abe0001 0x3abe0002 0x3abe0000 rafaeldtinoco@clubionic02:~$ crm_mon -1 Stack: corosync Current DC: clubionic02 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Fri Mar 6 16:49:44 2020 Last change: Fri Mar 6 16:38:55 2020 by hacluster via crmd on clubionic01 3 nodes configured 1 resource configured Online: [ clubionic01 clubionic02 clubionic03 ] Active resources: fence_clubionic (stonith:fence_scsi): Started clubionic02 Its all back to normal, but fence_clubionic agent stays where it was: clubionic02 node. This cluster behavior is usually to avoid the "ping-pong" effect for intermittent failures. -------------------- Now we will install a simple lighttpd service in all the nodes and have it managed by pacemaker. The idea is simple: to have a virtual IP migrating in between the nodes, serving a lighttpd service with files coming from the shared filesystem disk. AN IMPORTANT THING TO NOTE HERE: If you are using SHARED SCSI disk to protect cluster concurrency, it is imperative that the data being serviced by HA application is also contained in the shared disk. rafaeldtinoco@clubionic01:~$ apt-get install lighttpd rafaeldtinoco@clubionic01:~$ systemctl stop lighttpd.service rafaeldtinoco@clubionic01:~$ systemctl disable lighttpd.service rafaeldtinoco@clubionic02:~$ apt-get install lighttpd rafaeldtinoco@clubionic02:~$ systemctl stop lighttpd.service rafaeldtinoco@clubionic02:~$ systemctl disable lighttpd.service rafaeldtinoco@clubionic03:~$ apt-get install lighttpd rafaeldtinoco@clubionic03:~$ systemctl stop lighttpd.service rafaeldtinoco@clubionic03:~$ systemctl disable lighttpd.service Having the hostname as the index.html file we will be able to know which node is active when accessing the virtual IP, that will be migrating among all 3 nodes: rafaeldtinoco@clubionic01:~$ sudo rm /var/www/html/*.html rafaeldtinoco@clubionic01:~$ echo $HOSTNAME | sudo tee /var/www/html/index.html clubionic01 rafaeldtinoco@clubionic02:~$ sudo rm /var/www/html/*.html rafaeldtinoco@clubionic02:~$ echo $HOSTNAME | sudo tee /var/www/html/index.html clubionic02 rafaeldtinoco@clubionic03:~$ sudo rm /var/www/html/*.html rafaeldtinoco@clubionic03:~$ echo $HOSTNAME | sudo tee /var/www/html/index.html clubionic03 And we will have a good way to tell from which source the lighttpd daemon is getting its files from: rafaeldtinoco@clubionic01:~$ curl localhost clubionic01 -> local disk rafaeldtinoco@clubionic01:~$ curl clubionic02 clubionic02 -> local (to clubionic02) disk rafaeldtinoco@clubionic01:~$ curl clubionic03 clubionic03 -> local (to clubionic03) disk -------------------- Next step is to configure the cluster as a HA Active-Passive only cluster. The shared disk in this scenario would only work as a fence mechanism. rafaeldtinoco@clubionic01:~$ crm configure sh node 1: clubionic01 node 2: clubionic02 node 3: clubionic03 primitive fence_clubionic stonith:fence_scsi \ params pcmk_host_list="clubionic01 clubionic02 clubionic03" plug="" \ devices="/dev/sda" meta provides=unfencing primitive virtual_ip IPaddr2 \ params ip=10.250.98.13 nic=eth3 \ op monitor interval=10s primitive webserver systemd:lighttpd \ op monitor interval=10 timeout=30 group webserver_vip webserver virtual_ip property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.18-2b07d5c5a9 \ cluster-infrastructure=corosync \ cluster-name=clubionic \ stonith-enabled=on \ stonith-action=off \ no-quorum-policy=stop As you can see I have created 2 resources and 1 group of resources. You can copy and paste the command from above sinde "crmsh" and do a "commit" at the end and it will create the resource for you. After creating the resource, check if it is working: rafaeldtinoco@clubionic01:~$ crm_mon -1 Stack: corosync Current DC: clubionic02 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Fri Mar 6 18:57:54 2020 Last change: Fri Mar 6 18:52:17 2020 by root via cibadmin on clubionic01 3 nodes configured 3 resources configured Online: [ clubionic01 clubionic02 clubionic03 ] Active resources: fence_clubionic (stonith:fence_scsi): Started clubionic02 Resource Group: webserver_vip webserver (systemd:lighttpd): Started clubionic01 virtual_ip (ocf::heartbeat:IPaddr2): Started clubionic01 rafaeldtinoco@clubionic01:~$ ping -c 1 clubionic.public PING clubionic.public (10.250.98.13) 56(84) bytes of data. 64 bytes from clubionic.public (10.250.98.13): icmp_seq=1 ttl=64 time=0.025 ms --- clubionic.public ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.025/0.025/0.025/0.000 ms And testing the resource is really active in clubionic01 host: rafaeldtinoco@clubionic01:~$ curl clubionic.public clubionic01 Note that, in this example, we are not using the shared disk for much, only to have a way of fencing the failed host. This is important for virtual environments that does not necessarily give you a power fencing mechanism, for example, and you have to rely on SCSI FENCE + WATCHDOG to guarantee cluster consistence, as said in the beginning of this document. Final step is to start using the shared scsi disk as a HA active/passive resource. It means that the webserver we are clustering will serve files from the shared disk but there won't be multiple active nodes simultaneously, just one. This example can serve as a clustering example for other services such as: CIFS, SAMBA, NFS, MTAs and MDAs such as postfix/qmail, etc. -------------------- Note: I'm using "systemd" resource agent standard because its not relying on older agents and you can check supported agents by executing: rafaeldtinoco@clubionic01:~$ crm_resource --list-standards ocf lsb service systemd stonith rafaeldtinoco@clubionic01:~$ crm_resource --list-agents=systemd apt-daily apt-daily-upgrade atd autovt@ bootlogd ... The agents list will be compatible with the software you have installed at the moment you execute that command in a node (as the systemd standard basically uses existing service units from systemd on the nodes). -------------------- For a HA environment we need to first migrate the shared disk (meaning umounting from one node and mounting it in the other one) and then migrate dependent services. For this scenario there isn't a need for configuring a locking manager of any kind. Let's install LVM2 packages in all nodes: $ apt-get install lvm2 And configure LVM2 to have a system id based in the uname cmd output: rafaeldtinoco@clubionic01:~$ sudo vi /etc/lvm/lvm.conf ... system_id_source = "uname" Do that in all 3 nodes. rafaeldtinoco@clubionic01:~$ sudo lvm systemid system ID: clubionic01 rafaeldtinoco@clubionic02:~$ sudo lvm systemid system ID: clubionic02 rafaeldtinoco@clubionic03:~$ sudo lvm systemid system ID: clubionic03 Configure 1 partition for the shared disk: rafaeldtinoco@clubionic01:~$ sudo gdisk /dev/sda GPT fdisk (gdisk) version 1.0.3 Partition table scan: MBR: not present BSD: not present APM: not present GPT: not present Creating new GPT entries. Command (? for help): n Partition number (1-128, default 1): First sector (34-2047966, default = 2048) or {+-}size{KMGTP}: Last sector (2048-2047966, default = 2047966) or {+-}size{KMGTP}: Current type is 'Linux filesystem' Hex code or GUID (L to show codes, Enter = 8300): Changed type of partition to 'Linux filesystem' Command (? for help): w Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING PARTITIONS!! Do you want to proceed? (Y/N): y OK; writing new GUID partition table (GPT) to /dev/sda. The operation has completed successfully. And create the physical and logical volumes using LVM2: rafaeldtinoco@clubionic01:~$ sudo pvcreate /dev/sda1 rafaeldtinoco@clubionic01:~$ sudo vgcreate clustervg /dev/sda1 rafaeldtinoco@clubionic01:~$ sudo vgs -o+systemid VG #PV #LV #SN Attr VSize VFree System ID clustervg 1 0 0 wz--n- 988.00m 988.00m clubionic01 rafaeldtinoco@clubionic01:~$ sudo lvcreate -l100%FREE -n clustervol clustervg Logical volume "clustervol" created. rafaeldtinoco@clubionic01:~$ sudo mkfs.ext4 -LCLUSTERDATA /dev/clustervg/clustervol mke2fs 1.44.1 (24-Mar-2018) Creating filesystem with 252928 4k blocks and 63232 inodes Filesystem UUID: d0c7ab5c-abf6-4ee0-aee1-ec1ce7917bea Superblock backups stored on blocks: 32768, 98304, 163840, 229376 Allocating group tables: done Writing inode tables: done Creating journal (4096 blocks): done Writing superblocks and filesystem accounting information: done Let's now create a directory to mount this volume in all 3 nodes. Remember, we are not *yet* configuring a cluster filesystem. The disk should be mounted in one node AT A TIME. rafaeldtinoco@clubionic01:~$ sudo mkdir /clusterdata rafaeldtinoco@clubionic02:~$ sudo mkdir /clusterdata rafaeldtinoco@clubionic03:~$ sudo mkdir /clusterdata And, in this particular case, it should be tested in the node that you did all the LVM2 commands and created the EXT4 filesystem: rafaeldtinoco@clubionic01:~$ sudo mount /dev/clustervg/clustervol /clusterdata rafaeldtinoco@clubionic01:~$ mount | grep cluster /dev/mapper/clustervg-clustervol on /clusterdata type ext4 (rw,relatime,stripe=2048,data=ordered) Now we can go ahead and disable the volume group: rafaeldtinoco@clubionic01:~$ sudo umount /clusterdata rafaeldtinoco@clubionic01:~$ sudo vgchange -an clustervg -------------------- Its time to remove the resources we have configured and re-configure them. This is needed because the resources of a group are started in the order you created them and, in this new case, lighttpd resource will depend on the shared disk filesystem we are creating on the node that has lighttpd started. rafaeldtinoco@clubionic01:~$ sudo crm resource stop webserver_vip rafaeldtinoco@clubionic01:~$ sudo crm configure delete webserver rafaeldtinoco@clubionic01:~$ sudo crm configure delete virtual_ip rafaeldtinoco@clubionic01:~$ sudo crm configure sh node 1: clubionic01 node 2: clubionic02 node 3: clubionic03 primitive fence_clubionic stonith:fence_scsi \ params pcmk_host_list="clubionic01 clubionic02 clubionic03" \ plug="" devices="/dev/sda" meta provides=unfencing property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.18-2b07d5c5a9 \ cluster-infrastructure=corosync \ cluster-name=clubionic \ stonith-enabled=on \ stonith-action=off \ no-quorum-policy=stop Now we can create the resource responsible for taking care of the LVM volume group migration: ocf:heartbeat:LVM-activate. crm(live)configure# primitive lvm2 ocf:heartbeat:LVM-activate vgname=clustervg \ vg_access_mode=system_id crm(live)configure# commit With only those 2 commands our cluster shall have one of the nodes accessing the volume group "clustervg" we have created. In my case it got enabled in the 2nd node of the cluster: rafaeldtinoco@clubionic02:~$ crm_mon -1 Stack: corosync Current DC: clubionic01 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Fri Mar 6 20:59:44 2020 Last change: Fri Mar 6 20:58:33 2020 by root via cibadmin on clubionic01 3 nodes configured 2 resources configured Online: [ clubionic01 clubionic02 clubionic03 ] Active resources: fence_clubionic (stonith:fence_scsi): Started clubionic01 lvm2 (ocf::heartbeat:LVM-activate): Started clubionic02 And I can check that by executing: rafaeldtinoco@clubionic02:~$ sudo vgs VG #PV #LV #SN Attr VSize VFree clustervg 1 1 0 wz--n- 988.00m 0 rafaeldtinoco@clubionic02:~$ sudo lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert clustervol clustervg -wi-a----- 988.00m rafaeldtinoco@clubionic02:~$ sudo vgs -o+systemid VG #PV #LV #SN Attr VSize VFree System ID clustervg 1 1 0 wz--n- 988.00m 0 clubionic02 rafaeldtinoco@clubionic02:~$ sudo mount -LCLUSTERDATA /clusterdata and rafaeldtinoco@clubionic02:~$ sudo umount /clusterdata should work in the node having "lvm2" resource started. Now its time to re-create the resources we had before, in the group "webservergroup". crm(live)configure# primitive webserver systemd:lighttpd \ op monitor interval=10 timeout=30 crm(live)configure# group webservergroup lvm2 virtual_ip webserver crm(live)configure# commit Now pacemaker should show all resources inside "webservergroup": - lvm2 - virtual_ip - webserver enabled in the *same* node: rafaeldtinoco@clubionic02:~$ crm_mon -1 Stack: corosync Current DC: clubionic01 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Fri Mar 6 21:05:24 2020 Last change: Fri Mar 6 21:04:55 2020 by root via cibadmin on clubionic01 3 nodes configured 4 resources configured Online: [ clubionic01 clubionic02 clubionic03 ] Active resources: fence_clubionic (stonith:fence_scsi): Started clubionic01 Resource Group: webservergroup lvm2 (ocf::heartbeat:LVM-activate): Started clubionic02 virtual_ip (ocf::heartbeat:IPaddr2): Started clubionic02 webserver (systemd:lighttpd): Started clubionic02 And it does: clubionic02 node. -------------------- Perfect. Its time to configure the filesystem mount and umount now. Before moving on, make sure to install "psmisc" package in all nodes and: crm(live)configure# primitive ext4 ocf:heartbeat:Filesystem device=/dev/clustervg/clustervol directory=/clusterdata fstype=ext4 crm(live)configure# del webservergroup crm(live)configure# group webservergroup lvm2 ext4 virtual_ip webserver crm(live)configure# commit You will have: rafaeldtinoco@clubionic02:~$ crm_mon -1 Stack: corosync Current DC: clubionic01 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Fri Mar 6 21:16:39 2020 Last change: Fri Mar 6 21:16:36 2020 by hacluster via crmd on clubionic03 3 nodes configured 5 resources configured Online: [ clubionic01 clubionic02 clubionic03 ] Active resources: fence_clubionic (stonith:fence_scsi): Started clubionic01 Resource Group: webservergroup lvm2 (ocf::heartbeat:LVM-activate): Started clubionic03 ext4 (ocf::heartbeat:Filesystem): Started clubionic03 virtual_ip (ocf::heartbeat:IPaddr2): Started clubionic03 webserver (systemd:lighttpd): Started clubionic03 rafaeldtinoco@clubionic03:~$ mount | grep -i clu /dev/mapper/clustervg-clustervol on /clusterdata type ext4 (rw,relatime,stripe=2048,data=ordered) And that makes the environment we just created perfect to host the lighttpd service files, as the physical and logical volume will migrate from one node to another together with the needed service (lighttpd) AND virtual IP being used to serve our end users: rafaeldtinoco@clubionic01:~$ curl clubionic.public clubionic03 rafaeldtinoco@clubionic01:~$ crm resource move webservergroup clubionic01 INFO: Move constraint created for webservergroup to clubionic01 rafaeldtinoco@clubionic01:~$ curl clubionic.public clubionic01 We can start serving files/data from the volume that is currently being managed by the cluster. In the node with the resource group "webservergroup" enabled you could: rafaeldtinoco@clubionic01:~$ sudo rsync -avz /var/www/ /clusterdata/www/ sending incremental file list created directory /clusterdata/www ./ cgi-bin/ html/ html/index.html rafaeldtinoco@clubionic01:~$ sudo rm -rf /var/www rafaeldtinoco@clubionic01:~$ sudo ln -s /clusterdata/www /var/www rafaeldtinoco@clubionic01:~$ cd /clusterdata/www/html/ rafaeldtinoco@clubionic01:.../html$ echo clubionic | sudo tee index.html and in all other nodes: rafaeldtinoco@clubionic02:~$ sudo rm -rf /var/www rafaeldtinoco@clubionic02:~$ sudo ln -s /clusterdata/www /var/www rafaeldtinoco@clubionic03:~$ sudo rm -rf /var/www rafaeldtinoco@clubionic03:~$ sudo ln -s /clusterdata/www /var/www and test the fact that, now, data being distributed by lighttpd is shared among the nodes in an active-passive way: rafaeldtinoco@clubionic01:~$ curl clubionic.public clubionic rafaeldtinoco@clubionic01:~$ crm resource move webservergroup clubionic02 INFO: Move constraint created for webservergroup to clubionic02 rafaeldtinoco@clubionic01:~$ curl clubionic.public clubionic rafaeldtinoco@clubionic01:~$ crm resource move webservergroup clubionic03 INFO: Move constraint created for webservergroup to clubionic03 rafaeldtinoco@clubionic01:~$ curl clubionic.public clubionic -------------------- -------------------- Okay, so... we've done already 3 important things with our scsi-shared-disk fenced (+ watchdog'ed) cluster: - configured scsi persistent-reservation based fencing - configured watchdog to fence a host without reservations - configured HA resource group that migrates disk, ip and service among nodes -------------------- -------------------- It is time to go further and make all the nodes to access the same filesystem in the shared disk being managed by the cluster. This could allow different applications to be enabled in different nodes while accessing the same disk, for example, or several other examples that you can find online. Let's install the distributed lock manager in all cluster nodes: rafaeldtinoco@clubionic01:~$ apt-get install -y dlm-controld rafaeldtinoco@clubionic02:~$ apt-get install -y dlm-controld rafaeldtinoco@clubionic03:~$ apt-get install -y dlm-controld NOTE: 1. before enabling dlm-controld service you should disable the watchdog daemon "just in case" as it can cause you problems, rebooting your cluster nodes, if dlm-control daemon does not start successfully. Check that dlm service has started successfully: rafaeldtinoco@clubionic01:~$ systemctl status dlm ● dlm.service - dlm control daemon Loaded: loaded (/etc/systemd/system/dlm.service; enabled; vendor preset: enabled) Active: active (running) since Fri 2020-03-06 20:25:05 UTC; 1 day 22h ago Docs: man:dlm_controld man:dlm.conf man:dlm_stonith Main PID: 4029 (dlm_controld) Tasks: 2 (limit: 2338) CGroup: /system.slice/dlm.service └─4029 /usr/sbin/dlm_controld --foreground and, if it didn't, try removing the dlm module: rafaeldtinoco@clubionic01:~$ sudo modprobe -r dlm and reloading it again: rafaeldtinoco@clubionic01:~$ sudo modprobe dlm as this might happen because udev rules were not interpreted yet during package installation and devices /dev/misc/XXXX were not created. One way of guaranteeing dlm will always find correct devices is to add it to /etc/modules file: rafaeldtinoco@clubionic01:~$ cat /etc/modules virtio_balloon virtio_blk virtio_net virtio_pci virtio_ring virtio ext4 9p 9pnet 9pnet_virtio dlm So it is loaded during boot time: rafaeldtinoco@clubionic01:~$ sudo update-initramfs -k all -u rafaeldtinoco@clubionic01:~$ sudo reboot rafaeldtinoco@clubionic01:~$ systemctl --value is-active corosync.service active rafaeldtinoco@clubionic01:~$ systemctl --value is-active pacemaker.service active rafaeldtinoco@clubionic01:~$ systemctl --value is-active dlm.service active rafaeldtinoco@clubionic01:~$ systemctl --value is-active watchdog.service inactive And, after making sure it works, disable dlm service: rafaeldtinoco@clubionic01:~$ systemctl disable dlm rafaeldtinoco@clubionic02:~$ systemctl disable dlm rafaeldtinoco@clubionic03:~$ systemctl disable dlm because this service will be managed by the cluster resource manager. The watchdog service will be enabled at the end, because it is watchdog daemon that reboots/resets the node after SCSI is fenced. -------------------- In order to install the cluster filesystem (GFS2 in this case) we will be able to remove the configuration we did in the cluster: rafaeldtinoco@clubionic01:~$ sudo crm conf show node 1: clubionic01 node 2: clubionic02 node 3: clubionic03 primitive ext4 Filesystem \ params device="/dev/clustervg/clustervol" directory="/clusterdata" \ fstype=ext4 primitive fence_clubionic stonith:fence_scsi \ params pcmk_host_list="clubionic01 clubionic02 clubionic03" plug="" \ devices="/dev/sda" meta provides=unfencing target-role=Started primitive lvm2 LVM-activate \ params vgname=clustervg vg_access_mode=system_id primitive virtual_ip IPaddr2 \ params ip=10.250.98.13 nic=eth3 \ op monitor interval=10s primitive webserver systemd:lighttpd \ op monitor interval=10 timeout=30 group webservergroup lvm2 ext4 virtual_ip webserver \ meta target-role=Started location cli-prefer-webservergroup webservergroup role=Started inf: clubionic03 property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.18-2b07d5c5a9 \ cluster-infrastructure=corosync \ cluster-name=clubionic \ stonith-enabled=on \ stonith-action=off \ no-quorum-policy=stop \ last-lrm-refresh=1583529396 rafaeldtinoco@clubionic01:~$ sudo crm resource stop webservergroup rafaeldtinoco@clubionic01:~$ sudo crm conf delete webservergroup rafaeldtinoco@clubionic01:~$ sudo crm resource stop webserver rafaeldtinoco@clubionic01:~$ sudo crm conf delete webserver rafaeldtinoco@clubionic01:~$ sudo crm resource stop virtual_ip rafaeldtinoco@clubionic01:~$ sudo crm conf delete virtual_ip rafaeldtinoco@clubionic01:~$ sudo crm resource stop lvm2 rafaeldtinoco@clubionic01:~$ sudo crm conf delete lvm2 rafaeldtinoco@clubionic01:~$ sudo crm resource stop ext4 rafaeldtinoco@clubionic01:~$ sudo crm conf delete ext4 rafaeldtinoco@clubionic01:~$ crm conf sh node 1: clubionic01 node 2: clubionic02 node 3: clubionic03 primitive fence_clubionic stonith:fence_scsi \ params pcmk_host_list="clubionic01 clubionic02 clubionic03" \ plug="" devices="/dev/sda" meta provides=unfencing target-role=Started property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.18-2b07d5c5a9 \ cluster-infrastructure=corosync \ cluster-name=clubionic \ stonith-enabled=on \ stonith-action=off \ no-quorum-policy=stop \ last-lrm-refresh=1583529396 Now we are ready to create needed resources. -------------------- Because now we want multiple cluster nodes to access simultaneously LVM volumes in an active/active way, we have to install "clvm". This package provides the clustering interface for lvm2, when used with corosync based (eg Pacemaker) cluster infrastructure. It allows logical volumes to be created on shared storage devices (eg Fibre Channel, or iSCSI). rafaeldtinoco@clubionic01:~$ egrep "^\s+locking_type" /etc/lvm/lvm.conf locking_type = 1 The type being: 0 = no locking 1 = local file-based locking 2 = external shared lib locking_library 3 = built-in clustered locking with clvmd - disable use_lvmetad and lvmetad service (incompatible) 4 = read-only locking (forbits metadata changes) 5 = dummy locking Lets change LVM locking type to clustered in all 3 nodes: rafaeldtinoco@clubionic01:~$ sudo lvmconf --enable-cluster rafaeldtinoco@clubionic02:~$ ... rafaeldtinoco@clubionic03:~$ ... rafaeldtinoco@clubionic01:~$ egrep "^\s+locking_type" /etc/lvm/lvm.conf rafaeldtinoco@clubionic02:~$ ... rafaeldtinoco@clubionic03:~$ ... locking_type = 3 rafaeldtinoco@clubionic01:~$ systemctl disable lvm2-lvmetad.service rafaeldtinoco@clubionic02:~$ ... rafaeldtinoco@clubionic03:~$ ... Finally, enable clustered lvm resource in the cluster: # clubionic01 storage resources crm(live)configure# primitive clubionic01_dlm ocf:pacemaker:controld op \ monitor interval=10s on-fail=fence interleave=true ordered=true crm(live)configure# primitive clubionic01_lvm ocf:heartbeat:clvm op \ monitor interval=10s on-fail=fence interleave=true ordered=true crm(live)configure# group clubionic01_storage clubionic01_dlm clubionic01_lvm crm(live)configure# location l_clubionic01_storage clubionic01_storage \ rule -inf: #uname ne clubionic01 # clubionic02 storage resources crm(live)configure# primitive clubionic02_dlm ocf:pacemaker:controld op \ monitor interval=10s on-fail=fence interleave=true ordered=true crm(live)configure# primitive clubionic02_lvm ocf:heartbeat:clvm op \ monitor interval=10s on-fail=fence interleave=true ordered=true crm(live)configure# group clubionic02_storage clubionic02_dlm clubionic02_lvm crm(live)configure# location l_clubionic02_storage clubionic02_storage \ rule -inf: #uname ne clubionic02 # clubionic03 storage resources crm(live)configure# primitive clubionic03_dlm ocf:pacemaker:controld op \ monitor interval=10s on-fail=fence interleave=true ordered=true crm(live)configure# primitive clubionic03_lvm ocf:heartbeat:clvm op \ monitor interval=10s on-fail=fence interleave=true ordered=true crm(live)configure# group clubionic03_storage clubionic03_dlm clubionic03_lvm crm(live)configure# location l_clubionic03_storage clubionic03_storage \ rule -inf: #uname ne clubionic03 crm(live)configure# commit Note: I created the resource groups one by one and specified they could run in just one node each. This is basically to guarantee that all nodes will have the services "clvmd" and "dlm_controld" always running (or restarted in case of issues). rafaeldtinoco@clubionic01:~$ crm_mon -1 Stack: corosync Current DC: clubionic02 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Mon Mar 9 02:18:51 2020 Last change: Mon Mar 9 02:17:58 2020 by root via cibadmin on clubionic01 3 nodes configured 7 resources configured Online: [ clubionic01 clubionic02 clubionic03 ] Active resources: fence_clubionic (stonith:fence_scsi): Started clubionic02 Resource Group: clubionic01_storage clubionic01_dlm (ocf::pacemaker:controld): Started clubionic01 clubionic01_lvm (ocf::heartbeat:clvm): Started clubionic01 Resource Group: clubionic02_storage clubionic02_dlm (ocf::pacemaker:controld): Started clubionic02 clubionic02_lvm (ocf::heartbeat:clvm): Started clubionic02 Resource Group: clubionic03_storage clubionic03_dlm (ocf::pacemaker:controld): Started clubionic03 clubionic03_lvm (ocf::heartbeat:clvm): Started clubionic03 So... now we are ready to have a clustered filesystem running in this cluster! -------------------- Before creating the "clustered" volume group in LVM, I'm going to remove the previous volume group and volumes we had: rafaeldtinoco@clubionic03:~$ sudo vgchange -an clustervg rafaeldtinoco@clubionic03:~$ sudo vgremove clustervg rafaeldtinoco@clubionic03:~$ sudo pvremove /dev/sda1 And re-create them as "clustered": rafaeldtinoco@clubionic03:~$ sudo pvcreate /dev/sda1 rafaeldtinoco@clubionic03:~$ sudo vgcreate -Ay -cy --shared clustervg /dev/sda1 From man page: --shared Create a shared VG using lvmlockd if LVM is compiled with lockd support. lvmlockd will select lock type san‐ lock or dlm depending on which lock manager is running. This allows multiple hosts to share a VG on shared devices. lvmlockd and a lock manager must be configured and running. rafaeldtinoco@clubionic03:~$ sudo vgs VG #PV #LV #SN Attr VSize VFree clustervg 1 0 0 wz--nc 988.00m 988.00m rafaeldtinoco@clubionic03:~$ sudo lvcreate -l 100%FREE -n clustervol clustervg -------------------- rafaeldtinoco@clubionic01:~$ apt-get install gfs2-utils rafaeldtinoco@clubionic02:~$ apt-get install gfs2-utils rafaeldtinoco@clubionic03:~$ apt-get install gfs2-utils mkfs.gfs2 -j3 -p lock_dlm -t clubionic:gfs2fs /dev/vgclvm/lvcluster - 3 journals (1 per each node is minimum) - use lock_dlm as the locking protocol - -t clustername:lockspace The "lock table" pair used to uniquely identify this filesystem in a cluster. The cluster name segment (maxi‐ mum 32 characters) must match the name given to your cluster in its configuration; only members of this clus‐ ter are permitted to use this file system. The lockspace segment (maximum 30 characters) is a unique file system name used to distinguish this gfs2 file system. Valid clusternames and lockspaces may only contain alphanumeric characters, hyphens (-) and underscores (_). rafaeldtinoco@clubionic01:~$ sudo mkfs.gfs2 -j3 -p lock_dlm \ -t clubionic:clustervol /dev/clustervg/clustervol Are you sure you want to proceed? [y/n]y Discarding device contents (may take a while on large devices): Done Adding journals: Done Building resource groups: Done Creating quota file: Done Writing superblock and syncing: Done Device: /dev/clustervg/clustervol Block size: 4096 Device size: 0.96 GB (252928 blocks) Filesystem size: 0.96 GB (252927 blocks) Journals: 3 Resource groups: 6 Locking protocol: "lock_dlm" Lock table: "clubionic:clustervol" UUID: dac96896-bd83-d9f4-c0cb-e118f5572e0e rafaeldtinoco@clubionic01:~$ sudo mount /dev/clustervg/clustervol /clusterdata \ sudo umount /clusterdata rafaeldtinoco@clubionic02:~$ sudo mount /dev/clustervg/clustervol /clusterdata \ sudo umount /clusterdata rafaeldtinoco@clubionic03:~$ sudo mount /dev/clustervg/clustervol /clusterdata \ sudo umount /clusterdata -------------------- Now, since we want to add a new resource in an already existing resource group I'll prefer executing the command: "crm configure edit" and manually edit the cluster configuration file to this (or something like this in your case): node 1: clubionic01 node 2: clubionic02 node 3: clubionic03 primitive clubionic01_dlm ocf:pacemaker:controld \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive clubionic01_gfs2 Filesystem \ params device="/dev/clustervg/clustervol" directory="/clusterdata" \ fstype=gfs2 options=noatime \ op monitor interval=10s on-fail=fence interleave=true primitive clubionic01_lvm clvm \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive clubionic02_dlm ocf:pacemaker:controld \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive clubionic02_gfs2 Filesystem \ params device="/dev/clustervg/clustervol" directory="/clusterdata" \ fstype=gfs2 options=noatime \ op monitor interval=10s on-fail=fence interleave=true primitive clubionic02_lvm clvm \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive clubionic03_dlm ocf:pacemaker:controld \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive clubionic03_gfs2 Filesystem \ params device="/dev/clustervg/clustervol" directory="/clusterdata" \ fstype=gfs2 options=noatime \ op monitor interval=10s on-fail=fence interleave=true primitive clubionic03_lvm clvm \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive fence_clubionic stonith:fence_scsi \ params pcmk_host_list="clubionic01 clubionic02 clubionic03" plug="" \ devices="/dev/sda" meta provides=unfencing target-role=Started group clubionic01_storage clubionic01_dlm clubionic01_lvm clubionic01_gfs2 group clubionic02_storage clubionic02_dlm clubionic02_lvm clubionic02_gfs2 group clubionic03_storage clubionic03_dlm clubionic03_lvm clubionic03_gfs2 location l_clubionic01_storage clubionic01_storage \ rule -inf: #uname ne clubionic01 location l_clubionic02_storage clubionic02_storage \ rule -inf: #uname ne clubionic02 location l_clubionic03_storage clubionic03_storage \ rule -inf: #uname ne clubionic03 property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.18-2b07d5c5a9 \ cluster-infrastructure=corosync \ cluster-name=clubionic \ stonith-enabled=on \ stonith-action=off \ no-quorum-policy=stop \ last-lrm-refresh=1583708321 # vim: set filetype=pcmk: NOTE: 1. I have created the following resources: - clubionic01_gfs2 - clubionic02_gfs2 - clubionic03_gfs2 and added them to each of their correspondent groups. The final result is: rafaeldtinoco@clubionic02:~$ crm_mon -1 Stack: corosync Current DC: clubionic02 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Mon Mar 9 03:26:43 2020 Last change: Mon Mar 9 03:24:14 2020 by root via cibadmin on clubionic01 3 nodes configured 10 resources configured Online: [ clubionic01 clubionic02 clubionic03 ] Active resources: fence_clubionic (stonith:fence_scsi): Started clubionic02 Resource Group: clubionic01_storage clubionic01_dlm (ocf::pacemaker:controld): Started clubionic01 clubionic01_lvm (ocf::heartbeat:clvm): Started clubionic01 clubionic01_gfs2 (ocf::heartbeat:Filesystem): Started clubionic01 Resource Group: clubionic02_storage clubionic02_dlm (ocf::pacemaker:controld): Started clubionic02 clubionic02_lvm (ocf::heartbeat:clvm): Started clubionic02 clubionic02_gfs2 (ocf::heartbeat:Filesystem): Started clubionic02 Resource Group: clubionic03_storage clubionic03_dlm (ocf::pacemaker:controld): Started clubionic03 clubionic03_lvm (ocf::heartbeat:clvm): Started clubionic03 clubionic03_gfs2 (ocf::heartbeat:Filesystem): Started clubionic03 And each of the nodes having the proper GFS2 filesystem mounted: rafaeldtinoco@clubionic01:~$ for node in clubionic01 clubionic02 \ clubionic03; do ssh $node "df -kh | grep cluster"; done /dev/mapper/clustervg-clustervol 988M 388M 601M 40% /clusterdata /dev/mapper/clustervg-clustervol 988M 388M 601M 40% /clusterdata /dev/mapper/clustervg-clustervol 988M 388M 601M 40% /clusterdata -------------------- Now we can go back to the previous (and original) idea of having lighttpd resources serving files from the same shared filesystem. NOTES 1. So.. this is just an example and this setup isn't specifically good for anything but to show pacemaker working in an environment like this. I'm enabling 3 instances of lighttpd using the "systemd" standard and it is very likely that it does not accept multiple instances in the same node. 2. This is the reason that I'm not allowing the instances to run in all nodes. Using the right agent you can make the instances, and their virtual IP, to migrate among all nodes if one of them fails. 3. Instead of having 3 lighttpd instances here you could have 1 lighttpd, 1 postfix and 1 mysql instance, all instances floating among all cluster nodes with no particular preference... for example. All the 3 instances would be able to access the same clustered filesystem mounted at /clusterdata. -------------------- rafaeldtinoco@clubionic01:~$ crm config show | cat - node 1: clubionic01 node 2: clubionic02 node 3: clubionic03 primitive clubionic01_dlm ocf:pacemaker:controld \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive clubionic01_gfs2 Filesystem \ params device="/dev/clustervg/clustervol" directory="/clusterdata" \ fstype=gfs2 options=noatime \ op monitor interval=10s on-fail=fence interleave=true primitive clubionic01_lvm clvm \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive clubionic02_dlm ocf:pacemaker:controld \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive clubionic02_gfs2 Filesystem \ params device="/dev/clustervg/clustervol" directory="/clusterdata" \ fstype=gfs2 options=noatime \ op monitor interval=10s on-fail=fence interleave=true primitive clubionic02_lvm clvm \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive clubionic03_dlm ocf:pacemaker:controld \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive clubionic03_gfs2 Filesystem \ params device="/dev/clustervg/clustervol" directory="/clusterdata" \ fstype=gfs2 options=noatime \ op monitor interval=10s on-fail=fence interleave=true primitive clubionic03_lvm clvm \ op monitor interval=10s on-fail=fence interleave=true ordered=true primitive fence_clubionic stonith:fence_scsi \ params pcmk_host_list="clubionic01 clubionic02 clubionic03" plug="" \ devices="/dev/sda" \ meta provides=unfencing target-role=Started primitive instance01_ip IPaddr2 \ params ip=10.250.98.13 nic=eth3 \ op monitor interval=10s primitive instance01_web systemd:lighttpd \ op monitor interval=10 timeout=30 primitive instance02_ip IPaddr2 \ params ip=10.250.98.14 nic=eth3 \ op monitor interval=10s primitive instance02_web systemd:lighttpd \ op monitor interval=10 timeout=30 primitive instance03_ip IPaddr2 \ params ip=10.250.98.15 nic=eth3 \ op monitor interval=10s primitive instance03_web systemd:lighttpd \ op monitor interval=10 timeout=30 group clubionic01_storage clubionic01_dlm clubionic01_lvm clubionic01_gfs2 group clubionic02_storage clubionic02_dlm clubionic02_lvm clubionic02_gfs2 group clubionic03_storage clubionic03_dlm clubionic03_lvm clubionic03_gfs2 group instance01 instance01_web instance01_ip group instance02 instance02_web instance02_ip group instance03 instance03_web instance03_ip location l_clubionic01_storage clubionic01_storage \ rule -inf: #uname ne clubionic01 location l_clubionic02_storage clubionic02_storage \ rule -inf: #uname ne clubionic02 location l_clubionic03_storage clubionic03_storage \ rule -inf: #uname ne clubionic03 location l_instance01 instance01 \ rule -inf: #uname ne clubionic01 location l_instance02 instance02 \ rule -inf: #uname ne clubionic02 location l_instance03 instance03 \ rule -inf: #uname ne clubionic03 property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.18-2b07d5c5a9 \ cluster-infrastructure=corosync \ cluster-name=clubionic \ stonith-enabled=on \ stonith-action=off \ no-quorum-policy=stop \ last-lrm-refresh=1583708321 rafaeldtinoco@clubionic01:~$ crm_mon -1 Stack: corosync Current DC: clubionic02 (version 1.1.18-2b07d5c5a9) - partition with quorum Last updated: Mon Mar 9 03:42:11 2020 Last change: Mon Mar 9 03:39:32 2020 by root via cibadmin on clubionic01 3 nodes configured 16 resources configured Online: [ clubionic01 clubionic02 clubionic03 ] Active resources: fence_clubionic (stonith:fence_scsi): Started clubionic02 Resource Group: clubionic01_storage clubionic01_dlm (ocf::pacemaker:controld): Started clubionic01 clubionic01_lvm (ocf::heartbeat:clvm): Started clubionic01 clubionic01_gfs2 (ocf::heartbeat:Filesystem): Started clubionic01 Resource Group: clubionic02_storage clubionic02_dlm (ocf::pacemaker:controld): Started clubionic02 clubionic02_lvm (ocf::heartbeat:clvm): Started clubionic02 clubionic02_gfs2 (ocf::heartbeat:Filesystem): Started clubionic02 Resource Group: clubionic03_storage clubionic03_dlm (ocf::pacemaker:controld): Started clubionic03 clubionic03_lvm (ocf::heartbeat:clvm): Started clubionic03 clubionic03_gfs2 (ocf::heartbeat:Filesystem): Started clubionic03 Resource Group: instance01 instance01_web (systemd:lighttpd): Started clubionic01 instance01_ip (ocf::heartbeat:IPaddr2): Started clubionic01 Resource Group: instance02 instance02_web (systemd:lighttpd): Started clubionic02 instance02_ip (ocf::heartbeat:IPaddr2): Started clubionic02 Resource Group: instance03 instance03_web (systemd:lighttpd): Started clubionic03 instance03_ip (ocf::heartbeat:IPaddr2): Started clubionic03 Like we did previously, let's create a symbolic link of /clusterdata/www, of each node, into its correspondent /var/www directory. rafaeldtinoco@clubionic01:~$ sudo ln -s /clusterdata/www /var/www rafaeldtinoco@clubionic02:~$ sudo ln -s /clusterdata/www /var/www rafaeldtinoco@clubionic03:~$ sudo ln -s /clusterdata/www /var/www But now, as this is a clustered filesystem, we have to create the file just once =) and it will be serviced by all lighttpd instances, running in all 3 nodes: rafaeldtinoco@clubionic01:~$ echo "all instances show the same thing" | \ sudo tee /var/www/html/index.html all instances show the same thing Check it out: rafaeldtinoco@clubionic01:~$ curl http://instance01/ all instances show the same thing rafaeldtinoco@clubionic01:~$ curl http://instance02/ all instances show the same thing rafaeldtinoco@clubionic01:~$ curl http://instance03/ all instances show the same thing And Voilá =) You now have a pretty cool cluster to play with! Congrats! Rafael D. Tinoco rafaeldtinoco@ubuntu.com Ubuntu Linux Core Engineer