Thursday, 4 April 2024

How to clone an fully installed Linux system to another computer, or do live offline system upgrade?

A. Full-system Clone

When managing Linux server cluster, very often we need to clone a fully-installed Linux system to other computers/server so that we do not need to re-install all required packages and libraries, and do not need to reconfigure some packages such as inputrc, vimrc, bashrc, etc. Here, I will describe two common methods:

1. The recommended way is to use MX-Linux's mx-snapshot

This is a great utility. It can create a Linux-rescue ISO image that is a large bootable live image containing all packages/libraries/configurations, optionally containing home folder contents. At the same time, you can use the Live system to install onto as many harddisks as possible.

Remeber to select "Preserve /home (ext4)" option if you want to keep existing user folders.

2. Manually copy over all folders and setup grub. When copying over all file, you need to preserve file permissions, thus use either "cp -rfPp" or "rsync -avlP", or "tar --numeric-owner -czf"

mount /dev/sda3 /mnt (root partition)
mount /dev/sda2 /mnt/boot (required if separate boot partition)
mount /dev/sda1 /mnt/mnt/efi
mount --bind /dev /mnt/dev
mount --bind /dev/pts /mnt/dev/pts
mount --bind /proc /mnt/proc
mount --bind /sys /mnt/sys

grub-mkdevicemap
grub-install --efi-directory=/mnt/efi /dev/sda
update-grub
update-initramfs -u -k all 

However, the SUID/SGID/sticky bit will be reset when copying over directories or extracting archives (even if the preserve-permission option is set), so you need to manually redo setting them afterwards. The following list of commands need to have SUID bit set: passwd, su, sudo, ping*, chsh, mount, umount, fusermount, etc.


B. Live Offline System Upgrade

For offline live upgrade (while all other users are still using the system), Method A1 will introduce a very long down-time (typically a few hours, depending on the size of your fully-installed system), as the installation requires booting into the live system while other users cannot access. So we typically use Method A2, for which the only downtime is the server reboot. The steps are as follows:

1. In a running Linux system, create two new folders under /, e.g., /full-backup /full-upgrade.

2. Copy over the entire new-OS root system folders (i.e., /etc, /bin, /sbin, /usr, /opt, /root, /var, /lib, /lib64, /boot, etc., except /home, /dev, /sys, /mnt, /proc, /run, etc.) from USB storage to /full-upgrade using Method A2 ("cp -rfPp" or "rsync -avlP", or "tar --numeric-owner -czf"). This typically takes a few hours.

3. Copy over statically-linked busybox to the root folder (the version of busybox must not require ld-linux-x86-64.so interpreter). Typically, this file should already be prepared somewhere inside /full-upgrade.

root@my-laptop:~# file /usr/bin/busybox
/usr/bin/busybox: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, BuildID[sha1]=36c64fc4707a00db11657009501f026401385933, for GNU/Linux 3.2.0, stripped

4. Copy over credential/configuration files from / to /full-upgrade, e.g.,

    /etc/passwd, /etc/groups, /etc/shadow, (do NOT directly copy over, only copy over real user entries because you need to keep the user IDs for service users such as gdm/sshd/_apt/openvpn/lp/etc., otherwise, these services will not function properly)
    /etc/network/interfaces, /etc/NetworkManager/*, /etc/openvpn, /etc/audit, /etc/motd, /etc/logrotate.*
    /etc/fstab, /etc/exports, /etc/hostname, /root/.ssh, etc.

5. Move all old-version root system folders from / to /full-backup and move all new-version root system folders from /full-upgrade to / , i.e.,

cd /full-upgrade; for f in *; do /busybox mv -v /$f /full-backup/; /busybox mv -v $f /;done

From this point onwards, all newly launched programs will use new-version libraries and packages, while existing running programs will continue to use old libraries and packages. Since existing processes might open configs/files/folders or spawn new processes, all of which will be of new-version, there might be some conflicts/errors/failures because the running services and kernel are still of old-version before system reboot. However, the time period will be short because you only need to do the following.

6. Setup the boot-loader for the new-version root system using steps in Method A2.

  • unmount the old EFI partition, (if previously your EFI is mounted at `/boot/efi`, the "/busybox mv" command will move its mount-point to `/full-backup/boot/efi`, so `umount /full-backup/boot/efi`)
  • mount the EFI partition to /boot/efi (`mount /dev/sda1 /boot/efi`)
  • delete unused EFI boot images in /boot/efi/EFI (some bios will remember the previously booted EFI image, since OS has changed, the old EFI boot image might not work, so typically just run `rm -rf /boot/efi/EFI/*`, or move them to some backup location.)
  • Install the new grub EFI boot image:
grub-mkdevicemap
grub-install --efi-directory=/boot/efi --root-directory=/ /dev/sda
update-grub
update-initramfs -u -k all 

7. Delete /busybox (for security purposes)

8. Reboot the server into the new-version system.

9. Double-check the auto-start status for system services such as nfs-kernel-server, auditd, rsyslog, clamav, openvpn, etc., which can be different on different computers.

Typically, to minimize the system down time, Steps 5-8 should be done in one go, preferably at the end of the day when most people has left office. If ML/CUDA training happens during the night, Steps 5-8 should be done before lunch, so that after lunch, users can restart all their programs.

Tuesday, 2 April 2024

How to bring a stopped process into a tmux session with console display?

Very often, we run some long-waiting command with tons of console output out of tmux, then realize that we should move it into a tmux session so that we can remotely log in and monitor its progress. But then the process can hardly be terminated and re-run, and Ctrl+Z and `fg` can only resume it in the same console.

To do so:

1. Press Ctrl+Z to stop the process

2. launch a new tmux session or attach an existing tmux session

3. run `reptyr <PID>` inside tmux session with process ID of the stopped process