Using Google Cloud Build To Build Linux Desktop Disk Images
For the SCUSA project, we need to ship bootable Linux desktop images to all of the sites that they can use to easily convert a public computer lab into a contest environment. We have varied what we have shipped over the years but the last few we have opted for a LiveCD-based environment since it is minimally impacting to an existing computer lab. Read more about SCUSA disk images.
In order to build the image we use livemedia-creator
which is used upstream in Fedora to build
the LiveCD environments for that project. I got started in systems administration building
Fedora/CentOS servers with PXE-booted Anaconda/Kickstart and so it is nice to see anaconda make it’s
way into the livemedia-creator project as well.
Although there has been a lot of good work done to bring Anaconda to regular systems, I observed it
leaving artifacts on the system if the process exits abnormally. Because of this, I was using a
combination of Vagrant and Ansible to run livemedia-creator
in a clean environment for every build.
While using Vagrant with a local VirtualBox instance on my laptop worked, the image build process was taking so long that I would frequently leave my laptop and it would go to sleep and mess up the Ansible process. To fix this, I experimented with Vagrant using a remote libvirt+ssh instance on one of my local servers which worked but was still cumbersome to set up and manage over time.
In looking to further refine this process I started to experiment with Vagrant + Google Compute Engine plugin since I use GCE for hosting other aspects of the SCUSA infrastructure. I ran across Google Cloud Build and looked into if it would serve my needs better.
Requirements:
- Be able to run
livemedia-creator
which requires mount privileges (SYS_ADMIN capabilities) - Clean build environment
- “Fire and forget” functionality where I can kick off a build and walk away
- Composable build process to split the layers of configuration and compose them into the final product
Basic steps of the image build process:
- Use
livemedia-creator
which runsanaconda
to build a image and outputs a tarball - Extract tarball to temporary directory and create a SquashFS filesystem
- Create a raw disk image with partitions and configure grub to boot
- Add kernel, initrd, SquashFS to disk image
Downstream the disk image can be directly written to a USB drive for booting during contest
Cloud Build Configuration
Set up with a simple Cloud Build configuration to build my builder image:
steps:
- name: 'gcr.io/cloud-builders/docker'
args: ['build',
'-t', 'gcr.io/$PROJECT_ID/scusa-image-builder:latest',
'.']
images:
- 'gcr.io/$PROJECT_ID/scusa-image-builder:latest'
Second Cloud Build configuration that runs that builder image to build my image:
steps:
- name: 'gcr.io/cloud-builders/docker'
args:
- run
- --privileged
- --tty
- --name
- build
- --volume
- /workspace:/workspace
- gcr.io/$PROJECT_ID/scusa-image-builder:latest
- /workspace/centos8.ks
timeout: 3600s
artifacts:
objects:
location: 'gs://[STORAGE_LOCATION]/'
paths: ['output/disk.tar.gz']
I have to run docker with --privileged
because throughout the build process we are mounting
filesystems
First Issue: Immediate Exit from Missing DBUS
This was an issue in just porting the workflow to containers and was an issue in my initial testing on my local docker instance.
I found that Anaconda seemed to be selecting the RHEL profile (not sure why?) and was therefore attempting to instantiate the Subscription module. By removing it I was able to continue:
sed -i '/.*Subscription/d' /etc/anaconda/product.d/rhel.conf
Another Issue: Syslog in Containers
It was clear from the command line arguments that livemedia-creator
and anaconda
wanted to log
to files or syslog. In order to debug my problems I really needed stdout/stderr since that was the
easiest to get from the cloud builder. I was about to reach for rsyslog/fluentd or something of that
nature but after some quick googling I stumbled upon the excellent
ossobv/syslog2stdout
RUN git clone https://github.com/ossobv/syslog2stdout /opt/syslog2stdout && \
cd /opt/syslog2stdout && \
cc -O3 -o /bin/syslog2stdout syslog2stdout.c
And in my entrypoint builder script:
syslog2stdout /dev/log &
Debugging Anaconda Storage module
After some iteration I was able to successfully build an image with local Docker but when I tried to run it with Cloud Build I noticed it would hang trying to activate all of the modules.
2021-10-07 11:59:56,572 INFO pylorax: Running anaconda.
daemon.info: dbus-daemon[15]: Activating service name='org.fedoraproject.Anaconda.Boss' requested by ':1.0' (uid=0 pid=11 comm="/usr/libexec/platform-python /usr/sbin/anaconda --" label="unconfined")
daemon.info: dbus-daemon[15]: Successfully activated service 'org.fedoraproject.Anaconda.Boss'
daemon.info: dbus-daemon[15]: Activating service name='org.fedoraproject.Anaconda.Modules.Timezone' requested by ':1.1' (uid=0 pid=20 comm="/usr/libexec/platform-python -m pyanaconda.modules" label="unconfined")
daemon.info: dbus-daemon[15]: Activating service name='org.fedoraproject.Anaconda.Modules.Network' requested by ':1.1' (uid=0 pid=20 comm="/usr/libexec/platform-python -m pyanaconda.modules" label="unconfined")
daemon.info: dbus-daemon[15]: Activating service name='org.fedoraproject.Anaconda.Modules.Localization' requested by ':1.1' (uid=0 pid=20 comm="/usr/libexec/platform-python -m pyanaconda.modules" label="unconfined")
daemon.info: dbus-daemon[15]: Activating service name='org.fedoraproject.Anaconda.Modules.Security' requested by ':1.1' (uid=0 pid=20 comm="/usr/libexec/platform-python -m pyanaconda.modules" label="unconfined")
daemon.info: dbus-daemon[15]: Activating service name='org.fedoraproject.Anaconda.Modules.Users' requested by ':1.1' (uid=0 pid=20 comm="/usr/libexec/platform-python -m pyanaconda.modules" label="unconfined")
daemon.info: dbus-daemon[15]: Activating service name='org.fedoraproject.Anaconda.Modules.Payloads' requested by ':1.1' (uid=0 pid=20 comm="/usr/libexec/platform-python -m pyanaconda.modules" label="unconfined")
daemon.info: dbus-daemon[15]: Activating service name='org.fedoraproject.Anaconda.Modules.Storage' requested by ':1.1' (uid=0 pid=20 comm="/usr/libexec/platform-python -m pyanaconda.modules" label="unconfined")
daemon.info: dbus-daemon[15]: Activating service name='org.fedoraproject.Anaconda.Modules.Services' requested by ':1.1' (uid=0 pid=20 comm="/usr/libexec/platform-python -m pyanaconda.modules" label="unconfined")
daemon.info: dbus-daemon[15]: Successfully activated service 'org.fedoraproject.Anaconda.Modules.Security'
daemon.info: dbus-daemon[15]: Successfully activated service 'org.fedoraproject.Anaconda.Modules.Services'
daemon.info: dbus-daemon[15]: Successfully activated service 'org.fedoraproject.Anaconda.Modules.Payloads'
daemon.info: dbus-daemon[15]: Successfully activated service 'org.fedoraproject.Anaconda.Modules.Network'
daemon.info: dbus-daemon[15]: Successfully activated service 'org.fedoraproject.Anaconda.Modules.Users'
daemon.info: dbus-daemon[15]: Successfully activated service 'org.fedoraproject.Anaconda.Modules.Timezone'
daemon.info: dbus-daemon[15]: Successfully activated service 'org.fedoraproject.Anaconda.Modules.Localization'
We are missing full activation of org.fedoraproject.Anaconda.Modules.Storage
strace
is my friend:
strace -f -s 1024 -e fork,execve,open,listen,connect livemedia-creator ...
On local:
strace: Process 87 attached
daemon.info: dbus-daemon[19]: Successfully activated service 'org.fedoraproject.Anaconda.Modules.Storage'
On cloud build:
strace: Process 86 attached
[pid 86] execve("/usr/local/sbin/modprobe", ["modprobe", "--dry-run", "xfs"], 0x7f911725fc90 /* 21 vars */) = -1 ENOENT (No such file or directory)
[pid 86] execve("/usr/local/bin/modprobe", ["modprobe", "--dry-run", "xfs"], 0x7f911725fc90 /* 21 vars */) = -1 ENOENT (No such file or directory)
[pid 86] execve("/usr/sbin/modprobe", ["modprobe", "--dry-run", "xfs"], 0x7f911725fc90 /* 21 vars */) = 0
[pid 86] +++ exited with 1 +++
I went back to the configuration for anaconda in /etc/anaconda/product.d
. It wasn’t entirely clear
to me which profile was being selected but I also saw the profiles were setting file_system_type =
xfs
.
We don’t have xfs in /proc/filesystems
on the cloud builder but we do on local docker.
Finally just did a strace without filtering the syscalls and found this write call. Not sure why this traceback wasn’t making it all the way back to logs but it was definitely the culprit. I have seen issues when doing multi-threaded python where logging to stdout/stderr gets lost.
[pid 51] write(2, "Traceback (most recent call last):
File \"/usr/lib64/python3.6/runpy.py\", line 193, in _run_module_as_main
\"__main__\", mod_spec)
File \"/usr/lib64/python3.6/runpy.py\", line 85, in _run_code
exec(code, run_globals)
File \"/usr/lib64/python3.6/site-packages/pyanaconda/modules/storage/__main__.py\", line 29, in <module>
service = StorageService()
File \"/usr/lib64/python3.6/site-packages/pyanaconda/modules/storage/storage.py\", line 138, in __init__
self._set_storage(create_storage())
File \"/usr/lib64/python3.6/site-packages/pyanaconda/modules/storage/devicetree/model.py\", line 50, in create_storage
return InstallerStorage()
File \"/usr/lib/python3.6/site-packages/blivet/threads.py\", line 53, in run_with_lock
return m(*args, **kwargs)
File \"/usr/lib64/python3.6/site-packages/pyanaconda/modules/storage/devicetree/model.py\", line 67, in __init__
self.set_default_fstype(conf.storage.file_system_type or self.default_fstype)
File \"/usr/lib/python3.6/site-packages/blivet/threads.py\", line 53, in run_with_lock
return m(*args, **kwargs)
File \"/usr/lib/python3.6/site-packages/blivet/blivet.py\", line 1069, in set_default_fstype
self._check_valid_fstype(newtype)
File \"/usr/lib/python3.6/site-packages/blivet/threads.py\", line 53, in run_with_lock
return m(*args, **kwargs)
File \"/usr/lib/python3.6/site-packages/blivet/blivet.py\", line 1054, in _check_valid_fstype
raise ValueError(\"new value %s is not valid as a default fs type\" % newtype)
ValueError: new value tmpfs is not valid as a default fs type
", 1568) = 1568
After I changed the configuration file_system_type = ext4
I was able to complete my build with Cloud
Build. Woo hoo!
Conclusion
Overall Cloud Build is super easy to use and efficient for my use case. I wanted something that I could kick off and just let run without having to keep my laptop open and awake. At ~40 min for a build it was just too long to sit and wait for it to finish. Managing all of this in docker containers is significantly simpler than managing spinning up VMs either in Vagrant (local or with remote libvirt+ssh) or in Google Compute Engine (with Ansible).