Fix – Telco Cloud Automation by VMware Broadcom, Containerd service will not start again after resize of disks

Published by Jimmy Mankowitz on

I recently had a case with a customer where we resized the disks for Telco Cloud Automation, and after a reboot for some reason the TCA services would fail to start again.
After a lot of troubleshooting, we came to the conclusion that there was something wrong with containerd not starting correctly.

This post is a workaround on how to get containerd started again.

The Error we saw was when running journalctl -xfe, there was a lot of noice coming from containerd not being able to run CRI service. Below is an example of the error:
1. Collect the logs for Containerd

journalctl -xe --unit containerd -f
Dec 10 13:06:13 tcamgr.catbird.local containerd[8635]: time="2024-12-10T13:06:13.836586150Z" level=fatal msg="Failed to run CRI service" error="failed to recover state: failed to reserve container name \"tca-database-admin-service_tca-database-admin-service-6cd59cf46f-thzt6_tca-mgr_79f08f9b-790e-471a-90ff-b663ecbb94a3_9\": name \"tca-database-admin-service_tca-database-admin-service-6cd59cf46f-thzt6_tca-mgr_79f08f9b-790e-471a-90ff-b663ecbb94a3_9\" is reserved for \"131388c0f5dfd2c73c35650c148b54148a3df321c38b6eeebd9c36f8ee534554\""

As we can see in the error message it says it is beeing blocked from starting any containers since it is already reserved to some other id.

So to fix this go to the following location and edit the config.toml file:
On the top row edit the disabled_plugins section and add the cri name: io.containerd.grpc.v1.cri

vi /etc/containerd/config.toml
[ /home/admin ]# cat /etc/containerd/config.toml
disabled_plugins = ["io.containerd.grpc.v1.cri"]
#disabled_plugins = []
imports = []
oom_score = 0
plugin_dir = ""
required_plugins = []
root = "/var/lib/containerd"
state = "/run/containerd"
temp = ""
version = 2

[cgroup]
  path = ""

[debug]
  address = ""
  format = ""
  gid = 0
  level = ""
  uid = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216
  tcp_address = ""
  tcp_tls_ca = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0

[metrics]
  address = ""
  grpc_histogram = false

[plugins]

  [plugins."io.containerd.gc.v1.scheduler"]
    deletion_threshold = 0
    mutation_threshold = 100
    pause_threshold = 0.02
    schedule_delay = "0s"
    startup_delay = "100ms"

  [plugins."io.containerd.grpc.v1.cri"]
    device_ownership_from_security_context = false
    disable_apparmor = true

3. Restart the containerd service

systemctl stop containerd.service
systemctl start containerd.service 

4. List the containers in containerd

ctr -n k8s.io containers ls

5. Delete all containers in containerd

ctr -n k8s.io c rm $(ctr -n k8s.io c ls -q

6. Stop Containerd service and reedit the config.toml file and change back the plugin value to the default.

vi /etc/containerd/config.toml
[ /home/admin ]# cat /etc/containerd/config.toml
disabled_plugins = []
imports = []
oom_score = 0
plugin_dir = ""
required_plugins = []
root = "/var/lib/containerd"
state = "/run/containerd"
temp = ""
version = 2

7. Start Containerd and then see that all pods will start to come back after some time.

systemctl start containerd.service 
ctr -n k8s.io containers ls

Best is to now reboot the TCA Manager completely to have all related services started in the correct order.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *