Demistifying Certificate Generation using Ansible in Openshift

Background

Ansible is the only available tools to manage deployment Openshift Origin Clusters, now known OKD. Some pitfalls were identified when the author tried to redeploy certificate in an Origin 3.6 cluster.

Logging Issues

We were trying to do some troubleshooting using the kibana-ops logging console, and were surprised to find it not working. Quick check on the logging project shows that many of the pods have warning status.

Checking these pods (mostly elasticsearch instances) we found that the the health check script results in 'Elasticsearch node is not ready to accept HTTP requests yet' with http response code 000. The pods were at 90% storage, which I think caused by the curator unable to connect to the elastic instances (The curator is supposed to clean old index data files so the disk won't fill up).

Further investigation results the curl command (adapted from this syntax from the chapter Aggregating Container Logs - Performing Administrative Elasticsearch Operations ) results in expired peer certificate error :


oc project logging

oc rsh logging-es-data-master-9oin****-10-c3***

curl --key /etc/elasticsearch/secret/admin-key \
  --cert /etc/elasticsearch/secret/admin-cert \
  --cacert /etc/elasticsearch/secret/admin-ca -XGET "https://localhost:9200"

Further checks (using ssl shopper checker :) ) on the /etc/elasticsearch/secret/admin-cert shows the certificate were expired as well.

Solution Seeking

In order to renew the certificates, we searched the internet, and found that there is a support article on this exact problem : https://access.redhat.com/solutions/4233251 and this links to a ansible redeploy playbook.

Well, this post is a bit more than just executing that certain playbook :)

In short, the command to be executed is "ansible-playbook playbooks/openshift-logging/redeploy-certificates.yml", and we are unable to run it because the playbook file doesn't exist in our openshift version playbooks. Of course there is several options :

-> upgrade to the newer openshift version, which I not quite ready because of the risks involved with live applications running in the cluster. Well if this is to be done then we need to do some simulations first with a dummy cluster ..

-> copy the playbook from newer version, which also have its own risks, such as having incompatibility issues and missing variables

-> try to trick the existing 'logging installation' playbook to renew the certificates. This is the follow up I chose.

For other certificates (other than the logging) there is another redeploy certificates playbook than can be run but reading the bug reports (https://bugzilla.redhat.com/show_bug.cgi?id=1772543) we concluded that it just won't do the job. For your reference this is the bug report description (which have markings of openshift version 3.11, the last stable version):

Running /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml does not redeploy logging certificates although it is expected it redeploys all the certificates (not CAs) of OpenShift components (as it already does with console, catalog, monitoring...). Users must be aware and run /usr/share/ansible/openshift-ansible/playbooks/openshift-logging/redeploy-certificates.yml separately.

Running the Logging Installation playbook

Here is the ansible command I run (we need to do it from the bootstrap node) :

ansible-playbook -i /root/OSV3_ansible_inventory.20200731 playbooks/byo/openshift-cluster/openshift-logging.yml

The inventory file were specific to our deployment so you might want to change it before running in your cluster; the playbook were taken from subheading 'Deploying the EFK Stack' in chapter 'Aggregating Container Logs'.

The first pitfall is a strange error message about deployment config. It seems that the ansible playbook generated somewhat incorrect deployment config. However the culprit is wrong units being used in the memory limit (4GB instead of 4G). Red hat bugzilla have description on this problem ("Better error message for templates missing memory specification"). The strange error message is : " DeploymentConfig in version "v1" cannot be handled as a DeploymentConfig: quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'", which we resolve by changing memory limit into 4G in the inventory file.

The second pitfall is the certificate not being renewed at all. So I figure I need to delete the existing certificates to get them rebuild them. For more info I need to enable verbose logging on the ansible playbook execution:

ansible-playbook --verbose -i /root/OSV3_ansible_inventory.20200731 playbooks/byo/openshift-cluster/openshift-logging.yml

The interesting fragments were something like these :

TASK [openshift_logging : Checking for system.admin.crt] ***************************************************************************************************************

ok: [master1.paas.telkom.co.id] => {

"changed": false,

"failed": false,

"stat": {

"atime": 1597830303.7311661,

"attr_flags": "",

"attributes": [],

"block_size": 4096,

"blocks": 8,

"charset": "us-ascii",

"checksum": "0d0987b2615864f95fe914aa29a27c3831e8c93e",

"ctime": 1597830292.9452658,

"dev": 64768,

"device_type": 0,

"executable": false,

"exists": true,

"gid": 0,

"gr_name": "root",

"inode": 51974025,

"isblk": false,

At first I don't know where the file is located. Which ansible script responsible for that ?

Need to dig deeper. Lets see : the byo/openshift-cluster/openshift-logging.yml contains this :

include: ../../common/openshift-cluster/openshift_logging.yml

This direct us to another playbook in the common directory, which contains:

- name: OpenShift Aggregated Logging

hosts: oo_first_master

roles:

- openshift_logging

- name: Update Master configs

hosts: oo_masters:!oo_first_master

tasks:

- block:

- include_role:

name: openshift_logging

tasks_from: update_master_config

when: openshift_logging_install_logging | default(false) | bool

Being just a beginner in ansible stuffs, so at first this doesn't make any sense. What I make out of this is that the first master is something special, and other master being updated after configuration in the first master is done. Well the next file to be checked is openshift_logging role :

[root@lb openshift-ansible-release-3.6]# ls -l roles/openshift_logging

total 24

drwxrwxr-x. 2 lbuser lbuser 22 Dec 6 2017 defaults

drwxrwxr-x. 2 lbuser lbuser 52 Dec 6 2017 files

drwxrwxr-x. 2 lbuser lbuser 75 Dec 6 2017 filter_plugins

drwxrwxr-x. 2 lbuser lbuser 22 Dec 6 2017 handlers

drwxrwxr-x. 2 lbuser lbuser 40 Dec 6 2017 library

drwxrwxr-x. 2 lbuser lbuser 23 Dec 6 2017 meta

-rw-rw-r--. 1 lbuser lbuser 18785 Dec 6 2017 README.md

drwxrwxr-x. 2 lbuser lbuser 4096 Aug 19 19:45 tasks

drwxrwxr-x. 2 lbuser lbuser 47 Dec 6 2017 templates

drwxrwxr-x. 2 lbuser lbuser 81 Dec 6 2017 vars

Quite a lot of directories to be examined. Lets just examine the tasks directory.

[root@lb openshift-ansible-release-3.6]# ls -l roles/openshift_logging/tasks

total 60

-rw-rw-r--. 1 lbuser lbuser 484 Dec 6 2017 annotate_ops_projects.yaml

-rw-rw-r--. 1 lbuser lbuser 2260 Dec 6 2017 delete_logging.yaml

-rw-rw-r--. 1 lbuser lbuser 4371 Dec 6 2017 generate_certs.yaml

-rw-rw-r--. 1 lbuser lbuser 3401 Dec 6 2017 generate_jks.yaml

-rw-rw-r--. 1 lbuser lbuser 1442 Dec 6 2017 generate_pems.yaml

-rw-rw-r--. 1 lbuser lbuser 16425 Dec 6 2017 install_logging.yaml

-rw-rw-r--. 1 lbuser lbuser 1478 Dec 6 2017 main.yaml

-rw-rw-r--. 1 lbuser lbuser 2824 Aug 19 18:54 procure_server_certs.yaml

-rw-rw-r--. 1 lbuser lbuser 1267 Dec 6 2017 procure_shared_key.yaml

-rw-rw-r--. 1 lbuser lbuser 377 Dec 6 2017 update_master_config.yaml

Well it seems we found some treasure.

Lets find out which has the string 'system.admin' :

[root@lb openshift-ansible-release-3.6]# grep system.admin roles/openshift_logging/tasks/*

roles/openshift_logging/tasks/generate_certs.yaml: - system.admin

roles/openshift_logging/tasks/generate_jks.yaml:- name: Checking for system.admin.jks

roles/openshift_logging/tasks/generate_jks.yaml: stat: path="{{generated_certs_dir}}/system.admin.jks"

roles/openshift_logging/tasks/generate_jks.yaml: register: system_admin_jks

roles/openshift_logging/tasks/generate_jks.yaml: local_action: file path="{{local_tmp.stdout}}/system.admin.jks" state=touch mode="u=rw,g=r,o=r"

roles/openshift_logging/tasks/generate_jks.yaml: when: system_admin_jks.stat.exists

roles/openshift_logging/tasks/generate_jks.yaml: when: not elasticsearch_jks.stat.exists or not logging_es_jks.stat.exists or not system_admin_jks.stat.exists or not truststore_jks.stat.exists

roles/openshift_logging/tasks/generate_jks.yaml: src: "{{local_tmp.stdout}}/system.admin.jks"

roles/openshift_logging/tasks/generate_jks.yaml: dest: "{{generated_certs_dir}}/system.admin.jks"

roles/openshift_logging/tasks/generate_jks.yaml: when: not system_admin_jks.stat.exists

[root@lb openshift-ansible-release-3.6]#

In the generate_certs.yml file we find that :

- name: Generate PEM certs

include: generate_pems.yaml component={{node_name}}

with_items:

- system.logging.fluentd

- system.logging.kibana

- system.logging.curator

- system.admin

loop_control:

loop_var: node_name

This kind of calling another yaml (generate_pems.yaml) :

---

- name: Checking for {{component}}.key

stat: path="{{generated_certs_dir}}/{{component}}.key"

check_mode: no

- name: Checking for {{component}}.crt

stat: path="{{generated_certs_dir}}/{{component}}.crt"

check_mode: no

- name: Creating cert req for {{component}}

command: >

openssl req -out {{generated_certs_dir}}/{{component}}.csr -new -newkey rsa:2048 -keyout {{generated_certs_dir}}/{{component}}.key

-subj "/CN={{component}}/OU=OpenShift/O=Logging/subjectAltName=DNS.1=localhost{{cert_ext.stdout}}" -days 712 -nodes

when:

- not key_file.stat.exists

- cert_ext is defined

- cert_ext.stdout is defined

check_mode: no

- name: Creating cert req for {{component}}

command: >

openssl req -out {{generated_certs_dir}}/{{component}}.csr -new -newkey rsa:2048 -keyout {{generated_certs_dir}}/{{component}}.key

-subj "/CN={{component}}/OU=OpenShift/O=Logging" -days 712 -nodes

when:

- not key_file.stat.exists

- cert_ext is undefined or cert_ext is defined and cert_ext.stdout is undefined

check_mode: no

- name: Sign cert request with CA for {{component}}

command: >

openssl ca -in {{generated_certs_dir}}/{{component}}.csr -notext -out {{generated_certs_dir}}/{{component}}.crt

-config {{generated_certs_dir}}/signing.conf -extensions v3_req -batch -extensions server_ext

when:

- not cert_file.stat.exists

check_mode: no

Seeing this I can reuse the certificate key and csr, and just do the last part (signing) with the cert_file missing. So I need to delete the certificate file, which is at "{{generated_certs_dir}}/{{component}}.crt", for example the system.admin as one of the component becames "{{generated_certs_dir}}/system.admin.crt". But I still don't know where are the generated_certs_dir is..

[root@lb openshift-ansible-release-3.6]# grep generated_certs -R * | head

docs/proposals/role_decomposition.md: generated_certs_dir: "{{openshift.common.config_base}}/logging"

Ok it is under config_base,

[root@lb openshift-ansible-release-3.6]# grep config_base -R * | head

DEPLOYMENT_TYPES.md:| **openshift.common.config_base** | /etc/origin | /etc/origin |

docs/proposals/role_decomposition.md: path: "{{ openshift.common.config_base }}/logging"

To my luck the first result yields the correct location for config_base (/etc/origin). Thus the certificate directory is /etc/origin/logging. But it is not in the bootstrap server :

[root@lb openshift-ansible-release-3.6]# ls -l /etc/origin/logging

ls: cannot access /etc/origin/logging: No such file or directory

To my surprise it is in the first master only :

[root@lb openshift-ansible-release-3.6]# ls -l /etc/origin/logging

ls: cannot access /etc/origin/logging: No such file or directory

[root@lb openshift-ansible-release-3.6]# ssh master1 ls -l /etc/origin/logging

total 156

-rw-r--r--. 1 root root 1196 Aug 18 2018 02.pem

-rw-r--r--. 1 root root 1196 Aug 18 2018 03.pem

-rw-r--r--. 1 root root 1196 Aug 18 2018 04.pem

-rw-r--r--. 1 root root 1184 Aug 18 2018 05.pem

-rw-r--r--. 1 root root 1196 Aug 19 16:44 06.pem

-rw-r--r--. 1 root root 1196 Aug 19 16:44 07.pem

-rw-r--r--. 1 root root 1196 Aug 19 16:44 08.pem

-rw-r--r--. 1 root root 1184 Aug 19 16:44 09.pem

-rw-r--r--. 1 root root 1050 Aug 18 2018 ca.crt

...

[root@lb openshift-ansible-release-3.6]# ssh master2 ls -l /etc/origin/logging

ls: cannot access /etc/origin/logging: No such file or directory

[root@lb openshift-ansible-release-3.6]# ssh master3 ls -l /etc/origin/logging

ls: cannot access /etc/origin/logging: No such file or directory

The key take :

- certificate files were generated in the /etc/origin/logging directory in the first master server

- we can remove expired certificates in order to force certificate renewal

Next we tried to do just that :

[root@master1 ~]# cd /etc/origin/logging

[root@master1 logging]# mkdir /tmp/expiredcerts

[root@master1 logging]# mv kibana-internal.crt system.admin.crt system.logging.curator.crt system.logging.fluentd.crt system.logging.kibana.crt /tmp/expiredcerts/

Then reperform the playbook again :

ansible-playbook -i /root/OSV3_ansible_inventory.20200731 playbooks/byo/openshift-cluster/openshift-logging.yml

Which results in another error referring kibana-internal.crt being missing. Why it is not being regenerated ? Lets ask our friend grep:

[root@lb openshift-ansible-release-3.6]# grep kibana-internal -R playbooks/byo | head

grep: warning: playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_certificate_expiry/examples/playbooks: recursive directory loop

grep: warning: playbooks/byo/openshift-checks/roles/openshift_certificate_expiry/examples/playbooks/roles: recursive directory loop

playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_logging/tasks/generate_certs.yaml: - procure_component: kibana-internal

playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_logging_kibana/tasks/main.yaml: - { name: "kibana_internal_key", file: "kibana-internal.key"}

playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_logging_kibana/tasks/main.yaml: - { name: "kibana_internal_cert", file: "kibana-internal.crt"}

playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_logging_kibana/tasks/main.yaml: # path: "{{ generated_certs_dir }}/kibana-internal.key"

playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_logging_kibana/tasks/main.yaml: # path: "{{ generated_certs_dir }}/kibana-internal.crt"

grep: warning: playbooks/byo/openshift-cluster/roles/openshift_certificate_expiry/examples/playbooks/roles: recursive directory loop

playbooks/byo/openshift-checks/roles/openshift_logging/tasks/generate_certs.yaml: - procure_component: kibana-internal

The kibana-internal were in generate_certs.yml task (well, what do you know, we have certificate_expiry stuffs for the kibana-internal in the playbooks? Will need to check it later). Lets review that part in generate_certs.yaml :

- include: procure_server_certs.yaml

loop_control:

loop_var: cert_info

with_items:

- procure_component: kibana

- procure_component: kibana-ops

- procure_component: kibana-internal

hostnames: "kibana, kibana-ops, {{openshift_logging_kibana_hostname}}, {{openshift_logging_kibana_ops_hostname}}"

Which refers to procure_server_certs.yaml :

---

- name: Procure info

debug:

var: cert_info

- name: Checking for {{ cert_info.procure_component }}.crt

stat: path="{{generated_certs_dir}}/{{ cert_info.procure_component }}.crt"

check_mode: no

- name: Checking for {{ cert_info.procure_component }}.key

stat: path="{{generated_certs_dir}}/{{ cert_info.procure_component }}.key"

check_mode: no

- name: Trying to discover server cert variable name for {{ cert_info.procure_component }}

set_fact: procure_component_crt={{ lookup('env', '{{cert_info.procure_component}}' + '_crt') }}

when:

- cert_info.hostnames is undefined

- cert_info[ cert_info.procure_component + '_crt' ] is defined

- cert_info[ cert_info.procure_component + '_key' ] is defined

check_mode: no

- name: Trying to discover the server key variable name for {{ cert_info.procure_component }}

set_fact: procure_component_key={{ lookup('env', '{{cert_info.procure_component}}' + '_key') }}

when:

- cert_info.hostnames is undefined

- cert_info[ cert_info.procure_component + '_crt' ] is defined

- cert_info[ cert_info.procure_component + '_key' ] is defined

check_mode: no

- name: Creating signed server cert and key for {{ cert_info.procure_component }}

command: >

{{ openshift.common.client_binary }} adm --config={{ mktemp.stdout }}/admin.kubeconfig ca create-server-cert

--key={{generated_certs_dir}}/{{cert_info.procure_component}}.key --cert={{generated_certs_dir}}/{{cert_info.procure_component}}.crt

--hostnames={{cert_info.hostnames|quote}} --signer-cert={{generated_certs_dir}}/ca.crt --signer-key={{generated_certs_dir}}/ca.key

--signer-serial={{generated_certs_dir}}/ca.serial.txt

check_mode: no

when:

- cert_info.hostnames is defined

- not component_key_file.stat.exists

- not component_cert_file.stat.exists

- name: Copying server key for {{ cert_info.procure_component }} to generated certs directory

copy: content="{{procure_component_key}}" dest={{generated_certs_dir}}/{{cert_info.procure_component}}.key

check_mode: no

...

It seems that the signing process for this part requires both the key file and cert file to be non-existent. Only missing cert file will not get it regenerated. So the solution is to remove kibana-internal.key so it would regenerate both key and crt file.

The strange thing is the playbook refers to kibana and kibana-ops, which should result in kibana.key, kibana-ops.key, kibana.crt, kibana-ops.key being installed. But seems that the missing hostname clause caused them not to be executed. Is it a typo ? But without the kibana.* and kibana-ops.* files all seem well so I will just ignore it.

After removal of kibana-internal.key, the playbook works like a charm, but the running pods is still using the old certificate. The solution is either to kill the pods or to scale down deployment to 0, wait for it to destroy the pods, and scale back up to the original value (might be 1 or 2 depending the replication factor in the inventory file).

Conclusion and Postscript

We conclude that the certificate generation process take place only in the first master, with the directory /etc/origin/logging used for the logging certificates. In order to regenerate the certificates we need to remove the expired crt file and also kibana-internal.key file, and execute the playbooks/byo/openshift-cluster/openshift-logging.yml ansible playbook. Redeploying pods will be necessary in order to use the newer certificates.

After going through the quite adventure, I just found out that there are a similar passage involving removal of /etc/origin/logging files in 'Redeploying EFK Certificates' section in chapter 'Aggregating Container Logs' for Openshift 3.9.. Just be wary that location of ansible scripts changed quite a bit between Openshift Origin (OKD) updates.

Inventor's Paradox