Demistifying Certificate Generation using Ansible in Openshift

Background

Ansible is the only available tools to manage deployment Openshift Origin Clusters, now known OKD. Some pitfalls were identified when the author tried to redeploy certificate in an Origin 3.6 cluster.

Logging Issues

We were trying to do some troubleshooting using the kibana-ops logging console, and were surprised to find it not working. Quick check on the logging project shows that many of the pods have warning status.





Checking these pods (mostly elasticsearch instances) we found that the the health check script results in 'Elasticsearch node is not ready to accept HTTP requests yet' with http response code 000. The pods were at 90% storage, which I think caused by the curator unable to connect to the elastic instances (The curator is supposed to clean old index data files so the disk won't fill up). 
Further investigation results the curl command (adapted from this syntax from the chapter Aggregating Container Logs - Performing Administrative Elasticsearch Operations ) results in expired peer certificate error :

oc project logging
oc rsh logging-es-data-master-9oin****-10-c3*** 
curl --key /etc/elasticsearch/secret/admin-key \
  --cert /etc/elasticsearch/secret/admin-cert \
  --cacert /etc/elasticsearch/secret/admin-ca -XGET "https://localhost:9200"
Further checks (using ssl shopper checker :) ) on the /etc/elasticsearch/secret/admin-cert shows the certificate were expired as well.

Solution Seeking


In order to renew the certificates, we searched the internet, and found that there is a support article on this exact problem : https://access.redhat.com/solutions/4233251 and this links to a ansible redeploy playbook.
Well, this post is a bit more than just executing that certain playbook :)
In short, the command to be executed is "ansible-playbook playbooks/openshift-logging/redeploy-certificates.yml", and we are unable to run it because the playbook file doesn't exist in our openshift version playbooks. Of course there is several options :
-> upgrade to the newer openshift version, which I not quite ready because of the risks involved with live applications running in the cluster. Well if this is to be done then we need to do some simulations first with a dummy cluster ..
-> copy the playbook from newer version, which also have its own risks, such as  having incompatibility issues and missing variables
-> try to trick the existing 'logging installation' playbook to renew the certificates. This is the follow up  I chose.
For other certificates (other than the logging) there is another redeploy certificates playbook than can be run but reading the  bug reports (https://bugzilla.redhat.com/show_bug.cgi?id=1772543) we concluded that it just won't do the job. For your reference this is the bug report description (which have markings of openshift version 3.11, the last stable version):
Running /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml does not redeploy logging certificates although it is expected it redeploys all the certificates (not CAs) of OpenShift components (as it already does with console, catalog, monitoring...). Users must be aware and run /usr/share/ansible/openshift-ansible/playbooks/openshift-logging/redeploy-certificates.yml separately.

Running the Logging Installation playbook

Here is the ansible command I run (we need to do it from the bootstrap node) :
ansible-playbook -i /root/OSV3_ansible_inventory.20200731 playbooks/byo/openshift-cluster/openshift-logging.yml
The inventory file were specific to our deployment so you might want to change it before running in your cluster; the playbook were taken from subheading 'Deploying the EFK Stack' in chapter 'Aggregating Container Logs'.

The first pitfall is a strange error message about deployment config. It seems that the ansible playbook generated somewhat incorrect deployment config. However the culprit is wrong units being used in the memory limit (4GB instead of 4G). Red hat bugzilla have description on this problem ("Better error message for templates missing memory specification"). The strange error message is : " DeploymentConfig in version "v1" cannot be handled as a DeploymentConfig: quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'", which we resolve by changing memory limit into 4G in the inventory file.

The second pitfall is the certificate not being renewed at all. So I figure I need to delete the existing certificates to get them rebuild them. For more info I need to enable verbose logging on the ansible playbook execution:

ansible-playbook --verbose -i /root/OSV3_ansible_inventory.20200731 playbooks/byo/openshift-cluster/openshift-logging.yml

The interesting fragments were something like these :
TASK [openshift_logging : Checking for system.admin.crt] ***************************************************************************************************************
ok: [master1.paas.telkom.co.id] => {
    "changed": false,
    "failed": false,
    "stat": {
        "atime": 1597830303.7311661,
        "attr_flags": "",
        "attributes": [],
        "block_size": 4096,
        "blocks": 8,
        "charset": "us-ascii",
        "checksum": "0d0987b2615864f95fe914aa29a27c3831e8c93e",
        "ctime": 1597830292.9452658,
        "dev": 64768,
        "device_type": 0,
        "executable": false,
        "exists": true,
        "gid": 0,
        "gr_name": "root",
        "inode": 51974025,
        "isblk": false,

At first I don't know where the file is located. Which ansible script responsible for that ?
Need to dig deeper. Lets see : the byo/openshift-cluster/openshift-logging.yml contains this :
include: ../../common/openshift-cluster/openshift_logging.yml
This direct us to another playbook in the common directory, which contains:
- name: OpenShift Aggregated Logging
  hosts: oo_first_master
  roles:
  - openshift_logging

- name: Update Master configs
  hosts: oo_masters:!oo_first_master
  tasks:
  - block:
    - include_role:
        name: openshift_logging
        tasks_from: update_master_config
    when: openshift_logging_install_logging | default(false) | bool

Being just a beginner in ansible stuffs, so at first this doesn't make any sense. What I make out of this is that the first master is something special, and other master being updated after configuration in the first master is done. Well the next file to be checked is openshift_logging role :

[root@lb openshift-ansible-release-3.6]# ls -l roles/openshift_logging
total 24
drwxrwxr-x. 2 lbuser lbuser    22 Dec  6  2017 defaults
drwxrwxr-x. 2 lbuser lbuser    52 Dec  6  2017 files
drwxrwxr-x. 2 lbuser lbuser    75 Dec  6  2017 filter_plugins
drwxrwxr-x. 2 lbuser lbuser    22 Dec  6  2017 handlers
drwxrwxr-x. 2 lbuser lbuser    40 Dec  6  2017 library
drwxrwxr-x. 2 lbuser lbuser    23 Dec  6  2017 meta
-rw-rw-r--. 1 lbuser lbuser 18785 Dec  6  2017 README.md
drwxrwxr-x. 2 lbuser lbuser  4096 Aug 19 19:45 tasks
drwxrwxr-x. 2 lbuser lbuser    47 Dec  6  2017 templates
drwxrwxr-x. 2 lbuser lbuser    81 Dec  6  2017 vars

Quite a lot of directories to be examined. Lets just examine the tasks directory.

[root@lb openshift-ansible-release-3.6]# ls -l roles/openshift_logging/tasks
total 60
-rw-rw-r--. 1 lbuser lbuser   484 Dec  6  2017 annotate_ops_projects.yaml
-rw-rw-r--. 1 lbuser lbuser  2260 Dec  6  2017 delete_logging.yaml
-rw-rw-r--. 1 lbuser lbuser  4371 Dec  6  2017 generate_certs.yaml
-rw-rw-r--. 1 lbuser lbuser  3401 Dec  6  2017 generate_jks.yaml
-rw-rw-r--. 1 lbuser lbuser  1442 Dec  6  2017 generate_pems.yaml
-rw-rw-r--. 1 lbuser lbuser 16425 Dec  6  2017 install_logging.yaml
-rw-rw-r--. 1 lbuser lbuser  1478 Dec  6  2017 main.yaml
-rw-rw-r--. 1 lbuser lbuser  2824 Aug 19 18:54 procure_server_certs.yaml
-rw-rw-r--. 1 lbuser lbuser  1267 Dec  6  2017 procure_shared_key.yaml
-rw-rw-r--. 1 lbuser lbuser   377 Dec  6  2017 update_master_config.yaml

Well it seems we found some treasure.
Lets find out which has the string 'system.admin' :

[root@lb openshift-ansible-release-3.6]# grep system.admin roles/openshift_logging/tasks/*
roles/openshift_logging/tasks/generate_certs.yaml:    - system.admin
roles/openshift_logging/tasks/generate_jks.yaml:- name: Checking for system.admin.jks
roles/openshift_logging/tasks/generate_jks.yaml:  stat: path="{{generated_certs_dir}}/system.admin.jks"
roles/openshift_logging/tasks/generate_jks.yaml:  register: system_admin_jks
roles/openshift_logging/tasks/generate_jks.yaml:  local_action: file path="{{local_tmp.stdout}}/system.admin.jks" state=touch mode="u=rw,g=r,o=r"
roles/openshift_logging/tasks/generate_jks.yaml:  when: system_admin_jks.stat.exists
roles/openshift_logging/tasks/generate_jks.yaml:  when: not elasticsearch_jks.stat.exists or not logging_es_jks.stat.exists or not system_admin_jks.stat.exists or not truststore_jks.stat.exists
roles/openshift_logging/tasks/generate_jks.yaml:  when: not elasticsearch_jks.stat.exists or not logging_es_jks.stat.exists or not system_admin_jks.stat.exists or not truststore_jks.stat.exists
roles/openshift_logging/tasks/generate_jks.yaml:  when: not elasticsearch_jks.stat.exists or not logging_es_jks.stat.exists or not system_admin_jks.stat.exists or not truststore_jks.stat.exists
roles/openshift_logging/tasks/generate_jks.yaml:    src: "{{local_tmp.stdout}}/system.admin.jks"
roles/openshift_logging/tasks/generate_jks.yaml:    dest: "{{generated_certs_dir}}/system.admin.jks"
roles/openshift_logging/tasks/generate_jks.yaml:  when: not system_admin_jks.stat.exists
[root@lb openshift-ansible-release-3.6]#

In the generate_certs.yml file we find that :

- name: Generate PEM certs
  include: generate_pems.yaml component={{node_name}}
  with_items:
    - system.logging.fluentd
    - system.logging.kibana
    - system.logging.curator
    - system.admin
  loop_control:
    loop_var: node_name

This kind of calling another yaml (generate_pems.yaml) :

---
- name: Checking for {{component}}.key
  stat: path="{{generated_certs_dir}}/{{component}}.key"
  register: key_file
  check_mode: no

- name: Checking for {{component}}.crt
  stat: path="{{generated_certs_dir}}/{{component}}.crt"
  register: cert_file
  check_mode: no

- name: Creating cert req for {{component}}
  command: >
    openssl req -out {{generated_certs_dir}}/{{component}}.csr -new -newkey rsa:2048 -keyout {{generated_certs_dir}}/{{component}}.key
    -subj "/CN={{component}}/OU=OpenShift/O=Logging/subjectAltName=DNS.1=localhost{{cert_ext.stdout}}" -days 712 -nodes
  when:
    - not key_file.stat.exists
    - cert_ext is defined
    - cert_ext.stdout is defined
  check_mode: no

- name: Creating cert req for {{component}}
  command: >
    openssl req -out {{generated_certs_dir}}/{{component}}.csr -new -newkey rsa:2048 -keyout {{generated_certs_dir}}/{{component}}.key
    -subj "/CN={{component}}/OU=OpenShift/O=Logging" -days 712 -nodes
  when:
    - not key_file.stat.exists
    - cert_ext is undefined or cert_ext is defined and cert_ext.stdout is undefined
  check_mode: no

- name: Sign cert request with CA for {{component}}
  command: >
    openssl ca -in {{generated_certs_dir}}/{{component}}.csr -notext -out {{generated_certs_dir}}/{{component}}.crt
    -config {{generated_certs_dir}}/signing.conf -extensions v3_req -batch -extensions server_ext
  when:
    - not cert_file.stat.exists
  check_mode: no

Seeing this I can reuse the certificate key and csr, and just do the last part (signing) with the cert_file missing. So I need to delete the certificate file, which is at "{{generated_certs_dir}}/{{component}}.crt", for example the system.admin as one of the component becames "{{generated_certs_dir}}/system.admin.crt". But I still don't know where are the generated_certs_dir is..

[root@lb openshift-ansible-release-3.6]# grep generated_certs -R * | head
docs/proposals/role_decomposition.md:    generated_certs_dir: "{{openshift.common.config_base}}/logging"
docs/proposals/role_decomposition.md:    generated_certs_dir: "{{openshift.common.config_base}}/logging"
docs/proposals/role_decomposition.md:    generated_certs_dir: "{{openshift.common.config_base}}/logging"
docs/proposals/role_decomposition.md:    generated_certs_dir: "{{openshift.common.config_base}}/logging"

Ok it is under config_base,

[root@lb openshift-ansible-release-3.6]# grep config_base -R * | head
DEPLOYMENT_TYPES.md:| **openshift.common.config_base**                                | /etc/origin                              | /etc/origin                            |
docs/proposals/role_decomposition.md:    path: "{{ openshift.common.config_base }}/logging"

To my luck the first result yields the correct location for config_base (/etc/origin). Thus the certificate directory is /etc/origin/logging. But it is not in the bootstrap server :

[root@lb openshift-ansible-release-3.6]# ls -l /etc/origin/logging
ls: cannot access /etc/origin/logging: No such file or directory

To my surprise it is in the first master only :

[root@lb openshift-ansible-release-3.6]# ls -l /etc/origin/logging
ls: cannot access /etc/origin/logging: No such file or directory
[root@lb openshift-ansible-release-3.6]# ssh master1 ls -l /etc/origin/logging
total 156
-rw-r--r--. 1 root root 1196 Aug 18  2018 02.pem
-rw-r--r--. 1 root root 1196 Aug 18  2018 03.pem
-rw-r--r--. 1 root root 1196 Aug 18  2018 04.pem
-rw-r--r--. 1 root root 1184 Aug 18  2018 05.pem
-rw-r--r--. 1 root root 1196 Aug 19 16:44 06.pem
-rw-r--r--. 1 root root 1196 Aug 19 16:44 07.pem
-rw-r--r--. 1 root root 1196 Aug 19 16:44 08.pem
-rw-r--r--. 1 root root 1184 Aug 19 16:44 09.pem
-rw-r--r--. 1 root root 1050 Aug 18  2018 ca.crt
...

[root@lb openshift-ansible-release-3.6]# ssh master2 ls -l /etc/origin/logging
ls: cannot access /etc/origin/logging: No such file or directory
[root@lb openshift-ansible-release-3.6]# ssh master3 ls -l /etc/origin/logging
ls: cannot access /etc/origin/logging: No such file or directory

The key take :
- certificate files were generated in the /etc/origin/logging directory in the first master server
- we can remove expired certificates in order to force certificate renewal

Next we tried to do just that :
[root@master1 ~]# cd /etc/origin/logging
[root@master1 logging]# mkdir /tmp/expiredcerts
[root@master1 logging]# mv kibana-internal.crt system.admin.crt  system.logging.curator.crt system.logging.fluentd.crt system.logging.kibana.crt /tmp/expiredcerts/

Then reperform the playbook again :

ansible-playbook  -i /root/OSV3_ansible_inventory.20200731 playbooks/byo/openshift-cluster/openshift-logging.yml

Which results in another error referring kibana-internal.crt being missing. Why it is not being regenerated ? Lets ask our friend grep:
[root@lb openshift-ansible-release-3.6]# grep kibana-internal -R playbooks/byo | head
grep: warning: playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_certificate_expiry/examples/playbooks: recursive directory loop
grep: warning: playbooks/byo/openshift-checks/roles/openshift_certificate_expiry/examples/playbooks/roles: recursive directory loop
playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_logging/tasks/generate_certs.yaml:    - procure_component: kibana-internal
playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_logging_kibana/tasks/main.yaml:  - { name: "kibana_internal_key", file: "kibana-internal.key"}
playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_logging_kibana/tasks/main.yaml:  - { name: "kibana_internal_cert", file: "kibana-internal.crt"}
playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_logging_kibana/tasks/main.yaml:    #  path: "{{ generated_certs_dir }}/kibana-internal.key"
playbooks/byo/openshift-checks/certificate_expiry/roles/openshift_logging_kibana/tasks/main.yaml:    #  path: "{{ generated_certs_dir }}/kibana-internal.crt"
grep: warning: playbooks/byo/openshift-cluster/roles/openshift_certificate_expiry/examples/playbooks/roles: recursive directory loop
playbooks/byo/openshift-checks/roles/openshift_logging/tasks/generate_certs.yaml:    - procure_component: kibana-internal

The kibana-internal were in generate_certs.yml task (well, what do you know, we have certificate_expiry stuffs for the kibana-internal in the playbooks? Will need to check it later). Lets review that part in generate_certs.yaml :

- include: procure_server_certs.yaml
  loop_control:
    loop_var: cert_info
  with_items:
    - procure_component: kibana
    - procure_component: kibana-ops
    - procure_component: kibana-internal
      hostnames: "kibana, kibana-ops, {{openshift_logging_kibana_hostname}}, {{openshift_logging_kibana_ops_hostname}}"

Which refers to procure_server_certs.yaml :

---
- name: Procure info
  debug:
    var: cert_info

- name: Checking for {{ cert_info.procure_component }}.crt
  stat: path="{{generated_certs_dir}}/{{ cert_info.procure_component }}.crt"
  register: component_cert_file
  check_mode: no

- name: Checking for {{ cert_info.procure_component }}.key
  stat: path="{{generated_certs_dir}}/{{ cert_info.procure_component }}.key"
  register: component_key_file
  check_mode: no

- name: Trying to discover server cert variable name for {{ cert_info.procure_component }}
  set_fact: procure_component_crt={{ lookup('env', '{{cert_info.procure_component}}' + '_crt') }}
  when:
  - cert_info.hostnames is undefined
  - cert_info[ cert_info.procure_component + '_crt' ] is defined
  - cert_info[ cert_info.procure_component + '_key' ] is defined
  check_mode: no

- name: Trying to discover the server key variable name for {{ cert_info.procure_component }}
  set_fact: procure_component_key={{ lookup('env', '{{cert_info.procure_component}}' + '_key') }}
  when:
  - cert_info.hostnames is undefined
  - cert_info[ cert_info.procure_component + '_crt' ] is defined
  - cert_info[ cert_info.procure_component + '_key' ] is defined
  check_mode: no

- name: Creating signed server cert and key for {{ cert_info.procure_component }}
  command: >
     {{ openshift.common.client_binary }} adm --config={{ mktemp.stdout }}/admin.kubeconfig ca create-server-cert
     --key={{generated_certs_dir}}/{{cert_info.procure_component}}.key --cert={{generated_certs_dir}}/{{cert_info.procure_component}}.crt
     --hostnames={{cert_info.hostnames|quote}} --signer-cert={{generated_certs_dir}}/ca.crt --signer-key={{generated_certs_dir}}/ca.key
     --signer-serial={{generated_certs_dir}}/ca.serial.txt
  check_mode: no
  when:
  - cert_info.hostnames is defined
  - not component_key_file.stat.exists
  - not component_cert_file.stat.exists

- name: Copying server key for {{ cert_info.procure_component }} to generated certs directory
  copy: content="{{procure_component_key}}" dest={{generated_certs_dir}}/{{cert_info.procure_component}}.key
  check_mode: no
... 

It seems that the signing process for this part requires both the key file and cert file to be non-existent. Only missing cert file will not get it regenerated. So the solution is to remove kibana-internal.key so it would regenerate both key and crt file.
The strange thing is the playbook refers to kibana and  kibana-ops, which should result in kibana.key, kibana-ops.key, kibana.crt, kibana-ops.key being installed. But seems that the missing hostname clause caused them not to be executed. Is it a typo ? But without the kibana.* and kibana-ops.* files all seem well so I will just ignore it.
After removal of kibana-internal.key, the playbook works like a charm, but the running pods is still using the old certificate. The solution is either to kill the pods or to scale down deployment to 0, wait for it to destroy the pods, and scale back up to the original value (might be 1 or 2 depending the replication factor in the inventory file).

Conclusion and Postscript

We conclude that the certificate generation process take place only in the first master, with the directory /etc/origin/logging used for the logging certificates. In order to regenerate the certificates we need to remove the expired crt file and also kibana-internal.key file, and execute the playbooks/byo/openshift-cluster/openshift-logging.yml ansible playbook. Redeploying pods will be necessary in order to use the newer certificates.
After going through the quite adventure, I just found out that there are a similar passage involving removal of /etc/origin/logging files in 'Redeploying EFK Certificates' section in chapter 'Aggregating Container Logs' for Openshift 3.9.. Just be wary that location of ansible scripts changed quite a bit between Openshift Origin (OKD) updates.






Comments

Popular posts from this blog

Long running process in Linux using PHP

Reverse Engineering Reptile Kernel module to Extract Authentication code

SAP System Copy Lessons Learned