How to show clusterwide utilization in Openshift/Kubernetes

Background

When deploying openshift/kubernetes cluster in production, usually we used more than 3 machine/hosts, and in my company's deployment we used more than 10 VMs. Monitoring resource usage in multiple VM is nothing new but using kubernetes we have an option to use kubectl / oc adm top nodes to do this. Using kubectl/ oc adm top nodes will show overall CPU and memory usage for each node

The old ways (pre-kubernetes)

Before using kubernetes/openshift, we used RHEL/CentOS VMs and to get resource usage we use the sar utility. Of course we need to install sar first (sudo yum install sysstat), and then sar is quite unique that it store historical data based on the date of the month.

To get cpu usage for current date :

sar

To get memory usage for current date :

sar -r

To get memory usage for the date 30 :

sar -r -f /var/log/sa/sa30

To get cpu usage for the date 30 :

sar -f /var/log/sa/sa30

There are other interesting resource usage such as per core cpu usage (sar -P ALL), block device statistics (sar -b and sar -d) and network (sar -n DEV); refer to this link https://www.thegeekstuff.com/2011/03/sar-examples/

Kubernetes (and openshift way)

The command to show clusterwide utilization is :

kubectl adm top nodes

oc adm top nodes

However, in some case the command won't work and will return error message.

1. Cannot find resource

Error from server (NotFound): the server could not find the requested resource (get services http:heapster:)

The solution for this is to add namespace information where the heapster service is running :

oc adm top nodes --heapster-namespace='openshift-infra'

2. Malformed HTTP response

Error from server (InternalError): an error on the server ("Error: 'malformed HTTP response \"\\x15\\x03\\x01\\x00\\x02\\x02\"'\nTrying to reach: 'http://10.198.14.128:8082/apis/metrics/v1alpha1/nodes?labelSelector='") has prevented the request from succeeding (get services http:heapster:)

For this, the solution is to change from http to https scheme.

oc adm top nodes --heapster-namespace='openshift-infra' --heapster-scheme=https

NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%

infra3.paas.telkom.co.id 616m 3% 6776Mi 42%

node3.paas.telkom.co.id 2205m 14% 64724Mi 50%

node1.paas.telkom.co.id 1625m 10% 64409Mi 50%

node7.paas.telkom.co.id 2915m 18% 33371Mi 25%

logging1.paas.telkom.co.id 191m 4% 10529Mi 66%

master3.paas.telkom.co.id 138m 3% 8769Mi 55%

node5.paas.telkom.co.id 3026m 19% 62947Mi 49%

infra2.paas.telkom.co.id 1486m 9% 5485Mi 34%

infra1.paas.telkom.co.id 376m 2% 4378Mi 27%

node4.paas.telkom.co.id 699m 4% 60793Mi 47%

node6.paas.telkom.co.id 980m 6% 53378Mi 41%

node9.paas.telkom.co.id 1150m 7% 48668Mi 37%

master2.paas.telkom.co.id 162m 4% 8998Mi 57%

logging2.paas.telkom.co.id 379m 9% 9764Mi 61%

node2.paas.telkom.co.id 2796m 17% 68857Mi 53%

node8.paas.telkom.co.id 797m 4% 52673Mi 40%

logging3.paas.telkom.co.id 60m 1% 1307Mi 8%

master1.paas.telkom.co.id 138m 3% 7183Mi 45%

Using cron job to record utilization

The previous method will allow us to get current resource utilization, but how about utilization in the previous day or a few hours back? This cron job will execute the oc adm command and record the output into a certain directory. But you must first create service account to be used when running the oc adm command, refer to openshift documentation : https://docs.openshift.com/container-platform/3.6/dev_guide/service_accounts.html

#!/bin/bash

TOKEN1=<put service account token here>

LOG1=/root/logtop/

TGL1=$(date +%Y%m%d-%H%M%S)

oc adm top node --token=$TOKEN1 --heapster-namespace=openshift-infra --heapster-scheme=https >> $LOG1/d-$TGL1

The resulting files could be grep to show utilization for single vm / node :

[root@master1 ~]# grep infra3 /root/logtop/d-2020072*

/root/logtop/d-20200720-000001:infra3.paas.telkom.co.id 415m 2% 7024Mi 44%

/root/logtop/d-20200720-001001:infra3.paas.telkom.co.id 456m 2% 6967Mi 44%

/root/logtop/d-20200720-002001:infra3.paas.telkom.co.id 601m 3% 7215Mi 45%

/root/logtop/d-20200720-003001:infra3.paas.telkom.co.id 525m 3% 6955Mi 44%

/root/logtop/d-20200720-004001:infra3.paas.telkom.co.id 485m 3% 6897Mi 43%

/root/logtop/d-20200720-005001:infra3.paas.telkom.co.id 417m 2% 6743Mi 42%

/root/logtop/d-20200720-010001:infra3.paas.telkom.co.id 519m 3% 6741Mi 42%

/root/logtop/d-20200720-011001:infra3.paas.telkom.co.id 479m 2% 6742Mi 42%

/root/logtop/d-20200720-012001:infra3.paas.telkom.co.id 413m 2% 6485Mi 41%

/root/logtop/d-20200720-013002:infra3.paas.telkom.co.id 386m 2% 6101Mi 38%

/root/logtop/d-20200720-014001:infra3.paas.telkom.co.id 726m 4% 6192Mi 39%

/root/logtop/d-20200720-015001:infra3.paas.telkom.co.id 464m 2% 5870Mi 37%

/root/logtop/d-20200720-020001:infra3.paas.telkom.co.id 438m 2% 5771Mi 36%

/root/logtop/d-20200720-021001:infra3.paas.telkom.co.id 494m 3% 5706Mi 36%

/root/logtop/d-20200720-022001:infra3.paas.telkom.co.id 395m 2% 5598Mi 35%

/root/logtop/d-20200720-023001:infra3.paas.telkom.co.id 356m 2% 5760Mi 36%

/root/logtop/d-20200720-024001:infra3.paas.telkom.co.id 577m 3% 6181Mi 39%

/root/logtop/d-20200720-025001:infra3.paas.telkom.co.id 478m 2% 5913Mi 37%

/root/logtop/d-20200720-030001:infra3.paas.telkom.co.id 437m 2% 5803Mi 36%

/root/logtop/d-20200720-031001:infra3.paas.telkom.co.id 513m 3% 5854Mi 37%

Another example:

[root@master1 ~]# grep infra2 /root/logtop/d-20200723* | tail

/root/logtop/d-20200723-072001:infra2.paas.telkom.co.id 454m 2% 10399Mi 65%

/root/logtop/d-20200723-073001:infra2.paas.telkom.co.id 461m 2% 11237Mi 71%

/root/logtop/d-20200723-074001:infra2.paas.telkom.co.id 2713m 16% 13194Mi 83%

/root/logtop/d-20200723-075001:infra2.paas.telkom.co.id 9033m 56% 15134Mi 95%

/root/logtop/d-20200723-080001:infra2.paas.telkom.co.id 10490m 65% 15183Mi 96%

/root/logtop/d-20200723-081001:infra2.paas.telkom.co.id 925m 5% 2795Mi 17%

/root/logtop/d-20200723-083001:infra2.paas.telkom.co.id 8197m 51% 10963Mi 69%

/root/logtop/d-20200723-085001:infra2.paas.telkom.co.id 3783m 23% 5095Mi 32%

/root/logtop/d-20200723-090001:infra2.paas.telkom.co.id 5559m 34% 12428Mi 78%

/root/logtop/d-20200723-091001:infra2.paas.telkom.co.id 15261m 95% 14896Mi 94%

Conclusion

This post shows how to get resource utilization per host/vm using oc adm and workaround for common errors such as malformed HTTP response and could not find requested resource.

Addendum

Inventor's Paradox