How to show clusterwide utilization in Openshift/Kubernetes

Background

When deploying openshift/kubernetes cluster in production, usually we used more than 3 machine/hosts, and in my company's deployment we used more than 10 VMs. Monitoring resource usage in multiple VM is nothing new but using kubernetes we have an option to use kubectl / oc adm top nodes to do this. Using kubectl/ oc adm top nodes will show overall CPU and memory usage for each node

The old ways (pre-kubernetes)

Before using kubernetes/openshift, we used RHEL/CentOS VMs and to get resource usage we use the sar utility. Of course we need to install sar first (sudo yum install sysstat), and then sar is quite unique that it store historical data based on the date of the month.  
To get cpu usage for current date :
sar
To get memory usage for current date :
sar -r
To get memory usage for the date 30 :
sar -r -f /var/log/sa/sa30
To get cpu usage for the date 30 :
sar -f /var/log/sa/sa30

There are other interesting resource usage such as per core cpu usage (sar -P ALL), block device statistics (sar -b and sar -d) and network (sar -n DEV); refer to this link https://www.thegeekstuff.com/2011/03/sar-examples/

Kubernetes (and openshift way)

The command to show clusterwide utilization is : 
kubectl adm top nodes
or
oc adm top nodes

However, in some case the command won't work and will return error message.

1. Cannot find resource

Error from server (NotFound): the server could not find the requested resource (get services http:heapster:)

The solution for this is to add namespace information where the heapster service is running :
oc adm top nodes --heapster-namespace='openshift-infra'

2. Malformed HTTP response


Error from server (InternalError): an error on the server ("Error: 'malformed HTTP response \"\\x15\\x03\\x01\\x00\\x02\\x02\"'\nTrying to reach: 'http://10.198.14.128:8082/apis/metrics/v1alpha1/nodes?labelSelector='") has prevented the request from succeeding (get services http:heapster:)

For this, the solution is to change from http to https scheme.
oc adm top nodes --heapster-namespace='openshift-infra' --heapster-scheme=https
NAME                         CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%   
infra3.paas.telkom.co.id     616m         3%        6776Mi          42%       
node3.paas.telkom.co.id      2205m        14%       64724Mi         50%       
node1.paas.telkom.co.id      1625m        10%       64409Mi         50%       
node7.paas.telkom.co.id      2915m        18%       33371Mi         25%       
logging1.paas.telkom.co.id   191m         4%        10529Mi         66%       
master3.paas.telkom.co.id    138m         3%        8769Mi          55%       
node5.paas.telkom.co.id      3026m        19%       62947Mi         49%       
infra2.paas.telkom.co.id     1486m        9%        5485Mi          34%       
infra1.paas.telkom.co.id     376m         2%        4378Mi          27%       
node4.paas.telkom.co.id      699m         4%        60793Mi         47%       
node6.paas.telkom.co.id      980m         6%        53378Mi         41%       
node9.paas.telkom.co.id      1150m        7%        48668Mi         37%       
master2.paas.telkom.co.id    162m         4%        8998Mi          57%       
logging2.paas.telkom.co.id   379m         9%        9764Mi          61%       
node2.paas.telkom.co.id      2796m        17%       68857Mi         53%       
node8.paas.telkom.co.id      797m         4%        52673Mi         40%       
logging3.paas.telkom.co.id   60m          1%        1307Mi          8%        
master1.paas.telkom.co.id    138m         3%        7183Mi          45%    

Using cron job to record utilization

The previous method will allow us to get current resource utilization, but how about utilization in the previous day or a few hours back? This cron job will execute the oc adm command and record the output into a certain directory. But you must first create service account to be used when running the oc adm command, refer to openshift documentation : https://docs.openshift.com/container-platform/3.6/dev_guide/service_accounts.html

#!/bin/bash

TOKEN1=<put service account token here>

LOG1=/root/logtop/
TGL1=$(date +%Y%m%d-%H%M%S)
oc adm top node --token=$TOKEN1 --heapster-namespace=openshift-infra --heapster-scheme=https  >> $LOG1/d-$TGL1

The resulting files could be grep to show utilization for single vm / node :

[root@master1 ~]# grep infra3 /root/logtop/d-2020072*
/root/logtop/d-20200720-000001:infra3.paas.telkom.co.id     415m         2%        7024Mi          44%       
/root/logtop/d-20200720-001001:infra3.paas.telkom.co.id     456m         2%        6967Mi          44%       
/root/logtop/d-20200720-002001:infra3.paas.telkom.co.id     601m         3%        7215Mi          45%       
/root/logtop/d-20200720-003001:infra3.paas.telkom.co.id     525m         3%        6955Mi          44%       
/root/logtop/d-20200720-004001:infra3.paas.telkom.co.id     485m         3%        6897Mi          43%       
/root/logtop/d-20200720-005001:infra3.paas.telkom.co.id     417m         2%        6743Mi          42%       
/root/logtop/d-20200720-010001:infra3.paas.telkom.co.id     519m         3%        6741Mi          42%       
/root/logtop/d-20200720-011001:infra3.paas.telkom.co.id     479m         2%        6742Mi          42%       
/root/logtop/d-20200720-012001:infra3.paas.telkom.co.id     413m         2%        6485Mi          41%       
/root/logtop/d-20200720-013002:infra3.paas.telkom.co.id     386m         2%        6101Mi          38%       
/root/logtop/d-20200720-014001:infra3.paas.telkom.co.id     726m         4%        6192Mi          39%       
/root/logtop/d-20200720-015001:infra3.paas.telkom.co.id     464m         2%        5870Mi          37%       
/root/logtop/d-20200720-020001:infra3.paas.telkom.co.id     438m         2%        5771Mi          36%       
/root/logtop/d-20200720-021001:infra3.paas.telkom.co.id     494m         3%        5706Mi          36%       
/root/logtop/d-20200720-022001:infra3.paas.telkom.co.id     395m         2%        5598Mi          35%       
/root/logtop/d-20200720-023001:infra3.paas.telkom.co.id     356m         2%        5760Mi          36%       
/root/logtop/d-20200720-024001:infra3.paas.telkom.co.id     577m         3%        6181Mi          39%       
/root/logtop/d-20200720-025001:infra3.paas.telkom.co.id     478m         2%        5913Mi          37%       
/root/logtop/d-20200720-030001:infra3.paas.telkom.co.id     437m         2%        5803Mi          36%       
/root/logtop/d-20200720-031001:infra3.paas.telkom.co.id     513m         3%        5854Mi          37%

Another example:

[root@master1 ~]# grep infra2 /root/logtop/d-20200723* | tail

/root/logtop/d-20200723-072001:infra2.paas.telkom.co.id     454m         2%        10399Mi         65%       

/root/logtop/d-20200723-073001:infra2.paas.telkom.co.id     461m         2%        11237Mi         71%       

/root/logtop/d-20200723-074001:infra2.paas.telkom.co.id     2713m        16%       13194Mi         83%       

/root/logtop/d-20200723-075001:infra2.paas.telkom.co.id     9033m        56%       15134Mi         95%       

/root/logtop/d-20200723-080001:infra2.paas.telkom.co.id     10490m       65%       15183Mi         96%       

/root/logtop/d-20200723-081001:infra2.paas.telkom.co.id     925m         5%        2795Mi          17%       

/root/logtop/d-20200723-083001:infra2.paas.telkom.co.id     8197m        51%       10963Mi         69%       

/root/logtop/d-20200723-085001:infra2.paas.telkom.co.id     3783m        23%       5095Mi          32%       

/root/logtop/d-20200723-090001:infra2.paas.telkom.co.id     5559m        34%       12428Mi         78%       

/root/logtop/d-20200723-091001:infra2.paas.telkom.co.id     15261m       95%       14896Mi         94%

Conclusion

This post shows how to get resource utilization per host/vm using oc adm and workaround for common errors such as malformed HTTP response and could not find requested resource.

Addendum

Comments

Popular posts from this blog

Long running process in Linux using PHP

Reverse Engineering Reptile Kernel module to Extract Authentication code

SAP System Copy Lessons Learned