Tips on Recovering from Out of Disk Space (Linux Server)

Background

When multiple VM is being used for application infrastructure,  sooner or later a system administrator will face out of disk space condition. This post will show a few selected approach to resolve such condition.

First Step : Identify The Disk Configuration

Some commands to determine disk mounting configuration :

determine disk usage and mount points : df -h
detailed mount point and options : mount | column -t
physical volumes for LVM : pvs
logical volumes for LVM : lvs
volume group for LVM : vgs
block devices list : lsblk

Some VM might use ZFS on Linux, to examine pool configurations use :
zfs list
zpool list
zfs list -t snapshot

Second step : Determine which directory are using most space

From the df -h command, we found out which partition or mount point is at out of disk space condition or nearing it. Better way to determine which directory are the largest is using ncdu tool, but if you didn't have it installed you could always use du -hs /<path>/*.

For example : df -h results in :
[root@pv1 ~]# df -h
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/cl-root   23G  4.8G   19G  21% /
devtmpfs              16G     0   16G   0% /dev
tmpfs                 16G     0   16G   0% /dev/shm
tmpfs                 16G  1.6G   15G  11% /run
tmpfs                 16G     0   16G   0% /sys/fs/cgroup
/dev/sda1           1014M  144M  871M  15% /boot
/dev/mapper/cl-var    10G  580M  9.5G   6% /var
pool1/data1          813G  735G   79G  91% /data
pool1                 79G     0   79G   0% /pool1
tmpfs                3.2G     0  3.2G   0% /run/user/0
Then we checking out /data :
[root@pv1 exports]# ls -l /data/
total 5
drwxrwxrwx+ 113 nfsnobody nfsnobody 113 May 27 19:57 exports

only one directory in /data,
So lets descend one level :
[root@pv1 exports]# ls -l /data/exports
total 365
drwxrwxrwx+  4 nfsnobody nfsnobody   6 May 21  2018 clustermetrics
drwxr-xr-x+  2 nfsnobody nfsnobody   2 Jun 16  2017 mysql-nolsatu
drwxrwxr-x+  6 nfsnobody nfsnobody  22 Apr  1 22:40 pv0000001
drwxrwxr-x+  6 nfsnobody nfsnobody  24 Jun 29 12:32 pv0000002
drwxrwxr-x+  6 nfsnobody nfsnobody  24 Jan 29  2020 pv0000003
drwxrwxr-x+  6 nfsnobody nfsnobody  22 Apr  1 22:42 pv0000004
drwxrwxr-x+  6 nfsnobody nfsnobody  22 Jun 29 12:35 pv0000005
drwxrwxr-x+  5 nfsnobody nfsnobody   5 Nov  6  2019 pv0000006
drwxrwxr-x+  9 nfsnobody nfsnobody  25 Apr  5 13:03 pv0000007
...<redacted>..

Then we check usage summary at this level :
[root@pv1 exports]# du -hs /data/exports/*
21G /data/exports/clustermetrics
1.5K /data/exports/mysql-nolsatu
190M /data/exports/pv0000001
652M /data/exports/pv0000002
190M /data/exports/pv0000003
408M /data/exports/pv0000004
190M /data/exports/pv0000005
6.0K /data/exports/pv0000006
415M /data/exports/pv0000007
264M /data/exports/pv0000008
263M /data/exports/pv0000009
29M /data/exports/pv0000010
332M /data/exports/pv0000011
178M /data/exports/pv0000012
190M /data/exports/pv0000013
1.5K /data/exports/pv0000014
190M /data/exports/pv0000015
47M /data/exports/pv0000016
340M /data/exports/pv0000017
332M /data/exports/pv0000018
2.8M /data/exports/pv0000019
55M /data/exports/pv0000020
190M /data/exports/pv0000021

It might take a long time, you might want to do it in vnc session or screen.

Deleting ZFS Snapshot (if there any)

First trick, when we are using ZFS snapshots, we could delete old snapshots to reclaim space.
Check the snapshots using : zfs list -t snapshot

[root@pv1 exports]# zfs list -t snapshot
NAME                                                     USED  AVAIL  REFER  MOUNTPOINT
pool1/data1@zfs-auto-snap_weekly-2020-07-05-1659        24.1G      -   728G  -
pool1/data1@zfs-auto-snap_weekly-2020-07-12-1659        12.2G      -   729G  -
pool1/data1@zfs-auto-snap_weekly-2020-07-19-1659            0      -   728G  -
pool1/data1@zfs-auto-snap_daily-2020-07-19-1659             0      -   728G  -
..< redactd > ..

Then we could delete old snapshots by :
zfs destroy pool1/data1@zfs-auto-snap_weekly-2020-07-05-1659
zfs destroy pool1/data1@zfs-auto-snap_weekly-2020-07-12-1659

I sometimes use this shortcut (zfsdestlast) :
alias zfsdestlast='zfs destroy `zfs list -t snapshot | head -n 2 | tail -n 1 | cut -d " " -f 1`'

Be aware that we will be unable to recover data to that point in time after deleting the snapshots.

Deleting not needed files

This step only possible if we are certain that some large file is not needed at all. Just remove the offending file and make use that other people in your team (and also the team using the VM) is being confirmed before doing any action.
If the directory nearing full is being used as openshift container registry, you might want to run openshift prune (oadm prune images) from master. Running it from outside the cluster (such as your laptop) in my case will fail. 
If the directory (or logical volume) nearing full is being used as docker storage, you might want to run docker image cleanup ( docker rmi $(docker images --filter "dangling=true" -q --no-trunc) ), refer to this stackoverflow question. Newer versions of docker have 'docker image prune' command. Make sure you know the implication of running each pruning commands (container, volume, and image) without or with the -a option before doing any action.

Extending LVM Partition

If there is some significant free space in the pvs command output, meaning not the entire physical LVM disk is being used, we could easily extend logical LVM partition.

Commands: 
pvs
lvs
lvextend -L +10G /dev/mapper/vgname-lvname
(replace vgname and lvname with the names showing in lvs and also df -h)
resize2fs /dev/mapper/vgname-lvname 

Another possibility is to add additional storage in VMWare (if the linux server is virtual), create a partition in the new storage, and add it into the volume group so we have free space to extend to.

Borrowing Space from Another Partition

If one partition is low in space (for example, 99%) and Another partition is quite free ( lets say, have 50GB free), we might want to exchange disk space in one partition with another. One approach is by shrinking LVM partition with some shortcomings, for example that if the partition isn't using LVM then this approach could not be used. And there is another difficulty of shrinking the partition size when it is still being used by a running process and you want to minimize server downtime. So I would propose a different approach.
Step A. Find a mount point that have the largest free storage (for example: /var has 50 GB free); then find a directory in the low-space partition that uses  significant amount of storage and under 50 GB. For example I found /es1/ops directory which used 14 GB of storage, and /es1 space usage is at 99%.
Step B. Make sure server software that using the directory is stopped. 
Step C. Create sparse file in the donor partition (for this example, /var) :  truncate -s 20G /var/storage/es-ops-1
Step D. Create temporary mount point : mkdir /mnt/es-ops-1
Step E. Create filesystem in the sparse file :  mkfs.xfs /var/storage/es-ops-1 
Step F. Mount the filesystem in temporary mount : mount -o loop /var/storage/es-ops-1 /mnt/es-ops-1/
Step G. Move the directory content from step A to mount point in step F : 
cd /es1/ops 
mv * /mnt/es-ops-1/ 
Step H. Remount the sparse file in the original directory :
umount /mnt/es-ops-1
mount /var/storage/es-ops-1 /es1/ops


Comments

This is quite charming post you shared, I like the post, an obligation of appreciation is all together for sharing..
360DigiTMG data analytics course malaysia
tejaswini said…
Stunning! Such an astonishing and supportive post this is. I incredibly love it. It's so acceptable thus wonderful. I am simply astounded.
360DigiTMG data science malaysia
devika iangar said…
I have a strategic I'm seconds ago chipping away at, and I have been at the post for such data
difference between analysis and analytics

Popular posts from this blog

Long running process in Linux using PHP

Reverse Engineering Reptile Kernel module to Extract Authentication code

SAP System Copy Lessons Learned