Tips on Recovering from Out of Disk Space (Linux Server)

Background

When multiple VM is being used for application infrastructure, sooner or later a system administrator will face out of disk space condition. This post will show a few selected approach to resolve such condition.

First Step : Identify The Disk Configuration

Some commands to determine disk mounting configuration :

determine disk usage and mount points : df -h

detailed mount point and options : mount | column -t

physical volumes for LVM : pvs

logical volumes for LVM : lvs

volume group for LVM : vgs

block devices list : lsblk

Some VM might use ZFS on Linux, to examine pool configurations use :

zfs list

zpool list

zfs list -t snapshot

Second step : Determine which directory are using most space

From the df -h command, we found out which partition or mount point is at out of disk space condition or nearing it. Better way to determine which directory are the largest is using ncdu tool, but if you didn't have it installed you could always use du -hs /<path>/*.

For example : df -h results in :

[root@pv1 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/mapper/cl-root 23G 4.8G 19G 21% /

devtmpfs 16G 0 16G 0% /dev

tmpfs 16G 0 16G 0% /dev/shm

tmpfs 16G 1.6G 15G 11% /run

tmpfs 16G 0 16G 0% /sys/fs/cgroup

/dev/sda1 1014M 144M 871M 15% /boot

/dev/mapper/cl-var 10G 580M 9.5G 6% /var

pool1/data1 813G 735G 79G 91% /data

pool1 79G 0 79G 0% /pool1

tmpfs 3.2G 0 3.2G 0% /run/user/0

Then we checking out /data :

[root@pv1 exports]# ls -l /data/

total 5

drwxrwxrwx+ 113 nfsnobody nfsnobody 113 May 27 19:57 exports

only one directory in /data,

So lets descend one level :

[root@pv1 exports]# ls -l /data/exports

total 365

drwxrwxrwx+ 4 nfsnobody nfsnobody 6 May 21 2018 clustermetrics

drwxr-xr-x+ 2 nfsnobody nfsnobody 2 Jun 16 2017 mysql-nolsatu

drwxrwxr-x+ 6 nfsnobody nfsnobody 22 Apr 1 22:40 pv0000001

drwxrwxr-x+ 6 nfsnobody nfsnobody 24 Jun 29 12:32 pv0000002

drwxrwxr-x+ 6 nfsnobody nfsnobody 24 Jan 29 2020 pv0000003

drwxrwxr-x+ 6 nfsnobody nfsnobody 22 Apr 1 22:42 pv0000004

drwxrwxr-x+ 6 nfsnobody nfsnobody 22 Jun 29 12:35 pv0000005

drwxrwxr-x+ 5 nfsnobody nfsnobody 5 Nov 6 2019 pv0000006

drwxrwxr-x+ 9 nfsnobody nfsnobody 25 Apr 5 13:03 pv0000007

...<redacted>..

Then we check usage summary at this level :

[root@pv1 exports]# du -hs /data/exports/*

21G /data/exports/clustermetrics

1.5K /data/exports/mysql-nolsatu

190M /data/exports/pv0000001

652M /data/exports/pv0000002

190M /data/exports/pv0000003

408M /data/exports/pv0000004

190M /data/exports/pv0000005

6.0K /data/exports/pv0000006

415M /data/exports/pv0000007

264M /data/exports/pv0000008

263M /data/exports/pv0000009

29M /data/exports/pv0000010

332M /data/exports/pv0000011

178M /data/exports/pv0000012

190M /data/exports/pv0000013

1.5K /data/exports/pv0000014

190M /data/exports/pv0000015

47M /data/exports/pv0000016

340M /data/exports/pv0000017

332M /data/exports/pv0000018

2.8M /data/exports/pv0000019

55M /data/exports/pv0000020

190M /data/exports/pv0000021

It might take a long time, you might want to do it in vnc session or screen.

Deleting ZFS Snapshot (if there any)

First trick, when we are using ZFS snapshots, we could delete old snapshots to reclaim space.

Check the snapshots using : zfs list -t snapshot

[root@pv1 exports]# zfs list -t snapshot

NAME USED AVAIL REFER MOUNTPOINT

pool1/data1@zfs-auto-snap_weekly-2020-07-05-1659 24.1G - 728G -

pool1/data1@zfs-auto-snap_weekly-2020-07-12-1659 12.2G - 729G -

pool1/data1@zfs-auto-snap_weekly-2020-07-19-1659 0 - 728G -

pool1/data1@zfs-auto-snap_daily-2020-07-19-1659 0 - 728G -

..< redactd > ..

Then we could delete old snapshots by :

zfs destroy pool1/data1@zfs-auto-snap_weekly-2020-07-05-1659

zfs destroy pool1/data1@zfs-auto-snap_weekly-2020-07-12-1659

I sometimes use this shortcut (zfsdestlast) :

alias zfsdestlast='zfs destroy `zfs list -t snapshot | head -n 2 | tail -n 1 | cut -d " " -f 1`'

Be aware that we will be unable to recover data to that point in time after deleting the snapshots.

Deleting not needed files

This step only possible if we are certain that some large file is not needed at all. Just remove the offending file and make use that other people in your team (and also the team using the VM) is being confirmed before doing any action.

If the directory nearing full is being used as openshift container registry, you might want to run openshift prune (oadm prune images) from master. Running it from outside the cluster (such as your laptop) in my case will fail.

If the directory (or logical volume) nearing full is being used as docker storage, you might want to run docker image cleanup ( docker rmi $(docker images --filter "dangling=true" -q --no-trunc) ), refer to this stackoverflow question. Newer versions of docker have 'docker image prune' command. Make sure you know the implication of running each pruning commands (container, volume, and image) without or with the -a option before doing any action.

Extending LVM Partition

If there is some significant free space in the pvs command output, meaning not the entire physical LVM disk is being used, we could easily extend logical LVM partition.

Commands:

pvs

lvs

lvextend -L +10G /dev/mapper/vgname-lvname

(replace vgname and lvname with the names showing in lvs and also df -h)

resize2fs /dev/mapper/vgname-lvname

Another possibility is to add additional storage in VMWare (if the linux server is virtual), create a partition in the new storage, and add it into the volume group so we have free space to extend to.

Borrowing Space from Another Partition

If one partition is low in space (for example, 99%) and Another partition is quite free ( lets say, have 50GB free), we might want to exchange disk space in one partition with another. One approach is by shrinking LVM partition with some shortcomings, for example that if the partition isn't using LVM then this approach could not be used. And there is another difficulty of shrinking the partition size when it is still being used by a running process and you want to minimize server downtime. So I would propose a different approach.

Step A. Find a mount point that have the largest free storage (for example: /var has 50 GB free); then find a directory in the low-space partition that uses significant amount of storage and under 50 GB. For example I found /es1/ops directory which used 14 GB of storage, and /es1 space usage is at 99%.

Step B. Make sure server software that using the directory is stopped.

Step C. Create sparse file in the donor partition (for this example, /var) : truncate -s 20G /var/storage/es-ops-1

Step D. Create temporary mount point : mkdir /mnt/es-ops-1

Step E. Create filesystem in the sparse file : mkfs.xfs /var/storage/es-ops-1

Step F. Mount the filesystem in temporary mount : mount -o loop /var/storage/es-ops-1 /mnt/es-ops-1/

Step G. Move the directory content from step A to mount point in step F :

cd /es1/ops

mv * /mnt/es-ops-1/

Step H. Remount the sparse file in the original directory :

umount /mnt/es-ops-1

mount /var/storage/es-ops-1 /es1/ops

Comments

dataanalyticscourse said…

This is quite charming post you shared, I like the post, an obligation of appreciation is all together for sharing..
360DigiTMG data analytics course malaysia

August 31, 2020 at 6:14 AM

tejaswini said…

Stunning! Such an astonishing and supportive post this is. I incredibly love it. It's so acceptable thus wonderful. I am simply astounded.
360DigiTMG data science malaysia

September 4, 2020 at 10:07 PM

devika iangar said…

I have a strategic I'm seconds ago chipping away at, and I have been at the post for such data
difference between analysis and analytics

September 8, 2020 at 10:33 PM

Inventor's Paradox