Tips on Recovering from Out of Disk Space (Linux Server)
Background
When multiple VM is being used for application infrastructure, sooner or later a system administrator will face out of disk space condition. This post will show a few selected approach to resolve such condition.
First Step : Identify The Disk Configuration
Some commands to determine disk mounting configuration :
determine disk usage and mount points : df -h
detailed mount point and options : mount | column -t
physical volumes for LVM : pvs
logical volumes for LVM : lvs
volume group for LVM : vgs
block devices list : lsblk
Some VM might use ZFS on Linux, to examine pool configurations use :
zfs list
zpool list
zfs list -t snapshot
Second step : Determine which directory are using most space
From the df -h command, we found out which partition or mount point is at out of disk space condition or nearing it. Better way to determine which directory are the largest is using ncdu tool, but if you didn't have it installed you could always use du -hs /<path>/*.
For example : df -h results in :
[root@pv1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/cl-root 23G 4.8G 19G 21% /
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 1.6G 15G 11% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/sda1 1014M 144M 871M 15% /boot
/dev/mapper/cl-var 10G 580M 9.5G 6% /var
pool1/data1 813G 735G 79G 91% /data
pool1 79G 0 79G 0% /pool1
tmpfs 3.2G 0 3.2G 0% /run/user/0
Then we checking out /data :
[root@pv1 exports]# ls -l /data/
total 5
drwxrwxrwx+ 113 nfsnobody nfsnobody 113 May 27 19:57 exports
only one directory in /data,
So lets descend one level :
[root@pv1 exports]# ls -l /data/exports
total 365
drwxrwxrwx+ 4 nfsnobody nfsnobody 6 May 21 2018 clustermetrics
drwxr-xr-x+ 2 nfsnobody nfsnobody 2 Jun 16 2017 mysql-nolsatu
drwxrwxr-x+ 6 nfsnobody nfsnobody 22 Apr 1 22:40 pv0000001
drwxrwxr-x+ 6 nfsnobody nfsnobody 24 Jun 29 12:32 pv0000002
drwxrwxr-x+ 6 nfsnobody nfsnobody 24 Jan 29 2020 pv0000003
drwxrwxr-x+ 6 nfsnobody nfsnobody 22 Apr 1 22:42 pv0000004
drwxrwxr-x+ 6 nfsnobody nfsnobody 22 Jun 29 12:35 pv0000005
drwxrwxr-x+ 5 nfsnobody nfsnobody 5 Nov 6 2019 pv0000006
drwxrwxr-x+ 9 nfsnobody nfsnobody 25 Apr 5 13:03 pv0000007
...<redacted>..
Then we check usage summary at this level :
[root@pv1 exports]# du -hs /data/exports/*
21G /data/exports/clustermetrics
1.5K /data/exports/mysql-nolsatu
190M /data/exports/pv0000001
652M /data/exports/pv0000002
190M /data/exports/pv0000003
408M /data/exports/pv0000004
190M /data/exports/pv0000005
6.0K /data/exports/pv0000006
415M /data/exports/pv0000007
264M /data/exports/pv0000008
263M /data/exports/pv0000009
29M /data/exports/pv0000010
332M /data/exports/pv0000011
178M /data/exports/pv0000012
190M /data/exports/pv0000013
1.5K /data/exports/pv0000014
190M /data/exports/pv0000015
47M /data/exports/pv0000016
340M /data/exports/pv0000017
332M /data/exports/pv0000018
2.8M /data/exports/pv0000019
55M /data/exports/pv0000020
190M /data/exports/pv0000021
It might take a long time, you might want to do it in vnc session or screen.
Deleting ZFS Snapshot (if there any)
First trick, when we are using ZFS snapshots, we could delete old snapshots to reclaim space.
Check the snapshots using : zfs list -t snapshot
[root@pv1 exports]# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
pool1/data1@zfs-auto-snap_weekly-2020-07-05-1659 24.1G - 728G -
pool1/data1@zfs-auto-snap_weekly-2020-07-12-1659 12.2G - 729G -
pool1/data1@zfs-auto-snap_weekly-2020-07-19-1659 0 - 728G -
pool1/data1@zfs-auto-snap_daily-2020-07-19-1659 0 - 728G -
..< redactd > ..
Then we could delete old snapshots by :
zfs destroy pool1/data1@zfs-auto-snap_weekly-2020-07-05-1659
zfs destroy pool1/data1@zfs-auto-snap_weekly-2020-07-12-1659
I sometimes use this shortcut (zfsdestlast) :
alias zfsdestlast='zfs destroy `zfs list -t snapshot | head -n 2 | tail -n 1 | cut -d " " -f 1`'
Be aware that we will be unable to recover data to that point in time after deleting the snapshots.
Deleting not needed files
This step only possible if we are certain that some large file is not needed at all. Just remove the offending file and make use that other people in your team (and also the team using the VM) is being confirmed before doing any action.
If the directory nearing full is being used as openshift container registry, you might want to run openshift prune (oadm prune images) from master. Running it from outside the cluster (such as your laptop) in my case will fail.
If the directory (or logical volume) nearing full is being used as docker storage, you might want to run docker image cleanup ( docker rmi $(docker images --filter "dangling=true" -q --no-trunc) ), refer to this stackoverflow question. Newer versions of docker have 'docker image prune' command. Make sure you know the implication of running each pruning commands (container, volume, and image) without or with the -a option before doing any action.
Extending LVM Partition
If there is some significant free space in the pvs command output, meaning not the entire physical LVM disk is being used, we could easily extend logical LVM partition.
Commands:
pvs
lvs
lvextend -L +10G /dev/mapper/vgname-lvname
(replace vgname and lvname with the names showing in lvs and also df -h)
resize2fs /dev/mapper/vgname-lvname
Another possibility is to add additional storage in VMWare (if the linux server is virtual), create a partition in the new storage, and add it into the volume group so we have free space to extend to.
Borrowing Space from Another Partition
If one partition is low in space (for example, 99%) and Another partition is quite free ( lets say, have 50GB free), we might want to exchange disk space in one partition with another. One approach is by shrinking LVM partition with some shortcomings, for example that if the partition isn't using LVM then this approach could not be used. And there is another difficulty of shrinking the partition size when it is still being used by a running process and you want to minimize server downtime. So I would propose a different approach.
Step A. Find a mount point that have the largest free storage (for example: /var has 50 GB free); then find a directory in the low-space partition that uses significant amount of storage and under 50 GB. For example I found /es1/ops directory which used 14 GB of storage, and /es1 space usage is at 99%.
Step B. Make sure server software that using the directory is stopped.
Step C. Create sparse file in the donor partition (for this example, /var) : truncate -s 20G /var/storage/es-ops-1
Step D. Create temporary mount point : mkdir /mnt/es-ops-1
Step E. Create filesystem in the sparse file : mkfs.xfs /var/storage/es-ops-1
Step F. Mount the filesystem in temporary mount : mount -o loop /var/storage/es-ops-1 /mnt/es-ops-1/
Step G. Move the directory content from step A to mount point in step F :
cd /es1/ops
mv * /mnt/es-ops-1/
Step H. Remount the sparse file in the original directory :
umount /mnt/es-ops-1
mount /var/storage/es-ops-1 /es1/ops
Comments
360DigiTMG data analytics course malaysia
360DigiTMG data science malaysia
difference between analysis and analytics