draft an article about rescuing my disk after a bad fsck

This commit is contained in:
Colin 2022-04-14 23:42:01 +00:00
parent 2bdf48ddb6
commit c5a9bcf04b
1 changed files with 422 additions and 0 deletions

View File

@ -0,0 +1,422 @@
+++
title = "Rescuing a Broken EXT4 System With Ext4Magic and dd"
date = 2022-04-14
description = "for those poor fools who run `fsck --auto-fix`"
extra.hidden = true
+++
while i was setting up this machine, i made some mistakes along the way and observed the dreaded fsck messages on boot:
```
[kernel boot output]
...
/dev/sda1 contains a file system with errors, fsck required.
dropping into emergency shell
>
```
so i ran `fsck` on the disk like the computer told me to:
```
# fsck /dev/sda1
[...]
Inode 5202 has an invalid extent
(logical block 0, invalid physical block 2250752, len 2048)
Clear ('a' enables 'yes' to all) <y>? yes to all
```
and then i typed `a` ("yes to all") like a noob, expecting that any cornerstone tool of Linux in 2022 would act sanely.
i mean, it didn't give me any other apparent option: i had to complete `fsck` before it would let me boot,
and the only option `fsck` gives me on each error is `yes` or `yes to all`. so, obviously i'm supposed to
let it fix everything itself.
## Fsck fsck
so i let it whir. and it trashed everything. if i had properly read the output, i maybe would have
understood that it was simply deleting ("clearing") everything on the disk that it didn't understand.
but i'm an imperfect user, and i was in a hurry. so bye-bye `/opt/pleroma`. bye-bye `/home`. bye-bye
arbitrary chunks of `/var/lib/postgres/data/base/50616`. i didn't have backups in place:
i had naively decided to tackle that _after_ i finished installation and had a clear sense
of how this system would be structured long-term.
so do i just live with this, and redo two days of work?
hell no. i'd rather spend three days diving into EXT4 internals than redo anything. so i saved the
fsck output, attached (but didn't mount) the drive to a clean system, and got busy.
## What Did fsck Actually&nbsp;_Do_?
at the start of the fsck run was this:
```
Resize inode not valid. Recreate<y>? yes
```
the remainder of the messages were mostly of two forms:
```
Entry 'admin' in /home (8196) has invalid inode #: 2234378.
Clear? yes
```
and
```
Inode 4574 has an invalid extent
(logical block 0, invalid physical block 4031998, len 1)
Clear<y>? yes
Inode 4574, i_blocks is 8, should be 0. Fix<y>? yes
```
in fact, i had done an earlier `resize2fs` to expand the 8 GiB FS to fit its 2 TiB partition.
the docs say you can do this on a live filesystem, but... caveat emptor?
EXT4 defaults to a block size of 4096B (i.e., a traditional page of RAM). physical blocks
are a direct reference to some offset into the underlying device. so "physical block 4031998"
corresponds to the byte range on the device of 16515063808 - 16515067904.
inodes are indexed by their physical block address as well, so `inode # 2234378` corresponds
to the block starting at byte index of 9152012288.
notably, these indexes are both beyond the original 8 GiB fs size.
this holds true of _every_ inode and data block fsck complained about.
there's a good chance that all/most of the actual data/inode blocks still hold valid data
on-disk, and the EXT4 drivers simply didn't understand their address.
we have two tasks here:
1. recover the unlinked inodes (i.e. directory entries).
2. recover the cleared extents (i.e. data blocks within a file).
there's a purpose-built tool for #1, and we can script our own thing for #2.
## Recovering Inodes with Ext4Magic
[ext4magic](https://github.com/gktrk/ext4magic) is a tool to manage data loss like this.
one of its modes is `ext4magic -I <inode> -R <device>`, wherein you pass
it an inode #, it parses the inode data structure off the disk, and then makes a best-effort
attempt to recover everything in the fs tree at, or referenced by, that inode.
so for the missing `/home/admin` directory, i need only run:
```
# extmagic -I 2234378 -R -d recovered-inodes /dev/sda1
```
and out pops the _entire_ directory tree for the admin directory. e.g.
```
$ tree recovered-inodes
└── <2234378>
└── gitea
   ├── assets
   │   ├── emoji.json
   │   └── logo.svg
   ├── BSDmakefile
   ├── build
   │   ├── code-batch-process.go
   │   ├── codeformat
   │   │   ├── formatimports.go
   │   │   └── formatimports_test.go
   │   ├── generate-bindata.go
   │   ├── generate-emoji.go
...
```
everything under that `<2234378>` even has the correct group/owner/permission bits.
you just need to rename `<2234378>` to `admin`, `chown` it to what it was before,
and then link it back into `/home` in your fs (but don't do that yet: put this in some
staging directory and link everything back in only after all the data has been recovered).
repeat this for all the "invalid inodes" referenced in the fsck output. then we'll recover
the data blocks.
## Recovering Data Blocks: EXT4 Data Structures
the 2nd class of message was:
```
Inode 4574 has an invalid extent
(logical block 0, invalid physical block 4031998, len 1)
Clear<y>? yes
Inode 4574, i_blocks is 8, should be 0. Fix<y>? yes
```
to understand that this even _is_ recoverable, it helps to understand the ext4 inode structure.
inodes are on-disk data structures, one for every directory entry on the system.
an inode might represent a file or a directory. they look similar in both cases, but we only care about
inodes which represent files here.
each inode is a fixed-size structure holding metadata, like file type/mtime and -- notably --
file _size_, and then they link to a dynamically-sized sequence of "extents"; roughly, pointers to where the file data
lives on disk. the English translation is like "data bytes 0-32768 occupy the physical blocks starting
at block 4156555; bytes 32768-36864 occupy the physical blocks starting at block 6285112". this is
all represented in terms of blocks, so based on the file length the last block may only be partially
filled with data.
EXT4 (and many file systems) largely keeps the file data entirely outside of the inode structure. fsck tells
us that it cleared the extent _entries_, but not the actual data blocks. `i_blocks` here refers to the
blocks allocated to the inode for storing its variably-sized data, i.e. the list of extents (for
what seems to be legacy reasons, this is denoted in 512B disk sectors instead of FS blocks).
so, all the inode metadata is still here; the data blocks exist but are unlinked, and only the extents
were lost. if you try reading the file, it'll still present its original length of data, but will show
a block's worth of zeros for every logical block whose extent was cleared.
## Recovering Data Blocks
so we just need to link the data blocks back into the extents structure.
we could dive deeper into EXT4 data structures and twiddle those bits, but that would lead us into
having to understand the inode and block allocators. instead, we can just dump the block-level data, and use
fs-level APIs to put it back.
`Ext4Magic -B <block>` will dump the full 4096 bytes of some physical block. but because the physical
block is a direct index into the device, we can also just use `dd`. for example, let's recover
this cleared extent:
```
Inode 4574 has an invalid extent
(logical block 0, invalid physical block 4031998, len 1)
Clear<y>? yes
Inode 4574, i_blocks is 8, should be 0. Fix<y>? yes
```
first, we'll want to know which file this comes from:
```sh
$ mkdir preserved
# mount the drive READ-ONLY:
$ sudo mount -o ro /dev/sda1 preserved
$ find preserved/ -inum 4574
preserved/etc/passwd
```
that's, uh, an important file. does the data block still hold proper data?
```sh
$ dd if=/dev/sda1 of=/dev/stdout bs=4096 skip=4031998 count=1
root:x:0:0::/root:/bin/bash
bin:x:1:1::/:/usr/bin/nologin
daemon:x:2:2::/:/usr/bin/nologin
mail:x:8:12::/var/spool/mail:/usr/bin/nologin
ftp:x:14:11::/srv/ftp:/usr/bin/nologin
http:x:33:33::/srv/http:/usr/bin/nologin
nobody:x:65534:65534:Nobody:/:/usr/bin/nologin
[...]
<NUL><NUL><NUL>[...]
```
yes!
let's set up a scratch space. we can construct an overlay of our rootfs where we
place all the recovered and patched entries, and then apply that to the original device
once we're done recovering.
```sh
$ mkdir recovered
```
go ahead and manually link all the entries we recovered with `ext4magic -I` earlier into this `recovered` directory and fix up their group/owner/permissions.
now we can patch individual files by copying them from `preserved/<path>` to `recovered/<path>` and then `dd`ing specific
byte ranges from `/dev/sda1` into `recovered/<path>`. for example:
```sh
$ mkdir -p recovered/ext
$ sudo cp preserved/ext/passwd recovered/ext/passwd
$ sudo dd if=/dev/sda1 of=recovered/ext/passwd bs=4096 skip=4031998 count=1
$ ls -l preserved/ext/passwd
-rw-r--r-- 1 root root 3528 /etc/passwd
$ sudo truncate --size=3528 recovered/ext/passwd
```
because `dd` copies the whole block, we have that additional step of truncating the file to its original size.
## Bringing it Together
we've successfully recovered (into the `recovered` directory):
1. all unlinked directory entries.
2. the cleared extent in `/etc/passwd`.
we still need to:
1. recover all _other_ cleared extents.
2. link the recovered data back into the real file system.
step 2 is a simple `rsync`. step 1 is some nasty `dd` work. i demoed it for a file with only one cleared extent, but some files have _many_ cleared extents, often non-contiguous.
assume the presence of a script `patch_file.py` (see [Appendix](#appendix)) which takes:
- an inode number (`4574`)
- a file path (`etc/passwd`)
- the first logical block of the extent which was cleared (`0`)
- the length in blocks of the extent which was cleared (`1`)
- the first physical block containing the extent's data (`4031998`)
then we can parse the fsck output and script the rest of step 1.
```
# fsck /dev/sda1
[...]
Inode 34215 has an invalid extent
(logical block 14, invalid physical block 2207227, len 1)
Clear? yes
Inode 58213 has an invalid extent
(logical block 0, invalid physical block 3456000, len 1024)
Clear? yes
Inode 58213 has an invalid extent
(logical block 1024, invalid physical block 3463168, len 623)
Clear? yes
Inode 58213, i_blocks is 13176, should be 0. Fix? yes
Inode 58222 has an invalid extent
(logical block 0, invalid physical block 2207151, len 1)
Clear? yes
Inode 58222, i_blocks is 8, should be 0. Fix? yes
[...]
```
run `find -i <inode> preserved/` on each of these inodes to find the file they correspond to, and then you can create this script from that snippet of fsck output:
```sh
./patch_file.py -i 34215 -f var/log/pacman.log 14,1,2207227
./patch_file.py -i 58213 -f usr/bin/yay 0,1024,3456000 1024,623,3463168
./patch_file.py -i 58222 -f etc/fstab 0,1,2207151
```
sometimes `find` won't find the inode that fsck updated. for example, if you booted the system after running `fsck`, Linux will notice that certain files have been corrupted
and will update them with placeholders, destroying the original inode. these are usually the more important files, so you can dump the data block with that `dd` command
and compare it to notable entries on a good file system to "guess" what it was originally.
since we don't have the original inode, we lost the metadata like its length, so use the `--auto-len` flag to guess the length by trimming zero's off the original
data block.
take this snippet of fsck output:
```
Inode 4997 has an invalid extent
(logical block 0, invalid physical block 3831801, len 1)
Clear<y>? yes
Inode 4997, i_blocks is 8, should be 0. Fix<y>? yes
```
try to find the file:
```sh
$ find -i 4997 preserved/
# (no output)
```
but we dump physical block 3831801 and notice that it looks a lot like `/etc/shadow`. so:
```sh
./patch_file.py -i 4997 --auto-len -f etc/shadow 0,1,3831801
```
once you've patched all the files, then bring the file system back online, writeable, and copy over your changes.
```sh
$ sudo umount preserved
$ mkdir sda1
$ sudo mount /dev/sda1 sda1
$ rsync -av --checksum recovered/ sda1/
$ sync && sudo umount sda1
```
if all went well, you can boot the disk now. cheers 🍻
## Appendix
the `patch_file.py` script:
```py
#!/usr/bin/env python3
'''
replaces zero-pages, or partial zero-pages within a single file
'''
import os
import subprocess
import sys
PAGE_LEN = 4096
IN_DIR = 'preserved'
OUT_DIR = 'recovered'
def patch_range(file_: str, logical_block: int, n_blocks: int, physical_block: int):
'''
patch a whole range of blocks within the file
'''
subprocess.check_output([
'dd',
'if=/dev/sda1',
f'of={file_}',
'bs=4096',
f'seek={logical_block}',
f'skip={physical_block}',
f'count={n_blocks}',
])
def copy_for_patch(path: str) -> str:
in_path = os.path.join(IN_DIR, '.', path)
out_path = os.path.join(OUT_DIR, path)
subprocess.check_output(['rsync', '-a', '--relative', in_path, OUT_DIR + '/'])
return out_path
def estimate_length(path: str) -> int:
'''
return the length of the file were there to be no trailing bytes
'''
contents = open(path, 'rb').read()
l = len(contents)
while l and contents[l-1] == 0:
l -= 1
return l
def main(path: str, auto_len: bool, patches: list):
path = copy_for_patch(path)
old_size = os.stat(path).st_size
for patch in patches:
logical_block, n_blocks, physical_block = patch
patch_range(path, logical_block, n_blocks, physical_block)
if auto_len:
os.truncate(path, estimate_length(path))
else:
os.truncate(path, old_size)
def parse_args(args: list):
'''
return:
str: the relative file being operated on,
bool: auto-estimate len,
list: the ranges to patch
'''
i = 0
inode = None
file_ = None
auto_len = False
ranges = []
while i < len(args):
arg = args[i]
if arg == '-i':
inode = int(args[i+1])
i += 2
elif arg == '-f':
file_ = args[i+1]
i += 2
elif arg == '--auto-len':
auto_len = True
i += 1
else:
logical_block, n_blocks, physical_block = map(int, arg.split(','))
#vvv not actually required, but indicative of an error
assert logical_block < physical_block
ranges.append((logical_block, n_blocks, physical_block))
i += 1
# inode doesn't actually get used
# it's useful just to keep the script invocations organized
return file_, auto_len, ranges
if __name__ == '__main__':
main(*parse_args(sys.argv[1:]))
```