draft an article about rescuing my disk after a bad fsck

2022-04-14 23:42:01 +00:00 · 2022-04-14 23:42:01 +00:00 · c5a9bcf04b
parent 2bdf48ddb6
commit c5a9bcf04b
1 changed files with 422 additions and 0 deletions
--- a/content/blog/2022-04-14-DRAFT-fsck-dd-rescue.md
+++ b/content/blog/2022-04-14-DRAFT-fsck-dd-rescue.md
@ -0,0 +1,422 @@
+++
+title = "Rescuing a Broken EXT4 System With Ext4Magic and dd"
+date = 2022-04-14
+description = "for those poor fools who run `fsck --auto-fix`"
+extra.hidden = true
+++
+
+while i was setting up this machine, i made some mistakes along the way and observed the dreaded fsck messages on boot:
+```
+[kernel boot output]
+...
+/dev/sda1 contains a file system with errors, fsck required.
+dropping into emergency shell
+> 
+```
+
+so i ran `fsck` on the disk like the computer told me to:
+```
+# fsck /dev/sda1
+[...]
+Inode 5202 has an invalid extent
+        (logical block 0, invalid physical block 2250752, len 2048)
+Clear ('a' enables 'yes' to all) <y>? yes to all
+```
+
+and then i typed `a` ("yes to all") like a noob, expecting that any cornerstone tool of Linux in 2022 would act sanely.
+i mean, it didn't give me any other apparent option: i had to complete `fsck` before it would let me boot,
+and the only option `fsck` gives me on each error is `yes` or `yes to all`. so, obviously i'm supposed to
+let it fix everything itself.
+
+
+## Fsck fsck
+
+so i let it whir. and it trashed everything. if i had properly read the output, i maybe would have
+understood that it was simply deleting ("clearing") everything on the disk that it didn't understand.
+but i'm an imperfect user, and i was in a hurry. so bye-bye `/opt/pleroma`. bye-bye `/home`. bye-bye
+arbitrary chunks of `/var/lib/postgres/data/base/50616`. i didn't have backups in place:
+i had naively decided to tackle that _after_ i finished installation and had a clear sense
+of how this system would be structured long-term.
+
+so do i just live with this, and redo two days of work?
+
+hell no. i'd rather spend three days diving into EXT4 internals than redo anything. so i saved the
+fsck output, attached (but didn't mount) the drive to a clean system, and got busy.
+
+
+## What Did fsck Actually&nbsp;_Do_?
+
+at the start of the fsck run was this:
+```
+Resize inode not valid.  Recreate<y>? yes
+```
+
+the remainder of the messages were mostly of two forms:
+
+```
+Entry 'admin' in /home (8196) has invalid inode #: 2234378.
+Clear? yes
+```
+and
+
+```
+Inode 4574 has an invalid extent
+        (logical block 0, invalid physical block 4031998, len 1)
+Clear<y>? yes
+Inode 4574, i_blocks is 8, should be 0.  Fix<y>? yes
+```
+
+in fact, i had done an earlier `resize2fs` to expand the 8 GiB FS to fit its 2 TiB partition.
+the docs say you can do this on a live filesystem, but... caveat emptor?
+
+EXT4 defaults to a block size of 4096B (i.e., a traditional page of RAM). physical blocks
+are a direct reference to some offset into the underlying device. so "physical block 4031998"
+corresponds to the byte range on the device of 16515063808 - 16515067904.
+inodes are indexed by their physical block address as well, so `inode # 2234378` corresponds
+to the block starting at byte index of 9152012288.
+notably, these indexes are both beyond the original 8 GiB fs size.
+this holds true of _every_ inode and data block fsck complained about.
+
+there's a good chance that all/most of the actual data/inode blocks still hold valid data
+on-disk, and the EXT4 drivers simply didn't understand their address. 
+
+we have two tasks here:
+1. recover the unlinked inodes (i.e. directory entries).
+2. recover the cleared extents (i.e. data blocks within a file).
+
+there's a purpose-built tool for #1, and we can script our own thing for #2.
+
+
+## Recovering Inodes with Ext4Magic
+
+[ext4magic](https://github.com/gktrk/ext4magic) is a tool to manage data loss like this.
+one of its modes is `ext4magic -I <inode> -R <device>`, wherein you pass
+it an inode #, it parses the inode data structure off the disk, and then makes a best-effort
+attempt to recover everything in the fs tree at, or referenced by, that inode.
+
+so for the missing `/home/admin` directory, i need only run:
+```
+# extmagic -I 2234378 -R -d recovered-inodes /dev/sda1
+```
+and out pops the _entire_ directory tree for the admin directory. e.g.
+```
+$ tree recovered-inodes
+└── <2234378>
+    └── gitea
+        ├── assets
+        │   ├── emoji.json
+        │   └── logo.svg
+        ├── BSDmakefile
+        ├── build
+        │   ├── code-batch-process.go
+        │   ├── codeformat
+        │   │   ├── formatimports.go
+        │   │   └── formatimports_test.go
+        │   ├── generate-bindata.go
+        │   ├── generate-emoji.go
+...
+```
+
+everything under that `<2234378>` even has the correct group/owner/permission bits.
+you just need to rename `<2234378>` to `admin`, `chown` it to what it was before,
+and then link it back into `/home` in your fs (but don't do that yet: put this in some
+staging directory and link everything back in only after all the data has been recovered).
+
+repeat this for all the "invalid inodes" referenced in the fsck output. then we'll recover
+the data blocks.
+
+
+## Recovering Data Blocks: EXT4 Data Structures
+
+the 2nd class of message was:
+
+```
+Inode 4574 has an invalid extent
+        (logical block 0, invalid physical block 4031998, len 1)
+Clear<y>? yes
+Inode 4574, i_blocks is 8, should be 0.  Fix<y>? yes
+```
+
+to understand that this even _is_ recoverable, it helps to understand the ext4 inode structure.
+inodes are on-disk data structures, one for every directory entry on the system.
+an inode might represent a file or a directory. they look similar in both cases, but we only care about
+inodes which represent files here.
+
+each inode is a fixed-size structure holding metadata, like file type/mtime and -- notably --
+file _size_, and then they link to a dynamically-sized sequence of "extents"; roughly, pointers to where the file data
+lives on disk. the English translation is like "data bytes 0-32768 occupy the physical blocks starting
+at block 4156555; bytes 32768-36864 occupy the physical blocks starting at block 6285112". this is
+all represented in terms of blocks, so based on the file length the last block may only be partially
+filled with data.
+
+EXT4 (and many file systems) largely keeps the file data entirely outside of the inode structure. fsck tells
+us that it cleared the extent _entries_, but not the actual data blocks. `i_blocks` here refers to the
+blocks allocated to the inode for storing its variably-sized data, i.e. the list of extents (for
+what seems to be legacy reasons, this is denoted in 512B disk sectors instead of FS blocks).
+
+so, all the inode metadata is still here; the data blocks exist but are unlinked, and only the extents
+were lost. if you try reading the file, it'll still present its original length of data, but will show
+a block's worth of zeros for every logical block whose extent was cleared.
+
+
+## Recovering Data Blocks
+
+so we just need to link the data blocks back into the extents structure.
+we could dive deeper into EXT4 data structures and twiddle those bits, but that would lead us into
+having to understand the inode and block allocators. instead, we can just dump the block-level data, and use
+fs-level APIs to put it back.
+
+`Ext4Magic -B <block>` will dump the full 4096 bytes of some physical block. but because the physical
+block is a direct index into the device, we can also just use `dd`. for example, let's recover
+this cleared extent:
+
+```
+Inode 4574 has an invalid extent
+        (logical block 0, invalid physical block 4031998, len 1)
+Clear<y>? yes
+Inode 4574, i_blocks is 8, should be 0.  Fix<y>? yes
+```
+
+first, we'll want to know which file this comes from:
+```sh
+$ mkdir preserved
+# mount the drive READ-ONLY:
+$ sudo mount -o ro /dev/sda1 preserved
+$ find preserved/ -inum 4574
+preserved/etc/passwd
+```
+
+that's, uh, an important file. does the data block still hold proper data?
+
+```sh
+$ dd if=/dev/sda1 of=/dev/stdout bs=4096 skip=4031998 count=1
+root:x:0:0::/root:/bin/bash
+bin:x:1:1::/:/usr/bin/nologin
+daemon:x:2:2::/:/usr/bin/nologin
+mail:x:8:12::/var/spool/mail:/usr/bin/nologin
+ftp:x:14:11::/srv/ftp:/usr/bin/nologin
+http:x:33:33::/srv/http:/usr/bin/nologin
+nobody:x:65534:65534:Nobody:/:/usr/bin/nologin
+[...]
+<NUL><NUL><NUL>[...]
+```
+
+yes!
+
+let's set up a scratch space. we can construct an overlay of our rootfs where we
+place all the recovered and patched entries, and then apply that to the original device
+once we're done recovering.
+
+```sh
+$ mkdir recovered
+```
+
+go ahead and manually link all the entries we recovered with `ext4magic -I` earlier into this `recovered` directory and fix up their group/owner/permissions.
+
+now we can patch individual files by copying them from `preserved/<path>` to `recovered/<path>` and then `dd`ing specific
+byte ranges from `/dev/sda1` into `recovered/<path>`. for example:
+```sh
+$ mkdir -p recovered/ext
+$ sudo cp preserved/ext/passwd recovered/ext/passwd
+$ sudo dd if=/dev/sda1 of=recovered/ext/passwd bs=4096 skip=4031998 count=1
+$ ls -l preserved/ext/passwd
+-rw-r--r-- 1 root root 3528 /etc/passwd
+$ sudo truncate --size=3528 recovered/ext/passwd
+```
+
+because `dd` copies the whole block, we have that additional step of truncating the file to its original size.
+
+
+## Bringing it Together
+
+we've successfully recovered (into the `recovered` directory):
+1. all unlinked directory entries.
+2. the cleared extent in `/etc/passwd`.
+
+we still need to:
+1. recover all _other_ cleared extents.
+2. link the recovered data back into the real file system.
+
+step 2 is a simple `rsync`. step 1 is some nasty `dd` work. i demoed it for a file with only one cleared extent, but some files have _many_ cleared extents, often non-contiguous.
+
+assume the presence of a script `patch_file.py` (see [Appendix](#appendix)) which takes:
+- an inode number (`4574`)
+- a file path (`etc/passwd`)
+- the first logical block of the extent which was cleared (`0`)
+- the length in blocks of the extent which was cleared (`1`)
+- the first physical block containing the extent's data (`4031998`)
+
+then we can parse the fsck output and script the rest of step 1.
+
+```
+# fsck /dev/sda1 
+[...]
+
+Inode 34215 has an invalid extent
+        (logical block 14, invalid physical block 2207227, len 1)
+Clear? yes
+
+Inode 58213 has an invalid extent
+        (logical block 0, invalid physical block 3456000, len 1024)
+Clear? yes
+
+Inode 58213 has an invalid extent
+        (logical block 1024, invalid physical block 3463168, len 623)
+Clear? yes
+
+Inode 58213, i_blocks is 13176, should be 0.  Fix? yes
+
+Inode 58222 has an invalid extent
+        (logical block 0, invalid physical block 2207151, len 1)
+Clear? yes
+
+Inode 58222, i_blocks is 8, should be 0.  Fix? yes
+
+[...]
+```
+
+run `find -i <inode> preserved/` on each of these inodes to find the file they correspond to, and then you can create this script from that snippet of fsck output:
+
+```sh
+./patch_file.py -i 34215 -f var/log/pacman.log 14,1,2207227
+./patch_file.py -i 58213 -f usr/bin/yay 0,1024,3456000 1024,623,3463168
+./patch_file.py -i 58222 -f etc/fstab 0,1,2207151
+```
+
+sometimes `find` won't find the inode that fsck updated. for example, if you booted the system after running `fsck`, Linux will notice that certain files have been corrupted
+and will update them with placeholders, destroying the original inode. these are usually the more important files, so you can dump the data block with that `dd` command
+and compare it to notable entries on a good file system to "guess" what it was originally.
+since we don't have the original inode, we lost the metadata like its length, so use the `--auto-len` flag to guess the length by trimming zero's off the original
+data block.
+
+take this snippet of fsck output:
+```
+Inode 4997 has an invalid extent
+        (logical block 0, invalid physical block 3831801, len 1)
+Clear<y>? yes
+Inode 4997, i_blocks is 8, should be 0.  Fix<y>? yes
+```
+
+try to find the file:
+```sh
+$ find -i 4997 preserved/
+# (no output)
+```
+
+but we dump physical block 3831801 and notice that it looks a lot like `/etc/shadow`. so:
+
+```sh
+./patch_file.py -i 4997 --auto-len -f etc/shadow 0,1,3831801
+```
+
+once you've patched all the files, then bring the file system back online, writeable, and copy over your changes.
+
+```sh
+$ sudo umount preserved
+$ mkdir sda1
+$ sudo mount /dev/sda1 sda1
+$ rsync -av --checksum recovered/ sda1/
+$ sync && sudo umount sda1
+```
+
+if all went well, you can boot the disk now. cheers 🍻
+
+
+## Appendix
+
+the `patch_file.py` script:
+
+```py
+#!/usr/bin/env python3
+
+'''
+replaces zero-pages, or partial zero-pages within a single file
+'''
+
+import os
+import subprocess
+import sys
+
+PAGE_LEN = 4096
+IN_DIR = 'preserved'
+OUT_DIR = 'recovered'
+
+def patch_range(file_: str, logical_block: int, n_blocks: int, physical_block: int):
+    '''
+    patch a whole range of blocks within the file
+    '''
+    subprocess.check_output([
+        'dd',
+        'if=/dev/sda1',
+        f'of={file_}',
+        'bs=4096',
+        f'seek={logical_block}',
+        f'skip={physical_block}',
+        f'count={n_blocks}',
+    ])
+
+def copy_for_patch(path: str) -> str:
+    in_path = os.path.join(IN_DIR, '.', path)
+    out_path = os.path.join(OUT_DIR, path)
+    subprocess.check_output(['rsync', '-a', '--relative', in_path, OUT_DIR + '/'])
+    return out_path
+
+def estimate_length(path: str) -> int:
+    '''
+    return the length of the file were there to be no trailing bytes
+    '''
+    contents = open(path, 'rb').read()
+    l = len(contents)
+    while l and contents[l-1] == 0:
+        l -= 1
+    return l
+
+def main(path: str, auto_len: bool, patches: list):
+    path = copy_for_patch(path)
+    old_size = os.stat(path).st_size
+    for patch in patches:
+        logical_block, n_blocks, physical_block = patch
+        patch_range(path, logical_block, n_blocks, physical_block)
+
+    if auto_len:
+        os.truncate(path, estimate_length(path))
+    else:
+        os.truncate(path, old_size)
+    
+def parse_args(args: list):
+    '''
+    return:
+      str: the relative file being operated on,
+      bool: auto-estimate len,
+      list: the ranges to patch
+    '''
+    i = 0
+    inode = None
+    file_ = None
+    auto_len = False
+    ranges = []
+    while i < len(args):
+        arg = args[i]
+        if arg == '-i':
+            inode = int(args[i+1])
+            i += 2
+        elif arg == '-f':
+            file_ = args[i+1]
+            i += 2
+        elif arg == '--auto-len':
+            auto_len = True
+            i += 1
+        else:
+            logical_block, n_blocks, physical_block = map(int, arg.split(','))
+            #vvv not actually required, but indicative of an error
+            assert logical_block < physical_block
+            ranges.append((logical_block, n_blocks, physical_block))
+            i += 1
+    # inode doesn't actually get used
+    # it's useful just to keep the script invocations organized
+    return file_, auto_len, ranges
+
+
+if __name__ == '__main__':
+    main(*parse_args(sys.argv[1:]))
+```