draft an article about rescuing my disk after a bad fsck
This commit is contained in:
parent
2bdf48ddb6
commit
c5a9bcf04b
|
@ -0,0 +1,422 @@
|
|||
+++
|
||||
title = "Rescuing a Broken EXT4 System With Ext4Magic and dd"
|
||||
date = 2022-04-14
|
||||
description = "for those poor fools who run `fsck --auto-fix`"
|
||||
extra.hidden = true
|
||||
+++
|
||||
|
||||
while i was setting up this machine, i made some mistakes along the way and observed the dreaded fsck messages on boot:
|
||||
```
|
||||
[kernel boot output]
|
||||
...
|
||||
/dev/sda1 contains a file system with errors, fsck required.
|
||||
dropping into emergency shell
|
||||
>
|
||||
```
|
||||
|
||||
so i ran `fsck` on the disk like the computer told me to:
|
||||
```
|
||||
# fsck /dev/sda1
|
||||
[...]
|
||||
Inode 5202 has an invalid extent
|
||||
(logical block 0, invalid physical block 2250752, len 2048)
|
||||
Clear ('a' enables 'yes' to all) <y>? yes to all
|
||||
```
|
||||
|
||||
and then i typed `a` ("yes to all") like a noob, expecting that any cornerstone tool of Linux in 2022 would act sanely.
|
||||
i mean, it didn't give me any other apparent option: i had to complete `fsck` before it would let me boot,
|
||||
and the only option `fsck` gives me on each error is `yes` or `yes to all`. so, obviously i'm supposed to
|
||||
let it fix everything itself.
|
||||
|
||||
|
||||
## Fsck fsck
|
||||
|
||||
so i let it whir. and it trashed everything. if i had properly read the output, i maybe would have
|
||||
understood that it was simply deleting ("clearing") everything on the disk that it didn't understand.
|
||||
but i'm an imperfect user, and i was in a hurry. so bye-bye `/opt/pleroma`. bye-bye `/home`. bye-bye
|
||||
arbitrary chunks of `/var/lib/postgres/data/base/50616`. i didn't have backups in place:
|
||||
i had naively decided to tackle that _after_ i finished installation and had a clear sense
|
||||
of how this system would be structured long-term.
|
||||
|
||||
so do i just live with this, and redo two days of work?
|
||||
|
||||
hell no. i'd rather spend three days diving into EXT4 internals than redo anything. so i saved the
|
||||
fsck output, attached (but didn't mount) the drive to a clean system, and got busy.
|
||||
|
||||
|
||||
## What Did fsck Actually _Do_?
|
||||
|
||||
at the start of the fsck run was this:
|
||||
```
|
||||
Resize inode not valid. Recreate<y>? yes
|
||||
```
|
||||
|
||||
the remainder of the messages were mostly of two forms:
|
||||
|
||||
```
|
||||
Entry 'admin' in /home (8196) has invalid inode #: 2234378.
|
||||
Clear? yes
|
||||
```
|
||||
and
|
||||
|
||||
```
|
||||
Inode 4574 has an invalid extent
|
||||
(logical block 0, invalid physical block 4031998, len 1)
|
||||
Clear<y>? yes
|
||||
Inode 4574, i_blocks is 8, should be 0. Fix<y>? yes
|
||||
```
|
||||
|
||||
in fact, i had done an earlier `resize2fs` to expand the 8 GiB FS to fit its 2 TiB partition.
|
||||
the docs say you can do this on a live filesystem, but... caveat emptor?
|
||||
|
||||
EXT4 defaults to a block size of 4096B (i.e., a traditional page of RAM). physical blocks
|
||||
are a direct reference to some offset into the underlying device. so "physical block 4031998"
|
||||
corresponds to the byte range on the device of 16515063808 - 16515067904.
|
||||
inodes are indexed by their physical block address as well, so `inode # 2234378` corresponds
|
||||
to the block starting at byte index of 9152012288.
|
||||
notably, these indexes are both beyond the original 8 GiB fs size.
|
||||
this holds true of _every_ inode and data block fsck complained about.
|
||||
|
||||
there's a good chance that all/most of the actual data/inode blocks still hold valid data
|
||||
on-disk, and the EXT4 drivers simply didn't understand their address.
|
||||
|
||||
we have two tasks here:
|
||||
1. recover the unlinked inodes (i.e. directory entries).
|
||||
2. recover the cleared extents (i.e. data blocks within a file).
|
||||
|
||||
there's a purpose-built tool for #1, and we can script our own thing for #2.
|
||||
|
||||
|
||||
## Recovering Inodes with Ext4Magic
|
||||
|
||||
[ext4magic](https://github.com/gktrk/ext4magic) is a tool to manage data loss like this.
|
||||
one of its modes is `ext4magic -I <inode> -R <device>`, wherein you pass
|
||||
it an inode #, it parses the inode data structure off the disk, and then makes a best-effort
|
||||
attempt to recover everything in the fs tree at, or referenced by, that inode.
|
||||
|
||||
so for the missing `/home/admin` directory, i need only run:
|
||||
```
|
||||
# extmagic -I 2234378 -R -d recovered-inodes /dev/sda1
|
||||
```
|
||||
and out pops the _entire_ directory tree for the admin directory. e.g.
|
||||
```
|
||||
$ tree recovered-inodes
|
||||
└── <2234378>
|
||||
└── gitea
|
||||
├── assets
|
||||
│ ├── emoji.json
|
||||
│ └── logo.svg
|
||||
├── BSDmakefile
|
||||
├── build
|
||||
│ ├── code-batch-process.go
|
||||
│ ├── codeformat
|
||||
│ │ ├── formatimports.go
|
||||
│ │ └── formatimports_test.go
|
||||
│ ├── generate-bindata.go
|
||||
│ ├── generate-emoji.go
|
||||
...
|
||||
```
|
||||
|
||||
everything under that `<2234378>` even has the correct group/owner/permission bits.
|
||||
you just need to rename `<2234378>` to `admin`, `chown` it to what it was before,
|
||||
and then link it back into `/home` in your fs (but don't do that yet: put this in some
|
||||
staging directory and link everything back in only after all the data has been recovered).
|
||||
|
||||
repeat this for all the "invalid inodes" referenced in the fsck output. then we'll recover
|
||||
the data blocks.
|
||||
|
||||
|
||||
## Recovering Data Blocks: EXT4 Data Structures
|
||||
|
||||
the 2nd class of message was:
|
||||
|
||||
```
|
||||
Inode 4574 has an invalid extent
|
||||
(logical block 0, invalid physical block 4031998, len 1)
|
||||
Clear<y>? yes
|
||||
Inode 4574, i_blocks is 8, should be 0. Fix<y>? yes
|
||||
```
|
||||
|
||||
to understand that this even _is_ recoverable, it helps to understand the ext4 inode structure.
|
||||
inodes are on-disk data structures, one for every directory entry on the system.
|
||||
an inode might represent a file or a directory. they look similar in both cases, but we only care about
|
||||
inodes which represent files here.
|
||||
|
||||
each inode is a fixed-size structure holding metadata, like file type/mtime and -- notably --
|
||||
file _size_, and then they link to a dynamically-sized sequence of "extents"; roughly, pointers to where the file data
|
||||
lives on disk. the English translation is like "data bytes 0-32768 occupy the physical blocks starting
|
||||
at block 4156555; bytes 32768-36864 occupy the physical blocks starting at block 6285112". this is
|
||||
all represented in terms of blocks, so based on the file length the last block may only be partially
|
||||
filled with data.
|
||||
|
||||
EXT4 (and many file systems) largely keeps the file data entirely outside of the inode structure. fsck tells
|
||||
us that it cleared the extent _entries_, but not the actual data blocks. `i_blocks` here refers to the
|
||||
blocks allocated to the inode for storing its variably-sized data, i.e. the list of extents (for
|
||||
what seems to be legacy reasons, this is denoted in 512B disk sectors instead of FS blocks).
|
||||
|
||||
so, all the inode metadata is still here; the data blocks exist but are unlinked, and only the extents
|
||||
were lost. if you try reading the file, it'll still present its original length of data, but will show
|
||||
a block's worth of zeros for every logical block whose extent was cleared.
|
||||
|
||||
|
||||
## Recovering Data Blocks
|
||||
|
||||
so we just need to link the data blocks back into the extents structure.
|
||||
we could dive deeper into EXT4 data structures and twiddle those bits, but that would lead us into
|
||||
having to understand the inode and block allocators. instead, we can just dump the block-level data, and use
|
||||
fs-level APIs to put it back.
|
||||
|
||||
`Ext4Magic -B <block>` will dump the full 4096 bytes of some physical block. but because the physical
|
||||
block is a direct index into the device, we can also just use `dd`. for example, let's recover
|
||||
this cleared extent:
|
||||
|
||||
```
|
||||
Inode 4574 has an invalid extent
|
||||
(logical block 0, invalid physical block 4031998, len 1)
|
||||
Clear<y>? yes
|
||||
Inode 4574, i_blocks is 8, should be 0. Fix<y>? yes
|
||||
```
|
||||
|
||||
first, we'll want to know which file this comes from:
|
||||
```sh
|
||||
$ mkdir preserved
|
||||
# mount the drive READ-ONLY:
|
||||
$ sudo mount -o ro /dev/sda1 preserved
|
||||
$ find preserved/ -inum 4574
|
||||
preserved/etc/passwd
|
||||
```
|
||||
|
||||
that's, uh, an important file. does the data block still hold proper data?
|
||||
|
||||
```sh
|
||||
$ dd if=/dev/sda1 of=/dev/stdout bs=4096 skip=4031998 count=1
|
||||
root:x:0:0::/root:/bin/bash
|
||||
bin:x:1:1::/:/usr/bin/nologin
|
||||
daemon:x:2:2::/:/usr/bin/nologin
|
||||
mail:x:8:12::/var/spool/mail:/usr/bin/nologin
|
||||
ftp:x:14:11::/srv/ftp:/usr/bin/nologin
|
||||
http:x:33:33::/srv/http:/usr/bin/nologin
|
||||
nobody:x:65534:65534:Nobody:/:/usr/bin/nologin
|
||||
[...]
|
||||
<NUL><NUL><NUL>[...]
|
||||
```
|
||||
|
||||
yes!
|
||||
|
||||
let's set up a scratch space. we can construct an overlay of our rootfs where we
|
||||
place all the recovered and patched entries, and then apply that to the original device
|
||||
once we're done recovering.
|
||||
|
||||
```sh
|
||||
$ mkdir recovered
|
||||
```
|
||||
|
||||
go ahead and manually link all the entries we recovered with `ext4magic -I` earlier into this `recovered` directory and fix up their group/owner/permissions.
|
||||
|
||||
now we can patch individual files by copying them from `preserved/<path>` to `recovered/<path>` and then `dd`ing specific
|
||||
byte ranges from `/dev/sda1` into `recovered/<path>`. for example:
|
||||
```sh
|
||||
$ mkdir -p recovered/ext
|
||||
$ sudo cp preserved/ext/passwd recovered/ext/passwd
|
||||
$ sudo dd if=/dev/sda1 of=recovered/ext/passwd bs=4096 skip=4031998 count=1
|
||||
$ ls -l preserved/ext/passwd
|
||||
-rw-r--r-- 1 root root 3528 /etc/passwd
|
||||
$ sudo truncate --size=3528 recovered/ext/passwd
|
||||
```
|
||||
|
||||
because `dd` copies the whole block, we have that additional step of truncating the file to its original size.
|
||||
|
||||
|
||||
## Bringing it Together
|
||||
|
||||
we've successfully recovered (into the `recovered` directory):
|
||||
1. all unlinked directory entries.
|
||||
2. the cleared extent in `/etc/passwd`.
|
||||
|
||||
we still need to:
|
||||
1. recover all _other_ cleared extents.
|
||||
2. link the recovered data back into the real file system.
|
||||
|
||||
step 2 is a simple `rsync`. step 1 is some nasty `dd` work. i demoed it for a file with only one cleared extent, but some files have _many_ cleared extents, often non-contiguous.
|
||||
|
||||
assume the presence of a script `patch_file.py` (see [Appendix](#appendix)) which takes:
|
||||
- an inode number (`4574`)
|
||||
- a file path (`etc/passwd`)
|
||||
- the first logical block of the extent which was cleared (`0`)
|
||||
- the length in blocks of the extent which was cleared (`1`)
|
||||
- the first physical block containing the extent's data (`4031998`)
|
||||
|
||||
then we can parse the fsck output and script the rest of step 1.
|
||||
|
||||
```
|
||||
# fsck /dev/sda1
|
||||
[...]
|
||||
|
||||
Inode 34215 has an invalid extent
|
||||
(logical block 14, invalid physical block 2207227, len 1)
|
||||
Clear? yes
|
||||
|
||||
Inode 58213 has an invalid extent
|
||||
(logical block 0, invalid physical block 3456000, len 1024)
|
||||
Clear? yes
|
||||
|
||||
Inode 58213 has an invalid extent
|
||||
(logical block 1024, invalid physical block 3463168, len 623)
|
||||
Clear? yes
|
||||
|
||||
Inode 58213, i_blocks is 13176, should be 0. Fix? yes
|
||||
|
||||
Inode 58222 has an invalid extent
|
||||
(logical block 0, invalid physical block 2207151, len 1)
|
||||
Clear? yes
|
||||
|
||||
Inode 58222, i_blocks is 8, should be 0. Fix? yes
|
||||
|
||||
[...]
|
||||
```
|
||||
|
||||
run `find -i <inode> preserved/` on each of these inodes to find the file they correspond to, and then you can create this script from that snippet of fsck output:
|
||||
|
||||
```sh
|
||||
./patch_file.py -i 34215 -f var/log/pacman.log 14,1,2207227
|
||||
./patch_file.py -i 58213 -f usr/bin/yay 0,1024,3456000 1024,623,3463168
|
||||
./patch_file.py -i 58222 -f etc/fstab 0,1,2207151
|
||||
```
|
||||
|
||||
sometimes `find` won't find the inode that fsck updated. for example, if you booted the system after running `fsck`, Linux will notice that certain files have been corrupted
|
||||
and will update them with placeholders, destroying the original inode. these are usually the more important files, so you can dump the data block with that `dd` command
|
||||
and compare it to notable entries on a good file system to "guess" what it was originally.
|
||||
since we don't have the original inode, we lost the metadata like its length, so use the `--auto-len` flag to guess the length by trimming zero's off the original
|
||||
data block.
|
||||
|
||||
take this snippet of fsck output:
|
||||
```
|
||||
Inode 4997 has an invalid extent
|
||||
(logical block 0, invalid physical block 3831801, len 1)
|
||||
Clear<y>? yes
|
||||
Inode 4997, i_blocks is 8, should be 0. Fix<y>? yes
|
||||
```
|
||||
|
||||
try to find the file:
|
||||
```sh
|
||||
$ find -i 4997 preserved/
|
||||
# (no output)
|
||||
```
|
||||
|
||||
but we dump physical block 3831801 and notice that it looks a lot like `/etc/shadow`. so:
|
||||
|
||||
```sh
|
||||
./patch_file.py -i 4997 --auto-len -f etc/shadow 0,1,3831801
|
||||
```
|
||||
|
||||
once you've patched all the files, then bring the file system back online, writeable, and copy over your changes.
|
||||
|
||||
```sh
|
||||
$ sudo umount preserved
|
||||
$ mkdir sda1
|
||||
$ sudo mount /dev/sda1 sda1
|
||||
$ rsync -av --checksum recovered/ sda1/
|
||||
$ sync && sudo umount sda1
|
||||
```
|
||||
|
||||
if all went well, you can boot the disk now. cheers 🍻
|
||||
|
||||
|
||||
## Appendix
|
||||
|
||||
the `patch_file.py` script:
|
||||
|
||||
```py
|
||||
#!/usr/bin/env python3
|
||||
|
||||
'''
|
||||
replaces zero-pages, or partial zero-pages within a single file
|
||||
'''
|
||||
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
PAGE_LEN = 4096
|
||||
IN_DIR = 'preserved'
|
||||
OUT_DIR = 'recovered'
|
||||
|
||||
def patch_range(file_: str, logical_block: int, n_blocks: int, physical_block: int):
|
||||
'''
|
||||
patch a whole range of blocks within the file
|
||||
'''
|
||||
subprocess.check_output([
|
||||
'dd',
|
||||
'if=/dev/sda1',
|
||||
f'of={file_}',
|
||||
'bs=4096',
|
||||
f'seek={logical_block}',
|
||||
f'skip={physical_block}',
|
||||
f'count={n_blocks}',
|
||||
])
|
||||
|
||||
def copy_for_patch(path: str) -> str:
|
||||
in_path = os.path.join(IN_DIR, '.', path)
|
||||
out_path = os.path.join(OUT_DIR, path)
|
||||
subprocess.check_output(['rsync', '-a', '--relative', in_path, OUT_DIR + '/'])
|
||||
return out_path
|
||||
|
||||
def estimate_length(path: str) -> int:
|
||||
'''
|
||||
return the length of the file were there to be no trailing bytes
|
||||
'''
|
||||
contents = open(path, 'rb').read()
|
||||
l = len(contents)
|
||||
while l and contents[l-1] == 0:
|
||||
l -= 1
|
||||
return l
|
||||
|
||||
def main(path: str, auto_len: bool, patches: list):
|
||||
path = copy_for_patch(path)
|
||||
old_size = os.stat(path).st_size
|
||||
for patch in patches:
|
||||
logical_block, n_blocks, physical_block = patch
|
||||
patch_range(path, logical_block, n_blocks, physical_block)
|
||||
|
||||
if auto_len:
|
||||
os.truncate(path, estimate_length(path))
|
||||
else:
|
||||
os.truncate(path, old_size)
|
||||
|
||||
def parse_args(args: list):
|
||||
'''
|
||||
return:
|
||||
str: the relative file being operated on,
|
||||
bool: auto-estimate len,
|
||||
list: the ranges to patch
|
||||
'''
|
||||
i = 0
|
||||
inode = None
|
||||
file_ = None
|
||||
auto_len = False
|
||||
ranges = []
|
||||
while i < len(args):
|
||||
arg = args[i]
|
||||
if arg == '-i':
|
||||
inode = int(args[i+1])
|
||||
i += 2
|
||||
elif arg == '-f':
|
||||
file_ = args[i+1]
|
||||
i += 2
|
||||
elif arg == '--auto-len':
|
||||
auto_len = True
|
||||
i += 1
|
||||
else:
|
||||
logical_block, n_blocks, physical_block = map(int, arg.split(','))
|
||||
#vvv not actually required, but indicative of an error
|
||||
assert logical_block < physical_block
|
||||
ranges.append((logical_block, n_blocks, physical_block))
|
||||
i += 1
|
||||
# inode doesn't actually get used
|
||||
# it's useful just to keep the script invocations organized
|
||||
return file_, auto_len, ranges
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main(*parse_args(sys.argv[1:]))
|
||||
```
|
Loading…
Reference in New Issue