tips on hardware failures and fs corruption

This commit is contained in:
Scott Lamb 2021-02-11 16:04:40 -08:00
parent 972f26f30a
commit 5acca1a253
1 changed files with 75 additions and 3 deletions

View File

@ -1,6 +1,10 @@
# Troubleshooting
## Logs
Here are some tips for diagnosing various problems with Moonfire NVR. Feel free
to open an [issue](https://github.com/scottlamb/moonfire-nvr/issues) if you
need more help.
## Viewing Moonfire NVR's logs
While Moonfire NVR is running, logs will be written to stderr.
@ -35,8 +39,8 @@ If you use Docker, set these via Docker's `--env` argument.
### `Error: pts not monotonically increasing; got 26615520 then 26539470`
If your streams cut out with an error message like this one, it might mean
that your camera outputs [B
If your streams cut out and you see error messages like this one in Moonfire
NVR logs, it might mean that your camera outputs [B
frames](https://en.wikipedia.org/wiki/Video_compression_picture_types#Bi-directional_predicted_.28B.29_frames.2Fslices_.28macroblocks.29).
If you believe this is the case, file a feature request; Moonfire NVR
currently doesn't support B frames. You may be able to configure your camera
@ -47,3 +51,71 @@ to disable B frames in the meantime.
This happens if your machine is configured to a non-UTF-8 locale, due to
gyscos/Cursive#13. As a workaround, try setting the environment variable
`LC_ALL=C.UTF-8`. This should automatically be set with the Docker container.
### Moonfire NVR reports problems with the database or filesystem
It's helpful to check out your system's overall health when diagnosing
problems with Moonfire NVR.
1. Look at your kernel logs. On most Linux systems, you can browse them via
`journalctl`, `dmesg`, or `less /var/log/messages`. See [Errors in kernel
logs](#error) below for some common problems.
2. Use [`smartctl`](https://linuxconfig.org/how-to-check-an-hard-drive-health-from-the-command-line-using-smartctl) to
look at SMART ("Self-Monitoring, Analysis and Reporting Technology System
(SMART)") attributes on your flash and hard drives. Backblaze
[reports](https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/)
that the following SMART attributes are most predictive of drive failure:
* SMART 5: Reallocated Sectors Count
* SMART 187: Reported Uncorrectable Errors
* SMART 188: Command Timeout
* SMART 197: Current Pending Sector Count
* SMART 198: Uncorrectable Sector Count
If the RAW value for any of these attributes is non-zero, it's likely
your problem is due to hardware.
3. Use `smartctl` to run a self-test on your flash and hard drives.
4. Run `fsck` on your filesystems.
Your root filesystem is best checked on startup, before it's mounted as
read-write. On most Linux systems, you can force `fsck` to run on next
startup via the `fsck.mode=force` kernel parameter, as documented
[here](https://www.freedesktop.org/software/systemd/man/systemd-fsck@.service.html).
If you have hard drives dedicated to Moonfire NVR, you can also shut down
Moonfire NVR, unmount the filesystem, and run `fsck` on them without
rebooting.
After the system as a whole is verified healthy, run `moonfire-nvr check` while
Moonfire NVR is stopped to verify integrity of the SQLite database and sample
file directories.
### <a name="kernel-errors"></a> Errors in kernel logs
#### UAS errors
Some cheap USB SATA adapters don't appear to work reliably in UAS mode under
Linux. If you see errors like the following, try [disabling
UAS](https://unix.stackexchange.com/questions/239782/connection-problem-with-usb3-external-storage-on-linux-uas-driver-problem).
Unfortunately your filesystem is likely to have corruption, so after disabling UAS,
run a `fsck` and then `moonfire-nvr check` to try recovering.
```
Sep 22 17:26:01 nuc kernel: sd 4:0:0:1: [sdb] tag#2 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
Sep 22 17:26:01 nuc kernel: sd 4:0:0:1: [sdb] tag#2 CDB: Write(16) 8a 00 00 00 00 01 4d b4 c4 00 00 00 03 b0 00 00
```
#### Filesystem errors
Errors that mention `EXT4-fs` (or your filesystem of choice) likely indicate
filesystem corruption. Run `fsck` to fix as described above. Once the
corruption is addressed, use `moonfire-nvr check` to survey the damage to
your database.
```
Jan 28 07:26:27 nuc kernel: EXT4-fs (sdc1): error count since last fsck: 12
Jan 28 07:26:27 nuc kernel: EXT4-fs (sdc1): initial error at time 1576998292: ext4_validate_block_bitmap:376
Jan 28 07:26:27 nuc kernel: EXT4-fs (sdc1): last error at time 1579640202: ext4_validate_block_bitmap:376
...
Feb 13 04:48:43 nuc kernel: EXT4-fs error (device sdc1): ext4_validate_block_bitmap:376: comm kworker/u8:2: bg 57266: bad block bitmap checksum
Feb 13 04:48:43 nuc kernel: EXT4-fs (sdc1): Delayed block allocation failed for inode 7334278 at logical offset 0 with max blocks 11 with error 74
Feb 13 04:48:43 nuc kernel: EXT4-fs (sdc1): This should not happen!! Data will be lost
```