Discussion:
[sqlite] Btree page corruption
Nikhil Deshpande
7 years ago
Permalink
Hi,

We have an application that in a Linux VM that's running into
SQLite DB corruption (after weeks and months of running,
4 such instances yet in different VMs).

We would appreciate some help in debugging this further to identify
source of corruption!

Symptom is btree page corruption, e.g.
$ sqlite3 stats.sqlite "pragma integrity_check;"
*** in database main ***
Page 3818: btreeInitPage() returns error code 11
Page 46: btreeInitPage() returns error code 11
Error: database disk image is malformed
(Same error is raised for SELECT queries too.)

There were no power-off or reboots in near time vicinity when the
corruption was detected. We have poured over this document
https://sqlite.org/howtocorrupt.html
many times to check if any of the conditions could apply,
but so far no leads yet.

We have also been unable to reproduce the corruption by stressing
application's SQLite DB read/write code paths for a week.

I'm attaching showdb output for the DB header and 2 corrupt pages
if it's of any hint.

---

A bit more application setup context/information:

- Linux kernel 4.4.41
- glibc 2.22
- Ext4 file system, mounted as (rw,relatime,data=ordered).

- Writer C++ process: sqlite-3.17
- Creates a set of "time series" tables, each table has 2 numeric
columns (timestamp, int) during initialization.
- Every 1 minute, 2 threads will do total 15 writes. (using "INSERT OR
REPLACE ... (timestamp, int)" SQL into 15 tables).
- SQLite DB opened with SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE
flags, initialized with "PRAGMA journal_mode=wal;", threading mode
is Serialized for the libsqlite build, uses default VFS ("unix").
All other config params are default (e.g. autovacuum is disabled
etc.).
- A separate thread runs "PRAGMA quick_check;" periodically every 5
minutes, in its own separate DB connection.
- Reader process: sqlite-3.11 + Python 2.7.11
- Periodically reads time series tables for a given timestamp range
(usually latest 5 minutes) using SELECT queries (no INSERT/UPDATE/
DELETE from this process).
- Uses same same "PRAGMA journal_mode=wal", uses the sqlite3 DBAPI
module from Python standard library.
Apart from above 2, no other processes are accessing the SQLite DB file.

We have updated both the reader and writer to use latest SQLite 3.21,
but without understanding the cause of corruption, we are unable to
say if this update to latest 3.21 would indeed prevent further
occurrences.

Thanks,
Nikhil
Simon Slavin
7 years ago
Permalink
...
Well done for finding the appropriate web page. You’ve done all the easy stuff which leaves us with just the hard questions. These won’t necessarily give us an answer, but they might remind one of your readers of something.

When "pragma integrity_check" detects an error, does "PRAGMA quick_check" do too ?

Is your database file really about 65 Meg in size ? Just roughly.

How is your application terminated ? User action or self-determination ?

Does your application call sqlite3_close() ? Does it cbeck the code returned ?

Does your application call sqlite3_shutdown() ? Does it cbeck the code returned ?

How does your VM get shutdown ? Does it correctly shut down your application before it quits ?

Can you include a "pragma integrity_check" at startup ?
Can you include a "pragma integrity_check" executed at regular intervals ?

Simon.
Nikhil Deshpande
7 years ago
Permalink
Thanks Simon for the response!
Post by Simon Slavin
When "pragma integrity_check" detects an error, does "PRAGMA quick_check" do too ?
Yes, both pragmas return same error.
Post by Simon Slavin
Is your database file really about 65 Meg in size ? Just roughly.
Yes:
$ du -sh *
66M appliance_stats.sqlite
32K appliance_stats.sqlite-shm
4.2M appliance_stats.sqlite-wal

num_pages X page_size seem to match file system metadata, fsck has not
revealed any inconsistencies.
Post by Simon Slavin
How is your application terminated ? User action or self-determination ?
The writer process receives a SIGTERM and in that handler signals a
condition that is waited by all other threads (when not doing any work).
If the threads are doing any work, they finish it and then wait on the
condition. Threads are not abnormally terminated.
The writer process runs till the VM is shutdown.
Post by Simon Slavin
Does your application call sqlite3_close() ? Does it cbeck the code returned ?
Does your application call sqlite3_shutdown() ? Does it cbeck the code returned ?
Yes, the return codes are logged and used as process exit code, our logs
do not show non-zero error codes/logs.
Post by Simon Slavin
How does your VM get shutdown ? Does it correctly shut down your application before it quits ?
VM shutdown is normal shutdown with systemd shutting down a simple
watchdog that forwards shutdown to the SQLite DB writer process.
Post by Simon Slavin
Can you include a "pragma integrity_check" at startup ?
Can you include a "pragma integrity_check" executed at regular intervals ?
The writer process does "pragma quick_check" on every startup at init,
bails out on failure and spawns a separate thread to do same
"pragma quick_check" every 5 minutes (opens it's own separate DB handle
and closes it). Would changing quick_check to integrity_check be
helpful? (Would integrity_check catch such corruption earlier than
quick_check? Would it hold longer exclusive locks on the DB file
that could prevent reads?)


Thanks,
Nikhil
Simon Slavin
7 years ago
Permalink
...
Thanks Nikhil.

Had I noticed your email address I wouldn’t have asked picky questions about abnormal termination and VM management.

Your responses indicate that you are doing everything properly. I don’t see anything there that makes me think that you’re missing a cause of corruption. However people who know more than I do about SQLite might spot something.

Changing "quick_check" to "integrity_check" may help in this case may be useful in investigating your problem. "integrity_check" is far more thorough and checks every aspect of data integrity from both directions. However, the type of corruption you’re experiencing may be one which is spotted just as well by "quick_check". There’s no way to know without checking it.

But it does suffer from the problem you noted (a 65 Meg database would be locked for longer) and this might be unacceptable in your environment. Or you might be able to do it just once an hour instead of every 5 minutes.

An alternative might be to run "integrity_check" on backup copies which don’t show up anything on "quick_check". This could be done without blocking the production system. If you never find anything then you know "quick_check" is all you need. But it might spot corruption in some databases you don’t think have been corrupted yet.

Simon.
Simon Slavin
7 years ago
Permalink
An alternative might be to run "integrity_check" on backup copies which don’t show up anything on "quick_check". This could be done without blocking the production system. If you never find anything then you know "quick_check" is all you need. But it might spot corruption in some databases you don’t think have been corrupted yet.
Just to be clear, SQLite database files can be corrupted in ways that don’t show up until a particular row of a particular index is used. If a row pointer of a particular index becomes corrupt, and something tries to UPDATE that row, SQLite will corrupt the file even more, causing corruption to spread to the table, and to other indexes of that table.

This is why using a database without problems for months, then suddenly having a problem, does not mean that the database only just became corrupt.

The only way to check everything (at least, the developers hope it’s everything) is "PRAGMA integrity_check".

Note: the above is simplified for the sake of simplicity. SQLite does a certain amount of internal checking.

Simon.
Rowan Worth
7 years ago
Permalink
...
I haven't checked the code but the docs say:

https://sqlite.org/pragma.html#pragma_quick_check
Post by Simon Slavin
The pragma is like integrity_check
<https://sqlite.org/pragma.html#pragma_integrity_check> except that it does
not verify UNIQUE constraints and does
Post by Simon Slavin
not verify that index content matches table content. By skipping UNIQUE
and index consistency checks,
Post by Simon Slavin
quick_check is able to run much faster than integrity_check. Otherwise
the two pragmas are the same.

Based on my personal experience/testing, quick_check still reads the entire
DB. The only type of corruption I've ever seen integrity_check report which
quick_check didn't was an index lacking an entry for a particular ROWID (in
an otherwise well-formed DB). Coupled with Nikhil's tests I'm certain
quick_check will catch this type of corruption.

So the only scenario left to worry about is an index corruption (ie. one
only detected by integrity_check) happening first, which somehow leads to a
DB page corruption AND masks the original index corruption. Doesn't seem
likely, but switching to integrity_check would rule out any chance.

It does take longer than quick_check, but how much longer will depend on
how big your indices are. On a 2GB DB I've measured 200 seconds for
quick_check vs 300 seconds for integrity_check, but that was over NFS.

I don't think either check takes an EXCLUSIVE lock on the DB? But a
competing writer might, which ends up locking out readers until the check
finishes (and the write completes).


I agree with Simon you're not doing anything obviously wrong... This is
listed in the how to corrupt page, but I'll ask anyway because it's a
subtle one: do any of your processes open the database file, for any
reason, without going through sqlite's API?

-Rowan
Simon Slavin
7 years ago
Permalink
Post by Rowan Worth
do any of your processes open the database file, for any
reason, without going through sqlite's API?
Just to note that a major offender in this respect is anti-virus software. So don’t think just of things that might want to read a SQLite database, but think of anything that might read all files, or changed files.

Simon.

Richard Hipp
7 years ago
Permalink
Post by Nikhil Deshpande
There were no power-off or reboots in near time vicinity when the
corruption was detected.
(1) Might the corruption have been sitting dormant due to some far
away power-off or reboot and was only recently discovered? How much
do you trust the fsync() system call and the filesystems ability to
honor that system call on the VM on which this code is running? Do
you run "PRAGMA quick_check" on initial power-up?

(2) Is this the only corruption instance you have seen?

(3) The showdb outputs you sent only shows me the first 8 bytes of the
page. From those 8 bytes, it looks to be an overflow page that is
being used as a btree page. Are there any further clues in the hex
dump of the bad pages?

(4) What does "showdb pgidx" look like?

(5) How many threads are accessing the database? Are you sure that
you have the database set in a thread-safe mode?
--
D. Richard Hipp
***@sqlite.org
Nikhil Deshpande
7 years ago
Permalink
Thanks Richard for the response!
Post by Richard Hipp
Post by Nikhil Deshpande
There were no power-off or reboots in near time vicinity when the
corruption was detected.
(1) Might the corruption have been sitting dormant due to some far
away power-off or reboot and was only recently discovered? How much
do you trust the fsync() system call and the filesystems ability to
honor that system call on the VM on which this code is running? Do
you run "PRAGMA quick_check" on initial power-up?
The previous reboot was almost 4 days before the corruption was
detected by the reader process. Trawling through the logs for those
4 days showed no anomalies or different behavior (the writer was
functioning and inserting data as expected, writer did not detect
the corruption that the reader process ran into, our guess is that
the writer was not touching the corrupt btree pages, writing to some
other part of the DB file).
Post by Richard Hipp
(2) Is this the only corruption instance you have seen?
We have seen such 4 instances on 4 different VMs in path 4 months.
The corruption symptom is same, btree page header lacking the correct
bits set.
Post by Richard Hipp
(3) The showdb outputs you sent only shows me the first 8 bytes of the
page. From those 8 bytes, it looks to be an overflow page that is
being used as a btree page. Are there any further clues in the hex
dump of the bad pages?
Attaching full dump of the corrupt pages 46 and 3818.
This page 46 seems to contain reasonable strings (written by the app,
metadata about the timeseries tables), not garbage/arbitrary data.
Page 3818 (belongs to a timeseries table) seems to contain zeros
except at the head 16 bytes and tail end 16 bytes of the page.
Post by Richard Hipp
(4) What does "showdb pgidx" look like?
Attaching "showdb pgidx" output, command runs fine (exiting with 0),
showing these 2 pages as corrupt:
46: corrupt node [sm_stats_metadata], child 2 of page 2
3818: root corrupt node [sm_stat_disk_latency_rate_dm-9]
Post by Richard Hipp
(5) How many threads are accessing the database? Are you sure that
you have the database set in a thread-safe mode?
There are total 3 threads in the writer process accessing the DB,
two threads do periodic "INSERT OR REPLACE (timestamp, val)" into
bunch of tables. The third thread runs "pragma quick_check"
periodically.

Logging sqlite3_threadsafe() output shows value 1 (serialized mode
build time setting) and sqlite3_db_mutex() on a db handle shows
a non-NULL value. The DB file is opened using sqlite3_open_v2()
with SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE flags, it does
not explicitly pass SQLITE_OPEN_FULLMUTEX flag (relies on the compile
time threading mode to be inherited, which is serialized threading
mode build time setting).

Thanks,
Nikhil
Rowan Worth
7 years ago
Permalink
Does either process take backups of the DB? If so, how is that implemented?
-Rowan
...
Nikhil Deshpande
7 years ago
Permalink
Post by Rowan Worth
Does either process take backups of the DB? If so, how is that implemented?
Thanks Rowan for the response!
Backup is done by a separate process through command:
sqlite3 /path/to/db_file .dump > dump.sql
and not using the sqlite3 backup API.

Thanks,
Nikhil
Loading...