I/O errors with WAL on ZFS

Richard Hipp

2013-04-24 13:57:27 UTC

Post by SaÅ¡o Kiselkov
I'm running into I/O errors when trying to access a sqlite3 database
which is using WAL from my app. While using journal_mode=delete,
everything is fine, but as soon as I switch over to journal_mode=wal, I
just get a load of I/O errors on any query, regardless if it is a SELECT
or UPDATE/INSERT.

Can you please turn on error logging (as described at
http://www.sqlite.org/draft/errlog.html) and let us know more details about
the I/O error you are seeing?

Post by SaÅ¡o Kiselkov
All SQLite ops take place in a separate thread to prevent blocking the
main app thread (which is doing network io). The local filesystem where
database lives is ZFS (which has mmap support) on a recent build of
OpenIndiana (an OpenSolaris-derivative). I've downloaded the latest
stable SQLite release and compiled that with the default flags (which
has threading enabled, IIRC).
My SQL flow is as follows (this is my app creating a new DB and
<open empty new database>
PRAMA journal_mode=WAL
BEGIN
DROP TABLE IF EXISTS `version` <- [I/O error]
-rw-r--r-- 1 root root 1.0K 2013-04-24 14:21 block.db
-rw-r--r-- 1 root root 0 2013-04-24 14:21 block.db-shm
-rw-r--r-- 1 root root 0 2013-04-24 14:21 block.db-wal
Strangely enough, working on the database through the sqlite3
command-line tool works just fine. Is there some magic of sqlite API
calls that needs to be executed to support WAL databases? Unfortunately,
my Google searched turned up dry so far...
Cheers,
--
Saso
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

--
D. Richard Hipp
***@sqlite.org

Sašo Kiselkov

2013-04-24 14:28:02 UTC

Can you please turn on error logging (as described at
http://www.sqlite.org/draft/errlog.html) and let us know more details about
the I/O error you are seeing?

Here's my error log:

#4874: os_unix.c:27116: (22) fallocate(/root/test/idx/block.db-shm) -
Invalid argument
#4874: os_unix.c:27116: (22) fallocate(/root/test/idx/block.db-shm) -
Invalid argument
#4874: statement aborts at 3: [DROP TABLE IF EXISTS `version`] disk I/O
error

Cheers,
--
Saso

Sašo Kiselkov

2013-04-24 14:41:19 UTC

Can you please turn on error logging (as described at
http://www.sqlite.org/draft/errlog.html) and let us know more details about
the I/O error you are seeing?

#4874: os_unix.c:27116: (22) fallocate(/root/test/idx/block.db-shm) -
Invalid argument
#4874: os_unix.c:27116: (22) fallocate(/root/test/idx/block.db-shm) -
Invalid argument
#4874: statement aborts at 3: [DROP TABLE IF EXISTS `version`] disk I/O
error

Just a quick addendum, configure did in fact correctly detect that I
have posix_fallocate on SunOS:

configure:19903: checking for posix_fallocate
configure:19903: gcc -o conftest -m64 conftest.c >&5
configure:19903: $? = 0
configure:19903: result: yes
...
ac_cv_func_posix_fallocate=yes

Truss'ing the process, I find this:

/2: open("/root/test/idx/block.db-shm", O_RDWR|O_CREAT, 0644) = 12
/2: fstat(12, 0xFFFFFD7FFF0CE260) = 0
/2: fcntl(12, F_GETFD, 0x00000000) = 0
/2: fcntl(12, F_SETFD, 0x00000001) = 0
/2: getuid() = 0 [0]
/2: fchown(12, 0, 0) = 0
/2: fcntl(12, F_SETLK, 0xFFFFFD7FFF0CE2B0) = 0
/2: fcntl(12, F_FREESP, 0xFFFFFD7FFF0CE290) = 0
/2: fcntl(12, F_SETLK, 0xFFFFFD7FFF0CE2B0) = 0
/2: fstat(12, 0xFFFFFD7FFF0CE400) = 0
/2: fcntl(12, F_SETLK, 0xFFFFFD7FFF0CE420) = 0
/2: fstat(12, 0xFFFFFD7FFF0CE400) = 0
/2: fcntl(12, F_ALLOCSP, 0xFFFFFD7FFF0CE380) Err#22 EINVAL
/2: fstat(12, 0xFFFFFD7FFF0CE2F0) = 0

Cheers,
--
Saso

Richard Hipp

2013-04-24 14:41:48 UTC

Can you please turn on error logging (as described at
http://www.sqlite.org/draft/errlog.html) and let us know more details

about

Post by Richard Hipp
the I/O error you are seeing?

#4874: os_unix.c:27116: (22) fallocate(/root/test/idx/block.db-shm) -
Invalid argument

So apparently, the call to fallocate() on the file
/root/test/idx/block.db-shm is failing with errno==22. Do you have any
idea why that might be?

Can you tell me exactly which version of SQLite you are using so that I can
figure out what line 27116 says? Or maybe look at line 27116 of sqlite3.c
yourself and let us know which line of code the error is occurring on?

Post by SaÅ¡o Kiselkov
#4874: os_unix.c:27116: (22) fallocate(/root/test/idx/block.db-shm) -
Invalid argument
#4874: statement aborts at 3: [DROP TABLE IF EXISTS `version`] disk I/O
error
Cheers,
--
Saso
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

--
D. Richard Hipp
drh-CzDROfG0BjIdnm+***@public.gmane.org

Sašo Kiselkov

2013-04-24 14:44:48 UTC

Can you please turn on error logging (as described at
http://www.sqlite.org/draft/errlog.html) and let us know more details

about

Post by Richard Hipp
the I/O error you are seeing?

#4874: os_unix.c:27116: (22) fallocate(/root/test/idx/block.db-shm) -
Invalid argument

So apparently, the call to fallocate() on the file
/root/test/idx/block.db-shm is failing with errno==22. Do you have any
idea why that might be?
Can you tell me exactly which version of SQLite you are using so that I can
figure out what line 27116 says? Or maybe look at line 27116 of sqlite3.c
yourself and let us know which line of code the error is occurring on?

I'm running sqlite-autoconf-3071602, here's the relevant bits of code
from sqlite3.c:

if( sStat.st_size<nByte ){
/* The requested memory region does not exist. If bExtend is set to
** false, exit early. *pp will be set to NULL and SQLITE_OK returned.
**
** Alternatively, if bExtend is true, use ftruncate() to allocate
** the requested memory region.
*/
if( !bExtend ) goto shmpage_out;
#if defined(HAVE_POSIX_FALLOCATE) && HAVE_POSIX_FALLOCATE
if( osFallocate(pShmNode->h, sStat.st_size, nByte)!=0 ){
rc = unixLogError(SQLITE_IOERR_SHMSIZE, "fallocate",
pShmNode->zFilename);
goto shmpage_out;
}
#else
if( robust_ftruncate(pShmNode->h, nByte) ){
rc = unixLogError(SQLITE_IOERR_SHMSIZE, "ftruncate",
pShmNode->zFilename);
goto shmpage_out;
}
#endif
}

Cheers,
--
Saso

Sašo Kiselkov

2013-04-24 15:17:44 UTC

Can you please turn on error logging (as described at
http://www.sqlite.org/draft/errlog.html) and let us know more details

about

Post by Richard Hipp
the I/O error you are seeing?

#4874: os_unix.c:27116: (22) fallocate(/root/test/idx/block.db-shm) -
Invalid argument

So apparently, the call to fallocate() on the file
/root/test/idx/block.db-shm is failing with errno==22. Do you have any
idea why that might be?
Can you tell me exactly which version of SQLite you are using so that I can
figure out what line 27116 says? Or maybe look at line 27116 of sqlite3.c
yourself and let us know which line of code the error is occurring on?

I'm running sqlite-autoconf-3071602, here's the relevant bits of code
if( sStat.st_size<nByte ){
/* The requested memory region does not exist. If bExtend is set to
** false, exit early. *pp will be set to NULL and SQLITE_OK returned.
**
** Alternatively, if bExtend is true, use ftruncate() to allocate
** the requested memory region.
*/
if( !bExtend ) goto shmpage_out;
#if defined(HAVE_POSIX_FALLOCATE) && HAVE_POSIX_FALLOCATE
if( osFallocate(pShmNode->h, sStat.st_size, nByte)!=0 ){
rc = unixLogError(SQLITE_IOERR_SHMSIZE, "fallocate",
pShmNode->zFilename);
goto shmpage_out;
}
#else
if( robust_ftruncate(pShmNode->h, nByte) ){
rc = unixLogError(SQLITE_IOERR_SHMSIZE, "ftruncate",
pShmNode->zFilename);
goto shmpage_out;
}
#endif
}

I think I've found it. Dtracing around in the system, this is the ZFS
kernel code that's being called:

6 -> zfs_space
6 -> rrw_enter
6 -> rrw_enter_read
6 <- rrw_enter_read
6 <- rrw_enter
6 -> rrw_exit
6 <- rrw_exit
6 <- zfs_space

Looking at the implementation of zfs_space, I can see this tidbit:

/*
* ...
* Currently, this function only supports the `F_FREESP' command.
* ...
*/
static int
zfs_space(vnode_t *vp, int cmd, flock64_t *bfp, int flag,
offset_t offset, cred_t *cr, caller_context_t *ct)
{
...
if (cmd != F_FREESP) {
ZFS_EXIT(zfsvfs);
return (SET_ERROR(EINVAL));
}
...
}

So it appears that F_ALLOCSP isn't support on ZFS. This appears to be
the case for all platforms where ZFS is available, not just SunOS. For
instance, ZFS on Linux has this problem as well:
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zfs_vnops.c#L4225-L4228

Is there some way to work around posix_fallocate and still have WAL
support in SQLite?

Cheers,
--
Saso

Sašo Kiselkov

2013-04-24 16:21:48 UTC

Can you please turn on error logging (as described at
http://www.sqlite.org/draft/errlog.html) and let us know more details

about

Post by Richard Hipp
the I/O error you are seeing?

#4874: os_unix.c:27116: (22) fallocate(/root/test/idx/block.db-shm) -
Invalid argument

So apparently, the call to fallocate() on the file
/root/test/idx/block.db-shm is failing with errno==22. Do you have any
idea why that might be?
Can you tell me exactly which version of SQLite you are using so that I can
figure out what line 27116 says? Or maybe look at line 27116 of sqlite3.c
yourself and let us know which line of code the error is occurring on?

I think I've found it. Dtracing around in the system, this is the ZFS
6 -> zfs_space
6 -> rrw_enter
6 -> rrw_enter_read
6 <- rrw_enter_read
6 <- rrw_enter
6 -> rrw_exit
6 <- rrw_exit
6 <- zfs_space
/*
* ...
* Currently, this function only supports the `F_FREESP' command.
* ...
*/
static int
zfs_space(vnode_t *vp, int cmd, flock64_t *bfp, int flag,
offset_t offset, cred_t *cr, caller_context_t *ct)
{
...
if (cmd != F_FREESP) {
ZFS_EXIT(zfsvfs);
return (SET_ERROR(EINVAL));
}
...
}
So it appears that F_ALLOCSP isn't support on ZFS. This appears to be
the case for all platforms where ZFS is available, not just SunOS. For
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zfs_vnops.c#L4225-L4228
Is there some way to work around posix_fallocate and still have WAL
support in SQLite?

Just as a quick follow-up on this, when I manually undefine
HAVE_POSIX_FALLOCATE, which makes SQLite fall back to the
truncate-and-write implementation, everything works fine.

ZFS has been the filesystem of choice for SunOS-based systems for about
the last 5 years now, is becoming that for FreeBSD as we speak, and is
quickly gaining ground on Linux. The absence of support for
posix_fallocate() on ZFS kind of makes sense, since copy-on-write
filesystems cannot keep the posix_fallocate promise:

http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
"If posix_fallocate() returns successfully, subsequent writes to the
specified file data shall not fail due to the lack of free space on the
file system storage media."

COW filesystems never overwrite data in place and instead always
allocate new blocks, meaning even if the file being written to has data
blocks allocated, and the application thinks it's just overwriting the
existing blocks, under the hood the filesystem allocates new data
blocks, writes the data to them and then it *might* choose to discard
the original data (modulo snapshots, clones and a myriad of other
mechanisms in which data can be retained).

As such, I would suggest one of:

1) Introduce a configure option which allows SQLite users to explicitly
disable posix_fallocate support, if they expect to be running on
file systems without support for it. Merely switching by OS may not
be reliable enough, since for instanceUFS on SunOS implements it and
there is no simple way for libc to guess what file system a
particular file sits on.

2) Implement some sort of automatic fallback method which detects the
EINVAL condition and attempts to fall back to using the
truncate-and-write method.

If method #2 is acceptable for the SQLite project, I can attempt to
implement it. I could also implement support for posix_fallocate into
ZFS, but that will take a lot of time to get widely deployed (at least
several years), and even then the best ZFS could do is lie to the
applications (due to the aforementioned COW design).

Cheers,
--
Saso

Nico Williams

2013-04-24 20:17:48 UTC

Post by SaÅ¡o Kiselkov
ZFS has been the filesystem of choice for SunOS-based systems for about
the last 5 years now, is becoming that for FreeBSD as we speak, and is

More like 8 years :)

Post by SaÅ¡o Kiselkov
quickly gaining ground on Linux. The absence of support for
posix_fallocate() on ZFS kind of makes sense, since copy-on-write

Agreed.

Post by SaÅ¡o Kiselkov
http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
"If posix_fallocate() returns successfully, subsequent writes to the
specified file data shall not fail due to the lack of free space on the
file system storage media."

Pre-allocation should be a per-filesystem feature, discoverable via pathconf(3).

What would it take to add such a pathconf? (I should know this, but I don't.)

Post by SaÅ¡o Kiselkov
1) Introduce a configure option which allows SQLite users to explicitly
disable posix_fallocate support, if they expect to be running on
file systems without support for it. Merely switching by OS may not
be reliable enough, since for instanceUFS on SunOS implements it and
there is no simple way for libc to guess what file system a
particular file sits on.
2) Implement some sort of automatic fallback method which detects the
EINVAL condition and attempts to fall back to using the
truncate-and-write method.

EINVAL seems like a lousy error code to return here though. ENOTSUP
seems much better. EINVAL should be fatal here, but ENOTSUP should
cause SQLite3 to shrug and continue.

Post by SaÅ¡o Kiselkov
If method #2 is acceptable for the SQLite project, I can attempt to

I would think that it should be, but I think the errno that triggers
fallback should be ENOTSUP.

Post by SaÅ¡o Kiselkov
implement it. I could also implement support for posix_fallocate into
ZFS, but that will take a lot of time to get widely deployed (at least
several years), and even then the best ZFS could do is lie to the
applications (due to the aforementioned COW design).

Let's expand a bit on why. ZFS could save the DVAs of fallocated
blocks in the file's dnode for use later when either the file deleted
(last unlink) or written to. Admittedly it'd be tricky: the
pre-allocated blocks would have to include blocks for writing metadata
all the way up to the root, and the block sizes would have to be just
right, which would effectively mean having to pre-allocate the largest
possible block sizes (since ZFS has variable block sizes, but for any
given file the data blocks are all the same size, but this can change
when the file is one block long and grows; this applies to a bunch of
metadata as well), and that'd be rather painful.

For a SQLite3 DB/WAL in a dedicated ZFS dataset you could use
reservations to roughly equivalent effect to posix_fallocate(). But
that's not a solution.

Nico
--

Richard Hipp

2013-04-24 20:25:39 UTC

Post by SaÅ¡o Kiselkov
Just as a quick follow-up on this, when I manually undefine
HAVE_POSIX_FALLOCATE, which makes SQLite fall back to the
truncate-and-write implementation, everything works fine.
ZFS has been the filesystem of choice for SunOS-based systems for about
the last 5 years now, is becoming that for FreeBSD as we speak, and is
quickly gaining ground on Linux. The absence of support for
posix_fallocate() on ZFS kind of makes sense, since copy-on-write
http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
"If posix_fallocate() returns successfully, subsequent writes to the
specified file data shall not fail due to the lack of free space on the
file system storage media."

SQLite is using posix_fallocate() to allocate space for a region of shared
memory obtained using mmap(). If the space cannot be preallocated, then
when we use the mmapped region and an attempt is made to allocate the space
and the filesystem is full, we'll get a SIGBUS signal.

So what you seem to be saying is that if you (1) use ZFS and (2) use mmap()
and (3) your filesystem fills up, then there is nothing your application
can do to avoid a SIGBUS. Am I missing something here?

--
D. Richard Hipp
***@sqlite.org

Nico Williams

2013-04-24 20:34:25 UTC

...

SQLite is using posix_fallocate() to allocate space for a region of shared
memory obtained using mmap(). If the space cannot be preallocated, then
when we use the mmapped region and an attempt is made to allocate the space
and the filesystem is full, we'll get a SIGBUS signal.
So what you seem to be saying is that if you (1) use ZFS and (2) use mmap()
and (3) your filesystem fills up, then there is nothing your application
can do to avoid a SIGBUS. Am I missing something here?

That's right. You could use write(2)/pwrite(2) to avoid this and
still use mmap() for reading. That's what the OpenLDAP MDB does.

That still leaves EIO on read via mmap() -> SIGBUS. (I know, I
asserted that this was undesirable behavior by the OS, but in practice
I think all OSes do this, and if any apparently hang instead... it's
usually because of long hardware I/O timeouts.)

Nico
--

Sašo Kiselkov

2013-04-24 20:34:49 UTC