Dominique Devienne
7 years ago
I'm surprised about the commit time of SQLite, when writing blobs is
involved.
Can anybody shed light on this subject? Below's a description of what I do,
with the results.
I've exporting data into SQLite, spread in several tables.
I process only about 240,000 rows, and write around 1GB in 20,000 blobs.
I process the data in two passes:
1) write the rows, record rowid when a row has a blob,
zeroblob(N) during insert, to leave room for Incremental Blob I/O later
in 2nd pass.
2) process blobs only
Then I compare this with writing the blobs in HDF5 instead.
Instead of writing the blob itself in-row, I just insert its size in SQLite.
(HDF5 is a "filesystem in a file", non-transactional, with "files" being
N-dimensional typed arrays, thus a 1D array of bytes approximates a "file")
Results writing both rows and blobs in SQLite:
Write ROWs = 6.968s (18.7%) (includes the zeroblob(N))
Write BLOBs = 2.764s ( 7.4%)
Commit TX = 10.566s (28.4%)
Results is a single 1,110,253,568 bytes SQLite DB file.
Now writing only rows to SQLite and blobs to HDF5:
Write ROWs = 4.265s (16.6%)
Write BLOBs = 4.010s (15.6%)
Commit TX = 0.195s ( 0.8%) (includes flushing the HDF5 file too)
Results in two files, 64,475,136 bytes SQLite DB, and 1,039,652,416 bytes
HDF5 file.
SQLite (version 3.19.3) is used in journal mode, and DB written "from
scratch", so there's not much to put in the journal, AFAIK. My assumption
was that after the zeroblob(N), there was enough room in the main DBs
pages, such that the subsequent blob open+write+close did not need to
generate any "page churn" (i.e. journal activity) and could write directly
to the pages created on initial insert. But what surprised me was the
commit time jumping to 10s, i.e. 50x more that when not writing the blobs.
After the initial zeroblob(N) + sqlite3_last_insert_rowid(), part of "Write
ROWs",
the code writing the blobs is below (simplified, error handling removed).
sqlite3_blob* p_blob = nullptr;
sqlite3_blob_open(db, "main", tab, col, rowid, /*read-write*/1, &p_blob);
assert(sqlite3_blob_bytes(p_blob) == length, ;);
sqlite3_blob_write(p_blob, buffer, static_cast<int>(length), 0);
sqlite3_blob_close(p_blob);
So if the code above "the write way" to write blobs?
I don't yet have the buffers to write the blobs directly at-row-insert-time,
for reasons beyond this particular problem. Please assume this is by-design.
I can understand that 64MB I/O vs 1GB I/O takes more time, and that's
reflected
in the 4.2s writing just the rows, versus the 7s + 2.7s = 9.7s when writing
both rows and blobs,
but why the commit time jumps from 0.2s to 10.5s ??? Thanks for any
insights in this.
Thanks, --DD
involved.
Can anybody shed light on this subject? Below's a description of what I do,
with the results.
I've exporting data into SQLite, spread in several tables.
I process only about 240,000 rows, and write around 1GB in 20,000 blobs.
I process the data in two passes:
1) write the rows, record rowid when a row has a blob,
zeroblob(N) during insert, to leave room for Incremental Blob I/O later
in 2nd pass.
2) process blobs only
Then I compare this with writing the blobs in HDF5 instead.
Instead of writing the blob itself in-row, I just insert its size in SQLite.
(HDF5 is a "filesystem in a file", non-transactional, with "files" being
N-dimensional typed arrays, thus a 1D array of bytes approximates a "file")
Results writing both rows and blobs in SQLite:
Write ROWs = 6.968s (18.7%) (includes the zeroblob(N))
Write BLOBs = 2.764s ( 7.4%)
Commit TX = 10.566s (28.4%)
Results is a single 1,110,253,568 bytes SQLite DB file.
Now writing only rows to SQLite and blobs to HDF5:
Write ROWs = 4.265s (16.6%)
Write BLOBs = 4.010s (15.6%)
Commit TX = 0.195s ( 0.8%) (includes flushing the HDF5 file too)
Results in two files, 64,475,136 bytes SQLite DB, and 1,039,652,416 bytes
HDF5 file.
SQLite (version 3.19.3) is used in journal mode, and DB written "from
scratch", so there's not much to put in the journal, AFAIK. My assumption
was that after the zeroblob(N), there was enough room in the main DBs
pages, such that the subsequent blob open+write+close did not need to
generate any "page churn" (i.e. journal activity) and could write directly
to the pages created on initial insert. But what surprised me was the
commit time jumping to 10s, i.e. 50x more that when not writing the blobs.
After the initial zeroblob(N) + sqlite3_last_insert_rowid(), part of "Write
ROWs",
the code writing the blobs is below (simplified, error handling removed).
sqlite3_blob* p_blob = nullptr;
sqlite3_blob_open(db, "main", tab, col, rowid, /*read-write*/1, &p_blob);
assert(sqlite3_blob_bytes(p_blob) == length, ;);
sqlite3_blob_write(p_blob, buffer, static_cast<int>(length), 0);
sqlite3_blob_close(p_blob);
So if the code above "the write way" to write blobs?
I don't yet have the buffers to write the blobs directly at-row-insert-time,
for reasons beyond this particular problem. Please assume this is by-design.
I can understand that 64MB I/O vs 1GB I/O takes more time, and that's
reflected
in the 4.2s writing just the rows, versus the 7s + 2.7s = 9.7s when writing
both rows and blobs,
but why the commit time jumps from 0.2s to 10.5s ??? Thanks for any
insights in this.
Thanks, --DD