Tuning a SQLite database for best compression

Discussion:

Larry Knibb

2012-03-05 10:59:57 UTC

Hi,

I'm developing a mobile application which downloads my database so smallest
possible database size (i.e. on-disk/footprint) is important.

Currently my database is 41.5MB with indexes and to reduce the download I'm
zipping it with max. compression Deflate down to 18.4MB.

I noticed I can get it down to 12.1MB using LZMA compression but
unfortunately there's no LZMA library for Windows Phone yet... but anyway.

I further noticed that if I .dump the SQL to create the database then zip
that, the resulting size is just 6.7MB; which got me to thinking that maybe
there's some zip-unfriendly padding in the db file format that I could
potentially VACUUM away, defragment or otherwise remove as a cleanup
operation?

I don't really want to reload from SQL on the device so if anyone has any
tips about how to optimise the db file for space then I'd be very grateful.

Cheers,
Larry

Sent from my Windows Phone

Simon Slavin

2012-03-05 11:53:44 UTC

Permalink

Post by Larry Knibb
I'm developing a mobile application which downloads my database so smallest
possible database size (i.e. on-disk/footprint) is important.
Currently my database is 41.5MB with indexes and to reduce the download I'm
zipping it with max. compression Deflate down to 18.4MB.
I noticed I can get it down to 12.1MB using LZMA compression but
unfortunately there's no LZMA library for Windows Phone yet... but anyway.

Would this be useful ? I think you might have to be using the SQLite API calls rather than a third-party extra layer.

<http://www.hwaci.com/sw/sqlite/cerod.html>

Simon.

Larry Knibb

2012-03-05 12:16:46 UTC

Permalink

Thanks for the quick reply, Simon. I was rather hoping there might be
some efficiency tweaks I can do to the db file (e.g. more explicit
creation statements, or postprocessing) instead of getting into custom
sqlite builds.

I was on the bus when I wrote my earlier mail but at home now I've
tried VACUUM. Unfortunately it only shaves ~250KB off the ~41MB
database.

Open to other ideas?

Can I limit possible string lengths for the TEXT fields or is it
already using VARCHAR internally? If so, any idea where all the space
is being used? Can I profile the tables to learn how much data they
hold individually or the efficiency of that storage?

L

Would this be useful ? I think you might have to be using the SQLite API calls rather than a third-party extra layer.
<http://www.hwaci.com/sw/sqlite/cerod.html>
Simon.
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Larry Knibb

2012-03-05 12:43:17 UTC

Permalink

The database went up by 1MB on disk but I saved over 1MB on the zip
file by dropping the page_size down from 1024 to 512. That's a slight
improvement. Earlier when I said the footprint was the issue, I really
meant the download footprint.

LZMA is also 500KB smaller @ 11.6MB (why oh why isn't there a working
LZMA implementation for Windows Phone).

Still not a patch on the text data squeezing down to <7MB.

Nothing else in the pragma list looked immediately interesting.

L

Post by Larry Knibb
Thanks for the quick reply, Simon. I was rather hoping there might be
some efficiency tweaks I can do to the db file (e.g. more explicit
creation statements, or postprocessing) instead of getting into custom
sqlite builds.
I was on the bus when I wrote my earlier mail but at home now I've
tried VACUUM. Unfortunately it only shaves ~250KB off the ~41MB
database.
Open to other ideas?
Can I limit possible string lengths for the TEXT fields or is it
already using VARCHAR internally? If so, any idea where all the space
is being used? Can I profile the tables to learn how much data they
hold individually or the efficiency of that storage?
L

Mario Becroft

2012-03-05 13:06:54 UTC

Permalink

I don't know whether this is practical in your application, but if you
want to reduce redundancy in the database file when downloading, the
obvious thing is to leave off the indexes, and generate them once the
file is downloaded onto the device. This way you'll have to download
only the actual data set, and this combined with compression should be
about as small as you can go without using domain-specific data
compression or reformulating the data model.

This is similar to loading from DML statements, which you don't want to
do, but executing only the CREATE INDEX statements after downloading
might be more acceptable than loading the entire data set.

--
Mario Becroft <mb-LmE78Sg+***@public.gmane.org>

Larry Knibb

2012-03-05 13:14:39 UTC

Permalink

Thanks Mario. I did think about stripping the indexes to cut down the
size. Recreating them on-device wouldn't be difficult (just
time-consuming) but offset against the smaller download it's possibly
worth it... although if I'm into partial on-device db creation then I
should really do some timing to see whether importing the whole .dump
file on the device is really that bad. Anyway, cheers for the advice.

L

Post by Mario Becroft
I don't know whether this is practical in your application, but if you
want to reduce redundancy in the database file when downloading, the
obvious thing is to leave off the indexes, and generate them once the
file is downloaded onto the device. This way you'll have to download
only the actual data set, and this combined with compression should be
about as small as you can go without using domain-specific data
compression or reformulating the data model.
This is similar to loading from DML statements, which you don't want to
do, but executing only the CREATE INDEX statements after downloading
might be more acceptable than loading the entire data set.
--
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Scott Hess

2012-03-05 17:44:35 UTC

Permalink

Note that importing the .dump file does effectively strip the indices
and re-create them. In fact, there's a good chance that if you drop
the indices, VACUUM, and then compress the database file, you'll find
that it comes close to what you get with the .dump-then-compress
values.

-scott

Post by Larry Knibb
Thanks Mario. I did think about stripping the indexes to cut down the
size. Recreating them on-device wouldn't be difficult (just
time-consuming) but offset against the smaller download it's possibly
worth it... although if I'm into partial on-device db creation then I
should really do some timing to see whether importing the whole .dump
file on the device is really that bad. Anyway, cheers for the advice.
L

_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Larry Knibb

2012-03-06 13:10:40 UTC

Permalink

Thanks Scott,

It is much better without the indexes. The uncompressed db file after
dropping the indexes and vacuuming is 23MB compared to 43MB with
indexes. Deflate cuts the size down to 10MB. The LZMA archive is less
than 7MB - more or less the same as the Deflated SQL. I really need an
LZMA library for Windows Phone - maybe that's my next project ;)

Anyway, only rebuilding the indexes seems like a good starting point
for writing to the (previously read-only) database on-device, so I'll
start there. Thanks for the suggestion.

L

Post by Scott Hess
Note that importing the .dump file does effectively strip the indices
and re-create them. In fact, there's a good chance that if you drop
the indices, VACUUM, and then compress the database file, you'll find
that it comes close to what you get with the .dump-then-compress
values.
-scott

_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Simon Slavin

2012-03-05 13:02:11 UTC

Permalink

SQLite is about equally space-efficient for pretty-much everything you do. I have seen some reports that creating tables by INSERTing rows in primary key order makes a database smaller, but I don't remember any of the experts posting anything definite about it one way or the other.

Something you might try is to build your database file however you want, then use the shell tool's '.dump' and '.read' commands to make another database file out of it. Among other things I believe this does INSERT rows in primary key order. If the two files are radically different in size then there's something strange going on and, of course, you've found a way to make your database smaller.

Post by Larry Knibb
Can I limit possible string lengths for the TEXT fields or is it
already using VARCHAR internally?

Affinities of VARCHAR are interpreted as TEXT, with length limits being ignored. All strings in SQLite are variable-length strings. See

<http://sqlite.org/datatype3.html>

Post by Larry Knibb
If so, any idea where all the space
is being used? Can I profile the tables to learn how much data they
hold individually or the efficiency of that storage?

Take a look on

<http://sqlite.org/download.html>

to see if there's a build of 'sqlite-analyzer' for your platform. This is a good way of getting a ton of information about your database file. You might find something weird.

Generally, in cases where it /is/ possible to save a lot of space in a database, it's because people have made indexes which are unnecessary, or unnecessarily long, or duplicates of existing indexes. If you understand SQL well and made your indexes by examining your INSERT and UPDATE commands, then this probably isn't your problem.

Simon.

Larry Knibb

2012-03-05 13:12:14 UTC

Permalink

Post by Simon Slavin
SQLite is about equally space-efficient for pretty-much everything you do.

Yeah, I figured it probably would be. I'm generating my database in
one shot off-device at the moment and it's all sequential, so there's
no chance of anything being out of order. Even the vacuum only cleaned
up 250KB out of 41.5MB. The page_size find was a little bit expected
because smaller chunks will go through the Deflate algorithm better
but I guess I was hoping maybe someone had tried to do this before and
there is some magic combination of pragmas which perfectly align to
what Deflate expects (unusual page_size, perhaps?).

Anyway, I'll grab an analyser and see what's what.

Thanks Simon.

L

Post by Simon Slavin

SQLite is about equally space-efficient for pretty-much everything you do. I have seen some reports that creating tables by INSERTing rows in primary key order makes a database smaller, but I don't remember any of the experts posting anything definite about it one way or the other.
Something you might try is to build your database file however you want, then use the shell tool's '.dump' and '.read' commands to make another database file out of it. Among other things I believe this does INSERT rows in primary key order. If the two files are radically different in size then there's something strange going on and, of course, you've found a way to make your database smaller.

Post by Larry Knibb
Can I limit possible string lengths for the TEXT fields or is it
already using VARCHAR internally?

Affinities of VARCHAR are interpreted as TEXT, with length limits being ignored. All strings in SQLite are variable-length strings. See
<http://sqlite.org/datatype3.html>

Post by Larry Knibb
If so, any idea where all the space
is being used? Can I profile the tables to learn how much data they
hold individually or the efficiency of that storage?

Take a look on
<http://sqlite.org/download.html>
to see if there's a build of 'sqlite-analyzer' for your platform. This is a good way of getting a ton of information about your database file. You might find something weird.
Generally, in cases where it /is/ possible to save a lot of space in a database, it's because people have made indexes which are unnecessary, or unnecessarily long, or duplicates of existing indexes. If you understand SQL well and made your indexes by examining your INSERT and UPDATE commands, then this probably isn't your problem.
Simon.
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Roger Binns

2012-03-05 17:41:39 UTC

Permalink

Post by Larry Knibb
Currently my database is 41.5MB with indexes and to reduce the download
I'm zipping it with max. compression Deflate down to 18.4MB.
I noticed I can get it down to 12.1MB using LZMA compression but
unfortunately there's no LZMA library for Windows Phone yet... but anyway.

zip uses a compression window size of 32kb. bzip2 is 900kb. LZMA on
maximum can be gigabytes.

Post by Larry Knibb
I further noticed that if I .dump the SQL to create the database then
zip that, the resulting size is just 6.7MB;

A user on a mailing list I was on was trying to save space like you and
found the same pattern you did. A raw SQLite database has padding due to
page size and alignment, plus prefix bytes to values (eg giving
string/blob lengths), not to mention the btree structure for the records.

SQL text was more efficient because there is no padding/alignment/btree.
However every line has a redundant "insert into table values ..." so CSV
is even more efficient as it is using one byte to mark the end of each
value (the comma). His data had large numbers somewhat sequential and
they take more space as text (digits), so the CSV was adapted to be the
deltas between values/rows which was usually a digit or so. bzip2 was
used on the final delta CSV to get the smallest size. LZMA gained a few
more percent on bzip2 in size but not enough to bother using.

What works best for you will depend on what your data looks like, how much
it varies, the types etc. If there is a lot of text then you can also
look to the current best text compressors - see

http://mattmahoney.net/dc/text.html

You may also be able to save even more space by having the data a column
at a time instead of a row at a time, although it would require
considerably more temporary space to work out each row before
reconstructing the database.

Similarly you may be able to use a custom binary delta encoding that uses
fewer bits to encode each row.

Roger

Larry Knibb

2012-03-06 13:01:45 UTC

Permalink

Hi Roger,

Fantastic - thanks for your comprehensive reply. Funnily enough I had
already stripped out the INSERT INTO TABLE VALUES text to see if that
would make a difference but actually it looked like both compression
algorithms were pretty much eliminating all the duplication there
already. One interesting thing to note is that the 54MB of
uncompressed SQL unzips in lightning fast time - seriously, I mean ~1s
on my bog-standard "home" laptop, so that was an interesting finding.
Perhaps of use to others?

CSV is the next logical step and gets me under 5MB with LZMA and just
a shade over 6MB with Deflate. Fyi, BZip2 is just a shade under 6MB so
not a lot better than Deflate. My dataset is already fully
denormalized (although not sorted) and I'm using rowid instead of an
ID column so there's no sequential data either, but I do have a couple
of large, cross-referencing index tables (two of the largest) which I
bet I can squeeze down considerably by storing integer values instead
of strings to a "binary" CSV.

Now the only question is - is the db load (INSERT) performance of my
mobile app actually better than the average download case... and for
that I'm going to have to do some real work instead of just playing
with sqlite3 and 7zip :)

Cheers,
Larry

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

zip uses a compression window size of 32kb. bzip2 is 900kb. LZMA on
maximum can be gigabytes.

Post by Larry Knibb
I further noticed that if I .dump the SQL to create the database then
zip that, the resulting size is just 6.7MB;

A user on a mailing list I was on was trying to save space like you and
found the same pattern you did. A raw SQLite database has padding due to
page size and alignment, plus prefix bytes to values (eg giving
string/blob lengths), not to mention the btree structure for the records.
SQL text was more efficient because there is no padding/alignment/btree.
However every line has a redundant "insert into table values ..." so CSV
is even more efficient as it is using one byte to mark the end of each
value (the comma). His data had large numbers somewhat sequential and
they take more space as text (digits), so the CSV was adapted to be the
deltas between values/rows which was usually a digit or so. bzip2 was
used on the final delta CSV to get the smallest size. LZMA gained a few
more percent on bzip2 in size but not enough to bother using.
What works best for you will depend on what your data looks like, how much
it varies, the types etc. If there is a lot of text then you can also
look to the current best text compressors - see
http://mattmahoney.net/dc/text.html
You may also be able to save even more space by having the data a column
at a time instead of a row at a time, although it would require
considerably more temporary space to work out each row before
reconstructing the database.
Similarly you may be able to use a custom binary delta encoding that uses
fewer bits to encode each row.
Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iEYEARECAAYFAk9U+tMACgkQmOOfHg372QT8dgCfXRV3OwrDE+oUksDNCy+GAccu
h9YAoJ6LaZY7521lfj94BoIxGp1wMpS1
=rxdX
-----END PGP SIGNATURE-----
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users