Discussion:
50% faster than 3.7.17
Richard Hipp
2014-09-20 01:14:17 UTC
Permalink
The latest SQLite 3.8.7 alpha version (available on the download page
http://www.sqlite.org/download.html) is 50% faster than the 3.7.17 release
from 16 months ago. That is to say, it does 50% more work using the same
number of CPU cycles.

This performance gain is over and above the query planner improvements that
have also been made. We are constantly looking for new ways to run queries
and adding those ways into the query planner. For example, in the previous
release, we added a new way to evaluate IN operators with non-constant
right-hand-sides that was reported on this mailing list to make some
queries run 5 times faster.

The 50% faster number above is not about better query plans. This is 50%
faster at the low-level grunt work of moving bits on and off disk and
search b-trees. We have achieved this by incorporating hundreds of
micro-optimizations. Each micro-optimization might improve the performance
by as little as 0.05%. If we get one that improves performance by 0.25%,
that is considered a huge win. Each of these optimizations is unmeasurable
on a real-world system (we have to use cachegrind to get repeatable
run-times) but if you do enough of them, they add up.

A full 10% of the performance gain has come since the previous release.
There have been a lot of changes. All our tests pass, and we still have
100% branch test coverage, so we are confident that we didn't break too
much. But your testing is an important part of our quality process.
Please download a source archive or a DLL and give the latest alpha a
whirl, and let us know if you encounter any problems.

P.S.: Measurements were done using the "speedtest1 --size 5" workload on
Ubuntu 10.13 and gcc 4.8.1 with -Os. YMMV. Version 3.7.17 requires
1432835574 CPU cycles and the 3.8.7 alpha requires just 953861485 CPU
cycles, as measured by cachegrind.
--
D. Richard Hipp
drh-CzDROfG0BjIdnm+***@public.gmane.org
jose isaias cabrera
2014-09-20 01:22:14 UTC
Permalink
"Richard Hipp" wrote...
Post by Richard Hipp
The latest SQLite 3.8.7 alpha version (available on the download page
http://www.sqlite.org/download.html) is 50% faster than the 3.7.17 release
from 16 months ago. That is to say, it does 50% more work using the same
number of CPU cycles.
This performance gain is over and above the query planner improvements that
have also been made. We are constantly looking for new ways to run queries
and adding those ways into the query planner. For example, in the previous
release, we added a new way to evaluate IN operators with non-constant
right-hand-sides that was reported on this mailing list to make some
queries run 5 times faster.
The 50% faster number above is not about better query plans. This is 50%
faster at the low-level grunt work of moving bits on and off disk and
search b-trees. We have achieved this by incorporating hundreds of
micro-optimizations. Each micro-optimization might improve the performance
by as little as 0.05%. If we get one that improves performance by 0.25%,
that is considered a huge win. Each of these optimizations is
unmeasurable
on a real-world system (we have to use cachegrind to get repeatable
run-times) but if you do enough of them, they add up.
A full 10% of the performance gain has come since the previous release.
There have been a lot of changes. All our tests pass, and we still have
100% branch test coverage, so we are confident that we didn't break too
much. But your testing is an important part of our quality process.
Please download a source archive or a DLL and give the latest alpha a
whirl, and let us know if you encounter any problems.
P.S.: Measurements were done using the "speedtest1 --size 5" workload on
Ubuntu 10.13 and gcc 4.8.1 with -Os. YMMV. Version 3.7.17 requires
1432835574 CPU cycles and the 3.8.7 alpha requires just 953861485 CPU
cycles, as measured by cachegrind.
I don't know if folks have ever thank you, Dr. Hipp, for this wonderful gift
to the world called SQLite. I have become a legend in my own world with
this tool. :-) I do have to say that I have used it since 2006 and it has
increased in speed every year. Thank you! Thank you! And in my own native
language, muchas gracias!

josé
Stephen Chrzanowski
2014-09-20 01:52:01 UTC
Permalink
I, as well, wish to thank you for this tool Dr. Hipp. I've never published
a public application using this engine, but, at my place of employment,
where my primary responsibility is to just monitor servers world wide, I've
coded a few tidbit web and desktop applications that have made my job SO
much easier. Without this, my desktop apps would have to rely on MySQL,
which is massive overkill (resource wise) for some of the things I've
needed to use it for.

Thanks again!

On Fri, Sep 19, 2014 at 9:22 PM, jose isaias cabrera <
Post by jose isaias cabrera
"Richard Hipp" wrote...
The latest SQLite 3.8.7 alpha version (available on the download page
Post by Richard Hipp
http://www.sqlite.org/download.html) is 50% faster than the 3.7.17 release
from 16 months ago. That is to say, it does 50% more work using the same
number of CPU cycles.
This performance gain is over and above the query planner improvements that
have also been made. We are constantly looking for new ways to run queries
and adding those ways into the query planner. For example, in the previous
release, we added a new way to evaluate IN operators with non-constant
right-hand-sides that was reported on this mailing list to make some
queries run 5 times faster.
The 50% faster number above is not about better query plans. This is 50%
faster at the low-level grunt work of moving bits on and off disk and
search b-trees. We have achieved this by incorporating hundreds of
micro-optimizations. Each micro-optimization might improve the performance
by as little as 0.05%. If we get one that improves performance by 0.25%,
that is considered a huge win. Each of these optimizations is unmeasurable
on a real-world system (we have to use cachegrind to get repeatable
run-times) but if you do enough of them, they add up.
A full 10% of the performance gain has come since the previous release.
There have been a lot of changes. All our tests pass, and we still have
100% branch test coverage, so we are confident that we didn't break too
much. But your testing is an important part of our quality process.
Please download a source archive or a DLL and give the latest alpha a
whirl, and let us know if you encounter any problems.
P.S.: Measurements were done using the "speedtest1 --size 5" workload on
Ubuntu 10.13 and gcc 4.8.1 with -Os. YMMV. Version 3.7.17 requires
1432835574 CPU cycles and the 3.8.7 alpha requires just 953861485 CPU
cycles, as measured by cachegrind.
I don't know if folks have ever thank you, Dr. Hipp, for this wonderful
gift to the world called SQLite. I have become a legend in my own world
with this tool. :-) I do have to say that I have used it since 2006 and it
has increased in speed every year. Thank you! Thank you! And in my own
native language, muchas gracias!
josé
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
t***@public.gmane.org
2014-09-20 17:34:00 UTC
Permalink
In trying to see if the new version breaks any of my queries, I ran several
of them repeatedly, and they all appear to have produced the expected
output.

The only thing I noticed which maybe of interest in relation to speed
performance was (with .timer on) that although the first two run time
numbers (real & user) were consistently smaller in 3.8.7 (when compared to
3.8.6), the third number (sys) was consistently higher (or same in one
occasion). I guess the first number is the actual time (in seconds) it took
to run the query. I don't even know what the 2nd and 3rd numbers represent,
and how or if they maybe related to the first one. Is that increase in sys
to be expected?

A few examples from many more I tried that all follow the same pattern (same
query & database in each case):

3.8.6: Run Time: real 2.434 user 2.386815 sys 0.000000
3.8.7: Run Time: real 1.856 user 1.778411 sys 0.062400
---
3.8.6: Run Time: real 584.465 user 560.293192 sys 1.638011
3.8.7: Run Time: real 518.227 user 430.469159 sys 53.617544
---
3.8.6: Run Time: real 2.449 user 2.340015 sys 0.046800
3.8.7: Run Time: real 1.935 user 1.794012 sys 0.046800

(Thank you for two great solutions I use daily -- SQLite3 and Fossil)
Richard Hipp
2014-09-20 17:39:34 UTC
Permalink
Post by t***@public.gmane.org
In trying to see if the new version breaks any of my queries, I ran
several of them repeatedly, and they all appear to have produced the
expected output.
The only thing I noticed which maybe of interest in relation to speed
performance was (with .timer on) that although the first two run time
numbers (real & user) were consistently smaller in 3.8.7 (when compared to
3.8.6), the third number (sys) was consistently higher (or same in one
occasion). I guess the first number is the actual time (in seconds) it
took to run the query. I don't even know what the 2nd and 3rd numbers
represent, and how or if they maybe related to the first one. Is that
increase in sys to be expected?
Thanks for the report. I think the increased system time is harmless, but
I want to investigate further to be sure. Except right now I'm preoccupied
with Mr. Xu's new bug. So please remind me next week if I don't bring this
up again. :-)

What OS are you using?

Can you share your database and test script with us?
Post by t***@public.gmane.org
A few examples from many more I tried that all follow the same pattern
3.8.6: Run Time: real 2.434 user 2.386815 sys 0.000000
3.8.7: Run Time: real 1.856 user 1.778411 sys 0.062400
---
3.8.6: Run Time: real 584.465 user 560.293192 sys 1.638011
3.8.7: Run Time: real 518.227 user 430.469159 sys 53.617544
---
3.8.6: Run Time: real 2.449 user 2.340015 sys 0.046800
3.8.7: Run Time: real 1.935 user 1.794012 sys 0.046800
(Thank you for two great solutions I use daily -- SQLite3 and Fossil)
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
--
D. Richard Hipp
drh-CzDROfG0BjIdnm+***@public.gmane.org
Valentin Davydov
2014-09-22 12:29:16 UTC
Permalink
Post by Richard Hipp
The latest SQLite 3.8.7 alpha version (available on the download page
http://www.sqlite.org/download.html) is 50% faster than the 3.7.17 release
from 16 months ago. That is to say, it does 50% more work using the same
number of CPU cycles.
Is there any similar benchmarks with regard to disk i/o operations rather
than CPU? Especially random read of cache misses, I mean.

Valentin Davydov.
Richard Hipp
2014-09-22 17:48:55 UTC
Permalink
Post by Richard Hipp
Post by Richard Hipp
The latest SQLite 3.8.7 alpha version (available on the download page
http://www.sqlite.org/download.html) is 50% faster than the 3.7.17
release
Post by Richard Hipp
from 16 months ago. That is to say, it does 50% more work using the same
number of CPU cycles.
Is there any similar benchmarks with regard to disk i/o operations rather
than CPU? Especially random read of cache misses, I mean.
No, there are no such benchmarks. On the other hand, we've been focused on
minimizing I/O in SQLite for the past decade. I think we have extracted
about as much as we can fro m that vein. But if you have any new ideas on
how we can further reduce the I/O, we'd love to hear from you.
--
D. Richard Hipp
drh-CzDROfG0BjIdnm+***@public.gmane.org
Roger Binns
2014-09-24 02:23:44 UTC
Permalink
But if you have any new ideas on how we can further reduce the I/O, we'd love to hear from you.
The single biggest problem for me is defaults. SQLite supports memory
mapped i/o which has many advantages. The stat4 analyze does a really good
job. WAL reduces writing contention. These are all off by default for
various good reasons. But it also means that by default the majority of
developers are not getting the best performance SQLite already has to offer,
unless they happen to stumble on these. Some like stat4 are especially
problematic since it requires recompilation to address.

It would be nice if by default things were best. One weak suggestion is to
have a few profiles, and have them selectable by pragma. The profile would
then go ahead and set various options. Future versions of SQLite would also
then update the profiles.

For example:

pragma profile='max_performance' -- turns on all above, ups caches etc
pragma profile='min_memory' -- tunes down everything related to memory

Various members of the SQLite consortium can then do things like:

pragma profile='firefox'

Roger
jose isaias cabrera
2014-09-24 13:13:59 UTC
Permalink
"Roger Binns" wrote...
Post by Roger Binns
But if you have any new ideas on how we can further reduce the I/O, we'd
love to hear from you.
The single biggest problem for me is defaults. SQLite supports memory
mapped i/o which has many advantages. The stat4 analyze does a really good
job. WAL reduces writing contention. These are all off by default for
various good reasons. But it also means that by default the majority of
developers are not getting the best performance SQLite already has to offer,
unless they happen to stumble on these. Some like stat4 are especially
problematic since it requires recompilation to address.
It would be nice if by default things were best. One weak suggestion is to
have a few profiles, and have them selectable by pragma. The profile would
then go ahead and set various options. Future versions of SQLite would also
then update the profiles.
pragma profile='max_performance' -- turns on all above, ups caches etc
pragma profile='min_memory' -- tunes down everything related to memory
pragma profile='firefox'
This would be a nice set of options. On my case, I would set all connections
to our project to be" max_performance", as it is what we need. Just
thinking out loud.

josé
Simon Slavin
2014-09-24 13:19:29 UTC
Permalink
This would be a nice set of options. On my case, I would set all connections to our project to be" max_performance", as it is what we need. Just thinking out loud.
How much max is max ? Are you willing to give up ACID ? Are you willing to have your database corrupted if there's power loss ?

Simon.
Roger Binns
2014-09-24 17:15:31 UTC
Permalink
Post by Simon Slavin
How much max is max ?
Not giving up ACID. But for example stat4 is better than the default stat1.
Memory mapping (especially on 64 bit) is great. So is WAL. All are off by
default.

If you want to give up ACID then you should really be on your own to look at
the various tradeoffs.

Roger
Stephen Chrzanowski
2014-09-25 07:15:45 UTC
Permalink
In the vein of configurability, and in a day dream I just had, it would be
nice (But probably not possible as there could be compiler directives you
can't use at the same time) that we could have a single output
DLL/SO/whatever dumped from the compiler that had everything available,
then, via defaults in the source code, IF/THEN/ELSE (or whatever the C
equiv is) operations are considered to have certain functionality run
during the query process. Via a configuration file, or via pragma which
would be generated based on a configuration, feature sets could be set at
run time.

One fault with this would be the issue of COMPILED size. I realize we're
in an age where we generally talk megabytes instead of kilobytes in regards
to main thread compiled code, but I'm sure there are platforms out there
that are using SQLite that have to fit in the kilobyte range, and need
special compiled sources.

The advantage to this is that I wouldn't have to compile X number of DLLs
for Y number of programs to get the results I need, just have one
configuration file kicking around that I can adjust as needed. It'd also
help with fine tuning to validate whether I need certain features turned on
or off.
Post by Roger Binns
Post by Simon Slavin
How much max is max ?
Not giving up ACID. But for example stat4 is better than the default stat1.
Memory mapping (especially on 64 bit) is great. So is WAL. All are off by
default.
If you want to give up ACID then you should really be on your own to look at
the various tradeoffs.
Roger
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
jose isaias cabrera
2014-09-30 15:04:14 UTC
Permalink
"Simon Slavin" wrote...
Post by Simon Slavin
Post by jose isaias cabrera
This would be a nice set of options. On my case, I would set all
connections to our project to be" max_performance", as it is what we
need. Just thinking out loud.
How much max is max ? Are you willing to give up ACID ? Are you willing
to have your database corrupted if there's power loss ?
Ok, back to reality... ;-), no, I will not. Good questions, Simon.
big stone
2014-09-22 17:43:30 UTC
Permalink
Hi,

This 3.8.7alpha release seems to bring about 5% win from 3.8.6 , on my
particular SQL test case.

Question : "PRAGMA threads=2" didn't bring any speed-up on my "2 true"
cores machine.

Did I miss a compilation option ?
Richard Hipp
2014-09-22 17:46:39 UTC
Permalink
Post by big stone
Hi,
This 3.8.7alpha release seems to bring about 5% win from 3.8.6 , on my
particular SQL test case.
Question : "PRAGMA threads=2" didn't bring any speed-up on my "2 true"
cores machine.
Did I miss a compilation option ?
The multi-thread sort should be on by default. Probably you are just not
doing a big enough sort to make it worthwhile to start any new threads.
The "threads=2" pragma sets an upper limit on the number of threads. There
is no guarantee that SQLite will use that many.
--
D. Richard Hipp
drh-CzDROfG0BjIdnm+***@public.gmane.org
big stone
2014-09-22 18:04:29 UTC
Permalink
ok,

Nearly all the time is spent in a big 'CTE' select.

So maybe this sort of ugly CTE query is not threadable.

with
f0_k as
(SELECT f.rowid as nof2, f.*, p.Dn, p.g1, p.g2, p.w, p.other FROM F0 AS
f, Pt AS p WHERE f.Io =p.Io)
,F0_mcalc as
(SELECT f.*, p.*, (case when Priority=999 then Cg else Source end) AS
Sourcef
FROM F0_k AS f, Sor AS p WHERE p.omg in (f.Cg, '') And
f.Slic Between Be And En
And p.PtGroup In (f.Io,f.g1,f.g2,f.w,f.other,'PL','SO','ST','Not_found')
)
,F0_mcalcmin as
(SELECT nof2, min(priority) AS minp FROM F0_mcalc GROUP BY nof2)
,F0_mcalc_final as
(SELECT f.*, round(qty*coefficient,3) AS qtyf FROM F0_mcalcmin
AS fm inner join F0_mcalc AS f on f.nof2=fm.nof2 and
f.priority=fm.minp)
select * from F0_mcalc_final ;
Donald Shepherd
2014-09-23 06:33:10 UTC
Permalink
Are any of these improvements specifically in the area of the online backup
API, or are they more in the general running of SQLite?
Post by Richard Hipp
The latest SQLite 3.8.7 alpha version (available on the download page
http://www.sqlite.org/download.html) is 50% faster than the 3.7.17 release
from 16 months ago. That is to say, it does 50% more work using the same
number of CPU cycles.
This performance gain is over and above the query planner improvements that
have also been made. We are constantly looking for new ways to run queries
and adding those ways into the query planner. For example, in the previous
release, we added a new way to evaluate IN operators with non-constant
right-hand-sides that was reported on this mailing list to make some
queries run 5 times faster.
The 50% faster number above is not about better query plans. This is 50%
faster at the low-level grunt work of moving bits on and off disk and
search b-trees. We have achieved this by incorporating hundreds of
micro-optimizations. Each micro-optimization might improve the performance
by as little as 0.05%. If we get one that improves performance by 0.25%,
that is considered a huge win. Each of these optimizations is unmeasurable
on a real-world system (we have to use cachegrind to get repeatable
run-times) but if you do enough of them, they add up.
A full 10% of the performance gain has come since the previous release.
There have been a lot of changes. All our tests pass, and we still have
100% branch test coverage, so we are confident that we didn't break too
much. But your testing is an important part of our quality process.
Please download a source archive or a DLL and give the latest alpha a
whirl, and let us know if you encounter any problems.
P.S.: Measurements were done using the "speedtest1 --size 5" workload on
Ubuntu 10.13 and gcc 4.8.1 with -Os. YMMV. Version 3.7.17 requires
1432835574 CPU cycles and the 3.8.7 alpha requires just 953861485 CPU
cycles, as measured by cachegrind.
--
D. Richard Hipp
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Richard Hipp
2014-09-23 10:25:47 UTC
Permalink
Post by Donald Shepherd
Are any of these improvements specifically in the area of the online backup
API, or are they more in the general running of SQLite?
The "speedtest1" benchmark does not use the backup API. However, many of
the performance improvements are broadly applicable, so you would expect
that you would see some improvement in that API too.
--
D. Richard Hipp
drh-CzDROfG0BjIdnm+***@public.gmane.org
David Woodhouse
2014-09-23 16:48:40 UTC
Permalink
Post by Richard Hipp
The latest SQLite 3.8.7 alpha version (available on the download page
http://www.sqlite.org/download.html) is 50% faster than the 3.7.17 release
from 16 months ago. That is to say, it does 50% more work using the same
number of CPU cycles.
This performance gain is over and above the query planner improvements that
have also been made. We are constantly looking for new ways to run queries
and adding those ways into the query planner. For example, in the previous
release, we added a new way to evaluate IN operators with non-constant
right-hand-sides that was reported on this mailing list to make some
queries run 5 times faster.
The 50% faster number above is not about better query plans. This is 50%
faster at the low-level grunt work of moving bits on and off disk and
search b-trees. We have achieved this by incorporating hundreds of
micro-optimizations. Each micro-optimization might improve the performance
by as little as 0.05%. If we get one that improves performance by 0.25%,
that is considered a huge win. Each of these optimizations is unmeasurable
on a real-world system (we have to use cachegrind to get repeatable
run-times) but if you do enough of them, they add up.
A full 10% of the performance gain has come since the previous release.
There have been a lot of changes. All our tests pass, and we still have
100% branch test coverage, so we are confident that we didn't break too
much. But your testing is an important part of our quality process.
Please download a source archive or a DLL and give the latest alpha a
whirl, and let us know if you encounter any problems.
That looks really promising; thanks for all this work.

Tristan, you have a comprehensive set of benchmarks for Evolution's
addressbook; is it possible for someone else to run those or would it
take more of your time to babysit than it would to run them yourself?
--
dwmw2
David Woodhouse
2014-09-25 11:55:44 UTC
Permalink
This post might be inappropriate. Click to display it.
David Woodhouse
2014-09-24 10:37:47 UTC
Permalink
Post by Richard Hipp
The 50% faster number above is not about better query plans.
Speaking of better query plans, though... here's a query which takes
about 1700ms on my data set, followed by a couple of optimisations which
seem like they might be generically useful, and would help it go 300
times faster:

/* This is an address book. The 'main' table holds the primary
information while the 'email_list' table holds email addresses
because each contact can have more than one email address */
PRAGMA case_sensitive_like=on;
CREATE TABLE main (uid text primary key, first_name text, last_name text);
INSERT INTO main VALUES('aaaa','alice', 'X');
INSERT INTO main VALUES('bbbb','bert', 'Y');
INSERT INTO main VALUES('cccc','alison', 'Z');
CREATE INDEX main_first_name_index on main (first_name);
CREATE INDEX main_last_name_index on main (last_name);
CREATE TABLE email_list (uid text, email text);
INSERT INTO email_list VALUES('aaaa','alice-***@public.gmane.org');
INSERT INTO email_list VALUES('aaaa','alice2-***@public.gmane.org');
INSERT INTO email_list VALUES('bbbb','albert-***@public.gmane.org');
CREATE INDEX email_uid_index on email_list (uid);
CREATE INDEX email_email_index on email_list (email);

SELECT DISTINCT main.uid
FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE email_list.email LIKE 'al%'
OR main.first_name like 'al%'
OR main.last_name like 'al%';
/*
0|0|0|SCAN TABLE main USING INDEX sqlite_autoindex_main_1
0|1|1|SEARCH TABLE email_list USING INDEX email_uid_index (uid=?)
*/

The first optimisation is to realise that sometimes, operations on the
result of a JOIN can be more efficiently resolved by using on the
original tables. Specifically, in this example, there's no point in
looking for a given first_name or last_name in the *joined* table, when
we have indices on those fields in the original 'main' table and can do
it much faster by querying those and then combining the results.

So the original query becomes:

SELECT DISTINCT main.uid
FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE email_list.email LIKE 'al%'
UNION SELECT main.uid
FROM main WHERE main.first_name LIKE 'al%'
OR main.last_name LIKE 'al%';
/*
1|0|0|SCAN TABLE main USING COVERING INDEX sqlite_autoindex_main_1
1|1|1|SEARCH TABLE email_list USING INDEX email_uid_index (uid=?)
2|0|0|SEARCH TABLE main USING INDEX main_first_name_index (first_name>? AND first_name<?)
2|0|0|SEARCH TABLE main USING INDEX main_last_name_index (last_name>? AND last_name<?)
0|0|0|COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (UNION)
*/

With that optimisation, the time taken on my data set is reduced to
about 1000ms. Which is quite a nice improvement from 1700ms but that's
just the first step...

The second optimisation is to realise that there's no point in that
being a LEFT JOIN any more, given that it's now only ever going to
return results where the right-hand-side actually does exist. Perhaps
normally we'd blame the user for such a naïve query, but when coupled
with the first optimisation, spotting this optimisation does actually
start to make sense. So now the query becomes:

SELECT DISTINCT main.uid
FROM main JOIN email_list ON main.uid = email_list.uid
WHERE email_list.email LIKE 'al%'
UNION SELECT main.uid
FROM main WHERE main.first_name LIKE 'al%'
OR main.last_name LIKE 'al%';
/*
1|0|1|SEARCH TABLE email_list USING INDEX email_email_index (email>? AND email<?)
1|1|0|SEARCH TABLE main USING COVERING INDEX sqlite_autoindex_main_1 (uid=?)
2|0|0|SEARCH TABLE main USING INDEX main_first_name_index (first_name>? AND first_name<?)
2|0|0|SEARCH TABLE main USING INDEX main_last_name_index (last_name>? AND last_name<?)
0|0|0|COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (UNION)
*/

Now the query takes 6ms on my data set.

If this were a hard-coded query, of course it would make sense for me to
just do it this way in the client. But in my case — and I suspect it's
this way in a number of other real-world examples — we're actually just
representing a set of user-input search criteria in the SQL query. To
make this optimisation in the client would be distinctly non-trivial,
and it would be *really* nice if the sqlite query planner could do it.

Is that feasible?
--
dwmw2
Keith Medcalf
2014-09-24 12:13:51 UTC
Permalink
Would it not be more efficient to skip the join altogether since all you want is the list of uid's, and assuming that you have maintained the referential integrity of your database mail_list(list_uid) references main(uid)?

SELECT list_uid
FROM mail_list
WHERE email LIKE 'al%'
UNION
SELECT uid
FROM main
WHERE first_name LIKE 'al%'
OR last_name LIKE 'al%';
-----Original Message-----
Speaking of better query plans, though... here's a query which takes
about 1700ms on my data set, followed by a couple of optimisations which
seem like they might be generically useful, and would help it go 300
/* This is an address book. The 'main' table holds the primary
information while the 'email_list' table holds email addresses
because each contact can have more than one email address */
PRAGMA case_sensitive_like=on;
CREATE TABLE main (uid text primary key, first_name text, last_name text);
INSERT INTO main VALUES('aaaa','alice', 'X');
INSERT INTO main VALUES('bbbb','bert', 'Y');
INSERT INTO main VALUES('cccc','alison', 'Z');
CREATE INDEX main_first_name_index on main (first_name);
CREATE INDEX main_last_name_index on main (last_name);
CREATE TABLE email_list (uid text, email text);
CREATE INDEX email_uid_index on email_list (uid);
CREATE INDEX email_email_index on email_list (email);
SELECT DISTINCT main.uid
FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE email_list.email LIKE 'al%'
OR main.first_name like 'al%'
OR main.last_name like 'al%';
/*
0|0|0|SCAN TABLE main USING INDEX sqlite_autoindex_main_1
0|1|1|SEARCH TABLE email_list USING INDEX email_uid_index (uid=?)
*/
The first optimisation is to realise that sometimes, operations on the
result of a JOIN can be more efficiently resolved by using on the
original tables. Specifically, in this example, there's no point in
looking for a given first_name or last_name in the *joined* table, when
we have indices on those fields in the original 'main' table and can do
it much faster by querying those and then combining the results.
SELECT DISTINCT main.uid
FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE email_list.email LIKE 'al%'
UNION SELECT main.uid
FROM main WHERE main.first_name LIKE 'al%'
OR main.last_name LIKE 'al%';
/*
1|0|0|SCAN TABLE main USING COVERING INDEX sqlite_autoindex_main_1
1|1|1|SEARCH TABLE email_list USING INDEX email_uid_index (uid=?)
2|0|0|SEARCH TABLE main USING INDEX main_first_name_index (first_name>? AND first_name<?)
2|0|0|SEARCH TABLE main USING INDEX main_last_name_index (last_name>? AND last_name<?)
0|0|0|COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (UNION)
*/
With that optimisation, the time taken on my data set is reduced to
about 1000ms. Which is quite a nice improvement from 1700ms but that's
just the first step...
The second optimisation is to realise that there's no point in that
being a LEFT JOIN any more, given that it's now only ever going to
return results where the right-hand-side actually does exist. Perhaps
normally we'd blame the user for such a naïve query, but when coupled
with the first optimisation, spotting this optimisation does actually
SELECT DISTINCT main.uid
FROM main JOIN email_list ON main.uid = email_list.uid
WHERE email_list.email LIKE 'al%'
UNION SELECT main.uid
FROM main WHERE main.first_name LIKE 'al%'
OR main.last_name LIKE 'al%';
/*
1|0|1|SEARCH TABLE email_list USING INDEX email_email_index (email>? AND email<?)
1|1|0|SEARCH TABLE main USING COVERING INDEX sqlite_autoindex_main_1 (uid=?)
2|0|0|SEARCH TABLE main USING INDEX main_first_name_index (first_name>? AND first_name<?)
2|0|0|SEARCH TABLE main USING INDEX main_last_name_index (last_name>? AND last_name<?)
0|0|0|COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (UNION)
*/
Now the query takes 6ms on my data set.
If this were a hard-coded query, of course it would make sense for me to
just do it this way in the client. But in my case — and I suspect it's
this way in a number of other real-world examples — we're actually just
representing a set of user-input search criteria in the SQL query. To
make this optimisation in the client would be distinctly non-trivial,
and it would be *really* nice if the sqlite query planner could do it.
Is that feasible?
--
dwmw2
David Woodhouse
2014-09-24 13:24:51 UTC
Permalink
Post by Keith Medcalf
Would it not be more efficient to skip the join altogether since all
you want is the list of uid's, and assuming that you have maintained
the referential integrity of your database mail_list(list_uid)
references main(uid)?
SELECT list_uid
FROM mail_list
WHERE email LIKE 'al%'
UNION
SELECT uid
FROM main
WHERE first_name LIKE 'al%'
OR last_name LIKE 'al%';
Yes, but only because I oversimplified. In fact my queries are actually
after *other* fields from the main table. It's just that I'd elided
those fields from the example.

For the sake of the example, please assume my queries were actually
'SELECT DISTINCT main.uid, main.first_name, main.last_name FROM 
'

The real-life query that this is simplified *from* is discussed at
https://git.gnome.org/browse/evolution-data-server/commit/?id=5f9f5b52807
--
dwmw2
Keith Medcalf
2014-09-25 01:36:39 UTC
Permalink
Interesting. From that code you might want to try something like this:

SELECT uid, vcard, bdata
FROM folder_id
WHERE uid in ( select uid FROM email_list where value like 'p%'
union
select uid from folder_id where nickname LIKE 'p%'
union
select uid from folder_id where full_name LIKE 'p%'
union
select uid from folder_id where family_name LIKE 'p%'
union
select uid from folder_id where given_name LIKE 'p%'
union
select uid from folder_id where nickname LIKE 'p%'
union
select uid from folder_id where file_as LIKE 'p%'
);

Then having nocase indexes on the various search fields will all work as expected.
-----Original Message-----
Sent: Wednesday, 24 September, 2014 07:25
To: Keith Medcalf
Cc: General Discussion of SQLite Database
Subject: Re: [sqlite] 50% faster than 3.7.17
Post by Keith Medcalf
Would it not be more efficient to skip the join altogether since all
you want is the list of uid's, and assuming that you have maintained
the referential integrity of your database mail_list(list_uid)
references main(uid)?
SELECT list_uid
FROM mail_list
WHERE email LIKE 'al%'
UNION
SELECT uid
FROM main
WHERE first_name LIKE 'al%'
OR last_name LIKE 'al%';
Yes, but only because I oversimplified. In fact my queries are actually
after *other* fields from the main table. It's just that I'd elided
those fields from the example.
For the sake of the example, please assume my queries were actually
'SELECT DISTINCT main.uid, main.first_name, main.last_name FROM 
'
The real-life query that this is simplified *from* is discussed at
https://git.gnome.org/browse/evolution-data-server/commit/?id=5f9f5b52807
--
dwmw2
David Woodhouse
2014-09-25 10:13:34 UTC
Permalink
Post by Keith Medcalf
SELECT uid, vcard, bdata
FROM folder_id
WHERE uid in ( select uid FROM email_list where value like 'p%'
union
select uid from folder_id where nickname LIKE 'p%'
union
select uid from folder_id where full_name LIKE 'p%'
union
select uid from folder_id where family_name LIKE 'p%'
union
select uid from folder_id where given_name LIKE 'p%'
union
select uid from folder_id where nickname LIKE 'p%'
union
select uid from folder_id where file_as LIKE 'p%'
);
Then having nocase indexes on the various search fields will all work as expected.
Yeah, that achieves the same speed.

I'm not sure it addresses the real problem though. It still only really
applies when the user's query (which we're translating to SQL) contains
only 'OR' and no 'AND' clauses. It doesn't help me translate a query in
the general case.

Now, I'm not *entirely* averse to having a special case for the 'all OR'
query — this particular query is the address autocompletion so it's
common and the user is actually waiting for it as they type. In fact, as
a proof of concept I've already *implemented* a hackish special case to
spot this case and basically submit a hand-crafted query instead of the
normal translation to SQL:
https://bugzilla.gnome.org/show_bug.cgi?id=699597#c19

The problem is that I don't *want* to have to have that special case.
This is just a query optimisation, which is something the query planner
is supposed to do. I don't *want* to implement this and other
optimisations in the client, just to trick *today's* sqlite query
planner into spotting the best way to do it. That's the Wrong Way™ to do
things.

There are two alternative approaches which *don't* seem as wrong.

Firstly, if there's a sane way to rewrite our translator so that it
naturally uses UNION for OR clauses, that might make sense. But to cope
with AND clauses, AFAICT the natural extension of that approach would be
to use 'SELECT FROM ... SELECT FROM' and then we lose the use of indices
for *those* cases, right¹? Tristan started a thread about this 'nested
select' last year, which I picked up a couple of weeks ago. It didn't
seem like it was a viable strategy for the general case.

The second approach, and this is why I started this thread, is to 'fix'
the query planner so that that it can see for *itself* the best way to
implement a given query given the constraints.

I suggested a couple of specific optimisations which the query planner
might be able to make, which should hopefully have benefits wider than
just my own use case. Are those not viable?
--
dwmw2

¹ I do realise that in the special case of a single top-level AND such
that all its sub-clauses are necessary conditions, I can do something
nicer. But again, it's a special case and doesn't handle the general
case of nested AND and OR clauses. And it's still the *client* code
doing the job of the optimiser, spotting necessary vs. sufficient
conditions and pulling them out to the top level for more efficient
implementation.
David Woodhouse
2014-10-09 09:38:44 UTC
Permalink
Post by David Woodhouse
I suggested a couple of specific optimisations which the query planner
might be able to make, which should hopefully have benefits wider than
just my own use case. Are those not viable?
I'm preparing to commit a workaround to Evolution to avoid this issue,
and then move on with my life and forget about it.

Before I do, is it worth me rephrasing this as a 'suboptimal query plan'
bug report so it gets tracked and might get attention later? Or should I
just forget the idea of getting it fixed in sqlite?
--
dwmw2
Dan Kennedy
2014-10-09 10:41:27 UTC
Permalink
Post by David Woodhouse
Post by David Woodhouse
I suggested a couple of specific optimisations which the query planner
might be able to make, which should hopefully have benefits wider than
just my own use case. Are those not viable?
I'm preparing to commit a workaround to Evolution to avoid this issue,
and then move on with my life and forget about it.
Before I do, is it worth me rephrasing this as a 'suboptimal query plan'
bug report so it gets tracked and might get attention later? Or should I
just forget the idea of getting it fixed in sqlite?
Well, you could always create a patch...

I think I understand the second optimization. You're saying that given this:

SELECT DISTINCT main.uid
FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE email_list.email LIKE 'al%'

SQLite should deduce that since the LIKE implies "email_list.email IS NOT NULL" the LEFT JOIN is equivalent to a regular JOIN. Which would allow SQLite to reorder the tables and perhaps come up with a more efficient query plan. Correct?

Seems like a reasonable idea.

The first optimization might be trickier. With queries that feature a single table:

SELECT cols FROM tbl WHERE a=? OR b=? OR c=?

the planner may elect to run something very close to this:

SELECT cols FROM tbl WHERE a=?
UNION ALL
SELECT cols FROM tbl WHERE b=?
UNION ALL
SELECT cols FROM tbl WHERE c=?

However, after returning each row, we remember its PRIMARY KEY (either the rowid or real PK for WITHOUT ROWID tables). Similar transformations for individual loops within join queries are also possible.

However, with a JOIN query, we don't currently attempt this kind of transform. I think because we would have to create some kind of composite key to use in place of the PRIMARY KEY to avoid returning duplicates. I guess queries that have a DISTINCT clause don't have this problem. So it could in theory transform your query to:

SELECT DISTINCT
main.uid FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE email_list.email LIKE 'al%'
UNION ALL
SELECT main.uid FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE main.first_name like 'al%'
UNION ALL
SELECT main.uid FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE main.last_name like 'al%';

The the hypothetical optimization above could figure out that the first LEFT JOIN could just as easily be a JOIN.

And that the other two LEFT JOINs are not required at all due to the DISTINCT and the fact that the WHERE clauses on the sub-selects do not reference the joined table at all.

It seems like there are a few moving parts here. And none of these are trivial changes. So, good ideas that might show up in an SQLite release at some point, but it's not realistic to expect these optimizations to be implemented quickly. Unless, of course, you can propose a patch and they turn out to be simpler than they look.

Regards,
Dan.
David Woodhouse
2014-10-09 11:49:53 UTC
Permalink
Post by David Woodhouse
SELECT DISTINCT main.uid
FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE email_list.email LIKE 'al%'
SQLite should deduce that since the LIKE implies "email_list.email IS
NOT NULL"
(
Digression: SQLite *should* deduce the LIKE implies the IS NOT NULL
condition. My *actual* queries all have, for some bizarre
historical reason I haven't yet explored,

email_list.value IS NOT NULL AND email_list.value LIKE 'foo%'

You'd think that the former condition would end up optimised away,
but timings show that it clearly isn't. I wasn't going to bring that
up here since I think it is so clearly a stupid thing to be putting
in our query in the first place. But you just made me do it anyway.
)
Post by David Woodhouse
the LEFT JOIN is equivalent to a regular JOIN. Which would allow
SQLite to reorder the tables and perhaps come up with a more efficient
query plan. Correct?
Indeed.

As I said, a glib response to this suggested optimisation in *isolation*
might be that the user shouldn't have asked for a LEFT JOIN at all if
they didn't need it. But of course when coupled with the other suggested
transforms, that glib response is a lot less valid.
Post by David Woodhouse
The first optimization might be trickier.
...
Post by David Woodhouse
I guess queries that have a DISTINCT clause don't have this problem.
Right. I think it only works with DISTINCT, but that's OK.
Post by David Woodhouse
SELECT DISTINCT
main.uid FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE email_list.email LIKE 'al%'
UNION ALL
SELECT main.uid FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE main.first_name like 'al%'
UNION ALL
SELECT main.uid FROM main LEFT JOIN email_list ON main.uid = email_list.uid
WHERE main.last_name like 'al%';
The the hypothetical optimization above could figure out that the
first LEFT JOIN could just as easily be a JOIN.
Right.
Post by David Woodhouse
And that the other two LEFT JOINs are not required at all due to the
DISTINCT and the fact that the WHERE clauses on the sub-selects do not
reference the joined table at all.
Yeah, that's probably true. I'd assumed that we'd do it as part of the
"first optimisation" introducing the UNION but it's probably better
handled as a separate transform as you have it here.
Post by David Woodhouse
It seems like there are a few moving parts here. And none of these are
trivial changes. So, good ideas that might show up in an SQLite
release at some point, but it's not realistic to expect these
optimizations to be implemented quickly.
TBH that's all I was really looking for; thanks. I *hate* it when people
silently work around limitations in my software without bringing them to
my attention, and I didn't want to be guilty of doing the same.

I certainly don't expect an instant fix; just knowing that it's on the
radar is perfectly sufficient. Now I can commit my hackish workaround to
rewrite certain special-case queries to optimise the output of the
current query planner for them, and still sleep at night :)

Thanks.
Post by David Woodhouse
Unless, of course, you can propose a patch and they turn out to be
simpler than they look.
That would be interesting but realistically it's outside my capacity at
the moment. I am already *so* far into shaving this yak that I can
barely remember what I was actually intending to do in the first
place :)
--
David Woodhouse Open Source Technology Centre
David.Woodhouse-***@public.gmane.org Intel Corporation
Loading...