Discussion:
[sqlite] Increasing performance of query
Gabriele Lanaro
2017-04-19 18:50:19 UTC
Permalink
Dear SQLite communiy,

I’m trying to increase the performance of a query to its maximum possible
speed.

The schema is roughly constituted of two tables, one is a full text index,
fts5_table, while the other is called datatable.

The query in question is a full text query on the full text index joined
with the other table.

*SELECT* fts5.column, datatable.column2 *FROM* fts5_table, datatable *WHERE*
fts5_table *MATCH* ‘term’ *AND* datatable.id = fts5_column.rowid *ORDER BY*
rank *ASC* *LIMIT* 10;

Another query looks for counts of the same kind:

*SELECT* count(datatable.id) *FROM* fts5_table, datatable *WHERE*
fts5_table *MATCH* ‘term’ *AND* datatable.id = fts5_column.rowid

On a database of about the size of 1 GB of text (evenly distributed amount
100 000 rows). To give an idea of the speeds obtained I achieve a speed of
about 40 ms for the first and 80 ms for the second query.

Things I tried:

1) ANALYZE

Decreased the speed by orders of magnitude.

2) pragma cache_size=-256000;

Improved the performance considerably, increasing further than this value
has no effect

3) pragma mmap_size

I didn’t see an improvement

4) putting the database in a ramdisk

No improvement

I was wondering which other strategies could be effecting in increasing the
query speed.

I also noticed that the *ORDER BY* statement in the first query slows down
the query quite a bit (and this is probably due to calculating the bm25
score on all the results to get a ranking) I was wondering if there’s other
ways to prevent that.

Thanks,

Gabriele
Simon Slavin
2017-04-20 12:52:51 UTC
Permalink
Post by Gabriele Lanaro
*SELECT* count(datatable.id) *FROM* fts5_table, datatable *WHERE*
fts5_table *MATCH* ‘term’ *AND* datatable.id = fts5_column.rowid
If datatable.id is never NULL, then "count(*)" should do the same thing and be faster.
Post by Gabriele Lanaro
1) ANALYZE
Decreased the speed by orders of magnitude.
That’s never supposed to happen. Just to make sure I understand this … Are you saying that you have a situation where ANALYZE made something slower ? If so, have you kept a copy of the database before you executed ANALYZE ? There are some commands which will reveal what happened.

Given the efforts you’ve been to with other commands, you might want to try VACUUM.

Simon.
R Smith
2017-04-20 18:25:57 UTC
Permalink
Post by Gabriele Lanaro
Dear SQLite communiy,
I’m trying to increase the performance of a query to its maximum possible
speed.
The schema is roughly constituted of two tables, one is a full text index,
fts5_table, while the other is called datatable.
The query in question is a full text query on the full text index joined
with the other table.
*SELECT* fts5.column, datatable.column2 *FROM* fts5_table, datatable *WHERE*
fts5_table *MATCH* ‘term’ *AND* datatable.id = fts5_column.rowid *ORDER BY*
rank *ASC* *LIMIT* 10;
*SELECT* count(datatable.id) *FROM* fts5_table, datatable *WHERE*
fts5_table *MATCH* ‘term’ *AND* datatable.id = fts5_column.rowid
On a database of about the size of 1 GB of text (evenly distributed amount
100 000 rows). To give an idea of the speeds obtained I achieve a speed of
about 40 ms for the first and 80 ms for the second query.
Do you really mean "ms" as in milli-seconds? getting a result on an
aggregated or sorted query in under a 100 milliseconds from a Gigabyte+
table with 100K+ rows is what we would call: Very Fast.
Post by Gabriele Lanaro
1) ANALYZE
Decreased the speed by orders of magnitude.
Do you really mean "Decreased the speed" as in made it much much slower?

I'm going to assume for the sake of brevity that you meant "seconds" and
"Increased" so that there is real opportunity for improvement.

One strategy I would employ, if you always do those two queries
together, is to query to a temp table like this:

CREATE TEMP TABLE IF NOT EXISTS tmp_aq(
id INTEGER PRIMARY KEY,
col1 TEXT,
iRank INT
);

-- In case there's already values:
DELETE * FROM tmp_aq;

INSERT INTO tmp_aq(id, col1, iRank)
SELECT rowid, column, rank
FROM fts5_table
WHERE fts5_table MATCH 'term';

-- The first query then becomes:
SELECT col1, datatable.column2
FROM tmp_aq
JOIN datatable ON datatable.id = tmp_aq.id
ORDER BY tmp_aq.iRank ASC
LIMIT 10;


-- If datatable has a 1-to-1 relationship with the fts table,
-- then this is enough for the second query:
SELECT count(*) FROM tmp_aq;

-- else you will still need the join:
SELECT count(*) FROM tmp_aq JOIN datatable ON datatable.id = tmp_aq.id;


An Index on iRank may or may not improve things - you need to test it.


Cheers,
Ryan
Gabriele Lanaro
2017-04-21 08:34:37 UTC
Permalink
Thanks for your answers!

First of all, my apologies with ANALYZE I meant that it increased
performance (decreased query time) by orders of magnitude.

Before ANALYZE the query was in the order of seconds (about 2 seconds).
After ANALYZE the query was in the order of milliseconds.

I tried VACUUM too, but didn't improve query time.

I'm now trying saving the result to a temporary table (I set
temp_store=memory to avoid writing on disk), it looks like, by saving only
the rank, the query takes about 50 ms, which is a very good time..

And then extracting the first 15 records ordered by rank becomes very fast
thanks to an index. This is already very useful because it may happen that
we query the same term over and over to scroll down the result by rank.

Any other tips are always appreciated.

Thanks again!

Gabriele
Gabriele Lanaro
2017-04-26 15:57:01 UTC
Permalink
Also, I have question about the cache_size pragma.
If I run the query (this is regarding the count query), without first
running ANALYZE on the database, I can see that the times are affected by
the cache_size pragma as follows:

cache_size = 0
timing = 3.7 s

cache_size = -2000 (2000 kb)
timing = 2.0 s

cache_size = -1000000 (1000000 kb)
timing = 2.0 s

The size of the database is about 100 MB containing about 10000 rows. The
calls are all done from a freshly started database connection (but the
timings don't change if I repeat the query).

If however I run analyze, the general timings go down (to a timing in the
order of tens of milliseconds) and the cache_size doesn't have an effect
(or at least a substantial one).

Are there any tips to make sure that the cache is used properly?

I also tried to put the database on a directory mounted in RAM and didn't
get an improvement in speed. (However, I wasn't able to load my dataset
into a memory db because, by using .dump the fts5 table doesn't get backed
up correctly).

Thanks,

Gabriele
Post by Gabriele Lanaro
Thanks for your answers!
First of all, my apologies with ANALYZE I meant that it increased
performance (decreased query time) by orders of magnitude.
Before ANALYZE the query was in the order of seconds (about 2 seconds).
After ANALYZE the query was in the order of milliseconds.
I tried VACUUM too, but didn't improve query time.
I'm now trying saving the result to a temporary table (I set
temp_store=memory to avoid writing on disk), it looks like, by saving only
the rank, the query takes about 50 ms, which is a very good time..
And then extracting the first 15 records ordered by rank becomes very fast
thanks to an index. This is already very useful because it may happen that
we query the same term over and over to scroll down the result by rank.
Any other tips are always appreciated.
Thanks again!
Gabriele
Clemens Ladisch
2017-04-27 06:36:43 UTC
Permalink
Post by Gabriele Lanaro
Are there any tips to make sure that the cache is used properly?
The default is 2000 pages (8 MB with the default page size), and in most
cases should be increased.

But what works best in your specific case depends on the hardware, the
OS, the software, the database schema, and the data. You have to test
and measure yourself.


Regards,
Clemens

Continue reading on narkive:
Loading...