Discussion:
[sqlite] Knowing when not to use a R-Tree index, and clustering tables along spatial indexes
Andrea Aime
2017-12-29 18:59:12 UTC
Permalink
Hi,
I'm writing some software that can read data off GeoPackage (SQLite + rtree
+ standardized
set of metadata tables) as an alternative format for spatial databases,
like PostgreSql with the
PostGIS extension.

Now, when I use PostGIS the query plan optimizer checks the bbox provided
in the query
and verifies if using the spatial index is a good idea, or not. At
conferences I've been told
that the query has to be rather selective (e.g., retrieve less than 10% of
the data) in order
for the index to actually be used.

With SQLite R-Tree I'm using either a join with the index virtual table, or
a subquery
retrieving the ids from the rtree. Regardless, the query is basically
ordering SQLite
to use the index.
So I was wondering, is there any opportunity to run a blazing fast
pre-query against
the index that will tell me whether joining/subquerying into the rtree is
going to be a win, or not?

Also, while I'm here, in PostGIS there is an option to cluster a table
along the spatial
index, in order to reduce IO when the spatial index is the main access
driver (which is often
the case in geographic information systems). I looked at tables with no
rowids, but
it does not seem like a way to do it (spatial index not being suitable for
primary key).
Anything else that could be done here?

Cheers
Andrea
Clemens Ladisch
2017-12-30 09:30:31 UTC
Permalink
So I was wondering, is there any opportunity to run a blazing fast pre-query against
the index that will tell me whether joining/subquerying into the rtree is going to be a win, or not?
Each node in an R-tree index stores the coordinates of the leaf objects/child nodes;
see <https://stackoverflow.com/a/25242162/11654> for how to get this information.
To get the index extents, combine the extents of all entries in the root node.
Also, while I'm here, in PostGIS there is an option to cluster a table along the spatial
index, in order to reduce IO when the spatial index is the main access driver (which is often
the case in geographic information systems). I looked at tables with no rowids, but
it does not seem like a way to do it (spatial index not being suitable for primary key).
Anything else that could be done here?
For a static table, you could order the rows along a space-filling curve.
For a dynamic table, you would have to replace the R-tree with something else
(<https://en.wikipedia.org/wiki/Hilbert_R-tree>), which would no longer be portable.


Regards,
Clemens
Wolfgang Enzinger
2017-12-31 14:45:02 UTC
Permalink
Post by Andrea Aime
With SQLite R-Tree I'm using either a join with the index virtual table, or
a subquery
retrieving the ids from the rtree. Regardless, the query is basically
ordering SQLite
to use the index.
So I was wondering, is there any opportunity to run a blazing fast
pre-query against
the index that will tell me whether joining/subquerying into the rtree is
going to be a win, or not?
I had good results in a similar situation with this strategy:

First, query the overall extent of your data, like this:
SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM flst_shape_index;

Second, for every spatial query, calculate the size of the area in
question.

Then, with these two rectangles, you can calculate the LIKELIHOOD that a
particular record in your data is located within the requested area, i.e.
meets the spatial criteria.

Let SQLite know about that likelihood in a JOIN query, using the LIKELIHOOD
function (http://www.sqlite.org/lang_corefunc.html#likelihood). Also, if
possible, give LIKELIHOOD information to the query planner for any other
criteria used. SQLite will consider them.

HTH Wolfgang
Clemens Ladisch
2018-01-01 09:45:50 UTC
Permalink
Post by Wolfgang Enzinger
SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM flst_shape_index;
This results in a full table scan. Instead of caching these values manually,
it would be a better idea to read them from the index:

SELECT rtreenode(2, data) FROM flst_shape_index_node WHERE nodeno = 1;

(rtreenode() is undocumented; maybe you should use your own decoder.)
Post by Wolfgang Enzinger
Let SQLite know about that likelihood in a JOIN query
This does not appear to change anything with a virtual table:

CREATE TABLE t(id INETGER PRIMARY KEY, x, [...]);
CREATE VIRTUAL TABLE i USING rtree(id, minx, maxx);

SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.000001);
--EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
--EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)
SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.999999);
--EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
--EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)



Regards,
Clemens
Wolfgang Enzinger
2018-01-01 14:05:30 UTC
Permalink
Post by Clemens Ladisch
Post by Wolfgang Enzinger
SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM flst_shape_index;
This results in a full table scan. Instead of caching these values manually,
SELECT rtreenode(2, data) FROM flst_shape_index_node WHERE nodeno = 1;
(rtreenode() is undocumented; maybe you should use your own decoder.)
Thanks, didn't know that, I'll look into it. You're right, my query results
in a full table scan, however it's pretty fast anyway - less than a second
with 160,000 rows and cold cache.
Post by Clemens Ladisch
Post by Wolfgang Enzinger
Let SQLite know about that likelihood in a JOIN query
CREATE TABLE t(id INETGER PRIMARY KEY, x, [...]);
CREATE VIRTUAL TABLE i USING rtree(id, minx, maxx);
SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.000001);
--EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
--EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)
SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.999999);
--EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
--EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)
This is not surprising because only criteria concerning the "i" table are
in effect here. So it is clear that even a likelihood of 0.999999 is more
selective than a likelihood of 1.000 (= no filter criteria in this table).
However, if your query has criteria both in the "i" and the "t" table, it
can make a difference.

Of course, anybody correct me if I'm mistaken.

Happy new year!
Wolfgang
Clemens Ladisch
2018-01-01 15:20:21 UTC
Permalink
Post by Wolfgang Enzinger
Post by Clemens Ladisch
Post by Wolfgang Enzinger
Let SQLite know about that likelihood in a JOIN query
SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.000001);
SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.999999);
...
This is not surprising because only criteria concerning the "i" table are
in effect here. So it is clear that even a likelihood of 0.999999 is more
selective than a likelihood of 1.000 (= no filter criteria in this table).
However, if your query has criteria both in the "i" and the "t" table, it
can make a difference.
It is indeed possible to change the query so that SQLite uses rowid
lookups for the R-tree filter (INDEX 1). However, any likelihood on the
R-tree search expression still did not make any difference. Do you have
an example?


Regards,
Clemens
Wolfgang Enzinger
2018-01-01 16:07:35 UTC
Permalink
Post by Clemens Ladisch
It is indeed possible to change the query so that SQLite uses rowid
lookups for the R-tree filter (INDEX 1). However, any likelihood on the
R-tree search expression still did not make any difference. Do you have
an example?
Try:

---

CREATE TABLE t(id INTEGER PRIMARY KEY,x INTEGER,y INTEGER,z INTEGER);
CREATE VIRTUAL TABLE i USING rtree(id,minx,maxx,miny,maxy);
CREATE INDEX t_x ON t(x);

---

EXPLAIN QUERY PLAN
SELECT * FROM t INNER JOIN i USING(id)
WHERE LIKELIHOOD(i.minX>=-81.08 AND i.maxX<=-80.58
AND i.minY>=35.00 AND i.maxY<=35.44, 0.999) -- 0.999
AND LIKELIHOOD(t.x=3, 0.001); -- 0.001

-> SEARCH TABLE t USING INDEX t_x (x=?)
-> SCAN TABLE i VIRTUAL TABLE INDEX 1:

---

SELECT * FROM t INNER JOIN i USING(id)
WHERE LIKELIHOOD(i.minX>=-81.08 AND i.maxX<=-80.58
AND i.minY>=35.00 AND i.maxY<=35.44, 0.001) -- 0.001
AND LIKELIHOOD(t.x=3, 0.999); -- 0.999

-> SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B1D2B3
-> SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)

---

Tested with SQLite 3.13.0 here, but IIRC newer versions behave the same.
Clemens Ladisch
2018-01-01 19:30:10 UTC
Permalink
Post by Wolfgang Enzinger
Post by Clemens Ladisch
It is indeed possible to change the query so that SQLite uses rowid
lookups for the R-tree filter (INDEX 1). However, any likelihood on the
R-tree search expression still did not make any difference. Do you have
an example?
SELECT * FROM t INNER JOIN i USING(id)
WHERE LIKELIHOOD(i.minX>=-81.08 ..., 0.999) -- 0.999
AND LIKELIHOOD(t.x=3, 0.001); -- 0.001
SELECT * FROM t INNER JOIN i USING(id)
WHERE LIKELIHOOD(i.minX>=-81.08 ..., 0.001) -- 0.001
AND LIKELIHOOD(t.x=3, 0.999); -- 0.999
Sorry, this is not what I meant.

The original question was about manually removing the R-tree search
depending on the spatial window. So, do you have an example where
the query plan changes due to a difference in *only* the likelihood
of the R-tree search expression?


Regards,
Clemens
Wolfgang Enzinger
2018-01-01 19:48:57 UTC
Permalink
Post by Clemens Ladisch
Post by Wolfgang Enzinger
Post by Clemens Ladisch
It is indeed possible to change the query so that SQLite uses rowid
lookups for the R-tree filter (INDEX 1). However, any likelihood on the
R-tree search expression still did not make any difference. Do you have
an example?
SELECT * FROM t INNER JOIN i USING(id)
WHERE LIKELIHOOD(i.minX>=-81.08 ..., 0.999) -- 0.999
AND LIKELIHOOD(t.x=3, 0.001); -- 0.001
SELECT * FROM t INNER JOIN i USING(id)
WHERE LIKELIHOOD(i.minX>=-81.08 ..., 0.001) -- 0.001
AND LIKELIHOOD(t.x=3, 0.999); -- 0.999
Sorry, this is not what I meant.
Seems like I didn't get the original question correctly, then.
Post by Clemens Ladisch
The original question was about manually removing the R-tree search
depending on the spatial window. So, do you have an example where
the query plan changes due to a difference in *only* the likelihood
of the R-tree search expression?
No.

Andrea Aime
2018-01-01 16:22:19 UTC
Permalink
Post by Wolfgang Enzinger
Post by Wolfgang Enzinger
SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM
flst_shape_index;
This results in a full table scan. Instead of caching these values manually,
SELECT rtreenode(2, data) FROM flst_shape_index_node WHERE nodeno = 1;
(rtreenode() is undocumented; maybe you should use your own decoder.)
Intriguing! I'm looking at the output on my local data set (all Texas
roads, 3+ million records, large,
but not all that much in GIS systems):

SELECT rtreenode(2, data) FROM rtree_tiger_shp_geom_node WHERE nodeno = 1;
{26338 -100.598 -96.2836 25.8411 30.6962} {26337 -96.9349 -93.5195 28.598
31.1915} {43295 -96.9621 -93.6133 30.6486 33.9442} {62086 -106.644 -98.0079
28.887 32.1568} {80094 -97.6371 -96.4565 28.5382 33.9419} {97065 -98.5755
-97.2869 28.686 34.1405} {108645 -104.096 -98.3331 31.6511 36.5007}

So, if I'm guessing right, each set of numbers is a count of features and
then the rectangle intersecting them?
I've prepared a picture with the data:

https://drive.google.com/file/d/15WpyfWNKdVeBoLDDRCoCW2xx87T6qf_Z/view?usp=sharing

This query is indeed quite a bit faster and more interesting than scanning
the entire index virtual table, wondering,
is there also a way to get the "next level" of the rtree?
This one returns only a few records, I would not mind having a bit more in
memory for more accurate pre-checks (dabase
access can be optimized out fully if the requested rectangle falls outside
of the rtree boxes no?)

What worries me is that these functions are undocumented, and thus, I
imaging, unsupported and at risk
of being removed at any time. Do you have a feeling of how likely is this
to happen?
Post by Wolfgang Enzinger
Post by Wolfgang Enzinger
Let SQLite know about that likelihood in a JOIN query
The query is being generated after translation from another generic filter
language and at the moment it's
expressed as a sub-query (cause it can be negated, related with other
conditions by and/or, and so on).
I've tried to use it with little luck, the subquery is always run according
to explain plan:

sqlite> explain query plan SELECT "fid","geom" as "geom" FROM "tiger_shp"
WHERE likelihood("fid" IN (SELECT id FROM "rtree_tiger_shp_geom" r WHERE
r.maxx >= -98.41324970985764 AND r.minx <= -98.33594825643836 AND r.maxy >=
30.532285631231357 AND r.miny <= 30.595068029284644), *0.01*);
0|0|0|SEARCH TABLE tiger_shp USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE LIST SUBQUERY 1
1|0|0|SCAN TABLE rtree_tiger_shp_geom AS r VIRTUAL TABLE INDEX 2:D1B0D3B2
sqlite> explain query plan SELECT "fid","geom" as "geom" FROM "tiger_shp"
WHERE likelihood("fid" IN (SELECT id FROM "rtree_tiger_shp_geom" r WHERE
r.maxx >= -98.41324970985764 AND r.minx <= -98.33594825643836 AND r.maxy >=
30.532285631231357 AND r.miny <= 30.595068029284644), *0.99*);
0|0|0|SEARCH TABLE tiger_shp USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE LIST SUBQUERY 1
1|0|0|SCAN TABLE rtree_tiger_shp_geom AS r VIRTUAL TABLE INDEX 2:D1B0D3B2

Oh well, since the code is generating the SQL, it can directly omit the
subquery. Is there any literature on what a good threshold
would be?

Cheers
Andrea
Loading...