[sqlite] Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

So I was wondering, is there any opportunity to run a blazing fast pre-query against
the index that will tell me whether joining/subquerying into the rtree is going to be a win, or not?

Each node in an R-tree index stores the coordinates of the leaf objects/child nodes;
see <https://stackoverflow.com/a/25242162/11654> for how to get this information.
To get the index extents, combine the extents of all entries in the root node.

Also, while I'm here, in PostGIS there is an option to cluster a table along the spatial
index, in order to reduce IO when the spatial index is the main access driver (which is often
the case in geographic information systems). I looked at tables with no rowids, but
it does not seem like a way to do it (spatial index not being suitable for primary key).
Anything else that could be done here?

For a static table, you could order the rows along a space-filling curve.
For a dynamic table, you would have to replace the R-tree with something else
(<https://en.wikipedia.org/wiki/Hilbert_R-tree>), which would no longer be portable.

Regards,
Clemens

Wolfgang Enzinger

2017-12-31 14:45:02 UTC

Post by Andrea Aime
With SQLite R-Tree I'm using either a join with the index virtual table, or
a subquery
retrieving the ids from the rtree. Regardless, the query is basically
ordering SQLite
to use the index.
So I was wondering, is there any opportunity to run a blazing fast
pre-query against
the index that will tell me whether joining/subquerying into the rtree is
going to be a win, or not?

I had good results in a similar situation with this strategy:

First, query the overall extent of your data, like this:
SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM flst_shape_index;

Second, for every spatial query, calculate the size of the area in
question.

Then, with these two rectangles, you can calculate the LIKELIHOOD that a
particular record in your data is located within the requested area, i.e.
meets the spatial criteria.

Let SQLite know about that likelihood in a JOIN query, using the LIKELIHOOD
function (http://www.sqlite.org/lang_corefunc.html#likelihood). Also, if
possible, give LIKELIHOOD information to the query planner for any other
criteria used. SQLite will consider them.

HTH Wolfgang

Clemens Ladisch

2018-01-01 09:45:50 UTC

Post by Wolfgang Enzinger
SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM flst_shape_index;

This results in a full table scan. Instead of caching these values manually,
it would be a better idea to read them from the index:

SELECT rtreenode(2, data) FROM flst_shape_index_node WHERE nodeno = 1;

(rtreenode() is undocumented; maybe you should use your own decoder.)

Post by Wolfgang Enzinger
Let SQLite know about that likelihood in a JOIN query

This does not appear to change anything with a virtual table:

CREATE TABLE t(id INETGER PRIMARY KEY, x, [...]);
CREATE VIRTUAL TABLE i USING rtree(id, minx, maxx);

SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.000001);
--EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
--EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)
SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.999999);
--EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
--EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)

Regards,
Clemens

Wolfgang Enzinger

2018-01-01 14:05:30 UTC

Post by Wolfgang Enzinger
SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM flst_shape_index;

This results in a full table scan. Instead of caching these values manually,
SELECT rtreenode(2, data) FROM flst_shape_index_node WHERE nodeno = 1;
(rtreenode() is undocumented; maybe you should use your own decoder.)

Thanks, didn't know that, I'll look into it. You're right, my query results
in a full table scan, however it's pretty fast anyway - less than a second
with 160,000 rows and cold cache.

Post by Wolfgang Enzinger
Let SQLite know about that likelihood in a JOIN query

CREATE TABLE t(id INETGER PRIMARY KEY, x, [...]);
CREATE VIRTUAL TABLE i USING rtree(id, minx, maxx);
SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.000001);
--EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
--EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)
SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.999999);
--EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
--EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)

This is not surprising because only criteria concerning the "i" table are
in effect here. So it is clear that even a likelihood of 0.999999 is more
selective than a likelihood of 1.000 (= no filter criteria in this table).
However, if your query has criteria both in the "i" and the "t" table, it
can make a difference.

Of course, anybody correct me if I'm mistaken.

Happy new year!
Wolfgang

Clemens Ladisch

2018-01-01 15:20:21 UTC

Post by Wolfgang Enzinger
Let SQLite know about that likelihood in a JOIN query

SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.000001);
SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.999999);
...

It is indeed possible to change the query so that SQLite uses rowid
lookups for the R-tree filter (INDEX 1). However, any likelihood on the
R-tree search expression still did not make any difference. Do you have
an example?

Regards,
Clemens

Wolfgang Enzinger

2018-01-01 16:07:35 UTC

Post by Clemens Ladisch
It is indeed possible to change the query so that SQLite uses rowid
lookups for the R-tree filter (INDEX 1). However, any likelihood on the
R-tree search expression still did not make any difference. Do you have
an example?

Try:

---

CREATE TABLE t(id INTEGER PRIMARY KEY,x INTEGER,y INTEGER,z INTEGER);
CREATE VIRTUAL TABLE i USING rtree(id,minx,maxx,miny,maxy);
CREATE INDEX t_x ON t(x);

---

EXPLAIN QUERY PLAN
SELECT * FROM t INNER JOIN i USING(id)
WHERE LIKELIHOOD(i.minX>=-81.08 AND i.maxX<=-80.58
AND i.minY>=35.00 AND i.maxY<=35.44, 0.999) -- 0.999
AND LIKELIHOOD(t.x=3, 0.001); -- 0.001

-> SEARCH TABLE t USING INDEX t_x (x=?)
-> SCAN TABLE i VIRTUAL TABLE INDEX 1:

---

SELECT * FROM t INNER JOIN i USING(id)
WHERE LIKELIHOOD(i.minX>=-81.08 AND i.maxX<=-80.58
AND i.minY>=35.00 AND i.maxY<=35.44, 0.001) -- 0.001
AND LIKELIHOOD(t.x=3, 0.999); -- 0.999

-> SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B1D2B3
-> SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)

---

Tested with SQLite 3.13.0 here, but IIRC newer versions behave the same.

Clemens Ladisch

2018-01-01 19:30:10 UTC

SELECT * FROM t INNER JOIN i USING(id)
WHERE LIKELIHOOD(i.minX>=-81.08 ..., 0.999) -- 0.999
AND LIKELIHOOD(t.x=3, 0.001); -- 0.001
SELECT * FROM t INNER JOIN i USING(id)
WHERE LIKELIHOOD(i.minX>=-81.08 ..., 0.001) -- 0.001
AND LIKELIHOOD(t.x=3, 0.999); -- 0.999

Sorry, this is not what I meant.

The original question was about manually removing the R-tree search
depending on the spatial window. So, do you have an example where
the query plan changes due to a difference in *only* the likelihood
of the R-tree search expression?

Regards,
Clemens

Wolfgang Enzinger

2018-01-01 19:48:57 UTC

Sorry, this is not what I meant.

Seems like I didn't get the original question correctly, then.

Post by Clemens Ladisch
The original question was about manually removing the R-tree search
depending on the spatial window. So, do you have an example where
the query plan changes due to a difference in *only* the likelihood
of the R-tree search expression?

No.

Andrea Aime

2018-01-01 16:22:19 UTC

Post by Wolfgang Enzinger
SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM

flst_shape_index;
This results in a full table scan. Instead of caching these values manually,
SELECT rtreenode(2, data) FROM flst_shape_index_node WHERE nodeno = 1;
(rtreenode() is undocumented; maybe you should use your own decoder.)

Intriguing! I'm looking at the output on my local data set (all Texas
roads, 3+ million records, large,
but not all that much in GIS systems):

SELECT rtreenode(2, data) FROM rtree_tiger_shp_geom_node WHERE nodeno = 1;
{26338 -100.598 -96.2836 25.8411 30.6962} {26337 -96.9349 -93.5195 28.598
31.1915} {43295 -96.9621 -93.6133 30.6486 33.9442} {62086 -106.644 -98.0079
28.887 32.1568} {80094 -97.6371 -96.4565 28.5382 33.9419} {97065 -98.5755
-97.2869 28.686 34.1405} {108645 -104.096 -98.3331 31.6511 36.5007}

So, if I'm guessing right, each set of numbers is a count of features and
then the rectangle intersecting them?
I've prepared a picture with the data:

https://drive.google.com/file/d/15WpyfWNKdVeBoLDDRCoCW2xx87T6qf_Z/view?usp=sharing

This query is indeed quite a bit faster and more interesting than scanning
the entire index virtual table, wondering,
is there also a way to get the "next level" of the rtree?
This one returns only a few records, I would not mind having a bit more in
memory for more accurate pre-checks (dabase
access can be optimized out fully if the requested rectangle falls outside
of the rtree boxes no?)

What worries me is that these functions are undocumented, and thus, I
imaging, unsupported and at risk
of being removed at any time. Do you have a feeling of how likely is this
to happen?