slowish R*Tree performance

Discussion:

Eric Rubin-Smith

2014-06-15 04:25:12 UTC

I am exploring a mechanism for finding the longest covering IPv6 prefix for
a given IPv6 address by leveraging SQLite 3.8.5's R*Tree feature. I'm
getting pretty bad performance and would like to know if I'm doing
something obviously wrong.

A 1-dimensional R*Tree of integers of width 128 bits would be ideal. But
SQLite R*Trees are only 32 bits wide per dimension.

So I am modeling IPv6 prefixes as a pair of coordinates in 5-dimensional
integer space:

CREATE VIRTUAL TABLE ipIndex USING rtree_i32(
id, -- Integer primary key
minD1, maxD1,
minD2, maxD2,
minD3, maxD3,
minD4, maxD4,
minD5, maxD5
);

CREATE TABLE routeTarget(
id INTEGER PRIMARY KEY,
prefix TEXT NOT NULL,
target TEXT NOT NULL
);

To do this, I have come up with a mapping from an IPv6 prefix to a pair of
coordinates that has the geometric property that we would like: bounding
box 1 contains bounding box 2 if and only if prefix 1 contains prefix 2.
So if a search for bounding boxes containing a particular address's
coordinate returns rows, then each of those rows corresponds to a covering
prefix -- from there, we must simply find the smallest bounding box to find
the longest-matching prefix.

Functionally, this appears to work like a charm.

Performance, unfortunately, leaves much to be desired. I have seeded this
database with 100k randomly-generated prefixes, and am only able to push
through 250 searches per second. I am hoping to increase the database size
to ~50m records and would like to hit 50k searches per second.

This is not an unreasonable request based on my hardware, and SQLite has
easily hit this throughput (at least, this order of magnitude in
throughput) in other applications. For example, the analogue in IPv4
merely requires a 1-dimensional R*Tree, and with that I was able to hit
10kTPS throughput without breaking much of a sweat.

Here is my search query plan:

sqlite> explain query plan SELECT prefix, target FROM routeTarget WHERE id
= (
...> SELECT id FROM ipIndex
...> WHERE minD1 <= 1220818432 and 1220818432 <= maxD1
...> AND minD2 <= 2120561472 and 2120561472 <= maxD2
...> AND minD3 <= 1685398080 and 1685398080 <= maxD3
...> AND minD4 <= 1685755328 and 1685755328 <= maxD4
...> AND minD5 <= 538331072 and 538331072 <= maxD5
...> ORDER BY ((maxD5-minD5)*(maxD4-minD4)*(maxD3-minD3)*
...> (maxD2-minD2)*(maxD1-minD1)) ASC
...> LIMIT 1);
0|0|0|SEARCH TABLE routeTarget USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE SCALAR SUBQUERY 1
1|0|0|SCAN TABLE ipIndex VIRTUAL TABLE INDEX 2:B0D1B2D3B4D5B6D7B8D9
1|0|0|USE TEMP B-TREE FOR ORDER BY

I created a profiled SQLite build, and here are the top offenders for a run
on 1000 searches:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
40.00 1.58 1.58 300179 0.01 0.01 sqlite3VdbeExec
6.33 1.83 0.25 36628944 0.00 0.00 sqlite3VdbeMemMove
6.08 2.07 0.24 18314472 0.00 0.00 rtreeColumn
4.05 2.23 0.16 1665952 0.00 0.00 rtreeStepToLeaf
2.41 2.33 0.10 55549722 0.00 0.00 sqlite3VdbeMemRelease
2.28 2.42 0.09 1965104 0.00 0.00
sqlite3BtreeMovetoUnpacked
2.03 2.50 0.08 20187085 0.00 0.00
rtreeNodeOfFirstSearchPoint
1.90 2.57 0.08 19981424 0.00 0.00
sqlite3VtabImportErrmsg
1.77 2.64 0.07 1663952 0.00 0.00 sqlite3BtreeDelete
1.52 2.70 0.06 5297026 0.00 0.00 sqlite3VdbeSerialGet
1.52 2.76 0.06 1665952 0.00 0.00 btreeMoveto
1.27 2.81 0.05 29969136 0.00 0.00 numericType
1.27 2.86 0.05 22092688 0.00 0.00 sqlite3DbFree
1.27 2.91 0.05 16649521 0.00 0.00 sqlite3_result_int
1.27 2.96 0.05 5295009 0.00 0.00 moveToRoot
1.27 3.01 0.05 1844945 0.00 0.00 nodeAcquire
1.27 3.06 0.05 1665952 0.00 0.00 insertCell
1.27 3.11 0.05 1663952 0.00 0.00 dropCell
1.01 3.15 0.04 21214814 0.00 0.00 sqlite3_free
0.89 3.19 0.04 3335904 0.00 0.00 pager_write
0.76 3.22 0.03 9991712 0.00 0.00 sqlite3VdbeSerialType
0.76 3.25 0.03 3335904 0.00 0.00 sqlite3PagerWrite
0.76 3.28 0.03 903468 0.00 0.00 pcache1Fetch

I believe I have avoided the common pitfalls: everything is within a single
transaction, my cache_size pragma is large, etc.

I note from SQLite trace callbacks that a particular explicit search
results in a great many implicit SQL calls into the underlying R*Tree
tables. Three hundred or so per search. I assume this is expected? It
seems pretty large. They all look like this:

-- SELECT data FROM 'main'.'ipIndex_node' WHERE nodeno = :1
-- SELECT data FROM 'main'.'ipIndex_node' WHERE nodeno = :1
{etc, ~300 times}

I don't believe an R*Tree should be this deep; it makes me worry that
something about the data I am providing is causing the tree to degenerate
to a linear structure. Or something.

The code reads search keys from stdin, runs the query on them, writes the
results, and repeats:

if (sqlite3_exec(pDb, "BEGIN;", NULL, NULL, NULL)!=SQLITE_OK)
{
fprintf(stderr, "could not initiate transaction.\n");
return 1;
}
const char* zSearch =
"SELECT prefix, target FROM routeTarget WHERE id = (\n"
" SELECT id FROM ipIndex\n"
" WHERE minD1 <= $1 and $1 <= maxD1\n"
" AND minD2 <= $2 and $2 <= maxD2\n"
" AND minD3 <= $3 and $3 <= maxD3\n"
" AND minD4 <= $4 and $4 <= maxD4\n"
" AND minD5 <= $5 and $5 <= maxD5\n"
" ORDER BY ((maxD5-minD5)*(maxD4-minD4)*(maxD3-minD3)*\n"
" (maxD2-minD2)*(maxD1-minD1)) ASC \n"
" LIMIT 1"
");";
rc = sqlite3_prepare_v2(pDb, zSearch, strlen(zSearch),
&pSearch, NULL);
{snip}
uint32_t coord[5];
while (fgets(zLine, sizeof(zLine), stdin)!=NULL)
{
struct in6_addr addr;
size_t len = strlen(zLine);
if (zLine[len-1]=='\n') {zLine[len-1] = 0;}
if (inet_pton(AF_INET6, zLine, &addr)!=1)
{
fprintf(stderr, "Ignoring bad ip address \"%s\"\n", zLine);
continue;
}
addr2coord(addr, coord);
for (int i=0;i<5;i++)
{
rc = sqlite3_bind_int(pSearch, i+1, coord[i]);
assert(rc==SQLITE_OK);
}
rc = sqlite3_step(pSearch);
while (rc!=SQLITE_DONE)
{
switch(rc) {
case SQLITE_ROW:
const unsigned char* zPrefix, *zTarget;
zPrefix = sqlite3_column_text(pSearch, 0);
zTarget = sqlite3_column_text(pSearch, 1);
printf("%s --> %s %s\n", zLine, zPrefix, zTarget);
rc = sqlite3_step(pSearch);
break;
case SQLITE_DONE:
// no rows.
break;
case SQLITE_BUSY:
fprintf(stderr, "couldn't get lock\n");
return 1;
break;
default:
assert(0);
}
}
sqlite3_reset(pSearch);
struct timeval now;
gettimeofday(&now, NULL);
if (((now.tv_sec*1000+now.tv_usec/1000)-
(lastTxnTime.tv_sec*1000 + lastTxnTime.tv_usec/1000))>1000)
{
if (sqlite3_exec(pDb, "END;", NULL, NULL, NULL)!=SQLITE_OK ||
sqlite3_exec(pDb, "BEGIN;", NULL, NULL, NULL)!=SQLITE_OK)
{
fprintf(stderr, "could not initiate transaction.\n");
return 1;
}
lastTxnTime = now;
}
}
{snip}

SQLite build params:

gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\"
-DPACKAGE_VERSION=\"3.8.5\" -DPACKAGE_STRING=\"sqlite\ 3.8.5\"
-DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\"
-DPACKAGE=\"sqlite\" -DVERSION=\"3.8.5\" -DSTDC_HEADERS=1
-DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1
-DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1
-DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\"
-DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1
-DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_READLINE=1
-DHAVE_POSIX_FALLOCATE=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1
-DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_RTREE -g -O2 -c sqlite3.c

OS:

Linux <anonymized> 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT
2012 x86_64 x86_64 x86_64 GNU/Linux
[root@<anonymized> sqlite-autoconf-3080500]# cat /etc/issue
Red Hat Enterprise Linux Server release 6.3 (Santiago)
Kernel \r on an \m

[root@<anonymized> sqlite-autoconf-3080500]# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
stepping : 2
cpu MHz : 2400.016
cache size : 12288 KB
physical id : 1
siblings : 8
core id : 0
cpu cores : 4
apicid : 32
initial apicid : 32
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat dts
tpr_shadow vnmi flexpriority ept vpid
bogomips : 4800.03
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
{times 16 -- 4 cores * 2 hyperthreads * 2 chips}

Schema:

sqlite> select * from sqlite_master;
table|ipIndex|ipIndex|0|CREATE VIRTUAL TABLE ipIndex USING rtree_i32(
id, -- Integer primary key
minD1, maxD1,
minD2, maxD2,
minD3, maxD3,
minD4, maxD4,
minD5, maxD5
)
table|ipIndex_node|ipIndex_node|2|CREATE TABLE "ipIndex_node"(nodeno
INTEGER PRIMARY KEY, data BLOB)
table|ipIndex_rowid|ipIndex_rowid|3|CREATE TABLE "ipIndex_rowid"(rowid
INTEGER PRIMARY KEY, nodeno INTEGER)
table|ipIndex_parent|ipIndex_parent|4|CREATE TABLE "ipIndex_parent"(nodeno
INTEGER PRIMARY KEY, parentnode INTEGER)
table|routeTarget|routeTarget|5|CREATE TABLE routeTarget(
id INTEGER PRIMARY KEY,
prefix TEXT NOT NULL,
target TEXT NOT NULL
)
index|routeTargetIdxPrefix|routeTarget|6|CREATE INDEX routeTargetIdxPrefix
ON routeTarget(prefix)

Is there anything that I am doing obviously wrong?

Thanks!

Eric

Richard Hipp

2014-06-15 10:51:43 UTC

Permalink

Post by Eric Rubin-Smith
sqlite> explain query plan SELECT prefix, target FROM routeTarget WHERE id
= (
...> SELECT id FROM ipIndex
...> WHERE minD1 <= 1220818432 and 1220818432 <= maxD1
...> AND minD2 <= 2120561472 and 2120561472 <= maxD2
...> AND minD3 <= 1685398080 and 1685398080 <= maxD3
...> AND minD4 <= 1685755328 and 1685755328 <= maxD4
...> AND minD5 <= 538331072 and 538331072 <= maxD5
...> ORDER BY ((maxD5-minD5)*(maxD4-minD4)*(maxD3-minD3)*
...> (maxD2-minD2)*(maxD1-minD1)) ASC
...> LIMIT 1);
0|0|0|SEARCH TABLE routeTarget USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE SCALAR SUBQUERY 1
1|0|0|SCAN TABLE ipIndex VIRTUAL TABLE INDEX 2:B0D1B2D3B4D5B6D7B8D9
1|0|0|USE TEMP B-TREE FOR ORDER BY

What does this query return?

SELECT count(*) FROM ipIndex
WHERE minD1 <= 1220818432 and 1220818432 <= maxD1
AND minD2 <= 2120561472 and 2120561472 <= maxD2
AND minD3 <= 1685398080 and 1685398080 <= maxD3
AND minD4 <= 1685755328 and 1685755328 <= maxD4
AND minD5 <= 538331072 and 538331072 <= maxD5;

--
D. Richard Hipp
drh-CzDROfG0BjIdnm+***@public.gmane.org

Eric Rubin-Smith

2014-06-15 13:47:57 UTC

Permalink

Post by Richard Hipp
What does this query return?
SELECT count(*) FROM ipIndex
WHERE minD1 <= 1220818432 and 1220818432 <= maxD1
AND minD2 <= 2120561472 and 2120561472 <= maxD2
AND minD3 <= 1685398080 and 1685398080 <= maxD3
AND minD4 <= 1685755328 and 1685755328 <= maxD4
AND minD5 <= 538331072 and 538331072 <= maxD5;

Hm, it returns 1645. This indicates a bug (the max expected value is
128). I'm now highly suspicious of my mathematical reasoning or my
code. I'll take a look. Thanks, Richard!

Eric

Eric Rubin-Smith

2014-06-15 16:21:24 UTC

Permalink

Post by Eric Rubin-Smith

Hm, it returns 1645. This indicates a bug (the max expected value is
128). I'm now highly suspicious of my mathematical reasoning or my
code. I'll take a look. Thanks, Richard!

Follow-up for those who are curious. My program for randomly populating
the database was creating a bunch of identical bounding boxes for
short-length prefixes (i.e. prefixes corresponding to large bounding
boxes). This made the R*Tree do a bunch of redundant work. Eliminating
this issue led to a ~30x throughput improvement to ~6k searches per second
on a database with 100k prefixes in it.

After populating the database with 5.7 million such prefixes, we are at a
throughput of about 2.2kTPS. Not horrible, and not sure I can expect much
more out of SQLite -- but still not good enough for my use case
(unfortunately). Any further optimization tips are highly welcome. In the
mean time, I'm going to keep digging.

Thanks again to Richard for pointing me in the right direction.

Eric

Simon Slavin

2014-06-15 22:26:56 UTC

Permalink

Post by Eric Rubin-Smith
still not good enough for my use case
(unfortunately). Any further optimization tips are highly welcome.

Strongly suspect that although R*Trees produce an elegant solution to your problem, the fact that they're a general case tool will make them too slow to use for something like this.

I propose an alternative solution, though I have not tried it and do not have time to try it (sorry).

1) A function which turns a numeric IP address or a block into some standard easy-to-process representation in string form. Possibly a pair of strings with the first string being an address the second being something indicating the extent of the block, perhaps something like '2001:0db8:8500:0000:0000:0000:0000:0000vffff:ffff:ff00:0000:0000:0000:0000:0000'. You could make it shorter by leaving out the colons but my experience is that although this leads to less data stored on disk it doesn't speed up processing by much. But if you have a great deal of data you might want to do it anyway.

2) A comparator function (maybe a SQLite function, maybe not) which takes two such addresses or blocks and returns a value indicating whether they're identical, whether block A contains block or address B, or neither.

The closest I got to the above was when I needed a program which intensively searched and sorted individual IPv4 addresses. I got best results by defining a SQLite function which converted IP addresses of all formats into 'standard format' where each byte was two hex digits. All values stored in the database were stored in my 'standard' format. This allowed easy collation using standard text sorting. Everything else turned out faster to implement in my own programming language than it was to build as SQLite functions.

Simon.

Carlos Ferreira

2014-06-16 09:47:58 UTC

Permalink

Regarding the R.Tree performance problem,

What is the original problem that is causing slow performance in the SQlite
R-Tree implementation?

Thanks

Carlos

-----Original Message-----
From: sqlite-users-bounces-CzDROfG0BjIdnm+***@public.gmane.org
[mailto:sqlite-users-bounces-CzDROfG0BjIdnm+***@public.gmane.org] On Behalf Of Simon Slavin
Sent: domingo, 15 de Junho de 2014 23:27
To: General Discussion of SQLite Database
Subject: Re: [sqlite] slowish R*Tree performance

Post by Eric Rubin-Smith
still not good enough for my use case
(unfortunately). Any further optimization tips are highly welcome.

Strongly suspect that although R*Trees produce an elegant solution to your
problem, the fact that they're a general case tool will make them too slow
to use for something like this.

I propose an alternative solution, though I have not tried it and do not
have time to try it (sorry).

1) A function which turns a numeric IP address or a block into some standard
easy-to-process representation in string form. Possibly a pair of strings
with the first string being an address the second being something indicating
the extent of the block, perhaps something like
'2001:0db8:8500:0000:0000:0000:0000:0000vffff:ffff:ff00:0000:0000:0000:0000:
0000'. You could make it shorter by leaving out the colons but my
experience is that although this leads to less data stored on disk it
doesn't speed up processing by much. But if you have a great deal of data
you might want to do it anyway.

2) A comparator function (maybe a SQLite function, maybe not) which takes
two such addresses or blocks and returns a value indicating whether they're
identical, whether block A contains block or address B, or neither.

The closest I got to the above was when I needed a program which intensively
searched and sorted individual IPv4 addresses. I got best results by
defining a SQLite function which converted IP addresses of all formats into
'standard format' where each byte was two hex digits. All values stored in
the database were stored in my 'standard' format. This allowed easy
collation using standard text sorting. Everything else turned out faster to
implement in my own programming language than it was to build as SQLite
functions.

Simon.

Simon Slavin

2014-06-16 11:44:18 UTC

Permalink

Post by Carlos Ferreira
What is the original problem that is causing slow performance in the SQlite
R-Tree implementation?

Read earlier posts to this thread.

Simon.

Eric Rubin-Smith

2014-06-18 18:46:34 UTC

Permalink

Post by Carlos Ferreira
Regarding the R.Tree performance problem,
What is the original problem that is causing slow performance in the
SQlite R-Tree implementation?

I was populating my DB with bad data. In particular, I was first
choosing a random prefix length, then filling up that number of bits to
create a random prefix. I was then just blindly sticking that into the
database.

For very short prefix lengths, the chance of an exact collision was
therefore very high. E.g. if prefix length 0 is chosen, then the
chances of collision are 100%. And prefix length 0 is chosen 1% of the
time.

Collisions on the large bounding boxes are exactly what you don't want,
because of the way an R*Tree works. (Collisions in small bounding boxes
only matter for the rare searches that happen to fall within them.)

Fixed by checking for the existence of an identical bounding box, and
removing it if it does exist. (This population prototype code might
eventually feed production code, which will find it more useful to have
INSERT OR REPLACE semantics than INSERT OR IGNORE semantics.)

The above, however, raises an interesting point. Even when exact
collisions are detected and avoided, the above randomized scheme does
create a significant "matrioshka" structure that may not be present
in real-world datasets. That is, again, there is a 1/128 chance that
length 0 is chosen. There is an 8/128 or 6% chance that a prefix of
length <= 8 is chosen. I.e. we are creating a lot of large bounding
boxes that likely cover smaller ones, and that may or may not reflect
reality.

--
Eric A. Rubin-Smith

Computer programs don't like being anthropomorphized.

Eric Rubin-Smith

2014-06-18 17:32:07 UTC

Permalink

Post by Simon Slavin
Strongly suspect that although R*Trees produce an elegant solution to
your problem, the fact that they're a general case tool will make them too
slow to use for something like this.
I propose an alternative solution, though I have not tried it and do not
have time to try it (sorry).
1) A function which turns a numeric IP address or a block into some
standard easy-to-process representation in string form. Possibly a
pair of strings with the first string being an address the second being
something indicating the extent of the block, perhaps something like
'2001:0db8:8500:0000:0000:0000:0000:0000vffff:ffff:ff00:0000:0000:0000:0000:0000'.
You could make it shorter by leaving out the colons but my experience is
that although this leads to less data stored on disk it doesn't speed up
processing by much. But if you have a great deal of data you might want
to do it anyway.
2) A comparator function (maybe a SQLite function, maybe not) which
takes two such addresses or blocks and returns a value indicating whether
they're identical, whether block A contains block or address B, or
neither.
The closest I got to the above was when I needed a program which
intensively searched and sorted individual IPv4 addresses. I got best
results by defining a SQLite function which converted IP addresses of all
formats into 'standard format' where each byte was two hex digits. All
values stored in the database were stored in my 'standard' format. This
allowed easy collation using standard text sorting.

Thanks for the suggestion, Simon.

My use case is intensive on the search side, but will incur occasional
updates to the structure. No sorting necessary from an application
perspective. Perhaps I am being dense, but don't see how your
representation eases the burden of longest-prefix matching from within
SQL queries.

Post by Simon Slavin
Everything else turned out faster to implement in my own programming
language than it was to build as SQLite functions.

Yeah, I agree that the performance of a dedicated data structure will be
far better. Again, just wondering if I can stretch SQLite to solve this
problem, because it would be oh-so-nice to leverage all the other stuff
SQLite gives us.

--
Eric A. Rubin-Smith

:wq

Carlos Ferreira

2014-06-16 13:11:01 UTC

Permalink

Hi,

I am just a user of binary Sqlite. So, I do not know how much of custom
Sqlite tricks can be made with a copy of sqlite source code.

However, and if I understood well the problem is like this:

1 - There a data type named IPV6 Address.
2 - there is a table where this data type must be in. ( can be a set of
fields, one blob, one string ...)

You want to:

Given a certain IPV6, find in the database the existent IPV6 record with the
longest equal prefix to the given IPV6 value.

Again, if the problem is as I understood, the simplest solution is ( again I
am discussing it as if it could be implemented or available in SQLite..I do
not know..)

1 - encode the IPV6 field as a unique blob
2 - Implement an index to this field so that this particular field can be
ordered
3 - Make sure that the ordering compare function is a binary incremental
compare counting from the left ( in terms of the data...not sure how you
will represent it in the field )
4 - Each time you want to find the closed and longest prefix, you just need
to simulate an insert command.
Try to insert the given value. Instead of inserting, just return the ordered
position where it would be inserted. Check what is the record actually
standing in that ordered position...That would be your result!!

This can also be done outside Sqlite...but not sure if my understanding of
the problem is correct.

Regards,

Carlos

-----Original Message-----
From: sqlite-users-bounces-CzDROfG0BjIdnm+***@public.gmane.org
[mailto:sqlite-users-bounces-CzDROfG0BjIdnm+***@public.gmane.org] On Behalf Of Eric Rubin-Smith
Sent: domingo, 15 de Junho de 2014 05:25
To: General Discussion of SQLite Database
Subject: [sqlite] slowish R*Tree performance

I am exploring a mechanism for finding the longest covering IPv6 prefix for
a given IPv6 address by leveraging SQLite 3.8.5's R*Tree feature. I'm
getting pretty bad performance and would like to know if I'm doing something
obviously wrong.

A 1-dimensional R*Tree of integers of width 128 bits would be ideal. But
SQLite R*Trees are only 32 bits wide per dimension.

So I am modeling IPv6 prefixes as a pair of coordinates in 5-dimensional
integer space:

CREATE VIRTUAL TABLE ipIndex USING rtree_i32(
id, -- Integer primary key
minD1, maxD1,
minD2, maxD2,
minD3, maxD3,
minD4, maxD4,
minD5, maxD5
);

CREATE TABLE routeTarget(
id INTEGER PRIMARY KEY,
prefix TEXT NOT NULL,
target TEXT NOT NULL
);

To do this, I have come up with a mapping from an IPv6 prefix to a pair of
coordinates that has the geometric property that we would like: bounding box
1 contains bounding box 2 if and only if prefix 1 contains prefix 2.
So if a search for bounding boxes containing a particular address's
coordinate returns rows, then each of those rows corresponds to a covering
prefix -- from there, we must simply find the smallest bounding box to find
the longest-matching prefix.

Functionally, this appears to work like a charm.

Performance, unfortunately, leaves much to be desired. I have seeded this
database with 100k randomly-generated prefixes, and am only able to push
through 250 searches per second. I am hoping to increase the database size
to ~50m records and would like to hit 50k searches per second.

This is not an unreasonable request based on my hardware, and SQLite has
easily hit this throughput (at least, this order of magnitude in
throughput) in other applications. For example, the analogue in IPv4 merely
requires a 1-dimensional R*Tree, and with that I was able to hit 10kTPS
throughput without breaking much of a sweat.

Here is my search query plan:

sqlite> explain query plan SELECT prefix, target FROM routeTarget WHERE
sqlite> id
= (
...> SELECT id FROM ipIndex
...> WHERE minD1 <= 1220818432 and 1220818432 <= maxD1
...> AND minD2 <= 2120561472 and 2120561472 <= maxD2
...> AND minD3 <= 1685398080 and 1685398080 <= maxD3
...> AND minD4 <= 1685755328 and 1685755328 <= maxD4
...> AND minD5 <= 538331072 and 538331072 <= maxD5
...> ORDER BY ((maxD5-minD5)*(maxD4-minD4)*(maxD3-minD3)*
...> (maxD2-minD2)*(maxD1-minD1)) ASC
...> LIMIT 1);
0|0|0|SEARCH TABLE routeTarget USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE SCALAR SUBQUERY 1
1|0|0|SCAN TABLE ipIndex VIRTUAL TABLE INDEX 2:B0D1B2D3B4D5B6D7B8D9 USE
1|0|0|TEMP B-TREE FOR ORDER BY

I created a profiled SQLite build, and here are the top offenders for a run
on 1000 searches:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
40.00 1.58 1.58 300179 0.01 0.01 sqlite3VdbeExec
6.33 1.83 0.25 36628944 0.00 0.00 sqlite3VdbeMemMove
6.08 2.07 0.24 18314472 0.00 0.00 rtreeColumn
4.05 2.23 0.16 1665952 0.00 0.00 rtreeStepToLeaf
2.41 2.33 0.10 55549722 0.00 0.00 sqlite3VdbeMemRelease
2.28 2.42 0.09 1965104 0.00 0.00
sqlite3BtreeMovetoUnpacked
2.03 2.50 0.08 20187085 0.00 0.00
rtreeNodeOfFirstSearchPoint
1.90 2.57 0.08 19981424 0.00 0.00
sqlite3VtabImportErrmsg
1.77 2.64 0.07 1663952 0.00 0.00 sqlite3BtreeDelete
1.52 2.70 0.06 5297026 0.00 0.00 sqlite3VdbeSerialGet
1.52 2.76 0.06 1665952 0.00 0.00 btreeMoveto
1.27 2.81 0.05 29969136 0.00 0.00 numericType
1.27 2.86 0.05 22092688 0.00 0.00 sqlite3DbFree
1.27 2.91 0.05 16649521 0.00 0.00 sqlite3_result_int
1.27 2.96 0.05 5295009 0.00 0.00 moveToRoot
1.27 3.01 0.05 1844945 0.00 0.00 nodeAcquire
1.27 3.06 0.05 1665952 0.00 0.00 insertCell
1.27 3.11 0.05 1663952 0.00 0.00 dropCell
1.01 3.15 0.04 21214814 0.00 0.00 sqlite3_free
0.89 3.19 0.04 3335904 0.00 0.00 pager_write
0.76 3.22 0.03 9991712 0.00 0.00 sqlite3VdbeSerialType
0.76 3.25 0.03 3335904 0.00 0.00 sqlite3PagerWrite
0.76 3.28 0.03 903468 0.00 0.00 pcache1Fetch

I believe I have avoided the common pitfalls: everything is within a single
transaction, my cache_size pragma is large, etc.

I note from SQLite trace callbacks that a particular explicit search results
in a great many implicit SQL calls into the underlying R*Tree tables. Three
hundred or so per search. I assume this is expected? It seems pretty
large. They all look like this:

-- SELECT data FROM 'main'.'ipIndex_node' WHERE nodeno = :1
-- SELECT data FROM 'main'.'ipIndex_node' WHERE nodeno = :1 {etc, ~300
times}

I don't believe an R*Tree should be this deep; it makes me worry that
something about the data I am providing is causing the tree to degenerate to
a linear structure. Or something.

The code reads search keys from stdin, runs the query on them, writes the
results, and repeats:

if (sqlite3_exec(pDb, "BEGIN;", NULL, NULL, NULL)!=SQLITE_OK)
{
fprintf(stderr, "could not initiate transaction.\n");
return 1;
}
const char* zSearch =
"SELECT prefix, target FROM routeTarget WHERE id = (\n"
" SELECT id FROM ipIndex\n"
" WHERE minD1 <= $1 and $1 <= maxD1\n"
" AND minD2 <= $2 and $2 <= maxD2\n"
" AND minD3 <= $3 and $3 <= maxD3\n"
" AND minD4 <= $4 and $4 <= maxD4\n"
" AND minD5 <= $5 and $5 <= maxD5\n"
" ORDER BY ((maxD5-minD5)*(maxD4-minD4)*(maxD3-minD3)*\n"
" (maxD2-minD2)*(maxD1-minD1)) ASC \n"
" LIMIT 1"
");";
rc = sqlite3_prepare_v2(pDb, zSearch, strlen(zSearch),
&pSearch, NULL);
{snip}
uint32_t coord[5];
while (fgets(zLine, sizeof(zLine), stdin)!=NULL)
{
struct in6_addr addr;
size_t len = strlen(zLine);
if (zLine[len-1]=='\n') {zLine[len-1] = 0;}
if (inet_pton(AF_INET6, zLine, &addr)!=1)
{
fprintf(stderr, "Ignoring bad ip address \"%s\"\n", zLine);
continue;
}
addr2coord(addr, coord);
for (int i=0;i<5;i++)
{
rc = sqlite3_bind_int(pSearch, i+1, coord[i]);
assert(rc==SQLITE_OK);
}
rc = sqlite3_step(pSearch);
while (rc!=SQLITE_DONE)
{
switch(rc) {
case SQLITE_ROW:
const unsigned char* zPrefix, *zTarget;
zPrefix = sqlite3_column_text(pSearch, 0);
zTarget = sqlite3_column_text(pSearch, 1);
printf("%s --> %s %s\n", zLine, zPrefix, zTarget);
rc = sqlite3_step(pSearch);
break;
case SQLITE_DONE:
// no rows.
break;
case SQLITE_BUSY:
fprintf(stderr, "couldn't get lock\n");
return 1;
break;
default:
assert(0);
}
}
sqlite3_reset(pSearch);
struct timeval now;
gettimeofday(&now, NULL);
if (((now.tv_sec*1000+now.tv_usec/1000)-
(lastTxnTime.tv_sec*1000 + lastTxnTime.tv_usec/1000))>1000)
{
if (sqlite3_exec(pDb, "END;", NULL, NULL, NULL)!=SQLITE_OK ||
sqlite3_exec(pDb, "BEGIN;", NULL, NULL, NULL)!=SQLITE_OK)
{
fprintf(stderr, "could not initiate transaction.\n");
return 1;
}
lastTxnTime = now;
}
}
{snip}

SQLite build params:

gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\"
-DPACKAGE_VERSION=\"3.8.5\" -DPACKAGE_STRING=\"sqlite\ 3.8.5\"
-DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\"
-DPACKAGE=\"sqlite\" -DVERSION=\"3.8.5\" -DSTDC_HEADERS=1
-DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1
-DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1
-DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\"
-DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1
-DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_READLINE=1
-DHAVE_POSIX_FALLOCATE=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1
-DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_RTREE -g -O2 -c sqlite3.c

OS:

Linux <anonymized> 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT
2012 x86_64 x86_64 x86_64 GNU/Linux
[root@<anonymized> sqlite-autoconf-3080500]# cat /etc/issue Red Hat
Enterprise Linux Server release 6.3 (Santiago) Kernel \r on an \m

[root@<anonymized> sqlite-autoconf-3080500]# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
stepping : 2
cpu MHz : 2400.016
cache size : 12288 KB
physical id : 1
siblings : 8
core id : 0
cpu cores : 4
apicid : 32
initial apicid : 32
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc
aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat dts
tpr_shadow vnmi flexpriority ept vpid
bogomips : 4800.03
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
{times 16 -- 4 cores * 2 hyperthreads * 2 chips}

Schema:

sqlite> select * from sqlite_master;
table|ipIndex|ipIndex|0|CREATE VIRTUAL TABLE ipIndex USING rtree_i32(
id, -- Integer primary key
minD1, maxD1,
minD2, maxD2,
minD3, maxD3,
minD4, maxD4,
minD5, maxD5
)
table|ipIndex_node|ipIndex_node|2|CREATE TABLE "ipIndex_node"(nodeno
INTEGER PRIMARY KEY, data BLOB)
table|ipIndex_rowid|ipIndex_rowid|3|CREATE TABLE "ipIndex_rowid"(rowid
INTEGER PRIMARY KEY, nodeno INTEGER)
table|ipIndex_parent|ipIndex_parent|4|CREATE TABLE
table|"ipIndex_parent"(nodeno
INTEGER PRIMARY KEY, parentnode INTEGER)
table|routeTarget|routeTarget|5|CREATE TABLE routeTarget(
id INTEGER PRIMARY KEY,
prefix TEXT NOT NULL,
target TEXT NOT NULL
)
index|routeTargetIdxPrefix|routeTarget|6|CREATE INDEX
index|routeTargetIdxPrefix|routeTarget|6|routeTargetIdxPrefix
ON routeTarget(prefix)

Is there anything that I am doing obviously wrong?

Thanks!

Eric

Eric Rubin-Smith

2014-06-18 17:18:34 UTC

Permalink

1 - There a data type named IPV6 Address. 2 - there is a table where
this data type must be in. ( can be a set of fields, one blob, one string
...)
Given a certain IPV6, find in the database the existent IPV6 record with
the longest equal prefix to the given IPV6 value.

Not quite. Perhaps you were confused by my (probably unclear) use of
the word "prefix".

The data structure contains a set of IPv6 *network prefixes*. A prefix
is the first N bits of an IPv6 address and is denoted as an IP address
with a suffix of (128-N) bits of zeros, along with the length of the
prefix:

feed:beef::/32

(here N==32).

An IPv6 address is inside this prefix iff its first 32 bits are equal to
feed:beef.

Note that one prefix can be contained within another:

feed:beef:f00d::/48 is fully contained within feed:beef::/32. We say
that feed:beef:f00d::/48 is "more specific" than feed:beef::/32.

The problem (one that comes up quite often, e.g. in routers) is this:
given an IP address, find the longest-length prefix containing the
address. I.e. find the most specific prefix containing the address.

This is different from the problem you stated. One key thing to note is
that the two prefixes feed:beef::/32 and feed:beef::/48 are different,
even though the bits in the address parts are identical.

Again, if the problem is as I understood, the simplest solution is (
again I am discussing it as if it could be implemented or available in
SQLite..I do not know..)
1 - encode the IPV6 field as a unique blob 2 - Implement an index to
this field so that this particular field can be ordered 3 - Make sure that
the ordering compare function is a binary incremental compare counting
from the left ( in terms of the data...not sure how you will represent
it in the field ) 4 - Each time you want to find the closed and longest
prefix, you just need to simulate an insert command. Try to insert the
given value. Instead of inserting, just return the ordered position where
it would be inserted. Check what is the record actually standing in that
ordered position...That would be your result!!

The problem has many solutions outside of SQL. The most common data
structure is a "trie" (pronounced "tree" and short for "reTRIEval
tree"), though it turns out that in many subsets of this problem space
other data structures turn out to be more efficient.

The present motivation, however, is to see if we can leverage all
the ancillary sexiness of SQLite while still getting reasonably fast
searches within this problem space. Turns out we can do pretty darn
well with SQLite in this regard.

The key part is coming up with an isomorphism between overlapping
IPv6 prefixes and the overlapping geometric boxes represented in a
5-dimensional R*Tree. Not just any isomorphism, but one with the
property that for prefixes P1 and P2, P1 contains P2 if and only if
bbox(P1) fully contains bbox(P2). (It follows that the volume of
bbox(P1) is larger than the volume of bbox(P2), so you can sort by the
volumes of all the covering bboxes to find the most specific prefix.
Though my SQL had a bug in that regard, so treat it with care if you use
it.:-)

Again, the 5 dimensions are only needed because the R*Tree's integers
are only 32 bits wide. If they were 128 bits wide, you could just
represent an IPv6 prefix as an interval on the line segment [0,
2^128-1], and store those intervals in a 1-dimensional R*Tree (which
works great for IPv4, btw).

--
Eric A. Rubin-Smith

I still maintain the point that designing a monolithic kernel in
1991 is a fundamental error. Be thankful you are not my student.
You would not get a high grade for such a design.
-- Andrew Tanenbaum, to Linus Torvalds

Cory Nelson

2014-06-18 18:44:56 UTC

Permalink

Post by Simon Slavin

1 - There a data type named IPV6 Address. 2 - there is a table where
this data type must be in. ( can be a set of fields, one blob, one

string

...)
Given a certain IPV6, find in the database the existent IPV6 record with
the longest equal prefix to the given IPV6 value.

The phrase you're looking for here is "CIDR block". The way I'd handle this
is something like this:

Expand the prefix into the full feed:beef:0000:etc

Insert into a table (start binary(16), mask_length int)

select top 1 binary,length from table where start <= @input order by binary
desc

Check if the row is inside the range returned. This will take a single
index seek.

--
Cory Nelson
http://int64.org

Eric Rubin-Smith

2014-06-18 19:29:37 UTC

Permalink

Post by Cory Nelson
The phrase you're looking for here is "CIDR block".

Well, I was avoiding the phrase on purpose :-). I was worried that
using another bit of jargon -- one that is even more opaque than
"prefix" to someone unfamiliar with the space -- did not seem likely to
help get the idea across. But since you and this forum probably do not
have a burning interest in the minutiae of my flawed writing process, I
press on.

Post by Cory Nelson
Expand the prefix into the full feed:beef:0000:etc
Insert into a table (start binary(16), mask_length int)
binary desc
Check if the row is inside the range returned. This will take a single
index seek.

Um. This looks, wow, much simpler and better than the R*Tree trick.

I guess the only question is whether the binary search into the
(traditional) index will cost more than the R*Tree traversal. In a set
of 10m records we expect to bounce 23 times in a traditional index, if
my math is right. Not sure how that compares to the R*Tree.

I'll see if I can get an apples-to-apples performance comparison
going (and will reply back with the results, in case folks are still
interested).

Thank you!

--
Eric A. Rubin-Smith

I'm just glad it'll be Clark Gable who's falling on his face and
not Gary Cooper.
-- Gary Cooper on his decision not to take the leading role in
"Gone With The Wind."

Eric Rubin-Smith

2014-06-18 20:00:39 UTC

Permalink

Looking at this again, I do not think the solution is correct. E.g.
assume you have populated 10m prefixes, one of which is ::/0. Assume
the search key is ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff and ::/0
is the only covering prefix. Then your scheme will not find ::/0. And
simple extensions of your scheme involve searching through the whole set
to find ::/0.

--
Eric A. Rubin-Smith

Money is the root of all wealth.

Cory Nelson

2014-06-18 20:05:41 UTC

Permalink

Post by Eric Rubin-Smith

Correct, this will not work with overlapping blocks.

--
Cory Nelson
http://int64.org