[sqlite] Article: UUID or GUID as Primary Keys? Be Careful!

Discussion:

Simon Slavin

2017-06-09 22:05:19 UTC

Tangential to SQLite, but there’s little on the list at the moment so perhaps some of you might like this.

<https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439>

Several of his points don’t apply to SQLite, which works differently from most SQL engines, but it’s interesting reading nevertheless.

(I find it ironic that the URLs for his posts are composed of two elements: an assigned piece of content and an arbitrary long number coded as hexadecimal.)

Simon.

Jens Alfke

2017-06-10 04:27:56 UTC

Permalink

Post by Simon Slavin
Tangential to SQLite, but there’s little on the list at the moment so perhaps some of you might like this.
<https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439 <https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439>>

He makes some questionable points, like saying that an ASCII string of hex has a “9x cost in size” compared to a binary representation, or that hex strings would somehow get larger when converted from ISO-8859-1 to UTF-8.

Post by Simon Slavin
Several of his points don’t apply to SQLite, which works differently from most SQL engines, but it’s interesting reading nevertheless.

Most of what he says is extremely RDBMS-centric, even though he never says so. I have the feeling he’s never used a NoSQL database. To me the whole thing comes off as pretty parochial — I have to laugh at his assertion that UUIDs aren’t scalable, since in systems like Couchbase Server*, which handles ridiculously huge data sets, it’s extremely common to use them as keys.

From a highly-scalable perspective, having a single global counter in the database to assign consecutive integer keys is a horrifying bottleneck!

Post by Simon Slavin
(I find it ironic that the URLs for his posts are composed of two elements: an assigned piece of content and an arbitrary long number coded as hexadecimal.)

Blame that on Medium, which is hosting his blog posts :)

—Jens

* Disclaimer: I work for Couchbase (but on mobile software, not on servers.)

Darko Volaric

2017-06-10 06:02:14 UTC

Permalink

And between centrally assigned "consecutive integer keys" and elaborate, probabilistic UUIDs are centrally allocated number ranges of a natively supported integer, eg 64 bit on SQLite. Problem solved.

Post by Jens Alfke

Post by Simon Slavin
Several of his points don’t apply to SQLite, which works differently from most SQL engines, but it’s interesting reading nevertheless.

Most of what he says is extremely RDBMS-centric, even though he never says so. I have the feeling he’s never used a NoSQL database. To me the whole thing comes off as pretty parochial — I have to laugh at his assertion that UUIDs aren’t scalable, since in systems like Couchbase Server*, which handles ridiculously huge data sets, it’s extremely common to use them as keys.
From a highly-scalable perspective, having a single global counter in the database to assign consecutive integer keys is a horrifying bottleneck!

Post by Simon Slavin
(I find it ironic that the URLs for his posts are composed of two elements: an assigned piece of content and an arbitrary long number coded as hexadecimal.)

Blame that on Medium, which is hosting his blog posts :)
—Jens
* Disclaimer: I work for Couchbase (but on mobile software, not on servers.)
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

R Smith

2017-06-10 12:13:48 UTC

Permalink

Post by Jens Alfke

Just in case people wonder about your assertion that his assertion is
wrong - to be specific: UTF8 consumes the exact same space as ASCII when
you use only characters from the first block (0..7F) - which is exactly
what a UUID uses, at a maximum characters 0..9, A..F or a..f, -, {, and
}. Nothing outside of the first Unicode block - though it should be
noted that some systems may use Double-byte (16 bit) character
representations internally whenever the DB table text *storage* type is
set to UTF-anything (which should be noted is not the same thing as the
*DB-Interface* type being UTF-anything).

So the blogger's point doesn't hold on that assertion.

Further, a UUID/GUID as per the standard (RFC4122) consists of 128 bit
value formatted to present like this 36-character sequence:
xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx

where A is a variant and B is the version of the UUID represented.
Variants define different methods of calculation, like whether the MAC
address with a time component was used, or a Domain/Namespace based UUID
etc. In DB systems we usually use variant 1 (MAC+Time with 100
nanosecond precision) which, unless mechanical failure or intentional
deceit, must be unique (i.e. probability for global collision = 0 if
created exactly as described and all systems work as designed, and some
cosmic ray doesn't hit your processor just right [or is it just wrong?]).

Anyway, about the layout, you can of course simply store the UUID as a
128 bit value (or 2 64-bit INTs - considering you use the exact same
variant and version for all your IDs, but this takes processing and you
end up with a value that needs to be re-computed before it can be
compared to anything outside of your system), or at a minimum remove any
dashes and braces, but in reality most people will just plop it as-is
into a Text/Varchar field that's been Uniqued and probably PK'd.

In that worst case scenario (all of the UUID plus dashes and braces),
the full storage requirement for a UUID would look like this:
{xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx} which totals 38 characters of
ASCII (or UTF-8) text space which totals 38 bytes.
Let's be generous and assume the user made VARCHAR(40) provision on an
old-style DB which reserves all the bytes, or better yet, a modern one
with a length definition that takes a further 32-bit value, so 42 bytes
then. Even in this very worst case scenario, the full space requirement
for a UUID is a dismal ~2.7 times more than the 16 bytes of space the
original 128-bit value consumed. Let's further assume the worst text
storage system using DBCS to store 16 bits per character (and nobody
really does this), even then we only get to just over 5 times. Where did
he get 9 times from??
The typical usage, storing full text UUID minus braces in an ASCII/UTF-8
sequence will result in a hair over 2.3 times[1] the storage of INTs.
Not really that bad I think.

I find it fascinating that the number 1 reason to not use UUIDs, and
probably the only reason, he never even mentioned. Sheer speed. (He
refers sorting speed, but the real gain is look-up speed, which gets
compounded in a compound query). In MSSQL I measured almost double the
lookup speed using INTs in a PK in stead of VARCHARs (I didn't even use
UUIDs, simply 6-character client codes of the form ABC001 etc.).

Where I DO agree with the blogger: Where space is not a big concern, use
both UUIDs and INTs locally in your DB, that way it is always scalable,
always merge-able with other global data and always fast with the right
query.

Cheers,
Ryan

[1] - It's hard to say exactly, most DBs use extra bits/bytes for field
specifications, lengths etc, even for the INT fields, so making an exact
blanket assertion here about ratio of char vs. int storage is not
possible, but the given ratio should be close.

Daniel Anderson

2017-06-10 13:58:44 UTC

Permalink

the article was integers (4 bytes) vs guid as a primary key, which as
string takes 36 bytes. so he was right in saying that string guid/uuid take
9 time the space of simple integers.

in general the article was quite good and allowed people to ponder about
their choices/beliefs.

he said be: *careful*, which is a good rule of thumb, as design decisions
are often cast in stone and cannot be change because to many external
programs depends on them.

regards

Post by R Smith

Post by Jens Alfke

Post by Simon Slavin
Tangential to SQLite, but there’s little on the list at the moment so
perhaps some of you might like this.
<https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-
careful-7b2aa3dcb439 <https://tomharrisonjr.com/uui
d-or-guid-as-primary-keys-be-careful-7b2aa3dcb439>>

He makes some questionable points, like saying that an ASCII string of
hex has a “9x cost in size” compared to a binary representation, or that
hex strings would somehow get larger when converted from ISO-8859-1 to
UTF-8.

Just in case people wonder about your assertion that his assertion is
wrong - to be specific: UTF8 consumes the exact same space as ASCII when
you use only characters from the first block (0..7F) - which is exactly
what a UUID uses, at a maximum characters 0..9, A..F or a..f, -, {, and }.
Nothing outside of the first Unicode block - though it should be noted that
some systems may use Double-byte (16 bit) character representations
internally whenever the DB table text *storage* type is set to UTF-anything
(which should be noted is not the same thing as the *DB-Interface* type
being UTF-anything).
So the blogger's point doesn't hold on that assertion.
Further, a UUID/GUID as per the standard (RFC4122) consists of 128 bit
xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx
where A is a variant and B is the version of the UUID represented.
Variants define different methods of calculation, like whether the MAC
address with a time component was used, or a Domain/Namespace based UUID
etc. In DB systems we usually use variant 1 (MAC+Time with 100 nanosecond
precision) which, unless mechanical failure or intentional deceit, must be
unique (i.e. probability for global collision = 0 if created exactly as
described and all systems work as designed, and some cosmic ray doesn't hit
your processor just right [or is it just wrong?]).
Anyway, about the layout, you can of course simply store the UUID as a 128
bit value (or 2 64-bit INTs - considering you use the exact same variant
and version for all your IDs, but this takes processing and you end up with
a value that needs to be re-computed before it can be compared to anything
outside of your system), or at a minimum remove any dashes and braces, but
in reality most people will just plop it as-is into a Text/Varchar field
that's been Uniqued and probably PK'd.
In that worst case scenario (all of the UUID plus dashes and braces), the
{xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx} which totals 38 characters of
ASCII (or UTF-8) text space which totals 38 bytes.
Let's be generous and assume the user made VARCHAR(40) provision on an
old-style DB which reserves all the bytes, or better yet, a modern one with
a length definition that takes a further 32-bit value, so 42 bytes then.
Even in this very worst case scenario, the full space requirement for a
UUID is a dismal ~2.7 times more than the 16 bytes of space the original
128-bit value consumed. Let's further assume the worst text storage system
using DBCS to store 16 bits per character (and nobody really does this),
even then we only get to just over 5 times. Where did he get 9 times from??
The typical usage, storing full text UUID minus braces in an ASCII/UTF-8
sequence will result in a hair over 2.3 times[1] the storage of INTs. Not
really that bad I think.
I find it fascinating that the number 1 reason to not use UUIDs, and
probably the only reason, he never even mentioned. Sheer speed. (He refers
sorting speed, but the real gain is look-up speed, which gets compounded in
a compound query). In MSSQL I measured almost double the lookup speed using
INTs in a PK in stead of VARCHARs (I didn't even use UUIDs, simply
6-character client codes of the form ABC001 etc.).
Where I DO agree with the blogger: Where space is not a big concern, use
both UUIDs and INTs locally in your DB, that way it is always scalable,
always merge-able with other global data and always fast with the right
query.
Cheers,
Ryan
[1] - It's hard to say exactly, most DBs use extra bits/bytes for field
specifications, lengths etc, even for the INT fields, so making an exact
blanket assertion here about ratio of char vs. int storage is not possible,
but the given ratio should be close.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

--
Daniel
*L'action accède à la perfection quand, bien que vivant, vous êtes déjà
mort*
*Bunan*

Jens Alfke

2017-06-10 18:26:26 UTC

Permalink

Post by Daniel Anderson
the article was integers (4 bytes) vs guid as a primary key, which as
string takes 36 bytes. so he was right in saying that string guid/uuid take
9 time the space of simple integers.

Ah, that explains where he got 9x. I still think it’s an unfair comparison, given the inefficient encoding of the UUID (it should be a 16-byte blob.) And in any design problem that’s at sufficient scale that you’re even considering UUIDs, a 32-bit primary key isn’t going to be sufficient, so it should be 64-bit. Now we’re down to a 2x difference.

Post by Daniel Anderson
in general the article was quite good and allowed people to ponder about
their choices/beliefs.

Daniel Anderson

2017-06-10 21:18:55 UTC

Permalink

I drink to that!

Post by Jens Alfke

Post by Daniel Anderson
the article was integers (4 bytes) vs guid as a primary key, which as
string takes 36 bytes. so he was right in saying that string guid/uuid

take

Post by Daniel Anderson
9 time the space of simple integers.

Ah, that explains where he got 9x. I still think it’s an unfair
comparison, given the inefficient encoding of the UUID (it should be a
16-byte blob.) And in any design problem that’s at sufficient scale that
you’re even considering UUIDs, a 32-bit primary key isn’t going to be
sufficient, so it should be 64-bit. Now we’re down to a 2x difference.

Post by Daniel Anderson
in general the article was quite good and allowed people to ponder about
their choices/beliefs.

I would have been less critical, had the author identified that the
subject domain is traditional RDBMSs. Instead the implication is that this
is relevant to any database, which it isn’t.
Anyway, arguing about databases is boring. It’s the weekend — I’m off to
play guitar and go to parties ;-)
—Jens
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

--
Daniel
*L'action accède à la perfection quand, bien que vivant, vous êtes déjà
mort*
*Bunan*

J Decker

2017-06-10 17:47:19 UTC

Permalink

Post by R Smith

Post by Jens Alfke

Post by Simon Slavin
Tangential to SQLite, but there’s little on the list at the moment so
perhaps some of you might like this.
<https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-
careful-7b2aa3dcb439 <https://tomharrisonjr.com/uui
d-or-guid-as-primary-keys-be-careful-7b2aa3dcb439>>

He makes some questionable points, like saying that an ASCII string of
hex has a “9x cost in size” compared to a binary representation, or that
hex strings would somehow get larger when converted from ISO-8859-1 to
UTF-8.

Just in case people wonder about your assertion that his assertion is
wrong - to be specific: UTF8 consumes the exact same space as ASCII when
you use only characters from the first block (0..7F) - which is exactly
what a UUID uses, at a maximum characters 0..9, A..F or a..f, -, {, and }.
Nothing outside of the first Unicode block - though it should be noted that
some systems may use Double-byte (16 bit) character representations
internally whenever the DB table text *storage* type is set to UTF-anything
(which should be noted is not the same thing as the *DB-Interface* type
being UTF-anything).
So the blogger's point doesn't hold on that assertion.
Further, a UUID/GUID as per the standard (RFC4122) consists of 128 bit
xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx
where A is a variant and B is the version of the UUID represented.
Variants define different methods of calculation, like whether the MAC
address with a time component was used, or a Domain/Namespace based UUID
etc. In DB systems we usually use variant 1 (MAC+Time with 100 nanosecond
precision) which, unless mechanical failure or intentional deceit, must be
unique (i.e. probability for global collision = 0 if created exactly as
described and all systems work as designed, and some cosmic ray doesn't hit
your processor just right [or is it just wrong?]).
Anyway, about the layout, you can of course simply store the UUID as a 128
bit value (or 2 64-bit INTs - considering you use the exact same variant
and version for all your IDs, but this takes processing and you end up with
a value that needs to be re-computed before it can be compared to anything
outside of your system), or at a minimum remove any dashes and braces, but
in reality most people will just plop it as-is into a Text/Varchar field
that's been Uniqued and probably PK'd.
In that worst case scenario (all of the UUID plus dashes and braces), the
{xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx} which totals 38 characters of
ASCII (or UTF-8) text space which totals 38 bytes.
Let's be generous and assume the user made VARCHAR(40) provision on an
old-style DB which reserves all the bytes, or better yet, a modern one with
a length definition that takes a further 32-bit value, so 42 bytes then.
Even in this very worst case scenario, the full space requirement for a
UUID is a dismal ~2.7 times more than the 16 bytes of space the original
128-bit value consumed. Let's further assume the worst text storage system
using DBCS to store 16 bits per character (and nobody really does this),
even then we only get to just over 5 times. Where did he get 9 times from??
The typical usage, storing full text UUID minus braces in an ASCII/UTF-8
sequence will result in a hair over 2.3 times[1] the storage of INTs. Not
really that bad I think.
I find it fascinating that the number 1 reason to not use UUIDs, and
probably the only reason, he never even mentioned. Sheer speed. (He refers
sorting speed, but the real gain is look-up speed, which gets compounded in
a compound query). In MSSQL I measured almost double the lookup speed using
INTs in a PK in stead of VARCHARs (I didn't even use UUIDs, simply
6-character client codes of the form ABC001 etc.).

but then you have to compare a lot of characters before you get a mismatch;
a uuid will fail comparisons sooner since the leading characters will be
different sooner than your example data would. Definitely comparing apples
and oranges there.

Post by R Smith
Where I DO agree with the blogger: Where space is not a big concern, use
both UUIDs and INTs locally in your DB, that way it is always scalable,
always merge-able with other global data and always fast with the right
query.
Cheers,
Ryan
[1] - It's hard to say exactly, most DBs use extra bits/bytes for field
specifications, lengths etc, even for the INT fields, so making an exact
blanket assertion here about ratio of char vs. int storage is not possible,
but the given ratio should be close.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Simon Slavin

2017-06-10 18:18:13 UTC

Permalink

There’s no reason for a human to want to sort by GUID. They’re arbitrary symbols and don’t have an order. Anyone who says "Show me our spare parts in GUID order." needs to rethink why they’re doing what they’re doing.

However, SQLite does searching by sorting. Binary chop is still the best way we know to find one value of many (though hashes look interesting), and for that you need your values in sorted order. So yes, you’re going to sort by your key values.

The other thing is that your primary key should be values which don’t matter to you. Because if you care about a value sooner or later you’re going to think of a reason to change it. So if you understand UUID, appreciate their structure, print them out, or use parts of them to indicate factories or data-sources, they’re significant values and you need something else to use as a primary key.

Simon.

Richard Damon

2017-06-10 18:24:32 UTC

Permalink

Post by R Smith
In that worst case scenario (all of the UUID plus dashes and braces),
{xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx} which totals 38 characters of
ASCII (or UTF-8) text space which totals 38 bytes.
Let's be generous and assume the user made VARCHAR(40) provision on an
old-style DB which reserves all the bytes, or better yet, a modern one
with a length definition that takes a further 32-bit value, so 42
bytes then. Even in this very worst case scenario, the full space
requirement for a UUID is a dismal ~2.7 times more than the 16 bytes
of space the original 128-bit value consumed. Let's further assume the
worst text storage system using DBCS to store 16 bits per character
(and nobody really does this), even then we only get to just over 5
times. Where did he get 9 times from??
The typical usage, storing full text UUID minus braces in an
ASCII/UTF-8 sequence will result in a hair over 2.3 times[1] the
storage of INTs. Not really that bad I think.

If the field was declared as char[38], and UNICODE collation, many
systems will allocate 4 bytes per character to allow for any possible
character (of course, since we are only storing ASCII characters, we
need to declare that so they take only 1 byte per character). Comparing
that to 128 bits (16 bytes) is the 9x factor.

Richard Damon

Jens Alfke

2017-06-10 18:37:45 UTC

Permalink

If the field was declared as char[38], and UNICODE collation, many systems will allocate 4 bytes per character to allow for any possible character

Maybe, but it’s a bad design! It’s a huge amount of bloat for most text that would be stored in a database (even most Asian characters will fit in 16 bits.) And worse, it doesn’t actually make working with text easier, because you can _not_ treat every 32-bit code point as a character. There are arcane Unicode rules for combining code points — an accented letter may be represented as the base letter followed by an accent mark, and some ideographs (including many emoji!) are composed of multiple combined ideographs. So even something simple like “how many characters are in this string” requires scanning the string, not just a simple array lookup.

In the end, UTF-8 almost always becomes the best encoding, since everything has to be treated as variable-width anyway.

—Jens

R Smith

2017-06-10 22:31:54 UTC

Permalink

Post by Richard Damon

Post by R Smith
The typical usage, storing full text UUID minus braces in an
ASCII/UTF-8 sequence will result in a hair over 2.3 times[1] the
storage of INTs. Not really that bad I think.

If the field was declared as char[38], and UNICODE collation, many
systems will allocate 4 bytes per character to allow for any possible
character (of course, since we are only storing ASCII characters, we
need to declare that so they take only 1 byte per character).
Comparing that to 128 bits (16 bytes) is the 9x factor.

I've known systems to use Double-Byte character space in the past, but
most RDBMSes repented and mended their evil ways by moving to a
Multi-byte encoding variable length solution - but allocating 4-bytes
blindly to anything Unicode?, well, let's just say in that case you
don't need to consider whether to use UUIDs or not, you need to consider
changing Database engines asap. (or perhaps at least the storage engine
or type).

I mean 99.99% of the World's language and communication requirement
Unicode characters sits in the BMP (Basic Multilingual Plane) which is
in its entirety below the need for a 3rd byte. The higher planes that
are currently allocated contains mostly non-linguistic definitions, like
this smiley poo - 💩 SELECT Char(0x1F4A9); - though one can argue it
has become a staple of the chat-client vernacular.

Richard Damon

2017-06-10 22:57:45 UTC

Permalink

Post by R Smith

Post by Richard Damon

Post by R Smith
The typical usage, storing full text UUID minus braces in an
ASCII/UTF-8 sequence will result in a hair over 2.3 times[1] the
storage of INTs. Not really that bad I think.

If the field was declared as char[38], and UNICODE collation, many
systems will allocate 4 bytes per character to allow for any possible
character (of course, since we are only storing ASCII characters, we
need to declare that so they take only 1 byte per character).
Comparing that to 128 bits (16 bytes) is the 9x factor.

I've known systems to use Double-Byte character space in the past, but
most RDBMSes repented and mended their evil ways by moving to a
Multi-byte encoding variable length solution - but allocating 4-bytes
blindly to anything Unicode?, well, let's just say in that case you
don't need to consider whether to use UUIDs or not, you need to
consider changing Database engines asap. (or perhaps at least the
storage engine or type).
I mean 99.99% of the World's language and communication requirement
Unicode characters sits in the BMP (Basic Multilingual Plane) which is
in its entirety below the need for a 3rd byte. The higher planes that
are currently allocated contains mostly non-linguistic definitions,
like this smiley poo - 💩 SELECT Char(0x1F4A9); - though one can
argue it has become a staple of the chat-client vernacular.

The issue is defining a field as CHAR not VARCHAR. For many systems, if
the record is fixed size, there is an efficiency gain of CHAR over
VARCHAR, so some older guidelines suggest using fixed width fields when
possible. If the field is just ASCII, then it works well, but when
UNICODE fixed width fields can become less useful, as the storage engine
needs to be prepared for the totally worse case situation. Thus the
situation I described is likely a design error, but is one I have seen.
(It comes up more often in Indexes, which are usually by definition
fixed width fields)

--
Richard Damon