Post by Jens AlfkePost by Simon SlavinTangential to SQLite, but there’s little on the list at the moment so perhaps some of you might like this.
<https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439 <https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439>>
He makes some questionable points, like saying that an ASCII string of hex has a “9x cost in size” compared to a binary representation, or that hex strings would somehow get larger when converted from ISO-8859-1 to UTF-8.
Just in case people wonder about your assertion that his assertion is
wrong - to be specific: UTF8 consumes the exact same space as ASCII when
you use only characters from the first block (0..7F) - which is exactly
what a UUID uses, at a maximum characters 0..9, A..F or a..f, -, {, and
}. Nothing outside of the first Unicode block - though it should be
noted that some systems may use Double-byte (16 bit) character
representations internally whenever the DB table text *storage* type is
set to UTF-anything (which should be noted is not the same thing as the
*DB-Interface* type being UTF-anything).
So the blogger's point doesn't hold on that assertion.
Further, a UUID/GUID as per the standard (RFC4122) consists of 128 bit
value formatted to present like this 36-character sequence:
xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx
where A is a variant and B is the version of the UUID represented.
Variants define different methods of calculation, like whether the MAC
address with a time component was used, or a Domain/Namespace based UUID
etc. In DB systems we usually use variant 1 (MAC+Time with 100
nanosecond precision) which, unless mechanical failure or intentional
deceit, must be unique (i.e. probability for global collision = 0 if
created exactly as described and all systems work as designed, and some
cosmic ray doesn't hit your processor just right [or is it just wrong?]).
Anyway, about the layout, you can of course simply store the UUID as a
128 bit value (or 2 64-bit INTs - considering you use the exact same
variant and version for all your IDs, but this takes processing and you
end up with a value that needs to be re-computed before it can be
compared to anything outside of your system), or at a minimum remove any
dashes and braces, but in reality most people will just plop it as-is
into a Text/Varchar field that's been Uniqued and probably PK'd.
In that worst case scenario (all of the UUID plus dashes and braces),
the full storage requirement for a UUID would look like this:
{xxxxxxxx-xxxx-Axxx-Bxxx-xxxxxxxxxxxx} which totals 38 characters of
ASCII (or UTF-8) text space which totals 38 bytes.
Let's be generous and assume the user made VARCHAR(40) provision on an
old-style DB which reserves all the bytes, or better yet, a modern one
with a length definition that takes a further 32-bit value, so 42 bytes
then. Even in this very worst case scenario, the full space requirement
for a UUID is a dismal ~2.7 times more than the 16 bytes of space the
original 128-bit value consumed. Let's further assume the worst text
storage system using DBCS to store 16 bits per character (and nobody
really does this), even then we only get to just over 5 times. Where did
he get 9 times from??
The typical usage, storing full text UUID minus braces in an ASCII/UTF-8
sequence will result in a hair over 2.3 times[1] the storage of INTs.
Not really that bad I think.
I find it fascinating that the number 1 reason to not use UUIDs, and
probably the only reason, he never even mentioned. Sheer speed. (He
refers sorting speed, but the real gain is look-up speed, which gets
compounded in a compound query). In MSSQL I measured almost double the
lookup speed using INTs in a PK in stead of VARCHARs (I didn't even use
UUIDs, simply 6-character client codes of the form ABC001 etc.).
Where I DO agree with the blogger: Where space is not a big concern, use
both UUIDs and INTs locally in your DB, that way it is always scalable,
always merge-able with other global data and always fast with the right
query.
Cheers,
Ryan
[1] - It's hard to say exactly, most DBs use extra bits/bytes for field
specifications, lengths etc, even for the INT fields, so making an exact
blanket assertion here about ratio of char vs. int storage is not
possible, but the given ratio should be close.