Discussion:
Newbie Question about ASCII and UTF-16 strings
k***@public.gmane.org
2009-05-10 05:29:46 UTC
Permalink
Hi,

I am a newbie to sqlite.
I want to create a sqlite database in Linux OS.

The data that I wish to put in the table contains both ASCII and UTF-16 encoded strings.
For eg. File and directory names are UTF-16 where the URL, date time information about the files are in ASCII.

How do I create a table that holds both ASCII and UTF-16 strings.

The documentation says that TEXT datatype can hold both the encodings,
if i do a sqlite3_open -> TEXT is only ASCII, UTF-8
if i do a sqlite3_open16 -> TEXT is only UTF-16

Thanks in advance.

________________________________________________________________________
You are invited to Get a Free AOL Email ID. - http://webmail.aol.in
John Machin
2009-05-10 09:46:09 UTC
Permalink
Post by k***@public.gmane.org
Hi,
I am a newbie to sqlite.
I want to create a sqlite database in Linux OS.
The data that I wish to put in the table contains both ASCII and UTF-16 encoded strings.
For eg. File and directory names are UTF-16 where the URL, date time information about the files are in ASCII.
How do I create a table that holds both ASCII and UTF-16 strings.
The documentation says that TEXT datatype can hold both the encodings,
if i do a sqlite3_open -> TEXT is only ASCII, UTF-8
if i do a sqlite3_open16 -> TEXT is only UTF-16
AFAICT ... and I am by no means omniscient :-) ... the story is
something like this:

There are THREE possibilities for how the data is declared to be stored.
They are: UTF-8, UTF-16LE, UTF-16BE. You must choose ONE of the three.

The choice is made when a database is created and is applied to the
whole database. There is no way of making a choice per table.

When you subsequently feed data into SQLite, you have two [1] choices:
(a) you say that you are supplying bytes that are encoded in UTF-8
(b) you say that you are supplying wide chars that are encoded in the
UTF-16xE that is appropriate to the endianness of the machine that you
are running on.

If the choice differs from that recorded in the database, SQLite will
attempt to convert your data for you.

If the declaration(s) that you make implicitly is/are untrue e.g. your
data is encoded in ISO 8859-xx but you say that it is encoded in UTF-8,
you may not get any error message. SQLite3 does little checking.

Which choice you make for database storage would depend on two things:

(1) the time required for any transcoding (1.1) by SQLite automatically
as described above, and/or (1.2) by you to convert raw input into the
encoding that you will use to feed your data to SQLite

(2) the resultant size of the database:
[following assumes all of your data is in the range U+0000 to U+FFFF]

If stored in UTF-16xE, each character will occupy 2 bytes.

If stored in UTF-8, the bytes per char is variable:

U+0000 to U+007F (ASCII) -> 1 byte per char
U+0080 to U+07FF (covers e.g. accented-and-otherwise-adorned Latin
letters, Greek, Cyrillic, Hebrew, Arabic) -> 2 bytes per char
U+0800 to U+FFFF (covers e.g. Devanagari and other Indian languages,
Chinese, Japanese, Korean) -> 3 bytes per char

So it's impossible to say whether UTF-8 would be better/same/worse
without examining your actual data. You could write a script that loaded
a typical data mix into a database, and run it with the database
encoding specified as either UTF-8 or UTF-16xE and compare the resultant
sizes (and time difference).

[1]: If you are using a wrapper like e.g the Python sqlite3 module, your
choices may be different. Possibilities include: you supply a unicode
object or you supply a str object (8-bit characters) plus an encoding
(which may default to UTF-8) and the wrapper will do whatever is necessary.

HTH,
John
P Kishor
2009-05-10 12:15:28 UTC
Permalink
Post by John Machin
Post by k***@public.gmane.org
Hi,
I am a newbie to sqlite.
I want to create a sqlite database in Linux OS.
The data that I wish to put in the table contains both ASCII and UTF-16 encoded strings.
For eg. File and directory names are UTF-16 where the URL, date time information about the files are in ASCII.
How do I create a table that holds both ASCII and UTF-16 strings.
The documentation says that TEXT datatype can hold both the encodings,
if i do a sqlite3_open -> TEXT is only ASCII, UTF-8
if i do a sqlite3_open16 -> TEXT is only UTF-16
AFAICT ... and I am by no means omniscient :-) ... the story is
There are THREE possibilities for how the data is declared to be stored.
They are: UTF-8, UTF-16LE, UTF-16BE. You must choose ONE of the three.
The choice is made when a database is created and is applied to the
whole database. There is no way of making a choice per table.
(a) you say that you are supplying bytes that are encoded in UTF-8
(b) you say that you are supplying wide chars that are encoded in the
UTF-16xE that is appropriate to the endianness of the machine that you
are running on.
If the choice differs from that recorded in the database, SQLite will
attempt to convert your data for you.
If the declaration(s) that you make implicitly is/are untrue e.g. your
data is encoded in ISO 8859-xx but you say that it is encoded in UTF-8,
you may not get any error message. SQLite3 does little checking.
(1) the time required for any transcoding (1.1) by SQLite automatically
as described above, and/or (1.2) by you to convert raw input into the
encoding that you will use to feed your data to SQLite
[following assumes all of your data is in the range U+0000 to U+FFFF]
If stored in UTF-16xE, each character will occupy 2 bytes.
U+0000 to U+007F (ASCII) -> 1 byte per char
U+0080 to U+07FF (covers e.g. accented-and-otherwise-adorned Latin
letters, Greek, Cyrillic, Hebrew, Arabic) -> 2 bytes per char
U+0800 to U+FFFF (covers e.g. Devanagari and other Indian languages,
Chinese, Japanese, Korean) -> 3 bytes per char
So it's impossible to say whether UTF-8 would be better/same/worse
without examining your actual data. You could write a script that loaded
a typical data mix into a database, and run it with the database
encoding specified as either UTF-8 or UTF-16xE and compare the resultant
sizes (and time difference).
[1]: If you are using a wrapper like e.g the Python sqlite3 module, your
choices may be different. Possibilities include: you supply a unicode
object or you supply a str object (8-bit characters) plus an encoding
(which may default to UTF-8) and the wrapper will do whatever is necessary.
This is a really nice explanation, worthy of being enshrined in the wiki.
Post by John Machin
HTH,
John
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
--
Puneet Kishor http://www.punkish.org/
Carbon Model http://carbonmodel.org/
Charter Member, Open Source Geospatial Foundation http://www.osgeo.org/
Science Commons Fellow, Geospatial Data http://sciencecommons.org
Nelson Institute, UW-Madison http://www.nelson.wisc.edu/
-----------------------------------------------------------------------
collaborate, communicate, compete
=======================================================================
Loading...