[sqlite] hex and char functions

Discussion:

2017-08-07 13:38:59 UTC

In c++ I tried to call a sqlite udf using the following sql

UnicodeString SQL=“select udf(‘5\u00856’);”

Which was prepared using sqlite3_prepare16_v2.

(I was experimenting with sending a udf a group of numbers without getting involved with blobs).

Debugging the udf I recovered the string (call it Str) using sqlite3_value_text16.

Str[1]==’5’ & Str[3]==’6’ but Str[2]==’?’ (or Unicode 3F)

I also tried retrieving the string using sqlite3_value_text to return a const unsigned char * but that also seemed to have a problem with values beyond 127.

What am I missing?

Related

Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’.

What is the ‘C2’ about?

Igor Tandetnik

2017-08-07 13:54:16 UTC

Permalink

On 8/7/2017 9:38 AM, x wrote:
> In c++ I tried to call a sqlite udf using the following sql
>
> UnicodeString SQL=“select udf(‘5\u00856’);”

You are using a narrow string literal to try and hold a Unicode character. You are at the whim of your compiler as to how it represents the latter in the former. My guess is, it tries to convert according to the system default code page (I assume Windows here), and since the character is not in fact representable therein, it's converted to '?'.

That is, the character was lost before the program even ran, let alone before SQLite got involved.

Also, what's UnicodeString?
--
Igor Tandetnik

Hick Gunter

2017-08-07 14:37:45 UTC

Permalink

The sqlite char() function returns unicode. Apparently, the encoding for code point 133 is two characters, namely c2 85. You seem to be expecting char() to return ISO characters, which it does not do.

Calling sqlite3_value_text16 instructs SQLite to convert the contents of the field into utf16 with native byte order from (assumed) utf8, which may have funny results if the source is not indeed UTF8 but rather ISO or some strange (windows) codepage.

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:sqlite-users-***@mailinglists.sqlite.org] Im Auftrag von x
Gesendet: Montag, 07. August 2017 15:39
An: sqlite-***@mailinglists.sqlite.org
Betreff: [sqlite] hex and char functions

In c++ I tried to call a sqlite udf using the following sql

UnicodeString SQL=“select udf(‘5\u00856’);”

Which was prepared using sqlite3_prepare16_v2.

(I was experimenting with sending a udf a group of numbers without getting involved with blobs).

Debugging the udf I recovered the string (call it Str) using sqlite3_value_text16.

Str[1]==’5’ & Str[3]==’6’ but Str[2]==’?’ (or Unicode 3F)

I also tried retrieving the string using sqlite3_value_text to return a const unsigned char * but that also seemed to have a problem with values beyond 127.

What am I missing?

Related

Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’.

What is the ‘C2’ about?
_______________________________________________
sqlite-users mailing list
sqlite-***@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

___________________________________________
Gunter Hick
Software Engineer
Scientific Games International GmbH
FN 157284 a, HG Wien
Klitschgasse 2-4, A-1130 Vienna, Austria
Tel: +43 1 80100 0
E-Mail: ***@scigames.at

This communication (including any attachments) is intended for the use of the intended recipient(s) only and may contain information that is confidential, privileged or legally protected. Any unauthorized use or dissemination of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender by return e-mail message and delete all copies of the original communication. Thank you for your cooperation.

Igor Tandetnik

2017-08-07 14:48:37 UTC

Permalink

On 8/7/2017 9:38 AM, x wrote:
> Related
>
> Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’.
>
> What is the ‘C2’ about?

Two-byte sequence C2 85 is the UTF-8 encoding of the Unicode codepoint U+0085.
--
Igor Tandetnik

2017-08-07 15:29:41 UTC

Permalink

Apologies, I should have said I was using c++ builder Berlin on windows 10 and that UnicodeString was UTF16.

I thought I had learned enough about this string lunacy to get by but finding out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped me over the edge. I assumed they both used the same codes but UTF16 allowed some characters UTF8 didn’t have.

I’m now wondering if I should go to the trouble of changing my sqlite wrapper over to communicate with the sqlite utf8 functions rather than the utf16 ones. Trouble is many of c++ builder’s built in types such as TStringList etc are utf16.

From: Igor Tandetnik<mailto:***@tandetnik.org>
Sent: 07 August 2017 15:49
To: sqlite-***@mailinglists.sqlite.org<mailto:sqlite-***@mailinglists.sqlite.org>
Subject: Re: [sqlite] hex and char functions

On 8/7/2017 9:38 AM, x wrote:
> Related
>
> Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’.
>
> What is the ‘C2’ about?

Two-byte sequence C2 85 is the UTF-8 encoding of the Unicode codepoint U+0085.
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
sqlite-***@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Nico Williams

2017-08-07 15:33:57 UTC

Permalink

On Mon, Aug 07, 2017 at 03:29:41PM +0000, x wrote:
> Apologies, I should have said I was using c++ builder Berlin on
> windows 10 and that UnicodeString was UTF16.
>
> I thought I had learned enough about this string lunacy to get by but
> finding out that the UTF8 code for the UTF16 code \u0085 is in fact
> \uc285 has tipped me over the edge. I assumed they both used the same
> codes but UTF16 allowed some characters UTF8 didn’t have.

Internally SQLite3 uses UTF-8. The SQLite3 API lets you deal with
UTF-16, but this just transcodes to/from UTF-8 internally.

So when you SELECT hex(<some value of type TEXT>); the hex() function
sees UTF-8, not UTF-16.

Nico
--

Richard Hipp

2017-08-07 15:45:50 UTC

Permalink

On 8/7/17, Nico Williams <***@cryptonector.com> wrote:
>
> Internally SQLite3 uses UTF-8. The SQLite3 API lets you deal with
> UTF-16, but this just transcodes to/from UTF-8 internally.
>

That is not quite correct.

SQL statements are always converted into UTF8 for parsing and code
generation. But data can be stored in the database file and processed
as UTF8, UTF16be, or UTF16le. All text content for a single database
must use the same encoding. When creating the database file, set the
encoding before adding any content by running one of:

PRAGMA encoding('utf-8');
PRAGMA encoding('utf-16be');
PRAGMA encoding('utf-16le');

See https://www.sqlite.org/pragma.html#pragma_encoding for additional
information.

--
D. Richard Hipp
***@sqlite.org

Nico Williams

2017-08-07 15:47:09 UTC

Permalink

On Mon, Aug 07, 2017 at 11:45:50AM -0400, Richard Hipp wrote:
> On 8/7/17, Nico Williams <***@cryptonector.com> wrote:
> > Internally SQLite3 uses UTF-8. The SQLite3 API lets you deal with
> > UTF-16, but this just transcodes to/from UTF-8 internally.
>
> That is not quite correct.
>
> SQL statements are always converted into UTF8 for parsing and code
> generation. But data can be stored in the database file and processed
> as UTF8, UTF16be, or UTF16le. All text content for a single database
> must use the same encoding. When creating the database file, set the
> encoding before adding any content by running one of:
>
> PRAGMA encoding('utf-8');
> PRAGMA encoding('utf-16be');
> PRAGMA encoding('utf-16le');
>
> See https://www.sqlite.org/pragma.html#pragma_encoding for additional
> information.

Ah, OK, thanks for the correction!

R Smith

2017-08-07 18:32:57 UTC

Permalink

On 2017/08/07 5:29 PM, x wrote:
> Apologies, I should have said I was using c++ builder Berlin on windows 10 and that UnicodeString was UTF16.
>
> I thought I had learned enough about this string lunacy to get by but finding out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped me over the edge. I assumed they both used the same codes but UTF16 allowed some characters UTF8 didn’t have.
>
> I’m now wondering if I should go to the trouble of changing my sqlite wrapper over to communicate with the sqlite utf8 functions rather than the utf16 ones. Trouble is many of c++ builder’s built in types such as TStringList etc are utf16.

No you shouldn't. UTF16 doesn't have "more" characters than UTF8, and
TStringlist is not UTF16 - let me see if I can clear up some bit of the
confusion. This next bit is very short and really requires much more
study, but I hope I say enough (and correct enough) for you to get the
picture a little better.

First some (very short) history on the "string lunacy" you refer to.
Note that when storing text in any system there are two confusing
concepts that are hard to get your head around: Firstly there is the
actual Characters, or Character-sets - these are the things referred to
as ANSI, Latin, CN-Big5 etc., then there are Character Code Index
mappings, these are things such as Code-Pages and the like that
basically says stuff like the Uppercase Latin character A has a code of
65 (Hex 41) in the ASCII code-page etc. These may all differ for
different code-pages, though there were good overlap. Eventually
Unicode intended to save the World by indeed unifying all the
code-paging (hence "Unicode") and they did a marvelous job of it - but
there were very many real-World characters to cater for, so they have
code-point indices much larger than any single or even double-byte
character arrray or string can ever contain.

Here we Enter the character-encodings. These are things like UTF8,
UTF16LE and they specify an encoding, a way to make a sequence of bytes
refer to a specific codepoint in a code-space (in typically the Unicode
code-point space) that can be much larger than 8 or 16 bits may
accommodate. UTF-8 for instance specifies that any byte value less than
128 refers to the first 127 code points, as soon as that final bit (MSB)
goes high, it means another byte is needed (or byteS, depending on how
many high bits follow the initial) to complete the encoding, and further
bytes must specify a 1.0.x.x.x.x.x.x format in turn to ensure
consistency and safely lets any reader know as soon as they encounter a
high MSB that it is definitely part of a multi-byte UTF8 sequence -
which is a brilliant encoding. Although slightly technical, it is very
lean, we only escalate bytes when needed, and only as much as is needed.
The UTF16 encoding is a bit less technical, we can represent far more
code points with a consistent 2 byte setup, but even that is much
smaller than the full Unicode world, so UTF16 has specific character
ranges (0xd800 to 0xdbff) that requires follow-on double-bytes, also
known as "Surrogate pairs" (this is the thing that you said pushed you
over the edge, finding that some Unicode characters are represented by 2
double-byte characters, so 4-byte total width). There is much more to be
said about all this, but I don't want to take everyone's time and the
above is enough to understand the next bit regarding C++ history:

One of the great features of the bcc32 compilers of yonder was that they
(Borland) embraced strong typing, probably to this day the
strongest-typed language around is Pascal, later Turbo Pascal, Delphi
etc. I mention this because you use (apparently) the Borland origin
version of C++ and they always had a stronger typed vision than the C++
standard. There is a precise type for everything - but that also came
with problems when adapting as times changed. C++ specifically avoided a
standard string type for quite a while, which was one of the design
mistakes often noted by people reflecting on the early C++ development.
Anyway...

The first iterations of C++ started out long ago using the convention
borrowed from C for pointer strings with null terminators. Indeed, a
constant string such as 'Hello World!' today still is a pointer and
null terminator setup and the "std::string" type has ways of converting
itself to that still. After some time came a standardization in C++
with the "std::string" type. This standard string usually represents an
array of 8-bit characters in ASCII encoding.

ASCII of course did not last long as the defacto character set
definition, and eventually everything went Unicode (as mentioned above).
The first adaptations seen was "Wide" types (std::wstring) which allowed
a nice range of 65536 character codes per string index "character", same
as what Windows went for, but in no way guaranteed compatibility with
specific encodings. Later still came things like "std:u16string" and
"std::u32string" to try and fix some of the incompatibilities. This
went much more smoothly for the other Borland projects - the
Delphi/Lazarus platforms used the length specifier so all those #0
characters in between the wide bytes did not trip up the bcc compilers.
To this day, any normal string in Delphi/Lazarus and its ilk are
Double-Byte characters starting at Index 1 (not 0 because of the
backward compatibility, though these days the length is contained fully
in a sort of variable-registry - known as the RTTI or
Real-Time-Type-Info [Google it] and you can switch your compiler to use
zero-indexed strings in stead).

Back to the point: Unicode is much larger than those 65536 code-points,
so Delphi/C++ devised the "UTF8String" to accompany the "WideString" and
other string types which simply let's the compiler know that you wish
for that string type to automatically assume that any byte-sequence
stored to it comes in the form of UTF8, so it needs to be translated
from UTF8, and also when you read that memory, you intend for it to be
translated out of UTF8. Widestring/wstring on the other hand does no
such thing, but there are functions part of the standard interfaces that
will encode to and from anything - very easy in fact, but of course you
first have to know WHY and WHAT you want to encode for anything to make
sense. This can be especially confusing when you use an object made by
someone else, such as an SQLite wrapper, and not know whether they
already take care of UTF8 encoding or not, or whether the object will
have a setting or property like "useUTF8" or similar.

You can further read up on the Encoding class (not sure if it is the
exact same name in C++).

Lastly, I think Unicode is great and UTF8 and UTF16 both have merit -
right tool for the job and all that, but it can be confusing. Y may be
overwhelmed by needing to know all the above to simply get a string out
of a DB, but trust me, your simple string will one day need to store
Chinese, German, French characters and, naturally, an icon of a smiley
poo - and then you will be delighted about how easy and automatic that
ended up being - all because you went through the effort at the start of
learning the Unicode way of doing things.

Hope that did not add to the confusion...
Cheers,
Ryan

PS: Since this is an SQLite forum, and the topic of Unicode comes up
from time to time, allow me to share a personal project I had to
accomplish as part of a larger project - if anyone wishes. This
downloadable DB contains the complete Unicode standard in SQLite format
via a few tables that can be joined in obvious ways to get complete
Unicode character information currently defined by the Unicode standard,
each codepoint, its Plane, Section, surrogate pair, HTML entity, HTML
code etc. This DB will be available in future and updated whenever the
Unicode standard changes, though might host it in a different place.
Mail me offline if interested in updates.
http://rifin.co.za/software/glyphlib/unicode.zip

2017-08-07 19:01:33 UTC

Permalink

Thanks Ryan for going to the trouble of typing that out. Hope you’re not a one fingered typewriter like myself. The borland related stuff is welcome but I still can’t say I’m any less confused by it all.

I’m having a bad day today. I’ve spent most of it trying to fathom this stuff out. Igor & Gunter were correct earlier. The ‘\u0085’ is changed to ‘?’ before the string is anywhere near sqlite. Why I don’t know. It doesn’t seem unreasonable to want to put a Unicode code into a UnicodeString. As regards the hex(char(133)) returning C285, following the posts by Nico and Richard I’m wondering if it’s because I’m using SQLite Expert pro on a database that’s encoded in utf-8. I tried to change the coding to utf-16 to see if I would get a different result but, while the software seemed to accept the request, the request was never completed and no feedback was given aside from both the ‘Apply’ and ‘Cancel’ buttons both being greyed out for hours (it’s only a small database). I’ve had enough for today though.

Thanks to all who have contributed.

From: R Smith<mailto:***@rsweb.co.za>
Sent: 07 August 2017 19:33
To: sqlite-***@mailinglists.sqlite.org<mailto:sqlite-***@mailinglists.sqlite.org>
Subject: Re: [sqlite] hex and char functions

On 2017/08/07 5:29 PM, x wrote:
> Apologies, I should have said I was using c++ builder Berlin on windows 10 and that UnicodeString was UTF16.
>
> I thought I had learned enough about this string lunacy to get by but finding out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped me over the edge. I assumed they both used the same codes but UTF16 allowed some characters UTF8 didn’t have.
>
> I’m now wondering if I should go to the trouble of changing my sqlite wrapper over to communicate with the sqlite utf8 functions rather than the utf16 ones. Trouble is many of c++ builder’s built in types such as TStringList etc are utf16.

No you shouldn't. UTF16 doesn't have "more" characters than UTF8, and
TStringlist is not UTF16 - let me see if I can clear up some bit of the
confusion. This next bit is very short and really requires much more
study, but I hope I say enough (and correct enough) for you to get the
picture a little better.

First some (very short) history on the "string lunacy" you refer to.
Note that when storing text in any system there are two confusing
concepts that are hard to get your head around: Firstly there is the
actual Characters, or Character-sets - these are the things referred to
as ANSI, Latin, CN-Big5 etc., then there are Character Code Index
mappings, these are things such as Code-Pages and the like that
basically says stuff like the Uppercase Latin character A has a code of
65 (Hex 41) in the ASCII code-page etc. These may all differ for
different code-pages, though there were good overlap. Eventually
Unicode intended to save the World by indeed unifying all the
code-paging (hence "Unicode") and they did a marvelous job of it - but
there were very many real-World characters to cater for, so they have
code-point indices much larger than any single or even double-byte
character arrray or string can ever contain.

Here we Enter the character-encodings. These are things like UTF8,
UTF16LE and they specify an encoding, a way to make a sequence of bytes
refer to a specific codepoint in a code-space (in typically the Unicode
code-point space) that can be much larger than 8 or 16 bits may
accommodate. UTF-8 for instance specifies that any byte value less than
128 refers to the first 127 code points, as soon as that final bit (MSB)
goes high, it means another byte is needed (or byteS, depending on how
many high bits follow the initial) to complete the encoding, and further
bytes must specify a 1.0.x.x.x.x.x.x format in turn to ensure
consistency and safely lets any reader know as soon as they encounter a
high MSB that it is definitely part of a multi-byte UTF8 sequence -
which is a brilliant encoding. Although slightly technical, it is very
lean, we only escalate bytes when needed, and only as much as is needed.
The UTF16 encoding is a bit less technical, we can represent far more
code points with a consistent 2 byte setup, but even that is much
smaller than the full Unicode world, so UTF16 has specific character
ranges (0xd800 to 0xdbff) that requires follow-on double-bytes, also
known as "Surrogate pairs" (this is the thing that you said pushed you
over the edge, finding that some Unicode characters are represented by 2
double-byte characters, so 4-byte total width). There is much more to be
said about all this, but I don't want to take everyone's time and the
above is enough to understand the next bit regarding C++ history:

One of the great features of the bcc32 compilers of yonder was that they
(Borland) embraced strong typing, probably to this day the
strongest-typed language around is Pascal, later Turbo Pascal, Delphi
etc. I mention this because you use (apparently) the Borland origin
version of C++ and they always had a stronger typed vision than the C++
standard. There is a precise type for everything - but that also came
with problems when adapting as times changed. C++ specifically avoided a
standard string type for quite a while, which was one of the design
mistakes often noted by people reflecting on the early C++ development.
Anyway...

The first iterations of C++ started out long ago using the convention
borrowed from C for pointer strings with null terminators. Indeed, a
constant string such as 'Hello World!' today still is a pointer and
null terminator setup and the "std::string" type has ways of converting
itself to that still. After some time came a standardization in C++
with the "std::string" type. This standard string usually represents an
array of 8-bit characters in ASCII encoding.

ASCII of course did not last long as the defacto character set
definition, and eventually everything went Unicode (as mentioned above).
The first adaptations seen was "Wide" types (std::wstring) which allowed
a nice range of 65536 character codes per string index "character", same
as what Windows went for, but in no way guaranteed compatibility with
specific encodings. Later still came things like "std:u16string" and
"std::u32string" to try and fix some of the incompatibilities. This
went much more smoothly for the other Borland projects - the
Delphi/Lazarus platforms used the length specifier so all those #0
characters in between the wide bytes did not trip up the bcc compilers.
To this day, any normal string in Delphi/Lazarus and its ilk are
Double-Byte characters starting at Index 1 (not 0 because of the
backward compatibility, though these days the length is contained fully
in a sort of variable-registry - known as the RTTI or
Real-Time-Type-Info [Google it] and you can switch your compiler to use
zero-indexed strings in stead).

Back to the point: Unicode is much larger than those 65536 code-points,
so Delphi/C++ devised the "UTF8String" to accompany the "WideString" and
other string types which simply let's the compiler know that you wish
for that string type to automatically assume that any byte-sequence
stored to it comes in the form of UTF8, so it needs to be translated
from UTF8, and also when you read that memory, you intend for it to be
translated out of UTF8. Widestring/wstring on the other hand does no
such thing, but there are functions part of the standard interfaces that
will encode to and from anything - very easy in fact, but of course you
first have to know WHY and WHAT you want to encode for anything to make
sense. This can be especially confusing when you use an object made by
someone else, such as an SQLite wrapper, and not know whether they
already take care of UTF8 encoding or not, or whether the object will
have a setting or property like "useUTF8" or similar.

You can further read up on the Encoding class (not sure if it is the
exact same name in C++).

Lastly, I think Unicode is great and UTF8 and UTF16 both have merit -
right tool for the job and all that, but it can be confusing. Y may be
overwhelmed by needing to know all the above to simply get a string out
of a DB, but trust me, your simple string will one day need to store
Chinese, German, French characters and, naturally, an icon of a smiley
poo - and then you will be delighted about how easy and automatic that
ended up being - all because you went through the effort at the start of
learning the Unicode way of doing things.

Hope that did not add to the confusion...
Cheers,
Ryan

PS: Since this is an SQLite forum, and the topic of Unicode comes up
from time to time, allow me to share a personal project I had to
accomplish as part of a larger project - if anyone wishes. This
downloadable DB contains the complete Unicode standard in SQLite format
via a few tables that can be joined in obvious ways to get complete
Unicode character information currently defined by the Unicode standard,
each codepoint, its Plane, Section, surrogate pair, HTML entity, HTML
code etc. This DB will be available in future and updated whenever the
Unicode standard changes, though might host it in a different place.
Mail me offline if interested in updates.
http://rifin.co.za/software/glyphlib/unicode.zip

_______________________________________________
sqlite-users mailing list
sqlite-***@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

R Smith

2017-08-07 19:45:02 UTC

Permalink

On 2017/08/07 9:01 PM, x wrote:
> Thanks Ryan for going to the trouble of typing that out. Hope you’re not a one fingered typewriter like myself. The borland related stuff is welcome but I still can’t say I’m any less confused by it all.
>
> I’m having a bad day today. I’ve spent most of it trying to fathom this stuff out. Igor & Gunter were correct earlier. The ‘\u0085’ is changed to ‘?’ before the string is anywhere near sqlite. Why I don’t know. It doesn’t seem unreasonable to want to put a Unicode code into a UnicodeString. As regards the hex(char(133)) returning C285, following the posts by Nico and Richard I’m wondering if it’s because I’m using SQLite Expert pro on a database that’s encoded in utf-8. I tried to change the coding to utf-16 to see if I would get a different result but, while the software seemed to accept the request, the request was never completed and no feedback was given aside from both the ‘Apply’ and ‘Cancel’ buttons both being greyed out for hours (it’s only a small database). I’ve had enough for today though.

Don't worry, I can type faster than I can think (which isn't very fast,
and quite evident from the typos!).

I feel your pain though, and hope we can help get you un-discouraged
soon. (Imagine that is a word.)

One thing to Note: You can only change the Database encoding at the
START of making the Database. Once you put stuff in the database, the
encoding is set for life. That is an SQLite quirk - but that said, it
wouldn't solve your problem, since the encoding problem (by your
account) happens before the database is reached. It's real easy to test.

Use the lowest ranked interface to simply add a string into the DB. By
lowest ranked, I mean before it goes through any of your own code, or
other wrapper code.

Use direct SQL, such as DB.Execute('INSERT INTO t(v1)
VALUES(''Geronimo'');');
(This assumes a table called t with a column v1 of type TEXT. Those are
2 single-quotes btw, not double-quotes).

Use some SQLite tool or the command line to check what ended up in that DB.

If that works, add some Unicode value, like this:
DB.Execute('INSERT INTO t VALUES(''©Geronimo®'');');

If it ends up as something else in that database, follow the call stack
chain, see how those look in the debugger initially (which may be
different to what you expect, depending on the debugger display
encoding), what you are looking for is Change - the moment it changes -
you will see which thing is responsible for misinterpreting it, if any.

Tell us more when you did the tests. Don't give up!

Also - make sure that whatever you read it with reads (and can
faithfully display) any unicode characters. Often the fault is not
yours. The SQLite CLI for instance can read it correctly, but sometimes
your computer stdout uses a codepage that doesn't display it correctly.
Mail me off-list if you would like an SQLite DB and table with lots of
Unicode test data in it and a correct interpretation of it to test with.

Best of luck!
Ryan

Simon Slavin

2017-08-07 19:50:43 UTC

Permalink

On 7 Aug 2017, at 8:45pm, R Smith <***@rsweb.co.za> wrote:

> If it ends up as something else in that database, follow the call stack chain, see how those look in the debugger initially (which may be different to what you expect, depending on the debugger display encoding)

Note this point very hard. It is always possible that your program is working perfectly, and your debugger is faulty. I’ve wasted a lot of time over the years through believing the debugger after multiple versions of my own code all produced the same, apparently wrong, result.

Simon.

2017-08-08 08:24:22 UTC

Permalink

I accidentally deleted the latest 3 replies to this thread in my mail so apologies for not including any content from those mails.

Ryan, the error happens immediately in c++ builder. E.g.

UnicodeString S=”\u0085”;
Int Code=S[1];

Code returns 63 (x3F) which is code for ‘?’ and the question mark is also what shows when I hover the cursor over S during debugging.
As u+0085 is a control character, and so has no associated graphic, I can forgive the ‘?’ being printed in its place but why does the underlying Code change? I have posted this question on the c++ builder website and will post any answer here.

Simon’s timing with his comment about debugging is spooky. Last night I found that when I try to ‘Evaluate/Modify’ S[1] during debugging I get the following error message

E2015 Ambiguity between 'System::UnicodeString::System::UnicodeString::operator [](const int)' and 'System::UnicodeString::System::UnicodeString::operator [](const int)'

Ambiguity between X and X?? I had no such problems earlier yesterday but, as it was working fine when I first encountered the ‘\u0085’ problem, I doubt it’s the debugger.

I conducted the following experiment in SQLite Expert Pro by creating a couple of new empty databases.
select hex(char(65,133,66))
returned
'41C28542' when the current db was utf-8
'410085004200' when the current db was utf-16le
'004100850042' when the current db was utf-16be

The bottom one of those is what I expected when I first ran the select yet I think I'm right in saying my system (windows 10) is little endian. This is a further confusion thrown into the mix for me. As is usually the case. I've little doubt sqlite is coming up with the right answer in each case.

Tom

2017-08-08 10:32:06 UTC

Permalink

Just received this on Embarcadero forum.

You are assigning an Ansi literal string to a Unicode string so a conversion from the current Ansi locale is occurring.
Try assigning a unicode literal:

UnicodeString S= L"\u0085";

That L in front of the string makes Code==133 and it’s also 133 inside the udf after retrieval using sqlite3_value_text16.

Why does it always turn out to be my own fault ☹

Many thanks to all who contributed.

From: x<mailto:***@hotmail.com>
Sent: 08 August 2017 09:24
To: SQLite mailing list<mailto:sqlite-***@mailinglists.sqlite.org>
Subject: Re: [sqlite] hex and char functions

I accidentally deleted the latest 3 replies to this thread in my mail so apologies for not including any content from those mails.

Ryan, the error happens immediately in c++ builder. E.g.

UnicodeString S=”\u0085”;
Int Code=S[1];

Code returns 63 (x3F) which is code for ‘?’ and the question mark is also what shows when I hover the cursor over S during debugging.
As u+0085 is a control character, and so has no associated graphic, I can forgive the ‘?’ being printed in its place but why does the underlying Code change? I have posted this question on the c++ builder website and will post any answer here.

Simon’s timing with his comment about debugging is spooky. Last night I found that when I try to ‘Evaluate/Modify’ S[1] during debugging I get the following error message

E2015 Ambiguity between 'System::UnicodeString::System::UnicodeString::operator [](const int)' and 'System::UnicodeString::System::UnicodeString::operator [](const int)'

Ambiguity between X and X?? I had no such problems earlier yesterday but, as it was working fine when I first encountered the ‘\u0085’ problem, I doubt it’s the debugger.

I conducted the following experiment in SQLite Expert Pro by creating a couple of new empty databases.
select hex(char(65,133,66))
returned
'41C28542' when the current db was utf-8
'410085004200' when the current db was utf-16le
'004100850042' when the current db was utf-16be

The bottom one of those is what I expected when I first ran the select yet I think I'm right in saying my system (windows 10) is little endian. This is a further confusion thrown into the mix for me. As is usually the case. I've little doubt sqlite is coming up with the right answer in each case.

Tom
_______________________________________________
sqlite-users mailing list
sqlite-***@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Rowan Worth

2017-08-09 02:29:34 UTC

Permalink

On 8 August 2017 at 18:32, x <***@hotmail.com> wrote:

> Why does it always turn out to be my own fault ☹
>

Welcome to programming ;)

It's very rare in practice to find an actual bug in a library or language,
especially one as widely deployed and tested as sqlite. Glad you're sorted,
I just wanted to have a quick go at clearing this up:

but finding out that the UTF8 code for the UTF16 code \u0085 is in fact
> \uc285 has tipped me over the edge
>

1. \u0085 is a unicode code-point, 'NEXT LINE (NEL)'
2. \uc285 is a unicode code-point, 'HANGUL SYLLABLE SWIT (솅)'
3. Neither of these is a UTF8 or UTF8 "code" -- UTF8 and UTF16 are
different ways of enconding/decoding unicode code-points
4. In UTF8, \u0085 is encoded as two bytes -- 0xc2 0x85
5. In UTF8, \uc285 is encoded as three bytes -- 0xec 0x8a 0x85

HTH,
-Rowan

2017-08-09 10:39:02 UTC

Permalink

Thanks Rowan. I’ve picked up bits and pieces from the various replies to get a basic idea of what’s going on. My question should’ve been posted on the c++ builder forum in the first place. Sorry for wasting everyone’s time.

Tom

From: Rowan Worth<mailto:***@dug.com>
Sent: 09 August 2017 03:29
To: SQLite mailing list<mailto:sqlite-***@mailinglists.sqlite.org>
Subject: Re: [sqlite] hex and char functions

On 8 August 2017 at 18:32, x <***@hotmail.com> wrote:

> Why does it always turn out to be my own fault ☹
>

Welcome to programming ;)

It's very rare in practice to find an actual bug in a library or language,
especially one as widely deployed and tested as sqlite. Glad you're sorted,
I just wanted to have a quick go at clearing this up:

but finding out that the UTF8 code for the UTF16 code \u0085 is in fact
> \uc285 has tipped me over the edge
>

1. \u0085 is a unicode code-point, 'NEXT LINE (NEL)'
2. \uc285 is a unicode code-point, 'HANGUL SYLLABLE SWIT (솅)'
3. Neither of these is a UTF8 or UTF8 "code" -- UTF8 and UTF16 are
different ways of enconding/decoding unicode code-points
4. In UTF8, \u0085 is encoded as two bytes -- 0xc2 0x85
5. In UTF8, \uc285 is encoded as three bytes -- 0xec 0x8a 0x85

HTH,
-Rowan
_______________________________________________
sqlite-users mailing list
sqlite-***@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Jens Alfke

2017-08-08 00:44:33 UTC

Permalink

> On Aug 7, 2017, at 8:29 AM, x <***@hotmail.com> wrote:
>
> I thought I had learned enough about this string lunacy to get by but finding out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped me over the edge. I assumed they both used the same codes but UTF16 allowed some characters UTF8 didn’t have.

UTF-8 is backwards-compatible with ASCII. All 7-bit bytes (00-7f) represent the same characters as their ASCII equivalents. Beyond that, UTF-8 uses a sequence of two to five bytes in the range 80-ff to encode a single Unicode character/code-point. (You can sort of think of this as every byte holding 7 bits of the actual character number, with its MSB set to 1. It’s not exactly like that, but close.)

IMHO UTF-8 is the best general purpose text encoding. Code that works with ASCII (real 7-bit ASCII, not the nonstandard “extended” stuff) will generally work with UTF-8; the main thing to watch out for tends to be breaking or trimming strings, because you don’t want to cut part of a multibyte sequence. UTF-8 is also quite compact for Roman languages (although not non-Roman ones.)

16-bit encodings used to seem like a good idea back when Unicode has fewer than 65,536 characters, so you could assume that one unichar = one character. Those days are long gone. Now dealing with UTF-16 has all the same problems of dealing with UTF-8 (i.e. multi-word sequences) without the benefits of compactness or ASCII compatibility.

32-bit encodings are just silly, unless for some reason you really really have to optimize for speed over size (and even then the added size may well blow out your CPU caches and negate the speed boost.)

—Jens

PS: Apparently C++11 allows Unicode string literals by putting a letter U in front of the initial quote. The result will be a string of wchar_t.

Darko Volaric

2017-08-08 01:39:29 UTC

Permalink

Actually a maximum of 4 bytes are required to encode a single valid code-point in UTF-8.

> On Aug 8, 2017, at 2:44 AM, Jens Alfke <***@mooseyard.com> wrote:
>
>
>> On Aug 7, 2017, at 8:29 AM, x <***@hotmail.com> wrote:
>>
>> I thought I had learned enough about this string lunacy to get by but finding out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped me over the edge. I assumed they both used the same codes but UTF16 allowed some characters UTF8 didn’t have.
>
> UTF-8 is backwards-compatible with ASCII. All 7-bit bytes (00-7f) represent the same characters as their ASCII equivalents. Beyond that, UTF-8 uses a sequence of two to five bytes in the range 80-ff to encode a single Unicode character/code-point. (You can sort of think of this as every byte holding 7 bits of the actual character number, with its MSB set to 1. It’s not exactly like that, but close.)
>
> IMHO UTF-8 is the best general purpose text encoding. Code that works with ASCII (real 7-bit ASCII, not the nonstandard “extended” stuff) will generally work with UTF-8; the main thing to watch out for tends to be breaking or trimming strings, because you don’t want to cut part of a multibyte sequence. UTF-8 is also quite compact for Roman languages (although not non-Roman ones.)
>
> 16-bit encodings used to seem like a good idea back when Unicode has fewer than 65,536 characters, so you could assume that one unichar = one character. Those days are long gone. Now dealing with UTF-16 has all the same problems of dealing with UTF-8 (i.e. multi-word sequences) without the benefits of compactness or ASCII compatibility.
>
> 32-bit encodings are just silly, unless for some reason you really really have to optimize for speed over size (and even then the added size may well blow out your CPU caches and negate the speed boost.)
>
> —Jens
>
> PS: Apparently C++11 allows Unicode string literals by putting a letter U in front of the initial quote. The result will be a string of wchar_t.
> _______________________________________________
> sqlite-users mailing list
> sqlite-***@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users