Discussion:
[sqlite] How to search for fields with accents in UTF-8 data?
CC
2017-06-18 10:52:33 UTC
Permalink
Hello

I imported a CSV file where data are encoded in UTF-8.

Some of the characters (like Î) are not available in the ASCII table, so
I can't use the CLI sqlite3.exe to search.

As an alternative, I tried SQLite Studio, but it fails:

;Returns no record
SELECT COUNT(*) FROM MyTable WHERE REGION="Île-de-France";

;Returns the expected records
SELECT COUNT(*) FROM MyTable WHERE "LIBREG" LIKE "%le-de-France";

I found nothing in SQLite Studio's menus that could be related to
encoding so that I could tell it the DB contains UTF-8 instead of ANSI.

Is there another Windows application I could try that is more likely to
work with UTF-8 data?

Thank you.
Clemens Ladisch
2017-06-18 12:36:01 UTC
Permalink
Post by CC
I imported a CSV file where data are encoded in UTF-8.
Some of the characters (like Î) are not available in the ASCII table, so I can't use the CLI sqlite3.exe to search.
The latest version of sqlite3.exe might work.

Anyway, to check that whatever tool you're using uses Unicode correctly, execute this:

SELECT char(206), unicode('Î');

This should output Î and 206.
Post by CC
;Returns the expected records
SELECT COUNT(*) FROM MyTable WHERE "LIBREG" LIKE "%le-de-France";
Does "SELECT unicode(Libreg) FROM MyTable WHERE Libreg LIKE '%le-de-France' LIMIT 1"
return the correct code?
Post by CC
I found nothing in SQLite Studio's menus that could be related to encoding so that I could tell it the DB contains UTF-8 instead of ANSI.
The database API always uses Unicode.


Regards,
Clemens
hfiandor
2017-06-18 16:46:20 UTC
Permalink
Dear CC:
I use Spanish language, with letters with accents.

When I try to import from a .csv the data obtained in the table was "rare" and I have to implement some routines to convert "What the program read to the letter I want to write", and I solve the problem. I think in a prehistory way.

If you want, I can send to you this routines, and maybe you can suggest me a better way.

Yours,
Hfiandor.
Klaas Van B.
2017-06-19 06:41:38 UTC
Permalink
You can even make UTF-8 the default encoding in Windows as it is in SQLite

https://superuser.com/questions/239810/setting-utf8-as-default-character-encoding-in-windows-7
;>Returns no record
Post by CC
SELECT COUNT(*) FROM MyTable WHERE REGION="Île-de-France";
;>Returns the expected records
Post by CC
SELECT COUNT(*) FROM MyTable WHERE "LIBREG" LIKE "%le-de-France";
I found nothing in SQLite Studio's menus that could be related to
encoding so that I could tell it the DB contains UTF-8 instead of ANSI.
Is there another Windows application I could try that is more likely to
work with UTF-8 data?
Kind regards | Vriendelijke groeten | Cordiali saluti,
Klaas `Z4us` van Buiten V, Experienced Freelance ICT-Guy
https://www.linkedin.com/in/klaas-van-buiten-0325b2102
Gilles
2017-06-19 14:22:32 UTC
Permalink
Found the problem: Turns out the CSV file isn't in UTF8 but in CP1252 :-/

Icon.exe can be used to convert a file before importing it in SQLite.
https://dbaportal.eu/2012/10/24/iconv-for-windows/

Thanks everyone for the help.
Hick Gunter
2017-06-19 14:43:57 UTC
Permalink
CP1252 = Windows-1252 = ISO 8859-1 aka Latin-1, an extension of ASCII

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:sqlite-users-***@mailinglists.sqlite.org] Im Auftrag von Gilles
Gesendet: Montag, 19. Juni 2017 16:23
An: SQLite Maillist <sqlite-***@mailinglists.sqlite.org>
Betreff: Re: [sqlite] How to search for fields with accents in UTF-8 data?

Found the problem: Turns out the CSV file isn't in UTF8 but in CP1252 :-/

Icon.exe can be used to convert a file before importing it in SQLite.
https://dbaportal.eu/2012/10/24/iconv-for-windows/

Thanks everyone for the help.
_______________________________________________
sqlite-users mailing list
sqlite-***@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
Gunter Hick
Software Engineer
Scientific Games International GmbH
FN 157284 a, HG Wien
Klitschgasse 2-4, A-1130 Vienna, Austria
Tel: +43 1 80100 0
E-Mail: ***@scigames.at

This communication (including any attachments) is intended for the use of the intended recipient(s) only and may contain information that is confidential, privileged or legally protected. Any unauthorized use or dissemination of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender by return e-mail message and delete all copies of the original communication. Thank you for your cooperation.
Klaas Van B.
2017-06-19 06:51:29 UTC
Permalink
For some applications it is, for others not in all cases. For "just" accented characters it should be no problem following these instructions.
General advice: download OpenOffice or similar OpenSource packages. They are completely free and support almost all OS's.

Kind regards | Vriendelijke groeten | Cordiali saluti,
Klaas `Z4us` van Buiten V, Experienced Freelance ICT-Guy
https://www.linkedin.com/in/klaas-van-buiten-0325b2102

--------------------------------------------
On Mon, 19/6/17, Klaas Van B. <***@yahoo.com> wrote:

Subject: Re: How to search for fields with accents in UTF-8 data?
To: "SQLite Maillist" <sqlite-***@mailinglists.sqlite.org>
Cc: "CC" <***@free.fr>
Date: Monday, 19 June, 2017, 8:41

You can even make UTF-8 the default encoding
in Windows as it is in SQLite

https://superuser.com/questions/239810/setting-utf8-as-default-character-encoding-in-windows-7
Post by CC
As an alternative, I tried SQLite
Studio, but it fails:

;>Returns no record
Post by CC
SELECT COUNT(*) FROM MyTable WHERE
REGION="Île-de-France";

;>Returns the expected records
Post by CC
SELECT COUNT(*) FROM MyTable WHERE
"LIBREG" LIKE "%le-de-France";
Post by CC
I found nothing in SQLite Studio's
menus that could be related to
Post by CC
encoding so that I could tell it
the DB contains UTF-8 instead of ANSI.
Post by CC
Is there another Windows
application I could try that is more likely to
Post by CC
work with UTF-8 data?
Kind regards | Vriendelijke groeten |
Cordiali saluti,
Klaas `Z4us` van Buiten V, Experienced
Freelance ICT-Guy
https://www.linkedin.com/in/klaas-van-buiten-0325b2102
Gilles
2017-06-19 10:13:10 UTC
Permalink
Thanks everyone.

It looks running sqlite3.exe in a terminal window (CMD) in Windows 7
doesn't work: Apparently, it doesn't support UTF-8.

And when using DB Browser for SQLite, it does work only if I copy/paste
the output with the "?" where an accented character lives:
Loading Image...

Could the problem be with fonts not supporting UTF8?

I'd rather not mess with Windows encoding, especially since one of the
answers in SuperUser says that Windows only partially support Unicode.

Here's the output of the commands:

sqlite> SELECT unicode(Libreg) FROM MyTable WHERE Libreg LIKE
'%le-de-France' LIMIT 1;
65533
sqlite> SELECT char(206), unicode('I');
I;73
sqlite> SELECT hex('Ile-de-France');
496C652D64652D4672616E6365
sqlite> SELECT hex(region) FROM MyTable WHERE LIBREG like
'%le-de-France' LIMIT 1;
Error: no such column: region
sqlite> SELECT hex(libreg) FROM MyTable WHERE LIBREG like
'%le-de-France' LIMIT 1;
CE6C652D64652D4672616E6365

PS: I might be breaking the thread in the mailing list. For some reason,
the SQLite mailing lists refuses my post from Nabble although I used the
same email address to register 1) with Nabble and 2) with the SQLite
mailing list http://sqlite.1065341.n5.nabble.com/
Simon Slavin
2017-06-19 13:20:06 UTC
Permalink
It looks running sqlite3.exe in a terminal window (CMD) in Windows 7 doesn't work: Apparently, it doesn't support UTF-8.
Correct. And the "it" that doesn’t support UTF-8 is the Windows console. SQLite works fine and handles everything as Unicode internally. The Windows console won’t process multibyte characters internally and can’t display them correctly.

<https://msdn.microsoft.com/en-us/library/windows/desktop/dd317752(v=vs.85).aspx>

"many legacy applications continue to use character sets based on code pages. Even new applications sometimes have to work with code pages, often for one of the following reasons: […]
• To communicate with the Windows Console, which does not support Unicode."


Some people have found ways to hack around this, but they simulate compliance for a certain codepage rather than implement UTF-8 globally.

<https://www.curlybrace.com/words/2014/10/03/windows-console-and-doublemulti-byte-character-set/>

"The Windows Console doesn’t support Unicode."

<https://social.technet.microsoft.com/Forums/sharepoint/en-US/c42a0300-1803-475d-9438-d39e6672cc69/unicode-characters-in-powershell>

Simon.
Olivier Mascia
2017-06-19 15:01:19 UTC
Permalink
Post by Simon Slavin
It looks running sqlite3.exe in a terminal window (CMD) in Windows 7 doesn't work: Apparently, it doesn't support UTF-8.
Correct. And the "it" that doesn’t support UTF-8 is the Windows console. SQLite works fine and handles everything as Unicode internally. The Windows console won’t process multibyte characters internally and can’t display them correctly.
<https://msdn.microsoft.com/en-us/library/windows/desktop/dd317752(v=vs.85).aspx>
"many legacy applications continue to use character sets based on code pages. Even new applications sometimes have to work with code pages, often for one of the following reasons: […]
• To communicate with the Windows Console, which does not support Unicode."
Some people have found ways to hack around this, but they simulate compliance for a certain codepage rather than implement UTF-8 globally.
<https://www.curlybrace.com/words/2014/10/03/windows-console-and-doublemulti-byte-character-set/>
"The Windows Console doesn’t support Unicode."
<https://social.technet.microsoft.com/Forums/sharepoint/en-US/c42a0300-1803-475d-9438-d39e6672cc69/unicode-characters-in-powershell>
Simon.
Switch the console I/O (windows only of course) of sqlite3 shell.c to use WriteConsoleW and ReadConsoleW, and there you go and forget about CHCP, codepages... Learned it the hard way last year, well after at some point I though DBCS would be enough. No.
--
Best Regards, Meilleures salutations, Met vriendelijke groeten,
Olivier Mascia, http://integral.software
Hick Gunter
2017-06-19 07:34:50 UTC
Permalink
What do the following statements return, when run in sqlite3.exe (Please note that single quotes are SQLite3 string delimiters):

SELECT hex('Île-de-France');

SELECT hex(region) FROM MyTable WHERE LIBREG like '%le-de-France' LIMIT 1;

I expect one of them is ISO (lead character > 7F) and the other UTF8 (2 character sequence), so they can never match.
Alternatively, I have also seen "double conversion" ISO -> UTF8 when the encoding was already UTF8 but the conversion ISO -> UTF8 was performed anyway.

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:sqlite-users-***@mailinglists.sqlite.org] Im Auftrag von CC
Gesendet: Sonntag, 18. Juni 2017 12:53
An: sqlite-***@mailinglists.sqlite.org
Betreff: [sqlite] How to search for fields with accents in UTF-8 data?

Hello

I imported a CSV file where data are encoded in UTF-8.

Some of the characters (like Î) are not available in the ASCII table, so I can't use the CLI sqlite3.exe to search.

As an alternative, I tried SQLite Studio, but it fails:

;Returns no record
SELECT COUNT(*) FROM MyTable WHERE REGION="Île-de-France";

;Returns the expected records
SELECT COUNT(*) FROM MyTable WHERE "LIBREG" LIKE "%le-de-France";

I found nothing in SQLite Studio's menus that could be related to encoding so that I could tell it the DB contains UTF-8 instead of ANSI.

Is there another Windows application I could try that is more likely to work with UTF-8 data?

Thank you.

_______________________________________________
sqlite-users mailing list
sqlite-***@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
Gunter Hick
Software Engineer
Scientific Games International GmbH
FN 157284 a, HG Wien
Klitschgasse 2-4, A-1130 Vienna, Austria
Tel: +43 1 80100 0
E-Mail: ***@scigames.at

This communication (including any attachments) is intended for the use of the intended recipient(s) only and may contain information that is confidential, privileged or legally protected. Any unauthorized use or dissemination of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender by return e-mail message and delete all copies of the original communication. Thank you for your cooperation.
David Raymond
2017-06-19 15:22:00 UTC
Permalink
The Windows command prompt and unicode have always not played well with each other. SQLite itself works perfectly with data on disk or in the database, there are just translation and display problems when going to and from the command prompt.

If you write out your query in, say, Notepad++ and save it in UTF-8, then you can do ".read queryFile.txt" from the CLI and be sure that it's reading it ok. (Assuming of course your DB isn't using one of the UTF-16 options) The output may still look weird if it would include accented characters, but anything like count(*) or unicode(something) that return numbers, or anything that's ASCII will always look ok.


foo.txt: Saved in UTF-8

.bail on
.echo on
create table if not exists foo (foo text collate nocase);
insert or ignore into foo values ('Île-de-France');
select * from foo;
select char(206), unicode('Î');
select count(*) from foo where foo = 'Île-de-France';

end foo.txt



D:\Temp>sqlite3
SQLite version 3.19.3 2017-06-08 14:26:16
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

sqlite> .read foo.txt
create table if not exists foo (foo text collate nocase);
Run Time: real 0.001 user 0.000000 sys 0.015600
insert or ignore into foo values ('Île-de-France');
Run Time: real 0.000 user 0.000000 sys 0.000000
select * from foo;
--EQP-- 0,0,0,SCAN TABLE foo
foo
Île-de-France
Run Time: real 0.001 user 0.000000 sys 0.000000
select char(206), unicode('Î');
char(206)|unicode('Î')
Î|206
Run Time: real 0.000 user 0.000000 sys 0.000000
select count(*) from foo where foo = 'Île-de-France';
--EQP-- 0,0,0,SCAN TABLE foo
count(*)
1
Run Time: real 0.000 user 0.000000 sys 0.000000

sqlite>
Winfried [via SQLite]
2017-06-20 12:26:00 UTC
Permalink
Thanks for the infos.




______________________________________
If you reply to this email, your message will be added to the discussion below:
http://sqlite.1065341.n5.nabble.com/How-to-search-for-fields-with-accents-in-UTF-8-data-tp96249p96294.html
This email was sent by Winfried (via Nabble)
To receive all replies by email, subscribe to this discussion: http://sqlite.1065341.n5.nabble.com/template/NamlServlet.jtp?macro=subscribe_by_code&node=96249&code=c3FsaXRlLXVzZXJzQG1haWxpbmdsaXN0cy5zcWxpdGUub3JnfDk2MjQ5fC0xNDUwNjI0MDQ5
Gilles
2017-06-20 12:34:13 UTC
Permalink
Posted by David Raymond on Jun 19, 2017; 5:22pm > /The Windows command
prompt and unicode have always not played well with each other. SQLite
itself works perfectly with data on disk or in the database, there are
just translation and display problems when going to and from the command
prompt. /

Thanks much for the infos.

Lessons I learned:

1. In CSV files, double-check how data are encoded

2. Do not use the sqlite3.exe CLI if the data use anything more than the
basic latin alphabet. Instead, use a GUI application (eg. for Windows,
SQLite Studio, SQLitespeed, etc.)

Thank you all.
R Smith
2017-06-20 13:24:12 UTC
Permalink
Post by Gilles
1. In CSV files, double-check how data are encoded
2. Do not use the sqlite3.exe CLI if the data use anything more than
the basic latin alphabet. Instead, use a GUI application (eg. for
Windows, SQLite Studio, SQLitespeed, etc.)
Every lesson is valuable! Just to be clear - there is nothing wrong
with using the CLI. When pointing it to a file that is correctly encoded
the import must work correctly (if not, it's a bug) - It's just
difficult to enter weird and wonderful Unicode characters outside the
BMP basic Latin plane (the first 127 code-points) via the console, or do
queries using them, all because the Windows console specifically is not
Unicode-enabled.

As an aside - I never understood the reasons for that. I get that
Windows has a less "techy" clientèle than Linux for instance, and that
the backwards compatibility is paramount, and that no console command
ever need fall outside the 7-bit ANSI range of characters... but geez,
how much effort can it be to make it Unicode-friendly? It's not like the
Windows API lacks any Unicode functionality - even Notepad can handle it
masterfully.
J. King
2017-06-20 13:32:57 UTC
Permalink
Indeed. Technically-minded Windows users do exist (Hi, Microsoft, I'm right here!), and I have neither the time nor the inclination to learn PowerShell when the Windows terminal is already adequate---with a set of ports of GNU tools, anyway. :)
Post by R Smith
Post by Gilles
1. In CSV files, double-check how data are encoded
2. Do not use the sqlite3.exe CLI if the data use anything more than
the basic latin alphabet. Instead, use a GUI application (eg. for
Windows, SQLite Studio, SQLitespeed, etc.)
Every lesson is valuable! Just to be clear - there is nothing wrong
with using the CLI. When pointing it to a file that is correctly encoded
the import must work correctly (if not, it's a bug) - It's just
difficult to enter weird and wonderful Unicode characters outside the
BMP basic Latin plane (the first 127 code-points) via the console, or do
queries using them, all because the Windows console specifically is not
Unicode-enabled.
As an aside - I never understood the reasons for that. I get that
Windows has a less "techy" clientèle than Linux for instance, and that
the backwards compatibility is paramount, and that no console command
ever need fall outside the 7-bit ANSI range of characters... but geez,
how much effort can it be to make it Unicode-friendly? It's not like the
Windows API lacks any Unicode functionality - even Notepad can handle it
masterfully.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Simon Slavin
2017-06-20 13:56:36 UTC
Permalink
Every lesson is valuable! Just to be clear - there is nothing wrong with using the CLI. When pointing it to a file that is correctly encoded the import must work correctly (if not, it's a bug) - It's just difficult to enter weird and wonderful Unicode characters outside the BMP basic Latin plane (the first 127 code-points) via the console, or do queries using them, all because the Windows console specifically is not Unicode-enabled.
To clarify the clarification, you can use the SQLite shell tool just fine as long as you use it to process files, rather than expect characters which are entered through the keyboard or shown on the display to work. So use ".read" or ".output" or ".once", and then use a non-console program to view the results. Don’t type your text and view the results on the display.

I don’t know the technical details of how windows does piping within the console. It may or may not work to use command lines with ">" or "|" in.
As an aside - I never understood the reasons for that. I get that Windows has a less "techy" clientèle than Linux for instance, and that the backwards compatibility is paramount, and that no console command ever need fall outside the 7-bit ANSI range of characters... but geez, how much effort can it be to make it Unicode-friendly? It's not like the Windows API lacks any Unicode functionality - even Notepad can handle it masterfully.
The console you see is pretty-much the one which was in Windows 3.1. It does not use the modern API written post-unicode, it calls the old single-character Windows routines which are still in Windows so old programs don’t suddenly stop working. It has numerous parts which assume

one keypress == one character == one octet == one space on the display

these assumptions are not only in the code for the console itself but in the Windows routines it calls to do the work. Rewriting the console to use the newer API calls, and also deal with the above assumptions not being true would be so major that Microsoft might as well start again from scratch. And I’d expect the resulting code to be two or three times the size, and to be slower in execution.

This affects Powershell too, since Powershell runs inside the console. But Powershell might do piping correctly, or be an improvement on the original shell in some other way. I don’t use Windows at work so I don’t know.

Simon.
Olivier Mascia
2017-06-20 14:46:12 UTC
Permalink
Post by Simon Slavin
The console you see is pretty-much the one which was in Windows 3.1. It does not use the modern API written post-unicode, it calls the old single-character Windows routines which are still in Windows so old programs don’t suddenly stop working. It has numerous parts which assume
one keypress == one character == one octet == one space on the display
The console I/O can be used to enter (keyboard) not-so-basic unicode codepoints and the output can be sent pretty much anything you want, as long as the font selected in the console has been built with enough support for the eventually exotic glyphs you want to output. That's the application you run in the console that has to use the right APIs for the job (ReadConsoleW/WriteConsoleW). The support of these goes back to Windows 2000.

With some modified sqlite3.exe using those, here is what I can type interactively at the console:

SQLite version 3.19.3 2017-06-08 14:26:16
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> create table é€(a text);
sqlite> insert into é€ values('ñ');
sqlite> select * from é€;
ñ
sqlite> .once a.txt
sqlite> .dump
sqlite> .q

Here is the content of the file as seen on console when doing "type a.txt":

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE ├®Ôé¼(a text);
INSERT INTO "├®Ôé¼" VALUES('├▒');
COMMIT;

You can see the UTF-8 output sent to the dump file is correct - open it with notepad for instance. Side note: the system command "type" is not able to display it unless you do a CHCP 65001. It just proves I could type characters on the keyboard input which do not follow the concept 'one keypress - one character - one octet - one space on the display'.
--
Best Regards, Meilleures salutations, Met vriendelijke groeten,
Olivier Mascia, http://integral.software
Olivier Mascia
2017-06-20 14:17:37 UTC
Permalink
As an aside - I never understood the reasons for that. I get that Windows has a less "techy" clientèle than Linux for instance, and that the backwards compatibility is paramount, and that no console command ever need fall outside the 7-bit ANSI range of characters... but geez, how much effort can it be to make it Unicode-friendly? It's not like the Windows API lacks any Unicode functionality - even Notepad can handle it masterfully.
I wouldn't like looking like I'm trolling this subject, but this is only a matter of I/O functions used by programs built to interact with the display and keyboard when run in a console. Windows needs those programs to use ReadConsoleW/WriteConsoleW to do the proper thing. Those programs using C library to read or output byte streams can't do anything equivalent no matter what 'codepage' is set to be used or to/from what DBCS the program attempts conversion to/from.

I learned this postulat here last year and have had excellent success with console I/O ever since in my programmings.

To be complete, regarding proper display of the output, there is a secondary consideration. The fonts available in Windows are far from covering a large subset of the glyphs. For eastern languages on a western Windows edition, you generally need to change your console settings to make it use another font than the default one, just so that it can draw the needed glyphs. But the basic thing to do is get the program running in the console (here we are talking shell.c - sqlite3.exe) to output Windows wide-chars using the function WriteConsoleW(). And use ReadConsoleW() to read wide-chars chunks from the console input, before converting internally to UTF-8 or whatever wanted.

Sqlite3 shell.c when patched that way is as pleasant to use on Windows console as it can be on a modern Linux or macOS.

Input files feeded to sqlite3.exe need to be in UTF-8, as well as output sent by sqlite3.exe will be: that part is perfectly OK today in sqlite3.exe. Only the keyboard reading and console output writing lacks a little.
but geez, how much effort can it be to make it Unicode-friendly?
To further comment on a more general plane than the sqlite3.exe, the issue is deeper in Windows than in its console. Once upon a time (!), they made the choice of 16 bits per characters encoding as the *right* way (their right way!) to do Unicode. It took time for this to evolve, recognizing the need for multi-16 bits words encoding (UTF-16), so they could have chosen UTF-8 from day one, but that was not what history recorded. Later UTF-8 got *some* support in the OS (through conversion functions). But never UTF-8 was raised to full citizenship. There is even a CHCP 65001 to set the 'codepage' to UTF-8. It works partly in some circumstances, but is far from being 'right'. No matter what you would do, there is no way for any file I/O primitive of the OS to take an UTF-8 string as a filename. And this extend to the C-library on Windows platform. The only unicode support is to pass a UTF-16 filename through functions ending with a W in the name. Those 'ansi' functions, ending with an A in the name, are merely wrappers converting to the wide chars versions. There have been numerous requests to Microsoft to let people and developers set the ANSI codepage to UTF-8 so that the file I/O functions taking a narrow char filename string can interpret it as UTF-8. Some are still waiting for that day to come, others use the W-side of things, complicating portability of their codebase. :)
--
Best Regards, Meilleures salutations, Met vriendelijke groeten,
Olivier Mascia, http://integral.software
Scott Robison
2017-06-20 18:04:21 UTC
Permalink
Post by Olivier Mascia
As an aside - I never understood the reasons for that. I get that Windows has a less "techy" clientèle than Linux for instance, and that the backwards compatibility is paramount, and that no console command ever need fall outside the 7-bit ANSI range of characters... but geez, how much effort can it be to make it Unicode-friendly? It's not like the Windows API lacks any Unicode functionality - even Notepad can handle it masterfully.
I wouldn't like looking like I'm trolling this subject, but this is only a matter of I/O functions used by programs built to interact with the display and keyboard when run in a console. Windows needs those programs to use ReadConsoleW/WriteConsoleW to do the proper thing. Those programs using C library to read or output byte streams can't do anything equivalent no matter what 'codepage' is set to be used or to/from what DBCS the program attempts conversion to/from.
I learned this postulat here last year and have had excellent success with console I/O ever since in my programmings.
About a year ago I had to write an emergency fixup tool for my
employer because of a backward breaking change at Microsoft that was
almost certainly due to a breakdown in revision control. The tool
needed to be localized, but it was sufficiently simple that a console
mode executable was sufficient. I had to jump through hoops to make it
work, but (by way of confirmation), the problems were in the CRT, not
the Win32 API. It was able to write and read Unicode ...
Post by Olivier Mascia
To be complete, regarding proper display of the output, there is a secondary consideration. The fonts available in Windows are far from covering a large subset of the glyphs. For eastern languages on a western Windows edition, you generally need to change your console settings to make it use another font than the default one, just so that it can draw the needed glyphs. But the basic thing to do is get the program running in the console (here we are talking shell.c - sqlite3.exe) to output Windows wide-chars using the function WriteConsoleW(). And use ReadConsoleW() to read wide-chars chunks from the console input, before converting internally to UTF-8 or whatever wanted.
... assuming of course that the locale was using a font that supported
the character set for that area. This was true for our purposes by
default, as we weren't expecting English speaking customers to need to
see Asian languages.
Post by Olivier Mascia
Sqlite3 shell.c when patched that way is as pleasant to use on Windows console as it can be on a modern Linux or macOS.
Input files feeded to sqlite3.exe need to be in UTF-8, as well as output sent by sqlite3.exe will be: that part is perfectly OK today in sqlite3.exe. Only the keyboard reading and console output writing lacks a little.
Agreed.
Post by Olivier Mascia
but geez, how much effort can it be to make it Unicode-friendly?
To further comment on a more general plane than the sqlite3.exe, the issue is deeper in Windows than in its console. Once upon a time (!), they made the choice of 16 bits per characters encoding as the *right* way (their right way!) to do Unicode. It took time for this to evolve, recognizing the need for multi-16 bits words encoding (UTF-16), so they could have chosen UTF-8 from day one, but that was not what history recorded. Later UTF-8 got *some* support in the OS (through conversion functions). But never UTF-8 was raised to full citizenship. There is even a CHCP 65001 to set the 'codepage' to UTF-8. It works partly in some circumstances, but is far from being 'right'. No matter what you would do, there is no way for any file I/O primitive of the OS to take an UTF-8 string as a filename. And this extend to the C-library on Windows platform. The only unicode support is to pass a UTF-16 filename through functions ending with a W in the name. Those 'ansi' functions, ending with an A in the name, are merely wrappers converting to the wide chars versions. There have been numerous requests to Microsoft to let people and developers set the ANSI codepage to UTF-8 so that the file I/O functions taking a narrow char filename string can interpret it as UTF-8. Some are still waiting for that day to come, others use the W-side of things, complicating portability of their codebase. :)
Windows NT was released in 1993. It had been in development for years.
It decided Unicode for I18N/L10N/W6R (WhateveR) purposes was better
than a ton of different code pages. At the point Microsoft committed
to Unicode, it was a two byte / sixteen bit encoding. There was no
UTF-8. There was no UTF-16. Other than endian issues, there was
nothing to worry about. Win32 was an "all new" API.

POSIX people didn't want to re-write the entire API to support 16 bit
characters, so they came up with the FSS-UTF (File System Safe UCS
Transformation Format) alternative that eventually led to what we know
as UTF-8 today. Had Microsoft made the decision to implement a
variable width encoding of Unicode on their own, I dare say they'd be
excoriated for embrace/extend/extinguish practices! UTF-8 wasn't a
thing until late 1992 and wasn't presented until 1993, long after
Microsoft had committed to UCS. Unicode and ISO-10646 had not yet
unified when Microsoft made their call, and Unicode 1 was only 16 bits
(but ISO-10646 was 31 bits). By the time of Unicode 2 UTF-16 was
developed to give early Unicode adopters a means of accessing code
points beyond the basic multilingual plane & the code space was
limited to 17 planes. Not until 2003 was UTF-8 finally limited to the
current 4 byte form from the original six byte form.

Microsoft wasn't the only organization to commit to UCS-2.
https://en.wikipedia.org/wiki/UTF-16#Usage provides a list of others.

Regardless, I prefer UTF-8 to UCS-2 / UTF-16. Microsoft has certainly
had time to make their interfaces more UTF-8 friendly. I just don't
think they get enough credit for committing to Unicode in an era when
few were.
Olivier Mascia
2017-06-20 21:23:51 UTC
Permalink
Post by Scott Robison
Regardless, I prefer UTF-8 to UCS-2 / UTF-16. Microsoft has certainly
had time to make their interfaces more UTF-8 friendly. I just don't
think they get enough credit for committing to Unicode in an era when
few were.
We are in full agreement, here. :)
--
Best Regards, Meilleures salutations, Met vriendelijke groeten,
Olivier Mascia, http://integral.software
Gilles
2017-06-20 13:54:03 UTC
Permalink
R Smith >
Post by R Smith
Post by Gilles
2. Do not use the sqlite3.exe CLI if the data use anything more than
the basic latin alphabet. Instead, use a GUI application (eg. for
Windows, SQLite Studio, SQLitespeed, etc.)
Every lesson is valuable! Just to be clear - there is nothing wrong
with using the CLI. When pointing it to a file that is correctly encoded
the import must work correctly (if not, it's a bug) - It's just
difficult to enter weird and wonderful Unicode characters outside the
BMP basic Latin plane (the first 127 code-points) via the console, or do
queries using them, all because the Windows console specifically is not
Unicode-enabled.

Yes, I should have been more precise: Using the CLI for importing data
works fine; It's when typing accented characers that it fails.
Post by R Smith
As an aside - I never understood the reasons for that.
Beats me. Maybe there is some legacy code somewhere deep in Windows'
bowels that explains why the console (cmd.exe) isn't yet Unicode-capable.
Gilles
2017-06-20 14:12:57 UTC
Permalink
Simon Slavin-3 > To clarify the clarification, you can use the SQLite
shell tool just fine as long as you use it to process files, rather than
expect characters which are entered through the keyboard or shown on the
display to work. So use ".read" or ".output" or ".once", and then use a
non-console program to view the results. Don’t type your text and view
the results on the display.

Looks like using an alternative shell solves the problem:

http://conemu.github.io/
Keith Medcalf
2017-06-20 15:40:18 UTC
Permalink
Post by R Smith
Every lesson is valuable! Just to be clear - there is nothing wrong
with using the CLI. When pointing it to a file that is correctly encoded
the import must work correctly (if not, it's a bug) - It's just
difficult to enter weird and wonderful Unicode characters outside the
BMP basic Latin plane (the first 127 code-points) via the console, or do
queries using them, all because the Windows console specifically is not
Unicode-enabled.
The Windows Console is fully UNICODE enabled and always has been, ever since the beginning of time. It does not, however, support any character encoding other than the proprietary Microsoft DBCS/MBCS, and then those only when one uses specific console I/O -- just like you have to use specific I/O in a gooey application -- streams do not have the code necessary to perform anything other than ANSI Code Page I/O. Specifically you CANNOT send any encoded anything to a stdio stream and expect it to get decoded. What you send (and what you receive from the keyboard) is ANSI only.

This is the design of Windows.
Post by R Smith
As an aside - I never understood the reasons for that. I get that
Windows has a less "techy" clientèle than Linux for instance, and that
the backwards compatibility is paramount, and that no console command
ever need fall outside the 7-bit ANSI range of characters... but geez,
how much effort can it be to make it Unicode-friendly? It's not like the
Windows API lacks any Unicode functionality - even Notepad can handle it
masterfully.
It is unicode friendly, and unicode is fully supported.
The I/O streams do not support encoded byte streams is all.
Loading...