Discussion:
[sqlite] Finding records containing non-ascii characters
Cecil Westerhof
2016-01-29 09:00:36 UTC
Permalink
In H2 I can find records containing non-ascii characters with:
SELECT *
FROM <TABLE>
WHERE STRINGENCODE(<COLUMN>) LIKE '%\\u%'

Is something like this also possible with SQLite?
--
Cecil Westerhof
Richard Hipp
2016-01-29 09:12:52 UTC
Permalink
Post by Cecil Westerhof
SELECT *
FROM <TABLE>
WHERE STRINGENCODE(<COLUMN>) LIKE '%\\u%'
Is something like this also possible with SQLite?
Perhaps something like this:

SELECT * FROM <table> WHERE <column> GLOB ('*[^'||char(1,45,127)||']*');

Note, however, that the "*[" combination in a GLOB pattern in SQLite
is inefficient. So the above might be slow for a large amount of text.
Perhaps a better approach would be to create an application-defined
function to do the search.

SELECT * FROM <table> WHERE contains_non_ascii(<column>);
--
D. Richard Hipp
***@sqlite.org
Cecil Westerhof
2016-01-29 12:53:02 UTC
Permalink
Post by Richard Hipp
Post by Cecil Westerhof
SELECT *
FROM <TABLE>
WHERE STRINGENCODE(<COLUMN>) LIKE '%\\u%'
Is something like this also possible with SQLite?
​​
SELECT * FROM <table>
​​
WHERE <column> GLOB ('*[^'||char(1,45,127)||']*');
I tried it and it looks like it works. But I do not understand how it
works. Could you enlighten me?
Post by Richard Hipp
Note, however, that the "*[" combination in a GLOB pattern in SQLite
is inefficient. So the above might be slow for a large amount of text.
Perhaps a better approach would be to create an application-defined
function to do the search.
SELECT * FROM <table> WHERE contains_non_ascii(<column>);
​I am working (in this case) with bash. That is not possible then I think.
Or is it?

At the moment I do not need it a lot, so probably not a big problem (yet).
--
Cecil Westerhof
Dominique Devienne
2016-01-29 13:39:53 UTC
Permalink
Post by Cecil Westerhof
Post by Richard Hipp
Post by Cecil Westerhof
SELECT *
FROM <TABLE>
WHERE STRINGENCODE(<COLUMN>) LIKE '%\\u%'
Is something like this also possible with SQLite?
SELECT * FROM <table>
WHERE <column> GLOB ('*[^'||char(1,45,127)||']*');
I tried it and it looks like it works. But I do not understand how it
works. Could you enlighten me?
char() returns text with 1 (unicode) char per unicode code point integer
used as argument to this function.
So char(1,45,127) return first-ascii-char - (i.e. dash) last-ascii-char
(unicode is a superset of ascii).
So the same way a-z is the character range between a and z, i.e. all
lower-case letters
NUL - DEL is the full range of ascii characters. (char(1) = NUL, char(127)
= DEL)
[^...] is the negation of the characters inside.
So [^NUL-DEL] is all non-ascii (unicode) characters. It matches a single
character.
And *[^NULL-DEL]* allows that 1 character to be matched anywhere.
* is the kleen star, which matches 0 or more arbitrary characters.

Note that glob above is not unicode aware per se. It operates on UTF-8
encoded strings,
and in that encoding, all non-ascii characters are encoded into multiple
bytes, which (the bytes) all are
in the 128-255 range. So it works indeed.
Post by Cecil Westerhof
Note, however, that the "*[" combination in a GLOB pattern in SQLite
Post by Richard Hipp
is inefficient. So the above might be slow for a large amount of text.
because the kleen star is "griddy" by default.
Post by Cecil Westerhof
Post by Richard Hipp
Perhaps a better approach would be to create an application-defined
function to do the search.
SELECT * FROM <table> WHERE contains_non_ascii(<column>);
​I am working (in this case) with bash. That is not possible then I
think.Or is it?
You can create a C shared library SQLite extension, that you .load in the
sqlite3 shell.
So that works in bash too, but you need to do some C programming. --DD
Cecil Westerhof
2016-01-29 14:18:13 UTC
Permalink
Post by Dominique Devienne
Post by Cecil Westerhof
Post by Richard Hipp
Post by Cecil Westerhof
SELECT *
FROM <TABLE>
WHERE STRINGENCODE(<COLUMN>) LIKE '%\\u%'
Is something like this also possible with SQLite?
SELECT * FROM <table>
WHERE <column> GLOB ('*[^'||char(1,45,127)||']*');
I tried it and it looks like it works. But I do not understand how it
works. Could you enlighten me?
char() returns text with 1 (unicode) char per unicode code point integer
used as argument to this function.
​​
So char(1,45,127) return first-ascii-char - (i.e. dash) last-ascii-char
(unicode is a superset of ascii).
So the same way a-z is the character range between a and z, i.e. all
lower-case letters
NUL - DEL is the full range of ascii characters. (char(1) = NUL, char(127)
= DEL)
​I should have been a little more persevering: I did not understand the
usage of char, but I could/should have worked it out.

Thanks.


Post by Dominique Devienne
[^...] is the negation of the characters inside.
So [^NUL-DEL] is all non-ascii (unicode) characters. It matches a single
character.
And *[^NULL-DEL]* allows that 1 character to be matched anywhere.
* is the kleen star, which matches 0 or more arbitrary characters.
Note that glob above is not unicode aware per se. It operates on UTF-8
encoded strings,
and in that encoding, all non-ascii characters are encoded into multiple
bytes, which (the bytes) all are
in the 128-255 range. So it works indeed.
Post by Cecil Westerhof
Note, however, that the "*[" combination in a GLOB pattern in SQLite
Post by Richard Hipp
is inefficient. So the above might be slow for a large amount of text.
because the kleen star is "griddy" by default.
Post by Cecil Westerhof
Post by Richard Hipp
Perhaps a better approach would be to create an application-defined
function to do the search.
SELECT * FROM <table> WHERE contains_non_ascii(<column>);
​I am working (in this case) with bash. That is not possible then I
think.Or is it?
You can create a C shared library SQLite extension, that you .load in the
sqlite3 shell.
So that works in bash too, but you need to do some C programming. --DD
​In principal I share my things, so that could make things a little ​

​complicated, but I'll cross that bridge when I get there.
--
Cecil Westerhof
Continue reading on narkive:
Loading...