Discussion:
[sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?
邱朗
2018-09-21 02:21:40 UTC
Permalink
Hi,
I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese Korean I verify my sqlite supports fts5


sqlite> pragma compile_options;
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5


sqlite> CREATE VIRTUAL TABLE ft5_test USING fts5(content, tokenize = 'porter unicode61 remove_diacritics 1');
sqlite> INSERT INTO ft5_test values('为什么不支持中文 fts5 does not seem to work for chinese');
sqlite> select * from ft5_test where ft5_test = '中文';
sqlite>
sqlite> select * from ft5_test where ft5_test = 'Chinese';
为什么不支持中文 fts5 does not seem to work for chinese


But to my surprise it can't find any CJK word at all. Why is that ?
Thanks,
Qiulang
Scott Robison
2018-09-21 05:03:54 UTC
Permalink
Post by 邱朗
Hi,
I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese
Korean I verify my sqlite supports fts5
{snipped}
But to my surprise it can't find any CJK word at all. Why is that ?
Based on my experience with such things, I suspect that the tokenizer
requires whitespace between adjacent words, which is not the case with CJK.
Word breaks are implicit, not explicit.

Is the Unicode61 tokenizer based on ICU? I had to implement an algorithm
for software at work that used functionality from ICU to find CJK word
boundaries, so I believe it is possible, just not as straightforward as
whitespace delimited words.
邱朗
2018-09-21 06:01:38 UTC
Permalink
https://www.sqlite.org/fts5.html said " The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all other characters as token characters... " I really doubt unicode tokenizer requires white space, that is ascii tokenizer.


That was why I thought it might work for CJK.


Qiulang
Post by Scott Robison
Post by 邱朗
Hi,
I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese
Korean I verify my sqlite supports fts5
{snipped}
But to my surprise it can't find any CJK word at all. Why is that ?
Based on my experience with such things, I suspect that the tokenizer
requires whitespace between adjacent words, which is not the case with CJK.
Word breaks are implicit, not explicit.
Is the Unicode61 tokenizer based on ICU? I had to implement an algorithm
for software at work that used functionality from ICU to find CJK word
boundaries, so I believe it is possible, just not as straightforward as
whitespace delimited words.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Scott Robison
2018-09-21 06:14:56 UTC
Permalink
Post by 邱朗
https://www.sqlite.org/fts5.html said " The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all other characters as token characters... " I really doubt unicode tokenizer requires white space, that is ascii tokenizer.
Forgive my imprecise use of language. I should have said separators
instead of whitespace. Regardless, CJK uses implicit separation
between words, and that description seems to indicate that the unicode
tokenizer expects explicit separators (be they whitespace or
punctuation or something else) between tokens.
Post by 邱朗
That was why I thought it might work for CJK.
I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
邱朗
2018-09-21 06:38:49 UTC
Permalink
Post by Scott Robison
I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
I am Chinese and I know a little bit of Korean, I can help to test your product :D All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?


./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"


Thanks,
Qiulang
Scott Robison
2018-09-21 06:42:35 UTC
Permalink
Post by 邱朗
Post by Scott Robison
I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
I am Chinese and I know a little bit of Korean, I can help to test your product :D All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"
I do not have macOS experience, and I don't have a lot of free time
either. Mainly I was trying to explain what I had done in the past and
what I used. Hopefully someone else will chime in with more details
about the Unicode61 tokenizer and whether it is standalone or depends
on ICU.
Dan Kennedy
2018-09-21 09:32:38 UTC
Permalink
Post by 邱朗
Post by Scott Robison
I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
I am Chinese and I know a little bit of Korean, I can help to test your product :D All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"
Can you post the complete output of the failed build attempt? Thanks.

Dan.
Post by 邱朗
Thanks,
Qiulang
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
邱朗
2018-09-21 10:21:03 UTC
Permalink
Hi,

Thanks for replying my question. Following are the error I got when compiling sqlite-autoconf-3250100.tar.gz . The error looks similar to this old discussion
http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html


I am using macOS 10.13 & xcode 10


Undefined symbols for architecture x86_64:
"_u_errorName_62", referenced from:
_icuFunctionError in sqlite3.o
"_u_foldCase_62", referenced from:
_icuOpen in sqlite3.o
_icuLikeCompare in sqlite3.o
"_u_isspace_62", referenced from:
_icuNext in sqlite3.o
"_u_strToLower_62", referenced from:
_icuCaseFunc16 in sqlite3.o
"_u_strToUTF8_62", referenced from:
_icuNext in sqlite3.o
"_u_strToUpper_62", referenced from:
_icuCaseFunc16 in sqlite3.o
"_ubrk_close_62", referenced from:
_icuClose in sqlite3.o
"_ubrk_current_62", referenced from:
_icuNext in sqlite3.o
"_ubrk_first_62", referenced from:
_icuOpen in sqlite3.o
"_ubrk_next_62", referenced from:
_icuNext in sqlite3.o
"_ubrk_open_62", referenced from:
_icuOpen in sqlite3.o
"_ucol_close_62", referenced from:
_icuLoadCollation in sqlite3.o
_icuCollationDel in sqlite3.o
"_ucol_open_62", referenced from:
_icuLoadCollation in sqlite3.o
"_ucol_strcoll_62", referenced from:
_icuCollationColl in sqlite3.o
"_uregex_close_62", referenced from:
_icuRegexpDelete in sqlite3.o
"_uregex_matches_62", referenced from:
_icuRegexpFunc in sqlite3.o
"_uregex_open_62", referenced from:
_icuRegexpFunc in sqlite3.o
"_uregex_setText_62", referenced from:
_icuRegexpFunc in sqlite3.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [libsqlite3.la] Error 1
Post by Dan Kennedy
Post by 邱朗
Post by Scott Robison
I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
I am Chinese and I know a little bit of Korean, I can help to test your product :D All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"
Can you post the complete output of the failed build attempt? Thanks.
Dan.
Dan Kennedy
2018-09-21 13:52:30 UTC
Permalink
Post by 邱朗
Hi,
Thanks for replying my question. Following are the error I got when compiling sqlite-autoconf-3250100.tar.gz . The error looks similar to this old discussion
http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html
I am using macOS 10.13 & xcode 10
The text below is just the error. If you post the compiler commands that
appear before it in the build log somebody might be able to spot the
problem.

From the error message, it may be that you have mismatched ICU header
and library files, or it may be that not all required ICU libraries are
being linked. If you remove the -I... and -L... switches from your
command line does it make any difference?

Dan.
Post by 邱朗
_icuFunctionError in sqlite3.o
_icuOpen in sqlite3.o
_icuLikeCompare in sqlite3.o
_icuNext in sqlite3.o
_icuCaseFunc16 in sqlite3.o
_icuNext in sqlite3.o
_icuCaseFunc16 in sqlite3.o
_icuClose in sqlite3.o
_icuNext in sqlite3.o
_icuOpen in sqlite3.o
_icuNext in sqlite3.o
_icuOpen in sqlite3.o
_icuLoadCollation in sqlite3.o
_icuCollationDel in sqlite3.o
_icuLoadCollation in sqlite3.o
_icuCollationColl in sqlite3.o
_icuRegexpDelete in sqlite3.o
_icuRegexpFunc in sqlite3.o
_icuRegexpFunc in sqlite3.o
_icuRegexpFunc in sqlite3.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [libsqlite3.la] Error 1
Post by Dan Kennedy
Post by 邱朗
Post by Scott Robison
I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
I am Chinese and I know a little bit of Korean, I can help to test your product :D All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"
Can you post the complete output of the failed build attempt? Thanks.
Dan.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
邱朗
2018-09-21 14:44:06 UTC
Permalink
I actually first used ./configure CFLAGS="-DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="`icu-config --ldflags`" But I got the error


sqlite3.c:184184:10: fatal error: 'unicode/utypes.h' file not found
#include <unicode/utypes.h>


Then I added -I -L switches and if I remembered correct I used brew to install icu4c. The compiler command are these


qiulangs-MacBook-Pro:sqlite-autoconf-3250100 qiulang$ make
/bin/sh ./libtool --tag=CC --mode=compile gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" -DPACKAGE_STRING=\"sqlite\ 3.25.1\" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c -o sqlite3.lo sqlite3.c
libtool: compile: gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 3.25.1\"" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c -fno-common -DPIC -o .libs/sqlite3.o
libtool: compile: gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 3.25.1\"" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c -o sqlite3.o >/dev/null 2>&1
mv -f .deps/sqlite3.Tpo .deps/sqlite3.Plo
/bin/sh ./libtool --tag=CC --mode=link gcc -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -no-undefined -version-info 8:6:8 -L/usr/local/opt/icu4c/lib -o libsqlite3.la -rpath /usr/local/lib sqlite3.lo -lz
libtool: link: gcc -dynamiclib -o .libs/libsqlite3.0.dylib .libs/sqlite3.o -L/usr/local/opt/icu4c/lib -lz -install_name /usr/local/lib/libsqlite3.0.dylib -compatibility_version 9 -current_version 9.6 -Wl,-single_module
Undefined symbols for architecture x86_64:
"_u_errorName_62", referenced from:
_icuFunctionError in sqlite3.o
"_u_foldCase_62", referenced from:
...
Post by Dan Kennedy
Post by 邱朗
Hi,
Thanks for replying my question. Following are the error I got when compiling sqlite-autoconf-3250100.tar.gz . The error looks similar to this old discussion
http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html
I am using macOS 10.13 & xcode 10
The text below is just the error. If you post the compiler commands that
appear before it in the build log somebody might be able to spot the
problem.
From the error message, it may be that you have mismatched ICU header
and library files, or it may be that not all required ICU libraries are
being linked. If you remove the -I... and -L... switches from your
command line does it make any difference?
Dan.
Post by 邱朗
_icuFunctionError in sqlite3.o
_icuOpen in sqlite3.o
_icuLikeCompare in sqlite3.o
_icuNext in sqlite3.o
_icuCaseFunc16 in sqlite3.o
_icuNext in sqlite3.o
_icuCaseFunc16 in sqlite3.o
_icuClose in sqlite3.o
_icuNext in sqlite3.o
_icuOpen in sqlite3.o
_icuNext in sqlite3.o
_icuOpen in sqlite3.o
_icuLoadCollation in sqlite3.o
_icuCollationDel in sqlite3.o
_icuLoadCollation in sqlite3.o
_icuCollationColl in sqlite3.o
_icuRegexpDelete in sqlite3.o
_icuRegexpFunc in sqlite3.o
_icuRegexpFunc in sqlite3.o
_icuRegexpFunc in sqlite3.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [libsqlite3.la] Error 1
Post by Dan Kennedy
Post by 邱朗
Post by Scott Robison
I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
I am Chinese and I know a little bit of Korean, I can help to test your product :D All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"
Can you post the complete output of the failed build attempt? Thanks.
Dan.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Dan Kennedy
2018-09-21 15:43:01 UTC
Permalink
Post by 邱朗
I actually first used ./configure CFLAGS="-DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="`icu-config --ldflags`" But I got the error
When you ran this configure command, is the first line out output
something like the following?

bash: icu-config: command not found

Is [icu-config] actually in your path? And if so, what does the
[icu-config --ldflags] command return?

Dan.
Post by 邱朗
sqlite3.c:184184:10: fatal error: 'unicode/utypes.h' file not found
#include <unicode/utypes.h>
Then I added -I -L switches and if I remembered correct I used brew to install icu4c. The compiler command are these
qiulangs-MacBook-Pro:sqlite-autoconf-3250100 qiulang$ make
/bin/sh ./libtool --tag=CC --mode=compile gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" -DPACKAGE_STRING=\"sqlite\ 3.25.1\" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c -o sqlite3.lo sqlite3.c
libtool: compile: gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 3.25.1\"" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c -fno-common -DPIC -o .libs/sqlite3.o
libtool: compile: gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 3.25.1\"" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c -o sqlite3.o >/dev/null 2>&1
mv -f .deps/sqlite3.Tpo .deps/sqlite3.Plo
/bin/sh ./libtool --tag=CC --mode=link gcc -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -no-undefined -version-info 8:6:8 -L/usr/local/opt/icu4c/lib -o libsqlite3.la -rpath /usr/local/lib sqlite3.lo -lz
libtool: link: gcc -dynamiclib -o .libs/libsqlite3.0.dylib .libs/sqlite3.o -L/usr/local/opt/icu4c/lib -lz -install_name /usr/local/lib/libsqlite3.0.dylib -compatibility_version 9 -current_version 9.6 -Wl,-single_module
_icuFunctionError in sqlite3.o
...
Post by Dan Kennedy
Post by 邱朗
Hi,
Thanks for replying my question. Following are the error I got when compiling sqlite-autoconf-3250100.tar.gz . The error looks similar to this old discussion
http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html
I am using macOS 10.13 & xcode 10
The text below is just the error. If you post the compiler commands that
appear before it in the build log somebody might be able to spot the
problem.
From the error message, it may be that you have mismatched ICU header
and library files, or it may be that not all required ICU libraries are
being linked. If you remove the -I... and -L... switches from your
command line does it make any difference?
Dan.
Post by 邱朗
_icuFunctionError in sqlite3.o
_icuOpen in sqlite3.o
_icuLikeCompare in sqlite3.o
_icuNext in sqlite3.o
_icuCaseFunc16 in sqlite3.o
_icuNext in sqlite3.o
_icuCaseFunc16 in sqlite3.o
_icuClose in sqlite3.o
_icuNext in sqlite3.o
_icuOpen in sqlite3.o
_icuNext in sqlite3.o
_icuOpen in sqlite3.o
_icuLoadCollation in sqlite3.o
_icuCollationDel in sqlite3.o
_icuLoadCollation in sqlite3.o
_icuCollationColl in sqlite3.o
_icuRegexpDelete in sqlite3.o
_icuRegexpFunc in sqlite3.o
_icuRegexpFunc in sqlite3.o
_icuRegexpFunc in sqlite3.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [libsqlite3.la] Error 1
Post by Dan Kennedy
Post by 邱朗
Post by Scott Robison
I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
I am Chinese and I know a little bit of Korean, I can help to test your product :D All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"
Can you post the complete output of the failed build attempt? Thanks.
Dan.
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
邱朗
2018-09-22 04:43:00 UTC
Permalink
Hi,


It was exactly like you said, my bad, so now I have built an icu version. BUT unfortunately it still does not support CJK, why is that ?


qiulangs-MacBook-Pro:sqlite-autoconf-3250100 qiulang$ ./sqlite3
SQLite version 3.25.1 2018-09-18 20:20:44
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> CREATE VIRTUAL TABLE zh_text USING fts4(text, tokenize=icu zh_CN);
sqlite> INSERT INTO zh_text values('为什么不支持中文 icu does not seem to work for chinese');
sqlite> select * from zh_text where text match 'work';
为什么不支持中文 icu does not seem to work for chinese
sqlite> select * from zh_text where text match '中';
sqlite>


BTW, whoever hit the icu4c error it may be because you make the same mistake as I did. So I first run brew link icu4c, but brew refused, "Warning: Refusing to link macOS-provided software: icu4c", then I forgot to add it to my path :$


If you run brew info icu4c, it will tell you that but actually I didn't set them and compiler still can find them


For compilers to find icu4c you may need to set:
export LDFLAGS="-L/usr/local/opt/icu4c/lib"
export CPPFLAGS="-I/usr/local/opt/icu4c/include"


Thanks,
Qiulang
Post by Dan Kennedy
Post by 邱朗
I actually first used ./configure CFLAGS="-DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="`icu-config --ldflags`" But I got the error
When you ran this configure command, is the first line out output
something like the following?
bash: icu-config: command not found
Hideaki Takahashi
2018-09-22 14:49:24 UTC
Permalink
Hello,

full text search index can be used to see how the text is tokenized for
both FTS4 and FTS5.
for FTS4, fts3tokenize can be used too.

sqlite> CREATE VIRTUAL TABLE icu_zh_cn USING fts3tokenize(icu, zh_CN);
sqlite> SELECT token, start, end, position FROM icu_zh_cn WHERE
INPUT='为什么不支持中文 fts5 does not seem to work for chinese';
为什么|0|9|0
不|9|12|1
支持|12|18|2
中文|18|24|3
fts5|25|29|4
does|30|34|5
not|35|38|6
seem|39|43|7
to|44|46|8
work|47|51|9
for|52|55|10
chinese|56|63|11

based on the output above, the query below works and makes sense to me.
sqlite> select * from zh_text where text match '中文';
为什么不支持中文 icu does not seem to work for chinese


FTS5 + unicode61
sqlite> CREATE VIRTUAL TABLE ft5_test USING fts5(content, tokenize =
'porter unicode61 remove_diacritics 1');
sqlite> INSERT INTO ft5_test values('为什么不支持中文 fts5 does not seem to work
for chinese');
sqlite> CREATE VIRTUAL TABLE ft5_test_vocab_i USING fts5vocab(ft5_test,
'instance');
sqlite> SELECT term, doc, col, offset FROM ft5_test_vocab_i;
(snip non-Chinese portion)
为什么不支持中文|1|content|0

FTS4 + ICU(zh_CN)
sqlite> CREATE VIRTUAL TABLE zh_text USING fts4(text, tokenize=icu zh_CN);
sqlite> INSERT INTO zh_text values('为什么不支持中文 icu does not seem to work for
chinese');
sqlite> CREATE VIRTUAL TABLE zh_terms USING fts4aux(zh_text);
sqlite> SELECT term, col, documents FROM zh_terms;
(snip non-Chinese portion)
不|*|1
不|0|1
中文|*|1
中文|0|1
为什么|*|1
为什么|0|1
支持|*|1
支持|0|1

Thanks,
Hideaki
Post by 邱朗
Hi,
It was exactly like you said, my bad, so now I have built an icu version.
BUT unfortunately it still does not support CJK, why is that ?
qiulangs-MacBook-Pro:sqlite-autoconf-3250100 qiulang$ ./sqlite3
SQLite version 3.25.1 2018-09-18 20:20:44
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> CREATE VIRTUAL TABLE zh_text USING fts4(text, tokenize=icu zh_CN);
sqlite> INSERT INTO zh_text values('为什么不支持中文 icu does not seem to work for
chinese');
sqlite> select * from zh_text where text match 'work';
为什么不支持中文 icu does not seem to work for chinese
sqlite> select * from zh_text where text match '中';
sqlite>
BTW, whoever hit the icu4c error it may be because you make the same
mistake as I did. So I first run brew link icu4c, but brew refused,
"Warning: Refusing to link macOS-provided software: icu4c", then I forgot
to add it to my path :$
If you run brew info icu4c, it will tell you that but actually I didn't
set them and compiler still can find them
export LDFLAGS="-L/usr/local/opt/icu4c/lib"
export CPPFLAGS="-I/usr/local/opt/icu4c/include"
Thanks,
Qiulang
Post by Dan Kennedy
Post by 邱朗
I actually first used ./configure CFLAGS="-DSQLITE_ENABLE_ICU
`icu-config --cppflags`" LDFLAGS="`icu-config --ldflags`" But I got the
error
Post by Dan Kennedy
When you ran this configure command, is the first line out output
something like the following?
bash: icu-config: command not found
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
邱朗
2018-09-24 03:22:42 UTC
Permalink
Hi Hideaki,

Thanks for your reply which made me figure out why I said icu version does "not" support Chinese: b/c in Chinese '中文' can be tokenize as either '中文' or '中' or '文' so when query '中文' or '中*' I can get the result but no result when query '文'. The same goes to '为什么', which can be be tokenize as either '为什么' or '为' or '什么' so no result for when query '什么'

And sadly fts5+unicode 61 definitely does not support Chinese.



BTW, it also helps me realize that I had answered this question myself at 2014 here, https://stackoverflow.com/a/31396975/301513. So basically icu does the same as iOS CFStringTokenizer


Qiulang
Post by Hideaki Takahashi
Hello,
full text search index can be used to see how the text is tokenized for
both FTS4 and FTS5.
for FTS4, fts3tokenize can be used too.
sqlite> CREATE VIRTUAL TABLE icu_zh_cn USING fts3tokenize(icu, zh_CN);
sqlite> SELECT token, start, end, position FROM icu_zh_cn WHERE
INPUT='为什么不支持中文 fts5 does not seem to work for chinese';
为什么|0|9|0
不|9|12|1
支持|12|18|2
中文|18|24|3
fts5|25|29|4
does|30|34|5
not|35|38|6
seem|39|43|7
to|44|46|8
work|47|51|9
for|52|55|10
chinese|56|63|11
based on the output above, the query below works and makes sense to me.
sqlite> select * from zh_text where text match '中文';
为什么不支持中文 icu does not seem to work for chinese
FTS5 + unicode61
sqlite> CREATE VIRTUAL TABLE ft5_test USING fts5(content, tokenize =
'porter unicode61 remove_diacritics 1');
sqlite> INSERT INTO ft5_test values('为什么不支持中文 fts5 does not seem to work
for chinese');
sqlite> CREATE VIRTUAL TABLE ft5_test_vocab_i USING fts5vocab(ft5_test,
'instance');
sqlite> SELECT term, doc, col, offset FROM ft5_test_vocab_i;
(snip non-Chinese portion)
为什么不支持中文|1|content|0
FTS4 + ICU(zh_CN)
sqlite> CREATE VIRTUAL TABLE zh_text USING fts4(text, tokenize=icu zh_CN);
sqlite> INSERT INTO zh_text values('为什么不支持中文 icu does not seem to work for
chinese');
sqlite> CREATE VIRTUAL TABLE zh_terms USING fts4aux(zh_text);
sqlite> SELECT term, col, documents FROM zh_terms;
(snip non-Chinese portion)
不|*|1
不|0|1
中文|*|1
中文|0|1
为什么|*|1
为什么|0|1
支持|*|1
支持|0|1
Thanks,
Hideaki
Jens Alfke
2018-09-21 19:56:27 UTC
Permalink
https://www.sqlite.org/fts5.html <https://www.sqlite.org/fts5.html> said " The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all other characters as token characters... " I really doubt unicode tokenizer requires white space, that is ascii tokenizer.
Detecting word breaks in many East Asian languages (not just CJK; Thai is another) is a rather difficult task and requires having a non-small database of character sequences to match. I’m sure the SQLite maintainers considered it too large to build into their Unicode tokenizer.

IIRC, ICU can do this, as can special libraries like Mecab.

—Jens
Loading...