[sqlite] Finding second occurrence of character in string

Discussion:

Baruch Burstein

2014-10-26 12:27:24 UTC

Hi!

I have a column which represents a file path:

ab/cd/gf
ab/qw/ert
ab/fgrd/ert
ab/foo/bar/fgr
ab/bar/foo/foobar/etc
ab/etc/d
etc...

I happen to know in my case that the first part of the path is a certain
fixed string ('ab' in the above example). I need to get the path with the
first 2 parts stripped off. Currently I am doing:

substr(path, 4+instr(substr(path,4),'/'))

But that seems long and probably inefficient.
What is the best/simplest way to find the second occurrence of the '/' in a
string?

Also, a suggestion for an SQLite improvement: The builtin function instr()
should have another form that takes 3 arguments, with the 3rd being either
an offset from where to start the search, or which occurrence to search for
(1st, 2nd, etc.)

--
˙uʍop-ǝpısdn sı ɹoʇıuoɯ ɹnoʎ 'sıɥʇ pɐǝɹ uɐɔ noʎ ɟı

Andrea Peri

2014-10-26 12:46:10 UTC

Permalink

Hi,

I don't know if the SQLite SQL function are coming from a sql specs standard.

I guess however that a better usable string function for manage paths
is one function that retrieve the last occurrence of a string.
Because very often the need is to extract the last part of a filepath.

my 2ct,

A.

Post by Baruch Burstein
Hi!
ab/cd/gf
ab/qw/ert
ab/fgrd/ert
ab/foo/bar/fgr
ab/bar/foo/foobar/etc
ab/etc/d
etc...
I happen to know in my case that the first part of the path is a certain
fixed string ('ab' in the above example). I need to get the path with the
substr(path, 4+instr(substr(path,4),'/'))
But that seems long and probably inefficient.
What is the best/simplest way to find the second occurrence of the '/' in a
string?
Also, a suggestion for an SQLite improvement: The builtin function instr()
should have another form that takes 3 arguments, with the 3rd being either
an offset from where to start the search, or which occurrence to search for
(1st, 2nd, etc.)
--
˙uʍop-ǝpısdn sı ɹoʇıuoɯ ɹnoʎ 'sıɥʇ pɐǝɹ uɐɔ noʎ ɟı
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

--
-----------------
Andrea Peri
. . . . . . . . .
qwerty àèìòù
-----------------

Stadin, Benjamin

2014-10-26 13:57:24 UTC

Permalink

Hi,

One possible way could be to combine this with FTS4 (with parenthesis
support enabled) and a LIKE clause:

SELECT substr(path, 4+instr(substr(path,4),'/‚)) as relativepath FROM
table WHERE table MATCH "path:ab AND path:cd“ AND path LIKE "ab/cd%“

How it works:
- The match clause efficiently filters for all records containing ab and
cd (the / is treated as a separator for words by the fulltext tokenizer).
- The records returned by match mean that the words we searched for
occurred SOMEWHERE within the record. Since you’re only interested those
beginning with "ab/cd“ we must use another LIKE. But this is not a
peformance issue, because you effectively only apply the LIKE to the
records that MATCH returned
- The substr(…) is just applied to those records returned

It should be fairly efficient this way. A limitation is however that MATCH
doesn’t like special characters. If you have for example hyphens or spaces
in your path names. But even then it would work, you’d just need to build
an array of words, replacing any special chars with spaces. For example:

Path-_1/path%2/path3_

When we build an array by replacing all non-alphanumeric chars with empty
spaces we get:
"Path 1 path 2 path3 "

Notice we have two spaces in path 1 for „-_“. Normalize that as well,
removing duplicate as well as leading / trailing spaces:
"Path 1 path 2 path3“

Make an array of search words, breaking at the spaces:
[Path, 1, path, 2, path3]

Create your match statement with that:
MATCH "pathcolumn:Path AND pathcolumn:1, AND pathcolumn:path AND
pathcolumn:2 AND pathcolumn:path3“

Create your like statement, appending the % at the end, and normalizing
the input string (to make % inside your text not recognized by like but
treated as test really):
char *zSQL = sqlite3_mprintf("Path-_1/path%2/path3_", zText);

This will give some properly escaped string. Then use that string for your
LIKE statement, which will look something like (just quick example, not
sure if entirely correct)
LIKE „Path-_1/path\%2/path3_"

Regards
Ben

Stadin, Benjamin

2014-10-26 14:21:05 UTC

Permalink

Sorry, this should rather be something like sqlite3_mprintf("%q",
"Path-_1/path%2/path3_³);

https://www.sqlite.org/c3ref/mprintf.html

Am 26.10.14 14:57 schrieb "Stadin, Benjamin" unter

Post by Stadin, Benjamin
char *zSQL = sqlite3_mprintf("Path-_1/path%2/path3_", zText);

James K. Lowden

2014-10-26 19:16:48 UTC

Permalink

On Sun, 26 Oct 2014 15:27:24 +0300

Post by Baruch Burstein
I need to get the path with the
substr(path, 4+instr(substr(path,4),'/'))
But that seems long and probably inefficient.
What is the best/simplest way to find the second occurrence of the
'/' in a string?

Fast is fast enough. If your SQL solves your problem acceptably fast,
you're done.

If you need something faster, you could implement support for
regular expressions. Posix filename rules are very strict: the only
disallowed characters are '/' and NUL. A simple regex returns the Nth
occurence of a pattern. I would expect such a solution to move the
performance constraint from the SQL interpreter (if that's where it is)
to I/O.

My example implementation is at
http://www.schemamania.org/sql/sqlite/udf/regex.pdf Note however I
wrote it as a user-defined function before I understood that REGEXP is a
supported keyword in the SQLite syntax. If you implement a regexp()
user function, you have access to it as an SQL operator, not just as a
UDF.

--jkl

Stephen Chrzanowski

2014-10-26 23:17:21 UTC

Permalink

Another method beyond what was suggested above/below would be that since
the / could be considered a delimiter, you could consider each field a word
and insert each word into a separate table and index each word. Have
another table reference the indexed word to match whatever table you've
mentioned. This is my personal pref, as I've not a lot of experience with
FTS.

So using your sample data above, I designed this database;

CREATE TABLE [main].[RawData] (
[PathID] INTEGER PRIMARY KEY AUTOINCREMENT,
[Path] CHAR);

CREATE TABLE [main].[Keywords] (
[KeywordID] INTEGER PRIMARY KEY AUTOINCREMENT,
[Keyword] CHAR);
CREATE UNIQUE INDEX [main].[idxKeywords] ON [Keywords] ([Keyword] COLLATE
NOCASE);

CREATE TABLE [main].[DataKeywords] (
[PathID] INTEGER NOT NULL,
[KeywordID] integer NOT NULL,
CONSTRAINT [sqlite_autoindex_DataKeywords_1] PRIMARY KEY ([PathID],
[KeywordID]));

insert into [main].[RawData] values(1, 'ab/cd/gf');
insert into [main].[RawData] values(2, 'ab/qw/ert');
insert into [main].[RawData] values(3, 'ab/fgrd/ert');
insert into [main].[RawData] values(4, 'ab/foo/bar/fgr');
insert into [main].[RawData] values(5, 'ab/bar/foo/foobar/etc');
insert into [main].[RawData] values(6, 'ab/etc/d');

insert into [main].[Keywords] values(1, 'ab');
insert into [main].[Keywords] values(2, 'cd');
insert into [main].[Keywords] values(3, 'gf');
insert into [main].[Keywords] values(4, 'qw');
insert into [main].[Keywords] values(5, 'ert');
insert into [main].[Keywords] values(6, 'fgrd');
insert into [main].[Keywords] values(8, 'foo');
insert into [main].[Keywords] values(9, 'bar');
insert into [main].[Keywords] values(10, 'fgr');
insert into [main].[Keywords] values(11, 'foobar');
insert into [main].[Keywords] values(12, 'etc');
insert into [main].[Keywords] values(13, 'd');

insert into [main].[DataKeywords] values(1, 1);
insert into [main].[DataKeywords] values(1, 2);
insert into [main].[DataKeywords] values(1, 3);
insert into [main].[DataKeywords] values(2, 1);
insert into [main].[DataKeywords] values(2, 4);
insert into [main].[DataKeywords] values(2, 5);
insert into [main].[DataKeywords] values(3, 1);
insert into [main].[DataKeywords] values(3, 6);
insert into [main].[DataKeywords] values(3, 5);
insert into [main].[DataKeywords] values(4, 1);
insert into [main].[DataKeywords] values(4, 8);
insert into [main].[DataKeywords] values(4, 9);
insert into [main].[DataKeywords] values(4, 10);
insert into [main].[DataKeywords] values(5, 1);
insert into [main].[DataKeywords] values(5, 9);
insert into [main].[DataKeywords] values(5, 8);
insert into [main].[DataKeywords] values(5, 11);
insert into [main].[DataKeywords] values(5, 12);
insert into [main].[DataKeywords] values(6, 1);
insert into [main].[DataKeywords] values(6, 12);
insert into [main].[DataKeywords] values(6, 13);

select RawData.PathID, Path from RawData
join DataKeywords on RawData.PathID=DataKeywords.PathID
join Keywords on DataKeywords.KeywordID=Keywords.KeywordID
where keyword='ert'

PathID Path
------ -----------
2 ab/qw/ert
3 ab/fgrd/ert

select RawData.PathID, Path from RawData
join DataKeywords on RawData.PathID=DataKeywords.PathID
join Keywords on DataKeywords.KeywordID=Keywords.KeywordID
where keyword='ab'
PathID Path
------ ---------------------
1 ab/cd/gf
2 ab/qw/ert
3 ab/fgrd/ert
4 ab/foo/bar/fgr
5 ab/bar/foo/foobar/etc
6 ab/etc/d