Discussion:
[sqlite] Save text file content in db: lines or whole file?
Abramo Bagnara
7 years ago
Permalink
I need to save text files (let say between 1 KB to 20 MB) in a SQLite DB.

I see two possibilities:

1) save all the content in a single column:

create table content(id integer not null primary key,
text blob not null);

2) split the content in lines:

create table line(content integer not null,
line integer not null,
text blob not null,
primary key(content, line));

Some queries will need to extract the whole file, while other queries
will need to extract the text for a range of lines.

According to your experience it is better/faster the first option, the
second option or a smarter option I've not considered?

My partial considerations are:

- 1 is simpler
- 1 leads to faster load
- 1 is slower to retrieve a range of lines (not 100% sure)
--
Abramo Bagnara
Simon Slavin
7 years ago
Permalink
Post by Abramo Bagnara
Some queries will need to extract the whole file, while other queries
will need to extract the text for a range of lines.
Can you give us an idea of how many lines you expect per text file ? Minimum and maximum for 90% of the files would be good.

Also, in whatever programming language you're using, it is easy/fast or difficult/slow to locate lines 21 to 41 of a 100-line text string ?

Simon.
Abramo Bagnara
7 years ago
Permalink
Post by Simon Slavin
Post by Abramo Bagnara
Some queries will need to extract the whole file, while other queries
will need to extract the text for a range of lines.
Can you give us an idea of how many lines you expect per text file ? Minimum and maximum for 90% of the files would be good.
They are (possibly preprocessed) source files. Then the size is not
known, as I wrote I assume it is in the range 1KB, 20MB, with most cases
under 1MB.
Post by Simon Slavin
Also, in whatever programming language you're using, it is easy/fast or difficult/slow to locate lines 21 to 41 of a 100-line text string ?
The language is C++. Without caching it needs a linear scan of bytes
from the beginning counting newlines for every lines range query. With
caching this counting can be done once and saved somewhere.

The lines will be served to an infinite scroll web app, so the retrieval
speed is important to reduce the latency.
--
Abramo Bagnara
Abroży Nieprzełoży
7 years ago
Permalink
-- One table with whole files
CREATE TABLE content(
id INTEGER PRIMARY KEY,
data BLOB NOT NULL
);
-- And second table with line boundaries
CREATE TABLE lines(
id INTEGER NOT NULL REFERENCES content(id),
line_no INTEGER NOT NULL,
bytes_from INTEGER NOT NULL,
bytes_to INTEGER NOT NULL,
PRIMARY KEY(id, line_num)
) WITHOUT ROWID;
-- Use Incremental BLOB I/O https://www.sqlite.org/c3ref/blob_open.html

I would also consider keeping the largest files external to the
database. https://www.sqlite.org/intern-v-extern-blob.html
...
Richard Hipp
7 years ago
Permalink
Post by Abroży Nieprzełoży
I would also consider keeping the largest files external to the
database. https://www.sqlite.org/intern-v-extern-blob.html
See also https://www.sqlite.org/fasterthanfs.html

Yes, it is a little faster to store larger files directly in the
filesystem. But not that much faster. And it is certainly convenient
to have all your content in one place and to have transactional reads
and writes. So, unless there is a compelling need for an extra 10%
performance, I would argue for keeping all content in the database
file.

FWIW, I also think you should store your files as a single big blob
and parse out the individual lines using C++ code when you need that.
--
D. Richard Hipp
***@sqlite.org
Abramo Bagnara
7 years ago
Permalink
Post by Abroży Nieprzełoży
-- One table with whole files
CREATE TABLE content(
id INTEGER PRIMARY KEY,
data BLOB NOT NULL
);
-- And second table with line boundaries
CREATE TABLE lines(
id INTEGER NOT NULL REFERENCES content(id),
line_no INTEGER NOT NULL,
bytes_from INTEGER NOT NULL,
bytes_to INTEGER NOT NULL,
PRIMARY KEY(id, line_num)
) WITHOUT ROWID;
-- Use Incremental BLOB I/O https://www.sqlite.org/c3ref/blob_open.html
Incremental BLOB I/O is faster than using substr(data, start, len) in a
SELECT?
--
Abramo Bagnara
Abramo Bagnara
7 years ago
Permalink
...
Someone familiar with implementation of BLOB I/O and sqlite VM can
answer to that?

I'd guess that *if* there is a difference it is greater with bigger
blobs, but it is also possible there is not any sensible difference if
substr act similarly to sqlite3_blob_read.

But my uninformed guesses are definitely not relevant, until an aware
Samaritan will come to clarify that ;-)
--
Abramo Bagnara

BUGSENG srl - http://bugseng.com
mailto:***@bugseng.com
Abroży Nieprzełoży
7 years ago
Permalink
substr(data, start, len) loads entire value and then substrs it.
sqlite3_blob_read reads the desired parts.
...
Abramo Bagnara
7 years ago
Permalink
Post by Abroży Nieprzełoży
substr(data, start, len) loads entire value and then substrs it.
sqlite3_blob_read reads the desired parts.
I've verified that not only substr(data, start, len) loads entire row,
but with my surprise that also retrieving a sibling column (I've tried
to add another column "info blob not null") loads entire row (i.e.
including whole data column)...

IOW "select info from content;" retrieves also data column content.

Weird, isn't it?
...
--
Abramo Bagnara

BUGSENG srl - http://bugseng.com
mailto:***@bugseng.com
Graham Holden
7 years ago
Permalink
Post by Abramo Bagnara
I've verified that not only substr(data, start, len) loads entire row,
but with my surprise that also retrieving a sibling column (I've tried
to add another column "info blob not null") loads entire row (i.e.
including whole data column)...
IOW "select info from content;" retrieves also data column content.
Weird, isn't it?
If the "info" column is added to the end of the table, then this is a
well-known "optimisation opportunity"... because of the way SQLite
stores records, it has to read past all intervening columns to get to
the ones required. If "info" is after your BLOB, the BLOB has to be
read to get to it. A common "good practice" is to ensure any BLOB (or
similar large fields) are placed at the end of the table definition.

If the "info" column is added _before_ the BLOB, it _should_ (as I
understand it) not need to read the blob to handle "select info from
content".

Graham

Hick Gunter
7 years ago
Permalink
Please try to avoid using keywords as names, especially if they conflict with the intended datatype. "text blob not null" creates a field of name "text" whose content is a blob and yet you intend to store text data (with embedded newlines) in it.

If you store the lines separately, you can always group_concat() them together on retrieval.

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:sqlite-users-***@mailinglists.sqlite.org] Im Auftrag von Abramo Bagnara
Gesendet: Freitag, 03. August 2018 21:04
An: SQLite mailing list <sqlite-***@mailinglists.sqlite.org>
Betreff: [EXTERNAL] [sqlite] Save text file content in db: lines or whole file?


I need to save text files (let say between 1 KB to 20 MB) in a SQLite DB.

I see two possibilities:

1) save all the content in a single column:

create table content(id integer not null primary key,
text blob not null);

2) split the content in lines:

create table line(content integer not null,
line integer not null,
text blob not null,
primary key(content, line));

Some queries will need to extract the whole file, while other queries will need to extract the text for a range of lines.

According to your experience it is better/faster the first option, the second option or a smarter option I've not considered?

My partial considerations are:

- 1 is simpler
- 1 leads to faster load
- 1 is slower to retrieve a range of lines (not 100% sure)
--
Abramo Bagnara
_______________________________________________
sqlite-users mailing list
sqlite-***@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
R Smith
7 years ago
Permalink
...
Why not do both?

If it was me, I would write some code to split the text into sentences
(not lines - which is rather easy in English, but might be harder in
some other languages).

Then I would save the lines in a two-table Database like this:

CREATE TABLE files(
  ID INTEGER PRIMARY KEY, -- This is only to get a handle on the rowid
  fileName TEXT,  -- Add COLLATE NOCASE if the file system is case
insensitive, like Windows.
  filePath TEXT,
  ...  -- Any other things you want to store about the origin file, or
perhaps a timestamp etc.
);

CREATE TABLE content(
  ID INTEGER PRIMARY KEY,
  fileID INT NOT NULL REFERENCES files(ID), -- This to know which
origin file [*].
  lineNo INT NOT NULL,  -- So that a line can be referenced by position
into a file.
  txtLine TEXT,
  ...
);

Now you can easily query all lines with a specific fileID to see the
entire document, or JOIN by fileName even to list the content of any
file by name, or simply refer to any specific line in any file by either
its fileID+lineNo or simply its own ID.  I would probably go further
(since I'm code-parsing the file anyway) and include a paragraph number
or even chapter + page numbers if that's relevant.

This approach will work whether you split lines into sentences or just
physical positional lines, though a sentence-split makes more sense to
me (unless it's a data format).

[*] - I'm only showing basic options, but you'd typically want to add
some ON UPDATE CASCADE or ON DELETE referencing methods.


Cheers,
Ryan
R Smith
7 years ago
Permalink
Post by R Smith
Post by Abramo Bagnara
I need to save text files (let say between 1 KB to 20 MB) in a SQLite DB.
Why not do both?
If it was me, I would write some code to split the text into sentences
(not lines - which is rather easy in English, but might be harder in
some other languages).
//...

I've received two off-line questions as to how I could parse text into
sentences in "English" even, and thought I would reply here since it
might clear up the confusion for others too.

The said questions indicated that the authors probably imagined me
possessing some fancy AI comprehending the language into what
constitutes notional sentences (Subject+Predicate) or such, but I fear
the meaning was much more arbitrary, based on common syntax for written
English - as William Faulkner wrote in a letter to Malcolm Cowley:

*"I am trying to say it all in one sentence, between one Cap and one
period."*


Think of paragraphs in English as large records delimited by 2 or more
Line-break characters (#10+#13 or perhaps only #10 if on a *nix
platform) between texts.

Each paragraph record could be comprised of one or more sentences (in
English) as records delimited by a full-stop+Space or
full-stop+linebreak, or even simply the paragraph end.

By these simple rules, the following can easily parsed into 1 paragraph
with 2 sentences and a second paragraph with 1 sentence (lines here used
as formatting only, actual line-breaks indicated with "<-" marker):
<-
The quick brown fox jumps over the
lazy dog.  My grandma said to your
grandma, I'm gonna set your flag
on fire.<-
<-
Next paragraph here...<-
<-

Now a more difficult paragraph would be a the following, all of which
would translate in to 1 single sentence if only the above rules are
catered for:
<-
I have three wishes:<-
  - to be outlived by my children<-
  - to fly in space once before I die<-
  - to see Halley's comet once more<-
<-

That will be a single-sentenced paragraph.  It's up to the
end-implementation to gauge whether that would be sufficient a split or
not.

To put this into a DB, I would strip out the line-breaks inside
sentences (perhaps not strip out, but replace with space characters,
much like HTML does) to make them more easily handled as "lines". The
final DB table might then look like this:

ID |  fileID | parNo | parLineNo | docLineNo | txtLine
 1 |     1   |   1   |     1     |     1     | The quick brown fox
jumps over the lazy dog.
 2 |     1   |   1   |     2     |     2     | My grandma said to your
grandma, I'm gonna set your flag on fire.
 3 |     1   |   2   |     1     |     3     | Next paragraph here...
 4 |     1   |   3   |     1     |     4     | I have three wishes: -
to be outlived by my children - to fly in space once before I die - to
see Halley's comet once more

So yes, not a perfect walk-in-the-park, but easy to do for basic text
parsing.
Stating the obvious: If the intent is to re-construct the file 100%
exact (so it scores the same output for a hashing algorithm) then you
cannot strip out line-breaks and you need to carefully include each and
every character byte-for-byte used to split paragraphs and the like. It
all depends on the implementation requirements.

The above text format should hold for 99.9% of English literature text
that can be had in text files (i.e. no images, tables, etc.). Not so
easy for scientific papers, research material, movie scripts and a few
others.

Sorry for not presenting that great AI solution.  :)
Ryan
Hick Gunter
7 years ago
Permalink
Good luck with quoted speech that contains more than one sentence. E.g.

William Faulkner said, “Never be afraid to raise your voice for honesty and truth and compassion against injustice and lying and greed. If people all over the world...would do this, it would change the earth.”

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:sqlite-users-***@mailinglists.sqlite.org] Im Auftrag von R Smith
Gesendet: Montag, 06. August 2018 16:20
An: sqlite-***@mailinglists.sqlite.org
Betreff: Re: [sqlite] [EXTERNAL] Save text file content in db: lines or whole file?
Post by R Smith
Post by Abramo Bagnara
I need to save text files (let say between 1 KB to 20 MB) in a SQLite DB.
Why not do both?
If it was me, I would write some code to split the text into sentences
(not lines - which is rather easy in English, but might be harder in
some other languages).
//...

I've received two off-line questions as to how I could parse text into sentences in "English" even, and thought I would reply here since it might clear up the confusion for others too.

The said questions indicated that the authors probably imagined me possessing some fancy AI comprehending the language into what constitutes notional sentences (Subject+Predicate) or such, but I fear the meaning was much more arbitrary, based on common syntax for written English - as William Faulkner wrote in a letter to Malcolm Cowley:

*"I am trying to say it all in one sentence, between one Cap and one
period."*


Think of paragraphs in English as large records delimited by 2 or more
Line-break characters (#10+#13 or perhaps only #10 if on a *nix
platform) between texts.

Each paragraph record could be comprised of one or more sentences (in
English) as records delimited by a full-stop+Space or
full-stop+linebreak, or even simply the paragraph end.

By these simple rules, the following can easily parsed into 1 paragraph
with 2 sentences and a second paragraph with 1 sentence (lines here used
as formatting only, actual line-breaks indicated with "<-" marker):
<-
The quick brown fox jumps over the
lazy dog. My grandma said to your
grandma, I'm gonna set your flag
on fire.<-
<-
Next paragraph here...<-
<-

Now a more difficult paragraph would be a the following, all of which
would translate in to 1 single sentence if only the above rules are
catered for:
<-
I have three wishes:<-
- to be outlived by my children<-
- to fly in space once before I die<-
- to see Halley's comet once more<-
<-

That will be a single-sentenced paragraph. It's up to the
end-implementation to gauge whether that would be sufficient a split or
not.

To put this into a DB, I would strip out the line-breaks inside
sentences (perhaps not strip out, but replace with space characters,
much like HTML does) to make them more easily handled as "lines". The
final DB table might then look like this:

ID | fileID | parNo | parLineNo | docLineNo | txtLine
1 | 1 | 1 | 1 | 1 | The quick brown fox
jumps over the lazy dog.
2 | 1 | 1 | 2 | 2 | My grandma said to your
grandma, I'm gonna set your flag on fire.
3 | 1 | 2 | 1 | 3 | Next paragraph here...
4 | 1 | 3 | 1 | 4 | I have three wishes: -
to be outlived by my children - to fly in space once before I die - to
see Halley's comet once more

So yes, not a perfect walk-in-the-park, but easy to do for basic text
parsing.
Stating the obvious: If the intent is to re-construct the file 100%
exact (so it scores the same output for a hashing algorithm) then you
cannot strip out line-breaks and you need to carefully include each and
every character byte-for-byte used to split paragraphs and the like. It
all depends on the implementation requirements.

The above text format should hold for 99.9% of English literature text
that can be had in text files (i.e. no images, tables, etc.). Not so
easy for scientific papers, research material, movie scripts and a few
others.

Sorry for not presenting that great AI solution. :)
Ryan

_______________________________________________
sqlite-users mailing list
sqlite-***@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
R Smith
7 years ago
Permalink
Post by Hick Gunter
Good luck with quoted speech that contains more than one sentence. E.g.
William Faulkner said, “Never be afraid to raise your voice for honesty and truth and compassion against injustice and lying and greed. If people all over the world...would do this, it would change the earth.”
Very true, and even if you parse Quoted text as individual sections, it
can get weird since sometimes a Quote can contain partial sentences,
full sentences, full paragraph and even whole multi-paragraph pages. Do
we regard those as one sentence? or one "line" or do you break it into
lines on a second-level.
All these are possible via varying levels of difficulty, but they are
all questions to be answered by the source format, the use case and the
application designer.
Rowan Worth
7 years ago
Permalink
Post by R Smith
Think of paragraphs in English as large records delimited by 2 or more
Line-break characters (#10+#13 or perhaps only #10 if on a *nix platform)
between texts.
Each paragraph record could be comprised of one or more sentences (in
English) as records delimited by a full-stop+Space or full-stop+linebreak,
or even simply the paragraph end.
This is perfect because English has such strict rules and is so consistent;
how fortunate that the symbol used for delimiting sentences is never reused
for other purposes eg. to indicate an abbreviation or something crazy like
that.

:P
-Rowan
Continue reading on narkive:
Loading...