creating indexes on empty columns

Discussion:

Paul Sanderson

2012-11-24 12:18:25 UTC

Whilst building a new app I created an index on every column some of which
were empty. The database is reasonably large (400K rows) and I notcied that
it seems to take as long to create an index on a column in which all the
rows are empty as it does on one in which all the rows are unique.

I don't really have an idea of how sqlite works internally, but this seems
like an area where there could be some optimisation - or anm I totally off
track?

--
Paul

Igor Tandetnik

2012-11-24 14:36:26 UTC

Permalink

Post by Paul Sanderson
Whilst building a new app I created an index on every column some of which
were empty. The database is reasonably large (400K rows) and I notcied that
it seems to take as long to create an index on a column in which all the
rows are empty as it does on one in which all the rows are unique.
I don't really have an idea of how sqlite works internally, but this seems
like an area where there could be some optimisation - or anm I totally off
track?

What kind of optimization do you have in mind? How exactly do you propose determining that a column has empty values in all rows, any faster than looking at every row? Also, what constitutes an "empty value" - NULL, empty string, zero-length blob, integer 0, something else?

--
Igor Tandetnik

Clemens Ladisch

2012-11-24 14:40:21 UTC

Permalink

Post by Paul Sanderson
Whilst building a new app I created an index on every column some of which
were empty.

And with "empty", you mean that every value in that column is NULL?

Post by Paul Sanderson
The database is reasonably large (400K rows) and I notcied that
it seems to take as long to create an index on a column in which all the
rows are empty as it does on one in which all the rows are unique.

Paul Sanderson

2012-11-25 13:29:48 UTC

Permalink

Yes NULL -

I underastand that ecvery coumn needs to be read, that is self evident,
however my feeling (not tested) is that the process is much slower than it
needs to be, i..e the process of creating an index on a column whos values
are all NULL takes longer than just reading all of the columns - I have to
admit to not testing/timimg this though, I'll have a go when time permits.

That does lead to another question. Is their a method of creating multiple
indexes at the same time, e.g. create an index on each (or specified)
column in a table in one pass - rather than do each column in turn. This
would save on the overhead of reading the entire table for each column.

Post by Clemens Ladisch

Post by Paul Sanderson
Whilst building a new app I created an index on every column some of

which

Post by Paul Sanderson
were empty.

And with "empty", you mean that every value in that column is NULL?

SQLite still has to read all records to determine what the values are,
and to create the index with all these values and RowIDs. That these
values all happen to have the same value does not change the algorithm.
Regards,
Clemens
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Simon Slavin

2012-11-25 16:56:20 UTC

Permalink

Post by Paul Sanderson
That does lead to another question. Is their a method of creating multiple
indexes at the same time, e.g. create an index on each (or specified)
column in a table in one pass - rather than do each column in turn. This
would save on the overhead of reading the entire table for each column.

Looking again at this I'm not sure I understand why you're indexing every column. Do you understand the difference between creating one index for each column and creating one index on all columns ? It is worth the huge increase in filesize ? Is the slow-down on data entry (where entering one row must modify many indexes) worth the speed-up in your SELECTs ? Do you understand that any SELECT will use only one index ?

Simon.

Simon Slavin

2012-11-25 16:56:44 UTC

Permalink

Post by Paul Sanderson
I underastand that ecvery coumn needs to be read, that is self evident,
however my feeling (not tested) is that the process is much slower than it
needs to be, i..e the process of creating an index on a column whos values
are all NULL takes longer than just reading all of the columns - I have to
admit to not testing/timimg this though, I'll have a go when time permits.

In SQLite, all columns are in all indexes even if the column contains a NULL. NULL has a sorting order, and anything that does

SELECT * FROM myTable ORDER BY myIndexedColumn

still has to return all rows.

It is very rare to need to index every column of a table. That is generally a sign that what you really want is some sort of three-column attribute store rather than a conventional database. You should know that any SELECT will only use either zero or one index: once you've used one index for sorting, the others are useless because they list rows in an unsorted order.

Simon.

Jay A. Kreibich

2012-11-25 18:13:04 UTC

Permalink

Post by Simon Slavin

In SQLite, all columns are in all indexes even if the column contains a
NULL. NULL has a sorting order, and anything that does

Rows, Simon, rows... not columns. Watch your terminology or your
answers will be more confusing than the questions.

-j

--
Jay A. Kreibich < J A Y @ K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
but showing it to the wrong people has the tendency to make them
feel uncomfortable." -- Angela Johnson

Simon Slavin

2012-11-25 18:21:08 UTC

Permalink

Post by Jay A. Kreibich

Post by Simon Slavin
In SQLite, all columns are in all indexes even if the column contains a
NULL. NULL has a sorting order, and anything that does

Rows, Simon, rows... not columns. Watch your terminology or your
answers will be more confusing than the questions.

Whoops. Right. Thanks.

Simon.

Clemens Ladisch

2012-11-25 16:57:12 UTC

Permalink

There's also the time needed for sorting the values and constructing and
writing the index.

Post by Paul Sanderson
I have to admit to not testing/timimg this though

sqlite> .timer on
sqlite> SELECT COUNT(*) FROM t;
4194304
CPU Time: user 0.008000 sys 0.000000
sqlite> SELECT DISTINCT x FROM t;

CPU Time: user 0.608000 sys 0.012000
sqlite> CREATE INDEX i ON t(x);
CPU Time: user 6.592000 sys 0.064000
sqlite> DROP INDEX i;
CPU Time: user 0.080000 sys 0.004000
sqlite> UPDATE t SET x=1;
CPU Time: user 4.556000 sys 0.052000
sqlite> SELECT DISTINCT x FROM t;
1
CPU Time: user 0.628000 sys 0.000000
sqlite> CREATE INDEX i ON t(x);
CPU Time: user 6.648000 sys 0.080000

No such commands exists. Just have enough file cache. :-)

Regards,
Clemens

Jay A. Kreibich

2012-11-25 18:58:14 UTC

Permalink

Post by Paul Sanderson
Yes NULL -
I underastand that ecvery coumn needs to be read, that is self evident,
however my feeling (not tested) is that the process is much slower than it
needs to be, i..e the process of creating an index on a column whos values
are all NULL takes longer than just reading all of the columns - I have to
admit to not testing/timimg this though, I'll have a go when time permits.

An index is essentially a specialized table that uses a different key.
Tables and indexes use an almost identical data structure when
stored. The main difference is that tables are sorted and keyed via
the ROWID column, while indexes are sorted and keyed by one or more
indexed columns. Every index has one row for each row in the table.

So the question then becomes, if you were creating a one column table
and inserted a large number of rows, would you expect a significant
difference in the insert time if all the values were the same, or if
they were different?

Scanning the table is not the expensive part of creating an index.
Creating a large index requires a significant amount of memory to sort
and rebalance the B-Tree used to store the index data. Doing more
than one index at a time would make that problem much worse, and
would likely be slower, overall, then creating the indexes one at a
time.

As others have pointed out, it might be worth backing and asking what
you're trying to achieve with all these indexes. Putting an index on
each column is usually undesirable. A given SELECT can usually only
use one index per query (or sub-query), so it rarely makes sense to
stack up the indexes... adding unused indexes only slows down
insert/update/deletes, as well as makes the file size much larger.
Additionally, indexing any column (even one used heavily in a query
filter) is not useful if the index doesn't reduce the working data
set to ~10% of the rows. In other words, having an index on a column
that has all one value (be it NULL or anything else) accomplishes
nothing but significantly slowing down queries that use the index.
Same is true, for example, of a True/False column with an even
distribution. Generally, unless the index can be used to eliminate
the vast majority of rows, it will slow down, rather than speed up, a
query.

Indexes are not magic bullets, and using them properly requires
understanding how they work and how they are used.

-j

Keith Medcalf

2012-11-25 19:41:21 UTC

Permalink

Post by Jay A. Kreibich
each column is usually undesirable. A given SELECT can usually only
use one index per query (or sub-query), so it rarely makes sense to
stack up the indexes... adding unused indexes only slows down
insert/update/deletes, as well as makes the file size much larger.

Generally speaking indexes should be treated as a performance trade-off. When you create an index (other than a UNIQUE index used to enforce a constraint, or an index on a parent or child key in a foreign-key relationship, where such an index may greatly increase INSERT or UPDATE performance) you are "moving" execution time from the retrieval processing to the maintenance processing of your application.

When you add an index, you are (usually) optimizing retrieval and query operations. The execution time saved during such query operations does not disappear (it is not recovered). These processor cycles and I/O operations are "removed" from retrieval operations and "spent" when you perform updates to the database to maintain the indexes. In other words, optimization does not make all operations faster -- the time required for your application to perform its functions can be viewed as a fixed size bag of water. When you optimize some operation by adding an index you are "pushing in" the bag of water in some specific place. It bulges out somewhere else, and some other operation becomes slower because now you have to perform additional operations elsewhere to maintain the indexes.

The objective is to save a "huge" amount of processing time in one operation and spend that savings in "wee bits" distributed over many other operations where it is not noticeable. For example, adding a particular index may decrease the time to perform a certain query from 5 minutes to 5 seconds. However, it will add a few hundred milliseconds to each update or insert operation. If this trade-off is acceptable, then you have made a good trade-off. On the other hand, if the indexes you added to optimize the query increases each update/insert by a few seconds, the trade-off may not be acceptable. You may need to find an acceptable middle ground ...

Post by Jay A. Kreibich
Additionally, indexing any column (even one used heavily in a query
filter) is not useful if the index doesn't reduce the working data
set to ~10% of the rows. In other words, having an index on a column
that has all one value (be it NULL or anything else) accomplishes
nothing but significantly slowing down queries that use the index.
Same is true, for example, of a True/False column with an even
distribution. Generally, unless the index can be used to eliminate
the vast majority of rows, it will slow down, rather than speed up, a
query.

The exception to this is, of course, where the index created is a covering index because using a covering index, while it may not necessarily reduce the number of rows significantly, eliminates the accesses to the base table to retrieve data that might be being used in further operations.

Care needs to be taken to not prematurely add indexes that will add maintenance cost but not significantly improve query performance (ie, be careful not to just add water into the bag -- the objective is to poke it around, not just add more water). Adding indexes required to enforce or optimize constraint and foreign key operations are almost always required -- but other indexes should not be added until you are sure that they will actually be required because overuse of indexes can severely hamper overall performance.

---
() ascii ribbon campaign against html e-mail
/\ www.asciiribbon.org

Petite Abeille

2012-11-25 21:45:01 UTC

Permalink

Post by Keith Medcalf
(ie, be careful not to just add water into the bag -- the objective is to poke it around, not just add more water)

+1 for the, hmmm, treading water metaphore

Jay A. Kreibich

2012-11-25 23:30:55 UTC

Permalink

Post by Keith Medcalf

Generally speaking indexes should be treated as a performance trade-off.

Ideally they're a performance gain. That's kind of the point. If
they're not an overall gain, you likely shouldn't be using them.

Post by Keith Medcalf
When you create an index (other than a UNIQUE index used to enforce a
constraint, or an index on a parent or child key in a foreign-key
relationship, where such an index may greatly increase INSERT or UPDATE
performance) you are "moving" execution time from the retrieval
processing to the maintenance processing of your application.

Yes and no. If you view an index as an optimization, then the idea
is usually to increase overall performance, so that there is net win.
It is true that an index will (hopefully) increase the performance
of many queries at the possible cost of additional maintenance processing,
but the net change depends on the application's read/write ratios,
and the maintenance patterns.

In that sense, I'd say the indexes do not move execution *time*, so
much as move and adjust execution *cost*. If the application's needs
and access patterns are such that the refactored set of costs is lower,
the indexes are usually seen as a correct and positive optimization.
If the new set of costs results in lower overall performance, the
index is seen as a negative thing.

There is also nothing special about an automatic UNIQUE index.
The costs are the same. The only difference is the motivation for
creating it, accepting the write performance cost as the price of
enforcing the constraint. Adding a UNIQUE constraint (and the index
that goes with it) is not an optimization, but an application requirement.

Post by Keith Medcalf
When you add an index, you are (usually) optimizing retrieval and
query operations. The execution time saved during such query
operations does not disappear (it is not recovered). These processor
cycles and I/O operations are "removed" from retrieval operations
and "spent" when you perform updates to the database to maintain
the indexes.

I disagree with this idea, as it implies there is a 1:1 exchange in
read performance gains and write performance loss. That is very
rarely the case. Even the relative costs of a single read vs a
single write are not linear or particularly related. All of these
things are highly dependent on the table size, the different indexes,
and the operation being performed.

It isn't about moving costs from category A to category B, it is
about finding efficient work flows that work faster and better. We
already know there are some situations when an index will provide
significant performance benefits, and other cases when the same index
may slow things down. A big part of what the query optimizer must do
is identify these situations and pick the best use of the available
resources-- in many situations that may be to ignore the index.
That's not about shifting execution time, it is about getting rid of
it, and picking the fastest path for the situation at hand. The
balance is that an index opens up more options to the query
optimizer, but there is also an associated cost-- and that cost
should only be paid if the index is used from time to time.

This balance is true of all operations, not just SELECTs. For
example, a good index can make UPDATE and DELETE operations faster,
just as it may make SELECTs faster. That kind of blows the theory of
moving execution time around, since there can be advantages in both
types of operations.

Index optimization is in no way a zero-sum game. The bag of water is
*not* fixed, and it is just as possible to remove water as it is to
add it.

Post by Keith Medcalf

Hence the "generally." Covering indexes are a somewhat unique case,
and not every database can use them. Normally they still don't result
in much savings unless a noticeable percentage of rows are eliminated
(or very expensive columns are eliminated, such as large BLOBs), but
the break-even point is definitely lower than a normal index, with the
worse case being no worse than a normal table scan. All of this is
true for read operations only, of course. Covering indexes offer no
advantage over normal indexes in the case of UPDATE and DELETE
operations (and no disadvantages either; they can still be used to
speed up these operations, just no more than a "normal" index). And,
of course, covering indexes often have a higher maintenance cost than
normal indexes, since they typically contain more columns.

Post by Keith Medcalf
Care needs to be taken to not prematurely add indexes that will add
maintenance cost but not significantly improve query performance (ie,
be careful not to just add water into the bag -- the objective is to
poke it around, not just add more water).

If it is possible add water, it is also possible to take it away.
As I said, this is not a zero-sum game. The bag of water is not fixed.

I agree that premature optimization is a bad idea-- not just with
indexes, but with just about any aspect of software development.
Unless you know the patterns and needs of your application, you're
optimizing guesses, which is always a bad thing.

In the end, a big part of the reason index optimization is so
difficult is because it makes the cost structure much more complex.
Adding an index is not so simple that queries are reduced by 50% and
insert/update/delete operations increase by 50%. If it was that easy,
the query optimizer would be much better at suggesting indexes (and/or
automatically creating them, as it does create temporary indexes
for some queries).

Rather, the cost/benefit requirements of an index are much more
complex, and often have to do with the number of rows being operated
on, as well as the total number of rows in the table. This is why
it is important to understand how indexes work, and how they provide
(or fail to provide) advantages to the application.

"Using SQLite" (http://shop.oreilly.com/product/9780596521196.do)
has a very lengthy discussion of indexes and how they work,
specifically because it is difficult to generalize the use of
indexes. One must really look at each application and each database
(and the data in that database!) with a solid knowledge of what an
index can-- or cannot-- provide in order to find places where an
index will make a positive difference.

-j

Simon Slavin

2012-11-26 01:52:35 UTC

Permalink

Post by Jay A. Kreibich
If you view an index as an optimization, then the idea
is usually to increase overall performance, so that there is net win. [snip]
I disagree with this idea, as it implies there is a 1:1 exchange in
read performance gains and write performance loss.

To enlarge upon Jay's monotribe, I'll put these two together. Optimization of a system sometimes doesn't mean optimizing /net/ performance, it means speeding things up when time matters. I once got an ovation from a roomfull of staff by deleting a couple of indexes, not adding them. Because they spent most of their time entering data, but reports were generated by other people they never met and didn't care about. My changes reduced a two-second pause after hitting 'send' to half a second, massively reducing false keystrokes.

The report-needing people, on the other hand didn't care. It already took the system 30 minutes to generate the reports: they pressed the button and came back an hour later. Changing the 30 minutes to 50 minutes didn't impact their workflow badly. Out of courtesy I explained why the end-of-day reports were taking longer and nobody complained. (By the way this sort of thing is why banks used to close to customers at 3:30pm even though the staff still worked until 5pm.)

This was decades ago on a pre-SQL system. One thing I like about SQL is that the language used to change and consult the database doesn't say anything about indexes (except in some non-standard extensions). So you can add or delete indexes when you want without needing to change a line of code in any of your software and having to deploy new versions of your software. [insert dramatic evil genius music here]

Simon.

Paul Sanderson

2012-11-26 09:40:12 UTC

Permalink

Thanks for the replies - I'll try and read through them all thoroughly a
bit later.

But for now a bit of background.

My software creates a large table containing anything between about 250K
and Millions of rows when first run, the indexes are created immediately
after the table is populated and the tables do not change afterwards. This
process generally takes 10 minutes to an hour or so.

The reason for the indexes is that the data is then reviewed by a user and
each user will have a different end result in mind and so will be looking
at (lots of) different subsets of the data but will want pretty much
instant results, i.e. a few seconds not minutes. I have no idea which
columns, other than a few of the main columns, that they will need to do
selects and sorts on. It's unlikely to be all of them for a specific user
but all users are very likely to use all columns but wont really know which
until they start looking at the data and even then the desired end reult is
sometimes fluid so requirements change.

The size of the database is never an issue. Memory is unlikely to be an
issue. The initial processing time (data loading and indexing) is not an
issue.

The whole target here is to drive the end user experience and reduce any
wait time when they are at the keyboard to a minimum.

I have considered just indexing the main columns and indexing other columns
on demand (first use if you like) but can't see the benefit of that other
than saving time and disk space at stage 1, but there is no need for this

Post by Jay A. Kreibich

Post by Jay A. Kreibich
If you view an index as an optimization, then the idea
is usually to increase overall performance, so that there is net win.

[snip]

Post by Jay A. Kreibich
I disagree with this idea, as it implies there is a 1:1 exchange in
read performance gains and write performance loss.

To enlarge upon Jay's monotribe, I'll put these two together.
Optimization of a system sometimes doesn't mean optimizing /net/
performance, it means speeding things up when time matters. I once got an
ovation from a roomfull of staff by deleting a couple of indexes, not
adding them. Because they spent most of their time entering data, but
reports were generated by other people they never met and didn't care
about. My changes reduced a two-second pause after hitting 'send' to half a
second, massively reducing false keystrokes.
The report-needing people, on the other hand didn't care. It already took
the system 30 minutes to generate the reports: they pressed the button and
came back an hour later. Changing the 30 minutes to 50 minutes didn't
impact their workflow badly. Out of courtesy I explained why the
end-of-day reports were taking longer and nobody complained. (By the way
this sort of thing is why banks used to close to customers at 3:30pm even
though the staff still worked until 5pm.)
This was decades ago on a pre-SQL system. One thing I like about SQL is
that the language used to change and consult the database doesn't say
anything about indexes (except in some non-standard extensions). So you
can add or delete indexes when you want without needing to change a line of
code in any of your software and having to deploy new versions of your
software. [insert dramatic evil genius music here]
Simon.
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Clemens Ladisch

2012-11-26 14:20:43 UTC

Permalink

Post by Paul Sanderson
My software creates a large table containing anything between about 250K
and Millions of rows when first run, the indexes are created immediately
after the table is populated and the tables do not change afterwards.
The reason for the indexes is that the data is then reviewed by a user and
each user will have a different end result in mind and so will be looking
at (lots of) different subsets of the data but will want pretty much
instant results, i.e. a few seconds not minutes. I have no idea which
columns, other than a few of the main columns, that they will need to do
selects and sorts on.

If they can doing sorts on more than one column, you need indexes for
all combinations of those columns. (For n columns, that's "only" n!
combinations. Database size might become an issue ...)

Post by Paul Sanderson
The size of the database is never an issue. Memory is unlikely to be an
issue. The initial processing time (data loading and indexing) is not an
issue.

Then why not creating covering indexes for every combination of sort
columns? (This is essentially the same as creating lots of sorted
copies of the table.)

Regards,
Clemens

Paul Sanderson

2012-11-26 14:54:00 UTC

Permalink

Sorry - generally the sorts will be on one column - but they may choose at
a later time to sort by another column. They will (but rarely - sort by two
or more columns at the same time).

Post by Clemens Ladisch

and

Post by Paul Sanderson
each user will have a different end result in mind and so will be looking
at (lots of) different subsets of the data but will want pretty much
instant results, i.e. a few seconds not minutes. I have no idea which
columns, other than a few of the main columns, that they will need to do
selects and sorts on.

If they can doing sorts on more than one column, you need indexes for
all combinations of those columns. (For n columns, that's "only" n!
combinations. Database size might become an issue ...)

Post by Paul Sanderson
The size of the database is never an issue. Memory is unlikely to be an
issue. The initial processing time (data loading and indexing) is not an
issue.

Then why not creating covering indexes for every combination of sort
columns? (This is essentially the same as creating lots of sorted
copies of the table.)
Regards,
Clemens
_______________________________________________
sqlite-users mailing list
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

--
Paul
www.sandersonforensics.com
skype: r3scue193
twitter: @sandersonforens
Tel +44 (0)1326 572786

Petite Abeille

2012-11-26 09:35:44 UTC

Permalink

Post by Jay A. Kreibich
"Using SQLite" (http://shop.oreilly.com/product/9780596521196.do)
has a very lengthy discussion of indexes and how they work,
specifically because it is difficult to generalize the use of
indexes. One must really look at each application and each database
(and the data in that database!) with a solid knowledge of what an
index can-- or cannot-- provide in order to find places where an
index will make a positive difference.

All true, and well articulated.

But, at the end of the day, SQLite only provides b-tree indexes and that is that.

And b-trees are not that useful and/or detrimental for a wide range of use case.

What about providing bitmap indexes?

http://en.wikipedia.org/wiki/Bitmap_index

Bitmap Index vs. B-tree Index: Which and When?

http://www.oracle.com/technetwork/articles/sharma-indexes-093638.html

Now, that would be a breakthrough :)

Continue reading on narkive:

Search results for 'creating indexes on empty columns' (Questions and Answers)

replies

Help using visual basic in a macro for excel?

started 2007-02-21 03:38:52 UTC

programming & design

replies

How do you create a diabetic diet plan?

started 2010-10-31 01:45:46 UTC

diabetes

replies

Adding a column to number the rows in a Microsoft Access table?