Discussion:
[sqlite] Deleting records from a large table
Dave Blake
2017-03-03 10:53:23 UTC
Permalink
Say table1 has more then 500000 records, and there is a second table
tmp_keep with the ids of the records in table1 to be kept, the rest need to
be deleted. The number of records in tmp_keep can vary from 0 to all the
records in table1, with any values in between.

What is the best strategy for doing the deletion?

For deleting a large number of records (tmp_keep is small), this works
fine:
DELETE FROM table1 WHERE id1 NOT IN (SELECT id1 FROM tmp_keep);

But this becomes inefficient when tmp_keep is large.

Any suggestions?
R Smith
2017-03-03 11:04:32 UTC
Permalink
Post by Dave Blake
Say table1 has more then 500000 records, and there is a second table
tmp_keep with the ids of the records in table1 to be kept, the rest need to
be deleted. The number of records in tmp_keep can vary from 0 to all the
records in table1, with any values in between.
What is the best strategy for doing the deletion?
For deleting a large number of records (tmp_keep is small), this works
DELETE FROM table1 WHERE id1 NOT IN (SELECT id1 FROM tmp_keep);
But this becomes inefficient when tmp_keep is large.
BEGIN TRANSACTION;
ALTER TABLE table1 RENAME TO tmp1;
CREATE TABLE table1 (
-- Your standard Table creation code here for table1...
);
INSERT INTO table1 SELECT tmp1.* FROM tmp1 JOIN tmp_keep ON tmp_keep.id1
= tmp1.id1;
DROP TABLE tmp1;
COMMIT;
Clemens Ladisch
2017-03-03 12:33:48 UTC
Permalink
Post by Dave Blake
For deleting a large number of records (tmp_keep is small), this works
DELETE FROM table1 WHERE id1 NOT IN (SELECT id1 FROM tmp_keep);
But this becomes inefficient when tmp_keep is large.
SQLite usually creates a temporary index for the values in the IN clause.
How large is "large"?

Are the id1 values integers? Then you can make tmp_keep.id1 the INTEGER
PRIMARY KEY.


Regards,
Clemens
Dave Blake
2017-03-03 12:39:40 UTC
Permalink
Could be keep almost all the records so ~500000, but it varies greatly so
sometimes will be just keep 10. I can adjust approach depending on size if
necessary.

Yes the id1 are integer primary keys.

Table1 has a number of indexes and views, so the create new table approach
is less attractive

Clemens Ladisch
2017-03-03 15:57:12 UTC
Permalink
Post by Dave Blake
Yes the id1 are integer primary keys.
In both tables? If yes, then there is not much you could do, and the
problem probably are all the modifications done to the actual table
and its indexes.


Regards,
Clemens
Dave Blake
2017-03-03 17:05:13 UTC
Permalink
Thanks all for your input, it has really helped.

In my real world application tmp_keep is a temporary table populated by
examinining a number of other tables etc., and I suddenly realsied that it
could even contain duplicate ids. Sloppy thinking on my part.

I get the best results by creating another table:
CREATE TEMPORARY TABLE tmp_keep_unique (id1 integer primary key);
INSERT INTO tmp_keep_unique SELECT DISTINCT id1 from tmp_keep;

It takes far longer to create an index on tmp_keep, than it save times on
the above query with one.

Then
*with a primary key on both table1 and tmp_keep_unique*DELETE FROM table1
WHERE id1 NOT IN (SELECT id1 FROM tmp_keep_unique);

is acceptably efficient.

On SQLite NOT EXISTS is quicker than NOT IN, but I also need to use same
SQL on a MySQL implementation, and it behaves the oppoiste.

Keith Medcalf
2017-03-03 14:21:37 UTC
Permalink
DELETE FROM table1 WHERE NOT EXISTS (SELECT 1 from tmp_keep WHERE id1 = table1.id1);

Does it in a single pass by doing a correlated subquery on each row in table1 to see if the id is in tmp_keep.
-----Original Message-----
On Behalf Of Dave Blake
Sent: Friday, 3 March, 2017 03:53
To: SQLite mailing list
Subject: [sqlite] Deleting records from a large table
Say table1 has more then 500000 records, and there is a second table
tmp_keep with the ids of the records in table1 to be kept, the rest need
to
be deleted. The number of records in tmp_keep can vary from 0 to all the
records in table1, with any values in between.
What is the best strategy for doing the deletion?
For deleting a large number of records (tmp_keep is small), this works
DELETE FROM table1 WHERE id1 NOT IN (SELECT id1 FROM tmp_keep);
But this becomes inefficient when tmp_keep is large.
Any suggestions?
_______________________________________________
sqlite-users mailing list
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Simon Slavin
2017-03-03 14:30:39 UTC
Permalink
Post by Dave Blake
DELETE FROM table1 WHERE id1 NOT IN (SELECT id1 FROM tmp_keep);
But this becomes inefficient when tmp_keep is large.
Do you have an appropriate index on tmp_keep ?

CREATE INDEX tk_id1 ON tmp_keep (id1)

then try it again.

Simon.
Continue reading on narkive:
Loading...