背景:
数据库包含重复数据,需要清理掉重复数据,并只保留其中一条。
结论
优化:百万数据查询删除重复数据,耗时从5423秒下降到2秒左右
优化过程:
根据搜索到的资料:
4、删除表中多余的重复记录(多个字段),只留有rowid最小的记录
delete from vitae a
where (a.peopleId,a.seq) in (select peopleId,seq
from vitae group by peopleId,seq having count(*) > 1)
and rowid not in (select min(rowid) from vitae
group by peopleId,seq having count(*)>1)
根据搜索到的资料,编写第一个版本的sql语句:
delete from lcfyjttz where
(fdate, ffjdm, flcdm, ffytype, fgsbz) in(
select fdate,ffjdm,flcdm, ffytype, fgsbz
from lcfyjttz group by fdate,ffjdm,flcdm, ffytype, fgsbz having count(1) > 1)
and rowid not in(
select min(rowid) as rid from lcfyjttz
group byfdate,ffjdm,flcdm, ffytype, fgsbz having count(1) > 1 )
百万数据量的情况下,其执行结果如下:
慢得吓人
看这个sql的执行结果就很吓人,作为一个追求3秒级的人,简直忍受不了,开始尝试优化这条sql。经历一段自残式的试错,也终于是实现了。
优化后sql:
DELETE
FROM
LCFYJTTZ c
WHERE
EXISTS (
SELECT
a.ROWID
FROM
LCFYJTTZ a,
(
SELECT
fdate,
ffjdm,
flcdm,
ffytype,
fgsbz,
MIN( ROWID ) rid
FROM
lcfyjttz
GROUP BY
fdate,
ffjdm,
flcdm,
ffytype,
fgsbz
HAVING
count( 1 ) > 1
) b
WHERE
a.FDATE = b.FDATE
AND a.FFJDM = b.FFJDM
AND a.FLCDM = b.FLCDM
AND a.ffytype = b.FFYTYPE
AND a.ROWID != b.rid
AND c.ROWID = a.ROWID
)
其执行结果如下:
结果还很满意吧!
在优化过程中,还是学习到很多知识,比如in和exists关键字的使用,with...as的语法使用,我尝试过用但没用上。
网友评论