各位高手,目前碰到一个X亿级数据检索速度优化的难题,大家帮看看怎么解决。
涉及到的表有:
KeywordIndex:2.7亿条记录
Original:1014万条记录
KeywordIndex表包括四个字段,ID(PK,int)、KeywordID(int)、OriginalID(int)、ColumnID(int)
建立的索引:
PK_KeywordIndex([ID] ASC)
IX_KeywordIndex_KeywordID([KeywordID] ASC,包含[OriginalID]列)
业务需求是从KeywordIndex表中查找到同时符合多个KeywordID的记录(交集),然后根据这个交集从Original表中取出相应记录、排序后取出前X条
基本的语句如下SELECT * FROM ( SELECT OuterID AS ResultID, ROW_NUMBER() OVER(Order by Weights Desc, ProAddtime Desc) AS RowNum FROM Original Where
ID IN (
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 1933
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 1932
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 1934
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 1935
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 1931
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 14
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 21
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 20
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 23
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 22
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 24
) ) AS T WHERE RowNum > 0 AND RowNum < 21
现在的问题是,上面的语句包含11个KeywordID ,查询速度需要18秒,怎样能够优化到3秒以内?
测试发现,如果符合某一个KeywordID记录数较少的话(比如几万条),查询可以再1秒内完成
如果减少KeywordID 数量,能较少一定的时间(去掉几个KeywordID ,时间在13秒左右),但并不明显
感觉主要的耗时在INTERSECT上。
我试过对KeywordIndex 分区,每5000万一个分区,按ID左右分区依据,没有效果
上面的SQL语句中,已经是按照KeywordID记录数从少到多拼接的KeywordID = 1933(346613条记录) , KeywordID = 24(10080873条记录)
执行计划如下:
SQL Server 分析和编译时间:
CPU 时间 = 0 毫秒,占用时间 = 0 毫秒。
SQL Server 执行时间:
CPU 时间 = 0 毫秒,占用时间 = 0 毫秒。
SQL Server 分析和编译时间:
CPU 时间 = 0 毫秒,占用时间 = 0 毫秒。
(20 行受影响)
表 'Worktable'。扫描计数 0,逻辑读取 0 次,物理读取 0 次,预读 0 次,lob 逻辑读取 0 次,lob 物理读取 0 次,lob 预读 0 次。
表 'KeywordIndex'。扫描计数 11,逻辑读取 121615 次,物理读取 0 次,预读 0 次,lob 逻辑读取 0 次,lob 物理读取 0 次,lob 预读 0 次。
表 'Original'。扫描计数 1,逻辑读取 350977 次,物理读取 0 次,预读 0 次,lob 逻辑读取 0 次,lob 物理读取 0 次,lob 预读 0 次。
(31 行受影响)
(1 行受影响)
SQL Server 执行时间:
CPU 时间 = 15366 毫秒,占用时间 = 15479 毫秒。
SQL Server 执行时间:
CPU 时间 = 0 毫秒,占用时间 = 0 毫秒。
SQL Server 分析和编译时间:
CPU 时间 = 0 毫秒,占用时间 = 0 毫秒。
希望各位高手帮忙指点一下,方向不限,优化SQL,更改表结构、索引,更改实现方式,换数据库都可以,非常感谢
------解决思路----------------------
A)把 OuterID、Weights、ProAddtime 包含在 Original 的主键中,可以避免“逻辑读取 350977 次”。
B)KeywordIndex表就是多对多的关系,字段ID没有用处,主键设为(KeywordID,OriginalID)吧。
C)个人认为大数据(本身有索引时)用JOIN应该比INTERSECT更有效率吧,试试下面的语句
SELECT TOP 20 *
o.OuterID AS ResultID
ROW_NUMBER() OVER(Order by o.Weights Desc, o.ProAddtime Desc) AS RowNum
FROM Original o
JOIN KeywordIndex k1 ON k1.OrigianlID = o.ID AND k1.KeywordID = 1933
JOIN KeywordIndex k2 ON k2.OrigianlID = o.ID AND k2.KeywordID = 1932
JOIN KeywordIndex k3 ON k3.OrigianlID = o.ID AND k3.KeywordID = 1934
JOIN KeywordIndex k4 ON k4.OrigianlID = o.ID AND k4.KeywordID = 1935
JOIN KeywordIndex k5 ON k5.OrigianlID = o.ID AND k5.KeywordID = 1931
JOIN KeywordIndex k6 ON k6.OrigianlID = o.ID AND k6.KeywordID = 14
JOIN KeywordIndex k7 ON k7.OrigianlID = o.ID AND k7.KeywordID = 21
JOIN KeywordIndex k8 ON k8.OrigianlID = o.ID AND k8.KeywordID = 20
JOIN KeywordIndex k9 ON k9.OrigianlID = o.ID AND k9.KeywordID = 23
JOIN KeywordIndex k10 ON k10.OrigianlID = o.ID AND k10.KeywordID = 22
JOIN KeywordIndex k11 ON k11.OrigianlID = o.ID AND k11.KeywordID = 24
ORDER BY o.Weights Desc, o.ProAddtime Desc
------解决思路----------------------
我觉得中间那段可以简化一下,只查询一次,然后聚合一次,不知道会不会快点
SELECT OriginalID最后的11是指要查询的 KeywordID 的个数
FROM KeywordIndex
WHERE KeywordID IN(1933,1932,1934,1935,1931,14,21,20,23,22,24)
GROUP BY OriginalID
HAVING COUNT(1)=11
------解决思路----------------------
不管如何设主键,必须让(OriginalID,KeywordID)在同一个索引中,把KeywordIndex的逻辑读取降下来。
然后再来测试到底是INTERSECT还是JOIN比较快。
SELECT k1.OriginalID
FROM KeywordIndex
JOIN KeywordIndex k2 ON k2.OriginalID = k1.OriginalID AND k2.KeywordID = 1932
JOIN KeywordIndex k3 ON k3.OriginalID = k1.OriginalID AND k3.KeywordID = 1934
JOIN KeywordIndex k4 ON k4.OriginalID = k1.OriginalID AND k4.KeywordID = 1935
JOIN KeywordIndex k5 ON k5.OriginalID = k1.OriginalID AND k5.KeywordID = 1931
JOIN KeywordIndex k6 ON k6.OriginalID = k1.OriginalID AND k6.KeywordID = 14
JOIN KeywordIndex k7 ON k7.OriginalID = k1.OriginalID AND k7.KeywordID = 21
JOIN KeywordIndex k8 ON k8.OriginalID = k1.OriginalID AND k8.KeywordID = 20
JOIN KeywordIndex k9 ON k9.OriginalID = k1.OriginalID AND k9.KeywordID = 23
JOIN KeywordIndex k10 ON k10.OriginalID = k1.OriginalID AND k10.KeywordID = 22
JOIN KeywordIndex k11 ON k11.OriginalID = k1.OriginalID AND k11.KeywordID = 24
WHERE k1.KeywordID = 1933
------解决思路----------------------
把查询执行计划贴出来。
------解决思路----------------------
你代码中的当中一段代码,intersect是求出多个集合中的交集吗?
------解决思路----------------------
把指定的KeywordID 插入到一个临时表#t table (fid int primary key)
再
SELECT * FROM ( SELECT OuterID AS ResultID, ROW_NUMBER() OVER(Order by Weights Desc, ProAddtime Desc) AS RowNum FROM #t a
inner join Original b on b.ID=a.fID
) a
另外,按ID的后2位(ID % 100) 分100个区,能重复利用多核的好处
------解决思路----------------------
关注,最多和最小的先交集 呢? 个人模拟测试这种开销最小,但可能逻辑读增多
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 1933
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 24
------解决思路----------------------
关键是要把intersect这个语句的时间降下来,要看查询计划
------解决思路----------------------
从SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 1933
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 1932
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 1934
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 1935
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 1931
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 14
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 21
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 20
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 23
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 22
INTERSECT
SELECT OriginalID FROM KeywordIndex WHERE KeywordID = 24 中读取的实际可用的数据量 大概有多少?
交集中重复的数据是不是很多呢?,我觉得需要从这个下手,读取其中 实际用到的不重复的OriginalID ,另外你测试的分区有问题啊,这样加剧了复杂读啊!建议从业务需求 下手,简化 不从2.7亿这么庞大的数据量读取,将2.7压缩想要的数据,这样依赖 会大大缩减 SQL执行时间
------解决思路----------------------
远程看下实际情况,才能给个准确的实现方案
设计、开发是个技术活,必须从细节、全面上着手