场景:同一张表中,需要找到表中某些特定商品在品类中sku数量和销售额所占的比例。
由于Hive 中不支持 in 用法 ( select xxx from table_namewhere product_id in (select xxx from table)这种写法不可)
子查询必须用 impala 的semi left join 写法
就是说把联表条件写成semi join,但是这样有个问题,没办法取到b的值
比如以下语句,要用b.product_id 做计算,就不能用LEFT SEMI JOIN,只能用JOIN
b表必须用JOIN,且不能算精确值,只能使用NDV算近似值
// 这种写法,只能算a表的product_id;如果要计算b表的内容联表必须用JOIN
// 且不能算b表的精确值,只能使用NDV算近似值SELECT a.category_name AS category_name,count(DISTINCT a.product_id) AS total_distinct_count,NDV(DISTINCT b.product_id) AS distinct_ndv_count
FROM {ds} a LEFT SEMI
JOIN {ds} b ON(a.category_name = b.category_nameAND b.product_id IN {product_clause}AND b.month >= '{date_start}'AND b.month<= '{date_end}')
WHERE a.month >= '{date_start}'AND a.month<= '{date_end}'
GROUP BY category_name// 如果要获得精确值,可以像下面一样,改写为多个子查询然后join
SELECT total.category_name,total.distinct_count,s.distinct_count
FROM(SELECT a.category_name AS category_name,count(DISTINCT a.product_id) AS distinct_countFROM tmall.monthly_sales a LEFT SEMIJOIN tmall.monthly_sales b ON(a.category_name = b.category_nameAND b.product_id IN ({product_clause})AND b.month >= '2020-01-01'AND b.month<= '2020-01-01')WHERE a.month >= '2020-01-01'AND a.month<= '2020-01-01'GROUP BY category_name) total
JOIN(SELECT category_name,count(distinct product_id) AS distinct_countFROM tmall.monthly_salesWHERE product_id IN ({product_clause})AND month >= '2020-01-01'AND month<= '2020-01-01'GROUP BY category_name) sON total.category_name = s.category_name
上面的两种写法在只计算dinstinct product_id时不会有问题
但如果需要计算sales,不能像下面写法这样直接加参数
// 这种先联表再聚合的写法是错的,会有问题。
// 得出的结果,按理说b表是小表,a表是大表。但是计算sales时,小表的结果会大于大表
// 这种写法会先用a表join b表,因为是cateogry_name 作为连接key,所以是一对多的,最后的结果表会比a表大很多SELECT a.category_name AS category_name,count(DISTINCT a.product_id) AS total_distinct_count,NDV(DISTINCT b.product_id) AS distinct_ndv_count,ROUND(SUM(a.sales)) AS total_sales,ROUND(SUM(b.sales)) AS part_salesFROM tmall.monthly_sales aJOIN tmall.monthly_sales b ON(a.category_name = b.category_nameAND b.product_id IN (600224193594,602940287671,544560842135,596578254983,574306481012,565310791654,564612912671,587161279286,595825457694,602458587472,605508995612,612854392019,566808774081,597693067543,555143433791,569329581922,602457799114,613772034225,608833215256,536247480963,612115360237,606906659277,540081840176,602292808985,531692537958,536209773161,531865136366,600220091746,615509973951,586872753337,614324527815,602883927868,561861435110,602052585790,566875674241,606944262413,557192082055,576130847482,616016223150,36487591409,555144225674,613274285623,39502488873,549120416050,556955739771,565042448245,596166012938,580051828054,615850782644,569425850653,613524629350,542617381615,602774974796,596820717557,549345912979,615996739549,615507097842,611611619291,597505494996)AND b.month = '2020-01-01')WHERE a.month = '2020-01-01'GROUP BY category_name// 应该改成这样
// 需要先将两表group by 再聚合
// 之前查count 没问题,是因为我本来就查的是distinct count;如果查sales,就需要格外注意影响
with
a as (select category_name,count(DISTINCT a.product_id) AS total_distinct_count,ROUND(SUM(a.sales)) AS total_salesFROM tmall.monthly_sales aWHERE a.month = '2020-01-01'GROUP BY a.category_name
),
b as (select category_name,count(DISTINCT product_id) AS distinct_ndv_count,ROUND(SUM(sales)) part_salesfrom tmall.monthly_saleswhere month = '2020-01-01'AND product_id IN (600224193594,602940287671,544560842135,596578254983,574306481012,565310791654,564612912671,587161279286,595825457694,602458587472,605508995612,612854392019,566808774081,597693067543,555143433791,569329581922,602457799114,613772034225,608833215256,536247480963,612115360237,606906659277,540081840176,602292808985,531692537958,536209773161,531865136366,600220091746,615509973951,586872753337,614324527815,602883927868,561861435110,602052585790,566875674241,606944262413,557192082055,576130847482,616016223150,36487591409,555144225674,613274285623,39502488873,549120416050,556955739771,565042448245,596166012938,580051828054,615850782644,569425850653,613524629350,542617381615,602774974796,596820717557,549345912979,615996739549,615507097842,611611619291,597505494996)group by category_name
)
SELECT a.category_name,a.total_distinct_count,b.distinct_ndv_count,a.total_sales,b.part_sales
FROM a
JOIN b on a.category_name=b.category_name