Doris数据建模知识点

规范

建模篇

  1. 【强制】分区字段和分桶字段不能是同一个字段,分区字段作为分桶没有意义了,容易造成数据倾斜
  2. 【强制】建表必须写表的comment和字段的comment,而且需要是跟业务对应的有意义的comment,平台的建表页面可以直接同步hive表的字段comment,推荐使用
  3. 【强制】doris表模型,数仓给的聚合结果使用replace模型(类似于商品看板、经营大盘),数仓给的明细数据使用聚合模型,禁止数仓给的明细数据使用duplicate模型
  4. 【强制】doris表只保留本次需求需要的字段,数仓表中多余的字段禁止导入到doris,如果后续需求迭代需要,再进行导入
  5. 【推荐】日周月创建不同的doris表,这样能最合理的走分区和rollup,如果数仓只提供天维度的表,导入周月就相当于导入压力增大了两倍;不建立月表,通过日表创建rollup的方式出,由于rollup只是分区内创建,所以月的查询没法走月分区查询,查询性能肯定是没有直接用月表来的快;总之又是导入和查询之间的权衡
  6. 【推荐】分桶字段可以使用bu_id(仓维度使用wh_id)来分桶,分桶可以简单理解成es的分片,目前新集群的doris机器5台,分桶不宜太多,建议设置在3-8之间,具体根据数据量来权衡,数据量越大,可以设置的多一些
  7. 【推荐】建表最好不要自己使用sql创建,建议使用平台的建表页面创建,目前平台只支持分区为日的表创建,分区为周、月的无法使用,后续都支持了之后,该条规范将转为【强制】

查询篇

  1. 【强制】聚合模型的doris表中,禁止在聚合函数中增加if等函数,例如sum(if ….),添加了函数之后该查询无法走rollup;通过explain查看执行计划,会发现ScanNode中PreAggregation对应的值是OFF,OFF就表示不能走rollup
  2. 【强制】页面传指标id查询具体某些指标的功能(例如指标定制、点击核心指标的对应曲线查询),后端数据库查询,必须只查询前端传的指标,禁止一次性通过数据库查询出所有指标,然后只给前端返回他请求的;
  3. 【强制】duplicate或者unique模型,禁止使用select *,强制原因:doris是列式存储,比如说id=1这条数据,对应的各个列的值是存储在独立的文件中的,不像mysql这种行式存储,一行数据是物理存储在一起的;所以对于doris来说,查询一列跟查询多列,在io上的性能消耗是成倍增加的
  4. 【强制】拼sql进行查询的时候,必须明确字段类型,int类型的sql必须写成dt=20210101,不能写成dt='20210101',否则走不了分区查询(ps. 好像后来又能走了,不过还是按照字段类型写比较合理)
  5. 【强制】禁止代码中使用多线程查询doris,doris对cpu很敏感,如果觉得接口性能不好,需要努力的方向是优化接口中的慢sql,而不是并发的去执行这些sql,假设一个sql需要500ms执行完成,那么即使一个查询中有10个sql,也只需要5s时间,足够满足用户的需求;
  6. 【推荐】下钻查询尽量不要一次性返回所有数据(用户点下钻的时候前端不需要请求后端);推荐理由:一次性查询对数据库压力会更大,而且对于我们的页面来说,大部分用户不会点下钻,只会看汇总,一次性查询所有的是对查询资源的浪费,尤其是后端需要实时聚合的查询;该规范后续最好转成【强制】

拒绝聚合函数中增加if函数

doris如果在聚合函数中增加其他函数,是没有办法走rollup的。

通过explain查看执行计划,会发现ScanNode中PreAggregation对应的值是OFF,OFF就表示不能走rollup

对维度列使用聚合函数,不会影响rollup的使用

1
2
3
# 表是到sku粒度的,同时创建了到三级类目的rollup
# 加不加count(distinct bu_id),下面的sql都可以走rollup
explain select count(distinct bu_id),sum(sku_cnt) from app_vendor_board_basesku_day where dt=20210501;

不需要的列不要查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# 下面这个曲线的查询,每次最多展示2个曲线,这里一次把所有的结果都查询出来了,查询时间在1.3s左右
select *
from (
select date_scale as dt,
max(begin_date_id) as begin_date_id,
max(end_date_id) as end_date_id,
sum(arranged_amt) as arranged_amt,
sum(arranged_sale_amt) as arranged_sale_amt,
sum(arranged_cnt) as arranged_cnt,
sum(arranged_coupon_amt) as arranged_coupon_amt,
sum(arranged_profit) as arranged_profit,
sum(arranged_sale_profit) as arranged_sale_profit,
sum(stock_cost_not_presale) as stock_cost_not_presale,
sum(base_sku_code_stock_cost_not_presale) as base_sku_code_stock_cost_not_presale,
sum(base_sku_code_ord_outbound_cost_not_presale) as base_sku_code_ord_outbound_cost_not_presale,
sum(d30_no_move_stock_cost) as d30_no_move_stock_cost,
max(days_count) as days_count,
sum(arranged_headquarters_coupon_amt) as arranged_headquarters_coupon_amt,
sum(arranged_cost) as arranged_cost,
sum(gy_days_count) as gy_days_count,
sum(gy_days_count_unpresale) as gy_days_count_unpresale,
sum(gy_outstock_days_count) as gy_outstock_days_count,
sum(gy_offshelf_days_count) as gy_offshelf_days_count,
sum(gy_offshelf_days_count_unpresale) as gy_offshelf_days_count_unpresale,
sum(outstock_days_count) as outstock_days_count,
sum(onshelf_days_count) as onshelf_days_count,
sum(onshelf_days_count_unpresale) as onshelf_days_count_unpresale,
sum(case_nopur_coupon) as case_nopur_coupon,
sum(case_nopur_rtn_amt) as case_nopur_rtn_amt,
sum(scrapped_amt) as scrapped_amt,
sum(zonghe_kaohe_numerator_amt) as zonghe_kaohe_numerator_amt,
sum(zonghe_numerator_amt) as zonghe_numerator_amt,
sum(background_gross_profit) as background_gross_profit,
sum(view_uv) as view_uv,
sum(outstock_uv) as outstock_uv,
sum(view_pv) as view_pv,
sum(outstock_pv) as outstock_pv,
sum(view_uv_unpresale) as view_uv_unpresale,
sum(outstock_uv_unpresale) as outstock_uv_unpresale,
sum(view_pv_unpresale) as view_pv_unpresale,
sum(outstock_pv_unpresale) as outstock_pv_unpresale,
sum(old_stock_cost_not_presale) as old_stock_cost_not_presale,
sum(old_ord_outbound_cost_not_presale) as old_ord_outbound_cost_not_presale
from app_prod_sku_purchase_board_sale_kpi_cat3
where 1 =1
AND date_scale BETWEEN '20210321' AND '20210420'
AND business_type = 2
AND bu_id IN (110100, 120100, 310100, 340100, 710400, 11000003, 11000006, 11000008, 11000009, 11000010, 11000011, 11000012, 11000016, 11000017, 11000018, 11000019, 11000020, 11000021, 11000024, 11000027, 11000028, 11000031, 11000032, 11000033, 11000037, 11000039, 11000040, 11000041, 11000042, 11000043, 11000044, 11000045, 11000046, 11000047, 11000073, 11000075, 11000088, 11000111, 11000116, 11000125, 11000126, 11000129, 11000132, 11000133, 11000134, 11000135, 11000136, 11000137, 11000138, 11000139, 11000140, 11000141, 11000142, 11000143, 11000144, 11000145, 11000146, 11000147, 11000148, 11000149, 11000150, 11000151, 11000152, 11000153, 11000154, 11000155, 11000156, 11000157, 11000158, 11000159, 11000160, 11000161, 11000162, 11000163, 11000164, 11000165, 11000166, 11000167, 11000168, 11000169, 11000170, 11000171, 11000172, 11000173, 11000174, 11000175, 11000176, 11000177, 11000178, 11000179, 11000180, 11000181, 11000182, 11000183, 11000184, 11000185, 11000186, 11000187, 11000188, 11000189, 11000190, 11000191, 11000192, 11000193, 11000194, 11000195, 11000196, 11000197, 11000198, 11000199, 11000200, 11000201, 11000202, 11000203, 11000204, 11000205, 11000206, 11000207, 11000208, 11000209, 11000210, 11000211, 11000212, 11000213, 11000214, 11000215, 11000216, 11000217, 11002915, 11002954, 11002961)
AND time_tag = 0
group by dt
) t1
left join(
select date_scale as dt,
count(distinct if(is_arranged = 1, bu_sku_id, null)) as has_sale_sku_cnt,
count(distinct if(is_onself = 1, bu_sku_id, null)) as on_shelf_sku_cnt,
count(distinct if(is_onself = 1 and is_presale_no = 0, bu_sku_id, null)) as on_shelf_sku_cnt_unpresale,
count(distinct if(is_outstock = 1, bu_sku_id, null)) as outstock_sku_cnt,
count(distinct if(is_unsalable = 1, bu_sku_id, null)) as dull_sku_cnt
from app_prod_sku_purchase_board_sale_kpi_cat3
where 1 =1
AND date_scale BETWEEN '20210321' AND '20210420'
AND business_type = 2
AND bu_id IN (110100, 120100, 310100, 340100, 710400, 11000003, 11000006, 11000008, 11000009, 11000010, 11000011, 11000012, 11000016, 11000017, 11000018, 11000019, 11000020, 11000021, 11000024, 11000027, 11000028, 11000031, 11000032, 11000033, 11000037, 11000039, 11000040, 11000041, 11000042, 11000043, 11000044, 11000045, 11000046, 11000047, 11000073, 11000075, 11000088, 11000111, 11000116, 11000125, 11000126, 11000129, 11000132, 11000133, 11000134, 11000135, 11000136, 11000137, 11000138, 11000139, 11000140, 11000141, 11000142, 11000143, 11000144, 11000145, 11000146, 11000147, 11000148, 11000149, 11000150, 11000151, 11000152, 11000153, 11000154, 11000155, 11000156, 11000157, 11000158, 11000159, 11000160, 11000161, 11000162, 11000163, 11000164, 11000165, 11000166, 11000167, 11000168, 11000169, 11000170, 11000171, 11000172, 11000173, 11000174, 11000175, 11000176, 11000177, 11000178, 11000179, 11000180, 11000181, 11000182, 11000183, 11000184, 11000185, 11000186, 11000187, 11000188, 11000189, 11000190, 11000191, 11000192, 11000193, 11000194, 11000195, 11000196, 11000197, 11000198, 11000199, 11000200, 11000201, 11000202, 11000203, 11000204, 11000205, 11000206, 11000207, 11000208, 11000209, 11000210, 11000211, 11000212, 11000213, 11000214, 11000215, 11000216, 11000217, 11002915, 11002954, 11002961)
group by dt
) t2
on t1.dt = t2.dt
left join(
select dt as dt,
count(distinct customer_id) as customer_cnt,
count(distinct dt_customer_id) as dt_customer_cnt
from app_prod_sku_purchase_board_customer_order_cat3
where 1 = 1
AND dt BETWEEN '20210321' AND '20210420'
AND business_type = 2
AND bu_id IN (110100, 120100, 310100, 340100, 710400, 11000003, 11000006, 11000008, 11000009, 11000010, 11000011, 11000012, 11000016, 11000017, 11000018, 11000019, 11000020, 11000021, 11000024, 11000027, 11000028, 11000031, 11000032, 11000033, 11000037, 11000039, 11000040, 11000041, 11000042, 11000043, 11000044, 11000045, 11000046, 11000047, 11000073, 11000075, 11000088, 11000111, 11000116, 11000125, 11000126, 11000129, 11000132, 11000133, 11000134, 11000135, 11000136, 11000137, 11000138, 11000139, 11000140, 11000141, 11000142, 11000143, 11000144, 11000145, 11000146, 11000147, 11000148, 11000149, 11000150, 11000151, 11000152, 11000153, 11000154, 11000155, 11000156, 11000157, 11000158, 11000159, 11000160, 11000161, 11000162, 11000163, 11000164, 11000165, 11000166, 11000167, 11000168, 11000169, 11000170, 11000171, 11000172, 11000173, 11000174, 11000175, 11000176, 11000177, 11000178, 11000179, 11000180, 11000181, 11000182, 11000183, 11000184, 11000185, 11000186, 11000187, 11000188, 11000189, 11000190, 11000191, 11000192, 11000193, 11000194, 11000195, 11000196, 11000197, 11000198, 11000199, 11000200, 11000201, 11000202, 11000203, 11000204, 11000205, 11000206, 11000207, 11000208, 11000209, 11000210, 11000211, 11000212, 11000213, 11000214, 11000215, 11000216, 11000217, 11002915, 11002954, 11002961)
group by dt
) t3
on t1.dt = t3.dt
left join(
select dt as dt,
count(distinct customer_id) as bu_customer_cnt,
sum(arranged_amt) as bu_arranged_amt
from app_prod_sku_purchase_board_customer_order_cat3
where 1 = 1
AND dt BETWEEN '20210321' AND '20210420'
AND business_type = 2
AND bu_id IN (110100, 120100, 310100, 340100, 710400, 11000003, 11000006, 11000008, 11000009, 11000010, 11000011, 11000012, 11000016, 11000017, 11000018, 11000019, 11000020, 11000021, 11000024, 11000027, 11000028, 11000031, 11000032, 11000033, 11000037, 11000039, 11000040, 11000041, 11000042, 11000043, 11000044, 11000045, 11000046, 11000047, 11000073, 11000075, 11000088, 11000111, 11000116, 11000125, 11000126, 11000129, 11000132, 11000133, 11000134, 11000135, 11000136, 11000137, 11000138, 11000139, 11000140, 11000141, 11000142, 11000143, 11000144, 11000145, 11000146, 11000147, 11000148, 11000149, 11000150, 11000151, 11000152, 11000153, 11000154, 11000155, 11000156, 11000157, 11000158, 11000159, 11000160, 11000161, 11000162, 11000163, 11000164, 11000165, 11000166, 11000167, 11000168, 11000169, 11000170, 11000171, 11000172, 11000173, 11000174, 11000175, 11000176, 11000177, 11000178, 11000179, 11000180, 11000181, 11000182, 11000183, 11000184, 11000185, 11000186, 11000187, 11000188, 11000189, 11000190, 11000191, 11000192, 11000193, 11000194, 11000195, 11000196, 11000197, 11000198, 11000199, 11000200, 11000201, 11000202, 11000203, 11000204, 11000205, 11000206, 11000207, 11000208, 11000209, 11000210, 11000211, 11000212, 11000213, 11000214, 11000215, 11000216, 11000217, 11002915, 11002954, 11002961)
group by dt
) t4
on t1.dt = t4.dt

# 如果子查询一个指标,性能非常快,查询时间在0.3s左右
select date_scale as dt,
max(begin_date_id) as begin_date_id,
max(end_date_id) as end_date_id,
sum(arranged_amt) as arranged_amt
from app_prod_sku_purchase_board_sale_kpi_cat3
where 1 =1
AND date_scale BETWEEN '20210321' AND '20210420'
AND business_type = 2
AND bu_id IN (110100, 120100, 310100, 340100, 710400, 11000003, 11000006, 11000008, 11000009, 11000010, 11000011, 11000012, 11000016, 11000017, 11000018, 11000019, 11000020, 11000021, 11000024, 11000027, 11000028, 11000031, 11000032, 11000033, 11000037, 11000039, 11000040, 11000041, 11000042, 11000043, 11000044, 11000045, 11000046, 11000047, 11000073, 11000075, 11000088, 11000111, 11000116, 11000125, 11000126, 11000129, 11000132, 11000133, 11000134, 11000135, 11000136, 11000137, 11000138, 11000139, 11000140, 11000141, 11000142, 11000143, 11000144, 11000145, 11000146, 11000147, 11000148, 11000149, 11000150, 11000151, 11000152, 11000153, 11000154, 11000155, 11000156, 11000157, 11000158, 11000159, 11000160, 11000161, 11000162, 11000163, 11000164, 11000165, 11000166, 11000167, 11000168, 11000169, 11000170, 11000171, 11000172, 11000173, 11000174, 11000175, 11000176, 11000177, 11000178, 11000179, 11000180, 11000181, 11000182, 11000183, 11000184, 11000185, 11000186, 11000187, 11000188, 11000189, 11000190, 11000191, 11000192, 11000193, 11000194, 11000195, 11000196, 11000197, 11000198, 11000199, 11000200, 11000201, 11000202, 11000203, 11000204, 11000205, 11000206, 11000207, 11000208, 11000209, 11000210, 11000211, 11000212, 11000213, 11000214, 11000215, 11000216, 11000217, 11002915, 11002954, 11002961)
AND time_tag = 0
group by dt

排序对于性能的影响也很大

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# 查询某天的sku的指标,需要根据销售额排序,查询列是*,发现查询很慢,需要8s+
select *
from app_prod_analyze_dashboard_cube_dt
WHERE date_scale in (20210425)
and time_tag = 0
and bu_id = -1
and sku_id > 0
and sku_band = -1
and activity_type = -1
and sku_life_status = -1
and is_gy = -1
and is_new = -1
and business_type = 2
order by arranged_amt desc
limit 20
offset 0
# 去掉order by之后,查询性能很快,在0.2s左右
select *
from app_prod_analyze_dashboard_cube_dt
WHERE date_scale in (20210425)
and time_tag = 0
and bu_id = -1
and sku_id > 0
and sku_band = -1
and activity_type = -1
and sku_life_status = -1
and is_gy = -1
and is_new = -1
and business_type = 2
# 不去掉order by,但是查询的列不是*,是具体的列,查询时间在0.9s左右,也会快很多
select arranged_amt
from app_prod_analyze_dashboard_cube_dt
WHERE date_scale in (20210425)
and time_tag = 0
and bu_id = -1
and sku_id > 0
and sku_band = -1
and activity_type = -1
and sku_life_status = -1
and is_gy = -1
and is_new = -1
and business_type = 2
order by arranged_amt desc
limit 20
offset 0

查询的时候明确字段类型

字段类型如果是int,但是sql中使用的条件是字符串类型的值,比如说dt='20210423',这种是没有办法走分区查询的,explain查看执行计划,发现分区扫描还是会扫面全部分区,而不是只扫描查询条件中对应的分区

将维度名称字段作为维度列对查询性能的影响

目前针对维度名称,有三种处理方式:

  • doris表中不存储名称字段,名称字段通过维表查询
  • doris表中存储名称字段,存储的是作为指标,用max的聚合方式
  • doris表中存储名称字段,作为维度,跟id字段并列

第一种跟doris没关系了,先不考虑;后面两种第一种经过了验证,作为指标,对查询性能基本上没有影响;

但是最后一种,作为维度,要想查询出名称,需要在group by后面加上名称列,这个需要验证是否会影响性能,增加分组列还是对性能有一定影响的。

grouping set

1
2
3
4
5
6
7
8
9
10
11
12
SELECT
id,
name,
GROUPING(id),
GROUPING(name),
GROUPING_ID(id, name),
SUM(amount) as res
FROM
myTest
GROUP BY
GROUPING SETS ((id, name), (id))
order by GROUPING(name) desc ,res desc

grouping-set-排序

  • GROUPING SETS对应两个,其中(id)这个grouping,对应的结果,name一定是null
  • 但是(id, name)这个grouping,如果数据中name本来就有null,那么也可能是null
  • 怎么区分name的null到底是本来的null还是因为聚合函数中没有它才返回的null,通过GROUPING(name)函数可以判断,这个函数的返回值是1,说明一定是(id)这个grouping生成的,通常这个grouping生成的是聚合结果,需要放在前面

相关文档:

grouping set的缺点

如果grouping set中对应的分组组合,维度粒度相差非常大的话,性能会有问题,比如说全部品类维度和sku维度,第一个是所有sku的汇总,另一个是sku明细的展示。

  • 因为sku是最细粒度,所以只能走底表查询,相当于汇总数据也就不能走rollup了,只能从底表实时聚合,性能差;
  • 正常页面都是有分页的,如果汇总满足了条数,明细都不需要进行查询了,这时候通过grouping set会进行多余的查询,浪费资源、性能也不好

解决方案,代码分多次查询,对结果进行组合:

  • 先查询汇总的总条数,如果超过3000条,不需要查询明细条数和明细数据
  • 如果汇总条数小于3000条,那么需要查询明细的条数,返回给前端总条数
  • 判断前端请求的页的数据,需要查询哪个维度,还是两种维度都需要查询

bucket数量对查询的影响

bucket数量不能太少也不能太多,太少影响查询性能

  • 极端情况下:一个分区只有一个分桶,那么相当于某个日期的查询只会用到一台机器的资源
  • tablet的查询好像是并发查询的,如果分桶非常多,并发也会非常大,就会造成doris的cpu使用率比较高

经验值:

  • 聚合结果的查询:由于没有什么计算,而且不会有多事业部的实时聚合,按照事业部分桶多一些倒也无妨,因为只会查询某个事业部,命中某个tablet,即使多个事业部,因为是聚合结果,也是重io,轻计算的;但是分桶最好也不要超过机器的数量
  • 明细数据的实时聚合查询:分桶数量不要超过机器数量

一个查询打满doris

供应商看板的一个下载接口,一次下载操作就将doris打满了,原因是内部采用多线程查询doris,而且表中有很多的join;由于是下钻查询,查询doris的次数应该有一千多次,查询次数太多,虽然不是1000个查询并发,但是10几个并发,一直持续100次,压力也是很大的。

bitmap的全局字典

bigint能支持的最大数字是:18446744073709551615

针对日均、单城类型的指标,需要进行多字段的拼接来构建bitmap,例如日均单城在售sku数:dt+bu+onshelf_sku_id

直接拼接的成数字的话,可能会造成超出bigint的最大范围,例如20210527+11000039+50004121

拆分明细数据和聚合数据

类似于卖家多维分析这种,需要明细数据、类目聚合数据的情况,因为需要品类鉴权,所以需要数仓给明细结果,服务端实时进行聚合。

但是在查询性能上明细查询从聚合模型上出,性能不是很好,例如下面的两个sql:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# 聚合查询明细指标,查询时间5s
SELECT t44923_0.`dt` `dt`,
t44923_0.`bu_name` `buname`,
t44923_0.`seller_id` `seller_id`,
t44923_0.`seller_name` `seller_name`,
t44923_0.`sku_id` `sku_id`,
t44923_0.`sku_name` `sku_name`,
t44923_0.`brand_name` `brand_name`,
t44923_0.`sku_unit_desc` `sku_unit_desc`,
t44923_0.`cat1_id` `cat1_id`,
t44923_0.`cat1_name` `cat1_name`,
sum(t44923_0.arranged_weight) `arranged_weight_3p`,
sum(t44923_0.arranged_cnt) `arranged_cnt_3p`,
sum(t44923_0.billing_amt)/(sum(t44923_0.arranged_amt)+sum(t44923_0.coupon_amt)) `billing_rate_3p`,
bitmap_union_count(t44923_0.quality_case_customer_str)/bitmap_union_count(t44923_0.ord_customer_str) `quality_case_cust_rate_3p`,
sum(t44923_0.arranged_amt) `arranged_amt_3p`,
bitmap_union_count(t44923_0.order_code) `order_cnt_3p`,
sum(t44923_0.arranged_amt)/bitmap_union_count(t44923_0.dt_customer_id) `day_cust_order_amt_3p`
FROM kldp_data_stat.app_invite_business_analysis_view_all t44923_0
where t44923_0.`dt` = 20210523
and t44923_0.`bu_id` in(11002915,11000217,11000215,11000216,11000213,11000214,11000178,11000211,11000179,11000212,11000176,11000177,11000210,11000185,11000186,11000183,11000184,11000181,11000182,120100,11000180,11000189,11000187,11000188,11000075,11000196,11000197,11000073,11000194,11000195,11000192,11000193,11000190,11000191,11000039,11000037,11000158,11000159,11000156,11000157,11000033,11000154,11000155,11000042,11000163,11000043,11000164,11000040,11000161,11000041,11000162,11000160,340100,11000208,11000209,11000206,11000207,11000204,11000205,11000169,11000202,11000203,11000046,11000167,11000200,11000047,11000168,11000201,11000044,11000165,11000045,11000166,11000174,11000175,11000172,11000173,11000170,11000171,710400,11000019,110100,11000017,11000138,11000018,11000139,11002954,11000136,11000016,11000137,11000134,11000135,11000011,11000132,11000012,11000133,11000020,11000141,11000021,11000142,11000140,11000028,11000149,11000147,11000027,11000148,11000024,11000145,11000146,11002961,11000143,11000144,11000031,11000152,11000032,11000153,11000150,11000151,11000116,11000198,310100,11000111,11000199,11000008,11000129,11000009,11000006,11000125,11000126,11000003,11000088,11000010)
GROUP BY
dt,bu_id,cat1_id,cat2_id,cat3_id,cat4_id,seller_id,sku_id,bu_name,cat1_name,cat2_name,cat3_name,cat4_name,seller_name,sku_name,brand_name,sku_unit_desc
order by `arranged_amt_3p` DESC nulls last
limit 20 offset 0;
# 直接查询明细结果,不进行group by,查询时间500ms
SELECT t44923_0.`dt` `dt`,
t44923_0.`bu_name` `buname`,
t44923_0.`seller_id` `seller_id`,
t44923_0.`seller_name` `seller_name`,
t44923_0.`sku_id` `sku_id`,
t44923_0.`sku_name` `sku_name`,
t44923_0.`brand_name` `brand_name`,
t44923_0.`sku_unit_desc` `sku_unit_desc`,
t44923_0.`cat1_id` `cat1_id`,
t44923_0.`cat1_name` `cat1_name`,
t44923_0.arranged_weight `arranged_weight_3p`,
t44923_0.arranged_cnt `arranged_cnt_3p`,
t44923_0.billing_amt/(t44923_0.arranged_amt+t44923_0.coupon_amt) `billing_rate_3p`,
bitmap_count(t44923_0.quality_case_customer_str)/bitmap_count(t44923_0.ord_customer_str) `quality_case_cust_rate_3p`,
t44923_0.arranged_amt `arranged_amt_3p`,
bitmap_count(t44923_0.order_code) `order_cnt_3p`,
t44923_0.arranged_amt/bitmap_count(t44923_0.dt_customer_id) `day_cust_order_amt_3p`
FROM kldp_data_stat.app_invite_business_analysis_view_dt t44923_0
where t44923_0.`dt` = 20210523
and t44923_0.`bu_id` in(11002915,11000217,11000215,11000216,11000213,11000214,11000178,11000211,11000179,11000212,11000176,11000177,11000210,11000185,11000186,11000183,11000184,11000181,11000182,120100,11000180,11000189,11000187,11000188,11000075,11000196,11000197,11000073,11000194,11000195,11000192,11000193,11000190,11000191,11000039,11000037,11000158,11000159,11000156,11000157,11000033,11000154,11000155,11000042,11000163,11000043,11000164,11000040,11000161,11000041,11000162,11000160,340100,11000208,11000209,11000206,11000207,11000204,11000205,11000169,11000202,11000203,11000046,11000167,11000200,11000047,11000168,11000201,11000044,11000165,11000045,11000166,11000174,11000175,11000172,11000173,11000170,11000171,710400,11000019,110100,11000017,11000138,11000018,11000139,11002954,11000136,11000016,11000137,11000134,11000135,11000011,11000132,11000012,11000133,11000020,11000141,11000021,11000142,11000140,11000028,11000149,11000147,11000027,11000148,11000024,11000145,11000146,11002961,11000143,11000144,11000031,11000152,11000032,11000153,11000150,11000151,11000116,11000198,310100,11000111,11000199,11000008,11000129,11000009,11000006,11000125,11000126,11000003,11000088,11000010)
order by `arranged_amt_3p` DESC nulls last
limit 20 offset 0;

可以看出,查询同样的数据,聚合方式的查询时间是直接查询明细的50倍;所以建议在建模的时候:

  • 三级类目以下的结果(四级、sku),直接通过数仓聚合的方式,将结果计算好,doris通过unique key的模型进行数据存储,查询的时候只需要select*,不需要group by
  • 三级类目以及以上的结果,通过doris的聚合模型,将数仓提供的明细数据聚合到三级类目粒度,查询的时候实时通过doris进行聚合查询

一个指标关联多种维度应该如何查询

比如说销售额占比这个指标,公式是:当前所选类目销售额除以事业部总的销售额;这种如何动态生成sql

duplicate模型与rollup

duplicate模型也可以创建rollup,但是作用跟聚合模型的rollup不太相同:

  • 聚合模型的指标列会进行聚合计算,但是duplicate模型不会有指标列的概念,更不会进行聚合运算
  • duplicate模型创建的rollup也可以选部分列,但是只相当于是二级索引,rollup中的数据条数是跟base表完全一样的,只是列发生了变化

一个创建rollup的例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
CREATE TABLE kldp_data_stat_test_stage.doris_sm_star_pa_org_area (
`dt` BIGINT NULL COMMENT "取数时间分区",
`bu_id` BIGINT NULL COMMENT "事业部ID",
`cat_type` BIGINT NULL COMMENT "商户类型1餐饮 2流通 0 全量",
`org_id` BIGINT NULL COMMENT "销售组ID",
`area_id` BIGINT NULL COMMENT "分区ID",
`type` VARCHAR(32) NULL COMMENT "时间分区 day 日 wk 7日 mo 月",
`area_name` VARCHAR(256) NULL COMMENT "分区名称",
`bu_name` VARCHAR(256) NULL COMMENT "事业部名称",
`city_id` BIGINT NULL COMMENT "城市ID",
`last_day_sale_amt` DECIMAL(27,6) NULL COMMENT "上期销售额",
`total_new_byr_amt` DECIMAL(27,6) NULL COMMENT "新客销售额",
`total_old_byr_amt` DECIMAL(27,6) NULL COMMENT "老客销售额",
`total_byr_cnt_daily` BIGINT NULL COMMENT "日均合作商户数",
`gross_rate` DECIMAL(27,6) NULL COMMENT "毛利率",
`gross_amt` DECIMAL(27,6) NULL COMMENT "毛利额"
) ENGINE = OLAP
DUPLICATE KEY(`dt`, `bu_id`, `cat_type`, `org_id`, `area_id`, `type`)
COMMENT "人效主题-BD人效-小组汇总指标"
PARTITION BY RANGE(`dt`)
(
PARTITION p20210713 VALUES [("19700101"),("2018")),
PARTITION p20210714 VALUES [("20210713"),("20210714"))
)
DISTRIBUTED BY HASH(`type`) BUCKETS 3
rollup (
xxx_rollup( bu_id,org_id,area_id,gross_amt,gross_rate)
)
PROPERTIES (
"storage_format" = "v2")