
拉链表的设计是将更新的数 据进行状态记录,没有发生更新的数据不进行状态存储,用于存储所有数据在不 同时间上的所有状态,通过时间进行标记每个状态的生命周期,查询时,根据需 求可以获取指定时间范围状态的数据,默认用 9999-12-31 等最大值来表示最新状态
1:一张ods原始表负责采集原始数据
2:一张每日增量表
3:一张临时表
4:一张拉链表
代码流程如下
(1) 创建表并载入数据。
--1.创建表
CREATE TABLE orders (
166
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
) row format delimited fields terminated by ','
CREATE TABLE ods_orders_inc (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
) PARTITIONED BY (day STRING)
row format delimited fields terminated by ','
CREATE TABLE dw_orders_his (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING,
dw_start_date STRING,
dw_end_date STRING
) row format delimited fields terminated by ',' ;
--2.载入数据
load data local inpath '/home/ds_teacher/order.txt' overwrite into table orders;
(2)首先全量更新,我们先到 2021-08-20 为止的数据。
初始化,先把 2021-08-20 的数据初始化进去;然后刷到结果表(拉链表)中:
INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2021-08-20')
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE createtime < '2021-08-21' and modifiedtime <'2021-08-21';
INSERT overwrite TABLE dw_orders_his
SELECT orderid,createtime,modifiedtime,status,
createtime AS dw_start_date,
'9999-12-31' AS dw_end_date
FROM ods_orders_inc
WHERE day = '2021-08-20';
查询结果表里面的数据:select * from dw_orders_his;
1 2021-08-20 2021-08-20 创建 2021-08-20 9999-12-31
2 2021-08-20 2021-08-20 创建 2021-08-20 9999-12-31
3 2021-08-20 2021-08-20 创建 2021-08-20 9999-12-31
(2) 剩余需要进行增量更新:
--1.获取 2021-08-21 新增或者修改的数据
INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2021-08-21')
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE (createtime = '2021-08-21' and modifiedtime = '2021-08-
21') OR modifiedtime = '2021-08-21';
--查询 2021-08-21 的增量数据
select * from ods_orders_inc where day='2021-08-21';
查询结果:
1 2021-08-20 2021-08-21 支付 2021-08-21
2 2021-08-20 2021-08-21 完成 2021-08-21
4 2021-08-21 2021-08-21 创建 2021-08-21
(3)将增量数据写入到临时表:
--2.增量数据先写入临时表
DROP TABLE IF EXISTS dw_orders_his_tmp;
CREATE TABLE dw_orders_his_tmp AS
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date
FROM (
SELECT a.orderid,
a.createtime,
a.modifiedtime,
a.status,
a.dw_start_date,
CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2021-08-21' THEN '2021-
08-21' ELSE a.dw_end_date END AS dw_end_date
FROM dw_orders_his a
left outer join (SELECT * FROM ods_orders_inc WHERE day = '2021-08-21') b
ON (a.orderid = b.orderid)
UNION ALL
SELECT orderid,
createtime,
modifiedtime,
status,
modifiedtime AS dw_start_date,
'9999-12-31' AS dw_end_date
FROM ods_orders_inc
WHERE day = '2021-08-21'
) x
ORDER BY orderid,dw_start_date;
--查询临时表的数据
select * from dw_orders_his_tmp;
结果如下:
1 2021-08-20 2021-08-20 创建 2021-08-20 2021-08-21
1 2021-08-20 2021-08-21 支付 2021-08-21 9999-12-31
2 2021-08-20 2021-08-20 创建 2021-08-20 2021-08-21
2 2021-08-20 2021-08-21 完成 2021-08-21 9999-12-31
3 2021-08-20 2021-08-20 创建 2021-08-20 9999-12-31
4 2021-08-21 2021-08-21 创建 2021-08-21 9999-12-31
(3) 将临时数据写入到结果:
--3.临时数据写入结果表
INSERT overwrite TABLE dw_orders_his
SELECT * FROM dw_orders_his_tmp;
--查询结果表的数据,结果与临时表一样
select * from dw_orders_his;
(4)再根据上面(2)(3)步骤把 2021-08-22 号的数据更新进去,最后结果如下:
--1.获取 2021-08-22 新增或者修改的数据
INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2021-08-22')
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE (createtime = '2021-08-22' and modifiedtime = '2021-08-
22') OR modifiedtime = '2021-08-22';
--2.增量数据先写入临时表
DROP TABLE IF EXISTS dw_orders_his_tmp;
CREATE TABLE dw_orders_his_tmp AS
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date
FROM (
SELECT a.orderid,
a.createtime,
a.modifiedtime,
a.status,
a.dw_start_date,
169
CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2021-08-22' THEN '2021-
08-22' ELSE a.dw_end_date END AS dw_end_date
FROM dw_orders_his a
left outer join (SELECT * FROM ods_orders_inc WHERE day = '2021-08-22') b
ON (a.orderid = b.orderid)
UNION ALL
SELECT orderid,
createtime,
modifiedtime,
status,
modifiedtime AS dw_start_date,
'9999-12-31' AS dw_end_date
FROM ods_orders_inc
WHERE day = '2021-08-22'
) x
ORDER BY orderid,dw_start_date;
--3.临时数据写入结果表
INSERT overwrite TABLE dw_orders_his
SELECT * FROM dw_orders_his_tmp;
(5)得到结果表的数据:
1 2021-08-20 2021-08-20 创建 2021-08-20 2021-08-21
1 2021-08-20 2021-08-21 支付 2021-08-21 2021-08-22
1 2021-08-20 2021-08-22 完成 2021-08-22 9999-12-31
2 2021-08-20 2021-08-20 创建 2021-08-20 2021-08-21
2 2021-08-20 2021-08-21 完成 2021-08-21 9999-12-31
3 2021-08-20 2021-08-20 创建 2021-08-20 2021-08-22
3 2021-08-20 2021-08-22 支付 2021-08-22 9999-12-31
4 2021-08-21 2021-08-21 创建 2021-08-21 2021-08-22
4 2021-08-21 2021-08-22 支付 2021-08-22 9999-12-31
5 2021-08-22 2021-08-22 创建 2021-08-22 9999-12-31
拉链表的使用
--查询当前所有有效的记录
select * from dw_orders_his where dw_end_date = '9999-12-31';
--查询 2021-08-21 的历史快照
select * from dw_orders_his where dw_start_date <= '2021-08-
21' and dw_end_date >= '2021-08-21';