拉链表的设计是将更新的数 据进行状态记录,没有发生更新的数据不进行状态存储,用于存储所有数据在不 同时间上的所有状态,通过时间进行标记每个状态的生命周期,查询时,根据需 求可以获取指定时间范围状态的数据,默认用 9999-12-31 等最大值来表示最新状态

1:一张ods原始表负责采集原始数据

2:一张每日增量表

3:一张临时表

4:一张拉链表

代码流程如下

(1) 创建表并载入数据。 
--1.创建表 
CREATE TABLE orders (  
 166 
orderid INT, 
createtime STRING, 
modifiedtime STRING, 
status STRING 
) row format delimited fields terminated by ',' 
 
CREATE TABLE ods_orders_inc ( 
orderid INT, 
createtime STRING, 
modifiedtime STRING, 
status STRING 
) PARTITIONED BY (day STRING) 
row format delimited fields terminated by ',' 
 
CREATE TABLE dw_orders_his ( 
orderid INT, 
createtime STRING, 
modifiedtime STRING, 
status STRING, 
dw_start_date STRING, 
dw_end_date STRING 
) row format delimited fields terminated by ',' ; 
 
--2.载入数据 
load data local inpath '/home/ds_teacher/order.txt' overwrite into table orders; 
(2)首先全量更新,我们先到 2021-08-20 为止的数据。 
初始化,先把 2021-08-20 的数据初始化进去;然后刷到结果表(拉链表)中: 
INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2021-08-20') 
SELECT orderid,createtime,modifiedtime,status 
FROM orders 
WHERE createtime < '2021-08-21' and modifiedtime <'2021-08-21'; 
 
INSERT overwrite TABLE dw_orders_his 
SELECT orderid,createtime,modifiedtime,status, 
createtime AS dw_start_date, 
'9999-12-31' AS dw_end_date 
FROM ods_orders_inc 
WHERE day = '2021-08-20'; 
查询结果表里面的数据:select * from dw_orders_his; 
1 2021-08-20 2021-08-20 创建 2021-08-20 9999-12-31 
2 2021-08-20 2021-08-20 创建 2021-08-20 9999-12-31 
3 2021-08-20 2021-08-20 创建 2021-08-20 9999-12-31 
(2) 剩余需要进行增量更新: 
--1.获取 2021-08-21 新增或者修改的数据 
INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2021-08-21') 
SELECT orderid,createtime,modifiedtime,status 
FROM orders 
WHERE (createtime = '2021-08-21' and modifiedtime = '2021-08-
21') OR modifiedtime = '2021-08-21'; 
 
--查询 2021-08-21 的增量数据 
select * from ods_orders_inc where day='2021-08-21'; 
查询结果: 
1 2021-08-20 2021-08-21 支付 2021-08-21 
2 2021-08-20 2021-08-21 完成 2021-08-21 
4 2021-08-21 2021-08-21 创建 2021-08-21 
(3)将增量数据写入到临时表: 
--2.增量数据先写入临时表 
DROP TABLE IF EXISTS dw_orders_his_tmp; 
CREATE TABLE dw_orders_his_tmp AS 
SELECT orderid, 
createtime, 
modifiedtime, 
status, 
dw_start_date, 
dw_end_date 
FROM ( 
 SELECT a.orderid, 
 a.createtime, 
 a.modifiedtime, 
 a.status, 
 a.dw_start_date, 
 CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2021-08-21' THEN '2021-
08-21' ELSE a.dw_end_date END AS dw_end_date 
 FROM dw_orders_his a 
 left outer join (SELECT * FROM ods_orders_inc WHERE day = '2021-08-21') b 
 ON (a.orderid = b.orderid) 
 UNION ALL 
 SELECT orderid, 
 createtime, 
 modifiedtime, 
 status, 
 modifiedtime AS dw_start_date, 
 '9999-12-31' AS dw_end_date 
 FROM ods_orders_inc 
 WHERE day = '2021-08-21' 
) x 
ORDER BY orderid,dw_start_date; 
--查询临时表的数据 
select * from dw_orders_his_tmp; 
结果如下: 
1 2021-08-20 2021-08-20 创建 2021-08-20 2021-08-21 
1 2021-08-20 2021-08-21 支付 2021-08-21 9999-12-31 
2 2021-08-20 2021-08-20 创建 2021-08-20 2021-08-21 
2 2021-08-20 2021-08-21 完成 2021-08-21 9999-12-31 
3 2021-08-20 2021-08-20 创建 2021-08-20 9999-12-31 
4 2021-08-21 2021-08-21 创建 2021-08-21 9999-12-31 
(3) 将临时数据写入到结果: 
--3.临时数据写入结果表 
INSERT overwrite TABLE dw_orders_his 
SELECT * FROM dw_orders_his_tmp; 
 
--查询结果表的数据,结果与临时表一样 
select * from dw_orders_his; 
 
(4)再根据上面(2)(3)步骤把 2021-08-22 号的数据更新进去,最后结果如下: 
 
--1.获取 2021-08-22 新增或者修改的数据 
INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2021-08-22') 
SELECT orderid,createtime,modifiedtime,status 
FROM orders 
WHERE (createtime = '2021-08-22' and modifiedtime = '2021-08-
22') OR modifiedtime = '2021-08-22'; 
 
--2.增量数据先写入临时表 
DROP TABLE IF EXISTS dw_orders_his_tmp; 
CREATE TABLE dw_orders_his_tmp AS 
SELECT orderid, 
createtime, 
modifiedtime, 
status, 
dw_start_date, 
dw_end_date 
FROM ( 
 SELECT a.orderid, 
 a.createtime, 
 a.modifiedtime, 
 a.status, 
 a.dw_start_date,  
 169 
 CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2021-08-22' THEN '2021-
08-22' ELSE a.dw_end_date END AS dw_end_date 
 FROM dw_orders_his a 
 left outer join (SELECT * FROM ods_orders_inc WHERE day = '2021-08-22') b 
 ON (a.orderid = b.orderid) 
 UNION ALL 
 SELECT orderid, 
 createtime, 
 modifiedtime, 
 status, 
 modifiedtime AS dw_start_date, 
 '9999-12-31' AS dw_end_date 
 FROM ods_orders_inc 
 WHERE day = '2021-08-22' 
) x 
ORDER BY orderid,dw_start_date; 
 
--3.临时数据写入结果表 
INSERT overwrite TABLE dw_orders_his 
SELECT * FROM dw_orders_his_tmp; 
(5)得到结果表的数据: 
1 2021-08-20 2021-08-20 创建 2021-08-20 2021-08-21 
1 2021-08-20 2021-08-21 支付 2021-08-21 2021-08-22 
1 2021-08-20 2021-08-22 完成 2021-08-22 9999-12-31 
2 2021-08-20 2021-08-20 创建 2021-08-20 2021-08-21 
2 2021-08-20 2021-08-21 完成 2021-08-21 9999-12-31 
3 2021-08-20 2021-08-20 创建 2021-08-20 2021-08-22 
3 2021-08-20 2021-08-22 支付 2021-08-22 9999-12-31 
4 2021-08-21 2021-08-21 创建 2021-08-21 2021-08-22 
4 2021-08-21 2021-08-22 支付 2021-08-22 9999-12-31 
5 2021-08-22 2021-08-22 创建 2021-08-22 9999-12-31 

拉链表的使用

--查询当前所有有效的记录 
select * from dw_orders_his where dw_end_date = '9999-12-31'; 
--查询 2021-08-21 的历史快照 
select * from dw_orders_his where dw_start_date <= '2021-08-
21' and dw_end_date >= '2021-08-21'; 

作者 admin

张宴银,大数据开发工程师

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注