程式師世界 >> 數據庫知識 >> DB2數據庫 >> DB2教程 >> Hive中實現增量更新

Hive中實現增量更新

編輯：DB2教程

Hive中實現增量更新

保險公司有一個表記錄客戶的信息，其中包括有客戶的id，name和age(為了演示只列出這幾個字段)。
創建Hive的表:
create table customer
(
id int,
age tinyint,
name string
)
partitioned by(dt string)
row format delimited
fields terminated by '|'
stored as textfile;

導入初始化數據：
load data local inpath '/home/hadoop/hivetestdata/customer.txt' into table customer partition(dt = '201506');
hive> select * from customer order by id;
customer.id customer.age customer.name customer.dt
1 25 jiangshouzhuang 201506
2 23 zhangyun 201506
3 24 yiyi 201506
4 32 mengmeng 201506

對於保險公司來說，客戶每天都會發生變化，我們使用臨時數據表customer_temp來記錄每天客戶信息,字段和屬性與customer表一致，

create table customer_temp like customer;

load data local inpath '/home/hadoop/hivetestdata/customer_temp.txt' into table customer_temp partition(dt = '201506');

包含的數據示例如下所示：

hive> select * from customer_temp;
customer_temp.id customer_temp.age customer_temp.name customer_temp.dt
1 26 jiangshouzhuang 201506
5 45 xiaosan 201506

如果需要實現客戶表的增量更新，我們需要將兩個表進行full outer join,將customer_temp表中發生修改的數據更新到customer表中。
hive (hive)> select * from customer_temp
> union all
> select a.* from customer a
> left outer join customer_temp b
> on a.id = b.id where b.id is null;
_u1.id _u1.age _u1.name _u1.dt
2 23 zhangyun 201506
3 24 yiyi 201506
4 32 mengmeng 201506
1 26 jiangshouzhuang 201506
5 45 xiaosan 201506

之前看到網上有使用類似如下的方法，感覺是存在問題的：
hive> select customer.id,
coalesce(customer_temp.age,customer.age),
customer.name,
coalesce(customer_temp.dt,customer.dt)
from customer_temp
full outer join customer on customer_temp.id = customer.id;
執行後的結果為：
customer.id _c1 customer.name _c3
1 26 jiangshouzhuang 201506
2 23 zhangyun 201506
3 24 yiyi 201506
4 32 mengmeng 201506
NULL 45 NULL 201506

可以看出的確是有問題的。

如果朋友們有更好的優化方法請賜教，謝謝。

DB2教程