Mungeol Heo: Reassigning HDFS Data using Hive

Friday, December 30, 2016

소개
- 본 페이지에서는 hive를 사용하여 HDFS 파일의 데이터를 특정 형식 혹은 파티션 구조로 다시 저장하는 테스트 관련 내용을 기재하고 있습니다.
내용
- hdfs dfs -mkdir /test_cassan
- hdfs dfs -put sknightsgb_20151218__nmslog_raw.data_cassandra01 /test_cassan
- hive query
  - create database test;
  - use test;
  - drop table if exists test01;
    create external table test01 (
    log string
    )
    location '/test_cassan/';
  - drop table if exists test02;
    create external table test02 (
    log string
    )
    partitioned by(yyyymmdd string, hh string)
    location '/test_nmslog/';
  - set hive.exec.dynamic.partition.mode=nonstrict;
  - INSERT OVERWRITE TABLE test02 PARTITION (yyyymmdd, hh)
    select log, date_format(ddate, 'yyyyMMdd') as yyyymmdd, date_format(ddate, 'HH') as hh from (
    select log, from_unixtime(cast(substr(regexp_extract(log, '\"I_RegDateTime\":\"([0-9]+)\"', 1), 1, 10) as int)) as ddate from test01
    )t1;
- hdfs dfs -ls /test_nmslog
  - drwxr-xr-x - hadoop supergroup 0 2016-04-08 14:03 /test_nmslog/yyyymmdd=20151217
    drwxr-xr-x - hadoop supergroup 0 2016-04-08 14:03 /test_nmslog/yyyymmdd=20151218
추가 작업
- 기재된 hive 쿼리에 내용을 추가하면 /mi=00 까지 분류하여 저장할 수 있음
- 테이블 생성 시 옵션을 추가하면 sequence file로 저장할 수 있음
- hive 설정을 사용하여 gzip으로 압축하여 저장할 수 있음

Mungeol Heo