Oushu Database 3.0新特性 (发布日期:2017年9月20日)

Oushu Database 3.0是一个Major Release,最大的一个亮点功能是完全重新设计的新执行器,性>能比2.2版本的执行器要快10倍左右,是当之无愧的世界上最快的执行器。新执行器充分利用了新>的CPU硬件指令,比如SIMD等,可以做到不浪费CPU的每一个时钟周期。

众所周知,Oushu Database 2.2版本的执行器基于Apache HAWQ的执行器,而Apache HAWQ的执行器 起源于Greenplum Database和PostgreSQL,是一个很成熟的执行器,经过多年的优化和沉淀。Oushu Database 3.0的全新执行器能够比老的执行器快10倍左右,其中难度可想而知,性能的指数级提 升是偶数卓越技术团队经过无数个日夜探索的巨大突破。

具体来说,Oushu Database 3.0版本比Oushu Database 2.2.0版本有如下新特性增强:

  • 全新执行器,目前已支持scan,projection,filter,aggregation等基本操作,性能提升约10倍。不支持的操作会fallback到老的执行器
  • 支持ORC外部存储格式,结合新的执行器,外部存储的性能提升10-50倍
  • 支持新一代可插拔存储框架,只需编写几个函数就可以添加一个外部数据源
  • 支持PostGIS,提供了存储、查询和修改空间关系的能力
  • 关键Bug fixes

下表给出了Oushu Database 3.0版本支持的操作系统。

操作系统 版本
Redhat/Centos 7.0
Redhat/Centos 7.1
Redhat/Centos 7.2
Redhat/Centos 7.3

体验新执行器

本章节通过TPCH lineitem 表来展示新执行器的使用。

建立lineitem 外部表用来生成TPCH lineitem 数据,

CREATE EXTERNAL WEB TABLE E_LINEITEM ( L_ORDERKEY    INT8 ,
                                       L_PARTKEY     INTEGER ,
                                       L_SUPPKEY     INTEGER ,
                                       L_LINENUMBER  INTEGER ,
                                       L_QUANTITY    FLOAT ,
                                       L_EXTENDEDPRICE  FLOAT ,
                                       L_DISCOUNT    FLOAT ,
                                       L_TAX         FLOAT ,
                                       L_RETURNFLAG  VARCHAR(1) ,
                                       L_LINESTATUS  VARCHAR(1) ,
                                       L_SHIPDATE    TEXT ,
                                       L_COMMITDATE  TEXT ,
                                       L_RECEIPTDATE TEXT ,
                                       L_SHIPINSTRUCT CHAR(25) ,
                                       L_SHIPMODE     VARCHAR(10) ,
                                       L_COMMENT      VARCHAR(44) )
                                       EXECUTE 'bash -c "$GPHOME/bin/dbgen -b $GPHOME/bin/dists.dss -T L -s 1 -N 6 -n $((GP_SEGMENT_ID + 1))"'
                                       on 6 format 'text' (delimiter '|');

创建orc表, 其中localhost:9000 指向 “hawq_dfs_url”。

CREATE WRITABLE EXTERNAL TABLE lineitem
                            ( L_ORDERKEY    INT8,
                              L_PARTKEY     INTEGER,
                              L_SUPPKEY     INTEGER,
                              L_LINENUMBER  INTEGER,
                              L_QUANTITY    FLOAT,
                              L_EXTENDEDPRICE  FLOAT,
                              L_DISCOUNT    FLOAT,
                              L_TAX         FLOAT,
                              L_RETURNFLAG  TEXT,
                              L_LINESTATUS  TEXT,
                              L_SHIPDATE    TEXT,
                              L_COMMITDATE  TEXT,
                              L_RECEIPTDATE TEXT,
                              L_SHIPINSTRUCT TEXT,
                              L_SHIPMODE     TEXT,
                              L_COMMENT      TEXT)
LOCATION ('hdfs://localhost:9000/hawq_default/lineitem')
FORMAT 'orc';

插入数据

INSERT INTO lineitem SELECT * FROM e_lineitem;

使用新执行器(注:现在新执行器只支持ORC格式,不支持的功能会自动fallback到老执行器)

-----获取表行数------
postgres=# set new_executor = on;
SET
postgres=# SELECT COUNT(*) FROM lineitem;
count
---------
6001215
(1 row)
Time: 17.006 ms

postgres=# set new_executor = off;
SET
postgres=# SELECT COUNT(*) FROM lineitem;
count
---------
6001215
(1 row)
Time: 213.248 ms

-----TPCH 查询 1 ------
postgres=# set new_executor = on;
SET
postgres=#  SELECT
        l_returnflag,
        l_linestatus,
        sum(l_quantity)::bigint as sum_qty,
        sum(l_extendedprice)::bigint as sum_base_price,
        sum(l_extendedprice * (1 - l_discount))::bigint as sum_disc_price,
        sum(l_extendedprice * (1 - l_discount) * (1 + l_tax))::bigint as sum_charge,
        avg(l_quantity)::bigint as avg_qty,
        avg(l_extendedprice)::bigint as avg_price,
        avg(l_discount)::bigint as avg_disc,
        count(*) as count_order
FROM
        lineitem
        WHERE
        l_shipdate <= '1998-08-20'
GROUP BY
        l_returnflag,
        l_linestatus;

l_returnflag | l_linestatus | sum_qty  | sum_base_price | sum_disc_price |  sum_charge  | avg_qty | avg_price | avg_disc | count_order
--------------+--------------+----------+----------------+----------------+--------------+---------+-----------+----------+-------------
R            | F            | 37719753 |    56568041381 |    53741292685 |  55889619120 |      26 |     38251 |        0 |     1478870
N            | F            |   991417 |     1487504710 |     1413082168 |   1469649223 |      26 |     38284 |        0 |       38854
A            | F            | 37734107 |    56586554401 |    53758257135 |  55909065223 |      26 |     38273 |        0 |     1478493
N            | O            | 73808911 |   110700990251 |   105167436999 | 109377979031 |      26 |     38248 |        0 |     2894278
(4 rows)

Time: 234.376 ms

postgres=# set new_executor = off;
SET
postgres=#  SELECT
        l_returnflag,
        l_linestatus,
        sum(l_quantity)::bigint as sum_qty,
        sum(l_extendedprice)::bigint as sum_base_price,
        sum(l_extendedprice * (1 - l_discount))::bigint as sum_disc_price,
        sum(l_extendedprice * (1 - l_discount) * (1 + l_tax))::bigint as sum_charge,
        avg(l_quantity)::bigint as avg_qty,
        avg(l_extendedprice)::bigint as avg_price,
        avg(l_discount)::bigint as avg_disc,
        count(*) as count_order
FROM
        lineitem
WHERE
        l_shipdate <= '1998-08-20'
GROUP BY
        l_returnflag,
        l_linestatus;

l_returnflag | l_linestatus | sum_qty  | sum_base_price | sum_disc_price |  sum_charge  | avg_qty | avg_price | avg_disc | count_order
--------------+--------------+----------+----------------+----------------+--------------+---------+-----------+----------+-------------
R            | F            | 37719753 |    56568041381 |    53741292685 |  55889619120 |      26 |     38251 |        0 |     1478870
N            | F            |   991417 |     1487504710 |     1413082168 |   1469649223 |      26 |     38284 |        0 |       38854
A            | F            | 37734107 |    56586554401 |    53758257135 |  55909065223 |      26 |     38273 |        0 |     1478493
N            | O            | 73808911 |   110700990251 |   105167436999 | 109377979031 |      26 |     38248 |        0 |     2894278
(4 rows)

Time: 2341.147 ms

从 2.2.0 升级到 3.0.0

  1. 在master节点关闭hawq 2.2.0集群
sudo su gpadmin
hawq stop cluster -a
  1. 在所有hawq节点上备份 hawq2.2.0的设置
hawq ssh -f hostfile -e 'cp -rf $GPHOME/etc ~/'
  1. 在每个hawq节点上安装hawq 3.0.0的yum源
hawq ssh -f hostfile -e 'sudo wget -O /etc/yum.repos.d/oushu-database.repo http://yum.oushu.io/oushurepo/yumrepo/release/oushu-database/centos7/3.0.0.0/oushu-database.repo'
hawq ssh -f hostfile -e 'sudo wget -O /etc/yum.repos.d/oushu-database-utils.repo http://yum.oushu.io/oushurepo/yumrepo/oushu-database-utils/centos7/1.1.0.0/oushu-database-utils.repo'
hawq ssh -f hostfile -e 'sudo yum makecache'
  1. 在每个hawq节点上安装hawq 3.0.0的rpm包
hawq ssh -f hostfile -e 'sudo yum remove -y hawq'
sudo yum install -y hawq
hawq ssh -f hostfile -e 'sudo yum install -y hawq'
  1. 配置每个hawq节点
hawq ssh -f hostfile -e 'cp -rf ~/etc/* $GPHOME/etc/'
  1. 安装ORC

    1) 在master节点上配置集群到升级模式

    hawq start cluster
    hawq config -c upgrade_mode -v on --skipvalidation
    hawq restart cluster -a
    

    2) 在master节点上的template1库的pg_catalog.pg_proc表上安装ORC,在$GPHOME/etc/hawq-site.xml文件中找到$hawq_master_address_port的值,用其实际值替换命令中的$hawq_master_address_port

    PGOPTIONS='-c gp_session_role=utility' psql -a -p $hawq_master_address_port -d template1 -f $GPHOME/share/postgresql/orc_install.sql > orc_install.out 2>&1
    

    3) 在每个segment节点的template1库的pg_catalog.pg_proc表上安装ORC,在$GPHOME/etc/hawq-site.xml文件中找到$hawq_segment_address_port的值,用其实际值替换命令中的$hawq_segment_address_port

    source /usr/local/hawq/greenplum_path.sh
    PGOPTIONS='-c gp_session_role=utility' psql -a -p $hawq_segment_address_port -d template1 -f $GPHOME/share/postgresql/orc_install.sql > orc_install.out 2>&1
    

    4) 在master节点上将集群配置为普通模式

    hawq config -c upgrade_mode -v off --skipvalidation
    hawq restart cluster -a
    

    5) 在master节点获取数据库用户名称

    psql -a -d template1 -c "select datname from pg_database where datname not in ('hcatalog', 'template0', 'template1') order by datname;"
    

    6) 在master节点的每个数据库的pg_catalog.pg_proc表中安装ORC

    psql -a -d $user_database_name -f $GPHOME/share/postgresql/orc_install.sql > orc_install.out 2>&1
    

    7) 在master节点重启集群

    hawq restart cluster -a