Architecture

Overview

Figure 1 shows the components of a typical HAWQ database cluster. There are several master nodes, including HAWQ masters (master and standby), HDFS namenodes (master and standby) and optional YARN resource manager. HAWQ catalog service is now provided by active HAWQ master node. In this cluster, all the other nodes are slave nodes where an HDFS datanode and a HAWQ segment are typically deployed and started. If user prefers running HAWQ without YARN, HAWQ uses its own built-in standalone resource manager which makes HAWQ use exclusively the resources of the whole cluster.

When HAWQ runs a SQL statement for querying or manipulating tables, there are multiple QEs (Query Executor) allocated for parallel execution among segments. All QEs run inside logical resource containers (called virtual segments) allocated by HAWQ resource manager.

Thanks to MPP++ architecture, HAWQ slave nodes can be dynamically added to the cluster on demand, and data redistribution is not required. When a new HAWQ slave node is started and configured to join the cluster, it sends heartbeats to the active HAWQ master node. When the node is recognized by HAWQ master, it is ready for accepting future query plans.

_images/hawk++1_2_1.png

Figure 1. HAWQ architecture

As shown in Figure 2, there are several important components inside the HAWQ master node: Query Parser/Analyzer, Optimizer, Resource Manager, Fault Tolerant Service, Query Dispatcher and Catalog Service.

In each slave node, only one HAWQ segment is required to be installed and started. HAWQ resource manager can decide the degree of query parallelism according to the query cost and resources available. While for traditional MPP database architecture, the degree of parallelism of a query is fixed, and cannot be easily changed without reconfiguring the cluster and shuffling the data. This leads to throughput and scalability issues. For data exchange among QEs, a high-speed interconnect is used.

Besides flexibility of query execution, HAWQ can evenly assign query tasks by splits to achieve optimal parallel execution performance. For instance, if a query statement has 1000 virtual segments to start, then its target table data is evenly assigned into 1000 tasks and consequently these tasks are assigned to segments according to data locality to achieve maximal local IO ratio.

_images/hawk++1_2_2.png

Figure 2: HAWQ internal architecture

  • Parser and Analyzer: it parses queries, performs grammar checking and generates internal query parse trees.
  • Optimizer: the cost-based optimizer generates optimized query plan from query tree. Given a query, there may be thousands of possible equivalent query plans, but their execution performance varies a lot. More complex the query is, more sophisticated optimizer is needed.
  • Resource Manager: it is a newly implemented component that dynamically requests global resources through resource broker who is responsible for negotiating resources with optional global resource manager (such as YARN). it also returns the resources when it is not necessary to cache too many resources. In standalone mode, HAWQ resource manager exclusively occupies the resources of the cluster. HAWQ resource manager supports resource allocation in milliseconds and enforce resource consumption in each cluster node.
  • Fault-Tolerant Service: this service is responsible for detecting the status of each HAWQ node. When a node becomes unavailable, its resources are automatically excluded and queries won’t be dispatched to the node until the issues are fixed and it become active again.
  • Query Dispatcher: after optimizing a query, dispatcher sends the plan to the nodes according to the locations of allocated resources. Dispatcher is like a coordinator of query execution.
  • Catalog Service: it is used to store HAWQ database object and access control definitions. It is also the key for distributed transactions.
  • High-speed Interconnect: It is implemented based on UDP protocol to facilitate high-performance data exchange among nodes. UDP does not need established connections, thus it does not have the limit of the number of connections when supporting a large amount of concurrent queries.

Query execution flow

_images/hawk++1_2_3.png

Figure 3. Query execution flow

When a SQL query is submitted to master from client through JDBC/ODBC, Parser parses the statement to get the query tree, and then Planner (Optimizer) generates the optimized query plan, then Dispatcher interacts with Resource Manager to get resources, splits the query plan into slices, and dispatches the plan to run as a set of parallel running QEs. Finally, the result is returned to the client from master.

Elastic execution engine

The Elastic Execution Engine has several key design points: the complete separation of storage and compute, stateless segments, and how to use resources. The separation of storage and compute allows us to dynamically start any number of virtual segments to execute queries. Stateless segments makes dynamic expansion and shrinking possible. How to use resources includes how to determine the cost and thus the resources needed, how to effectively use these resources and how to make the data locality optimal.

High-speed interconnect

The purpose of introducing a high-speed interconnect is to efficiently exchange data among nodes. HAWQ interconnect is implemented based UDP protocol. It is because TCP has the trouble of touching system limit of the number of connections, which terribly limits the concurrency.

_images/hawk++1_2_4.png

Figure 4. High-speed Internet

As shown in Figure 4, the query executor processes forms a pipeline for data exchange. Assume that there are 1000 processes on each node. And these processes have to interact with each other, therefore, there maybe millions of connections on each node. There is no way to use TCP efficiently to support so many connections. This is why we developed a UDP-based interconnect. For UDP transmission, operating system can not guarantee the reliability and the delivery order at the receiving side. Thus, HAWQ interconnect needs to ensure the following properties:

  • Reliability: guarantee that packets are automatically retransmitted when data loss is detected
  • Order: interconnect can receive packets in its original sending sequence.
  • Flow control: interconnect can control the sending speed to avoid overwhelming its corresponding receivers. This makes HAWQ handle network congestion gracefully without huge performance downgrade.
  • High performance and high scalability: this is a main goal needed by an interconnect.
  • Portability: Support multiple platforms.

Transaction management

Transaction is an important database feature. Most SQL-on-Hadoop engines do not support transactional consistency. This forces software developers to implement data consistency logic by themselves. Unfortunately, this is like a nightmare for developers.

HAWQ supports snapshot isolation. And since HAWQ cluster has only stateless segment nodes, all transaction are implemented at master side. Internally, this is implemented using a lane model, different concurrent insert operations use independent lanes which do not conflict with each other. The logical file lengths are maintained in catalog service to guarantee the consistency. In case of transcation rollback, the trash data at the end of data files are truncated.

In fact, the file truncate feature in HDFS was originally proposed and driven by HAWQ project.

Resource manager

HAWQ supports three level resource management:

  • Global resource management: HAWQ can be configured to use a global resource manager, for example YARN. The global resource manager can dynamically control total consumable resources for HAWQ. This makes HAWQ possible to coexist with other applications in the same physical cluster.
  • HAWQ internal resource management: HAWQ introduces resource queues to let users define resource consumption policies, this guides the query resource allocation by users at query level. Users can use DDL statements to define and modify resource queues to adjust resource consumption behavior.
  • Operator level resource management: HAWQ automatically assigns plan operators with exact resource limits, so HAWQ executors at segment side can enforce the resource consumption of each query operator.

Figure 5 at below shows some details of HAWQ resource manager:

_images/hawk++1_2_5.png

Figure 5. HAWQ resource manager design

Storage module

Internally, HAWQ supports a variety of optimized storage formats, such as AO (Append Only) and Parquet. And HAWQ provides MapReduce InputFormat as well to support Map/Reduce job. In addition, HAWQ supports plugable external storage, which let HAWQ access external data efficiently.