To optimize the performance of queries on sequence-based table structures, such as Kudu, you must create a sequence-based analytical view.
Storing analytical view in Kudu necessitates addition of some SQL clauses to the
CREATE ANALYTICAL VIEW
statement as the analytical view needs to satisfy
Kudu requirement that every Kudu table must have a primary key.
We recommend that you use partitioned base tables. Even though base table partitions generally do not affect how you optimally structure your analytical views, having partitioned tables enables partition pruning of queries routed to the analytical view, and improves performance. When building analytical views with highly filtering predicates, we recommend that you partition analytical views on these columns to enable partition pruning.
This topic contains the following sections:
The CREATE ANALYTICAL VIEW
command for Kudu has the following syntax. Note
that PRIMARY KEY
is a necessary component of Kudu analytical views, as is
the STORED AS KUDU
statement.
CREATE ANALYTICAL VIEW [IF NOT EXISTS] aview_name [COMMENT]
PRIMARY KEY (column_name[, ...])
[PARTITION BY kudu_partition_clause]
[SEQUENCE COLUMN sequence_column_name
[STEP SIZE step_size]]
STORED AS KUDU
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
SELECT_STMT
The name of the new analytical view
The definition of the partition:
kudu_partition_clause ::= [ hash_clause [, ...]] [, range_clause ]
The definition of the hashing clause used in partitioning:
hash_clause ::= HASH [ (pk_col [, ...]) ] PARTITIONS n
The definition of the range clause used in partitioning:
range_clause ::= RANGE [ (pk_col [, ...]) ]
[Optional] Determines how many unique sequence ids to use in internal range partitioning, indicating sequence ranges. By default, this parameter is equivalent to the number of microseconds in a day, 86,400,000,000.
The PRIMARY KEY
clause is mandatory for sequence-based analytical
views. The primary key must consist of simple columns, not expressions.
The STORED AS KUDU
clause is mandatory for sequence-based analytical
views.
Partitioning has a number of considerations:
To learn more about table partitioning in Kudu and any of their restrictions, see Apache Kudu Schema Design, Partitioning.
The PARTITION BY HASH
clause is optional. The user may choose to
hash partition the analytical view on a set of columns and may also specify the number
of buckets to use. If this clause is not specified, the analytical view is not
hash-partitioned.
Any predicates specified on the hash partition columns can be used for partition pruning. Hash partitioning is very useful when a column with high cardinality is used in predicates, so the query execution benefits from partition pruning using that column. Partitioning the table by hash enables partition pruning without creating a large number of partitions. See Examples.
The PARTITION BY RANGE
clause is optional. The user may choose to
range partition the analytical view (this may be in addition to any hash
partitioning). Any predicates specified on range partition columns can be used for
partition pruning. See Examples.
The SEQUENCE COLUMN
clause is optional. The user may choose to
specify the name of a numeric column, the value of which is increasing over time. This
column must be immutable. This column is used for incremental refresh of
analytical view. If this clause is not specified, the analytical view will undergo a
full refresh every time the 'refresh analytical view' command is issued. See Examples.
When we issue the DESCRIBE aview_name
command, the system displays a new STRING
-type column based on the seq_id
column, in addition to other user-specified columns. Arcadia Engine uses the sequence column internally to create ranges. The internal column is step-size wide, and aligns along the step-size boundaries, with the first range starting at 0
.
Arcadia Engine must read the data added to the base table after the previous refresh to refresh the analytical view incrementally. Depending on the sequence id values in the base table, Arcadia Engine constructs the sequence ranges: if the sequence ids range from 53
to 67
and step size is set to 10
, it creates two sequence ranges, 50-59
and 60-69
.
Arcadia Engine tracks the maximum sequence number it encounters in a refresh, max_seq_id
, to identify the starting point of new data in the table. Note that all rows >= max_seq_id
.
Arcadia Engine runs the following query to identify the end-point of new data in the table, which is important because it guarantees integrity of the data in the analytical view in cases when new data is inserted into the table at the same time as a REFRESH
command reads the data from the same table.
SELECT max(seq_id) FROM table
Here is an example usage of create analytical view ... stored as
kudu
.
CREATE ANALYTICAL VIEW sales_aview_simple
PRIMARY KEY (custid, sale_year, sale_month, sale_day)
PARTITION BY HASH(custid) PARTITIONS 10
RANGE(sale_year, sale_month, sale_day)
SEQUENCE COLUMN seq_id
STORED AS KUDU
AS (SELECT custid, sale_year, sale_month, sale_day, sum(amt)
FROM sales
GROUP BY 1, 2, 3, 4);
This example shows an analytical view that hash partitions the data based on
custid
, and further hashes it based on region
.
Each partition is further partitioned using RANGE
partitions based on
year, month and day of sale.
CREATE ANALYTICAL VIEW sales_aview_nested
PRIMARY KEY (custid, region, sale_year, sale_month, sale_day)
PARTITION BY HASH(custid) PARTITIONS 10
HASH(region) PARTITIONS 4
RANGE(sale_year, sale_month, sale_day)
SEQUENCE COLUMN seq_id
STORED AS KUDU
AS (SELECT custid, sale_year, sale_month, sale_day, sum(amt)
FROM sales
GROUP BY 1, 2, 3, 4);