CREATE ANALYTICAL VIEW; Sequence-Based

To optimize the performance of queries on sequence-based table structures, such as Kudu, you must create a sequence-based analytical view.

Storing analytical view in Kudu necessitates addition of some SQL clauses to the CREATE ANALYTICAL VIEW statement as the analytical view needs to satisfy Kudu requirement that every Kudu table must have a primary key.

We recommend that you use partitioned base tables. Even though base table partitions generally do not affect how you optimally structure your analytical views, having partitioned tables enables partition pruning of queries routed to the analytical view, and improves performance. When building analytical views with highly filtering predicates, we recommend that you partition analytical views on these columns to enable partition pruning.

Developer Note. Before creating analytical views, consider the restrictions on Impala Identifiers.

This topic contains the following sections:

Syntax

The CREATE ANALYTICAL VIEW command for Kudu has the following syntax. Note that PRIMARY KEY is a necessary component of Kudu analytical views, as is the STORED AS KUDU statement.

CREATE ANALYTICAL VIEW [IF NOT EXISTS] aview_name [COMMENT]
PRIMARY KEY (column_name[, ...])
[PARTITION BY kudu_partition_clause]
[SEQUENCE COLUMN sequence_column_name
  [STEP SIZE step_size]]
STORED AS KUDU
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
SELECT_STMT

Parameters

aview_name

The name of the new analytical view

column_name
The name of the column that specifies the primary key.
kudu_partition_clause

The definition of the partition:

kudu_partition_clause ::= [ hash_clause [, ...]] [, range_clause ]
hash_clause

The definition of the hashing clause used in partitioning:

hash_clause ::= HASH [ (pk_col [, ...]) ] PARTITIONS n
range_clause

The definition of the range clause used in partitioning:

range_clause ::= RANGE [ (pk_col [, ...]) ]
sequence_column_name
[Optional] Name of an increasing column in the base table; this column must be immutable. This column typically holds the information regarding an event,insertion of a new row in a table, and so on. It is measured as microseconds after the Unix epoch. This parameter is used for incremental refresh of the analytical view. If this column is not specified, Arcadia Engine refreshes the entire analytical view every time it executes a refresh.
step_size

[Optional] Determines how many unique sequence ids to use in internal range partitioning, indicating sequence ranges. By default, this parameter is equivalent to the number of microseconds in a day, 86,400,000,000.

Notes

Primary Key

The PRIMARY KEY clause is mandatory for sequence-based analytical views. The primary key must consist of simple columns, not expressions.

  • If an analytical view does not perform any grouping, you may use the base table's primary key columns as analytical view's primary key.
  • If an analytical view performs grouping, you may choose the analytical view's 'group by columns' as the primary key for the analytical view, because it is unique.
Stored as Kudu

The STORED AS KUDU clause is mandatory for sequence-based analytical views.

Partitioning

Partitioning has a number of considerations:

  • Partition key columns must be a subset of the primary key columns.
  • Primary key columns must be the first columns in the column order of the base table definition.
  • A table may be defined using nested partitioning, and an arbitrary number of hash partitions can be nested. These hash partitions may consists of multiple columns.
  • You can partition by range, and these range partitions can consist of multiple columns, but they cannot be nested.
  • You can define range partitions within hash partitions. However, hash partitions cannot be nested within range partitions.

To learn more about table partitioning in Kudu and any of their restrictions, see Apache Kudu Schema Design, Partitioning.

Partition by Hash

The PARTITION BY HASH clause is optional. The user may choose to hash partition the analytical view on a set of columns and may also specify the number of buckets to use. If this clause is not specified, the analytical view is not hash-partitioned.

Any predicates specified on the hash partition columns can be used for partition pruning. Hash partitioning is very useful when a column with high cardinality is used in predicates, so the query execution benefits from partition pruning using that column. Partitioning the table by hash enables partition pruning without creating a large number of partitions. See Examples.

Partition by Range

The PARTITION BY RANGE clause is optional. The user may choose to range partition the analytical view (this may be in addition to any hash partitioning). Any predicates specified on range partition columns can be used for partition pruning. See Examples.

Sequence Column

The SEQUENCE COLUMN clause is optional. The user may choose to specify the name of a numeric column, the value of which is increasing over time. This column must be immutable. This column is used for incremental refresh of analytical view. If this clause is not specified, the analytical view will undergo a full refresh every time the 'refresh analytical view' command is issued. See Examples.

Sequence Tracking

When we issue the DESCRIBE aview_name command, the system displays a new STRING-type column based on the seq_id column, in addition to other user-specified columns. Arcadia Engine uses the sequence column internally to create ranges. The internal column is step-size wide, and aligns along the step-size boundaries, with the first range starting at 0.

Arcadia Engine must read the data added to the base table after the previous refresh to refresh the analytical view incrementally. Depending on the sequence id values in the base table, Arcadia Engine constructs the sequence ranges: if the sequence ids range from 53 to 67 and step size is set to 10, it creates two sequence ranges, 50-59 and 60-69.

Arcadia Engine tracks the maximum sequence number it encounters in a refresh, max_seq_id, to identify the starting point of new data in the table. Note that all rows >= max_seq_id.

Arcadia Engine runs the following query to identify the end-point of new data in the table, which is important because it guarantees integrity of the data in the analytical view in cases when new data is inserted into the table at the same time as a REFRESH command reads the data from the same table.

SELECT max(seq_id) FROM table

Examples

Simple Sequential-Based Analytical View

Here is an example usage of create analytical view ... stored as kudu.

CREATE ANALYTICAL VIEW sales_aview_simple
  PRIMARY KEY (custid, sale_year, sale_month, sale_day)
  PARTITION BY HASH(custid) PARTITIONS 10 
    RANGE(sale_year, sale_month, sale_day) 
  SEQUENCE COLUMN seq_id
 STORED AS KUDU
  AS (SELECT custid, sale_year, sale_month, sale_day, sum(amt)
       FROM sales
       GROUP BY 1, 2, 3, 4);
Analytical View with Nested Hash Partitioning

This example shows an analytical view that hash partitions the data based on custid, and further hashes it based on region. Each partition is further partitioned using RANGE partitions based on year, month and day of sale.

CREATE ANALYTICAL VIEW sales_aview_nested
  PRIMARY KEY (custid, region, sale_year, sale_month, sale_day)
  PARTITION BY HASH(custid) PARTITIONS 10 
    HASH(region) PARTITIONS 4
    RANGE(sale_year, sale_month, sale_day) 
  SEQUENCE COLUMN seq_id
 STORED AS KUDU
  AS (SELECT custid, sale_year, sale_month, sale_day, sum(amt)
       FROM sales
       GROUP BY 1, 2, 3, 4);