Best Practices for Partition-Based Analytical Views

Here are some best practice guidelines for creating and using analytical views:

Partitioning Analytical Views
Including Query Sub-Components in the Analytical View
Cardinality Considerations
Incremental Refresh of Analytical Views
Eager Aggregation and Use of Analytical Views
Support Restrictions

Partitioning Analytical Views

Partitioning analytical views confers similar performance advantages as partitioning base tables, because it supports partition pruning and incremental refresh of the analytical view.

Partitions of the analytical view do not have to match the partitions of the underlying base table. However, when refreshing the analytical views, having the same partitions significantly reduces computational time and memory usage in many scenarios.

Consider the following case:

This is the definition of base table default.events.

CREATE EXTERNAL TABLE default.events (
        event_id STRING,
        app_id STRING,
        time TIMESTAMP,
        user_id STRING,
        device_id STRING,
        platform STRING,
        sdk_version STRING,
        version_code INT
      )
      PARTITIONED BY (
        year INT,
        month INT,
        day INT
      )

location '/warehouse/events';

The analytical view events_aview1 is not partitioned. Therefore, it cannot be refreshed incrementally. Any changes to the base table data results in a full refresh of the analytical view, and this recomputes the entire associated query. For very large base tables, this operation is computationally intensive.
```
CREATE ANALYTICAL VIEW events_aview1
                       STORED AS PARQUET
                       AS (SELECT   count(*), platform, sdk_version
                       FROM     events
                       GROUP BY 2, 3);
```

The analytical view events_aview2 partitions the data along the same dimensions as the base table, and in the same order. This facilitates incremental refresh of the analytical view. Only the partitions of the analytical view that correspond to changed partitions of the base table must be recomputed.

CREATE ANALYTICAL VIEW events_aview2
                             PARTITIONED BY (year, month, day)
                             STORED AS PARQUET
                             AS (SELECT   count(*), platform, sdk_version,
                             year, month, day
                             FROM     events
                             GROUP BY 2, 3, 4, 5, 6);

The analytical view events_aview3 is based on a logical view lv1 that accesses a single table. It partitions the data along the same dimensions as the base table. This facilitates incremental refresh of the analytical view. When the data in the base table changes, only the partitions of the analytical view that correspond to changed partitions must be recomputed.

CREATE VIEW lvl as select platform, sdk_version, year, 
                          month, day from events where year=2016 and month=2;
                          
                          CREATE ANALYTICAL VIEW events_aview3
                          PARTITIONED BY (year, month, day)
                          STORED AS PARQUET
                          AS (SELECT   count(*), platform, sdk_version,
                          year, month, day
                          FROM     lv1
                          GROUP BY 2, 3, 4, 5, 6);

Note that using a self-join in the logical view constitutes multiple table references, and cannot benefit from incremental refresh.

Including Query Sub-Components in the Analytical View

Add the sub-components of calculations into the analytical view. The matching logic correctly identifies the various components, and combines them into what is defined by the user's query. For example:

Create separate columns for sum(a) and sum(b).
When the user runs a query of the form select sum(a)/sum(b), the matching algorithm uses the sum(a) and sum(b) from the analytical view.

Cardinality Considerations

Analytical Views improve performance because they leverage inherent reduction in cardinality as a result of grouping and aggregation. For this reason, ensure that the grouping columns have a reasonable reduction in cardinality over the base tables.

It is best to avoid using float or decimal data type columns with large number of distinct values in the GROUP BY clause.

Incremental Refresh of Analytical Views

A full refresh of analytical views is a resource-intensive operation. Arcadia Engine mitigates the performance issues through incremental refresh of analytical views.

The incremental refresh of analytical views follows these steps:

Set metadata to specify last attempted refresh time.
Insert data into partitions.
Set metadata to track information regarding files in the base table, which is necessary to establish analytical view 'freshness'.
Repeat the previous steps for each partition that is being updated.
Compute statistics on the analytical view.

You can optimize the refresh performance by combining the various INSERT statements (Step 2) for multiple partitions. The following three parameters (MAX_PARTITIONS_PER_REFRESH_INSERT, NUMBER_OF_INSERTS_FOR_REFRESH, and REFRESH_RESTARTABILITY) combine to determine how to perform the refresh.

MAX_PARTITIONS_PER_REFRESH_INSERT

The query option MAX_PARTITIONS_PER_REFRESH_INSERT, which may be specified in connection settings (see Parameter Options), controls the number of partitions that refresh in a single INSERT statement; our default is 20 partitions.

Note that combining too many partitions in a single INSERT may make the operations so large that it runs out of memory. Combining several smaller INSERT operations typically improves overall performance and optimize access to the metastore. We suggest that you experiment with the value of the MAX_PARTITIONS_PER_REFRESH_INSERT option to find the best value for your configuration.

NUMBER_OF_INSERTS_FOR_REFRESH

This parameter specifies how many total inserts should trigger a refresh of an analytical view; our default is 10. Remember that each data insert may effect multiple partitions.

REFRESH_RESTARTABILITY

If the refresh operation on an analytical view fails, the following call to refresh the analytical view picks up the work where previous call ended, so the partially completed work still applies. Setting the REFRESH_RESTARTABILITY option to false reduces the time spent in saving partial work (such as ALTER commands to save some state) at the expense of restartability.

By default, the REFRESH_RESTARTABILITY setting is true.

The default settings for these options work well for most analytical views. For atypical analytical views, they should be altered. Consider the following situations:

Large number of small partitions in base table
When the base table has a large number of small-sized partitions, in the 1,000s, setting MAX_PARTITIONS_PER_REFRESH_INSERT to higher number improves the refresh time. Depending on available memory and size of partitions, set the parameter to 200-300, or even more.
Overloaded statestore
If the statestore in a cluster is loaded close to capacity, and ddl commands take a long time to complete, we recommend setting the REFRESH_RESTARTABILITY option to false. This makes refresh operations on analytical views faster at the expense of restartability. If the analytical view refresh operation fails, the subsequent refresh command runs as if the previous failed refresh did not perform any work.

Eager Aggregation and Use of Analytical Views

Arcadia Engine supports Eager Aggregation rewrites in conjunction with analytical views to significantly improve performance of join queries.

Join queries benefit from eager aggregation rewrite. Any analytical view that matches the aggregated fact table can be used to accelerate an eager-aggregated query. See Eager Aggregation.

Support Restrictions

Analytical views must be created either as stored as parquet for HDFS/S3/ADLS tables, or stored as kudu.

Note that the two types of analytical views, partition-based and sequence-based, have different capabilities, syntax, and refresh mechanisms.
A user-defined function name cannot end in '_finalize' or '_no_finalize', which are reserved words in function names in Arcadia Data.
Analytical views may not contain analytic functions, also known as 'window' functions. However, queries that include window functions may be routed to analytical views that contain the corresponding simple aggregate functions.

For example, a query that contains the window function sum(sales) over () can be routed to an analytical view that has a column sum(sales).

Queries may contain window functions, such as moving average, percent of group, percentage of previous, and running total.
Arcadia Engine does not support the use of non-deterministic functions, such as rand(), user(), pid(), current_database(), version(), and so on, in analytical views.
Analytical views based on tables cannot contain queries with joins, unions, and sub-selects. In order to use these SQL features, use analytical views based on logical views, which encapsulate the calculations that involve joins, unions, sub-selects, and so on. Analytical views based on tables can be used to substitute parts of a more complex join query, or a query with sub-selects.