Best Practices for Sequence-Based Analytical Views

Here are some best practice guidelines for creating sequence-based analytical views:

Data Consistency for Sequence-Based Analytical Views
Partition Pruning on Sequence-Based Analytical Views
Eager Aggregation and Use of Analytical Views
Support Restrictions for Sequence-Based Analytical Views
Nullability Constraints
Including Query Sub-Components in the Analytical View
Cardinality Considerations

Data Consistency for Sequence-Based Analytical Views

We engineered sequence-based Analytical Views in Arcadia Data to robustly manage data consistency issues.

It is essential that the analytical views maintain an accurate and consistent view of the contents of the base table.

Kudu tables store data in partitioned sections that contain a sequence column range. Consider the following example with partitioning based on a 24 hour period: aggregated data for "Jan 3, 2017" is in one partition, and aggregated data for "Jan 4, 2017" is in another partition.

Time Lag During in Delete/Insert
It can become challenging to maintain data consistency when changes to the data in the base table fall into the range of an existing partition. To continue our example, let us consider that one data row is added to the base table, and it falls into the "Jan 3, 2017" partition. Because this new record is unknown to the analytical view, the entire aggregate data from the base table for "Jan 3, 2017" must be recalculated. Arcadia Engine deletes the current aggregate data from the "Jan 3, 2017" partition, recalculates it, and inserts fresh data into the analytical view. Because Kudu does not support ACID transactions, there is a lag during this data update. Between deletion and insertion, the analytical view partition for the time period contains no data. If a query were to route to the analytical view at this time, it would return incorrect data.
Using Time Snapshots
Kudu allows queries to view data at a specific time snapshot, as it describes in greater detail in Transaction Semantics in Apache Kudu. Arcadia Engine's analytical views take advantage of this feature by using the snapshot of the time when the data is known to be consistent to return query results to the user. The data from that time can be safely used when there is no refresh in progress, and the previous analytical view refresh operation was successful. When running a refresh on an analytical view, Arcadia Engine records the snapshot time before any delete/insert operations can execute. After each delete/insert completes, Arcadia Engine updates the snapshot time so that the new data for the given range can be viewed. In summary, if a query routes to the analytical view during a refresh on a relevant sequence range, it then returns data from the snapshot time immediately before the delete. This mechanism ensures that the user does not see inconsistent data when it is in transition.
Managing Snapshot History
Kudu maintains snapshot history for a limited time, specified by the Kudu server parameter TABLET_HISTORY_MAX_AGE_SECS. The Arcadia Engine startup option KUDU_TABLET_HISTORY_MAX_AGE_SECS should be set to the same value. Both TABLET_HISTORY_MAX_AGE_SECS and KUDU_TABLET_HISTORY_MAX_AGE_SECS have a default value of 15 minutes, and you should modify these settings to be minimally equal to the longest delete/insert operation for any sequence range. If the snapshot history expires too soon (because these parameters are set too low), the analytical view is marked UNUSABLE until the delete/insert operation finishes.

We also use the snapshot history parameter when a refresh is cancelled, or if it fails. After an unsuccessful run, the next refresh can use the most recently recorded Kudu snapshot time provided that the new run covers the same range as the unsuccessful refresh. This new refresh fixes any inconsistent data in the analytical view. During this refresh, while correcting the data consistency, Arcadia Engine marks the analytical view as UNUSABLE, so queries are not able to route to it.

If the data has not been refreshed after a certain amount of time, Arcadia Engine marks an analytical view as STALE. The default for this time period is 24 hours. This STALENESS_PERIOD query option can be altered to change this behavior. A STALENESS_PERIOD of -1 prevents Kudu analytical views from being treated stale. Note that this query option is set at the level of the session, and applies to all Kudu analytical views running in the cluster. See Staleness Period.
User Action
Arcadia Engine's data consistency management is transparent to the user. You only have to ensure that Kudu's TABLET_HISTORY_MAX_AGE_SECS and Arcadia Engine's corresponding KUDU_TABLET_HISTORY_MAX_AGE_SECS parameter values are identical, and that this value is greater than the time of the slowest delete-insert of the Kudu analytical view partition.

Partition Pruning on Sequence-Based Analytical Views

Partitioning analytical views confers performance advantages because it supports partition pruning of the analytical view.

When Arcadia Engine runs with analytical views enabled, all queries pass through a matching and routing process.

If a query is fully or partially matches an existing analytical view (or views), it is routed to these analytical views.
While running the analytical view(s), Arcadia Engine prunes partitions if applicable predicates are present, so it never reads the data that does not apply to the query.

Eager Aggregation and Use of Analytical Views

Arcadia Engine supports Eager Aggregation rewrites in conjunction with analytical views to significantly improve performance of join queries.

Join queries benefit from eager aggregation rewrite. Any analytical view that matches the aggregated fact table can be used to accelerate an eager-aggregated query. See Eager Aggregation.

Support Restrictions for Sequence-Based Analytical Views

Analytical views must be created either as stored as parquet for HDFS/S3/ADLS tables, or stored as kudu.

Note that the two types of analytical views, partition-based and sequence-based, have different capabilities, syntax, and refresh mechanisms.
A user-defined function name cannot end in '_finalize' or '_no_finalize', which are reserved words in function names in Arcadia Data.
Analytical views may not contain analytic functions, also known as 'window' functions. However, queries that include window functions may be routed to analytical views that contain the corresponding simple aggregate functions.

For example, a query that contains the window function sum(sales) over () can be routed to an analytical view that has a column sum(sales).

Queries may contain window functions, such as moving average, percent of group, percentage of previous, and running total.
Arcadia Engine does not support the use of non-deterministic functions, such as rand(), user(), pid(), current_database(), version(), and so on, in analytical views.

Nullability Constraints

The base table columns used in analytical views for primary key, hash partitioning, range partitioning, and sequencing must not be null.
In rare cases, an analytical view may be range-partitioned on an expression that introduces null values, such as the following:
```
if c1 < 0 then null else c1 
```
In this case, the null constraint is violated, and analytical views drop the rows that contain null values.

Including Query Sub-Components in the Analytical View

Add the sub-components of calculations into the analytical view. The matching logic correctly identifies the various components, and combines them into what is defined by the user's query. For example:

Create separate columns for sum(a) and sum(b).
When the user runs a query of the form select sum(a)/sum(b), the matching algorithm uses the sum(a) and sum(b) from the analytical view.

Cardinality Considerations

Analytical Views improve performance because they leverage inherent reduction in cardinality as a result of grouping and aggregation. For this reason, ensure that the grouping columns have a reasonable reduction in cardinality over the base tables.

It is best to avoid using float or decimal data type columns with large number of distinct values in the GROUP BY clause.