Here are some best practice guidelines for creating and using analytical views:
Partitions of the analytical view do not have to match the partitions of the underlying base table. However, when refreshing the analytical views, having the same partitions significantly reduces computational time and memory usage in many scenarios.
Consider the following case:
default.events
.
CREATE EXTERNAL TABLE default.events (
event_id STRING,
app_id STRING,
time TIMESTAMP,
user_id STRING,
device_id STRING,
platform STRING,
sdk_version STRING,
version_code INT
)
PARTITIONED BY (
year INT,
month INT,
day INT
)
location '/warehouse/events';
events_aview1
is not partitioned. Therefore, it cannot be
refreshed incrementally. Any changes to the base table data results in a full refresh of
the analytical view, and this recomputes the entire associated query. For very large
base tables, this operation is computationally intensive.
CREATE ANALYTICAL VIEW events_aview1
STORED AS PARQUET
AS (SELECT count(*), platform, sdk_version
FROM events
GROUP BY 2, 3);
The analytical view events_aview2
partitions the data along the same dimensions as the base table, and in the same order. This facilitates incremental refresh of the analytical view. Only the partitions of the analytical view that correspond to changed partitions of the base table must be recomputed.
CREATE ANALYTICAL VIEW events_aview2
PARTITIONED BY (year, month, day)
STORED AS PARQUET
AS (SELECT count(*), platform, sdk_version,
year, month, day
FROM events
GROUP BY 2, 3, 4, 5, 6);
The analytical view events_aview3
is based on a logical view
lv1
that accesses a single table. It partitions the data along the
same dimensions as the base table. This facilitates incremental refresh of the
analytical view. When the data in the base table changes, only the partitions of the
analytical view that correspond to changed partitions must be recomputed.
CREATE VIEW lvl as select platform, sdk_version, year,
month, day from events where year=2016 and month=2;
CREATE ANALYTICAL VIEW events_aview3
PARTITIONED BY (year, month, day)
STORED AS PARQUET
AS (SELECT count(*), platform, sdk_version,
year, month, day
FROM lv1
GROUP BY 2, 3, 4, 5, 6);
Note that using a self-join in the logical view constitutes multiple table references, and cannot benefit from incremental refresh.
Add the sub-components of calculations into the analytical view. The matching logic correctly identifies the various components, and combines them into what is defined by the user's query. For example:
sum(a)
and sum(b)
.select sum(a)/sum(b)
, the matching algorithm uses the sum(a)
and sum(b)
from the analytical view.Analytical Views improve performance because they leverage inherent reduction in cardinality as a result of grouping and aggregation. For this reason, ensure that the grouping columns have a reasonable reduction in cardinality over the base tables.
It is best to avoid using float or decimal data type columns with large number of distinct values in the GROUP BY
clause.
A full refresh of analytical views is a resource-intensive operation. Arcadia Engine mitigates the performance issues through incremental refresh of analytical views.
The incremental refresh of analytical views follows these steps:
You can optimize the refresh performance by combining the various INSERT
statements (Step 2) for multiple partitions. The following three parameters (MAX_PARTITIONS_PER_REFRESH_INSERT
, NUMBER_OF_INSERTS_FOR_REFRESH
, and REFRESH_RESTARTABILITY
) combine to determine how to perform the refresh.
The query option MAX_PARTITIONS_PER_REFRESH_INSERT
, which may be specified
in connection settings (see Parameter Options),
controls the number of partitions that refresh in a single INSERT
statement;
our default is 20
partitions.
Note that combining too many partitions in a single
INSERT
may make the operations so large that it runs out of memory.
Combining several smaller INSERT
operations typically improves overall
performance and optimize access to the metastore. We suggest that you experiment with the
value of the MAX_PARTITIONS_PER_REFRESH_INSERT
option to find the best
value for your configuration.
This parameter specifies how many total inserts should trigger a refresh of an analytical view; our default is 10. Remember that each data insert may effect multiple partitions.
REFRESH_RESTARTABILITY
option
to false
reduces the time spent in saving partial work (such as
ALTER
commands to save some state) at the expense of
restartability.By default, the REFRESH_RESTARTABILITY
setting is
true
.
The default settings for these options work well for most analytical views. For atypical analytical views, they should be altered. Consider the following situations:
When the base table has a large number of small-sized partitions, in the 1,000s,
setting MAX_PARTITIONS_PER_REFRESH_INSERT
to higher number improves the
refresh time. Depending on available memory and size of partitions, set the parameter to
200-300, or even more.
If the statestore in a cluster is loaded close to
capacity, and ddl commands take a long time to complete, we recommend setting the
REFRESH_RESTARTABILITY
option to false
. This makes
refresh operations on analytical views faster at the expense of restartability. If the
analytical view refresh operation fails, the subsequent refresh command runs as if the
previous failed refresh did not perform any work.
Arcadia Engine supports Eager Aggregation rewrites in conjunction with analytical views to significantly improve performance of join queries.
Join queries benefit from eager aggregation rewrite. Any analytical view that matches the aggregated fact table can be used to accelerate an eager-aggregated query. See Eager Aggregation.
Analytical views must be created either as stored as parquet
for HDFS/S3/ADLS
tables, or stored as kudu
.
Note that the two types of analytical views, partition-based and sequence-based, have different capabilities, syntax, and refresh mechanisms.
_finalize
' or '_no_finalize
', which are reserved words in function names in Arcadia Data.Analytical views may not contain analytic functions, also known as 'window' functions. However, queries that include window functions may be routed to analytical views that contain the corresponding simple aggregate functions.
For example, a query that contains the window function sum(sales) over ()
can be routed to an analytical view that has a column sum(sales)
.
Queries may contain window functions, such as moving average, percent of group, percentage of previous, and running total.
Arcadia Engine does not support the use of non-deterministic functions, such as rand()
, user()
, pid()
, current_database()
, version()
, and so on, in analytical views.