Partitions of the analytical view do not have to match the partitions of the underlying base table. However, when refreshing the analytical views, having the same partitions significantly reduces computational time and memory usage in many scenarios.
Consider the following case:
default.events
.
CREATE EXTERNAL TABLE default.events (
event_id STRING,
app_id STRING,
time TIMESTAMP,
user_id STRING,
device_id STRING,
platform STRING,
sdk_version STRING,
version_code INT
)
PARTITIONED BY (
year INT,
month INT,
day INT
)
location '/warehouse/sales/events';
events_aview1
is not partitioned. Therefore, it cannot be refreshed incrementally. Any changes to the base table data results in a full refresh of the analytical view, and this recomputes the entire associated query. For very large base tables, it may cause serious problems, such as 'out of memory' errors.
create analytical view events_aview1 stored as parquet as (select count(*), platform, sdk_version from events group by 2, 3);
The analytical view events_aview2
partitions the data along the same dimensions as the base table, and in the same order. This facilitates incremental refresh of the analytical view. Only the partitions of the analytical view that correspond to changed partitions of the base table must be recomputed.
create analytical view events_aview2
partitioned by (year, month, day)
stored as parquet
as (select count(*), platform, sdk_version,
year, month, day
from events
group by 2, 3, 4, 5, 6);
The analytical view events_aview3
is based on a logical view
lv1
that accesses a single table. It partitions the data along the
same dimensions as the base table. This facilitates incremental refresh of the
analytical view. When the data in the base table changes, only the partitions of the
analytical view that correspond to changed partitions must be recomputed.
Note that using a self-join in the logical view constitutes multiple table references, and cannot benefit from incremental refresh.
create view lv1 as select platform, sdk_version, year,
month, day from events where year=2016 and month=2;
create analytical view events_aview3
partitioned by (year, month, day)
stored as parquet
as (select count(*), platform, sdk_version,
year, month, day
from lv1
group by 2, 3, 4, 5, 6);