Here, we provide some typical examples of creating and using analytical views:
When we include a sequence column in the definition of an analytical view, it ensures that we can refresh that analytical view incrementally. For an in-depth description of this mechanism, see Sequence Tracking.
The following scenario describes how to use analytical views with a defined sequence column:
The following analytical view, high_sales_av
, shows sales that
are greater than 50,000. Because we specified a sequence column, Arcadia Engine
can refresh the analytical view with new data,
incrementally.
CREATE ANALYTICAL VIEW high_sales_av
PRIMARY KEY (seq_id)
SEQUENCE COLUMN (seq_id)
STORED AS KUDU
AS SELECT seq_id, custid, amt from sales where amt > 50000;
Assuming current max_seq_id
is 19
and the new
data in the table is identified to be between sequence ids 20
and
29
, ArcEngine performs the following operations to refresh the
analytical view. In this example, the highest known sequence range is
20-29
, the two numbers indicating the starting and ending
points of a range. With each REFRESH ANALYTICAL VIEW
statement,
Arcadia Engine reevaluates the highest existing sequence range (if needed) and
creates any necessary new sequence
ranges.
ALTER TABLE high_sales_av ADD RANGE PARTITION VALUE = "20-29";
INSERT INTO high_sales SELECT seq_id, "20-29", custid, amt
FROM sales
WHERE amt > 50000 and seq_id between 20 and 29;
Let us assume that the user appended data to the base table, and that Arcadia
Engine correspondingly refreshed the analytical view. The analytical view now has
range partitions that correspond to sequence ranges 0-9
,
10-19
, 20-29
, and 30-39
.
Subsequently, the user modified rows with sequence numbers 19
,
20
, 21
, and 22
. In order to
update the analytical view, the user must then issue the following command:
REFRESH ANALYTICAL VIEW high_sales_av FROM 19 TO 22;
Internally,
Arcadia Engine computes the partitions spanned by the FROM-TO
clause. The affected ranges are 10-19
and 20-29
.
Internally, it executes the REFRESH ANALYTICAL VIEW
statement
through the following DELETE
and INSERT
commands:
DELETE FROM high_sales_av where seq_range = '10-19';
INSERT INTO high_sales SELECT seq_id,"10-19", custid, amt
FROM sales
WHERE amt > 50000 and seq_id between 10 and 19;
DELETE FROM high_sales where seq_range = '20-29';
INSERT INTO high_sales SELECT seq_id,"20-29", custid, amt
FROM sales
WHERE amt > 50000 and seq_id between 20 and 29;
We can define the primary key of an analytical view to be the same as the primary key of the base table. The following scenario describes how to use analytical views with same primary key as the base table. For an in-depth description of the sequencing mechanism, see Sequence Tracking.
The following analytical view, high_sales2
, has the same primary
key column as its base table.
CREATE ANALYTICAL VIEW high_sales2
PRIMARY KEY (seq_id, sale_year, sale_month, sale_day)
SEQUENCE COLUMN (seq_id)
STORED AS KUDU
AS SELECT seq_id, sale_year, sale_month, sale_day, custid, amt from sales where amt > 50000;
The first REFRESH ANALYTICAL VIEW
statement changes the
high_sales2
analytical view in the following manner. Note that
we are adding range partitions to an internal column with details in the previous
section.:
ALTER TABLE high_sales2 ADD RANGE PARTITION VALUE = "0-9";
INSERT INTO high_sales2 SELECT seq_id, "0-9", sale_year, sale_month, sale_day, custid, amt
FROM sales
WHERE amt > 50000 and seq_id between 0 and 9;
Let us assume that the user appended data with sequence ids 9-19
to the base table, and that Arcadia Engine correspondingly refreshed the
analytical view.
Remember that the system expects the sequence id column to
be an increasing column. So, the new ranges of data that may be added to base
table are 9-19
, 10-20
, 12-20
,
but not 8-20
. When the user issues the command to refresh the
analytical view, the previously created range 0-9
must be
recomputed because it overlaps with the previous range. Internally, it executes
the following DELETE
, ALTER
, and
INSERT
commands:
DELETE FROM high_sales2 WHERE seq_range = '0-9';
INSERT INTO high_sales2 SELECT seq_id,"0-9", sale_year,
sale_month, sale_day, custid, amt
FROM sales
WHERE amt > 50000 and seq_id between 0 and 9;
ALTER TABLE high_sales2 ADD RANGE PARTITION VALUE = '10-19';
INSERT INTO high_sales2 SELECT seq_id,"10-19", sale_year,
sale_month, sale_day, custid, amt
FROM sales
WHERE amt > 50000 and seq_id between 10 and 19;
When defining an analytical view with grouping and aggregation, we recommend that you
use the GROUP BY
clause of the analytical view as the primary key.
The following scenario describes how to use analytical views with a primary key based
on the GROUP BY
clause. For an in-depth description of the sequencing
mechanism, see Sequence Tracking.
The following analytical view, daily_sales
, has the same primary key
column as its base table.
CREATE ANALYTICAL VIEW daily_sales
PRIMARY KEY (sale_year, sale_month, sale_day)
SEQUENCE COLUMN (seq_id)
PARTITION BY RANGE (sale_year, sale_month, sale_day)
STORED AS KUDU
AS SELECT sale_year, sale_month, sale_day, sum(amt) sum_amt,
count(qty) cnt_qty
FROM sales GROUP BY 1, 2, 3;
The first REFRESH ANALYTICAL VIEW
is similar to previous
examples. Let us assume that the user appended data with sequence ids
0-9
to the base table, and that Arcadia Engine correspondingly
refreshed the analytical view. Internally, it executes the REFRESH
ANALYTICAL VIEW
statement through the following ALTER
and INSERT
commands:
ALTER TABLE daily_sales ADD RANGE PARTITION VALUE = ("0-9", 2018, 08, 01);
INSERT INTO daily_sales SELECT "0-9", sale_year, sale_month,
sale_day, sum(amt), count(qty)
FROM sales
WHERE seq_id BETWEEN 0 AND 9 AND sale_year = 2018 AND month = 08
AND day = 01
GROUP BY 2, 3, 4;
Subsequent
REFRESH
commands add partitions for the sequence ranges
10-19
and 20-29
. The ranges in analytical
view look like the following:
"0-9", 2018, 08, 01
"10-19", 2018, 08, 01
"10-19", 2018, 08, 02
"20-29", 2018, 08, 03
Note that the date value (2018, 08, 01)
repeats in two range
partitions, 0-9
and 10-19
, because the data for
that date spans the two ranges.
Arcadia Engine supports sequence-based analytical views for fact-dimension (star schema) joins, by expediting query execution through eager aggregation.
Imagine a fact table message
that joins with dimension tables
carrier
, account
, and others. Most of the queries are star
schema queries, joining on primary-foreign keys. When we enable eager aggregation rewrites,
most of the queries reduce to a join of aggregated fact table with dimension tables.
This section focuses on the fact table and possible analytical views on the fact table to accelerate the star schema queries. For an in-depth description of the sequencing mechanism, see Sequence Tracking.
Conceptually, the fact table looks as follows:
CREATE TABLE message (acct_id BIGINT, date_initiated BIGINT, xid INT,
status STRING, num_segments INT, price INT,
PRIMARY KEY (acct_id, date_initiated, xid))
PARTITION BY HASH (acct_id) PARTITIONS 10,
RANGE (date_initiated) (PARTITION VALUE = (0))
STORED AS KUDU;
A typical query constrains on acct_id
and
date_initiated
columns. Therefore, the analytical view should
be partitioned so that Arcadia Engine can prune partitions on these constraints.
We define an analytical view for tracking daily message volumes as
follows:
CREATE ANALYTICAL VIEW daily_message_vol
PRIMARY KEY (acct_id, ymd)
SEQUENCE COLUMN (date_initiated)
PARTITION BY HASH (acct_id) PARTITIONS 50
RANGE (ymd)
STORED AS KUDU
AS SELECT acct_id, from_unixtime(date_initiated, "yyyy/MM/dd") ymd, sum(1) sum_msgs,
max(num_segments) max_segs
FROM message
GROUP BY 1, 2;
Note
that the sequence column date_initiated
is not unique, but we
assume that it is increasing.
The REFRESH
command works similarly to the examples we described
in Using Analytical View with Primary Key Identical to the Base Table's Primary Key and Using Analytical View with Grouping and Aggregation. Within the
internally-created ranges on date_initiated_range
column, Arcadia
Engine creates range partitions on ymd
, or year-month-date. So,
if the first REFRESH
command reads
date_initiated_range
values from
20180801123
to 20180803789
,
the system executes the command through the following internal
ALTER
and INSERT
commands:
ALTER TABLE daily_message_vol ADD RANGE PARTITION VALUE = ("20180801123-20180803789", "2018/08/01");
ALTER TABLE daily_message_vol ADD RANGE PARTITION VALUE = ("20180801123-20180803789", "2018/08/02");
ALTER TABLE daily_message_vol ADD RANGE PARTITION VALUE = ("20180801123-20180803789", "2018/08/03");
INSERT INTO daily_message_vol
SELECT acct_id, "20180801123-20180803789", from_unixtime(date_initiated, "yyyy/MM/dd") ymd,
sum(1) sum_msgs, max(num_segments) max_segs
FROM message
WHERE date_initiated between 20180801123 and 20180803789
GROUP BY 1, 3;
Within
the acct_id
hash partitions, the ranges have the following
values:
"20180801123-20180803789", "2018/08/01", sum1, max1
"20180801123-20180803789", "2018/08/02", sum2, max2
"20180801123-20180803789", "2018/08/03", sum3, max3
Let us add data for two more days, 2018/08/03
and
2018/08/04
, before the next REFRESH ANALYTICAL
VIEW
command runs. Also, let us assume that the
date_initiated_range
for the new data is
20180803789
to 20180804567
,
overlapping the previous refresh.
Arcadia Engine determines the range
partitions that it must recompute or create by running the following query. Note
that the WHERE
predicate examines distinct ranges in the new data
bounded by last known max_seq_id
and the new actual
max_seq_id
:
SELECT DISTINCT (from_unixtime(date_initiated, “yyyy/MM/dd”))
FROM message
WHERE date_initiated between 20180803789 and 20180804567;
In this scenario, using the results of the preceding query and the existing
metadata, Arcadia Engine determines that the 2018/08/03
partition
must be recomputed, and a 2018/08/04
partition must be added.
Internally, it executes the REFRESH ANALYTICAL VIEW
statement
through the following ALTER
and INSERT
commands:
ALTER TABLE daily_message_vol ADD RANGE PARTITION VALUE = ("20180801123-20180803789","2018/08/03");
ALTER TABLE daily_message_vol ADD RANGE PARTITION VALUE = ("20180803790-20180804567","2018/08/04");
-- Inserts the 2018/08/03 partition
INSERT INTO daily_message_vol
SELECT acct_id, "20180801123-20180803789",
from_unixtime(date_initiated, "yyyy/MM/dd"),
sum(1) sum_msgs, max(num_segments) max_segs
FROM message
WHERE date_initiated between 20180801123 and 20180803789
GROUP BY 1, 3;
-- Inserts the 2018/08/04 partition
INSERT INTO daily_message_vol
SELECT acct_id, "20180803790-20180804567",
from_unixtime(date_initiated, "yyyy/MM/dd"),
sum(1) sum_msgs, max(num_segments) max_segs
FROM message
WHERE date_initiated between 20180803790 and 20180804567
GROUP BY 1, 3;
Within
the acct_id
hash partitions, the ranges have the following
updated values:
"20180801123-20180803789", "2018/08/01", sum1, max1
"20180801123-20180803789", "2018/08/02", sum2, max2
"20180801123-20180803789", "2018/08/03", sum3_new,
max3_new
"20180803790-20180804567", "2018/08/04", sum4, max4
We can make all updates or deletes to the message table, and Arcadia Engine
makes corresponding changes to the analytical view by issuing a REFRESH
ANALYTICAL VIEW
command, along with the specification of the
FROM ... TO
clause, as described in Using Analytical View with Primary Key Identical to the Base Table's Primary Key.
When Arcadia Engine receives a query, it automatically routes it to a suitable analytical view. It also prunes partitions based on the predicates on hash and range partitioning columns.