Using Sequence-Based Analytical Views

Here, we provide some typical examples of creating and using analytical views:

Using Analytical View with a Sequence Column

When we include a sequence column in the definition of an analytical view, it ensures that we can refresh that analytical view incrementally. For an in-depth description of this mechanism, see Sequence Tracking.

The following scenario describes how to use analytical views with a defined sequence column:

  1. Creating an Analytical View
  2. Inserting Data into Base Table
  3. Updating Data in Base Table
  1. Creating an Analytical View

    The following analytical view, high_sales_av, shows sales that are greater than 50,000. Because we specified a sequence column, Arcadia Engine can refresh the analytical view with new data, incrementally.

    CREATE ANALYTICAL VIEW high_sales_av 
                      PRIMARY KEY (seq_id) 
                      SEQUENCE COLUMN (seq_id) 
                      STORED AS KUDU 
                      AS SELECT seq_id, custid, amt from sales where amt > 50000;
  2. Inserting Data into Base Table

    Assuming current max_seq_id is 19 and the new data in the table is identified to be between sequence ids 20 and 29, ArcEngine performs the following operations to refresh the analytical view. In this example, the highest known sequence range is 20-29, the two numbers indicating the starting and ending points of a range. With each REFRESH ANALYTICAL VIEW statement, Arcadia Engine reevaluates the highest existing sequence range (if needed) and creates any necessary new sequence ranges.

    ALTER TABLE high_sales_av ADD RANGE PARTITION VALUE = "20-29";
                      INSERT INTO high_sales SELECT seq_id, "20-29", custid, amt 
                      FROM sales 
                      WHERE amt > 50000 and seq_id between 20 and 29;
  3. Updating Data in Base Table

    Let us assume that the user appended data to the base table, and that Arcadia Engine correspondingly refreshed the analytical view. The analytical view now has range partitions that correspond to sequence ranges 0-9, 10-19, 20-29, and 30-39. Subsequently, the user modified rows with sequence numbers 19, 20, 21, and 22. In order to update the analytical view, the user must then issue the following command:

    REFRESH ANALYTICAL VIEW high_sales_av FROM 19 TO 22;

    Internally, Arcadia Engine computes the partitions spanned by the FROM-TO clause. The affected ranges are 10-19 and 20-29. Internally, it executes the REFRESH ANALYTICAL VIEW statement through the following DELETE and INSERT commands:

    DELETE FROM high_sales_av where seq_range = '10-19';
                      INSERT INTO high_sales SELECT seq_id,"10-19", custid, amt 
                      FROM sales 
                      WHERE amt > 50000 and seq_id between 10 and 19;
                      
                      DELETE FROM high_sales where seq_range = '20-29';
                      INSERT INTO high_sales SELECT seq_id,"20-29", custid, amt 
                      FROM sales 
                      WHERE amt > 50000 and seq_id between 20 and 29;

Using Analytical View with Primary Key Identical to the Base Table's Primary Key

We can define the primary key of an analytical view to be the same as the primary key of the base table. The following scenario describes how to use analytical views with same primary key as the base table. For an in-depth description of the sequencing mechanism, see Sequence Tracking.

  1. Creating an Analytical View
  2. Inserting Data into Base Table
  3. Updating Data into Base Table
  1. Creating an Analytical View

    The following analytical view, high_sales2, has the same primary key column as its base table.

    CREATE ANALYTICAL VIEW high_sales2 
                      PRIMARY KEY (seq_id, sale_year, sale_month, sale_day) 
                      SEQUENCE COLUMN (seq_id) 
                      STORED AS KUDU 
                      AS SELECT seq_id, sale_year, sale_month, sale_day, custid, amt from sales where amt > 50000;
  2. Inserting Data into Base Table

    The first REFRESH ANALYTICAL VIEW statement changes the high_sales2 analytical view in the following manner. Note that we are adding range partitions to an internal column with details in the previous section.:

    ALTER TABLE high_sales2 ADD RANGE PARTITION VALUE = "0-9";
                      INSERT INTO high_sales2 SELECT seq_id, "0-9", sale_year, sale_month, sale_day, custid, amt 
                      FROM sales 
                      WHERE amt > 50000 and seq_id between 0 and 9;
  3. Updating Data in Base Table

    Let us assume that the user appended data with sequence ids 9-19 to the base table, and that Arcadia Engine correspondingly refreshed the analytical view.

    Remember that the system expects the sequence id column to be an increasing column. So, the new ranges of data that may be added to base table are 9-19, 10-20, 12-20, but not 8-20. When the user issues the command to refresh the analytical view, the previously created range 0-9 must be recomputed because it overlaps with the previous range. Internally, it executes the following DELETE, ALTER, and INSERT commands:

    DELETE FROM high_sales2 WHERE seq_range  = '0-9';
                      INSERT INTO high_sales2 SELECT seq_id,"0-9", sale_year, 
                      sale_month, sale_day, custid, amt 
                      FROM sales 
                      WHERE amt > 50000 and seq_id between 0 and 9;
                      
                      ALTER TABLE high_sales2 ADD RANGE PARTITION VALUE = '10-19';
                      INSERT INTO high_sales2 SELECT seq_id,"10-19", sale_year, 
                      sale_month, sale_day, custid, amt 
                      FROM sales 
                      WHERE amt > 50000 and seq_id between 10 and 19;

Using Analytical View with Grouping and Aggregation

When defining an analytical view with grouping and aggregation, we recommend that you use the GROUP BY clause of the analytical view as the primary key.

The following scenario describes how to use analytical views with a primary key based on the GROUP BY clause. For an in-depth description of the sequencing mechanism, see Sequence Tracking.

  1. Creating an Analytical View
  2. Updating Data in Base Table
  1. Creating an Analytical View

    The following analytical view, daily_sales, has the same primary key column as its base table.

    CREATE ANALYTICAL VIEW daily_sales 
                      PRIMARY KEY (sale_year, sale_month, sale_day) 
                      SEQUENCE COLUMN (seq_id) 
                      PARTITION BY RANGE (sale_year, sale_month, sale_day) 
                      STORED AS KUDU 
                      AS SELECT sale_year, sale_month, sale_day, sum(amt) sum_amt, 
                      count(qty) cnt_qty 
                      FROM sales GROUP BY 1, 2, 3;
  2. Updating Data in Base Table

    The first REFRESH ANALYTICAL VIEW is similar to previous examples. Let us assume that the user appended data with sequence ids 0-9 to the base table, and that Arcadia Engine correspondingly refreshed the analytical view. Internally, it executes the REFRESH ANALYTICAL VIEW statement through the following ALTER and INSERT commands:

    ALTER TABLE daily_sales ADD RANGE PARTITION VALUE = ("0-9", 2018, 08, 01);
                      INSERT INTO daily_sales SELECT "0-9", sale_year, sale_month, 
                      sale_day, sum(amt), count(qty) 
                      FROM sales 
                      WHERE seq_id BETWEEN 0 AND 9 AND sale_year = 2018 AND month = 08 
                      AND day = 01 
                      GROUP BY 2, 3, 4;

    Subsequent REFRESH commands add partitions for the sequence ranges 10-19 and 20-29. The ranges in analytical view look like the following:

    • "0-9", 2018, 08, 01
    • "10-19", 2018, 08, 01
    • "10-19", 2018, 08, 02
    • "20-29", 2018, 08, 03

    Note that the date value (2018, 08, 01) repeats in two range partitions, 0-9 and 10-19, because the data for that date spans the two ranges.

Using Analytical Views with Eager Aggregation

Arcadia Engine supports sequence-based analytical views for fact-dimension (star schema) joins, by expediting query execution through eager aggregation.

Imagine a fact table message that joins with dimension tables carrier, account, and others. Most of the queries are star schema queries, joining on primary-foreign keys. When we enable eager aggregation rewrites, most of the queries reduce to a join of aggregated fact table with dimension tables.

This section focuses on the fact table and possible analytical views on the fact table to accelerate the star schema queries. For an in-depth description of the sequencing mechanism, see Sequence Tracking.

  1. Fact Table Definition
  2. Creating an Analytical View
  3. Refreshing the Analytical View
  4. Inserting Data into Base Table
  5. Updating Data in Base Table
  1. Fact Table Definition

    Conceptually, the fact table looks as follows:

    CREATE TABLE message (acct_id BIGINT, date_initiated BIGINT, xid INT, 
                      status STRING, num_segments INT, price INT,
                      PRIMARY KEY (acct_id, date_initiated, xid)) 
                      PARTITION BY HASH (acct_id) PARTITIONS 10, 
                      RANGE (date_initiated) (PARTITION VALUE = (0))
                      STORED AS KUDU;
  2. Creating an Analytical View

    A typical query constrains on acct_id and date_initiated columns. Therefore, the analytical view should be partitioned so that Arcadia Engine can prune partitions on these constraints. We define an analytical view for tracking daily message volumes as follows:

    CREATE ANALYTICAL VIEW daily_message_vol 
                      PRIMARY KEY (acct_id, ymd) 
                      SEQUENCE COLUMN (date_initiated) 
                      PARTITION BY HASH (acct_id) PARTITIONS 50 
                      RANGE (ymd)
                      STORED AS KUDU 
                      AS SELECT acct_id, from_unixtime(date_initiated, "yyyy/MM/dd") ymd, sum(1) sum_msgs, 
                      max(num_segments) max_segs 
                      FROM message 
                      GROUP BY 1, 2;

    Note that the sequence column date_initiated is not unique, but we assume that it is increasing.

  3. Refreshing the Analytical View

    The REFRESH command works similarly to the examples we described in Using Analytical View with Primary Key Identical to the Base Table's Primary Key and Using Analytical View with Grouping and Aggregation. Within the internally-created ranges on date_initiated_range column, Arcadia Engine creates range partitions on ymd, or year-month-date. So, if the first REFRESH command reads date_initiated_range values from 20180801123 to 20180803789, the system executes the command through the following internal ALTER and INSERT commands:

    ALTER TABLE daily_message_vol ADD RANGE PARTITION VALUE = ("20180801123-20180803789", "2018/08/01");
                      ALTER TABLE daily_message_vol ADD RANGE PARTITION VALUE = ("20180801123-20180803789", "2018/08/02");
                      ALTER TABLE daily_message_vol ADD RANGE PARTITION VALUE = ("20180801123-20180803789", "2018/08/03");
                      INSERT INTO daily_message_vol
                      SELECT acct_id, "20180801123-20180803789", from_unixtime(date_initiated, "yyyy/MM/dd") ymd, 
                      sum(1) sum_msgs, max(num_segments) max_segs
                      FROM message
                      WHERE date_initiated between 20180801123 and 20180803789
                      GROUP BY 1, 3;

    Within the acct_id hash partitions, the ranges have the following values:

    • "20180801123-20180803789", "2018/08/01", sum1, max1
    • "20180801123-20180803789", "2018/08/02", sum2, max2
    • "20180801123-20180803789", "2018/08/03", sum3, max3
  4. Inserting Data into Base Table

    Let us add data for two more days, 2018/08/03 and 2018/08/04, before the next REFRESH ANALYTICAL VIEW command runs. Also, let us assume that the date_initiated_range for the new data is 20180803789 to 20180804567, overlapping the previous refresh.

    Arcadia Engine determines the range partitions that it must recompute or create by running the following query. Note that the WHERE predicate examines distinct ranges in the new data bounded by last known max_seq_id and the new actual max_seq_id:

    SELECT DISTINCT (from_unixtime(date_initiated, “yyyy/MM/dd”)) 
                      FROM message 
                      WHERE date_initiated between 20180803789 and 20180804567;
  5. Updating Data in Base Table

    In this scenario, using the results of the preceding query and the existing metadata, Arcadia Engine determines that the 2018/08/03 partition must be recomputed, and a 2018/08/04 partition must be added. Internally, it executes the REFRESH ANALYTICAL VIEW statement through the following ALTER and INSERT commands:

    ALTER TABLE daily_message_vol ADD RANGE PARTITION VALUE = ("20180801123-20180803789","2018/08/03");
                      ALTER TABLE daily_message_vol ADD RANGE PARTITION VALUE = ("20180803790-20180804567","2018/08/04");
                      
                      -- Inserts the 2018/08/03 partition
                      INSERT INTO daily_message_vol
                      SELECT acct_id, "20180801123-20180803789", 
                      from_unixtime(date_initiated, "yyyy/MM/dd"), 
                      sum(1) sum_msgs, max(num_segments) max_segs
                      FROM message
                      WHERE date_initiated between 20180801123 and 20180803789
                      GROUP BY 1, 3;
                      
                      -- Inserts the 2018/08/04 partition
                      INSERT INTO daily_message_vol
                      SELECT acct_id, "20180803790-20180804567", 
                      from_unixtime(date_initiated, "yyyy/MM/dd"), 
                      sum(1) sum_msgs, max(num_segments) max_segs
                      FROM message
                      WHERE date_initiated between 20180803790 and 20180804567
                      GROUP BY 1, 3;

    Within the acct_id hash partitions, the ranges have the following updated values:

    • "20180801123-20180803789", "2018/08/01", sum1, max1
    • "20180801123-20180803789", "2018/08/02", sum2, max2
    • "20180801123-20180803789", "2018/08/03", sum3_new, max3_new
    • "20180803790-20180804567", "2018/08/04", sum4, max4

    We can make all updates or deletes to the message table, and Arcadia Engine makes corresponding changes to the analytical view by issuing a REFRESH ANALYTICAL VIEW command, along with the specification of the FROM ... TO clause, as described in Using Analytical View with Primary Key Identical to the Base Table's Primary Key.

    When Arcadia Engine receives a query, it automatically routes it to a suitable analytical view. It also prunes partitions based on the predicates on hash and range partitioning columns.