Best Practices for Arcadia Catalog Tables

Extract Completed Queries to Permanent Storage

The arcadia_catalog.completed_queries table relies on the parsing profile files that the arcengined (ArcEngine daemon) process generates. These files reside on the coordinator node, where the query ran originally. They are not replicated, and take much longer to process than parquet tables.

In scenarios that rely on extensive use of these tables, we strongly recommend that you extract this information into a parquet-formated destination table. Running regularly-scheduled extraction jobs to update this table improves the speed for analytics, replication, and durability; it also mitigates the persistence and performance issues we identify in the discussion on Limitations of Arcadia Catalog Tables.

For example, to copy data into a new table, use the following approach to extract information on completed tables into more permanent storage:

  1. To create a parquet destination table (historical_queries), we use the same schema as arcadia_catalog.completed_queries, and use an additional partitioning column start_date.

    Issue the following command:

    CREATE TABLE historical_queries
      PARTITIONED BY (start_date)
      SORT BY (start_time, end_time)
      STORED AS PARQUET 
      AS SELECT *, to_date(start_time) 
      AS start_date 
      FROM arcadia_catalog.completed_queries;

    Note that this statement fully populates the new table with all data in the arcadia_catalog.completed_queries table. Also, the SORT BY clause enables you to run more efficient queries by referencing start_time and end_time columns.

  2. To update the historical_queries table periodically with new queries in the arcadia_catalog.completed_queries table, run the following command:

    INSERT INTO historical_queries
    PARTITION (start_date) 
      SELECT *, to_date(start_time) 
        AS start_date 
        FROM arcadia_catalog.completed_queries 
        WHERE end_time > (SELECT max(end_time) FROM historical_queries);

    The end_time predicate copies all queries that ran after the previous data upload into the historical_queries table.

  3. Note that for simple comparison predicates greater-than, equal, or less-than (>, =, <) on the end_time column, Arcadia Enterprise optimizes queries through predicate push down, allowing the arcengined process to skip over profile files that do not match the end_time predicate.