Limitations of Arcadia Catalog Tables

Note the following limitations and drawbacks when using the catalog tables:

Creating Catalog Tables

If catalog tables do not exist, the catalogd process creates them at startup time. The tables reside in the Hive metastore, and appear as empty tables in Hive or Impala connections.

If the catalogd process cannot successfully create the catalog tables, it continues with the remaining startup steps. As a result, the catalog tables cannot be accessed.

No Support for Analytical Views

It is not possible to create analytical views over the catalog tables.

Support for Pretty Printing in Beeswax

In the beeswax interface, which uses tab characters, running a SELECT query on the completed_queries catalog table through arcadia-shell may result in a failure to pretty-print the output.

To address this issue, we recommend using the regexp_replace command on columns that may contain tabs. For example, if you expect the query_status column to contain tabs, run the following command:
SELECT regexp_replace(query_status, '\t', ' ') 
  AS query_status 
  FROM arcadia_catalog.completed_queries;

This restriction does not apply to Hive Server sessions.

Format for Extracted Catalog Data

When extracting data from catalog tables into another table, we recommend using the PARQUET format for these destination tables. This mitigates many parsing errors.

Using other destination table types, such as TEXT, may lead to errors during insert operations from the catalog table because the data may contain containing tabs, commas, or other characters that often function as delimiters during conversion.

Persistence for Completed Queries

The completed_queries catalog table itself does not contain any data; instead, it reads data from profiles stored on local file systems. Because this data is not replicated, the loss of a disk results in data loss. Therefore, we recommend that you extract the data to an external table, as described in Recommendations for Arcadia Catalog Tables.

Performance of Queries over Completed Queries

Queries against the completed_queries catalog table may take a long time to run because of the nature of the completed_queries table. Instead, we recommend that you run these queries against an external table that contains the same data, as described in Recommendations for Arcadia Catalog Tables.