Creating New Databricks / Spark SQL Connections

Arcadia Enterprise supports Databricks / Spark SQL data connections.

This connection type provides basic connectivity and functionality for Databricks / Spark SQL Connections, running queries faster then Hive connections. However, it does not support the use of save table and data sampling features, analytical views, or flow and funnel visuals.

You can create fully managed Spark clusters on Amazon AWS Databricks or Microsoft Azure Databricks.

The following steps demonstrate how to create new Databricks / Spark SQL data connections.

Availability Note. This connection type requires the presence of a Spark Thrift Server.
Developer Notes:

Note the following known issue that affects Databricks / Spark SQL connections:

Create New Data Connection Modal Window: Databricks / Spark SQL
Create a New Databricks / Spark SQL Connection
  1. On the main navigation bar, click Data.

    Click DATA on main navigation bar

    The Data view appears, open on the Datasets tab.

    Main landing page of DATA
  2. In the side bar, click New Connection.

    Create New Connection

    The Create New Data Connection modal window appears.

  3. In the Create New Data Connection modal window, under Connection type, select Databricks / Spark SQL.
  4. Under Connection name, specify the name of the new connection. Here, we use DatabricksSparkSQLConnection.
  5. Under Hostname or IP address, specify the name of your database host, or its IP address; use localhost when the data source is local.

  6. Under Port #, enter the port number. The default port # for Databricks / Spark SQL connections is 10000.
  7. Under Credentials, complete the following entries.

    • Under Username, enter the username for establishing the connection.

    • Under Password, enter the password for establishing the connection.

  8. At the bottom of the modal, click Test.

    testing connection
    Testing the New Connection

    If the connection is valid, the system returns a 'Connection Verified' message.

  9. Click Save.

    connection verified, connecting
    Connecting a Verified Connection

After this operation succeeds, the new connection name appears on the side navigation bar.

Advanced Options

To select advanced connection options, click the Advanced tab, and make changes to any of the following options:

advanced connection options
Advanced Options
  • Connection Mode
    To authenticate and access a Databricks / Spark SQL connection, follow these steps:
  • Socket Type

    Choose one of the following Socket types:

    • Normal

      Default setting

    • SSL
    • SSL with certificate

      If you select this option, you can use the Allow Common Name - Host Name Mismatch option, which means that the issued SSL certificate name does not have to match the host name of the server. By default, the names must match.

  • Authentication Mode

    Choose one of the Authentication modes:

    • NoSasl
    • Plain
    • LDAP
    • Kerberos

      Kerberos authentication is only available on Linux platforms.

  • Query Timeout

    Specify Query Timeout. The default value is 60 seconds.

  • Session Timeout

    Specify Session Timeout. The default value is 0, which means that sessions do not time out.

  • Socket Timeout

    Specify Socket Timeout. The default value is 60.

  • Queue depth

    Queue depth controls the maximum number of simultaneous queries on the connection.

    • By default, the value is 2 and does not have to be specified.

    • Valid values are integers 1 through 100.

  • Impersonation

    Check Impersonation if using this feature to control individual table-level access for a user or a user group.

Parameter Options

  • Click the Parameters tab.

    Parameter tabs, specifying parameter name/value
    Specifying Connection Parameters
  • Click the first row to add a Parameter name/value pair. Type in the parameters and their values.

  • If you want to remove a parameter, click (trash can) icon to remove existing parameters.

Cache Options

  • Click the Cache tab.

    Cache tab, specifying Result Cache Enable and Retention Time options
    Specifying Cache
  • Select the Result Cache option to enable periodic cache updates.

  • In the Retention Time field, specify the frequency of cache updates, in seconds.

    For example, the default value of 86400 indicates an update every 24 hours, and a value of 300 initiates a refresh every 5 minutes.