Deploying Arcadia Enterprise on an EMR Cluster

This article describes how to deploy Arcadia Enterprise on an Amazon EMR Cluster.

Developer Notes:

After completing the prerequisites for EMR installation, run the following commands on AWS:

  1. Download the Arcadia Enterprise EMR deployment package provided by Arcadia Data support team to a local directory. See Arcadia Enterprise Deployment Package.
  2. Extract the EMR deployment package that contains all Arcadia Enterprise scripts and binaries. Use the following command:

    tar -xf ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1.tar.gz
  3. Copy the entire deployment package to an Amazon S3 bucket. A bucket exclusively used for Arcadia deployments and backups. It is important to sync the package in a folder with the same name as the package name. This helps our deployment scripts to find the installable binaries.

    aws s3 sync ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1 
    s3://arc-emr-test-alex/ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1
  4. The scripts extract the following three files and install them on the machine that runs the AWS CLI.

    ./ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1/local/config_template.json
    ./ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1/local/run_emr_arc
    ./ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1/local/sample_deployment.conf

    You may choose to delete the other files, as they are not required on the AWS CLI machine.

  5. Complete the config_template.json template file, or create your own template that uses the values required for the deployment. At the minimum, we recommend you to enter the following values in the template:

    • Specify the access and secret keys for storing data in S3 buckets when using the EMR File System (EMRFS). The system uses these access and secret keys when specifying LOCATION s3://... as a Hive clause.
    • If using an external hive metastore, fill out the Hive-site section in the template.

      Remove this section from the template if you are not using an external metastore.

      Arcadia Enterprise works even if you do not configure an external hive metastore.

  6. To deploy Arcadia Enterprise on EMR, you can use either of the following methods:
    • Run the run_emr_arc command

      Execute the run_emr_arc command from the top level directory, and enter the relevant information from the template in the EMR wizard. This generates an AWS CLI command which deploys Arcadia Enterprise on EMR.

      When choosing the instance type, note our recommendations for Sizing in the Prerequisites for Deploying Arcadia Enterprise on Amazon EMR

      The CLI wizard has the following format:

      [arcuser@local]$ ./run_emr_arc
      Provisioning EMR with Arcadia Enterprise <version>
      Arcadia Enterprise Version?: <version>
      S3 bucket which contains the directory[]: <bucket-name> 
      s3://arc-emr-test-alex Access Key? []: <Access key of S3 bucket>
      s3://arc-emr-test-alex Secret Key? []: <Secret key of S3 bucket>
      Location of additional configurations to supply (file://)? []: <Path of the file generated from the config_template.json template file>>
      Instance count?: <Number of EMR nodes>
      Instance type?: <AWS EC2 instance type>
      AWS ssh key to use for ec2 instances? []: <Preconfigured SSH Key to use for EMR spawned instances>
      Cluster Name?: <Name of Arcadia Enterprise EMR Cluster>
      Skip deployment and save aws cli output to a file (y/n)?:<Yes/No>
      For example:
      [arcuser@local]$ ./run_emr_arc
      Provisioning EMR with Arcadia Enterprise <version> (ver. ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1)
      Arcadia Enterprise Version? [ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1]:
      S3 bucket which contains the directory ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1? []: arc-emr-test-alex
      s3://arc-emr-test-alex Access Key? []: AKIAJOAJINGNOKZVPWNQ
      s3://arc-emr-test-alex Secret Key? []: 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
      Location of additional configurations to supply (file://)? []: file://./config.json
      Instance count? [3]:
      Instance type? [m5.2xlarge]:
      AWS ssh key to use for ec2 instances? []: devops
      Cluster Name? [Arcadia Cluster ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1]:
      Skip deployment and save aws cli output to a file (y/n)?[n]:n
      j-29M75IXAOT0LP

      [Optional] If you have an EMR deployment script of your own and want to use the generated bootstrap action as a part of a larger EMR deployment, you can skip the following deployment step, and save the AWS CLI command to the following output file, emr_deployment.txt:

      Skip deployment and save aws cli output to a file (y/n)? [n]:y
      Deployment command saved in emr_deployment.txt
    • Run Deployment Script

      Run the deployment script saved in the emr_deployment.txt file.

      [arcuser@local]$ ./run_emr_arc -f "./sample_deployment.conf" -d
      Provisioning EMR with Arcadia Enterprise (script ver. ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1)
      Loading configuration file: ./sample_deployment.conf
      ARCADIA-ENTERPRISE Version: ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1
      INSTALL BUCKET: emr-deployment-bucket
      CONFIG PATH: file:///tmp/config.json
      INSTANCE COUNT: 3
      INSTANCE TYPE: m4.xlarge
      SSH KEY NAME: devops
      CLUSTER NAME:[Arcadia Cluster ARCADIA-ENTERPRISE-5.0.0.0_1547496452-1.amzn1]
  7. If you do not skip the deployment, AWS CLI output shows the EMR cluster ID at the end. In the preceding example, EMR ID is j-29M75IXAOT0LP. This ID can be used to check the status of the EMR cluster deployment and Arcadia installation from the EMR console.
  8. Login to the EMR console in AWS. The cluster status changes from Starting to Bootstrapping. Wait for the instance to build. This process may take a few minutes. Arcadia Enterprise deployment is complete, when the Waiting message appears.

  9. Click Arcadia Cluster ARCADIA-ENTERPRISE-4.5.0.1_1547851339-1.amzn1 to view the summary and configuration details of the Arcadia Cluster.

    Displaying EMR console in AWS with 'Arcadia Cluster' status as 'Starting'
    Display Arcadia Cluster Status

    The following image shows the summary and configuration details of the Arcadia Cluster:

    Displaying EMR console in AWS with summary and configuration details of the Arcadia Cluster
    Display Summary and Configuration Details of Arcadia Cluster
  10. To check the status of a bootstrapping EMR cluster, view the details in the /tmp/bootrap.log file on each node. By default, the EMR deployment command generated by run_emr_arc archives EMR logs to the install bucket.

    For more information on troubleshooting an EMR cluster, see Troubleshoot a Cluster.

  11. To verify your Arcadia Enterprise installation on EMR, connect to ArcViz.