Flintrock - command-line tool for launching Apache Spark clusters

Tools AWS

November 22nd, 2022

What is Flintrock
Usage
Accessing data on S3
Installation
- Standalone version
- Community-supported distributions
Automated Pipelines
Managing permanent infrastructure
Launching non-Spark-related services
Configurable CLI Defaults

What is Flintrock

Flintrock is a command-line tool for launching Apache Spark clusters

Usage

Flintrock works best with Amazon Linux.

One way to establish a cluster on EC2 is by saying: flintrock launch test-cluster

If you persist options to a file, you can do the same thing:

Once you're done using the cluster, destroy it with "flintrowDestroyTestClustered.cfg"

Other things to do:

Flintrock login
describe
add slaves
remove slaves
run command
copy-file
etc

Accessing data on S3

Setup an IAM Role that grants access to S3 as desired

Reference this role when you launch your cluster using the "--ec2-instance-profile-name" option and reference S3 paths in your Spark code using the "s3a:// prefix"

Call Spark with the Hadoop-aws package to enable s3 a://

Installation

Flintrock requires Python 3.7 or newer unless you use one of our standalone packages.

Standalone version

If you don't have a recent enough version of Python, or if you don't have Python installed at all, you can still use Flintrock.

Publish standalone packages of Flintrock on GitHub

Unzip the standalone package, unzip it to a location of your choice, and run the flintrock executable inside

Community-supported distributions

Flintrock is also available via the following package managers: Homebrew:

"brew install flintrock"

Automated Pipelines

Flintrock is designed to be used as part of an automated pipeline.

Managing permanent infrastructure

Flintrock is not for managing long-lived clusters or any infrastructure that is a permanent part of some environment.

If looking for ways to manage permanent infrastructure, look at tools like Terraform, Ansible, SaltStack, or Ubuntu Juju.

Launching non-Spark-related services

Flintrock is meant for launching Spark clusters that include closely related services like HDFS, Mesos, and YARN.

Configurable CLI Defaults

Flintrock lets you persist your desired configuration to a YAML file, so you don't have to keep typing options in the command line.

Flintrock'se typical launch time will be a minute or two longer.

Flintrock is a single-purpose tool with minimal focus.

Repository: https://github.com/nchammas/flintrock

Flintrock - command-line tool for launching Apache Spark clusters

What is Flintrock

Usage

Accessing data on S3

Installation

Standalone version

Community-supported distributions

Automated Pipelines

Managing permanent infrastructure

Launching non-Spark-related services

Configurable CLI Defaults

Related articles

Troubleshooting

Flintrock - command-line tool for launching Apache Spark clusters

#What is Flintrock

#Usage

#Accessing data on S3

#Installation

#Standalone version

#Community-supported distributions

#Automated Pipelines

#Managing permanent infrastructure

#Launching non-Spark-related services

#Configurable CLI Defaults

Related articles

What is Flintrock

Usage

Accessing data on S3

Installation

Standalone version

Community-supported distributions

Automated Pipelines

Managing permanent infrastructure

Launching non-Spark-related services

Configurable CLI Defaults