Skip to main content

Spark RAPIDS Benchmark Data Generation Script 🚀

This Spark job generates synthetic retail data for benchmarking Spark RAPIDS performance on Kubernetes. The generated data includes sales, customer, and product information in CSV format and is designed for large-scale testing scenarios. By default, the script generates:

📊 Sales Table: 1 billion rows, generating 60GB of sales data.

👥 Customer Table: 100 million rows, creating 6GB of customer data.

🛒 Product Table: 100k rows, creating 5MB of product data.

Spark Operator Configuration ⚙️

Below is a sample Spark Operator YAML configuration for submitting the data generation job on a Kubernetes cluster. This job runs the PySpark script stored in an S3 bucket and generates data that is saved back to S3.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-data-generation
namespace: data-eng-team
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: kubedai/spark:spark-3.5.1_hadoop-3.3.4_aws-sdk-1.12.773
imagePullPolicy: Always
mainApplicationFile: "s3a://<S3_BUCKET>/benchmark/scripts/data-generation-retail.py"
arguments:
- "s3a://<S3_BUCKET>/benchmark/input"
- "1000000000" # Sales table (60GB)
- "100000000" # Customer table (6GB)
- "100000" # Product table (5MB)
sparkVersion: "3.3.1"
...

How to Deploy on Kubernetes

Step1: Copy the PySpark Script to S3

Upload the PySpark script to the S3 bucket (e.g., s3://<S3_BUCKET>/benchmark/scripts/data-generation-retail.py). The S3 bucket is automatically created during the cluster deployment using Terraform templates. To find the bucket name, run terraform outputs from the infra/aws/terraform directory.

aws s3 cp benchmarks/dat-gen/data-generation-retail.py s3://<S3_BUCKET>/benchmark/scripts/

Step2: Execute Spark Job 🚦

Make sure you replace <S3_BUCKET> in the yaml with the actual bucket name before executing the following command.

kubectl apply -f spark-operator-job.yaml

Step3: Monitor the Job 🔍

Track the job's progress by running the following commands:

kubectl logs -f spark-data-gen -n data-eng-team

Verifying Output Data ✅

Once the job completes, verify that the generated data has been written to the S3 bucket specified in the script arguments. The output data will be stored in CSV format at the path s3a://<S3_BUCKET>/benchmark/input/.

Use the following command to list the files in the S3 bucket:

aws s3 ls s3://<S3_BUCKET>/benchmark/input/

Next Steps: Running Benchmarks

The data generated by this job will be used as the input dataset for running Spark RAPIDS benchmarks. For details on running the benchmarks, refer to the benchmark guide.

Cleanup

To delete the job and clean up resources:

kubectl delete -f spark-operator-job.yaml