Skip to main content

🗓️ 05092025 1514
📎

databricks_cluster_config

🎯 Core Configuration Fields

cluster_name (String)

  • Required: Yes
  • When to use: Always set a descriptive name
  • Best practices: Include environment, team, purpose (e.g., "prod-etl-daily", "dev-analytics")
  • Limits: Max 100 characters

spark_version (String)

  • Required: Yes
  • When to use LTS: Production workloads (stability)
  • When to use latest: Development, new features
  • Avoid:
    • photon- versions (use runtime_engine instead)
  • Reference: Databricks Runtime Versions

node_type_id

  • Required: Yes
  • General Purpose (i3): Balanced CPU/memory, development
  • Memory Optimized (r5): ETL, analytics, large datasets
  • Compute Optimized (c5): CPU-intensive, streaming
  • Reference: AWS Instance Types

⚖️ Scaling Configuration

autoscale

  • When to use: Variable workloads, cost optimization
  • Development: 1-3 workers
  • Production ETL: 2-10 workers
  • Analytics: 3-20 workers
  • Mutually exclusive with: num_workers

num_workers

  • When to use: Consistent workloads, ML training
  • ML/Training: Fixed size for stability
  • Streaming: Fixed size for predictable performance
  • Mutually exclusive with: autoscale

💰 Cost Optimization

autotermination_minutes (Long)

  • Required: Highly recommended
  • Range: 10-10000 minutes
  • Development: 15-30 minutes
  • Production: 60-120 minutes

aws_attributes (AwsAttributes)

  • SPOT_WITH_FALLBACK: 60% savings, production-safe
  • SPOT: Maximum savings, dev/testing only
  • ON_DEMAND: Highest reliability, mission-critical
  • Reference: Spot Instance Best Practices

🚀 Performance Configuration

runtime_engine (RuntimeEngine)

  • PHOTON: SQL workloads, ETL, analytics (3x faster)
  • STANDARD: ML, streaming, general compute
  • Cost: Photon adds ~20% premium but 3x performance
  • Reference: Photon Engine

spark_conf

  • Always enable: Adaptive Query Execution (AQE)
  • ETL workloads: Enable Delta optimizations
  • Large datasets: Tune partition settings
  • Reference: Spark Configuration

🔒 Security & Governance

data_security_mode

  • SINGLE_USER: Production, highest security, Unity Catalog
  • USER_ISOLATION: Shared clusters, user separation
  • NONE: Legacy, not recommended for new clusters
  • Reference: Data Security Modes

enable_local_disk_encryption

  • true: Production, compliance requirements
  • false: Development, testing only

🏷️ Resource Management

custom_tags

  • Required for: Cost tracking, resource management
  • Limit: 45 custom tags max
  • Best practices: Environment, Team, Project, CostCenter

policy_id

  • When to use: Enforce organizational standards
  • Governance: Restrict instance types, regions, settings
  • Cost control: Limit expensive configurations

🐳 Container & Advanced Options

docker_image

  • When to use: Custom libraries, golden images
  • Benefits: Consistent environments, faster startup
  • Reference: Databricks Container Services

init_scripts

  • When to use: Custom software installation, configuration
  • Execution: Sequential order, before Spark starts

References