🗓️ 05092025 1514
📎
databricks_cluster_config
🎯 Core Configuration Fields
cluster_name (String)
- Required: Yes
- When to use: Always set a descriptive name
- Best practices: Include environment, team, purpose (e.g., "prod-etl-daily", "dev-analytics")
- Limits: Max 100 characters
spark_version (String)
- Required: Yes
- When to use LTS: Production workloads (stability)
- When to use latest: Development, new features
- Avoid:
- photon- versions (use runtime_engine instead)
- Reference: Databricks Runtime Versions
node_type_id
- Required: Yes
- General Purpose (i3): Balanced CPU/memory, development
- Memory Optimized (r5): ETL, analytics, large datasets
- Compute Optimized (c5): CPU-intensive, streaming
- Reference: AWS Instance Types
⚖️ Scaling Configuration
autoscale
- When to use: Variable workloads, cost optimization
- Development: 1-3 workers
- Production ETL: 2-10 workers
- Analytics: 3-20 workers
- Mutually exclusive with: num_workers
num_workers
- When to use: Consistent workloads, ML training
- ML/Training: Fixed size for stability
- Streaming: Fixed size for predictable performance
- Mutually exclusive with: autoscale
💰 Cost Optimization
autotermination_minutes (Long)
- Required: Highly recommended
- Range: 10-10000 minutes
- Development: 15-30 minutes
- Production: 60-120 minutes
aws_attributes (AwsAttributes)
- SPOT_WITH_FALLBACK: 60% savings, production-safe
- SPOT: Maximum savings, dev/testing only
- ON_DEMAND: Highest reliability, mission-critical
- Reference: Spot Instance Best Practices
🚀 Performance Configuration
runtime_engine (RuntimeEngine)
- PHOTON: SQL workloads, ETL, analytics (3x faster)
- STANDARD: ML, streaming, general compute
- Cost: Photon adds ~20% premium but 3x performance
- Reference: Photon Engine
spark_conf
- Always enable: Adaptive Query Execution (AQE)
- ETL workloads: Enable Delta optimizations
- Large datasets: Tune partition settings
- Reference: Spark Configuration
🔒 Security & Governance
data_security_mode
- SINGLE_USER: Production, highest security, Unity Catalog
- USER_ISOLATION: Shared clusters, user separation
- NONE: Legacy, not recommended for new clusters
- Reference: Data Security Modes
enable_local_disk_encryption
- true: Production, compliance requirements
- false: Development, testing only
🏷️ Resource Management
custom_tags
- Required for: Cost tracking, resource management
- Limit: 45 custom tags max
- Best practices: Environment, Team, Project, CostCenter
policy_id
- When to use: Enforce organizational standards
- Governance: Restrict instance types, regions, settings
- Cost control: Limit expensive configurations
🐳 Container & Advanced Options
docker_image
- When to use: Custom libraries, golden images
- Benefits: Consistent environments, faster startup
- Reference: Databricks Container Services
init_scripts
- When to use: Custom software installation, configuration
- Execution: Sequential order, before Spark starts