hologres_sql_optimization

🗓️ 03022025 1110

Basic principles

ABSTRACT

Reduce I/O resource consumption and optimize concurrency

If you configure appropriate indexes for data distribution when you create a table, the data that you want to query can be quickly located after an SQL statement is executed
This reduces I/O and computing resource consumption and accelerates queries
Balanced data distribution also helps increase the utilization of concurrent resources and prevent single-node bottlenecks
The following figure shows the execution process from executing an SQL statement to obtaining data

If you perform an SQL query on a partitioned table, the system locates the partition in which the required data is located by using the partition pruning feature
If no partition is matched based on query conditions, the system needs to traverse all partitions, which consumes excessive I/O resources
In most cases, partitioning by day is more appropriate
For a non-partitioned table, partition pruning is not involved

You can use a hologres_distribution_key to quickly locate the shard in which the desired data is located
This reduces the resources that are required by a single SQL statement and supports a high throughput if multiple SQL statements are executed at the same time
If the specific shard cannot be located, the system schedules all shards for computing based on the distributed framework
In this case, a single SQL statement is executed on multiple shards, which consumes a large amount of resources and decreases the overall concurrency of the system
Extra shuffle overhead is incurred if specific operators are executed in a centralized manner
In most cases, you can configure fields whose values are evenly distributed as distribution key fields, such as the order ID, user ID, and event ID fields
We recommend that you configure the same distribution key for the tables that you want to join to allow correlative data to be distributed to the same shard
This helps achieve high efficiency in local join operations

You can use an event time column to quickly locate the file in which the required data is located from multiple files on a node
This eliminates the need to access other files
If the filtering fails, the system needs to traverse all files

You can use the clustering key to quickly locate the desired data range in a file
This helps improve the efficiency of range queries and field sorting