Edit

Share via


Optimize for high concurrency with Azure Data Explorer

Highly concurrent applications are necessary in scenarios with a large user base, where the application simultaneously handles many requests with low latency and high throughput.

Use cases include large-scale monitoring and alerting dashboards. Examples include Microsoft products and services such as Azure Monitor, and Playfab. All these services use Azure Data Explorer for serving high-concurrency workloads. Azure Data Explorer is a fast, fully managed big data analytics service for real-time analytics on large volumes of data streaming from applications, websites, IoT devices, and more.

Note

The actual number of queries that can run concurrently on a cluster depends on factors such as cluster SKU, data volumes, query complexity, and usage patterns.

To set up high-concurrency applications, design the back-end architecture as follows:

This article presents recommendations for each of the preceding subjects that you can implement to achieve high concurrency in an optimal, cost-effective way. Use these features alone or in combination.

Optimize data

For high concurrency, queries should consume the least possible amount of CPU resources. Use any or all of the following methods:

Use table schema design best practices

Use the following table schema design suggestions to minimize the CPU resources used:

  • Define ID columns as string data types, even if the values are numeric. Indexing for string columns is more sophisticated than for numeric columns and provides better filtering performance.
  • Match the column data type optimally to the actual data stored in these columns. For example, don't store datetime values in a string column.
  • Avoid a large sparse table with many columns, and use dynamic columns to store sparse properties.
  • Store frequently used properties in their own column with a nondynamic datatype.
  • Denormalize data to avoid joins that demand relatively large CPU resources.

Partition data

Data is stored in the form of extents (data shards) and is partitioned by ingestion time by default. You can use the partitioning policy to repartition the extents based on a single string column or a single datetime column in a background process. Partitioning can provide significant performance improvements when most of the queries use partition keys to filter, aggregate, or both.

Note

The partitioning process itself uses CPU resources. However, the CPU reduction during query time should outweigh the CPU used for partitioning.

Preaggregate your data by using materialized views

Preaggregate your data to significantly reduce CPU resources during query time. Example scenarios include summarizing data points over a reduced number of time bins, keeping the latest record of a given record, or deduplicating the dataset. Use materialized views for an easy-to-configure aggregated view over source tables. This feature simplifies the effort of creating and maintaining these aggregated views.

Note

The background aggregation process uses CPU resources. However, the CPU reduction during query time should outweigh the CPU consumption for aggregation.

Configure the caching policy

Configure the caching policy so that queries run on data stored in the hot storage, also known as the disk cache. Only run limited, carefully designed scenarios on the cold storage, or external, tables.

Set a leader-follower architecture pattern

The follower database is a feature that follows a database or a set of tables in a database from another cluster located in the same region. You can access this feature through Azure Data Share, Azure Resource Manager APIs, and a set of cluster commands.

Use the leader-follower pattern to set compute resources for different workloads. For example, set up a cluster for ingestions, a cluster for querying or serving dashboards or applications, and a cluster that serves the data science workloads. Each workload in this case has dedicated compute resources that you can scale independently, and different caching and security configurations. All clusters use the same data, with the leader writing the data and the followers using it in a read-only mode.

Note

Follower databases have a lag from the leader, usually of a few seconds. If your solution requires the latest data with no latency, this solution might not be useful. Use a view on the follower cluster that unions the data from the leader and the follower and queries the latest data from the leader and the rest of the data from the follower.

To improve the performance of queries on the follower cluster, you can enable prefetch extents configuration. Use this configuration carefully, because it could affect the freshness of data in the follower database.

Optimize queries

Use the following methods to optimize your queries for high concurrency.

Follow query best practices so that your queries are as efficient as possible.

Use a query results cache

When more than one user loads the same dashboard at a similar time, the dashboard to the second and following users can be served from the cache. This setup provides high performance with almost no CPU usage. Use the query results cache feature, and send query results cache configuration with the query by using the set statement.

Grafana contains a configuration setting for the query results cache at the data source level, so all dashboards use this setting by default and don't need to modify the query.

Configure query consistency

The default query consistency mode is strong. In this mode, an admin node manages metadata and ingestion for the cluster, as well as query planning and delegating execution to other nodes.

In high-concurrency applications, managing queries can cause the admin node's CPU usage to be high, while other nodes are less busy. This situation can cause a bottleneck where the number of concurrent queries can't grow. However, this bottleneck might not be apparent in the cluster's CPU report (Azure portal > {your_cluster} > Metrics > CPU Metric) which shows the average CPU usage for the cluster.

For this scenario, use weak consistency mode. In this mode, more nodes manage queries, which makes it possible to horizontally scale the number of concurrent queries. Nodes in this mode periodically refresh their copy of metadata and newly ingested data, which leads to a latency of typically less than a minute as the data is synchronized. However, this short latency is preferable to the bottleneck situation that can arise when using strong consistency mode.

Set the consistency mode in a workload group query consistency policy, in the client request properties, or in the Grafana data source configuration.

Set cluster policies

The Request rate limit policy caps the number of concurrent requests by default so that the cluster doesn't get overloaded. You can adjust this policy for high-concurrency situations. Adjust this policy only after rigorous testing, preferably on production-like usage patterns and datasets. Testing ensures the cluster can sustain the modified value. Configure this limit based on application needs.

Monitor Azure Data Explorer clusters

Monitoring the health of your cluster resources helps you build an optimization plan by using the features suggested in the preceding sections. Azure Monitor for Azure Data Explorer provides a comprehensive view of your cluster's performance, operations, usage, and failures. Get insights on the queries' performance, concurrent queries, throttled queries, and various other metrics by selecting the Insights (preview) tab under the Monitoring section of the Azure Data Explorer cluster in the Azure portal.

For information on monitoring clusters, see Azure Monitor for Azure Data Explorer.