Amazon CloudWatch metrics for Amazon OpenSearch Service storage and shard skew well being


Amazon OpenSearch Service is a managed service that makes it simple to deploy, function, and scale OpenSearch clusters in AWS to carry out interactive log analytics, real-time software monitoring, web site search, and extra. OpenSearch is an open supply, distributed search and analytics suite.

When working with OpenSearch Service, shard technique is vital. Shards distribute your workload throughout the information nodes of your cluster. When creating an index, you inform OpenSearch Service what number of main shards to create and what number of replicas to create of every shard. The first shards are unbiased partitions of the complete dataset. OpenSearch Service routinely distributes your information throughout the first shards in an index. Our advice is to make use of two replicas to your index. For instance, in the event you set your index’s shard rely to a few main shards and two replicas, you should have a complete of 9 shards. Correctly configured indexes might help enhance total area efficiency, whereas a misconfigured index will result in storage and efficiency skew.

OpenSearch Service distributes the shards in your indexes to the information nodes in your area, making certain that no main shard and its replicas are positioned on the identical node. The information for the shards are saved within the node’s storage. In case your indexes (and due to this fact their shards) are very totally different sizes, the storage used on the information nodes within the area can be unequal, or skewed. Storage skew results in uneven reminiscence and CPU utilization, intermittent and uneven latency, and uneven queueing and rejecting of requests. Due to this fact, it’s necessary to configure and keep indexes such that shards might be distributed evenly throughout the information nodes of your cluster.

On this publish, we discover tips on how to deploy Amazon CloudWatch metrics utilizing an AWS CloudFormation template to watch an OpenSearch Service area’s storage and shard skew. This answer makes use of an AWS Lambda operate to extract storage and shard distribution metadata out of your OpenSearch Service area, calculates the extent of skew, after which pushes this data to CloudWatch metrics to be able to simply monitor, alert, and reply.

Resolution overview

The answer and related sources can be found so that you can deploy into your personal AWS account as a CloudFormation template. The template deploys the next sources:

  • An AWS Identification and Entry Administration (IAM) function for the Lambda operate referred to as OpensearchSkewMetricsLambdaRole. This permits write entry to CloudWatch metrics and entry to the CloudWatch log group and OpenSearch APIs.
  • An AWS Lambda operate referred to as Opensearch-SkewMetricsPublisher-py.
  • An Amazon CloudWatch log group for the Lambda operate referred to as /aws/lambda/Opensearch-skewmetrics-publisher-py.
  • An Amazon EventBridge rule for the Lambda operate referred to as EventRuleForOSSkew.
  • The next CloudWatch metrics for the Lambda operate:
    • aws_/<region-name>/<MetricIdentifier>/_storagemetric
    • aws_/<region-name>/<MetricIdentifier>/_shardmetric

Stipulations

For this walkthrough, you must have the next conditions:

  • An AWS account.
  • An OpenSearch Service area.
  • This publish requires you so as to add a Lambda function to the OpenSearch Service area’s safety configuration entry coverage. In case your area is utilizing fine-grained entry management, then you must observe the steps as described within the part Mapping roles to customers to allow entry for the newly deployed Lambda execution function to the area after deploying the CloudFormation template.

Deploy the CloudFormation template

To deploy the CloudFormation template, full the next steps:

  1. Log in to your AWS account.
  2. Choose the Area the place you’re working your OpenSearch Service area.
  3. To launch your CloudFormation stack, select Launch Stack
  4. For Stack identify, enter a reputation for the stack (most size 30 characters).
  5. For MetricIdentifier, enter a singular identifier that can show you how to establish the customized CloudWatch metrics to your area.
  6. For OpensearchDomainURL, enter the area endpoint that you’re monitoring.
  7. Select Subsequent.
  8. Choose I acknowledge that AWS CloudFormation may create IAM sources, then select Create stack.
  9. Anticipate the stack creation to finish.
  10. On the Lambda console, select Features within the navigation pane.
  11. Select the Lambda operate referred to as Opensearch-SkewMetricsPublisher-py-<stackname>.
  12. Within the Code part, select Check.
  13. Maintain the default values for the check occasion and run a fast check.

Be certain to grant the Lambda execution function permission to the OpenSearch Service area’s resource-based coverage, in case you are utilizing one. If fine-grained entry management is enabled on the area, then observe the steps in Mapping roles to customers (as talked about within the conditions) to permit the Lambda operate to learn from the area in read-only entry.

The Lambda operate that sends OpenSearch area metrics to CloudWatch is about to a default frequency of 1 day. You’ll be able to change this configuration to watch the area on the required granularity by updating the occasion schedule for the rule deployed by the CloudFormation stack on the EventBridge console. Notice that if the frequency is about to 1 minute, it will set off the Lambda operate each minute and can improve the Lambda value.

This answer makes use of the cat/allocation API, which offers the variety of information nodes within the area together with every information node’s variety of shards and storage utilization attributes. For additional particulars on area storage and shard skew, check with Node shard and storage skew. The Lambda operate processes and kinds every information node’s storage and shard skew from the common worth. Any information node’s skew above 10% from the common is usually thought-about to be considerably skewed. It will begin to affect CPU, community, and disk bandwidth utilization as a result of the nodes with the best storage utilization are typically the resource-strained nodes, whereas nodes with lower than 10% utilization symbolize underutilized capability.

Discuss with Demystifying Elasticsearch shard allocation for particulars associated to shard dimension and shard rely technique. Typically, we advocate conserving shard sizes between 10–30 GB for workloads the place search latency is a key efficiency goal and 30–50 GB for write-heavy workloads. For shard rely, we advocate sustaining index shard counts which might be divisible by the information node rely. For added particulars, check with Sizing Amazon OpenSearch Service domains and Shard technique.

View skew metrics in CloudWatch

After you run this answer in your account, it’ll create two CloudWatch metrics for monitoring. To entry these CloudWatch metrics, use the next steps:

  1. On the CloudWatch console, beneath Metrics within the navigation pane, select All metrics.
  2. Select Browse and choose Customized namespaces. It’s best to see two customized metrics ending with _storageworkspace and _shardworkspace, respectively.
  3. Select both of the customized metrics after which choose NodeID.
  4. On the record of node IDs, choose all of the nodes displayed within the record, and the graph can be plotted routinely.

You’ll be able to hover the mouse over the plotted traces to see the node skew data.

The next screenshots present examples of how the CloudWatch metrics will seem on the console.

The storage skew metrics can be much like the next screenshot. Storage skew metrics exhibits the area storage skew. If you happen to hover over the graph, it exhibits the node record with obtainable nodes within the area. This record is sorted by the storage dimension (largest to smallest). The Lambda operate will periodically publish the newest storage skew outcomes.

The shard skew metrics can be much like the next screenshot. Shard skew metrics present the area shard skew. If you happen to hover over the graph, it exhibits the node record with obtainable nodes within the area. This record is sorted by the shard dimension (largest to smallest). The Lambda operate will periodically publish the newest storage skew outcomes.

Storage skew happens when a number of nodes inside the area has considerably extra storage than different nodes. The CloudWatch metric will present increased deviation of storage utilization for these nodes vs. different nodes. Equally, shard skew happens when a number of nodes has considerably extra shards than others nodes. The CloudWatch metric will present increased deviation for these nodes vs. different nodes within the area. When the area storage or shard skew is detected, you’ll be able to elevate a assist case to work with the AWS crew for remediation actions. See How do I rebalance the uneven shard distribution in my Amazon OpenSearch Service cluster for data on tips on how to take remediation actions to configure your area shard technique for optimum efficiency.

Prices

The price related to utilizing this answer can be minimal, round few cents monthly because it generates CloudWatch metrics. The answer additionally runs Lambda code, and on this case the Lambda features make API calls. For pricing particulars, check with Amazon CloudWatch Pricing and AWS Lambda Pricing.

Clear up

If you happen to determine that you just not need to preserve the Lambda operate and related sources, you’ll be able to navigate to the AWS CloudFormation console, select the stack, and select Delete.

If you wish to add the CloudWatch skew monitor metrics mechanism again in at any level, you’ll be able to create the stack once more from the CloudFormation template.

Conclusion

You need to use this answer to get a greater understanding of your OpenSearch Service area’s storage and shard skew to enhance its efficiency and presumably decrease the price of working your area. See Use Elasticsearch’s _rollover API For environment friendly storage distribution for extra particulars associated to shard allocation and environment friendly storage distribution technique.


Concerning the authors

Nikhil Agarwal is Sr. Technical Supervisor with Amazon Internet Companies. He’s enthusiastic about serving to clients obtain operational excellence of their cloud journey and dealing exercise on technical options. He’s additionally AI/ML enthusiastic and deep dives into buyer’s ML-specific use circumstances. Exterior of labor, he enjoys touring with household and exploring totally different devices.

Karthik Chemudupati is a Principal Technical Account Supervisor (TAM) with AWS, centered on serving to clients obtain value optimization and operational excellence. He has greater than 19 years of IT expertise in software program engineering, cloud operations and automations. Karthik joined AWS in 2016 as a TAM and labored with greater than dozen Enterprise Clients throughout US-West. Exterior of labor, he enjoys spending time along with his household.

Gene Alpert is a Senior Analytics Specialist with AWS Enterprise Assist. He has been centered on our Amazon OpenSearch Service clients and ecosystem for the previous three years. Gene joined AWS in 2017. Exterior of labor he enjoys mountain biking, touring, and enjoying Inhabitants:One in VR.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles