Enterprise Manager for ZFS Storage: High Latency and High Utilization Disk Count Metric Extension for more controlled notifications

Asset ID:	1-72-2270407.1
Update Date:	2017-06-01
Keywords:

Solution Type Problem Resolution Sure

Solution 2270407.1 : Enterprise Manager for ZFS Storage: High Latency and High Utilization Disk Count Metric Extension for more controlled notifications

Applies to:

Oracle ZFS Storage ZS3-2 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS4-4 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS5-2 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS5-4 - Version All Versions to All Versions [Release All Releases]
7000 Appliance OS (Fishworks)

Symptoms

When clients have data residing on a similar set of drives, clients can run into performance issues as the data is accessed simultaneously and with heavy load due to multiple drives being over-utilized.

Cause

Systems under heavy load can run into times when there are several disks having high utilization or high latency events resulting in slower client performance.

Should an administrator know about this condition, new clients could be created on alternative appliances with less load.

The appliance lets users alert on individual occurrences of high utilization through analytics datasets and alerts, but because disks can have high utilization events without performance impacts, clients often get alert storms when requesting alerts on individual occurrences.

Solution

One solution to understanding this issue is to monitor for groups of drives having issues at the same time.

In an Oracle Enterprise Manager for Oracle ZFS Storage Metric Extension, we will monitor two analytics datasets on each target with the metric extension enabled (or run in real time):

io.ops[latency=100000][disk]
io.disks[utilization=95][disk]

The Metric Extension will gather 5 minutes of historical data and return

Average number of disks encountering the utilization or latency issues over a 5 minute period - letting a user see a smoothed case to determine if the problem is sustained in the time window
Maximum number of drives in a one second period that encountered the issue - letting an administrator see what the worst case in a window is
Minimum number of drives in a one second period that encountered the issue - letting an administrator note whether the event is sustained through the time window

Requirements

Oracle Enterprise Manager 13.1+
Oracle Enterprise Manager Plug-in for Oracle ZFS Storage 2.1.3 or above
Enterprise Manager Agent hosting Plug-in deployed on a Linux Operating System Environment

Installation

This Metric Extension is for a Linux environment, it would have to be edited for a Windows environment prior to deployment. It should work unchanged in a Solaris environment but it was not tested for Solaris.

Download the appropriate version of the metric extension:

Enterprise Manager 13.1: MEA_ME%24BackendLoad131.zip
Enterprise Manager 13.2: MEA_ME%24BackendLoad.zip

Installation Steps

Import the appropriate Metric Extension archive (.zip file) to Enterprise Manager:

Log into Oracle Enterprise Manager
Select from top level menu: "Enterprise" - "Monitoring" - "Metric Extension".
Click the "Actions" - "Import...", select "Browse" and select the downloaded .zip file
Choose the newly added metric extension and click "Actions" - "Save As Deployable Draft"
Choose the newly added metric extension and click "Actions" - "Publish Metric Extension"

To deploy to a targets.

Choose the extension row and click "Action" - "Deploy To Targets", click "+Add" and select targets.

After Deployment

By default, the Metric Extension is "Disabled"
The analytics datasets on the appliance are also disabled by default
For each appliance you would like to monitor these metrics against
- Log into the appliance CLI
- Go to "analytics datasets"
- If the dataset io.disks[utilization=95][disk] is not created, create it: create io.disks[utilization=95][disk]
- If the dataset io.ops[latency=100000][disk] is not created, create it: create io.ops[latency=100000][disk]
- Note that if the datasets above are created but suspended, they should be resumed
Modify each target (or a monitoring template) to
- Enable the Dataset
- Set a collection schedule, less than 5 minute collection cycles is not helpful, more than 5 will leave gaps in knowledge as only the last 5 minutes are sampled
- Diagnosing the over-utilization issue should not require every minute of data capture as high utilization systems are sustained through core hours
- Ensure the thresholds on the number of disks are properly set

The data captured is a 5 minute interval. There are three columns of data captured for each metric:

Average - this is the average number of disks encountering the high utilization or high latency issue for each second over the 5 minute period, most likely this is the threshold the administrator will be interested in
Max - this is the maximum number of disks in a second window that encountered the high utilization or high latency issue during the 5 minute window, this can indicate sudden surges in utilization that may be of interest but because it is not sustained, may not have impacted a client
Min - this is the minimum number of disks in a second window that encountered the high utilization or high latency issue during the 5 minute window, if this number is continuously above 0 and increasing over time, the appliance is under a sustained load and as this number increases, clients will likely be impacted

Should the Max and Min be large numbers (-9,000,000 and 9,000,000) the dataset on the target is likely suspended, log into the appliance and resume it.

Uninstallation Steps

Undeploy from all targets:

Go to "Enterprise" - "Monitoring" - "Metric Extensions"
Select the BackendLoad metric extension, choose "Actions" - "Manage Target Deployments"
Select all deployed targets and click "Undeploy"

Delete the Metric Extension:

Go to "Enterprise" - "Monitoring" - "Metric Extensions"
Select the BackendLoad metric extension, choose "Actions" - "Delete"

Attachments

This solution has no attachment