Asset ID: |
1-72-2270407.1 |
Update Date: | 2017-06-01 |
Keywords: | |
Solution Type
Problem Resolution Sure
Solution
2270407.1
:
Enterprise Manager for ZFS Storage: High Latency and High Utilization Disk Count Metric Extension for more controlled notifications
Related Items |
- Oracle ZFS Storage ZS5-4
- Oracle ZFS Storage ZS3-2
- Oracle ZFS Storage ZS5-2
- Oracle ZFS Storage ZS4-4
|
Related Categories |
- PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
|
Provide a Metric Extension that works with Oracle Enterprise Manager Plug-in for Oracle ZFS Storage that gives customer the ability to threshold on the number of drives experiencing high latency or high utilization over a period of time.
In this Document
Created from <SR 3-12414059291>
Applies to:
Oracle ZFS Storage ZS3-2 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS4-4 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS5-2 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS5-4 - Version All Versions to All Versions [Release All Releases]
7000 Appliance OS (Fishworks)
Symptoms
When clients have data residing on a similar set of drives, clients can run into performance issues as the data is accessed simultaneously and with heavy load due to multiple drives being over-utilized.
Cause
Systems under heavy load can run into times when there are several disks having high utilization or high latency events resulting in slower client performance.
Should an administrator know about this condition, new clients could be created on alternative appliances with less load.
The appliance lets users alert on individual occurrences of high utilization through analytics datasets and alerts, but because disks can have high utilization events without performance impacts, clients often get alert storms when requesting alerts on individual occurrences.
Solution
One solution to understanding this issue is to monitor for groups of drives having issues at the same time.
In an Oracle Enterprise Manager for Oracle ZFS Storage Metric Extension, we will monitor two analytics datasets on each target with the metric extension enabled (or run in real time):
- io.ops[latency=100000][disk]
- io.disks[utilization=95][disk]
The Metric Extension will gather 5 minutes of historical data and return
- Average number of disks encountering the utilization or latency issues over a 5 minute period - letting a user see a smoothed case to determine if the problem is sustained in the time window
- Maximum number of drives in a one second period that encountered the issue - letting an administrator see what the worst case in a window is
- Minimum number of drives in a one second period that encountered the issue - letting an administrator note whether the event is sustained through the time window
Requirements
- Oracle Enterprise Manager 13.1+
- Oracle Enterprise Manager Plug-in for Oracle ZFS Storage 2.1.3 or above
- Enterprise Manager Agent hosting Plug-in deployed on a Linux Operating System Environment
Installation
This Metric Extension is for a Linux environment, it would have to be edited for a Windows environment prior to deployment. It should work unchanged in a Solaris environment but it was not tested for Solaris.
Download the appropriate version of the metric extension:
- Enterprise Manager 13.1: MEA_ME%24BackendLoad131.zip
- Enterprise Manager 13.2: MEA_ME%24BackendLoad.zip
Installation Steps
Import the appropriate Metric Extension archive (.zip file) to Enterprise Manager:
- Log into Oracle Enterprise Manager
- Select from top level menu: "Enterprise" - "Monitoring" - "Metric Extension".
- Click the "Actions" - "Import...", select "Browse" and select the downloaded .zip file
- Choose the newly added metric extension and click "Actions" - "Save As Deployable Draft"
- Choose the newly added metric extension and click "Actions" - "Publish Metric Extension"
To deploy to a targets.
- Choose the extension row and click "Action" - "Deploy To Targets", click "+Add" and select targets.
After Deployment
- By default, the Metric Extension is "Disabled"
- The analytics datasets on the appliance are also disabled by default
- For each appliance you would like to monitor these metrics against
- Log into the appliance CLI
- Go to "analytics datasets"
- If the dataset io.disks[utilization=95][disk] is not created, create it: create io.disks[utilization=95][disk]
- If the dataset io.ops[latency=100000][disk] is not created, create it: create io.ops[latency=100000][disk]
- Note that if the datasets above are created but suspended, they should be resumed
- Modify each target (or a monitoring template) to
- Enable the Dataset
- Set a collection schedule, less than 5 minute collection cycles is not helpful, more than 5 will leave gaps in knowledge as only the last 5 minutes are sampled
- Diagnosing the over-utilization issue should not require every minute of data capture as high utilization systems are sustained through core hours
- Ensure the thresholds on the number of disks are properly set
The data captured is a 5 minute interval. There are three columns of data captured for each metric:
- Average - this is the average number of disks encountering the high utilization or high latency issue for each second over the 5 minute period, most likely this is the threshold the administrator will be interested in
- Max - this is the maximum number of disks in a second window that encountered the high utilization or high latency issue during the 5 minute window, this can indicate sudden surges in utilization that may be of interest but because it is not sustained, may not have impacted a client
- Min - this is the minimum number of disks in a second window that encountered the high utilization or high latency issue during the 5 minute window, if this number is continuously above 0 and increasing over time, the appliance is under a sustained load and as this number increases, clients will likely be impacted
Should the Max and Min be large numbers (-9,000,000 and 9,000,000) the dataset on the target is likely suspended, log into the appliance and resume it.
Uninstallation Steps
Undeploy from all targets:
- Go to "Enterprise" - "Monitoring" - "Metric Extensions"
- Select the BackendLoad metric extension, choose "Actions" - "Manage Target Deployments"
- Select all deployed targets and click "Undeploy"
Delete the Metric Extension:
- Go to "Enterprise" - "Monitoring" - "Metric Extensions"
- Select the BackendLoad metric extension, choose "Actions" - "Delete"
Attachments
This solution has no attachment