VSM - Performance Problem Analysis

Asset ID:	1-75-1487938.1
Update Date:	2016-08-01
Keywords:

Solution Type Troubleshooting Sure

Solution 1487938.1 : VSM - Performance Problem Analysis

Applies to:

Sun StorageTek VSM System - Version 4 to 5C [Release 4.0 to 5.0]
Sun StorageTek VSM5 System - Version All Versions to All Versions [Release All Releases]
IBM z/OS on System z

Purpose

VSM performance problems can be induced by changes in workload or other environment changes, microcode faults or hardware. This document is intended to aid the reader in isolating the source of the performance problem.

Troubleshooting Steps

Symptoms

The user has detected a decline in performance of the VSM solution.

What indicators are being used to determine that you have a VSM performance problem?
1. Wall clock/job run time
2. SMF performance data
3. Jobs are backing up in the queue
4. Other
When did the problem first occur, or be noticed?
What is the impact of the performance problem?
1. Batch windows or SLAs are being missed
2. back-ups are not completing before the online day begins
3. Other
Is data available that documents the change in performance?

Examples: SMF, RMF data, Channel Utilization information
Can the performance problem be demonstrated at will?

Changes

Has anything changed in the environment?
1. Workload
2. Hardware changes
  PLEASE NOTE:
  The introduction of a new VTSS with an empty or nearly empty buffer can trigger a performance problem.
  This is because software will direct work to the VTSS with the most space available in its buffer.
  The imbalance of workload that takes place when too much of the work is directed to the VTSS with the most available space can result in a performance problem.
  Another contributing factor here can be the fact that any updates made via the new machine will first require the data to be recalled into the machine. These recalls take time and can impact performance.
  Once the buffer utilization of the new machine matches that of the older machines this type of performance problem should be resolved.
  This kind of situation can be mitigated by providing smaller PCAP StorageKeys and increasing them over time to the full size of buffer that was purchased.
  A similar benefit may be possible by having the customer reduce the LAMT and HAMT values initially and gradually raising them until they meet the same LAMT and HAMT values used on the other VTSS units at the site.
3. Microcode changes
4. Software changes
5. Configuration changes
6. Other

Possible Causes

Have you received any SLS messages on the console indicating the VSM has a hardware problem?

SLS message example:

01.13.15 STC00736 *:SLS6659I VTSS VSMPROD
SIM:0010100000008FE0111000039E003F1041030E9E0000735905104200F1008FEE
01.13.15 STC00736 :SLS6974I Fault reported by VTSS: VSMPROD Model:VSM5 Serial:00201234 FSC:7359 FRU:FEE

If a SLS message similar to the above has been received please open a Service Request to have the hardware repaired.
Can you display RTDs (D RTD) to see if any RTDs are offline? The display command is: D RTD

Information on why RTD offline status can impact performance:
1. If too many RTDs are offline then this can lead to performance issues because the VTSS cannot migrate data out quickly enough and it can fill up. The performance issue is then induced when it approaches being full. If RTDs are found to be offline, please vary them online. If RTDs go offline repeatedly please open a Service Request to have this problem investigated.
2. If all or nearly all RTDs are busy doing migrates and none or few are doing recalls this can also manifest as a performance problem. It can be an indicator that the workload going to the VTSS is too great. The VTSS may be saturated from a workload perspective and is not able to migrate data out at the rate data is coming in. In this case reducing the workload may allow the VTSS to catch up and thereby resolve the performance problem. Alternately, please see item 4 below.
For legacy VSM (VSM4, VSM5 or VSM5C but not applicable to VSM6) What is the level of the VSM Buffer Utilization (DBU)? The display command is: D VTSS

Information on why Buffer Utilization can affect performance:
1. The VSM has an internal background task called ‘Free Space Collection’ (FSC). FSC takes storage areas that contain deleted data and moves that storage area into the ‘Collected Free Space’ area thereby making that storage area available to receive future write command data.
  
  The FSC collection criteria only collects cylinders if the system is down to 25% collected cylinders left and there are 10 cylinders with at least 25% free space on them. EFS will then collect cylinders until the system is back above 25% collected cylinders left.
  
  FSC normally runs as a background task and has minimal or no impact on performance. However, if the rate of data coming into the VSM is greater than its ability to migrate data out then FSC may become more aggressive. There are two stages of aggressive FSC. The first level of aggressive free space collection engages when there are 12.5% free cylinders left. The second level of aggressive free space collection engages when there are 6.25% free cylinders left. FSC becomes more aggressive by using more resources within the VSM to accomplish its task. This leaves fewer resources available to perform work for the customer. This behavior can then be perceived as a performance problem.
  
  Consequently, it is counterproductive from a performance perspective to try and run the VSM at or near full capacity.
If a VSM statesave (NDSS or disruptive) is captured near the time of the performance problem the statesave files can be used to analyze the internal workload of the VSM. This analysis can examine the front-end (host interface) and back-end (harddrive and/or VLE storage) workload versus the products available bandwidth. The statesave must be taken on the VSM, files from the statesave sent to support for analysis and the analysis work by support must then be performed. This analysis can then reveal if a performance problem is being caused by an excessive workload (meaning the VSM is being worked beyond it's design specifications). To have this work performed please open a service request.
Would you be willing to have an organization like Oracle’s Advanced Customer Services (formerly Professional Services) perform a study of your VSM solution to ensure it is running at the optimum level? Please note that there will be a cost associated with this activity.

Solution

As can be seen above VSM performance problems have multiple potential causes. Therefore performance problem resolution will vary according to the problems root cause.

Attachments

This solution has no attachment