Asset ID: |
1-75-1021113.1 |
Update Date: | 2018-04-20 |
Keywords: | |
Solution Type
Troubleshooting Sure
Solution
1021113.1
:
Sun Storage 2500, 2500-M2, 6000 and Flexline Arrays: Troubleshooting RAID Controller Failures
Related Items |
- Sun Storage 6580 Array
- Sun Storage 6180 Array
- Sun Storage Flexline 280 Array
- Sun Storage 6780 Array
- Sun Storage 2540-M2 Array
- Sun Storage 2540 Array
- Sun Storage 2510 Array
- Sun Storage 6140 Array
- Sun Storage Flexline 210 Array
- Sun Storage 2530-M2 Array
- Sun Storage 2530 Array
- Sun Storage Flexline 380 Array
- Sun Storage 6540 Array
- Sun Storage Flexline 240 Array
- Sun Storage 6130 Array
|
Related Categories |
- PLA-Support>Sun Systems>DISK>Arrays>SN-DK: 6130
|
PreviouslyPublishedAs
271129
Applies to:
Sun Storage Flexline 240 Array - Version Not Applicable and later
Sun Storage 2530-M2 Array - Version Not Applicable and later
Sun Storage 2530 Array - Version Not Applicable and later
Sun Storage 6140 Array - Version Not Applicable and later
Sun Storage 6780 Array - Version Not Applicable and later
All Platforms
Purpose
This document describes how to troubleshoot Sun Storage Array controller failures in a duplex (dual controller) environment. For problems with a simplex (single controller) array, please contact Oracle support.
Symptoms:
- Seven Segment Display of controller shows a repeating pattern
- 88
- L# (where # is some value)
- Amber LED on controller
- Critical Fault for RPA Memory Error(xx.66.1041)
- Critical Fault for Controller is Offline(xx.66.1028)
Please validate that each troubleshooting step below is true for your environment. Each step will provide instructions via a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.
Troubleshooting Steps
1. Verify Array Critical Faults
Refer to <Document 1021057.1> Sun Storage Common Array Manager (CAM): How to Verify Critical Faults for Sun Storage 2500, 2500-M2, 6000 and J4000 Arrays.
- If fault listed as OFFLINE or RPA Memory Error, go to Step 2.
- Otherwise go to Step 11.
2. Verify Array Model
Refer to <Document 1021066.1> Verify Sun Storage[TM] Array Array Type via the User Interface.
Array Model | Instructions |
- Sun Storage 2510
- Sun Storage 2530
- Sun Storage 2540
|
- If the Critical Fault from Step 1 was Controller Offline, go to Step 5.
- If the Critical Fault from Step 1 was RPA Memory Error, go to Step 7.
To identify the correct replacement part number for a Controller, refer to <Document 1546658.1> Sun Storage 2500 Arrays: Controller Replacement Requires Verifying Cache Size. This document only applies to those arrays where a cache upgrade has been done using marketing option XTA-2500-1GBMEM.
|
- Sun Storage 6140
- Sun Storage 6540
- Sun Storage Flexline 380
|
- If the Critical Fault from Step 1 was Controller Offline, go to Step 3.
- If the Critical Fault from Step 1 was RPA Memory Error, go to Step 7.
|
- Sun Storage 6130
- Sun Storage Flexline 240
- Sun Storage Flexline 280
|
- If the Critical Fault from Step 1 was Controller Offline, go to Step 5.
- If the Critical Fault from Step 1 was RPA Memory Error, go to Step 10.
|
- Sun Storage 6180
- Sun Storage 2530-M2
- Sun Storage 2540-M2
|
- If the Critical Fault from Step 1 was Controller Offline, go to Step 4.
- If the Critical Fault from Step 1 was RPA Memory Error, go to Step 10.
|
- Sun Storage 6580
- Sun Storage 6780
|
- If the Critical Fault from Step 1 was Controller Offline, go to Step 4.
- If the Critical Fault from Step 1 was RPA Memory Error, go to Step 7.
|
3. Verify 7-segment Display on 6140/6540/FLX380 array controller.
Currently the user interface does not display what is being shown on the seven segment display that normally shows the tray ID for the array under optimal conditions. For arrays of these types, we can get additional status of the system. If you are not local to the system, you will need someone to look at the ID. The display can vary based on the array model and the error status.
Refer to <Document 1021109.1> Sun StorageTek[TM] 6140, 6540, and Flexline 380 Array Controller 7-Segment LED.
- If Seven Segment Display shows "88", this indicates a possible intermittent issue, go to Step 5.
- If 7-Segment Shows L2 or L3 for the controller, the subsystem has offlined the controller due to persistent memory faults(L2) or Hardware(L3). Go to Step 10.
- If the 7-Segment Shows an L-code but is not L2 or L3 go to Step 11.
4. Verify 7-segment Display on 6180/6580/6780/2530M2/2540M2 array controller.
Currently the user interface does not display what is being shown on the seven segment display that normally shows the tray ID for the array under optimal conditions. For arrays of these types, we can get additional status of the system. If you are not local to the system, you will need someone to look at the ID. The display can vary based on the array model and the error status.
Refer to <Document 1021110.1> Sun Storage[TM] 6x80 and 2500-M2 Array Controller 7-Segment Display.
For 2530-M2/2540-M2/6180/6580/6780:
- If 7-Segment Display flashes either OS+ OL+ blank- or SE+ 88+ blank-, this indicates a possible intermittent issue, go to Step 5.
- If 7-Segment Display flashes: 0E+ L2+ dash+ CF+ P#+ blank-, SE+ dF+ dash+ CF+ P#+ blank-, or OE+ L3+ blank-, this indicates that a Controller Processor Memory DIMM has failed due to parity errors(L2) or system has detected hardware fault, and has placed the controller offline, go to Step 10.
- If 7-Segment Display flashes: 0E+ L2+ dash+ CF+ d#+ blank-, or SE+ dF+ dash+ CF+ d#+ blank-, this indicates that the Processor Memory on the 6180 has failed due to Parity errors, or the system detected a hardware fault, and placed the controller offline. Go to Step 10.
FOR 6580/6780 ONLY
- If 7-Segment Display flashes: 0E+ L2+ dash+ CF+ C#+ blank-, or SE+ dF+ dash+ CF+ C#+ blank-, this indicates that the Controller Data Cache Memory DIMM has failed due to parity errors, and has placed the controller offline. Refer to <Document 1117584.1> Troubleshooting Sun Storage[TM] 6580/6780 Cache Memory DIMM Faults
- If 7-Segment Display flashes: SE+ dF+ dash+ CF+ H#+ blank-, this indicates that the Host Interface Card(HIC) in slot # for the controller is either failed or missing. Refer to <Document 1120725.1> Troubleshooting Sun Storage[TM] 6580/6780 Host Interface Card Faults.
- If 7-segment Display flashes: SE+ L8+ blank+ CF+ Cx+ blank-, this indicates that the cache configuration does not match the alternate controller's configuration. Refer to <Document 1117584.1> Troubleshooting Sun Storage[TM] 6580/6780 Cache Memory DIMM Faults.
If none of the errors above are displayed, go to Step 10.
5. Online the RAID controller.
Make an attempt to online the RAID controller, using the user interface. The symptoms that have been indicated thus far point to something other than a hardware problem on the RAID controller itself.
Sun Storage Common Array Manager
Browser
- Expand Storage Array in the left window menu tree
- Click on your array name
- Click on the Service Advisor button in the top right corner of the browser window.
- Find and Expand Place a Controller Online in the Troubleshooting and Recovery Section
- Select the faulted controller, and follow the instructions in the right hand pane to place the controller online.
Service CLI
Locations:
Solaris: /opt/SUNWsefms/bin
Windows: c:\Program Files\Sun\Common Array Manager\Component\fms\bin
Linux: /opt/sun/cam/private/fms/bin
service -d array_name -c revive -t [a | b]
NOTE: You must specify controller slot location A or B.
SANtricity Storage Manager
GUI
- Open the Array Management Window for your array
- Select the array controller (will have a red X on it)
- Open the Advanced Menu
- Select the Recovery Sub-Menu
- Select the Place Controller Sub-Menu
- Select Online
SMcli
SMcli -n array_name -c "set controller [(a|b)] availability=online;"
NOTE: You must specify controller slot location A or B.
- If the request to online the controller fails, go to Step 11.
- If the request to online the controller is successful, and the controller stays up for longer than 5 minutes, go to Step 6.
- If the request to online the controller was successful, but the controller went offline again, go to Step 11.
6. Reset SOC and RLS counters on the array for monitoring.
You have indicated that the array controller was successfully placed online and made available for longer than 5 minutes. If this issue is intermittent, the controller may go offline again. In order to help with diagnosis in the event that this occurs, we need to set baselines for error statistics on the array.
The RLS (Read Link Status) and SOC (Switch On Chip) statistics are collected as part of normal array support collections, and can be zeroed out very easily for further diagnosis, as follows. Often, a controller will go offline due to a communication issue, which requires this data as part of the investigation.
NOTE: This is not available for 2510, 2530, or 2540 arrays, although the collection of the error counters is. If this is your array type, you do not need to run the commands, but should follow the instructions on what action to take, regardless.
Sun Storage Common Array Manager
Service CLI
Locations:
Solaris: /opt/SUNWSefms/bin
Windows: c:\Program Files\Sun\Common Array Manager\Component\fms\bin
Linux: /opt/sun/cam/private/fms/bin
service -d array_name -c reset -t soc
service -d array_name -c reset -t rls
SANtricity Storage Manager
SMcli
SMcli -n array_name -c "reset storageArray RLSBaseline;"
SMcli -n array_name -c "reset storageArray SOCBaseline;"
- If the array remains online for longer than 48 hours, monitor for a period of 2 weeks. After that point, the problem was likely due to a software error or state inconsistency. You may want to consider updating firmware if available. No further actions are required.
- If the array controller is placed offline in less than 2 weeks, go to Step 11.
7. Check for additional Critical Faults besides the RPA Memory Error
Under certain circumstances documented in unpublished bugs, the RPA Memory Error may be false. Check the list of faults on the array, for any of the following:
REC_LOST_REDUNDANCY_DRIVE(xx.66.1076)
REC_PATH_DEGRADED(xx.66.1032)
The opening sentence of this section was changed because it referenced several internal bugs. Originally, it said:
Due to bugs 6767241 and 6797173, the RPA Memory Error may be false. Check the list of faults on the array, for any of the following:
- If these faults exist, in addition to the RPA Memory Error, the error may be false. Continue to Step 8 to review your firmware revisions.
- If these faults do not exist on your array, go to Step 10.
8. Verify your array firmware.
Use the following document to check your firmware against the table below:
<Document 1021067.1> How to Verify Sun Storage[TM] Array Firmware Using Sun Storage Common Array Manager or SANtricity.
Array Model | Firmware | Action |
Sun Storage Flexline 380 Sun Storage 6540 Sun Storage 6140 |
07.50.xx.xx 07.60.xx.xx |
You are not exposed to the bugs. The controller should be replaced. Continue to Step 10. |
Sun Storage Flexline 380 Sun Storage 6540 Sun Storage 6140 |
07.10.xx.xx 07.15.xx.xx |
You are exposed to 6767241, which causes false RPA Memory Errors, along with the faults in Step 6. To correct the condition, go to Step 9. |
Sun Storage Flexline 380 Sun Storage 6540 Sun Storage 6140 |
06.60.xx.xx 06.19.xx.xx 06.16.xx.xx 06.15.xx.xx |
You are not exposed to the bugs. The controller should be replaced. Continue to Step 10. |
Sun Storage 6580/6780 |
07.50.xx.xx 07.60.xx.xx |
You are not exposed to the bugs. The controller should be replaced. Continue to Step 10. |
Sun Storage 6580/6780 |
07.30.xx.xx |
You are exposed to 6767241, which causes false RPA Memory Errors, along with the faults in Step 6. To correct the condition, go to Step 9. |
Sun Storage 2510/2530/2540 |
07.35.50.10 07.35.55.10 07.35.44.10 |
You are not exposed to the bugs. The controller should be replaced. Continue to Step 10. |
Sun Storage 2510/2530/2540 |
07.35.10.10 |
You are exposed to 6767241, which causes false RPA Memory Errors, along with the faults in Step 6. To correct the condition, go to Step 9. |
Sun Storage 2510/2530/2540 |
06.70.xx.xx 06.17.xx.xx |
You are not exposed to the bugs. The controller should be replaced. Continue to Step 10. |
9. Power Cycle the RAID Controller Tray to clear the false RPA memory error.
This procedure requires an outage, as the surviving controller will hold the faulted controller in a fault state. The RAID Tray and only the RAID Tray requires a power cycle.
After performing a power cycle of the RAID Tray, review the Critical Fault list in your user interface.
- If the fault persists, the controller will require replacement, continue to Step 10.
- If the fault is cleared, update your firmware to a version where 6767241 is fixed.
2510/2530/2540 Arrays this is 07.35.44.10 or later
6580/6780/6140/6540/Flexline 380 07.50.08.10 or later
10. Have the controller replaced.
You have indicated that the 7-segment display on the array controller or a critical fault for an RPA Memory Error indicate that the RAID controller requires replacement.
Please supply:
Critical Fault
7-Segment display
Array Support Data Collection:
- Refer to <Document 1002514.1> Collecting Sun Storage Common Array Manager Support Data for Arrays.
- Refer to <Document 1014074.1> Collecting Support Data for Arrays Using Sun StorageTek SANtricity Storage Manager.
and contact Oracle.
11. Provide Data for further analysis
At this point you have validated that each troubleshooting step is true for your environment and the issue still exists. Therefore further troubleshooting is required to identify the issue.
Please provide:
7-Segment Display if available
Array Support Data Collection:
- Refer to <Document 1002514.1> Collecting Sun Storage Common Array Manager Support Data for Arrays.
- Refer to <Document 1014074.1> Collecting Support Data for Arrays Using Sun StorageTek SANtricity Storage Manager.
and contact Oracle.
Do you still have questions? You can use My Oracle Support Communities. Communities put you in touch with industry professionals like yourself. They are monitored by Oracle support engineers, so you can expect reliable and correct answers. Ask questions and see what others are asking about in the
Disk Storage 2000, 3000, 6000 RAID Arrays & JBODs Community.
Internal Comments
WARNING!!!!
For HIC and Cache DIMM replacements on a 6580 and 6780, the module should be taken offline.
While the array controllers can be removed for replacement of these components, you cannot have the controller
out of the array for more than 15 minutes or the module will overheat, causing residual damage to the remaining
components.
Attachments
This solution has no attachment