How to collect data for Netra CT900 related problems

Asset ID:	1-71-1534465.1
Update Date:	2017-05-18
Keywords:

Solution Type Technical Instruction Sure

Solution 1534465.1 : How to collect data for Netra CT900 related problems

Applies to:

Sun Netra CT900 Server - Version All Versions and later
Sun Netra CP3010 Blade Server - Version Not Applicable and later
Sun Netra CP3060 ATCA Blade Server - Version Not Applicable and later
Sun Netra CP3260 ATCA Blade Server - Version Not Applicable and later
Netra T3-1BA - Version Not Applicable and later
Information in this document applies to any platform.

Goal

Customer to collect relevant data for troubleshooting.

Solution

The customer should collect relevant information based on following chassis components:

SAP ( Shelf Alarm Panel)
FT (Fan Tray)
PEM (Power Entry Module)
ShMM (Shelf Management Module)
Blade/RTM

SAP (Shelf Alarm Panel)

Symptom: Serial port to ShMM not working

To do:

Using terminal server or laptop/workstation
Serial port baudrate set correctly
Check integrity of cable
If all OK, replace SAP

Fan Tray (FT)

Symptom:

SAP alarm LED light up
FT RED/BLUE LED light up
Sensor showing abnormal temperature (either receiving SNMP message or alarm triggered)
Fans at abnormal speed (FT not level 5)

To do:

Identify which FT is at fault
Check the LED status using: [clia] getfruledstate 20 [3|4|5]
Check fan state ([clia] fans) and cooling state ([clia] shelf -v fs, [clia] shelf -v cs)
Check sensor reading: [clia] sensordata board <slot #>
Check air filter

Example output:

        # clia getfruledstate 20 3
      20: FRU # 3, Led # 0 ("BLUE LED"):
            Local Control LED State: LED OFF

      20: FRU # 3, Led # 1 ("LED 1"):
          Local Control LED State: LED OFF

      20: FRU # 3, Led # 2 ("LED 2"):
          Local Control LED State: LED ON, color: GREEN

      # clia fans
      20: FRU # 3
          Current Level: 5
            Minimum Speed Level: 0, Maximum Speed Level: 15
      20: FRU # 4
          Current Level: 5
            Minimum Speed Level: 0, Maximum Speed Level: 15
      20: FRU # 5
          Current Level: 5
            Minimum Speed Level: 0, Maximum Speed Level: 15

        # clia shelf -v cs
          Cooling state: "Normal"
          Sensor(s) at this state: (0x8e,6,0) (0x90,7,0) (0x90,8,0) (0x90,23,0)
                                   (0x90,24,0) (0x90,25,0) (0x90,40,0) (0x90,41,0)
                                   (0x90,42,0) (0x8e,5,0) (0x90,6,0) (0x86,20,0)
                                   (0x86,21,0) (0x86,22,0) (0x86,23,0) (0x86,24,0)
                                   (0x86,25,0) (0x86,26,0) (0x92,6,0) (0x92,7,0)
                                   (0x92,30,0) (0x92,31,0) (0x86,19,0) (0x94,7,0)
                                   (0x94,8,0) (0x94,23,0) (0x94,24,0) (0x94,25,0)
                                   (0x94,40,0) (0x94,41,0) (0x94,42,0) (0x92,5,0)
                                   (0x9a,5,0) (0x9a,6,0) (0x9a,29,0) (0x9a,30,0)
                                   (0x9a,31,0) (0x94,6,0) (0x96,5,0) (0x96,6,0)
                                   (0x96,29,0) (0x96,30,0) (0x96,31,0) (0x9a,4,0)
                               (0x9c,4,0) (0x9c,5,0) (0x96,4,0) (0x82,20,0)
                                   (0x82,36,0) (0x82,37,0) (0x82,44,0) (0x82,45,0)
                                   (0x82,52,0) (0x82,53,0) (0x9c,3,0) (0x82,10,0)
                                   (0x88,6,0) (0x88,7,0) (0x88,8,0) (0x88,23,0)
                                   (0x88,24,0) (0x88,25,0) (0x20,120,0) (0x20,121,0)
                                   (0x20,122,0) (0x20,123,0) (0x20,124,0) (0x20,125,0)
                                   (0x20,126,0) (0x20,200,0) (0x20,201,0) (0x98,3,0)
                                   (0x98,4,0) (0x98,5,0)

      # clia shelf -v fs
          Fans state: "Normal"
            Sensor(s) at this state: (0x10,8,0) (0x10,10,0) (0x10,11,0) (0x10,13,0)
                                   (0x10,14,0) (0x10,7,0)

NOTE: When only SAP LED lights up, all data should be checked fine because FT speed up and cool the chassis already. Just need to clear alarm ([clia] alarm clear).

NOTE: Be sure to re-seat the FT at least once before determining to replace it.

PEM (Power Entry Module)

Symptom: RED/BLUE LED light up

To do:

If replacement is needed, customer will have to provide licenced electrician.
Check the LED status using: [clia] getfruledstate 20 [6|7]
Check PEM sensors for any "Entity Absent" state
Replace the faulty PEM

Example Output:

        # clia getfruledstate 20 6
20: FRU # 6, Led # 0 ("BLUE LED"):
        Local Control LED State: LED OFF

      20: FRU # 6, Led # 1 ("LED 1"):
          Local Control LED State: LED OFF

      20: FRU # 6, Led # 2 ("LED 2"):
          Local Control LED State: LED ON, color: GREEN

      # clia sensor 20 | grep PEM
        20: LUN: 0, Sensor # 162 ("PEM A In 2")
      20: LUN: 0, Sensor # 163 ("PEM A In 2 Fused")
        20: LUN: 0, Sensor # 164 ("PEM A In 1")
      20: LUN: 0, Sensor # 165 ("PEM A In 1 Fused")
        20: LUN: 0, Sensor # 166 ("PEM A In 4")
      20: LUN: 0, Sensor # 167 ("PEM A In 4 Fused")
        20: LUN: 0, Sensor # 168 ("PEM A In 3")
    20: LUN: 0, Sensor # 169 ("PEM A In 3 Fused")
      20: LUN: 0, Sensor # 174 ("PEM B In 2")
      20: LUN: 0, Sensor # 175 ("PEM B In 2 Fused")
        20: LUN: 0, Sensor # 176 ("PEM B In 1")
      20: LUN: 0, Sensor # 177 ("PEM B In 1 Fused")
        20: LUN: 0, Sensor # 178 ("PEM B In 4")
      20: LUN: 0, Sensor # 179 ("PEM B In 4 Fused")
        20: LUN: 0, Sensor # 180 ("PEM B In 3")
      20: LUN: 0, Sensor # 181 ("PEM B In 3 Fused")
        20: LUN: 0, Sensor # 192 ("PEM A")
      20: LUN: 0, Sensor # 193 ("PEM B")
        20: LUN: 0, Sensor # 200 ("PEM A Temp")
      20: LUN: 0, Sensor # 201 ("PEM B Temp")

      # clia sensordata 20 164
        20: LUN: 0, Sensor # 164 ("PEM A In 1")
            Type: Discrete (0x6f), "Entity Presence" (0x25)
          Status: 0xc0
              All event messages enabled from this sensor
              Sensor scanning enabled
                Initial update completed
          Sensor reading: 0x00
          Current State Mask 0x0001
              Entity Present

ShMM (Shelf Management Module)

Symptom:

Could not log in
Could not ping
No console access to blade ([clia] console <slot #>)
Firmware upgrade problem
SNMP related problem

To do:

Obtain a clear description and log of what has been attempted form customer
Collect:

/tmp/debug.log (created by command /etc/summary)
/etc/shelfman.conf
/etc/openhpi.conf

For any networking related problem (remote log in and ping related problem), make sure there are route to ShMM and the route is pingable from both directions.

For console access (or netconsole), check /etc/openhpi.conf and switch blade setting. Make sure VLAN 55 IP (from /var/netcons.ip) are pingable.

For firmware upgrade problem, obtain a complet log and check command arguments carefully. Make sure the correct version is used and README file is followed.

For SNMP problem, check on two thing:

"df -k" output --- /dev/ram0 should not be filled up, or not enough swap memory and some process will be shut down
CP3060 voltage event on sensor 9 (see doc 1346085.1 for sensor numbers) or other voltage sensors --- if voltage dropped to threshold, numerous IPMB events are generated and ShMM stops respond to SNMP porbing; work around is to lower threshold:

        # clia help setthreshold

          Set the specified threshold of the dedicated sensor
                unc    - Upper Non Critical
                uc     - Upper Critical
                unr    - Upper Non Recoverable
                lnc    - Lower Non Critical
                lc     - Lower Critical
                lnr    - Lower Non Recoverable
          instead of <addr> user may use:
                board <N>
                shm <N>
          to access the sensor on the specified board
                "-r <value>" considers <value> as unsigned byte
                just "<value>" considers as the floating point number
                setthreshold board 21 "IPMB LINK" unc -r 34
                setthreshold 20 8 lc -45.67

          setthreshold <addr> [ lun: ] | unc | uc | unr | lnc | lc | lnr [-r] value

Blade / RTM

Symptom: Blade shut down, hang, panic; could not boot, etc.

To do:

Collect explorer / core dump / snapshot
Troubleshoot as it is any other UltraSPARC system
Be sure to re-seat blade before replacing it --- when a blade died, there are residual values in IPMB controller (H8) that might prevent blade form booting back up normally; if reboot blade does not clear it, re-seat blade definitely will.

Attachments

This solution has no attachment