Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1534465.1
Update Date:2017-05-18
Keywords:

Solution Type  Technical Instruction Sure

Solution  1534465.1 :   How to collect data for Netra CT900 related problems  


Related Items
  • Netra T3-1BA
  •  
  • Sun Netra CT900 Server
  •  
  • Sun Netra CP3260 ATCA Blade Server
  •  
  • Sun Netra CT1600 Server
  •  
  • Sun Netra CP3010 Blade Server
  •  
  • Sun Netra CP3060 ATCA Blade Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Usx/Blade/Netra>SN-SPARC: Netra Cxxxx
  •  


How to collect relevant data based on problematic component of Netra CT900

In this Document
Goal
Solution


Applies to:

Sun Netra CT900 Server - Version All Versions and later
Sun Netra CP3010 Blade Server - Version Not Applicable and later
Sun Netra CP3060 ATCA Blade Server - Version Not Applicable and later
Sun Netra CP3260 ATCA Blade Server - Version Not Applicable and later
Netra T3-1BA - Version Not Applicable and later
Information in this document applies to any platform.

Goal

Customer to collect relevant data for troubleshooting.

Solution

The customer should collect relevant information based on following chassis components:

  •     SAP ( Shelf Alarm Panel)
  •     FT (Fan Tray)
  •     PEM (Power Entry Module)
  •     ShMM (Shelf Management Module)
  •     Blade/RTM



SAP (Shelf Alarm Panel)

  Symptom: Serial port to ShMM not working

  To do:

  1. Using terminal server or laptop/workstation
  2. Serial port baudrate set correctly
  3. Check integrity of cable
  4. If all OK, replace SAP



Fan Tray (FT)

  Symptom:

  • SAP alarm LED light up
  • FT RED/BLUE LED light up
  • Sensor showing abnormal temperature (either receiving SNMP message or alarm triggered)
  • Fans at abnormal speed (FT not level 5)

  To do:

  1. Identify which FT is at fault
  2. Check the LED status using: [clia] getfruledstate 20 [3|4|5]
  3. Check fan state ([clia] fans) and cooling state ([clia] shelf -v fs, [clia] shelf -v cs)
  4. Check sensor reading: [clia] sensordata board <slot #>
  5. Check air filter

  Example output:

        # clia getfruledstate 20 3
        20: FRU # 3, Led # 0 ("BLUE LED"):
            Local Control LED State: LED OFF

        20: FRU # 3, Led # 1 ("LED 1"):
            Local Control LED State: LED OFF

        20: FRU # 3, Led # 2 ("LED 2"):
            Local Control LED State: LED ON, color: GREEN

        # clia fans
        20: FRU # 3
            Current Level: 5
            Minimum Speed Level: 0, Maximum Speed Level: 15
        20: FRU # 4
            Current Level: 5
            Minimum Speed Level: 0, Maximum Speed Level: 15
        20: FRU # 5
            Current Level: 5
            Minimum Speed Level: 0, Maximum Speed Level: 15

        # clia shelf -v cs
            Cooling state: "Normal"
            Sensor(s) at this state: (0x8e,6,0) (0x90,7,0) (0x90,8,0) (0x90,23,0)
                                     (0x90,24,0) (0x90,25,0) (0x90,40,0) (0x90,41,0)
                                     (0x90,42,0) (0x8e,5,0) (0x90,6,0) (0x86,20,0)
                                     (0x86,21,0) (0x86,22,0) (0x86,23,0) (0x86,24,0)
                                     (0x86,25,0) (0x86,26,0) (0x92,6,0) (0x92,7,0)
                                     (0x92,30,0) (0x92,31,0) (0x86,19,0) (0x94,7,0)
                                     (0x94,8,0) (0x94,23,0) (0x94,24,0) (0x94,25,0)
                                     (0x94,40,0) (0x94,41,0) (0x94,42,0) (0x92,5,0)
                                     (0x9a,5,0) (0x9a,6,0) (0x9a,29,0) (0x9a,30,0)
                                     (0x9a,31,0) (0x94,6,0) (0x96,5,0) (0x96,6,0)
                                     (0x96,29,0) (0x96,30,0) (0x96,31,0) (0x9a,4,0)
                                     (0x9c,4,0) (0x9c,5,0) (0x96,4,0) (0x82,20,0)
                                     (0x82,36,0) (0x82,37,0) (0x82,44,0) (0x82,45,0)
                                     (0x82,52,0) (0x82,53,0) (0x9c,3,0) (0x82,10,0)
                                     (0x88,6,0) (0x88,7,0) (0x88,8,0) (0x88,23,0)
                                     (0x88,24,0) (0x88,25,0) (0x20,120,0) (0x20,121,0)
                                     (0x20,122,0) (0x20,123,0) (0x20,124,0) (0x20,125,0)
                                     (0x20,126,0) (0x20,200,0) (0x20,201,0) (0x98,3,0)
                                     (0x98,4,0) (0x98,5,0)

        # clia shelf -v fs
            Fans state: "Normal"
            Sensor(s) at this state: (0x10,8,0) (0x10,10,0) (0x10,11,0) (0x10,13,0)
                                     (0x10,14,0) (0x10,7,0)

  NOTE: When only SAP LED lights up, all data should be checked fine because FT speed up and cool the chassis already.  Just need to clear alarm ([clia] alarm clear).

  NOTE: Be sure to re-seat the FT at least once before determining to replace it.

 


PEM (Power Entry Module)

  Symptom: RED/BLUE LED light up

  To do:

  1. If replacement is needed, customer will have to provide licenced electrician.
  2. Check the LED status using: [clia] getfruledstate 20 [6|7]
  3. Check PEM sensors for any "Entity Absent" state
  4. Replace the faulty PEM

  Example Output:

        # clia getfruledstate 20 6
        20: FRU # 6, Led # 0 ("BLUE LED"):
            Local Control LED State: LED OFF

        20: FRU # 6, Led # 1 ("LED 1"):
            Local Control LED State: LED OFF

        20: FRU # 6, Led # 2 ("LED 2"):
            Local Control LED State: LED ON, color: GREEN


        # clia sensor 20 | grep PEM
        20: LUN: 0, Sensor # 162 ("PEM A In 2")
        20: LUN: 0, Sensor # 163 ("PEM A In 2 Fused")
        20: LUN: 0, Sensor # 164 ("PEM A In 1")
        20: LUN: 0, Sensor # 165 ("PEM A In 1 Fused")
        20: LUN: 0, Sensor # 166 ("PEM A In 4")
        20: LUN: 0, Sensor # 167 ("PEM A In 4 Fused")
        20: LUN: 0, Sensor # 168 ("PEM A In 3")
        20: LUN: 0, Sensor # 169 ("PEM A In 3 Fused")
        20: LUN: 0, Sensor # 174 ("PEM B In 2")
        20: LUN: 0, Sensor # 175 ("PEM B In 2 Fused")
        20: LUN: 0, Sensor # 176 ("PEM B In 1")
        20: LUN: 0, Sensor # 177 ("PEM B In 1 Fused")
        20: LUN: 0, Sensor # 178 ("PEM B In 4")
        20: LUN: 0, Sensor # 179 ("PEM B In 4 Fused")
        20: LUN: 0, Sensor # 180 ("PEM B In 3")
        20: LUN: 0, Sensor # 181 ("PEM B In 3 Fused")
        20: LUN: 0, Sensor # 192 ("PEM A")
        20: LUN: 0, Sensor # 193 ("PEM B")
        20: LUN: 0, Sensor # 200 ("PEM A Temp")
        20: LUN: 0, Sensor # 201 ("PEM B Temp")

        # clia sensordata 20 164
        20: LUN: 0, Sensor # 164 ("PEM A In 1")
            Type: Discrete (0x6f), "Entity Presence" (0x25)
            Status: 0xc0
                All event messages enabled from this sensor
                Sensor scanning enabled
                Initial update completed
            Sensor reading: 0x00
            Current State Mask 0x0001
                Entity Present

 


ShMM (Shelf Management Module)

  Symptom:

  • Could not log in
  • Could not ping
  • No console access to blade ([clia] console <slot #>)
  • Firmware upgrade problem
  • SNMP related problem

  To do:

 

  1. Obtain a clear description and log of what has been attempted form customer
  2. Collect:
  • /tmp/debug.log  (created by command /etc/summary)
  • /etc/shelfman.conf
  • /etc/openhpi.conf


  For any networking related problem (remote log in and ping related problem), make sure there are route to ShMM and the route is pingable from both directions.

  For console access (or netconsole), check /etc/openhpi.conf and switch blade setting.  Make sure VLAN 55 IP (from /var/netcons.ip) are pingable.

  For firmware upgrade problem, obtain a complet log and check command arguments carefully.  Make sure the correct version is used and README file is followed.

  For SNMP problem, check on two thing:

  1. "df -k" output --- /dev/ram0 should not be filled up, or not enough swap memory and some process will be shut down
  2. CP3060 voltage event on sensor 9 (see doc 1346085.1 for sensor numbers) or other voltage sensors --- if voltage dropped to threshold, numerous IPMB events are generated and ShMM stops respond to SNMP porbing; work around is to lower threshold:


        # clia help setthreshold

          Set the specified threshold of the dedicated sensor
                unc    - Upper Non Critical
                uc     - Upper Critical
                unr    - Upper Non Recoverable
                lnc    - Lower Non Critical
                lc     - Lower Critical
                lnr    - Lower Non Recoverable
          instead of <addr> user may use:
                board <N>
                shm <N>
          to access the sensor on the specified board
                "-r <value>" considers <value> as unsigned byte
                just "<value>" considers as the floating point number
                setthreshold board 21 "IPMB LINK" unc -r 34
                setthreshold 20 8 lc -45.67

          setthreshold <addr> [ lun: ] | unc | uc | unr | lnc | lc | lnr [-r] value

 


Blade / RTM

  Symptom: Blade shut down, hang, panic; could not boot, etc.

  To do:

  1. Collect explorer / core dump / snapshot
  2. Troubleshoot as it is any other UltraSPARC system
  3. Be sure to re-seat blade before replacing it --- when a blade died, there are residual values in IPMB controller (H8) that might prevent blade form booting back up normally; if reboot blade does not clear it, re-seat blade definitely will.

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback