Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1589906.1
Update Date:2017-10-18
Keywords:

Solution Type  Troubleshooting Sure

Solution  1589906.1 :   Sun Fire [TM] SF4800/SF4810/SF6800 - E4900/E6900 Platforms: discussion of RTU/RTS Behavior and Related Power Issues  


Related Items
  • Sun Fire 6800 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire E6900 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-x8x0/Ex900
  •  




Applies to:

Sun Fire E4900 Server - Version All Versions to All Versions [Release All Releases]
Sun Fire E6900 Server - Version All Versions to All Versions [Release All Releases]
Sun Fire 6800 Server - Version All Versions to All Versions [Release All Releases]
Sun Fire 4800 Server - Version All Versions to All Versions [Release All Releases]
Sun SPARC Sun OS

Purpose

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in an appropriate
My Oracle Support Community - Oracle Sun Technologies Community.

 

 

The purpose of this document is to familiarise support engineers with the inner workings of the power grids within a Sun Fire [TM] SF4800/SF6800/E4900/E6900 rack in order to assist in troubleshooting an observed power issue. It will focus on power grids in general, and the RTU enclosure in particular. Along the way, we will mention the AC Input Boxes, the Rack Fan Trays, the Frame Manager, Power Centerplane, and various internal breakers and switches that are capable of shutting off power to specific areas in the rack, including to any installed disk arrays such as a D240. An astute engineer with knowledge of the power layout within the rack will quickly home in on the cause of any sectioned power losses. Such sectioned power losses could occur within the Sun Fire [TM] SF4800/SF6800/E4900/E6900 chassis which contains the main circuit boards of the server, or outside the chassis but within the rack (peripheral trays, etc).

 

Troubleshooting Steps

The following document should be read in conjunction with this document...

Document 1010053.1 Sun Fire [TM] SF3800/SF4800/SF4810/SF6800 - E4900/E6900 Troubleshooting complete system power outages

 

A Sun Fire [TM] E4900/E6900 chassis contans two internal 56V DC power grids to power all the internal components of the chassis such as System Boards, Repeater Boards, I/O Boards etc. These grids obtain their power from the AC power inlet of the customer's Datacenter. This AC power is passed through a number of qualification and conversion components before being distributed around the server's chassis and the rack in which it is installed. The System Controller (SC) can list all the internal components in the chassis and indicate on which power grid each component depends on for power.

sc0:SC> showboards -p power

Component     Pwr   Grid
---------     ---   ----
SSC0          On    No grid
SSC1          On    No grid
ID0           On    No grid
PS0           On    Grid 0
PS1           On    Grid 0
PS2           On    Grid 0
PS3           On    Grid 1
PS4           On    Grid 1
PS5           On    Grid 1
FT0           On    Grid 0 and/or Grid 1
FT1           On    Grid 0 and/or Grid 1
FT2           On    Grid 0 and/or Grid 1
FT3           On    Grid 0 and/or Grid 1
RP0           On    Grid 0
RP1           On    Grid 0
RP2           On    Grid 1
RP3           On    Grid 1
/N0/SB0       On    Grid 0
SB1           -     Grid 1
/N0/SB2       On    Grid 0
/N0/SB3       On    Grid 1
/N0/SB4       On    Grid 0
/N0/SB5       On    Grid 1
/N0/IB6       On    Grid 0
/N0/IB7       Off   Grid 1
/N0/IB8       On    Grid 0
/N0/IB9       On    Grid 1

As a minimum, each power grid requires its own AC source (an RTS) implying that two AC sources are required to power both internal grids and all the components in the chassis. Another two AC sources will be required for redundancy making four AC connections to the Sun Fire [TM] SF4800/SF6800/E4900/E6900 rack in total. The Sun Fire rack accepts each of the four AC connections through four Remote Transfer Switchs (RTS) housed in two Remote Transfer Units (RTU) at the base of the Sun Fire Rack. See Figure 1.

 

Figure 1  - RTU enclosure containing two RTS Modules (power and backup power for one grid)

 Populated RTU

To begin any AC investigation, check the Frame Manager (FM) on the top of the 6800/6900 rack for an amber LED (Figure 2). This amber LED implies a fault with one of the following components:

  • The FM Itself
  • The Rack Fan Trays
  • The RTU enclosure
  • One or more of the RTS Modules

 For this document, we are interested in the RTU and RTS modules

 

Figure 2:

 FM Fault LEDs

If the FM does indicate that a fault exists, check the system logs for any AC issues. As already noted, AC power enters the rack by way of the RTS modules. You should be able to get RTS reports in the system logs as follows...

SC> showplatform -p frame

  Frame information is not available.

If you get the above message about the "Frame", you won't receive any RTS failure reports in the logs. We must rectify this problem in order to get useful log data about the health of each RTS. Maybe the Frame Manager is not cabled up. Or the FM itself is faulty, or needs rebooting (power-cycling the rack Fan Trays). More information about the Frame Manager, including FM images and a fault LED table, is available from the following document...

Document 1001696.1 Sun Fire [TM] 3800-6900 Frame Manager FAQ

Below is how the "showplatform -p frame" command should work and shows an example of a failing Rear-Left RTS module...

SC> showplatform -p frame
[...]
Frame RTS
Rear/Left       - OK, AC Failed
Rear/Right      - OK, Connected to the System
Front/Left      - OK, Connected to the System
Front/Right     - OK

Notice above that the Rear-Right RTS has taken over and become connected. The following messages can be seen in the 'showlogs -v' output taken at the SC prompt, or from a full explorer...

Platform.SC: [ID 694629 local0.notice]
   Frame RTS Rear/Left Disconnected from the System
Platform.SC: [ID 741218 local0.notice]
   Frame RTS Rear/Left AC Failed

Platform.SC: [ID 967149 local0.notice]
   Frame RTS Rear/Right Connected to the System
[...]


To accurately interpret the above data, we must now become familiar with the workings of the Remote Transfer Unit (RTU) and the Remote Transfer Switch (RTS).  In a broad sense, an RTU is an enclosure that can hold up to two RTS units and switch between them if the AC connected to either RTS is lost or becomes disqualified. That is, each RTS in the RTU has external AC power connected to it. Only one RTS is required to be active and supplying power to a single E6900 grid at any given moment, the other is in standby mode, waiting to jump in and supply power if the other RTS loses AC. That is, one RTS is powering an entire grid on its own, not sharing power with its neighboring RTS. In practice, the standby (slave) RTS is supplying a small amount of power to an unswitched outlet (J11 or J12) on its side of the RTU. More about this later. The RTU favours the Left-Hand RTS, the Right-Hand RTS being mostly in standby mode. If the Right-Hand RTS has to take over for any reason, an automatic failover back to the Left-Hand RTS occurs about five minutes after the Left-Hand RTS's AC power has been requalified as good. It's _about_ five minutes because there is a random factor involved to prevent multiple Racks switching their AC loads simultaneously.

Note: 

If the standby (slave) RTS loses AC power, no RTS switching takes place. The master RTS is already in control. But log data will indicate the slave RTS failure (as well as the fault LED on the failed RTS).


There are six internal bulk power supplies for converting AC (from the RTSs) to DC to supply both internal power grids. Three of the bulk power supplies supply Grid-0 with 56v DC, the other three supply Grid-1. As we can see from the above "showboards -p power" output, PS0, PS1, PS2 supply Grid-0, and PS3, PS4, PS5 supply Grid-1. Each of the six bulk PSUs also supplies a separate 56V housekeeping voltage (auxiliary power grid) which supplies both System Controllers with power. As can be seen from the above "showboards -p power" output, both SCs show a "No Grid" status to indicate independence from Grid-0 and Grid-1. What this implies is important to note. Even when all components are powered off with the 'poweroff' command, both SCs will continue to receive power from the auxilliary grid. This of course allows you to type commands at the SC prompt as normal. If both SCs do lose power, you will know that the whole rack went down and not just an isolated part of the rack.

As indicated above, the six bulk power supplies are arranged in two groups of three. Each group gets its AC power from an AC Input Box (filter tray) making two AC input boxes in the rack. And each of these AC input boxes gets its AC power from one of the two RTUs. To be more precise, Each RTU houses two RTSs (one for redundancy as already discussed), so each AC Input Box gets its power feed from an RTS (in an RTU). A loss of power on one or both groups of bulk power supplies makes it necessary to check the breakers on the AC Input Boxes and the RTS modules. The AC Input Boxes and RTS modules have magnetic circuit breakers on them. These can be switched on/off manually, or they can trip by themselves if an AC problem is encountered. That is, if one of these breakers trips, it will remain that way until someone manually switches it back on. A visual check on these breakers will be necessary if you are investigating a sectioned power loss (perhaps one power grid is down)..

RTU enclosures also have switches that need to be checked to ensure they are in an on state. These switches are laballed J3-to-J12 (see Figure 1). That is, on the left-hand side of the RTU there are four AC sockets for supplying power to various components in the rack. These sockets have switches J3,J5,J7,J9. Another four are grouped together on the right-hand side of the RTU controlled by J4,J6,J8,J10. So long as there is one RTS installed in the RTU, all of these sockets, left and right, have AC power. Here too, a visual check on the breakers will be necessary if you are investigating a sectioned power loss. For additional data, see the following document...

Document 1011893.1 Sun Fire [TM] 3800, 4800, 4810, E4900, 6800, E6900 AC/DC Power Distribution

 

J11 and J12 are unswitched (unshared) and require the presence of an RTS on its side of the RTU in order to receive power. See Figure 3 where it shows that J11 and J12 are connected to an unshared port on an RTS. If the RTS is not present, J11 (J12) will not have power. Frequently, J11 and J12 are used to power the rack fan trays. Checking to see if the rack Fan Trays have power is useful in determining if there is an RTS power issue, assuming that the redundant RTS is installed. Sometimes the redundant RTS (slave, spare) is faulty but as it's not in use most of the time, nobody notices it. The log data will indicate if a Right-Hand RTS becomes faulty.

 

Figure 3:

 RTU Unswitched

In summary, the path from the AC source through to the various circuit boards can be shown as follows...

 

Figure 4:

 Components on the grid path

Next, we must consider the effect of the Frame Manager on the integrity of the AC supply. To power on a Sun Fire [TM] SF4800/SF6800/E4900/E6900 chassis you must firstly power on the four RTS Modules located at the bottom front and bottom rear of the system rack. Allow 30 seconds for the RTS modules to complete their power on sequence. Check the LEDs on the left-hand RTSs to ensure that the left-most LED is green (AC), and the middle LED is also green (relay). Next, power on the two AC Input Boxes in the middle portion of the system chassis at the rear. Lastly turn the Frame Manager keyswitch to the poweron position. The keyswitch is located on the Frame Manager (FM) towards the top-most portion of the system chassis at the front. The Frame Manager can be identified as a unit with a keyswitch and an LCD display. See the following document for image examples....

Document 1001696.1 Sun Fire [TM] 3800-6900 Frame Manager FAQ

 

Remember that the Frame Manager gets its power from either of the two rack Fan Trays in the roof of the rack. And these Fan Trays in turn get their power (usually) from the previously mentioned unswitched outlets on the RTU (J11,J12). The usual method of rebooting the operating system on the FM is to power-cycle the rack Fan Trays. This won't harm the domains and can be done live. Otherwise, these rack Fan Trays should remain in an ON position. If the FM has any power issues, check the rack Fan Trays and their connections to the RTUs.

RTS modules contain embedded firmware. Earlier firmware, on the left-hand (master) RTS, would disconnect the AC load and handover to the right-hand RTS whenever the AC load became unqualified (dropped below 180V). The right-hand (slave) RTS would then connect the load and become the master if its power was qualified (180-264v, 47-63Hz). And the Grid being powered by this RTS pair, and the Solaris domains depending on this Grid, would remain up before, during, and after the handover to the slave RTS.

Take note of the above mentioned qualification boundaries. The lower frequency boundary, 47Hz, is not far below the European standard of 50Hz and brief periods below 47Hz have been noticed during times when backup generators switch in or out. If you encounter unexplained AC outages for the E6900 rack, while other Oracle or Third-Party servers appear to be undisturbed, consider the possibility of a sag in the frequency. You might have to ask the customer if he can monitor the frequency for a time.

Getting back to the point about the embedded firmware on the master RTS unit, and if both RTSs in an RTU are "plugged into the same AC power feed", some undesirable behaviour can occur. During an AC power issue, the master RTS module would find the power source unqualified as expected, disconnect the AC load, and initiate the transfer to the slave RTS module. The slave RTS module however is observing the same unqualified power, and will not connect the load. In other words, the master RTS tries to handover to the slave RTS but the slave RTS sees unqualified AC power and refuses to takeover.  Consequently, none of the RTSs carry the load and one of the E6900 grids goes down until the AC input power becomes qualified again. This failed Grid could have been supplying a number Solaris domains with power and so some domains may have gone down - the number of domains depending on how the customer configured the boards into the domains.

Document 1316607.1 Sun Fire[TM] 3800, 4800, 4810, 6800, E4900, and E6900: Sun Fire Segments, Domains, and Power Grids

 

Expanding on this, consider that the internal bulk power supplies can survive an AC loss of up to one cycle. At 50Hz, this means that the bulk power supplies can survive a 20 millisecond outage. The master RTS is capable of a transfer to the slave RTS in under 20ms so the grid survives. When the failed AC feed to the RTS is eventually restored, that RTS needs about 100ms to requalify the power input before being available to become master again. So in the above example where both RTSs were connected to the same AC source, and both RTSs saw unqualified AC power, about 100ms would pass before power can be requalified by both RTS modules. This is far greater that the 20ms holdup time of the internal bulk power supplies and so the grid would go down. However, if it is a brownout, a loss of AC for only a few milliseconds, we would not want both RTSs taking time out (100ms) to requalify the AC power. It would be better for one of the RTSs to attempt to ride through the brownout. Many AC outages in Datacenters are brief sags below 180V and those servers, which don't have RTS units in their configuration, tend to survive brief sags in AC power. And so it is that if both RTSs in an RTU are plugged into the same AC power feed, it is better to leave the right-hand RTS powered off. In such a configuration, the left-hand RTS will have no slave RTS to handover to and will attempt to ride out the brownout, which it will do without any problems if the brownout lasts less than 20ms.

Said another way, if the Datacenter power outage can be measured in time intervals of many seconds, the RTS configuration doesn't matter. All internal power grids will be lost. But if it is an all too common brownout, where the power drops below 180V for les than 20ms, and both RTSs in the RTU are plugged into the exact same AC feed, we are better off with only one RTS in the RTU so as to avoid a failover attempt. In any event, the RTS was designed for datacenters with multiple, independent AC power sources. Under such conditions, a sag below 180V on the left-hand RTS is not likely to also occur on the right-hand RTS which is getting its AC feed elsewhere. In such a case, an RTS failover is appropriate.

Older RTS modules with part numbers of 300-1396-05 and earlier behave differently than those with later part numbers 300-1396-06 and above or 300-1928-xx (RoHS). The embedded FW within the later units was modified so that if the master RTS detects a power anomaly and the slave RTS does not have qualified power either, a transfer from the master to the slave is not attempted. Instead, the master RTS will try to ride through the power anomaly as it would do if the slave  RTS was not present. This eliminates the problem of plugging in the slave RTS to the same AC source when an undervoltage/overvoltage condition occurs. However, keep in mind that if the frequency becomes unqualified, we run into the same problem as mentioned above - unqualified AC on both RTSs in an RTU and the grids go down, regardless of the RTS type/dash number  (unless each RTS is connected to independent power sources). So Oracle documentation will still stipulate that there should only be one RTS per RTU for datacenters _without_ multiple, independent power sources. So long as this rule is observed, the dash number of the RTS does not matter.

We should of course, specify what we mean by an "independent power source". Independent power sources are those which don't observe an over/undervoltage condition at exactly the same time. So it's possible that the customer has installed a UPS between the AC power source and one of the RTS modules. Any brownouts or power interruptions will not reach the RTS that has the UPS in the loop. In effect, with the UPS in the loop, we have an independent power feed. However, note that a UPS can be configured in Standby mode,  whereby the AC is directly connected to the RTS (bypassing the UPS) until an AC interruption occurs. On detection of the AC interruption, the UPS kicks in and supplies power within 20ms. However, this UPS kick-in is often too late to stop the RTS from seeing the undervoltage condition. Leading us back to the issues of both RTS modules being connected to the same power feed. UPSs in always-connected mode do not suffer from this problem.

If multiple independent power sources are in use, don't worry about the type and dash number of the RTSs. They will all work correctly under conditions of multiple independent power sources.

 

Tips on Troubleshooting: 

Check to see if the whole rack lost power by checking "showlogs -v" on the Main SC. If the uptime heartbeat message returned to zero days after the reported AC outage, it is almost certain that the AC loss was not confined to a single grid or section of the rack. Check the Datacenter's wall breakers to see if they have tripped. You might have to examine the customer's power logs to verify this.
   
  Heartbeat message:
      [ID 372088 local0.notice] Main, up 0 days 12:30:07, Memory 7,455,840

 

Check the keyswitch on the FM to see if it has been turned to the off position.  The key often has a spare on the same ring, and this dangling spare key can get caught in the front door of the rack which can unintentially rock the key to the off position. Consider the Frame Manager keyswitch cable as a possible faulty FRU if all visual checks on the key appear normal. Consider also the Frame Manager itself. If it looks normal, reboot it anyway by power cycling the rack Fan Trays.

 

If only one internal grid goes down, check the relevant group of three bulk power supplies. If two or more of the grid's PSUs are in a faulted condition, the grid will go down and you have multiple faulty PSUs to deal with (replace). On the otherhand, if the PSUs look good, check their relevant AC Input Box for a tripped breaker. Then, move along to the RTU supplying that AC Input Box (are both RTSs working? - check their fault LEDs).  Consider that the RTS or the AC Input Box is faulty. Also consider any interconnecting cables. If you are onsite and have access to the whole rack, remember that you have multiple RTS Modules, and two AC Input Boxes. Move these around and see if the power fault moves with one of these FRUs. If you can move the fault, you can track it. And if you can track it, you can eliminate it.

 

Figure 4 drills down to the Power Centerplane. If the SC showenv command shows that all power supplies are present, and there are no errors reported against any of the PSUs, then the power centerplane is likely to be good. The power centerplane has no active components on it so any problems with it tend to be connection issues (loose screws).

The 'showenv -ltuv' command will exercise the global I2C buses to the power supplies.  There are two I2C buses to each PSU, one from each SC. So to check the PSU-to-Power Centerplane connections, fail over to the other SC and run the 'showenv -ltuv' command again. Once again, if there are no errors, the power centerplane is likely to be good.

If all else fails, remove all AC power from the rack and check and tighten the bus bar screws to the power centerplane. There are two removable panels just below the I/O Boards (above the Fan Trays) that allow access to the bus bar screws. The screwdriver should be a #2 philips and its length needs to be about 18 inches (450mm) to reach.  Only consider this as a last resort, loose bus bar screws are a thing of the past and very rare today.

   Ref:
   https://mosemp.us.oracle.com/handbook_internal/fin-fco/1-7-I1124-1-1.html

   Screws on the power bus bar can become loose after prolonged periods of server
   operation on Sun Fire 6800 Servers manufactured prior to the 23rd week of 2001.  

  

References

<NOTE:1003348.1> - Sun Fire[TM] 3800-6800 servers: RTU RTS Failover Fails
<NOTE:1017518.1> - Sun Fire[TM] 6800/E6900 Server: How to completely power off/on the server for datacenter maintenance.

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback