![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Troubleshooting Sure Solution 1589906.1 : Sun Fire [TM] SF4800/SF4810/SF6800 - E4900/E6900 Platforms: discussion of RTU/RTS Behavior and Related Power Issues
Applies to:Sun Fire E4900 Server - Version All Versions to All Versions [Release All Releases]Sun Fire E6900 Server - Version All Versions to All Versions [Release All Releases] Sun Fire 6800 Server - Version All Versions to All Versions [Release All Releases] Sun Fire 4800 Server - Version All Versions to All Versions [Release All Releases] Sun SPARC Sun OS PurposeTo discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in an appropriate
My Oracle Support Community - Oracle Sun Technologies Community.
The purpose of this document is to familiarise support engineers with the inner workings of the power grids within a Sun Fire [TM] SF4800/SF6800/E4900/E6900 rack in order to assist in troubleshooting an observed power issue. It will focus on power grids in general, and the RTU enclosure in particular. Along the way, we will mention the AC Input Boxes, the Rack Fan Trays, the Frame Manager, Power Centerplane, and various internal breakers and switches that are capable of shutting off power to specific areas in the rack, including to any installed disk arrays such as a D240. An astute engineer with knowledge of the power layout within the rack will quickly home in on the cause of any sectioned power losses. Such sectioned power losses could occur within the Sun Fire [TM] SF4800/SF6800/E4900/E6900 chassis which contains the main circuit boards of the server, or outside the chassis but within the rack (peripheral trays, etc). Troubleshooting StepsThe following document should be read in conjunction with this document... Document 1010053.1 Sun Fire [TM] SF3800/SF4800/SF4810/SF6800 - E4900/E6900 Troubleshooting complete system power outages
A Sun Fire [TM] E4900/E6900 chassis contans two internal 56V DC power grids to power all the internal components of the chassis such as System Boards, Repeater Boards, I/O Boards etc. These grids obtain their power from the AC power inlet of the customer's Datacenter. This AC power is passed through a number of qualification and conversion components before being distributed around the server's chassis and the rack in which it is installed. The System Controller (SC) can list all the internal components in the chassis and indicate on which power grid each component depends on for power. sc0:SC> showboards -p power As a minimum, each power grid requires its own AC source (an RTS) implying that two AC sources are required to power both internal grids and all the components in the chassis. Another two AC sources will be required for redundancy making four AC connections to the Sun Fire [TM] SF4800/SF6800/E4900/E6900 rack in total. The Sun Fire rack accepts each of the four AC connections through four Remote Transfer Switchs (RTS) housed in two Remote Transfer Units (RTU) at the base of the Sun Fire Rack. See Figure 1.
Figure 1 - RTU enclosure containing two RTS Modules (power and backup power for one grid) To begin any AC investigation, check the Frame Manager (FM) on the top of the 6800/6900 rack for an amber LED (Figure 2). This amber LED implies a fault with one of the following components:
For this document, we are interested in the RTU and RTS modules
Figure 2: If the FM does indicate that a fault exists, check the system logs for any AC issues. As already noted, AC power enters the rack by way of the RTS modules. You should be able to get RTS reports in the system logs as follows... Notice above that the Rear-Right RTS has taken over and become connected. The following messages can be seen in the 'showlogs -v' output taken at the SC prompt, or from a full explorer... To accurately interpret the above data, we must now become familiar with the workings of the Remote Transfer Unit (RTU) and the Remote Transfer Switch (RTS). In a broad sense, an RTU is an enclosure that can hold up to two RTS units and switch between them if the AC connected to either RTS is lost or becomes disqualified. That is, each RTS in the RTU has external AC power connected to it. Only one RTS is required to be active and supplying power to a single E6900 grid at any given moment, the other is in standby mode, waiting to jump in and supply power if the other RTS loses AC. That is, one RTS is powering an entire grid on its own, not sharing power with its neighboring RTS. In practice, the standby (slave) RTS is supplying a small amount of power to an unswitched outlet (J11 or J12) on its side of the RTU. More about this later. The RTU favours the Left-Hand RTS, the Right-Hand RTS being mostly in standby mode. If the Right-Hand RTS has to take over for any reason, an automatic failover back to the Left-Hand RTS occurs about five minutes after the Left-Hand RTS's AC power has been requalified as good. It's _about_ five minutes because there is a random factor involved to prevent multiple Racks switching their AC loads simultaneously. Note: If the standby (slave) RTS loses AC power, no RTS switching takes place. The master RTS is already in control. But log data will indicate the slave RTS failure (as well as the fault LED on the failed RTS).
As indicated above, the six bulk power supplies are arranged in two groups of three. Each group gets its AC power from an AC Input Box (filter tray) making two AC input boxes in the rack. And each of these AC input boxes gets its AC power from one of the two RTUs. To be more precise, Each RTU houses two RTSs (one for redundancy as already discussed), so each AC Input Box gets its power feed from an RTS (in an RTU). A loss of power on one or both groups of bulk power supplies makes it necessary to check the breakers on the AC Input Boxes and the RTS modules. The AC Input Boxes and RTS modules have magnetic circuit breakers on them. These can be switched on/off manually, or they can trip by themselves if an AC problem is encountered. That is, if one of these breakers trips, it will remain that way until someone manually switches it back on. A visual check on these breakers will be necessary if you are investigating a sectioned power loss (perhaps one power grid is down).. Document 1011893.1 Sun Fire [TM] 3800, 4800, 4810, E4900, 6800, E6900 AC/DC Power Distribution
J11 and J12 are unswitched (unshared) and require the presence of an RTS on its side of the RTU in order to receive power. See Figure 3 where it shows that J11 and J12 are connected to an unshared port on an RTS. If the RTS is not present, J11 (J12) will not have power. Frequently, J11 and J12 are used to power the rack fan trays. Checking to see if the rack Fan Trays have power is useful in determining if there is an RTS power issue, assuming that the redundant RTS is installed. Sometimes the redundant RTS (slave, spare) is faulty but as it's not in use most of the time, nobody notices it. The log data will indicate if a Right-Hand RTS becomes faulty.
Figure 3: In summary, the path from the AC source through to the various circuit boards can be shown as follows...
Figure 4: Next, we must consider the effect of the Frame Manager on the integrity of the AC supply. To power on a Sun Fire [TM] SF4800/SF6800/E4900/E6900 chassis you must firstly power on the four RTS Modules located at the bottom front and bottom rear of the system rack. Allow 30 seconds for the RTS modules to complete their power on sequence. Check the LEDs on the left-hand RTSs to ensure that the left-most LED is green (AC), and the middle LED is also green (relay). Next, power on the two AC Input Boxes in the middle portion of the system chassis at the rear. Lastly turn the Frame Manager keyswitch to the poweron position. The keyswitch is located on the Frame Manager (FM) towards the top-most portion of the system chassis at the front. The Frame Manager can be identified as a unit with a keyswitch and an LCD display. See the following document for image examples.... Document 1001696.1 Sun Fire [TM] 3800-6900 Frame Manager FAQ
Remember that the Frame Manager gets its power from either of the two rack Fan Trays in the roof of the rack. And these Fan Trays in turn get their power (usually) from the previously mentioned unswitched outlets on the RTU (J11,J12). The usual method of rebooting the operating system on the FM is to power-cycle the rack Fan Trays. This won't harm the domains and can be done live. Otherwise, these rack Fan Trays should remain in an ON position. If the FM has any power issues, check the rack Fan Trays and their connections to the RTUs. RTS modules contain embedded firmware. Earlier firmware, on the left-hand (master) RTS, would disconnect the AC load and handover to the right-hand RTS whenever the AC load became unqualified (dropped below 180V). The right-hand (slave) RTS would then connect the load and become the master if its power was qualified (180-264v, 47-63Hz). And the Grid being powered by this RTS pair, and the Solaris domains depending on this Grid, would remain up before, during, and after the handover to the slave RTS. Document 1316607.1 Sun Fire[TM] 3800, 4800, 4810, 6800, E4900, and E6900: Sun Fire Segments, Domains, and Power Grids
Expanding on this, consider that the internal bulk power supplies can survive an AC loss of up to one cycle. At 50Hz, this means that the bulk power supplies can survive a 20 millisecond outage. The master RTS is capable of a transfer to the slave RTS in under 20ms so the grid survives. When the failed AC feed to the RTS is eventually restored, that RTS needs about 100ms to requalify the power input before being available to become master again. So in the above example where both RTSs were connected to the same AC source, and both RTSs saw unqualified AC power, about 100ms would pass before power can be requalified by both RTS modules. This is far greater that the 20ms holdup time of the internal bulk power supplies and so the grid would go down. However, if it is a brownout, a loss of AC for only a few milliseconds, we would not want both RTSs taking time out (100ms) to requalify the AC power. It would be better for one of the RTSs to attempt to ride through the brownout. Many AC outages in Datacenters are brief sags below 180V and those servers, which don't have RTS units in their configuration, tend to survive brief sags in AC power. And so it is that if both RTSs in an RTU are plugged into the same AC power feed, it is better to leave the right-hand RTS powered off. In such a configuration, the left-hand RTS will have no slave RTS to handover to and will attempt to ride out the brownout, which it will do without any problems if the brownout lasts less than 20ms. If multiple independent power sources are in use, don't worry about the type and dash number of the RTSs. They will all work correctly under conditions of multiple independent power sources.
Tips on Troubleshooting: Check to see if the whole rack lost power by checking "showlogs -v" on the Main SC. If the uptime heartbeat message returned to zero days after the reported AC outage, it is almost certain that the AC loss was not confined to a single grid or section of the rack. Check the Datacenter's wall breakers to see if they have tripped. You might have to examine the customer's power logs to verify this.
Heartbeat message: [ID 372088 local0.notice] Main, up 0 days 12:30:07, Memory 7,455,840
Check the keyswitch on the FM to see if it has been turned to the off position. The key often has a spare on the same ring, and this dangling spare key can get caught in the front door of the rack which can unintentially rock the key to the off position. Consider the Frame Manager keyswitch cable as a possible faulty FRU if all visual checks on the key appear normal. Consider also the Frame Manager itself. If it looks normal, reboot it anyway by power cycling the rack Fan Trays.
If only one internal grid goes down, check the relevant group of three bulk power supplies. If two or more of the grid's PSUs are in a faulted condition, the grid will go down and you have multiple faulty PSUs to deal with (replace). On the otherhand, if the PSUs look good, check their relevant AC Input Box for a tripped breaker. Then, move along to the RTU supplying that AC Input Box (are both RTSs working? - check their fault LEDs). Consider that the RTS or the AC Input Box is faulty. Also consider any interconnecting cables. If you are onsite and have access to the whole rack, remember that you have multiple RTS Modules, and two AC Input Boxes. Move these around and see if the power fault moves with one of these FRUs. If you can move the fault, you can track it. And if you can track it, you can eliminate it.
Figure 4 drills down to the Power Centerplane. If the SC showenv command shows that all power supplies are present, and there are no errors reported against any of the PSUs, then the power centerplane is likely to be good. The power centerplane has no active components on it so any problems with it tend to be connection issues (loose screws). References<NOTE:1003348.1> - Sun Fire[TM] 3800-6800 servers: RTU RTS Failover Fails<NOTE:1017518.1> - Sun Fire[TM] 6800/E6900 Server: How to completely power off/on the server for datacenter maintenance. Attachments This solution has no attachment |
||||||||||||
|