![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1002125.1 : Sun Fire[TM] 15K/12K/E20K/E25K Servers: RCM Daemon hanging, causing DR operations to hang
PreviouslyPublishedAs 203025 Applies to:Sun Fire 15K Server - Version All Versions and laterSun Fire E25K Server - Version All Versions and later All Platforms SymptomsRemote Dynamic Reconfiguration(DR) operation(from the System Controller), or local DR operation(from the domain), works fine until a DR operation does not respond; reporting in the $SMSLOGGER/domain_Id/messages file, messages like: DCA/DCS communication error and/or dca[...]-S(): [... ERR DCSInterface.cc 378] message receive failed: DCSInterface :: receiveResponse errCode:502 In some cases, it may not be possible to kill the associated process(cfgadm, rcfgadm, deleteboard, showdevices). CauseThis is related to the libthread issue described into Document 1000512.1 (Applications Linked to "libthread" may Hang or Terminate Abnormally During Initialization - Solaris Bug 4730459): please note that this affects Solaris 8 (and below) only. NOTE: as Solaris 8 is currently EOL (check Oracle Lifetime Support Policy here for detailed information), in case of this issue you're firstly suggested to consider upgrading to a current Solaris release.
18541/4: 0.3145 creat("/var/opt/SUNWSMS/SMS1.4.1/pipes/C/scdr0", 0666) = 8 18541/4: 0.3149 pipe() = 8 [9] [...] 18541/4: 1.5339 ioctl(8, I_RECVFD, 0xFE77BF24) (sleeping...) 18541/4: fd=9 uid=11 gid=20 18541/4: 0.3517 open("/var/opt/SUNWSMS/SMS1.4.1/doors/H/dca", O_RDONLY) = 7 18541/4: door_call(7, 0x00048CD8) (sleeping...) 18541/4: door_call(7, 0x00048CD8) (sleeping...)
29675/232: 13.3519 poll(0xFE3FBAF0, 1, 43200000) (sleeping...) 29675/232: fd=12 ev=POLLIN rev=0
# pfiles 29675 29675: dca -d C [...] 12: S_IFSOCK mode:0666 dev:308,0 ino:30682 uid:0 gid:0 size:0 O_RDWR sockname: AF_INET 10.2.1.1 port: 39601 peername: AF_INET 10.2.1.4 port: 665
# grep sun-dr /etc/inetd.conf sun-dr stream tcp wait root /usr/lib/dcs dcs sun-dr stream tcp6 wait root /usr/lib/dcs dcs # grep sun-dr /etc/services sun-dr 665/tcp # Remote Dynamic Reconfiguration
Trussing the dcs process(es) should confirm that they are all waiting for update from the Reconfiguration Coordination Manager daemon (rcm_daemon): # ptree 432 155 /usr/sbin/inetd -s 432 dcs 7122 dcs # truss -p 7122 ... 7122/1: 10.2993 open("/var/run/rcm_daemon_door", O_RDONLY) = 8 7122/1: 55.7444 door_call(8, 0xFFBEE978) (sleeping...) # pfiles 7122 7122: dcs [...] 8: S_IFDOOR mode:0400 dev:305,0 ino:40644 uid:0 gid:1 size:0 O_RDONLY door to rcm_daemon[7053]
# pgrep rcm_daemon 7053 # pstack 7053 7053: /usr/lib/rcm/rcm_daemon [...] ----------------- lwp# 5 / thread# 4 -------------------- ff09f3d8 lwp_mutex_lock (ff29cd10) ff287698 fork1 (ff29c000, a, 35cc8, ff29d670, 534d, 1) + 50 0001ac1c run_script (0, 35cc0, 0, 0, 2, 35ca0) + 154 0001b4c4 do_cmd (30910, fea0b62c, 30910, fea0b62c, 0, 35ca0) + 34 0001bf2c script_register_interest (35cd8, ffffffff, 0, 35ca0, 354c0, 0) + 98 000173fc rcmd_db_sync (308e8, 35c68, ffffffff, 19598, 19fbc, 0) + 7c 000195c0 rcmd_thr_incr (30e06, 89200, 6, fea0b798, 35f80, 0) + c4 00012bd8 event_service (fea0bc50, fea0bc54, 0, fea0bc88, 0, 0) + f4 ff2b40dc door_service (31f28, ff2c6000, b0, 31f28, 0, 4) + 64 ff09c9ec _door_return (0, 38, e0000, 1, 11, 72636d2e) + 68 [...]
SolutionThe workaround (once SunOS 5.8 rcm_daemon patch 116991-03 has been installed and it now links rcm_daemon with alternate libthread) can be done with a script: #!/bin/sh LD_LIBRARY_PATH=/usr/lib/lwp export LD_LIBRARY_PATH LD_LIBRARY_PATH_64=/usr/lib/lwp/64 export LD_LIBRARY_PATH_64 /usr/lib/rcm/rcm_daemon or via a command line # pkill -9 rcm_daemon # LD_LIBRARY_PATH=/usr/lib/lwp LD_LIBRARY_PATH_64=/usr/lib/lwp/64 /usr/lib/rcm/rcm_daemon On Solaris[TM] 8 Operating System, to verify that the rcm_daemon is using the alternate libthread, a pldd(1) command against the process, should report: For Example: # pgrep rcm 7204 # pldd 7204 7204: /usr/lib/rcm/rcm_daemon /usr/lib/libgen.so.1 /usr/lib/libelf.so.1 /usr/lib/libdl.so.1 /usr/lib/libcmd.so.1 /usr/lib/libdoor.so.1 /usr/lib/librcm.so.1 /usr/lib/lwp/libthread.so.1 /usr/lib/libnvpair.so.1 /usr/lib/libdevinfo.so.1 /usr/lib/libnsl.so.1 /usr/lib/libsocket.so.1 /usr/lib/libc.so.1 /usr/lib/libmp.so.2 /usr/platform/sun4u-us3/lib/libc_psr.so.1 /usr/lib/rcm/modules/SUNW_cluster_rcm.so /usr/lib/rcm/modules/SUNW_dump_rcm.so /usr/lib/rcm/modules/SUNW_filesys_rcm.so /usr/lib/rcm/modules/SUNW_ip_rcm.so /usr/lib/rcm/modules/SUNW_network_rcm.so /usr/lib/rcm/modules/SUNW_swap_rcm.so
Also, see: Internal BugID 15413600 - libthread`_co_timerset() may attempt to acquire _calloutlock twice References<BUG:4730459> - HC4/5: KOREAN VAT CONCURRENT PROGRAM NOT AVAILABLE<BUG:4825286> - DEFECT000341773 NON MERCHANDISE COST - WHEN CREATING DEBIT MEMOS, CREDIT NOTE <BUG:6234740> - THERE IS WRONG MAPPING FOR USER EXIT TYPE FOR C_DIDATA-VOICEMAIL_1-0-0_ADD_U... <NOTE:1000512.1> - Applications Linked to "libthread" may Hang or Terminate Abnormally During Initialization <NOTE:1003582.1> - Sun Fire[TM] 12K/15K/E20K/E25K: What Happens in a DR Slot0 Detach Operation <NOTE:1004922.1> - Sun Fire[TM] servers: Trouble-shooting RCM failures events in DR operations <NOTE:1008803.1> - Sun Fire[TM] 12K/15K/E20K/E25K: showdevices can hang if sd.conf is large or misleading <NOTE:1008805.1> - Sun Fire[TM] 12K/15K/E20K/E25K: Remote Dynamic Reconfiguration (DR) generates "DCA/DCS Communication Error" and showdevices is “Unable to get device information from domain”. <NOTE:1009124.1> - Sun Fire[TM] 12K/15K/E20K/E25K: showdevices takes a long time to return <NOTE:1012320.1> - Sun Fire[TM] 12K/15K/20K/25K: Domain reports "sun-dr/tcp: bind: Address already in use" Attachments This solution has no attachment |
||||||||||||
|