![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1007790.1 : Sun Fire[TM] 12K/15K/E20K/E25K: System Controller (SC) platform messages file reports "FRAD chkpt WRITE failed. session id: 128, return code: 8" errors.
PreviouslyPublishedAs 210776 Applies to:Sun Fire 15K Server - Version Not Applicable and laterSun Fire E20K Server - Version Not Applicable and later Sun Fire 12K Server - Version Not Applicable and later Sun Fire E25K Server - Version Not Applicable and later All Platforms SymptomsIn the /var/opt/SUNWSMS/adm/platform/messages file on a Sun Fire[TM] 12K/15K/E20K/E25K SC, all components in the platform report FruAcess errors and "FRAD chkpt WRITE failed" messages such as the following: Oct 4 07:00:43 2004 sc0 frad[660]: [10009 1379176584473204 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 Oct 4 07:00:43 2004 sc0 esmd[1422]: [1994 1379176638578801 ERR FruAccess.cc 554] Failed to update the power summary record of fru FT5: rc=-2 Oct 4 07:00:43 2004 sc0 esmd[1422]: [1994 1379176639358700 ERR DynamicFru.cc 256] Failed to update the power summary record of fru FT5: rc=-2 Oct 4 07:00:43 2004 sc0 frad[660]: [10009 1379176775710362 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 Oct 4 07:00:43 2004 sc0 esmd[1422]: [1991 1379176829667639 ERR FruAccess.cc 473] Failed to write the power event record of fru FT5: rc=-2 Oct 4 07:00:43 2004 sc0 esmd[1422]: [1992 1379176830622863 ERR DynamicFru.cc 394] Failed to write the power event record, STILL_ON, of fru FT5: rc=-2 Oct 4 07:03:43 2004 sc0 frad[660]: [10009 1379357012147050 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1994 1379357084158464 ERR FruAccess.cc 554] Failed to update the power summary record of fru SB14: rc=-2 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1994 1379357085045960 ERR DynamicFru.cc 256] Failed to update the power summary record of fru SB14: rc=-2 Oct 4 07:03:43 2004 sc0 frad[660]: [10009 1379357173410163 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357221559279 ERR FruAccess.cc 655] Failed to update the temperature summary record of fru SB14(sensor=0): rc=-2 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357222339133 ERR DynamicFru.cc 210] Failed to update the temperature summary record of fru SB14(sensor=0): rc=-2 Oct 4 07:03:43 2004 sc0 frad[660]: [10009 1379357334767963 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357388884910 ERR FruAccess.cc 655] Failed to update the temperature summary record of fru SB14(sensor=1): rc=-2 Oct 4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357389675546 ERR DynamicFru.cc 210] Failed to update the temperature summary record of fru SB14(sensor=1): rc=-2 Oct 4 07:03:43 2004 sc0 frad[660]: [10009 1379357502066976 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id: 128, return code: 8
CauseThe error message is indicating that FRAD, Fru Access Daemon, can not write to a checkpoint file because the daemon doesn't have permissions to the file. SolutionIn the case of the FRAD chkpt error, the file in question is located in the /var/opt/SUNWSMS/data/.failover/chkpt directory on the SC. This file is a checkpoint file that is used as reference by FOMD (Failover Monitoring Daemon) for file propagation between SCs. If the permissions on this chkpt file are incorrect, the SMS daemon can not write to it and the error messages appear. So, a possible "fix" for this issue would be to simply open up the permissions on this file or directory and the daemons could now write to the chkpt file, as root does:
chmod -R 777 /var/opt/SUNWSMS/data/.failover/chkpt
BUT, this is not really a good solution because this may not actually be the real root cause. There might be more problems that need to be resolved. If the directory /var/opt/SUNWSMS/SMS1.4.1/data/.failover has the wrong group/ownership permissions, it's subdirectories are not writeable by sms daemons, and the error messages above will happen. Changing just the permissions on the chkpt files or chkpt directory is not the correct course of action, because we need to make sure that the parent directory is not actually the real root cause. The whole directory structure needs it's ownership configuration resolved to head off possible future issues: BAD CONFIGURATION (NOTE: ".cod" and ".failover" directories should be root:sms)
sms-svc> cd /var/opt/SUNWSMS/SMS1.4.1/data/
sms-svc> ls -la total 54 drwxrwxr-x+ 23 root sms 512 Oct 4 14:15 . drwxr-xr-x+ 8 root sys 512 Oct 2 00:52 .. drwxrwxr-x 2 root bin 512 Jun 18 2002 .cod drwxrwxr-x 6 root bin 512 Jun 18 2002 .failover -r-------- 1 root sys 17 Sep 16 17:46 .remotesc drwxr-xr-x 2 root sms 512 Oct 2 01:55 .wcapp drwxrwx---+ 2 root sms 512 Oct 2 01:55 A drwxrwx--- 2 root sms 512 Sep 12 02:45 B drwxrwx--- 2 root sms 512 Sep 12 02:45 C drwxrwx--- 2 root sms 512 Sep 12 02:45 D drwxrwx--- 2 root sms 512 Sep 12 02:45 E drwxrwx--- 2 root sms 512 Sep 12 02:45 F drwxrwx--- 2 root sms 512 Sep 12 02:45 G drwxrwx--- 2 root sms 512 Sep 12 02:45 H drwxrwx--- 2 root sms 512 Sep 12 02:45 I drwxrwx--- 2 root sms 512 Sep 12 02:45 J drwxrwx--- 2 root sms 512 Sep 12 02:45 K drwxrwx--- 2 root sms 512 Sep 12 02:46 L drwxrwx--- 2 root sms 512 Sep 12 02:46 M drwxrwx--- 2 root sms 512 Sep 12 02:46 N drwxrwx--- 2 root sms 512 Sep 12 02:46 O drwxrwx--- 2 root sms 512 Sep 12 02:46 P drwxrwx--- 2 root sms 512 Sep 12 02:46 Q drwxrwx--- 2 root sms 512 Sep 12 02:46 R -rw-r----- 1 sms-dsmd sms 288 Oct 2 02:04 dsmd_domain_info srwxrwxrwx 1 sms-efe sms 0 Oct 2 02:03 efeSock -rw-r--r-- 1 sms-osd sms 72 Oct 2 00:11 osdTimeDeltas -rw-r--r-- 1 root root 4 Oct 2 01:52 ssd_loop.pid GOOD CONFIGURATION sms-svc> pwd /var/opt/SUNWSMS/SMS1.4.1/data sms-svc> ls -la total 54 drwxrwxr-x+ 23 root sms 512 Oct 2 16:30 . drwxr-xr-x+ 8 root sys 512 Sep 22 11:47 .. drwxrwxr-x 2 root sms 512 Sep 22 11:51 .cod drwxrwxr-x 6 root sms 512 Sep 22 11:46 .failover -r-------- 1 root sys 17 Sep 23 12:08 .remotesc drwxr-xr-x 2 root sms 512 Oct 1 11:00 .wcapp drwxrwx---+ 2 root sms 512 Oct 1 11:00 A drwxrwx---+ 2 root sms 512 Sep 29 14:17 B drwxrwx---+ 2 root sms 512 Sep 27 10:27 C drwxrwx---+ 2 root sms 512 Sep 27 10:27 D drwxrwx---+ 2 root sms 512 Sep 22 11:51 E drwxrwx---+ 2 root sms 512 Sep 22 11:51 F drwxrwx---+ 2 root sms 512 Sep 22 11:51 G drwxrwx---+ 2 root sms 512 Sep 22 11:51 H drwxrwx---+ 2 root sms 512 Sep 22 11:51 I drwxrwx---+ 2 root sms 512 Sep 22 11:51 J drwxrwx---+ 2 root sms 512 Sep 22 11:51 K drwxrwx---+ 2 root sms 512 Sep 22 11:51 L drwxrwx---+ 2 root sms 512 Sep 22 11:51 M drwxrwx---+ 2 root sms 512 Sep 22 11:51 N drwxrwx---+ 2 root sms 512 Sep 22 11:51 O drwxrwx---+ 2 root sms 512 Sep 22 11:51 P drwxrwx---+ 2 root sms 512 Sep 30 14:21 Q drwxrwx---+ 2 root sms 512 Sep 22 11:51 R -rw-r----- 1 sms-dsmd sms 288 Oct 1 21:01 dsmd_domain_info srwxrwxrwx 1 sms-efe sms 0 Oct 1 11:02 efeSock -rw-r--r-- 1 sms-osd bin 72 Oct 1 17:58 osdTimeDeltas -rw-r--r-- 1 root root 5 Oct 1 10:58 ssd_loop.pid sms-svc> cd .failover sms-svc> ls -la total 12 drwxrwxr-x 6 root sms 512 Sep 22 11:46 . drwxrwxr-x+ 23 root sms 512 Oct 2 16:30 .. drwxrwxr-x 2 root sms 512 Oct 5 10:15 chkpt drwxrwxr-x 2 root sms 512 Sep 22 11:51 fomd drwxrwxr-x 2 root sms 512 Sep 22 11:46 local drwxrwxrwx 2 root sms 512 Oct 5 10:55 tmp sms-svc> cd chkpt sms-svc> ls -la total 10 drwxrwxr-x 2 root sms 512 Oct 5 10:15 . drwxrwxr-x 6 root sms 512 Sep 22 11:46 .. -rw-r--r-- 1 root other 544 Oct 1 17:32 2.128.1.0 -rw-r--r-- 1 root other 544 Oct 1 11:03 2.130.1.0 -rw-rw-rw- 1 root other 434 Oct 5 10:15 chkpt.list
Ultimately, changing the permissions on only the /var/opt/SUNWSMS/SMS1.4.1/data/.failover/chkpt directory would allow for SMS to write to the particular chkpt file, but there is no telling if other problems might be resolved now by fixing what was actually root cause, which is the bad group ownership of the top level directories. So, the fix is to issue the commands as root:
cd /var/opt/SUNWSMS/SMS1.4.1/data
chgrp -R sms .failover chgrp -R sms .cod Please see Additional Information for more suggestions.
It's also a good idea to confirm that the SMS daemons have the correct UID as well. From /etc/passwd, the UID is as follows for the various daemons: sms-codd:x:10:54:SMS Capacity On Demand Daemon:: sms-dca:x:11:54:SMS Domain Configuration Agent:: sms-dsmd:x:12:54:SMS Domain Status Monitoring Daemon:: sms-dxs:x:13:54:SMS Domain Server:: sms-efe:x:14:54:SMS Event Front-End Daemon:: sms-esmd:x:15:54:SMS Environ. Status Monitoring Daemon:: sms-fomd:x:16:54:SMS Failover Management Daemon:: sms-frad:x:17:54:SMS FRU Access Daemon:: sms-osd:x:18:54:SMS OBP Service Daemon:: sms-pcd:x:19:54:SMS Platform Config. Database Daemon:: sms-tmd:x:20:54:SMS Task Management Daemon:: sms-svc:x:6:10:SMS Service User:/export/home/sms-svc:/bin/csh sms-efhd:x:21:54:SMS Error and Fault Handling Daemon:: sms-elad:x:22:54:SMS Event Log Access Daemon:: sms-erd:x:23:54:SMS Event Reporting Daemon::
Attachments This solution has no attachment |
||||||||||||
|