![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||||||||||||
Solution Type Troubleshooting Sure Solution 2100823.1 : Trouble shooting ODA REPO/VM Guest related issues : Basic Collection
Explain some basic concept behind ODA VM and date collection and analyze method. In this Document
Applies to:Oracle Database Appliance - Version All Versions to All Versions [Release All Releases]Information in this document applies to any platform. This Note will be more focus on VM related issue on the ODA platform. It can be similar as non-ODA env but we will be more focus on oda related technolodge here. PurposeExplain how we should analyze ODA Linux VM Guest related issues. We will not cover the Windows/Solaris VM guest related issue here. Troubleshooting StepsWhat is special about the VM guest on ODA compared with a general VM guest?1. VM on ODA can be on a shared REPO on ACFS at ODA_BASE While a Guest VM image can be put on local repo which is the same as general OVM VM guest or Share REPO which is special on ODA. Example about how a shared repo works on the ODA platform:
192.168.16.10:/u01/app/sharedrepo/lab1 839909376 114954400 724954976 14% /OVS/Repositories/lab1 It's a NFS mount point mounted through HANFS from oda_base This example will include various example names such as lab1-375 /dev/asm/lab1-375 839909376 114954420 724954956 14% /u01/app/sharedrepo/lab1 Volume Name: LAB1 Summary: Share repo on ODA goes through the below path: ASM Volume -> to ACFS directory (oda_base) -> to HANFS mount (dom0) IF the above stack has an issue at any point it will cause the related VM to a crash, go read-only, problems during startup or possibly performance issues if it's on share repo.
These command will be accepted by oakd and redirected to master oakd to dispatch to odaBaseAgent on oda_base and oakVmagent on dom0 to execute the related command. --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Steps for troubleshooting: 1. We need to identify which layer has the issue.So here are the layers which may be involved:
The lower layer issues (except L0) will cause all above layers have issues. In some cases we cannot identify which layer has issues or maybe couple layers have issues together for example performance issue at VM, we need be very careful about how to validate layers for such cases. The detail way command to identify each layer has issue or not: L1. DOM0 OS -- Try to ssh to related dom0 IP address; Use ilom console try to access the related dom0 console; Some basic os command (ls, df, top) L2. ODA_BASE OS -- Try to ssh to related oda_base IP address; Use xm list/console to access related oda_base console or use VNC connect to dom0:5900 to access the console; Some basic os command (ls, df, top) L3. GI/ASM Using grid user to run: crsctl stat res -t/crsctl stat res -init -t/ ps -ef|grep smon/go into asm instance to check using sqlplus. L4. ASM Diskgroup --Using grid user: asmcmd lsdg to make sure all the diskgroups are at MOUNTED stage. L5. ASM Volume Group -- asmcmd volinfo --all to make sure the related vg state is Enable L6. ACFS/ADVM driver mount-- on ODA_BASE df -k the related directory is mounted and try to read files (more) and write files (touch) onto related directory L7. HANFS -- HAVIP/exportfs -- crsctl stat res -t (related resource file system and HAVIP are exist and online); exportfs -v related filesystem is at export list. L8. NFS -- on dom0 df -k to make sure related directory is mounted and try to read files (more) and write files (touch) onto related directory L9. DOM0 xen -- xen list; xen console -- such command can show output on dom0 L10. Inside the Guest VM --Try to ssh to related Guest VM IP address; Use xm list/console to access related guest VM console or use VNC connect to dom0:59XX to access the console; Some basic os command (ls, df, top) to run inside the guest VM.
L0 oakd layer
Sometime it's not obviously start testing from which layer. We may need start from one layer and test up/down layers to verify. 2 Based on the related layer/type, collect information, analyze data of the issue (Case Study)L1 DOM0 OSData collection: 1. sosreport from dom0 -- os message 2. ilom snapshot -- console log for the dom0 3. dmesg output --- When dom0 is hanging we cannot do anything, only reboot. Most time the reboot can complete and we need RCA. --- For RCA we usually looking for if we have HW issues in ilom, and kernel panic in console logs or os message file, core dump file at related core dump directory if kdump has been setup. --- If nothing can be found, we will need install OSW on dom0 to expect next time the issue happen again to collect CPU/memory related issues.
Case Study 1.1 Bug 17896838 - DOMU LOST NETWORK CONNECTION DURING COPYING BIG FILES OVER NFS Symptom: All VM including ODA_BASE hang on the system. dom0 can be access but all the network connections were lost. Finding : Dom0 is short of memory in meminfo during the connection problem. Case Study 1.2 BUG 19358298 - DOM0 SYSTEM BOOT DISK CAUSE THE WHOLE VM SYSTEM HANG Symptom: All VM including ODA_BASE were hanging on the system and dom0 cannot be accessed during the moment when this happened. Finding: The IO SRV time on local disk at dom0 is very high and the related disk util is 100% Solution: Replace related disk.
L2 oda_base OS issueData collection: 1.sosreport from oda_base 2. xm console (output from dom0) 3. dmesg output 4 If the issue is happening: 5. OSW information (OSW is default installed on ODA_BASE) Case Study 2.1: Bug 22495710 /SR 3-11955744301 Symptom: The ODA_BASE crash all the time when we tried to boot up so all the system is down. Founding: Using the xm console we can find ODA Base is crash at the panic of ACFS driver. (So it's showing as L2 issue but actually it's L6 issue) But we do need diagnostic from L2. Solution: Disable CRS and apply related ACFS patch and issue solved. Comments: This is verified as L2 issue first but after disable GI and the ODA_BASE was able to boot up. So it's L3 issue cause L2 issue.. then we diagnostic by start CRS without ASM to verify GI has no issue. Then Disable the ACFS driver can start up ASM and related diskgroup verified it has no L4/5 issues. And the console message indicate it's L6 issues also. Case Study 2.2 It's a common issue that ODA_BASE can run out of memory/CPU such Guest VM resources which can identified by OSW. Case Study 2.3 3-9272040911 and 3-9354104211 Symptom: On ODA VM env when ping oda_base, got package lost randomly. Both customers have cisco switch and vlan setting. Founding: It's XEN driver bugs on ODA. Solution: : echo 0 > /sys/class/net/net1/bridge/multicast_snooping Comments: tcpdump is default on ODA_BASE/dom0 and can install on Guest Vm which will help under such case.
L3. GI/ASMData Collection: 1. GI related log: Data Collection for Troubleshooting Oracle Clusterware (CRS or GI) And Real Application Cluster (RAC) Issues (Doc ID 289690.1) 2. ASM related log/trace: /u01/app/grid/diag/asm/+asm/+ASM1(2)/trace Case 3.1 3-11575993804/Doc ID 2013879.1 Symptom: When one node power off the other node will crash. Founding: In ocssd.log we can find the interconnection get lost and the restart happened after that. Solution: It's GI Brain/Split issue related to IB driver on ODA X5-2.. It's actually a L2 issue which node reboot but because it's OCSSD reboot the nodes we need treat this as L3 issue and check related GI logs and then focus on interconnection and find IB driver issues at L2. Case 3.2 ASM Instance Fails with an ORA-600 [723] Error (Doc ID 1581539.1) ASM memory leak can cause the oda_base hang and crash. Need got OSW and find out who engage most of the memory and get RCA.
L4. ASM diskgroupData collection: 1. ASM metadata 2. ASM related log/trace
/u01/app/grid/diag/asm/+asm/+ASM1(2)/trace
3. May need some internal events to collect more data.
Case 4.1:3-11787213061/ 3-11787213061 Bug 22294722 (ASM / OSD ) / Bug 21300303 (ODA)
Symptom: Customer complain all the VM suddenly stop working. After checking we find out when doing rebalance on REDO diskgroup, asm may offline two good disks without reason.
WARNING: Write Failed. group:3 disk:3 AU:17500 offset:7340032 size:1048576
path:/dev/mapper/SSD_E0_S20_805741335p1 incarnation:0x752dc23b asynchronous result:'I/O error' subsys:System krq:0x7fcaa5f10f90 bufp:0x7fcaa49f4000 osderr1:0x69b5 osderr2:0x0 IO elapsed time: 0 usec Time waited on I/O: 0 usec --- ASM showing osderr2:0x0 which is no reason but still have write failed and offline the disk cause the issue.
Solution: Apply ASM related diag patches to find more detail.
Case 4.2 Bug 21369858 - ORA-15196: INVALID ASM BLOCK HEADER [KFC.C:29297] [ENDIAN_KFBH]
It's ACFS driver bug at L6... what show up from customer is that the related VM is crash.
Then ASM diskgroup cannot mount because of the error NVALID ASM BLOCK HEADER [KFC.C:29297] [ENDIAN_KFBH] in asm alert log.
L5. ASM volume GroupData collection:
--- Please make sure the related VG State is enabled
--- The related entries match up for the output of: acfsutil registry -l
Case 5.1:
ODA: OAKERR:5003 Command 'oakcli show repo' Does Not Show Shared Repository (Doc ID 2057349.1)
It's related to OAKD layer which we will discuss later but the real issue here is the customer delete VG in asmcmd but not cleanup in acfs registry.
L6. ACFS/ADVM driver mountData collection:
Troubleshooting ACFS Repository/VM Mounting Issues on Oracle Database Appliance (Doc ID 2037999.1) Case 6.1 BUG 21307906 -- ACFS driver cannot mount at one node after upgrade.
Cause: The two nodes ACFS driver version is different. One node got upgraded and the other node has not yet.
Solution: Manually upgrade the other node.
L7. HANFS -- HAVIP/exportfsData Collection:
1. crsctl stat res -t (related resource file system and HAVIP are exist and online)
2. exportfs -v related filesystem is at export list
3. Detail related fs/havip listed in the white paper.
Case Study 7.1 SR 3-11618165711
Symptom: After change interconnection interface, REPO cannot startup
Cause: After change inerconnection, the related HAVIP interface has not been changed.
Solution: Change the HAVIP interface
Case Study 7.2 SR 3-11430277331
Symptom: The guest VM system turn to read only suddenly.
Cause: the customer run: "exportfs -ra" on the system manually.
L8.dom0 NFS client issue/oda_base NFS server issueData collection:
1.sosreport from dom0 and oda_base --Focus on os message
2. dmesg output 3. OSW data from oda_base and dom0 ------- Mostly it will show error related into os message.
Case Study: 8.1 3-12074018491
We are still working on the case but it point to memory issue on oda_base for NFS Server:
L9.dom0 OVM (xen) related issue1.sosreport from dom0
2. /var/log/xen/
3. xm list -- output
4. xm console VM_ID -- output
case study 9.1: 3-11995002281
Oracle VM: Failure to start guest Virtual Machine: "Hotplug scripts not working" (Doc ID 1089604.1)
Message Log : Errors first reported Jan 4 09:23:11 xen-shakoda1dom0-test logger: /etc/xen/scripts/block: xenstore-read backend/vbd/12/51728/node failed. <<<<<<<<<<<<< Here
Jan 4 09:23:11 xen-shakoda1dom0-test logger: /etc/xen/scripts/block: /etc/xen/scripts/block failed; error detected. Jan 4 09:23:11 xen-shakoda1dom0-test logger: /etc/xen/scripts/block: xenstore-read backend/vbd/12/51744/node failed.<<<<<<<<<<<<< Here XEND Log ========= File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2294, in _restart
new_dom.waitForDevices() File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1249, in waitForDevices self.getDeviceController(devclass).waitForDevices() File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 140, in waitForDevices VmError: Device 51728 (vbd) could not be connected. /OVS/Repositories/drake0/.ACFS/snaps/oakvdk_lnx-inb-test-home/VirtualDisks/oakvdk_lnx-inb-test-home does not exist. <<<<<<<<<<<********* <<<<<<<<<<<< Error reported on the VM [2016-01-04 09:23:17 15469] DEBUG (XendDomainInfo:108) XendDomainInfo.create(['vm', ['name', 'lnx-inb-test'], ['memory', 12288], ['maxmem', 12288], ['on_xend_start', 'ignore'], ['on_xend_stop', 'ignore'], ['vcpu_avail', '1'], ['vcpus', 1], ['cpus Related to OVM bugs which can workaround by set parameters.
L10. Inside the Guest VMData collection: 1. sosreport from related VM. 2. OSW from guest VM. 3. xm console output. Most issues happening inside the guest VM can be purely Linux OS related. Case Study: Trouble shooting Guest VM Startup related issue on share repo of ODA (Doc ID 2102380.1) "Oracle Database Appliance" ODA , ODAVP, ODA Virtualized, ODA troubleshooting, ODA Debug, Guest VM, VM, shared REPO ,ACFS, ODA_BASE, Dom0, DomU, HANFS, HAVIP,
Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||||||||||||
|