Trouble shooting ODA REPO/VM Guest related issues : Basic Collection

Asset ID:	1-75-2100823.1
Update Date:	2017-09-18
Keywords:

Solution Type Troubleshooting Sure

Solution 2100823.1 : Trouble shooting ODA REPO/VM Guest related issues : Basic Collection

Applies to:

Oracle Database Appliance - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
This Note will be more focus on VM related issue on the ODA platform. It can be similar as non-ODA env but we will be more focus on oda related technolodge here.

Purpose

Explain how we should analyze ODA Linux VM Guest related issues. We will not cover the Windows/Solaris VM guest related issue here.

Troubleshooting Steps

What is special about the VM guest on ODA compared with a general VM guest?

1. VM on ODA can be on a shared REPO on ACFS at ODA_BASE

While a Guest VM image can be put on local repo which is the same as general OVM VM guest or Share REPO which is special on ODA.

This note will focus on the shared repo.

Example about how a shared repo works on the ODA platform:

On dom0: /OVS/Repositories/lab1

192.168.16.10:/u01/app/sharedrepo/lab1 839909376 114954400 724954976 14% /OVS/Repositories/lab1

It's a NFS mount point mounted through HANFS from oda_base

This example will include various example names such as lab1-375

On the oda_base we have related ACFS Volume group /dev/asm/lab1-375
mapping to the related directory /u01/app/sharedrepo/lab1 which maps to dom0 through NFS:

/dev/asm/lab1-375 839909376 114954420 724954956 14% /u01/app/sharedrepo/lab1

Volume Name: LAB1
Volume Device: /dev/asm/lab1-375
State: ENABLED
Size (MB): 820224
Resize Unit (MB): 64
Redundancy: MIRROR
Stripe Columns: 8
Stripe Width (K): 1024
Usage: ACFS
Mountpath: /u01/app/sharedrepo/lab1

Summary: Share repo on ODA goes through the below path:

ASM Volume -> to ACFS directory (oda_base) -> to HANFS mount (dom0)

IF the above stack has an issue at any point it will cause the related VM to a crash, go read-only, problems during startup or possibly performance issues if it's on share repo.

2. ODA uses "oakd" instead of "OVM Manager".

All OVM manager commands cannot be run on the dom0.
ODA uses oakd and the vmagent on ODA_BASE and dom0 to monitor/control relate repo/vm operation.
All VM related commands should be run from ODA_BASE using oakcli.

These command will be accepted by oakd and redirected to master oakd to dispatch to odaBaseAgent on oda_base and oakVmagent on dom0 to execute the related command.
Without oakd we can still manually mount all the repo and start related VM.
Most time it's showing as oakd related issues but underneath it's some other layers' issue.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Steps for troubleshooting:

1. We need to identify which layer has the issue.

So here are the layers which may be involved:

L0. OAKD
L1. DOM0 OS
L2. ODA_BASE OS
L3. GI/ASM
L4. ASM Diskgroup
L5. ASM Volume Group
L6. ACFS/ADVM driver
L7. HANFS -- HAVIP/exportfs
L8. NFS
L9. Dom0 OVM (xen)
L10. Inside the Guest VM

L0. OAKD layer is monitor/control layer related to all other layers.

The lower layer issues (except L0) will cause all above layers have issues.
For example: ASK diskgroup is dismount, all VM related to that diskgroup will not be able to start. And if the higher layer is working fine which can means the lower layer is fine.
For example: NFS mount point on dom0 can read/write without issue then the related ASM diskgroup must have been mounted.

In some cases we cannot identify which layer has issues or maybe couple layers have issues together for example performance issue at VM, we need be very careful about how to validate layers for such cases.

The detail way command to identify each layer has issue or not:

L1. DOM0 OS -- Try to ssh to related dom0 IP address; Use ilom console try to access the related dom0 console; Some basic os command (ls, df, top)

L2. ODA_BASE OS -- Try to ssh to related oda_base IP address; Use xm list/console to access related oda_base console or use VNC connect to dom0:5900 to access the console; Some basic os command (ls, df, top)

L3. GI/ASM

Using grid user to run: crsctl stat res -t/crsctl stat res -init -t/ ps -ef|grep smon/go into asm instance to check using sqlplus.

L4. ASM Diskgroup --Using grid user: asmcmd lsdg to make sure all the diskgroups are at MOUNTED stage.

L5. ASM Volume Group -- asmcmd volinfo --all to make sure the related vg state is Enable

L6. ACFS/ADVM driver mount-- on ODA_BASE df -k the related directory is mounted and try to read files (more) and write files (touch) onto related directory

L7. HANFS -- HAVIP/exportfs -- crsctl stat res -t (related resource file system and HAVIP are exist and online); exportfs -v related filesystem is at export list.

L8. NFS -- on dom0 df -k to make sure related directory is mounted and try to read files (more) and write files (touch) onto related directory

L9. DOM0 xen -- xen list; xen console -- such command can show output on dom0

L10. Inside the Guest VM --Try to ssh to related Guest VM IP address; Use xm list/console to access related guest VM console or use VNC connect to dom0:59XX to access the console; Some basic os command (ls, df, top) to run inside the guest VM.

L0 oakd layer

oakcli show version -detail ;
oakcli show repo;
oakcli show vm; --- Sometime the related VM/REPO status is not right or the entries do not show up.

Sometime it's not obviously start testing from which layer. We may need start from one layer and test up/down layers to verify.

2 Based on the related layer/type, collect information, analyze data of the issue (Case Study)

L1 DOM0 OS

Data collection:

1. sosreport from dom0 -- os message

2. ilom snapshot -- console log for the dom0

3. dmesg output

--- When dom0 is hanging we cannot do anything, only reboot. Most time the reboot can complete and we need RCA.
If the reboot cannot successful we will need boot into single/rescue mode to do recovery, the steps are the same for general Linux.

--- For RCA we usually looking for if we have HW issues in ilom, and kernel panic in console logs or os message file, core dump file at related core dump directory if kdump has been setup.

--- If nothing can be found, we will need install OSW on dom0 to expect next time the issue happen again to collect CPU/memory related issues.
If the kdump has not been enable we may enable kdump. By default DOM0 has no OSW.

Case Study 1.1 Bug 17896838 - DOMU LOST NETWORK CONNECTION DURING COPYING BIG FILES OVER NFS

Symptom: All VM including ODA_BASE hang on the system. dom0 can be access but all the network connections were lost.

Finding : Dom0 is short of memory in meminfo during the connection problem.
Eventually we found a netback driver issue on dom0

Case Study 1.2 BUG 19358298 - DOM0 SYSTEM BOOT DISK CAUSE THE WHOLE VM SYSTEM HANG

Symptom: All VM including ODA_BASE were hanging on the system and dom0 cannot be accessed during the moment when this happened.
We were able to reproduce after the second day while OSW was enabled.

Finding: The IO SRV time on local disk at dom0 is very high and the related disk util is 100%

Solution: Replace related disk.

L2 oda_base OS issue

Data collection:

1.sosreport from oda_base

2. xm console (output from dom0)

3. dmesg output

4 If the issue is happening:

      xm console oakDom1 (press keyboard couple times)

   -- to check if you can go into the xm console
   -- A lot time you will need to have the console open when the issue happening to get the call stack of the issue.

5. OSW information (OSW is default installed on ODA_BASE)

Case Study 2.1:

Bug 22495710 /SR 3-11955744301

Symptom: The ODA_BASE crash all the time when we tried to boot up so all the system is down.

Founding: Using the xm console we can find ODA Base is crash at the panic of ACFS driver. (So it's showing as L2 issue but actually it's L6 issue) But we do need diagnostic from L2.

Solution: Disable CRS and apply related ACFS patch and issue solved.

Comments: This is verified as L2 issue first but after disable GI and the ODA_BASE was able to boot up. So it's L3 issue cause L2 issue.. then we diagnostic by start CRS without ASM to verify GI has no issue. Then Disable the ACFS driver can start up ASM and related diskgroup verified it has no L4/5 issues. And the console message indicate it's L6 issues also.

Case Study 2.2

It's a common issue that ODA_BASE can run out of memory/CPU such Guest VM resources which can identified by OSW.

Case Study 2.3 3-9272040911 and 3-9354104211

Symptom: On ODA VM env when ping oda_base, got package lost randomly. Both customers have cisco switch and vlan setting.

Founding: It's XEN driver bugs on ODA.

Solution: : echo 0 > /sys/class/net/net1/bridge/multicast_snooping

Comments: tcpdump is default on ODA_BASE/dom0 and can install on Guest Vm which will help under such case.

L3. GI/ASM

Data Collection:

1. GI related log: Data Collection for Troubleshooting Oracle Clusterware (CRS or GI) And Real Application Cluster (RAC) Issues (Doc ID 289690.1)

2. ASM related log/trace: /u01/app/grid/diag/asm/+asm/+ASM1(2)/trace

Case 3.1 3-11575993804/Doc ID 2013879.1

Symptom: When one node power off the other node will crash.

Founding: In ocssd.log we can find the interconnection get lost and the restart happened after that.

Solution: It's GI Brain/Split issue related to IB driver on ODA X5-2.. It's actually a L2 issue which node reboot but because it's OCSSD reboot the nodes we need treat this as L3 issue and check related GI logs and then focus on interconnection and find IB driver issues at L2.

Case 3.2 ASM Instance Fails with an ORA-600 [723] Error (Doc ID 1581539.1)

ASM memory leak can cause the oda_base hang and crash. Need got OSW and find out who engage most of the memory and get RCA.

L4. ASM diskgroup

Data collection:

1. ASM metadata
How To Gather & Backup ASM/ACFS Metadata In A Formatted Manner version 10.1, 10.2, 11.1, 11.2 and 12.1?(Doc ID 470211.1)

2. ASM related log/trace

/u01/app/grid/diag/asm/+asm/+ASM1(2)/trace

3. May need some internal events to collect more data.

Case 4.1:3-11787213061/ 3-11787213061 Bug 22294722 (ASM / OSD ) / Bug 21300303 (ODA)

Symptom: Customer complain all the VM suddenly stop working. After checking we find out when doing rebalance on REDO diskgroup, asm may offline two good disks without reason.

WARNING: Write Failed. group:3 disk:3 AU:17500 offset:7340032 size:1048576 path:/dev/mapper/SSD_E0_S20_805741335p1 incarnation:0x752dc23b asynchronous result:'I/O error' subsys:System krq:0x7fcaa5f10f90 bufp:0x7fcaa49f4000 osderr1:0x69b5 osderr2:0x0 IO elapsed time: 0 usec Time waited on I/O: 0 usec

--- ASM showing osderr2:0x0 which is no reason but still have write failed and offline the disk cause the issue.

Solution: Apply ASM related diag patches to find more detail.

Case 4.2 Bug 21369858 - ORA-15196: INVALID ASM BLOCK HEADER [KFC.C:29297] [ENDIAN_KFBH]

It's ACFS driver bug at L6... what show up from customer is that the related VM is crash.
Then ASM diskgroup cannot mount because of the error NVALID ASM BLOCK HEADER [KFC.C:29297] [ENDIAN_KFBH] in asm alert log.

L5. ASM volume Group

Data collection:

asmcmd volinfo
acfsutil registry -l

--- Please make sure the related VG State is enabled

--- The related entries match up for the output of: acfsutil registry -l

Case 5.1:

ODA: OAKERR:5003 Command 'oakcli show repo' Does Not Show Shared Repository (Doc ID 2057349.1)

It's related to OAKD layer which we will discuss later but the real issue here is the customer delete VG in asmcmd but not cleanup in acfs registry.

L6. ACFS/ADVM driver mount

Data collection:

Troubleshooting ACFS Repository/VM Mounting Issues on Oracle Database Appliance (Doc ID 2037999.1)

Case 6.1 BUG 21307906 -- ACFS driver cannot mount at one node after upgrade.

Cause: The two nodes ACFS driver version is different. One node got upgraded and the other node has not yet.

Solution: Manually upgrade the other node.

L7. HANFS -- HAVIP/exportfs

Concept: http://www.oracle.com/ocom/groups/public/@otn/documents/webcontent/2011281.pdf

Data Collection:

1. crsctl stat res -t (related resource file system and HAVIP are exist and online)

2. exportfs -v related filesystem is at export list

3. Detail related fs/havip listed in the white paper.

Case Study 7.1 SR 3-11618165711

Symptom: After change interconnection interface, REPO cannot startup

Cause: After change inerconnection, the related HAVIP interface has not been changed.

Solution: Change the HAVIP interface

Case Study 7.2 SR 3-11430277331

Symptom: The guest VM system turn to read only suddenly.

Cause: the customer run: "exportfs -ra" on the system manually.

L8.dom0 NFS client issue/oda_base NFS server issue

Data collection:

1.sosreport from dom0 and oda_base --Focus on os message

2. dmesg output

3. OSW data from oda_base and dom0

------- Mostly it will show error related into os message.

Case Study: 8.1 3-12074018491

We are still working on the case but it point to memory issue on oda_base for NFS Server:

L9.dom0 OVM (xen) related issue

1.sosreport from dom0

2. /var/log/xen/

3. xm list -- output

4. xm console VM_ID -- output

case study 9.1: 3-11995002281

Oracle VM: Failure to start guest Virtual Machine: "Hotplug scripts not working" (Doc ID 1089604.1)

Message Log : Errors first reported

Jan 4 09:23:11 xen-shakoda1dom0-test logger: /etc/xen/scripts/block: xenstore-read backend/vbd/12/51728/node failed. <<<<<<<<<<<<< Here
Jan 4 09:23:11 xen-shakoda1dom0-test logger: /etc/xen/scripts/block: /etc/xen/scripts/block failed; error detected.
Jan 4 09:23:11 xen-shakoda1dom0-test logger: /etc/xen/scripts/block: xenstore-read backend/vbd/12/51744/node failed.<<<<<<<<<<<<< Here

XEND Log
=========

File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2294, in _restart
new_dom.waitForDevices()
File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1249, in waitForDevices
self.getDeviceController(devclass).waitForDevices()
File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 140, in waitForDevices
VmError: Device 51728 (vbd) could not be connected. /OVS/Repositories/drake0/.ACFS/snaps/oakvdk_lnx-inb-test-home/VirtualDisks/oakvdk_lnx-inb-test-home does not exist. <<<<<<<<<<<********* <<<<<<<<<<<< Error reported on the VM
[2016-01-04 09:23:17 15469] DEBUG (XendDomainInfo:108) XendDomainInfo.create(['vm', ['name', 'lnx-inb-test'], ['memory', 12288], ['maxmem', 12288], ['on_xend_start', 'ignore'], ['on_xend_stop', 'ignore'], ['vcpu_avail', '1'], ['vcpus', 1], ['cpus

Related to OVM bugs which can workaround by set parameters.

L10. Inside the Guest VM

Data collection:

1. sosreport from related VM.

2. OSW from guest VM.

3. xm console output.

Most issues happening inside the guest VM can be purely Linux OS related.
Any kernel bug which can cause system hang/crash will happen in Guest VM.
Also if the customer has APP has memory leak it will cause the guest VM hang.
All these issues will need go through general Linux Diagnostic method like using sosreport check configuration or using OSW data for memory/IO/CPU issues.

Case Study:

Trouble shooting Guest VM Startup related issue on share repo of ODA (Doc ID 2102380.1)

"Oracle Database Appliance"

ODA , ODAVP, ODA Virtualized, ODA troubleshooting, ODA Debug, Guest VM, VM, shared REPO ,ACFS, ODA_BASE, Dom0, DomU, HANFS, HAVIP,
ODA , ODAVP, ODA Virtualized, ODA troubleshooting, ODA Debug, Guest VM, VM, shared REPO ,ACFS, ODA_BASE, Dom0, DomU, HANFS, HAVIP,
ODA , ODAVP, ODA Virtualized, ODA troubleshooting, ODA Debug, Guest VM, VM, shared REPO ,ACFS, ODA_BASE, Dom0, DomU, HANFS, HAVIP

Attachments

This solution has no attachment