Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2278551.1
Update Date:2018-03-07
Keywords:

Solution Type  Problem Resolution Sure

Solution  2278551.1 :   Solaris 11 Control Domain Panic Triggered by Guest LDom Reboot - Fatal error has occured in: PCIe fabric.(0x1)(0x243)  


Related Items
  • SPARC M6-32
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>HBA>SN-DK: FC HBA
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-15081472961>

Applies to:

SPARC M6-32 - Version All Versions and later
Information in this document applies to any platform.

Symptoms

This is a M6-32 control domain with Solaris 11.1 SRU 21
and several 16GB FC HBA Oracle Emulex accessing an IBM Disk Storage


With no other messages  the server panic with "Fatal error has occured in: PCIe fabric.(0x1)(0x243)"

Jun 8 13:14:07 control-dom01 genunix: [ID 843051 kern.info] NOTICE: SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major
Jun 8 13:14:07 control-dom01 unix: [ID 836849 kern.notice]
Jun 8 13:14:07 control-dom01 ^Mpanic[cpu0]/thread=2a10009dc60:
Jun 8 13:14:07 control-dom01 unix: [ID 198415 kern.notice] Fatal error has occured in: PCIe fabric.(0x1)(0x243)
Jun 8 13:14:07 control-dom01 unix: [ID 100000 kern.notice]
Jun 8 13:14:07 control-dom01 genunix: [ID 723222 kern.notice] 000002a10009d6a0 px:px_err_panic+1c4 (106f2400, 1, 243, 7bfba800, 1, 106f0530)
Jun 8 13:14:07 control-dom01 genunix: [ID 702911 kern.notice] %l0-3: 000002a10009d750 0000000000000016 00000000106f2800 000000000000005f
Jun 8 13:14:07 control-dom01 %l4-7: 0000000000000000 0000000010508400 ffffffffffffffff 0000000000000000
Jun 8 13:14:07 control-dom01 genunix: [ID 723222 kern.notice] 000002a10009d7b0 px:px_err_fabric_intr+1ac (c4001cda4000, 1, 220, 1, 243, 4000dd2c8a8)
Jun 8 13:14:07 control-dom01 genunix: [ID 702911 kern.notice] %l0-3: 0000000000000220 000000007bfba970 0000000000000000 0000000000000220
Jun 8 13:14:07 control-dom01 %l4-7: 0000000000000001 000000007bfba800 0000000000000001 0000c4001be781d8
Jun 8 13:14:07 control-dom01 genunix: [ID 723222 kern.notice] 000002a10009d930 px:px_msiq_intr+208 (c4001cd84b28, 0, 9, c4001cda62c8, 0, 2)
Jun 8 13:14:07 control-dom01 genunix: [ID 702911 kern.notice] %l0-3: 0000000000000000 0000000038790080 0000c4001be53998 0000c4001cda4000
Jun 8 13:14:07 control-dom01 %l4-7: 0000c4001cda6418 000004000dd2c8a8 0000c4001cd9fb50 0000000000000030
Jun 8 13:14:07 control-dom01 unix: [ID 100000 kern.notice]
Jun 8 13:14:07 control-dom01 genunix: [ID 672855 kern.notice] syncing file systems...
Jun 8 13:14:07 control-dom01 genunix: [ID 904073 kern.notice] done
Jun 8 13:14:12 control-dom01 genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
Jun 8 13:14:33 control-dom01 genunix: [ID 100000 kern.notice]
Jun 8 13:14:33 control-dom01 genunix: [ID 665016 kern.notice] ^M100% done: 714892 pages dumped,
Jun 8 13:14:33 control-dom01 genunix: [ID 851671 kern.notice] dump succeeded
Jun 8 13:16:54 control-dom01 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 Version 11.1 64-bit



There are some pci events just before the panic:

Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.3487 ereport.io.pciex.linkbw.down
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pci.fabric
Jun 08 13:14:07.2903 ereport.io.pciex.pl.re
Jun 08 13:14:07.2903 ereport.io.pciex.dl.btlp
Jun 08 13:14:07.2903 ereport.io.pciex.a-nonfatal
Jun 08 13:14:07.2903 ereport.io.pciex.tl.uc
Jun 08 13:14:07.2903 ereport.io.pciex.tl.uc
Jun 08 13:14:07.2903 ereport.io.pciex.rc.ce-msg
Jun 08 13:14:07.2903 ereport.io.pciex.rc.mce-msg



More in details related to a FC HBA and an ethernet card

Jun 08 2017 13:14:07.290398687 ereport.io.pci.fabric
nvlist version: 0
  class = ereport.io.pci.fabric
  ena = 0x2b4ac89307700001
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  cna_dev = 0x55fd57840000027e
  device-path = /pci@3c0
  (end detector)

  primary = 1
  pcie_adv_rp_status = 0x1
  pcie_adv_rp_command = 0x0
  pcie_adv_rp_ce_src_id = 0x220
  pcie_adv_rp_ue_src_id = 0x0
  __ttl = 0x1
  __tod = 0x5939317f 0x114f21df

Jun 08 2017 13:14:07.348767145 ereport.io.pciex.linkbw.down
nvlist version: 0
  class = ereport.io.pciex.linkbw.down
  ena = 0x2b4b003d24101801
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  cna_dev = 0x55fd57840000027e
  device-path = /pci@3c0/pci@1/pci@0/pci@4
  (end detector)

  source-id = 0x220
  device-id = 0x80ba
  vendor-id = 0x111d
  expected = 0x1
  supported-link-speeds = 0xe
  current-link-speed = 0x1
  current-link-width = 0x4
  prior-link-speed = 0x3
  prior-link-width = 0x8
  target-link-speed = 0x0
  max-link-speed = 0x3
  max-link-width = 0x8
  runtime = 0x0
  __ttl = 0x1
  __tod = 0x5939317f 0x14c9c3a9

....

Jun 08 2017 13:14:07.290398687 ereport.io.pci.fabric
nvlist version: 0
  class = ereport.io.pci.fabric
  ena = 0x2b4ac89307700001
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  device-path = /pci@3c0/pci@1/pci@0/pci@6/network@0,1
  (end detector)

  bdf = 0x7301
  device_id = 0x1528
  vendor_id = 0x8086
  rev_id = 0x1
  dev_type = 0x0
  pcie_off = 0xa0
  pcix_off = 0x0
  aer_off = 0x100
  ecc_ver = 0x0
  func_type = 0x1
  pci_status = 0x10
  pci_command = 0x147
  pcie_status = 0x0
  pcie_command = 0x201f
  pcie_dev_cap = 0x10008cc2
  pcie_link_status = 0x1082
  pcie_dev_ctl2 = 0x5
  pcie_adv_ctl = 0x1e0
  pcie_ue_status = 0x0
  pcie_ue_mask = 0x0
  pcie_ue_sev = 0x462031
  pcie_ue_hdr0 = 0x0
  pcie_ue_hdr1 = 0x0
  pcie_ue_hdr2 = 0x0
  pcie_ue_hdr3 = 0x0
  pcie_ce_status = 0x0
  pcie_ce_mask = 0x0
  pcie_aff_flags = 0x0
  pcie_aff_bdf = 0xffff
  orig_sev = 0x1
  remainder = 0x0
  severity = 0x1
  __ttl = 0x1
  __tod = 0x5939317f 0x114f21df

Jun 08 2017 13:14:07.290398687 ereport.io.pciex.pl.re
nvlist version: 0
  ena = 0x2b4ac89307700001
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  cna_dev = 0x55fd57840000027e
  device-path = /pci@3c0/pci@1/pci@0/pci@4/SUNW,emlxs@0,1
  (end detector)

  class = ereport.io.pciex.pl.re
  dev-status = 0x1
  ce-status = 0x2041
  link-width = 0x4
  link-speed = 0x9c4
  __ttl = 0x1
  __tod = 0x5939317f 0x114f21df

Jun 08 2017 13:14:07.290398687 ereport.io.pciex.dl.btlp
nvlist version: 0
  ena = 0x2b4ac89307700001
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  cna_dev = 0x55fd57840000027e
  device-path = /pci@3c0/pci@1/pci@0/pci@4/SUNW,emlxs@0,1
  (end detector)

  class = ereport.io.pciex.dl.btlp
  dev-status = 0x1
  ce-status = 0x2041
  link-width = 0x4
  link-speed = 0x9c4
  __ttl = 0x1
  __tod = 0x5939317f 0x114f21df

Jun 08 2017 13:14:07.290398687 ereport.io.pciex.a-nonfatal
nvlist version: 0
  ena = 0x2b4ac89307700001
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  cna_dev = 0x55fd57840000027e
  device-path = /pci@3c0/pci@1/pci@0/pci@4/SUNW,emlxs@0,1
  (end detector)

  class = ereport.io.pciex.a-nonfatal
  dev-status = 0x1
  ce-status = 0x2041
  __ttl = 0x1
  __tod = 0x5939317f 0x114f21df

Jun 08 2017 13:14:07.290398687 ereport.io.pciex.tl.uc
nvlist version: 0
  ena = 0x2b4ac89307700001
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  cna_dev = 0x55fd57840000027e
  device-path = /pci@3c0/pci@1/pci@0/pci@4/SUNW,emlxs@0,1
  (end detector)

  class = ereport.io.pciex.tl.uc
  dev-status = 0x1
  ue-status = 0x10000
  ue-severity = 0x62011
  adv-ctl = 0x1f0
  source-id = 0xffff
  source-valid = 1
  __ttl = 0x1
  __tod = 0x5939317f 0x114f21df

Jun 08 2017 13:14:07.290398687 ereport.io.pciex.tl.uc
nvlist version: 0
  ena = 0x2b4ac89307700001
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  cna_dev = 0x55fd578400000062
  device-path = /pci@3c0/pci@1/pci@0/pci@4/SUNW,emlxs@0
  (end detector)

  class = ereport.io.pciex.tl.uc
  dev-status = 0x1
  ue-status = 0x10000
  ue-severity = 0x62011
  adv-ctl = 0x1f0
  source-id = 0xffff
  source-valid = 1
  __ttl = 0x1
  __tod = 0x5939317f 0x114f21df


After panic and reboot, all FC HBAs are operating with no issues, no HW problems observed, no pci errors, no scsi or FC errors.

 

If you see a panic similar to this, check from the control domain if there are guest LDom with virtual functions assigned with command

# ldm list-io -l

....

/SYS/IOU0/PCIE1/IOVFC.PF0.VF0 VF pci_1 guestldom01 <<<--- virtual function assigned to guest LDom "guestldom01"
[pci@340/pci@1/pci@0/pci@8/SUNW,emlxs@0,2]
Class properties [FIBRECHANNEL]
port-wwn = 10:00:00:14:XX:XX:XX:0c
node-wwn = 20:00:00:14:XX:XX:XX:0c
bw-percent = 0

If that is the case , check if any of these LDoms were rebooted just before the panic on control domain.

 

Changes

Just before panic, one guest LDom that has assigned some FC HBA virtual functions was rebooted unexpectedly due to another issue.

 

 

Cause

You could be facing this know issue, in some rare situations PCIe errors caused by stop/start of guests with VFs may cause primary domain to panic.

Bug 15906060 : Panic seen on Primary after multiple start/stop of io-domains w/VF's
--> Closed as duplicate of 21352084

Bug 21352084 - root domain panic due to fatal error occured in pcie fabric
--> Solution Description: Fix delivered in Oracle Solaris 11.3.1.5.0 (or greater)
 

 

Solution

 

Upgrade to Oracle Solaris 11.3.1.5.0 (or greater) on the guest LDom and control domain.

 

 

 

References

<BUG:15906060> - PANIC SEEN ON PRIMARY AFTER MULTIPLE START/STOP OF IO-DOMAINS W/VF'S
<BUG:21352084> - ROOT DOMAIN PANIC DUE TO FATAL ERROR OCCURED IN PCIE FABRIC

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback