Solaris 11 - After Expanding EMC LUN : Corrupt label - bad geometry - Label says 681561600 blocks; Drive says 419430400 blocks

Asset ID:	1-72-2029948.1
Update Date:	2017-05-23
Keywords:

Solution Type Problem Resolution Sure

Solution 2029948.1 : Solaris 11 - After Expanding EMC LUN : Corrupt label - bad geometry - Label says 681561600 blocks; Drive says 419430400 blocks

Applies to:

SPARC T5-2 - Version All Versions and later
Solaris Operating System - Version 11 11/11 and later
Information in this document applies to any platform.

Symptoms

Solaris 11.1 server with Oracle FC HBA connected to SAN to access EMC VNX disk array

EMC LUNs are under emc powerpath control, each LUNs have 4 paths.
Customer expanded the size of this LUN 0x3a = 58

Pseudo name=emcpower110a
VNX ID=CKM00112201084 [T5-2_server01]
Logical device ID=600601608C403400BEC7C3FEB30DE511 [LUN 968_SMA]
state=alive; policy=CLAROpt; queued-IOs=0
Owner: default=SP A, current=SP A Array failover mode: 4
==============================================================================
--------------- Host --------------- - Stor - -- I/O Path -- -- Stats ---
### HW Path I/O Paths Interf. Mode State Q-IOs Errors
==============================================================================
3083 pci@380/pci@1/pci@0/pci@5/SUNW,qlc@0/fp@0,0 c14t500601683EA01BB9d58s0 SP B0 active alive 0 0
3083 pci@380/pci@1/pci@0/pci@5/SUNW,qlc@0/fp@0,0 c14t500601613EA01BB9d58s0 SP A1 active alive 0 0
3080 pci@300/pci@1/pci@0/pci@4/SUNW,qlc@0/fp@0,0 c7t500601693EA01BB9d58s0 SP B1 active alive 0 0
3080 pci@300/pci@1/pci@0/pci@4/SUNW,qlc@0/fp@0,0 c7t500601603EA01BB9d58s0 SP A0 active alive 0 0

This LUN was mapped with the new size at his point in time:

Jun 8 13:25:57 server01 scsi: [ID 243001 kern.info] /pci@380/pci@1/pci@0/pci@5/SUNW,qlc@0/fp@0,0 (fcp11):
Jun 8 13:25:57 server01 Lun=3a for target=11700 reappeared
Jun 8 13:25:57 server01 scsi: [ID 243001 kern.info] /pci@300/pci@1/pci@0/pci@4/SUNW,qlc@0/fp@0,0 (fcp8):
Jun 8 13:25:57 server01 Lun=3a for target=11700 reappeared
Jun 8 13:25:57 server01 scsi: [ID 243001 kern.info] /pci@380/pci@1/pci@0/pci@5/SUNW,qlc@0/fp@0,0 (fcp11):
Jun 8 13:25:57 server01 ndi_devi_online: failed for scsa,00.bfcp: target=11700 lun=3a ffffffff
Jun 8 13:25:57 server01 scsi: [ID 243001 kern.info] /pci@300/pci@1/pci@0/pci@4/SUNW,qlc@0/fp@0,0 (fcp8):
Jun 8 13:25:57 server01 ndi_devi_online: failed for scsa,00.bfcp: target=11700 lun=3a ffffffff
Jun 8 13:25:57 server01 scsi: [ID 243001 kern.info] /pci@300/pci@1/pci@0/pci@4/SUNW,qlc@0/fp@0,0 (fcp8):
Jun 8 13:25:57 server01 Lun=3a for target=10f00 reappeared
Jun 8 13:25:57 server01 scsi: [ID 243001 kern.info] /pci@300/pci@1/pci@0/pci@4/SUNW,qlc@0/fp@0,0 (fcp8):
Jun 8 13:25:57 server01 ndi_devi_online: failed for scsa,00.bfcp: target=10f00 lun=3a ffffffff
Jun 8 13:26:00 server01 scsi: [ID 243001 kern.info] /pci@380/pci@1/pci@0/pci@5/SUNW,qlc@0/fp@0,0 (fcp11):
Jun 8 13:26:00 server01 Lun=3a for target=10f00 reappeared

Jun 8 13:26:00 server01 scsi: [ID 243001 kern.info] /pci@380/pci@1/pci@0/pci@5/SUNW,qlc@0/fp@0,0 (fcp11):
Jun 8 13:26:00 server01 ndi_devi_online: failed for scsa,00.bfcp: target=10f00 lun=3a ffffffff
Jun 8 13:27:04 server01 cmlb: [ID 107833 kern.warning] WARNING: /pci@300/pci@1/pci@0/pci@4/SUNW,qlc@0/fp@0,0/ssd@w500601603ea01bb9,3a (ssd240):
Jun 8 13:27:04 server01 Corrupt label; wrong magic number
Jun 8 13:27:04 server01 cmlb: [ID 107833 kern.warning] WARNING: /pci@300/pci@1/pci@0/pci@4/SUNW,qlc@0/fp@0,0/ssd@w500601603ea01bb9,3a (ssd240):
Jun 8 13:27:04 server01 Corrupt label; wrong magic number
...

After that, on the messages files we are getting these errors:

Jun 8 20:27:56 server01 cmlb: [ID 107833 kern.warning] WARNING: /pci@300/pci@1/pci@0/pci@4/SUNW,qlc@0/fp@0,0/ssd@w500601693ea01bb9,3a (ssd249):
Jun 8 20:27:56 server01 Corrupt label - bad geometry
Jun 8 20:27:56 server01 cmlb: [ID 107833 kern.notice] Label says 681561600 blocks; Drive says 419430400 blocks

The Solaris partition table shows the right information, EMC has confirmed from the Storage Array that this LUN is now around 325 GB

bash-3.2$ more c14t500601613EA01BB9d58s0
* /dev/rdsk/c14t500601613EA01BB9d58s0 partition map
*
* Dimensions:
* 512 bytes/sector
* 50 sectors/track
* 256 tracks/cylinder
* 12800 sectors/cylinder
* 53248 cylinders
* 53246 accessible cylinders
*
* Flags:
* 1: unmountable
* 10: read-only
*
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
  0 0 00 0 209715200 209715199
  1 0 00 209715200 157286400 367001599
  2 5 01 0 681548800 681548799    <<<-----------
  3 0 00 367001600 157286400 524287999
  4 0 00 524288000 157260800 681548799

From partition s2 (the whole disk) we see 681548800 blocks / 2 = 340774400 KB /1024 = 332787 MB /1024 = 324 GB --> very similar to:

"Label says 681561600 blocks" / 2 = 340780800 KB /1024 = 332793,75 MB /1024 = 324,99 GB --> same size as reported by EMC

So the message reports the right information "Label says 681561600 blocks" , when format runs the disk respond with the right information,

but the solaris driver still thinks is has a lower size "Drive says 419430400 blocks"

In addition to that, there are many transport error against this LUN , observed on each ssd instance related with this LUN, ie:

ssd232 Soft Errors: 0 Hard Errors: 3 Transport Errors: 18664
Vendor: DGC Product: VRAID Revision: 0532 Device Id: id1,ssd@n600601609a402d00a872244180c2e011
Size: 214.75GB
Media Error: 0 Device Not Ready: 0 No Device: 3 Recoverable: 0
Illegal Request: 30 Predictive Failure Analysis: 0

Cause

You most probably are hitting Bug 18239194 - syslog shows errors after LUN expansion on Solaris 11.1Bug 18239194 - syslog shows errors after LUN expansion on Solaris 11.1

New Bug was opened and closed for a similar issue on a Solaris 11.3 SRU 10.5.0 T5-4 server, customer mapped initially a wrong volume with "141419520 blocks" size, then they unmapped that volume and then mapped the good volume (using the same LUN number), with size 176770560 blocks

Feb 19 02:49:29 server03 cmlb: [ID 107833 kern.notice] Label says 176770560 blocks; Drive says 141419520 blocks

Bug 25584831 - syslog shows errors after LUN replacement 11.3

In order to troubleshoot this aesthetic error further, bug engineer has created a new dtrace script for getting sense data returned by the target : "analyze_sense_1.d"

Save above dtrace to file in /var/tmp/ and call it analyze_sense_1.d, then enable the perm
chmod u+x /var/tmp/analyze_sense_1.d

Then, as a root, run this command:
# ./analyze_sense_1.d > analyze_sense_1.out

Once you have reproduced the problem (on the messages files, you should see previous error ) you can CTRL-C the script.

Then collect a new explorer from the system, and upload the output of the dtrace and the new explorer to this SR.

Solution

Fix has been provided on:

Solaris 10 Sparc: Kernel patch 150400-31
Solaris 10 x86 : Kernel patch 150401-31

Solaris 11.2 SRU 13.6.0 or greater

See this document for Solaris 11.2 SRUs available to download:

Oracle Solaris 11.2 Support Repository Updates (SRU) Index (Doc ID 1672221.1)

NOTE:
1. This issue or a similar one reappeared on Solaris 11.3 SRU 10.5.0
2. Workaround: # update_drv -f ssd
Should fix the issue. If not, the last resort is to reboot.
If you require a fix, open a SR with Oracle support and request to be added to new BUG 25584831 to address this issue.

References

<BUG:18239194> - SYSLOG SHOWS ERRORS AFTER LUN EXPANSION ON SOLARIS 11.1
<NOTE:1672221.1> - Oracle Solaris 11.2 Support Repository Updates (SRU) Index
<BUG:25584831> - SYSLOG SHOWS ERRORS AFTER LUN REPLACEMENT 11.3

Attachments

This solution has no attachment