![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||
Solution Type Problem Resolution Sure Solution 2017033.1 : Creating RAIDZ ZFS Pools With Large ashift/physical-block-size VDEVs May Result In Lower Than Expected Usable Capacity
In this Document
Applies to:Solaris Operating System - Version 10 8/07 U4 and laterF40 Flash Accelerator Card - Version All Versions and later Flash Accelerator F80 PCIe Card - Version All Versions and later Sun Flash F20 PCIe Card - Version All Versions and later Sun Flash F5100 Array - Version All Versions and later Information in this document applies to any platform. SymptomsNewer storage devices, particularly Solid State Disks (SSD), Non-Volatile Memory on PCI (NVME), and Flash Module (FMOD) HBAs, are being released with increasing native physical-block-sizes. Traditional spinning disks have a physical-block-size of 512 bytes. SSD, NVME, and FMODS typically have 4K or 8K native block sizes with backwards compatability to 512e (512 byte emulated). For optimal performance the native physical-block-size should be used. This avoids multiple Read-Modify-Write (RMW) operations within the device itself that can lead to premature wear failures and mis-aligned IOs. Usually this requires the 'physical-block-size' parameter to be added to the driver configuration file. Refer to the product documentation for the recommended settings. For example: For F40 or F80 flash devices, add the following entry to /kernel/drv/sd.conf. In the entry below, ensure that "ATA " is padded to 8 characters. String length of "ATA 2E256" is 13, and string length of "ATA 3E128" is also 13. Note that this should be a single line in sd.conf. sd-config-list= "ATA 2E256" , "disksort:false, cache-nonvolatile:true, physical-block-size:8192" , "ATA 3E128" ,"disksort: false , cache-nonvolatile: true , physical-block-size: 4096 ”; For F20 and F5100 Flash devices, please add the following entry to /kernel/drv/sd.conf. In the entry below, "ATA " is padded to 8 characters, and "MARVELL SD88SA02" contains 16 characters. The total string length is 24.
sd-config-list= "ATA MARVELL SD88SA02" , "throttle-max:32, disksort:false, cache-nonvolatile:true, physical-block-size:4096" ; ZFS uses the physical-block-size to calculate the appropriate value for its 'ashift' property. The 'ashift' property ensures the IO is correctly aligned for optimal performance. ashift:9 == physical-block-size: 512b
ashift:12 == physical-block-size: 4K/512e ashift:13 == physical-block-size: 8K To verify the physical-block-size of 'sd' devices, use: $ echo "*sd_state::walk softstate | ::print -d struct sd_lun un_sd un_f_disksort_disabled un_f_suppress_cache_flush un_phy_blocksize" | mdb –k
un_sd = 0xc4015ff47880 un_f_disksort_disabled = 0 un_f_suppress_cache_flush = 0 un_phy_blocksize = 0t512 <-- Physical Block Size 512b For ssd devices use: $ echo "*ssd_state::walk softstate | ::print -d struct sd_lun un_sd un_f_disksort_disabled un_f_suppress_cache_flush un_phy_blocksize" | mdb –k
un_sd = 0xc4015ff47990 un_f_disksort_disabled = 0 un_f_suppress_cache_flush = 0 un_phy_blocksize = 0t512 <-- Physical Block Size 512b To verify the ZFS ashift value, choose one of the Virtual Devices (VDEVs) within the pool, then use zdb to print the label and grep for 'ashift': $ zdb -l /dev/rdsk/c0t0d0s0 | grep ashift
ashift: 9 <-- Physical Block Size 512b
Note: If the entries in the driver config do not appear to work, very the syntax and review 'Unable to Override Physical Block Size Specification for Some Devices in Solaris (Doc ID 1666907.1)'
Creating a RAIDZ1 (7 Data + 1 Parity) Pool consisting 16 devices (Oracle F80 HBAs in this case) with various physical-block-sizes/ashift values results in the following 'loss' of usable capacity as the physical-block-size is increased from 512e, 4K, to 8K (native): $ zpool create tank raidz1 c0t5002361000146897d0 c0t5002361000147254d0 c0t5002361000147259d0 c0t5002361000147541d0 \
c0t5002361000147639d0 c0t5002361000147642d0 c0t5002361000147695d0 c0t5002361000147787d0 \ raidz1 c0t5002361000148012d0 c0t5002361000148122d0 c0t5002361000148134d0 c0t5002361000148161d0 \ c0t5002361000148568d0 c0t5002361000148674d0 c0t5002361000148698d0 c0t5002361000148702d0 ashift=9: 'zpool list': 2.91T, 'zfs list': 2.49T avail ashift=12: 'zpool list': 2.91T, 'zfs list': 2.41T avail ashift=13: 'zpool list': 2.91T, 'zfs list': 2.29T avail Using the same devices to create a new pool without RAID while varying the physical-block-size/ashift values results in the following: $ zpool create ilx c0t5002361000146897d0 c0t5002361000147254d0 c0t5002361000147259d0 c0t5002361000147541d0 \
c0t5002361000147639d0 c0t5002361000147642d0 c0t5002361000147695d0 c0t5002361000147787d0 \ c0t5002361000148012d0 c0t5002361000148122d0 c0t5002361000148134d0 c0t5002361000148161d0 \ c0t5002361000148568d0 c0t5002361000148674d0 c0t5002361000148698d0 c0t5002361000148702d0 ashift=9: 'zpool list': 2.91T, 'zfs list': 2.86T avail ashift=12: 'zpool list': 2.91T, 'zfs list': 2.86T avail ashift=13: 'zpool list': 2.91T, 'zfs list': 2.86T avail
CauseZFS is both a volume/pool manager and a filesystem. ZFS does not pre-allocate anything when the pool and datasets are created. Instead, ZFS creates the required metadata when write IOs occur. ZFS also supports variable stripe and raid widths, chosen as data is written out. Because of this, the initial calculation during pool creation results in the loss of usable space. The key function is: static uint64_t vdev_raidz_asize(vdev_t *vd, uint64_t psize, dva_layout_t layout, int copies) { uint64_t asize; uint64_t ashift = vd->vdev_top->vdev_ashift; uint64_t cols = vd->vdev_children; uint64_t nparity = vd->vdev_nparity; asize = ((psize - 1) >> ashift) + 1; if (layout == DVA_LAYOUT_RAIDZ_MIRROR) { ASSERT(copies > nparity); asize *= copies; } else { ASSERT(layout == DVA_LAYOUT_STANDARD); asize += nparity * ((asize + cols - nparity - 1) / (cols - nparity)); } asize = roundup(asize, nparity + 1) << ashift; return (asize); } To compute the vdev_deflate_ratio vd->vdev_deflate_ratio = (1 << 17) / (vdev_psize_to_asize(vd, 1 << 17, DVA_LAYOUT_STANDARD, 1) >> SPA_MINBLOCKSHIFT); We're asking the system how much deflation we have to apply to physical space when storing 128K (near best case). If smaller ZFS recordsizes are used, the free space will drop much faster. /*
* Reserve about 1.6% (1/64), or at least 32MB, for allocation efficiency. * XXX The intent log is not accounted for, so it must fit within this slop. * If we're trying to assess whether it's OK to do a free, * cut the reservation by the factor of netfree to allow forward progress. */ #define SPA_SPACE_RESERVE(space, netfree) \ (MAX((space) >> 6, SPA_MINDEVSIZE >> 1) >> (netfree))
Going through the computation in vdev_raidz with cols=8, nparity=1, the numbers posted by 'zpool list' and 'zfs list' match. SolutionNo solution available. ZFS uses the available space within the pool intelligently and makes the initial calculations based on pool configuration and predicted future use. The initial calculations may eventually become over, under, or a very good estimate. Reducing the physical-block-size in the driver config file will permit ZFS pools to be created with more usable capacity. See RFE 25304772 - "Raidz wastes more space than required for padding" that will help but won't fully resolve the issue. References<NOTE:1666907.1> - Unable to Override Physical Block Size Specification for Some Devices in Solaris<BUG:25304772> - RAIDZ WASTES MORE SPACE THAN REQUIRED FOR PADDING Attachments This solution has no attachment |
||||||||||||||||||
|