Sun Storage 7000 Unified Storage System: RAIDZ2 Performance Issues With High I/O Wait Queues

Asset ID:	1-72-1315536.1
Update Date:	2018-01-08
Keywords:

Solution Type Problem Resolution Sure

Solution 1315536.1 : Sun Storage 7000 Unified Storage System: RAIDZ2 Performance Issues With High I/O Wait Queues

Applies to:

Sun ZFS Storage 7320 - Version All Versions and later
Sun Storage 7110 Unified Storage System - Version All Versions and later
Sun ZFS Storage 7420 - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
Sun Storage 7310 Unified Storage System - Version All Versions and later
7000 Appliance OS (Fishworks)
NAS head revision : [not dependent]
BIOS revision : [not dependent]
ILOM revision : [not dependent]
JBODs Model : [not dependent]
CLUSTER related : [not dependent]

Symptoms

Lower than expected random read IO performance on 7000 series with RAIDz2 pool profile.

This is especially the case in pools with less than 15 spindles assigned to a single pool but applies to all raidz2 pools.

Cause

A zpool is constructed of one or many virtual devices (vdevs), these vdevs are themselves constructed of block devices which in the case of the ZFS Storage Appliance are always entire disks (spindles).

Performance with raidz2 profile

CLI:configuration storage> ls
Properties:
                          pool = TestPool
                        status = Online
                        errors = 0
                       profile = raidz2
                   log_profile = log_stripe
                 cache_profile = cache_stripe
                         scrub = none requested

zpool output looks like
NAME STATE READ WRITE CKSUM
TestPool Raidz2 ONLINE 0 0 0
       raidz2-0 ONLINE 0 0 0
         c0t1d0 ONLINE 0 0 0
       c0t2d0 ONLINE 0 0 0
         c0t3d0 ONLINE 0 0 0
     c0t4d0 ONLINE 0 0 0
       c0t5d0 ONLINE 0 0 0
       c0t6d0 ONLINE 0 0 0
     c0t7d0 ONLINE 0 0 0
         c0t8d0 ONLINE 0 0 0
         c0t9d0 ONLINE 0 0 0
        c0t10d0 ONLINE 0 0 0
        c0t11d0 ONLINE 0 0 0
        c0t12d0 ONLINE 0 0 0
        c0t13d0 ONLINE 0 0 0
        c0t14d0 ONLINE 0 0 0
       logs
          c4t5d0 ONLINE 0 0 0
          c4t5d1 ONLINE 0 0 0
          c4t5d2 ONLINE 0 0 0
        cache
          c1t2d0 ONLINE 0 0 0
          c1t3d0 ONLINE 0 0 0
       spares
        c0t15d0 AVAIL
        c0t16d0 AVAIL

Here, we see all data drives (spindles) belong to a single top level vdev under zfs. When an IO read is done, we have to read the data from all the disks inside the raidz2 vdev.
In the worse situation with extreme random workload, a disk can sustain 150 IOPS. So, the TestPool will be limited to : 1 vdev * 150 IOPS = 150 IOPS / sec.

Please note: 150 IOPS stated in this document is an example only and the IOPS rating for drives change depending on the Vendor / Capacity / RPM etc.

The IOPs rating for HDDs are pessimal. Worst case. These numbers assume that there is a seek between EVERY I/O. In practice, the way ZFS works, we typically see about 60% more than these figures for HDDs.

Performance with mirror profile

CLI:configuration storage> ls
Properties:
                          pool = TestPool
                        status = online
                        errors = 0
                       profile = mirror_nspf
                   log_profile = log_stripe
                 cache_profile = cache_stripe
                         scrub = none requested

zpool output looks like
NAME STATE READ WRITE CKSUM
TestPool Mirror ONLINE 0 0 0
       mirror-0 ONLINE 0 0 0
         c0t1d0 ONLINE 0 0 0
         c0t2d0 ONLINE 0 0 0
       mirror-1 ONLINE 0 0 0
         c0t3d0 ONLINE 0 0 0
         c0t4d0 ONLINE 0 0 0
       mirror-2 ONLINE 0 0 0
         c0t5d0 ONLINE 0 0 0
         c0t6d0 ONLINE 0 0 0
       mirror-3 ONLINE 0 0 0
         c0t7d0 ONLINE 0 0 0
         c0t8d0 ONLINE 0 0 0
       mirror-4 ONLINE 0 0 0
         c0t9d0 ONLINE 0 0 0
        c0t10d0 ONLINE 0 0 0
       mirror-5 ONLINE 0 0 0
        c0t11d0 ONLINE 0 0 0
        c0t12d0 ONLINE 0 0 0
     mirror-6 ONLINE 0 0 0
        c0t13d0 ONLINE 0 0 0
        c0t14d0 ONLINE 0 0 0
       logs
          c4t5d0 ONLINE 0 0 0
          c4t5d1 ONLINE 0 0 0
          c4t5d2 ONLINE 0 0 0
        cache
          c1t2d0 ONLINE 0 0 0
          c1t3d0 ONLINE 0 0 0
       spares
        c0t15d0 AVAIL
        c0t16d0 AVAIL

Here, we have 7 vdevs. When an IO read is done, we can read it from 1 vdev and the next IO can be read from the next vdev in parallel.
So, the TestPool will be able to sustain : 7 vdev * 150 IOPS = 1050 IOPS / sec. Actually, it can be even better as we can read different data from each disk of a vdev in parallel.

The key concept to remember is that the IOPS are determined per vdev not per spindle (disk).
Mirrored Storage Profiles allow for a much higher vdev count when using the same amount of spindles.

Solution

Use mirrored profiles for pools when ever possible to greatly improve IOPs.
This will reduce capacity but greatly increase performance.

Sun ZFS Storage Configuration Rules and Guidelines
Sun ZFS Storage Performance Impact

ZPOOL and IOPs notes

Notes on IOPs (Input/Output Operations per second) and ZPOOLS (example zpool name: tank)

Choose any 2: speed | reliability | cost
If you pick speed and reliability, it will not be cheap
If you pick reliable and cost effective, it will not be fast
If you pick speed and cost effective, it will not be reliable

IOPs computation/approximation via SEEK time
IOPS = 1 / ( average latency + average seek time )           lower bound calculated value
IOPS = 1 / ( average seek time )                                   upper bound calculated value

Using specifications of a specific disk Seagate ST31000424SS [ST31001SSSUN1.0T] (1TB - 7200 RPM - SAS Disk), avg seek time 8.5/9.5ms, avg latency 4.16ms
IOPS = 1/ (0.00416 + 0.0085) = 1000/(4.16 + 8.5) = ~78.989 IOPs lower bound calculated value
IOPS = 1/ (0.0085) = 1000/(8.5) = ~117.647 IOPs                       upper bound calculated value

IOPS = 1/ (0.00416 + 0.0095) = 1000/(4.16 + 9.5) = ~73.206 IOPs lower bound calculated value
IOPS = 1/ (0.0095) = 1000/(9.5) = ~105.263 IOPs                       upper bound calculated value

Alternate, via pure rotational speed (one full rotation)
IOPS = RPM / 60
IOPS = 7200 / 60 = ~150 IOPs

One argument, says it is actually 2 x 120 because on average you wait half of a revolution or 180 degrees (ie IOPS ~= RPM / 30) vs a full revolution.
However, if seek time is not fast enough for a 10 degree rotation, you’ll end up waiting one revolution + the previous delta or 360 + 10 = 370 degrees.
For 7200 RPM disk (rotational method):   150 <= IOPS <= 300
For 7200 RPM disk (seek method):          78 <= IOPS <= 117

What does this all mean for my POOL?

RAID         Read IO penalty     Write IO penalty
RAID 0            1                   1
RAID 1, 10            1                   2
RAID Z         1                   4
RAID Z2       1                   6

IOPs needed = ( Total IOPs x % read ) + (Total IOPs x % write x RAID penalty)
So, if we needed 300 IOPs with a 50% read and 50% write workload from a RAID-Z (write penalty = 4) pool
IOPs needed = (300 x 0.5) + (300 x 0.5 x 4) = 150 + 600 = 750 IOPs

For, a similar 300 IOPs with a 50% read and 50% write workload for a RAID 1, 10 (write penalty = 2) pool
IOPs needed = (300 x 0.5) + (300 x 0.5 x 2) = 150 + 300 = 450 IOPs
This says that a 750 IOPs pool would be needed to support a 300 IOPs RAID-Z pool with a 50/50 read/write workload.
Note, only a 450 IOPs pool is needed to support the same 300 IOPs on a RAID 1, 10 pool with the same 50/50 read/write workload.

See zpool IOPs read/write CLI metrics (operations column)
zpool iostat -v 1
               capacity   operations    bandwidth
pool        alloc   free    read write   read    write
---------- ----- -----    ----- ----- -----   -----
tank        6.78G 9.75T      0      0    4.84K 5.57K
c5d0      6.78G 9.75T      0      0    4.84K 5.57K
---------- ----- -----    ----- ----- ----- -----

Optimal RAID-Zx pool member per vdev rule 2^n + p
Where n is 1, 2, 3, 4, . . .
And p is the parity: p=1 for raid-z1, p=2 for raid-z2 and p=3 for raid-z3
RAID-Z = (2^1 + 1) … (2^n + 1) = 3, 5, 9, 17, …
RAID-Z2 = (2^1 + 2) … (2^n + 2) = 4, 6, 10, 18, …
RAID-Z3 = (2^1 + 3) … (2^n + 3) = 5, 7, 11, 19, …

RAID1 aka mirror
This example creates a mirror with 1 vdev and 1 mirrored data disk (2 disks on the vdev).
zpool create tank mirror disk1 disk2
adding a striped, mirrored vdev. Growth becomes similar to RAID 10 and adds an additional vdev each add.
zpool add tank mirror disk3 disk4

3 way mirror
zpool create tank mirror disk1 disk2 disk3
zpool add tank mirror disk4 disk5 disk6

RAID-Z
Similar to RAID5, but uses variable width stripe for parity which allows better performance than RAID5. RAID-Z allows a single disk failure. This example creates pool with 1 vdev, 2 data disks and 1 parity disk.
zpool create tank raidz disk1 disk2 disk3
adding a vdev to grow the RAID-Z nested pool
zpool add tank raidz disk4 disk5 disk6

RAID-Z2
Similar to RAID6, and allows 2 drive failures before being vulnerable to data loss. Here we have 1 vdev with 2 data and 2 parity disks
zpool create tank raidz2 disk1 disk2 disk3 disk4
adding a vdev to grow the RAID-Z2 pool
zpool add    tank raidz2 disk5 disk6 disk7 disk8

RAID-Z3
Allows 3 drive failures before being vulnerable to data loss. Here we have 1 vdev with 2 data and 3 parity disks.
zpool create tank raidz3 disk1 disk2 disk3 disk4 disk5
adding a vdev to grow the RAID-Z3 pool
zpool add    tank raidz3 disk6 disk7 disk8 disk9 disk10

Disk type. Can be one of 'system', 'data', 'log', 'cache', or 'spare'. When a spare is active, it will be displayed as 'spare [A]'.
Adding a spare
Another consideration is if two disks fail in the same vdev raidz2-0 a disk replacement for both failed disks take longer because only one disk at a time can be replaced in the same vdev.
The advantage of two disk failing in a mirror configuration: mirror-0 and mirror-1 is both disks can be replaced at the same time and resilver from the spare.
Having spares minimizes the time your pool is unprotected. You can begin replacement of a disk as soon as the resilver completes.
zpool add tank spare disk1

Adding a log or Logzillas (ZIL/Write cache)
Having log disks in the disk array can be configured using only one of two different profiles: striped or mirrored. Log devices are only used in the event of node failure, so in order for data to be lost with unmirrored logs it would be necessary for both the device to fail and the node to reboot immediately thereafter. This highly-unlikely event would constitute a double failure, however mirroring log devices can make this effectively impossible, requiring two simultaneous device failures and node failure within a very small time window. Log disks improves input/output operations per second (IOPS) by improving latency issues with HDD disk. Log disks can meet the demands of Virtual Machine access times.
A mirrored disk pool of (at least) 20x300/600 or 900GB (10,000 or 15,000 RPM performance disks) or 44x3TB SAS-2 (7200 RPM capacity disk drives) with at least two 73GB SSD devices for LogZilla working with a stripped log profile.
zpool add    tank log disk1                           ok
zpool add    tank log mirror disk1 disk2          better
zpool add    tank log stripe disk1 disk2          i/o performance increase

log_config

Adding a cache or Readzillas (L2ARC/Read cache) (cache devices are available only to the node which has the storage pool is imported)
In a cluster, it is possible to configure cache devices on both nodes to be part of the same pool. To do this, takeover the pool on the passive node, and then add storage and select the cache devices. This has the effect of having half the global cache devices configured at any one time. While the data on the cache devices will be lost on failover, because the resources (network and pools) are moved from the active node to the peer node, the new cache devices will be used on the new node. SSDs are much faster than traditional hard drives, this allows the computer to read the cached data much faster than if it had to read the same data directly from the hard drive. The main advantage of SSD cache comes into play when booting into Software or when a program is run for the first time after a reboot or power off. Since the data in RAM gets cleared each time the computer powers cycles, the data is not present in the RAM whereas it is still present on the SSD cache drive.
At least 2x512GB for L2 cache (L2ARC) – Striped cache
zpool add tank cache disk1
zpool add tank cache disk2

References

https://blogs.oracle.com/relling/entry/zfs_raid_recommendations_space_performance
https://blogs.oracle.com/7000tips/entry/vdev_what_is_a_vdev
https://blogs.oracle.com/ahl/entry/what_is_raid_z
https://blogs.oracle.com/roch/entry/when_to_and_not_to
<NOTE:1333120.1> - Sun Storage 7000 Unified Storage System: How to add L2ARC cache SSDs (Readzillas) to a pool
<NOTE:1452452.1> - Sun Storage 7000 Unified Storage System: How to add a Logzilla to an existing pool

Attachments

This solution has no attachment