Sun ZFS Storage Appliance: Performance clues and considerations

Asset ID:	1-79-1213714.1
Update Date:	2018-01-05
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1213714.1 : Sun ZFS Storage Appliance: Performance clues and considerations

Applies to:

Sun ZFS Storage 7320 - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
Oracle ZFS Storage ZS3-4 - Version All Versions and later
Oracle ZFS Storage ZS3-2 - Version All Versions and later
Sun ZFS Storage 7420 - Version All Versions and later
7000 Appliance OS (Fishworks)
NAS head revision : [not dependent]
BIOS revision : [not dependent]
ILOM revision : [not dependent]

Purpose

Giving accurate values for expected system-level performance on the ZFS Storage Appliance is not possible, as it depends on many parameters - such as client workload, feature set used, memory size, network interfaces type, number of disks, optional use of logzilla and/or readzilla (SSD), pool layout (mirror vs raidz2), filesystem recordsize and FC/iSCSI LUN volblocksize.

However, this document will give you some basics which may help you to estimate what could be expected and give you some best practices to advise on the best configuration for the customer.

Please note : This document is not intended to be used in isolation for performance troubleshooting. Please refer to and use the ZFSSA Performance Troubleshooting resolution path document (See Document 1331769.1) to guide through pre-requisite stages of performance problem separation, clarification, categorization, basic system health-checks and then further information gathering - before looking to compare customer system performance with the information described in this document.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance.

Scope

This documentation provides some guidelines but will not replace a serious and reliable benchmark based on genuine workload from the customer needs.

Details

Disks performance

The genuine performance really depends on IO workload. The following numbers do not take disk cache hits and prefetching into account :
7200 rpm : 120 IOPs
10000 rpm : 166 IOPs
15000 rpm : 250 IOPs

Note1 : with cache read hits and because ZFS is able to use disk write cache, these numbers can be a doubled.
Note2 : a full disk will provide less IOPs. This means that the IOPs ratio between 7200rpm and 15000rpm disks might be 1.5 instead of 2.0 for the same
amount of stored data.

We traditionally think of IOPs as "IOPs under fully random conditions," which is a worst-case situation. Fully random means that there is a seek and
rotate delay between every I/O, and that is lost time as far as transfers are concerned. If the workload is massively random, this can decrease down to
70 IOPs.
But in fact, most I/O is not really fully random. And if you have some sequential I/O - which does either no seek or a very short one - you can easily
see vastly higher I/O rates from a disk. In the limit, sequential 512-byte I/Os can easily achieve rates in the thousands.
The NAS head is based on ZFS for data handling. Each pool is made of a certain number of vdevs of the same type (mirror, raidz, raidz2, raidz3). vdevs
are striped in a pool. Each vdev contains disks. To very simplify, when an IO comes in, it is first written to the first vdev. The next IO will be written to
the second vdev and so on. The bandwidth (and IOPs) limit is per vdev. The more vdev is used, the more bandwidth is expected.
For example, let's take a configuration with 16 disks. For a mirror pool layout, we will have 8 vdevs (2 disks per vdev), hence we can expect up to
8*150 = 1200 IOPs. For a raidz2 pool layout, we might have only 2 vdevs (each made of 6 disks), hence we can expect 2*150 = 300 IOPs.
On large config, raidz pool layout can end up with 5x or 10x more vdevs than a raidz2. mirroring is still faster, but the point is that raidz and raidz2 are
not necessarily the same.

See also Document 1315536.1 for a detailed example (RAIDZ2 Performance Issues With High I/O Wait Queues)

SSD

2 SSD types can be used in the NAS heads.

SSD logzilla is used for Synchronous IOs (iSCSI and files open with O_DSYNC attribute). When a synchronous IO comes in the NAS head, it is written in the DRAM (memory) as well as in the logzilla. In the next 5 seconds (or less) the grouped DATA will be flushed from the DRAM to the SATA or SAS-2 disks. We never read from the logzilla except after a system crash. The logzilla size is 18GB on 7x10 series and/or 73GB on 7x20 series. In the 7120 series, a 96GB Flash Module card has been integrated as a PCI device. It is divided in 4 modules, each usable as a log device.

Perf :
18GB : 120 MB/s of synchronous writes and up to 3000 IOPS.
73GB gen3 : 200 MB/s of synchronous writes and up to 7000 IOPS.
73GB gen4 : 350MB/s of synchronous writes and up to 11000 IOPS.
200GB : 510MB/s of synchronous writes and up to 33000 IOPS.
SSD readzilla is used as a L2 cache for the ZFS ARC. After some time, the old data/metadata are pushed from L1 ARC (kernel memory) to the　level2 ARC. This is called "eviction". Access to this is still 10 times faster than retrieving it from the SATA/SAS disks. SSD readzilla are written　and read from memory, there is no direct copy from the SATA/SAS disks to the SSD readzilla. The readzilla size is 100GB on 7x10 series and　512GB on 7x20 series.

The readzillas have two workloads, the read requests that it is satisfying and the writes that are trying to fill the device. They trade off against　each other. The write rate is a function of the eviction rate from the L1 cache and a variety of other factors. The write rate is explicitly throttled　to avoid suppressing the ability to satisfy reads.

Perf :
　　3100 8Kbyte IOPS and up to 10000 IOPS with a synthetic benchmark.

See also Document 1213725.1 to learn when some logzilla/readzilla can be added (search "observing hardware bottlenecks in Analytics").

More details can be found here:

https://stbeehive.oracle.com/teamcollab/wiki/Elite+Engineering+Exchange:ZFSSA+Sizing+Q+and+A#.26.2339.3B.26.2339.3BWhat+is+the+maximum+IOPS+per+15k+disk.26.2339.3B.26.2339.3B

Network

2 types of network interfaces can be used :
PCI Express Quad Gigabit Ethernet UTP
Dual 10-Gigabit Ethernet

A 1Gb device can push ~120MBytes/sec.
A 10Gb device can push ~1.20GBytes/sec.

Most of the time, LACP usage with 2 10Gb interfaces will not let 1 client to get a 20Gb bandwidth.
Some load balancing can be done at the protocol level (tcp/udp port) with LACP Policy set to L4 which uses the source and destination transport level port. This means that a client can use different interfaces in the LACP group if different protocols are used at the same time. The more clients use the same LACP group, the best is the efficiency.

Jumbo Frames can be used to make the MTU larger (9000). The clients must have Jumbo frames enabled to see performance improvements. There is a negotiation between the
client and the NAS head. The less is used.

Single threaded workload :
Even if the network seems to be fast with 10Gb interfaces, a single threaded workload may not reach the expected max troughput. This is especially true with small IO sizes, such as 8KB. In this scenario, performance can be limited by the Round Time Trip : length of time it takes for a data packet to be sent plus the length of time it takes for an acknowledgment of that data packet to be received. A common good estimation of a network RTT can be done thanks to the 'ping' command issued from the client side :

 [jack@oracle]$ ping 192.168.10.2
 PING 192.168.10.2 (192.168.10.2) 56(84) bytes of data.
 64 bytes from 192.168.10.2: icmp_seq=1 ttl=255 time=0.146 ms
 64 bytes from 192.168.10.2: icmp_seq=2 ttl=255 time=0.116 ms
 64 bytes from 192.168.10.2: icmp_seq=3 ttl=255 time=0.111 ms
 64 bytes from 192.168.10.2: icmp_seq=4 ttl=255 time=0.117 ms
 64 bytes from 192.168.10.2: icmp_seq=5 ttl=255 time=0.125 ms
 64 bytes from 192.168.10.2: icmp_seq=6 ttl=255 time=0.116 ms
 64 bytes from 192.168.10.2: icmp_seq=7 ttl=255 time=0.096 ms
 64 bytes from 192.168.10.2: icmp_seq=8 ttl=255 time=0.113 ms
 64 bytes from 192.168.10.2: icmp_seq=9 ttl=255 time=0.096 ms
 [..]
 --- 192.168.10.2 ping statistics ---
 49 packets transmitted, 49 received, 0% packet loss, time 48007ms
 rtt min/avg/max/mdev = 0.086/0.107/0.266/0.028 ms

Retaining the avergare RTT : 0.10, we can estimate the max throughput for a single threaded stream like 'cp' or 'dd', using a 8KB block size.

throughput = 8192 * 1000 / 0.10 = 81MB/s

Single thread 8KB I/O is really bounded by RTT and is a latency test, not a throughput test.

Some of possible bugs :
15666668 : Suggest lowering of max db_ref value for DEBUG kernels
15329612 : Found memory leaks in tcp_send
15329609 : found memory leaks in strmakedata

Pool Layout

As introduced previously, many pool layouts can be used when configuring the storage. For small latency usage (VMWARE, VDI), it is highly recommended to use mirror layout.
Raidz2 layout is an acceptable choice for sequential IOs and can be quite good with many vdevs (but many disks).
Mirror layout remains better because the number of vdevs is far bigger than raidz2 and IO reads can be done on each of the 2 submirror disks at the same time.

Recordsize/volblocksize

The recordsize specifies a suggested block size for files in a filesystem. It can be set up to 128Kbytes (default) and can be changed at any time but it only applies to files created
after the change.
The volblocksize specifies the block size of a volume (iSCSI,FC). The block size cannot be changed once a volume has been written, so set the block size at volume creation time. The
default block size for volumes is 8 Kbytes. It can be set up to 128Kbytes.

It is very important to match the client blocksize with the filesystem recordsize or volume blocksize.
Wrong sizing might lead to unexpected performance degradation, especially for random IO reads in a raidz2 vdev.
Even on iSCSI with logzilla used, the wrong sizing may lead to bad performance : if the IO does not match in term of size (less than the volblocksize), ZFS has to retrieve the rest of
the block from memory (if still present) or disk (far slower) to write the entire block back to the logzilla. The block alignment has to be taken into consideration too, see next section.

block alignment

This topic is detailed on a blog from David Lutz.
With proper alignment, a single client block that is the same size or less as the volume block size of a LUN will be contained entirely within a single volume block in the LUN. Without
proper alignment, that same client block may span multiple volume blocks in the LUN. That could result in 2 appliance reads for a single client read, and 2 appliance reads plus 2
appliance writes for a single client write. This will obviously have a big impact on performance if ignored.
For details, see　Partition Alignment Guidelines for Unified Storage.

See also:

Aligning Partitions to Maximize Storage Performance (white paper for Solaris, Linux, Windows, OVM, VMware)

Document 2036559.1 : How to avoid miss-alignment when using ASM on Solaris clients

Detecting and Resolving Oracle Solaris LUN Alignment Problems (white paper for Solaris only)

Document 1175573.1 : Sun Storage 7000 Unified Storage System: Configuration and Tuning for iSCSI performance. This shows some 'wmic' and 'diskpart.exe' commands for windows.

Document 1507737.1 : Sun Storage 7000 Unified Storage System: Tuning Solaris hosts using ZFS filesystem. This shows partition alignment and zfs_vdev_max_pending tunable.

Deduplication (dedup)

Dedup is good for capacity but has some known caveats for performance : throughput to and from shares with deduplication enabled is within 30% of the throughput available without deduplication enabled.
For details, see Dedup design and implementation guidelines : http://www.oracle.com/technetwork/articles/servers-storage-admin/zfs-storage-deduplication-335298.html#Perf

Compression

Some compression methods can be highly CPU-intensive, especially gzip methods with high compression ratios. See Doc ID 1012836.1 "Understanding Oracle Solaris ZFS Compression".

ORACLE database scenarios

To have correct performance when running databases onto volumes (or shares) exported by the ZFS appliance, some rules have to be followed. Logbias and recordsize/volblocksize settings have to be configured correctly, miss-alignment can cost extra IO reads and writes making the ZFSSA SAS2 drives over loaded. To absorb redologs activity, SSD log devices will have to be added in the ZFSSSA pool. For OLTP databases, some SSD cache devices will be of great benefit.
See Document <Document 2079993.1> : Best practices to use ZFSSA for databases.

Back to Document 1331769.1 Sun Storage 7000 Unified Storage System: How to Troubleshoot Performance Issues

References

<BUG:15329612> - SUNBT6423877 FOUND MEMORY LEAKS IN TCP_SEND
<BUG:15329609> - SUNBT6423874 FOUND MEMORY LEAKS IN STRMAKEDATA
<NOTE:1175573.1> - Sun Storage 7000 Unified Storage System: Configuration and tuning for iSCSI performance
<NOTE:1213725.1> - Sun Storage 7000 Unified Storage System: Configuration and tuning for NFS performance
<NOTE:1229193.1> - Sun Storage 7000 Unified Storage System: Collecting analytics data for iSCSI performance issues
<NOTE:1230145.1> - Sun Storage 7000 Unified Storage System: Collecting analytics data for CIFS performance issues
<NOTE:1315536.1> - Sun Storage 7000 Unified Storage System: RAIDZ2 Performance Issues With High I/O Wait Queues
<NOTE:1331769.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Performance Issues
<BUG:15666668> - SUNBT6981953 SUGGEST LOWERING OF MAX DB_REF VALUE FOR DEBUG KERNELS

Attachments

This solution has no attachment