The Cirrascale Blog/ Scaling the Cloud One Blog Post at a Time

Challenges of using cutting edge hardware and older software (and why whitelists can be more annoying than blacklists)

October 28, 2013
/Scott Ellis
/Storage
/No Comments

One of the fun things about Cirrascale is that we get to work with a wide variety of customers, and have access to the latest hardware and software from world-leading vendors. This means we’re often putting together solutions using pieces that have never tried to work together before. While fun, doing this early integration often leads to unexpected challenges.

Recently, we needed to evaluate the performance improvement of a still-in-development SSD to be used as a ZFS Intent Log (ZIL) for a storage solution. Because the intended application is ZFS, there were only a few reasonable choices for the test environment: Linux using ZFS-on-Linux, FreeBSD and it’s ZFS port, or Solaris in the form of OpenIndiana. Solaris seemed like the best choice since it still has the most robust ZFS implementation (not to disparage ZoL or FreeBSD…both are coming along really, really well!), and because it matches what many of our customers use. This is where the first challenge started.

Challenge One: Too New for Redundancy

After getting OpenIndiana installed and configured, and all of the SAS cabling taken care of, mpathadm(1M) was only showing some older drives: the new SSD that I wanted to test was shown by format(1M) so I knew it was physically connected and working, but seen only by one controller. After trying the (to me) obvious steps of trying different bays in the storage array, and verifying I had known working firmware for the SAS controllers and expanders, I had to cry “uncle!” and resort to Google (obviously it’s been a long while since I had to deal with Solaris!). I ended up on a My South blog entry which clued me into the fact Solaris has a whitelist of “multi-path capable drives” and wouldn’t let other drives take advantage of redundant SAS links. Not a big deal, as a quick edit of /kernel/drv/scsi_vhci.conf should fix things up. I added the vendor and product IDs and tried again, but that has seemingly no effect. It turns out that as an added layer of archiacness (is that a word?), the format to identify disks in scsi_vhci.conf needs to have the vendor field padded to 8 bytes! So for example, using format(1M) to perform an inquiry on the drive returns a result of something like:

format> inquiry Vendor: ABC Product: FOOBAR Revision: 0123

The scsi_vhci.conf entry would need to have "ABC FOOBAR" to properly match the drive (note the five spaces between "C" and "F"). It’s not as though the parser doesn’t understand formatted entries, as the actual scsi_vhci.conf file is made up of name:value pairs, where value is a comma separated, semi-colon terminated list. But I digress… In the end, the problem was solved by adding an entry like:

scsi-vhci-failover-override = "ABC FOOBAR", "f_sym";

While editing the scsi_vhci.conf file, I decided to bring it in line with the latest Nexenta configuration by changing the load-balance mechanism to “logical-block”. (Note that this is as simple as adding load-balance="logical-block"; to the configuration file…note the lack of pad-to-X-bytes silliness!) With that done, a reboot just to be safe had the new SSD under test showing up in mpathadm(1M) as it was supposed to. In retrospect, Solaris does document most of this in “Configuring Third-Party Storage Devices“, but omits some key points like padding to 8 bytes.

In this age of rapidly improving standards and fast product development cycles, having a blacklist of known offenders seems like it would be more likely to just DTRT than the currently implemented whitelist. This is the path that NetBSD has taken with it’s quirk tables, that allows for an “innocent until proven guilty” model. I guess that’s closer to my thought process than a “members only” list. But heck, it was a good learning experience.

Challenge Two: Too Fast for Old Software

After getting over the hump of having Solaris properly see the new SSD and access it using the SAS multipath driver, the next step was to test the impact the new SSD had on the storage appliance. Here at Cirrascale, as well as most other sane places, the preference for testing is to test under a realistic customer workload. As is often the case, the customer for which this new SSD is most interesting for in the near-term doesn’t have an benchmark suite that’s representative of their actual workloads, but most fortunately they have used sysbench in the past to characterize performance of their existing solutions, and asked us to do the same. It’s useful to note here that sysbench seems to have been last updated in 2009, and that it’s somewhat popular among the MySQL crowd but not nearly as comprehensive as other “real world” benchmarking tools like fio or completely artificial ones like iometer. In this instance, the request was to use sysbench “fileio” tests to generate a workload against the storage array and quantify the improvement that the new SSD makes as a ZIL, and contrast that with other ZIL solutions. “No problem,” I figured, “a quick pkg install sysbench and I’ll be off to the races.” A read of the manpage made me think that a quick bash(1) script to create and destroy zpools, and to iterate through a few runs at various numbers of I/O threads would be useful – 5 minutes later I was off to the races! The output of the first few test runs was in line with my expectations, and I shoved the window off to another monitor so I could get to work on something else. Then the window grabbed my attention far sooner than it should have.

Like most people, I decided to iterate through an increasing number of threads…”1,16,64,256″. Since I was creating and destroying the zpool (and obviously the associated test files) between each test run, I didn’t need to care about randomizing the test order (Note to future self: always randomize the test order, even if you don’t have to). After the first few test runs with “1″ thread, the test with “16″ I/O threads started causing sysbench to fail with “FATAL: Incorrect file discovered in request”. I knew in my gut that this wasn’t really a filesystem error; ZFS does many things really well, one of which is ensuring what you read from the disk is what you wrote to it. Nonetheless, I spent some time reviewing the zpool members, scrubbing zfs, and doing the kinds of checks I’d do for a normal filesystem. To help with my sanity, I tried different numbers of threads…2 threads in case it was a “too many” type of problem, 7 in case it was an “even number” type of problem. All of this turned up nothing at all: Running sysbench’s fileio test with more than one thread would stop with a fatal error. So back to Google I went.

Most immediate results that were seemingly related to my problem were around the lack of mutexes in versions of the software back from the sysbench 0.4.10 days. By most accounts, sysbench 0.4.12 seemed to have fixed all those, and that’s the version I was using. Eventually I found, buried in the Sourceforce sysbench discussion groups, under a topic totally unrelated to the problem I was having (“fileo non-zero write pattern…possible?”) was a description of an error I wasn’t getting (“FATAL: Too large position discovered in request!”). Clearly I was getting desperate, so I figured I’d try the solution to that problem: Make sure the size of file data is evenly divisible by the number of files being used. What the wha? This sort of check is trivially done by the argument processor at runtime if it’s really necessary, and c’mon…why would that configuration stipulation even need to exist in modern software?

For the tests I was doing, I was specifying the total file size of 200GB (“--file-total-size=200G“), as this was sufficiently more than the 64GB of RAM I had in the system that it should cause the ARC to be put under some stress, and for lots of data to actually hit the disks. By default, the test data is split into evenly sized files. 128 evenly sized files. The easiest thing for me to try was changing the 200GB into 128GB, so I tried that. Huzzah! Sysbench ran through the suite of tests at various thread counts with nary a problem. I don’t want to think about what hackery is in sysbench to make this problem a problem, but I’m glad there was a fix that didn’t involve me debugging the application or filesystem code.

So I continued with the tests…ZFS with no ZIL, ZIL on the new SSD, ZIL on incumbent SSDs, and a variety of other configurations. In short order (well, not that short…each test run takes awhile to prepare and execute) I’d collected the data I was looking for. Because I was explicitly looking at the impact of a particular ZIL – meant to quickly commit writes to a non-volatile source, I had ran all of my sysbench tests with fsync called after every write (“–file-fsync-all”). (For the record, I’m aware I could have set the ZFS sync property to “always” and had a similar effect, but I prefer to keep as many filesystem related attributes at the default.) Since the data I needed to collect was collected, the nerd in me decided to run a few tests with sync’ing happen as ZFS saw fit, rather than being forced by the application (“--file-fsync-end“). I should have stopped while I was ahead, since disabling the sync caused the “FATAL: Too large position discovered in request!” error to appear again. A quick comb through the code makes me think that the mutexes added since 0.4.10 aren’t enough to prevent threads from stepping on each other in the face of very fast I/O that’s seen in modern storage arrays.

Not wanting to delve any further into the issue just for the sake of my own curiosity, I decided to take my “--file-fsync-all” results and be happy with them. I’ve had enough fun trying to use old software to benchmark high performance, modern hardware. I give a 50/50 chance Cirrascale will be asked for data in the future which will run into that issue, and I’ll have to figure out if I want to steer the customer to newer (or at least more robust) software, fire up vi(1) and sprinkle locks all around, or run performance tests with lightly-tested software (I see the version is bumped to 0.5.0 in the source, but don’t know what changes are present). Until then, I’ll be content with what I have.

Share this Post:

The Cirrascale Blog/ Scaling the Cloud One Blog Post at a Time

Challenges of using cutting edge hardware and older software (and why whitelists can be more annoying than blacklists)

Challenge One: Too New for Redundancy

Challenge Two: Too Fast for Old Software

Leave a Comment Cancel reply

Recent Posts

Recent Comments

Archives

About Cirrascale

Quick Links

Recent News Releases

The Cirrascale Blog/ Scaling the Cloud One Blog Post at a Time

Challenges of using cutting edge hardware and older software (and why whitelists can be more annoying than blacklists)

Challenge One: Too New for Redundancy

Challenge Two: Too Fast for Old Software

Leave a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Cirrascale Tag Cloud

About Cirrascale

Quick Links

Recent News Releases