Comments on: Exploring the PCIe Bus Routes http://www.cirrascale.com/blog/index.php/exploring-the-pcie-bus-routes/ Blades, Data Centers, HPC and more Mon, 14 Dec 2015 18:32:04 +0000 hourly 1 http://wordpress.org/?v=3.6.1 By: Scott Ellis http://www.cirrascale.com/blog/index.php/exploring-the-pcie-bus-routes/#comment-1862 Scott Ellis Fri, 10 Apr 2015 20:34:29 +0000 http://www.cirrascale.com/blog/?p=486#comment-1862 Hi Scott,

Sorry, I missed this comment! I don’t usually ignore things for 4 months! :-)

The 2011-3 processors (E5-2600v3 series) don’t seem markedly different in our testing regarding passing PCIe frames across the QPI link. I expected there to be some improvements (since the buffer sizes changed), but initial testing doesn’t show a dramatic improvement with the p2pBandwidthLatency tool. I have on my to-do list an update to the benchmarks (along with CUDA 7 or better), but seems it never makes it’s way to the top of that list.

What I do find interesting through is that more recent NVIDIA drivers (and/or CUDA 7, not entirely sure which yet) have started to do a darned good job masking the QPI latency, so larger transfers (not the 1-byte cudaMemcpy() that the p2pBandwidthLatency test does) can get pretty close to maximum bandwidth across the QPI link. Latency is still horrible, of course, but if you ask CUDA to move a lot of bits from one side of the QPI link to the other, you can achieve respectable goodput numbers.

]]>
By: Scott Ellis http://www.cirrascale.com/blog/index.php/exploring-the-pcie-bus-routes/#comment-1861 Scott Ellis Fri, 10 Apr 2015 20:25:20 +0000 http://www.cirrascale.com/blog/?p=486#comment-1861 Technically if it’s embarrassingly parallel, then the 1st configuration (the “Typical 8 GPU Server”) would make the most sense. That’d let you move code H2D/D2H while having to put minimal thought into scheduling those copies to minimize bandwidth bottlenecks. It does preclude you from taking real advantage of D2D (“P2P” in NVIDIA parlance) transfers though, which isn’t terribly forward looking.

Practically, genomics workloads are moving toward data flows where the the 2nd or 3rd configuration makes the most sense. From a Cirrascale product standpoint, I see people buying a GB5470 configured like the 2nd configuration (4x cards on each of two CPUs), and converting that to the 3rd configuration (8x cards on one CPU) for their leading-edge developers.

]]>
By: Richard Casey http://www.cirrascale.com/blog/index.php/exploring-the-pcie-bus-routes/#comment-1859 Richard Casey Sun, 01 Feb 2015 00:17:59 +0000 http://www.cirrascale.com/blog/?p=486#comment-1859 Hi,

We have embarrassingly parallel CUDA code in genomics/bioinformatics. It looks like the third configuration above would map well to these algorithms. Does that sound right?

Email: info@rmcsoftwareinc.com
Linkedin: http://www.linkedin.com/in/richardcaseyhpc
Blog: rmcsoftwareinc.wordpress.com
Twitter: @rmcsoftwareinc
Facebook: http://www.facebook.com/richardcaseyhpc
Google+: plus.google.com/107594981582657849119/posts

]]>