Hyperscaled Cluster Storage – Hardware 101

In spending money – as a tech, I must admit it’s oddly satisfying almost to a point of being guilty when it comes to “buying new toys”.

​But, when it comes to deciding what kind of infrastructure to use with the community-sourced hyperscaled storage solution your business has agreed to build out, whether it’s Ceph, ScaleIO (Dell EMC), GlusterFS, or something similar, it’s all too easy to make a decision that can greatly hinder your performance or scalability.

While there’s a ton of entry-grade compartmental resources on the web as to what to do vs. what not to do, it’s difficult to tell what perspective the writer is coming from simply because if the writer is familiar with software coding (which is great!) but isn’t as familiar with systems engineering or networking, their assessment of what hardware you need could be distorted simply because they don’t understand the “bigger picture”.

As such, I wanted to go through my hardware selection process for my anime website, Otaku Central, and how I arrived at what I’ve decided to use. This is coming from the perspective of someone who works as an infrastructure architect in this area, and has a bit more of a well-rounded understanding of systems implementation as a whole vs. someone who’s purely a hardware engineer and nothing else.

Take it as you will; completely throw my findings out if you want, but I at least hope you can get a little use from it as a reference pointer for a real-world production environment that’s following a number of best practices, and is successfully running the largest non-profit anime streaming service in the world.

Hyperscaled Systems – The Basic Requirements

This evaluation is broken into two major parts in terms of hardware; firstly, we’ll look at the hardware principles that can be applied to any platform that you go with, and then we’ll look more specifically at the hardware that applies to the platform I use, Ceph.

To kick off our discussion, let’s start by addressing the most common misconception of the hardware selection process – which hardware vendor you should use.

As a reference point for our survey, consider the case study consisting of some of the big industry players right now who’ve implemented their own systems based off this model. Google. Facebook. Amazon.

What do all of these companies have in common? Well, their server infrastructure is custom-designed and custom-built by them to more directly satisfy their need for a server model that works well with their hyperscaled storage solutions, each of which is proprietary, but share some commonalities.

​Those commonalities include:

  • ​RAID is generally not preferred (except with ScaleIO), because redundancy is handled by the cluster storage algorithm, and not by a disk hardware controller. JBOD is usually the order of the day for storage disks.
  • When spec’ing hardware, having a larger number of less reliable servers is preferred to having a smaller number of more reliable servers. This is the case because redundancy is handled on the system level as opposed to the system component level in these environments, and the possibility of partial cluster failure, as hard as it may be to believe, actually is marginally lower in higher-server environments than fewer-server environments where the hardware is more stable.
  • The fewer “pieces” in the puzzle, the better. What I mean by this is that, for example, since we’re not looking for RAID in most of our deployment environments, why have an expansion card-based RAID controller at all? It’s just another piece of the equation that could potentially fail, and if your server’s motherboard can otherwise handle all the needed data connections and throughput, why complicate the situation further? Design systems around this mentality.
  • For 99.9% of storage algorithms, the best tradeoff of system performance to disk space is:
    • Have 1Ghz of CPU capacity on an individual processing thread for every 1TB of disk space you have in your system; for example, if you have 24TB of disk capacity total in a server, having a single hex-core hyperthreaded CPU running at 2Ghz would get you to break even.
    • Have 1GB of RAM for every 1TB of disk space you have in your system; for example, a 48TB storage server should have at least 48GB of RAM.
    • To clarify, these numbers come in to play for both calculating storage algorithmic functions as well as the additional CPU/RAM overhead that occurs when a physical disk fails and the algorithm has to reconstruct it.
  • At a minimum, 10Gb networking is preferred. While this is a little-known fact, this is not only because of the obvious bandwidth increase from 10Gb vs. 1 Gb, but also because of the way that packet switching occurs on a Layer 2 fabric on the 10Gb level. 1 Gb switching fabrics average a round-trip response latency of 1ms give-or-take; 10 Gb switching fabrics average a response latency of 90μm (that’s microseconds) – this happens purely because the 10 Gb switching is more efficient in design than 1Gb.
  • Infiniband networking (or, more specifically, IP over Infiniband, or IPoI) is discouraged. While some people might immediately look at the 40Gb/second throughput of Infiniband and think that this number is obviously higher than the 10Gb/second throughput of Ethernet fiber, what they don’t see is that the 40Gb/second number isn’t by using TCP/IP. To get Infiniband to use TCP/IP, you have to have the data transmission broken and reconstructed in TCP/IP – which relies solely on your server’s CPU because Infiniband is a Layer 7 protocol and not simply an encapsulation type. 40Gb/second becomes a highly unrealistic number when you factor in that this number is achieved in DAS environments and isn’t feasible in an IP-based infrastructure – to attempt this in an IP environment would absolutely crush your server’s CPU.

Picking A Hardware Vendor (Controversy Alert!)

  …you have been warned.

​While this goes against most of my industry judgment, understand that this statement is made in a slightly different context compared to other deployments, and our needs are notably different from what we’d normally be looking for in SMB (small- to mid-sized business) environments:

Using Dell, HP, IBM, or another primarily SMB-oriented hardware provider is likely a mistake.

​Take into consideration that these vendors, while I love them to death in the SMB sector, don’t really build hyperscale cluster-grade hardware because their focus is mainly on systems that are more expensive by comparison, but are more reliable. Because of this, they feature a number of things we don’t really need – like RAID cards, battery backups, lack of SATA/SAS connections on-motherboard, lack of disk slots proportionate to rackspace given, etc.

To get into the hyperscaled clustering mentality, we’re going to have to deviate from the beaten path a bit. While vendors such as Google have done this by designing their own hardware in-house, we’re likely going to have to opt for a slightly different method since we don’t have millions of dollars in R&D budget floating around (if you actually do, cheers!).

The vendor that I normally go with for my storage nodes is SuperMicro because they’ve designed a number of systems specifically for Ceph or HADOOP deployments that pretty closely fit the bill for what I need, but other vendors (such as Tyan Microsystems or Huawei International, for example) have designed systems that fit this role as well. In addition, SuperMicro’s offerings in this area are very competitively priced.

Selecting Out Hardware Model (Part 1)

Ok, so SuperMicro is who we’re going to use for a hardware vendor in this example.

That being said, SuperMicro makes literally hundreds of different server models between a number of different system categories – where do we even start with breaking down these systems to find some potential candidates?

To make this mountain a bit more climbable, let’s start by reminding ourselves in a nutshell what we’re looking for:

  • No RAID
  • Decent amount of drives per rackspace coefficient
  • No unnecessary components
  • 10Gb network capability (onboard or expansion)
  • Raw number of servers preferred over potential reliability of individual servers

Since we’re looking to build out storage nodes, SuperMicro’s server category of SuperStorage machines is a great place to start. Many of these systems are designed for Ceph or HADOOP already, and may require only minimal system change, or no change at all, to get them to where we want them for our production environment.

Another potential pool of hardware, although searching within it is a bit more complicated, is to open up the SuperServer product matrix and start going through the list and selecting candidates that fit the bill pretty closely for what we want so we have a more narrow outlook to examine more closely in our next step.

I’ve picked out a couple models from these methods to examine for our study based on our needs. These models both offer twelve 3.5″ drive bays in a 1U rack form factor, offer modern CPU/RAM capability, and are fairly inexpensive to buy (meaning that we can get more vs. having to buy less). These models are the SuperServer 6019P-ACR12L and the SuperServer 6017R-73THDP+.

Picture
Picture

Selecting Our Hardware Model (Part 2)

One of the things I truly love about SuperMicro as a vendor is their circuit schematics for their systems, which are published under the model documentation on their website. In a moment, you’ll understand a bit more about why I feel this way.

​Looking at the two server hardware models we’ve selected for this review, we immediately note several things from looking at the image comparisons of the two models:

  • The 6019P has dual power supplies, whereas the 6017R does not.
  • Both servers have IPMI and integrated 10Gb Ethernet connections.
  • Airflow is more evenly distributed across hard drives in the 6019P.
  • The 6019P, being a newer model, supports DDR4 RAM whereas the 6017R uses DDR3 RAM.

With this information in hand, it seems the 6019P is winning the race at the moment, right? Alright, so let’s go to the motherboard manual sections of each of these servers product pages, and open up the circuit schematic for them to make sure that the “guts” of each one are up to snuff. First off, here’s the 6019P:

Picture
Picture

What we’re looking for here specifically is the SATA storage connections and how they tie in to the Northbridge/Southbridge chipsets and the CPU(s). In a nutshell, we’re checking to ensure that we can use the full allotment of throughput potential across all the disks and NICS at once. For example, if we have 12 SATA disks running at SATA3 speeds, we need at least 12 x 6Gb/sec, or 72 Gb/sec throughput, on our Southbridge chipset to fully leverage this interface potential.

You might be genuinely surprised at the number of server vendors that don’t build their systems with this thought process in mind. At the end of the day, if you’re planning on pushing your system to near 100% of its total potential, your circuit topology and chipset has to be capable of handling nearly everything at once instead of topping out at two-thirds of the way.

Going back to out manual’s circuit diagram, we see that there’s two X8 PCI-E 3.0 channels connecting CPU1 into the PCH chipset controller that both the NICs and SATA drives are operating off of. Doing some quick math, we know that a X8 PCI-E 3.0 lane operates at 1GB (that’s GIgabytes) per lane, which equivocates to 8Gb (that’s Gigabits) per lane, which gives us a total of 64Gb of throughput potential per  channel, or 128Gb/sec total throughput.

Our 12 x 6Gb/sec SATA3 slots take up a total of 72Gb/sec throughput at maximum load, and 2 x 10Gb for our NICs gives us 20Gb/sec there, so 92Gb/sec total. This is well within 128Gb/sec, considering that we’re otherwise not going to leverage anything off of the PCH controller since this is a single-purpose server dedicated to onboard storage.

In passing, I want you to note something here before moving on. Let’s suppose that instead of using onboard storage controller technology, we’d opted to use an expansion RAID card for storage control, such as the LSI 3108. 99.9% of RAID expansion cards come in a x8 PCI-E form factor, which if operating in PCI-E 3.0 capacity, gives us 64Gb/sec of total throughput, as noted prior. See the problem here? 64Gb/sec is less than the 72Gb/sec aggregate maximum throughput of all our drives; the RAID controller would actually bottleneck our performance! To get around this, we’d have to buy a second RAID controller to split the drives across, which would involve anywhere from $300 – $500 in additional costs as of the time of this writing. It would also mean that our deployment requires an additional component in the mix that could potentially fail.

​Another “takeaway” of our circuit diagram is that all our data storage connections are going off of CPU1 – nothing in this department is actually load-balanced through CPU2 or even routed through CPU2! As such, it may not make much sense to even invest in a second CPU for this system; there’s a heavy controversy in the industry today about Ceph, GlusterFS, and HADOOP as to whether or not a second CPU is even warranted in most storage server environments. Personally, I lean in the opinion of “If you don’t need a second CPU, don’t get one.” If a second CPU operates at 100 watts of power, that can easily add up to 10,000 watts of power saved in a 100-server deployment without them, which is a BIG​ deal!

I trust you can see by now that having access to the server motherboard’s circuit topology diagram has been a MASSIVE help in determining that the 6019P server will do everything we need it to do!

Since you’ve probably got the basic gist of what we’re doing nailed down, let’s skim through the same process on the 6017R server system:

Picture
Picture

Alright, so we immediately notice a couple of similarities as well as differences between the two server systems, right? We’re running the SATA-based drive connections off of CPU0’s chipset controller, but the SAS2-based drive connections are going off CPU1! In other words, to leverage this system to its full potential, we’re going to have to get 2 CPUs. Good to know.

We also note that our circuits are a little bit more segregated in this system, as well. We have two SATA3 connections and four SATA2 connections going off an X4 PCI-E 2.0 interface to the PCH controller; if we do some quick math, 2 x 6Gb/sec for the SATA3 connections plus 4 x 3Gb/sec for the SATA2 connections gives us 20Gb/sec of drive throughput potential, but an X4 PCI-E 2.0 slot only gives us 500MB/sec x 4 lanes, or 16Gb/sec of total throughput once you convert Gigabytes to Gigabits. Uh oh! 16Gbs/sec is less than the 20Gb/sec aggregate sum of our interface speeds, not to mention that we may not want 4 of our drives operating at half-speed, anyway! This is a bit of an issue!

Between the cost of a second CPU we otherwise wouldn’t need in additional to the disk performance leveraging issue we’re seeing, it seems while the 6017R system is a fantastic platform, it may not be quite what we’re looking for in our environment unless we’re planning on putting it into a long-term, low-access storage pool in our Ceph cluster since it has notable performance issues and bottlenecks.

Putting It All Together

Both of these systems have very relevant, practical, real-world use scenarios. While, from our study, we see that the 6019P easily wins on the rounded performance and efficiency front, the 6017R can easily fit into more of a backseat role in our infrastructure due to the inherent bottlenecks in its storage layout, lack of PSU redundancy, and need of a second CPU, albiet a more affordable one.

So, the million-dollar question (or more like 50,000-dollar question, in this case!) comes to the fore: which system do we go with?

It really depends on our needs, and what we’re carving out the storage pool in question for. In a tiered Ceph deployment, you’re likely going to have multiple different pools for your data storage depending on access frequency, caching level, and disk potential. The 6019P would be a prime candidate for front-line, high-performance data utilization on either the front-end access tier or on the SSD-grade caching tier; it boasts great performance with minimal components necessary to get us to the level we need. On the flip side, the 6017R would be a great solution for long-term easily-accessible storage retention; it would be easy to spec it to minimum standards on the CPU/RAM front to lower costs even further just to ensure that we’d be getting the most bang for our buck out of the role it would be performing.

Take into consideration that the 6017R is using DDR3 RAM, which implies that it’s an older model than the 6019P running DDR4 RAM, and is subsequently going to be more affordable as a whole (or, if the situation merits it, easier to purchase refurbished or secondhand). If we needed down-and-dirty storage for minimal up-front costs, it could easily get us where we need to go considering that at the time of this writing, it comes in at about half the cost of the 6019P – even less if we look at them on the secondhand market.

For my environment, I can see a use for both platforms – and may very well end up going with a mix of both systems over time. While the 6019P gets me to where I want to go right now with my front-end, the 6017R gives me the data longevity at a better price point that I’m going to need in the future. Since I usually buy storage servers in batches, if I can pick up a great deal for either platform in the 8-20 count ballpark, why not plan on getting both?

Summary

I hope this has been a helpful read for you; there aren’t too many step-by-step walkthroughs like this on the Net, and this guide was composed with the idea of bridging this knowledge gap somewhat for the medium/large-grade deployments of these types of systems in a cloud environment.

​Whether you agree or disagree (it takes all kinds to make a world, after all!), the one piece of knowledge I want to stress since it absolutely makes or breaks hyperscaled deployments is having the server motherboard’s circuit topology diagram available. The diagram, in the case of the 6017R server, gave us absolutely vital insight into understanding where it should and shouldn’t be used in a modern production environment.

Feel free to drop me a note with suggestions or comments on how this guide could be improved, or examples from your own deployment that could be added here as a secondary reference!

Caleb
Caleb Huggenberger is a 31 year-old systems engineer, old-school guitar and amplifier builder, and Eastern culture enthusiast. Outside of long work days, he enjoys electronics engineering, cast iron campfire cooking, and homesteading on his acreage in the Indiana countryside.

Leave A Comment (please keep things clean & civil)

Your email address will not be published. Required fields are marked *