HSFRA Algorithm – Calculating Hyperscale Storage Needs

In the process of working on Otaku Central’s video storage infrastructure, I’d determined previously that I wanted the entire video server topology to operate on a JBOD-style disk redundancy system because it offers the throughput potential that I need, while giving me failover capacity on the server level instead of the individual disk level. Most large-scale enterprises (such as Google, Facebook, and Amazon to name a few) don’t use RAID in their production environment clusters because it’s simply not an efficient way of managing performance and failover.

But, with that in mind, how risky is it to have a JBOD-style hyperscaled storage infrastructure? How much server-level redundancy do you need to reasonable mitigate that risk, and how much will it cost you? Is it even a good idea to begin with if your organization is just “starting out” due to how much server-level redundancy is needed?

Army with these questions, I strived to find a solution in the industry that would help me answer them, but came up with nothing after an extensive search as well as process of questioning colleagues. My guess is that large enterprises have successfully implemented JBOD-style cluster deployments and already done the number-crunching on this, but that knowledge was retained as a technical secret by that organization.

So, left with no choice but to figure out how to do this myself, I’ve come up with what I refer to as the “Hyperscaled Storage Failure Risk Assessment” algorithm, or HSFRA for short. I’ve successfully used it in test deployment scenarios to calculate risk levels, and it can be easily modified or expanded to calculate other storage risk scenarios such as those including RAID, multi-clustering, and multi-stage clustering.

If you find this even the least bit useful, or would suggest changes to it, I’d love to hear your feedback. I’m hoping that others can possibly find this helpful for their own deployments, or as an extensible framework to apply to other technical situations as well.

Variables Needed

To use HSFRA, you’re going to need some standard variables to plug into the algorithm first. Some of these can be easily approximated via an Internet search for statistics, but others may need to be guessed on – and that’s alright, because there’s a percentage-based margin of error built into the algorithm to compensate for this.

This algorithm anticipates that when you build your cluster, you’re using the same model of everything across the board for each years deployment; not only is it MUCH easier to build your cluster this way, but it also greatly simplifies the math involved.

What you’ll need are:

  • HA – Hard Drive Age; how old your hard drives are in years.
    • If they’re brand new, then this will be ‘0’.
  • HM – Hard Drive Age Modifier; how much of an increase in percentage of risk with each passing year there is that your hard drive will fail.
    • You can often find statistics on this for your make/model of drive from vendors like Google or Facebook that have run assessments on hundreds of thousands of drives in datacenters already. If you’re not sure what this is for you, these are the numbers I typically use by vendor:
      • Western Digital: 0.05
      • Seagate: 0.035
      • HGST: 0.02
  • HC – Hard Drive Cost; how much it costs to buy each individual hard drive.
  • HQ – Hard Drive Quantity; how many hard drives will be in each server.
  • HW – Hard Drive Warranty; how long the vendor warranty for your hard drives is in years.
  • SA – Server Age; how old your servers are in years.
    • If they’re brand new, then this will be ‘0’.
  • SM – Server Age Modifier; how much of an increase in percentage of risk with each passing year there is that your server will fail.
    • ​As long as you’re using an industry-standard manufacturer’s server (Dell, HP, IBM, Supermicro, etc), set this number to ‘0.02’ unless your research justifies changing it to something else.
    • If you’re using a custom-built server or custom-designed server model, this number usually jumps to around ‘0.04-0.05’, but do your research to confirm.
  • SC – Server Cost; how much it costs to buy each individual server.
  • SQ – Server Quantity; how many of this server build will be in the cluster as a whole.
  • SW – Server Warranty; how long the vendor warranty for your server is in years.
  • EM – Error Margin; what percentage of human error or statistical error you want to factor into the algorithm.
    • Set this to the equivalent of ‘+15%’ if you’re not sure; this usually gives you plenty of a cushion to fall back on, if needed.
  • SN – Storage Amount Needed; how much usable disk space you need from this cluster.
  • RI – Replication Index; how many other servers in your cluster you want a copy of the same data on in order to mitigate data loss risk odds.
  • AC – Annual Cost Budget; how much money your company has allowed you to spend on the storage infrastructure for the fiscal year.

The following variables are ones that we’re going to calculate for, since we need to get them in the lower stages of the algorithm in order to solve for the higher stages:

  • HF – Hard Drive Annual Failure Risk %; how likely it will be that an individual hard drive could fail.
  • SF – Server Annual Failure Risk %; how likely it will be that an individual server could fail to the point it would be have to be taken out of the cluster and repaired due to a non-hard drive-related fault.
  • AF – Aggregate (Server + HD) Failure Risk %; how likely it will be that a server must be removed from the cluster either due to a fault of its own, or because of the hard drives it contains.
  • AL – Annual Risk of Data Loss; this is the percentage chance that you’ll lose enough of your infrastructure simultaneously due to failures that you’ll incur a loss of data from it.
    • If this ever gets above 5%, you’re in the danger zone with your company’s risk because you’re sitting on the equivalent of a D20 dice rolling a ‘crit’ that your company will go under this year due to data loss. Keeping it below 2% at all times is the safer goal to start with, but strive post-build to get it even lower.

​Hopefully, you’re not completely intimidated by the amount of variables here! To help make this simpler in the implementation aspect, the HSFRA algorithm is divided into five parts that can each be pieced together at the end to provide you the information you ultimately want.

Running Through HSFRA – Sample Scenario

Let’s suppose you have a deployment situation that’s provided you with the following variable details:

  • ​HA = 0
  • HM = 0.035
  • HC = $155
  • HQ = 8
  • HW = 3
  • HS = 4TB
  • SA = 0
  • SM = 0.02
  • SC = $2,250
  • SQ = 8
  • SW = 1
  • EM = 1.15
  • SN = 40TB
  • RI = 6
  • AC = $60,000

​Let’s step through HSFRA using this information, and see if we could find a storage solution that would work for this organization.

Part #1 – HF

The formula to calculate ‘HF’ is:

Picture

To apply this to our variables, we get:

Picture

Interpreting this into real-world information, ‘0.05’ here means that there’s a 5% chance that one of the brand-new datacenter-grade drives we picked will fail right out of the gate when we install them.

Had we re-run this assessment on a 3-year old hard drive (such as from a cluster we’d built out 3 years ago and wanted a risk assessment done on), the result would be:

Picture

Wow! On a 3-year old hard drive, our failure risk percentage goes up from 5% to 23.5%! For those that are reading this that have worked in engineering or hardware datacenter roles before, you can probably correlate from your personal experience that this number sounds about right based on what you’ve worked with.

Part #2 – SF

The formula to calculate ‘SF’ is:

Picture

Applying this to the variables we were given, we get:

Picture

In similar fashion to ‘HF’, our initial server failure rate for the first year of production is 5%.

To offer a little clarification on the ‘0.06’ number in the far right bracket, servers statistically see an increase in failure risk percentage after they’re 6 years old. The far right bracket accounts for this, hence why it should only be factored in if the server is past the 6-year mark.

If the server was 3 years old, our equation would be a bit different:

Picture

Notice how servers see a slower increase in risk percentage compared to hard drives at 11%? Most engineers could also attest that it’s much more common to see hard drives failures than server failures, and this speaks to that fact.

Part #3 – AF

To solve for AF using the two variables we’ve obtained in the prior steps, we use:

Picture

Plugging in our numbers, we get:

Picture

Interesting! So between all the components in our server, including both the server itself and each individual hard drive, we’re looking at just below a 37% chance that something in it will die in the first year to the point we have to take the server out of the cluster and repair it.

You might be thinking that this number is unnaturally high – I mean, all the components are brand new, and there was only a 5% chance that each of them individually would fail! And why are these numbers calculated in the manner that they are, anyway?

To explain, this equation applies what’s generally known as ‘stacking diminishingly’. This term is most famous from its use in the popular video game “Dota 2” where it describes how to achieve a realistic aggregate percentage based on several varying chance factors.

To use it, we need to apply it to the inverted number of our risk percentage, which is why we use “1 – SF” to invert ‘0.05’ to ‘0.95’, then multiply all inversions against each other to achieve our diminishingly-stacked aggregate percentage, then invert it back.

If we run the same formula against a 3-year old system, we get:

Picture

Well, this is pretty staggering, isn’t it? An 89.5% chance that the system will incur a failure. While this seems pretty scary, take something into consideration when analyzing this information.

Firstly, the odds are that in either the 1st or 2nd year of life on those hard drives that at least one of them will fail. If two hard drives failed in this fashion, it would force the percentage to be recalculated to bring it up to an 83.8% AF. If three hard drives failed, it would bring the AF percentage to 77.6%.

So, our AF formula is really just a guidepost to re-tailoring this for a more granular fashion on our end in a real-world scenario on a server-by-server basis. It would make a really nice back-end scripting addition to your asset management system to automatically track this on your database backend; this is what I would do to implement this for my own organization.

Part #4 – Adding in AC

Alright, so now let’s work with our AC and see if our gameplan for our infrastructure meets our budget here:

Picture

While this equation’s a lot more complex than the previous ones, let’s step through it a piece at a time with our variables and see what we get:

Picture

Our result confirms that 32,080 is indeed less than 60,000, so our deployment is within our budget; very much so, in fact.

Notice that all of the potential failure costs for replacing components is mitigated by the fact that all our hardware is brand new, and thus covered by warranty, and we won’t have to pay to replace any of it. If we were running the same calculation on a 3-year old server cluster, we’d be looking at $4,554 per year when we combine our cost against a 15% margin of error (EM).

Alright, so we’re within our budget with our design, but will it actually meet our business requirements?

Part #5 – AL & RI

Finally, let’s see how bad the damage is as far as the risk for our overall organization goes:

Picture

This equation inverts our AF, then runs it against the “stacking diminishingly” function to give us a comparison of what our total annual risk of data loss is if don’t replicate the data to every single server in our cluster vs. if we do. The result:

Picture

So if we went with the original proposal of having an RI of 6, we’re sitting on a 6.27% possibility this year that we’ll lose all of our data in order to gain an additional 25% storage capacity; that 6.27% is above the 5% threshold we’re never supposed to be above, though. On the flip side, that additional 25% storage capacity gets us to exactly the 40TB of disk space our business needs demand per our SN.

If we chose to replicate the data to all of the servers instead of using an RI, our AL now drops to 2.48% – which is looking a lot better! The drawback to this approach is that now we’re limited to 32TB of disk capacity in the cluster, which is below the 40TB we need!

The easy solution here is to fall back on the 27K-ish dollars we had left over from the budget to simply add more servers into the proposal and re-run the HSFRA algorithm as a whole to confirm that we’re looking good on all fronts now. I won’t take the trouble to do this since there’s a number of different solutions within that means, but this effectively brings us to the end of the road for HSFRA and some of what it can bring to the table.

Closing Thoughts

For Otaku Central, it’s pretty cool for me to now have some hard numbers concerning risk and data storage so that I can effectively plan out my business needs into the future without having to guess or estimate on solving problems as they arise – I now have a solution that can give me the answer I need in 5 algebraic steps.

I hope this has been insightful for you, and a reminder than even though most systems engineers haven’t studied math since college or highschool, it definitely has a place in more advanced server work as well as scripting.

Caleb
Caleb Huggenberger is a 31 year-old systems engineer, owner of the non-profit animation streaming service 'Otaku Central', and Eastern culture enthusiast. Outside of long work days, he enjoys electronics engineering, cast iron campfire cooking, and homesteading on his acreage in the Indiana countryside.

Leave A Comment (please keep things clean & civil)

Your email address will not be published. Required fields are marked *