Cost-Effective Scaling: Save Money by Adding More Nodes

Maximizing (Nutanix) Cluster Resiliency while Optimizing Costs

The notion of enhancing Cluster resiliency while simultaneously reducing costs may sound like a paradox, but rest assured, it’s not just wishful thinking. The following principle can be applied to any virtualization solution.

Balancing Performance and Cost Efficiency

When architecting a Nutanix solution, particularly in scenarios where CPU cores dictate licensing expenses, finding the optimal balance between performance and cost efficiency is paramount. This challenge becomes more pronounced in smaller environments, where the impact of each additional node on both performance and budget is more significant.

The Dilemma: Node Count vs. Budget Constraints

Increasing the number of nodes in a cluster undoubtedly boosts resiliency. More nodes translate to faster rebuilds, a reduced blast radius during node failures, and minimized disruptions during system upgrades. However, this also means escalating costs, potentially pushing you beyond your allocated budget. The question then arises: Is a four-node cluster truly better than a three-node one, considering the increased expenses?

A Clever Approach: Increase Nodes, Decrease Costs

Failover Capacity Optimization

To address this conundrum, consider failover capacity as a key factor. In the event of a node failure, reserved resources within the cluster play a crucial role in mitigating the outage impact. In a three-node cluster, a third of the total capacity must be reserved for failover. Contrastingly, a four-node cluster requires only a quarter, presenting an intriguing opportunity to reduce costs.

Practical Example: CPU Core Sizing

Let’s break it down with a practical example. Assume you need 192 vCPUs for your virtual machines, with an oversubscription ratio of 3:1. Assuming you want to tolerate one node failure. Traditionally, you might size for a three-node cluster with a total of 96 cores : 192 / 3 = 64 cores required in the cluster.

  • Three-Node Cluster: has we want to handle a complete node outage 2 Nodes are required to accomodate the complete workload during a node failure. So we need 64 Cores in two nodes. Due to the fact that we cannot know which nodes will eventually fail any day we have to have three nodes with the identical hardware configuration. Each node will need 32 cores to still serve 64 physical cores during a node failure. so end up having a cluster with three nodes and a total number of physical cores of 96.
  • now take a look to a Four-Node Cluster: Three operational nodes must handle the workload when a node failure occurs. dividing the numbers we come to core count on each node with arround 21 cores. So you may choose a node with 20 cores and a slightly higher clock rate to compensate the lower number of cores. While seemingly counterintuitive, this approach proves effective in cost reduction: now the total number of Cores within the cluster is 80 Cores. We reduced the total number of cores while shrinking the blast radius when a node goes offline. Fever VMs has to be restarted and the overall impact is lower when compared to a three node cluster.
  • Five-Node Cluster: Similar logic applies, with calculating with five nodes – so in case of an node outage we have four operational nodes requiring 16 cores each node resulting in a 5-node cluster with 20 cores per node. Again the total number of Cores is 80.

in the following table you see the calculation for our example:

Extending the Strategy: Memory and Storage with HCI

Applying the same methodology to memory each node might have fewer resources individually, but the collective cluster can seamlessly accommodate a complete node failure.

With a HCI solution like nutanix this also extends to storage too. so you can save money due to the fact that you decrease the required failover capacity when using more nodes.

Conclusion: Strategic Node Scaling for Cost Optimization

In essence, by strategically increasing the node count while tailoring the resources per node, you can decrease the total number of cores required. Consequently, this allows for a reduction in license costs, all while keeping hardware expenses within reasonable bounds.

It some cases this approach won’t work and it also has it’s limits. What I saw is that in some of the cases you can really save a lot of money when you can decrease the over number of cores required as the additional node is cheaper than all the additional software costs for HCI Software and Operating System licenses.

In the intricate dance between cluster resilience and budgetary constraints, finding the right balance involves leveraging failover capacity and adopting a nuanced approach to node scaling. In doing so, you can attain the twin goals of heightened resiliency and optimized costs.

Leave a comment