Storage and Resource Sharding and Partitioning With Azure Resources

12 min readFeb 12, 2024

Storage and Resource Sharding and Partitioning With Azure Resources

In this post, we look at sharding and partitioning approaches with various Azure technologies. This post is a work in progress and will be updated over time.

Terms

Horizontal scaling: the practice of distributing a workload across multiple resources.
Sharding: the database application of horizontal scaling where data are demarcated into separate shards.
Shards: A subset of the overall data; if you think of a pie chart that represents all the data, a shard would be portion of that pie.
Hash: a unique fingerprint for a data entity.
Round robin: a cyclic order of of division, usually one after the other.

Reasoning Behind Horizontal Scaling

We may scale for a variety of reasons, but the biggest reasons behind scaling data involve resources, such as computing data, storing data, etc. If you’ve ever managed replication and you’ve had to manage the publisher, distributor and subscriber all on the same server, you understand the competition over resources. One technique is to have a separate server for each — a server for publishing, distributing, and subscribing. The net effect of this is an increase in resources on each server.

One obvious downside is cost, however in the cloud this is not always the case due to economies of scale. Likewise, you can use appropriate data tools, such as databases, for different tasks and leverage their advantages. This is business equivalent to outsourcing tasks to people with specialties and having every major project completed faster.

Another obvious downside is complexity. If we aren’t careful in how we architect scaling, we may over complicate a design. Likewise, more scaling of data can mean more miscommunication among resources, so it’s possible that we could face data corruption. Horizontal scaling is not a quick solution to problems; it requires planning and thinking about where the weaknesses may appear, but overall it can help increase performance.

Horizontal Scaling Is Not Vertical Scaling

If we threw more memory, CPU or storage at a solution, we would be vertically scaling. Remember, that in the context of horizontally scaling, we are creating more resources not adding more power to our resources.

Sharding Approaches: Hash, Lookup, Range, Round Robin

To understand how hash, lookup and range sharding works we’ll apply a simple example. Suppose that we have a massive data set that we organize by an alphanumeric symbol. We decide that we’ll shard our data with each of these approaches. To understand a bit of this problem first, we’ll consider how our symbol column looks by seeing a few of the symbols that would appear on each row:

A, AB, ABC, AD, ...

Hash sharding. In hash sharding, each shard stores a range of hash values and when the data need to be returned, the hash value of the data is calculated and the correct shard is used to return the data. A hash algorithm does not require a lookup table, nor does it involve any special resources; it is simply a mathematical algorithm that creates a hash from data input. The biggest benefit to hash sharding is that it uses a hashing algorithm to ensure a balanced distribution of data across the various shards.

If we think about a situation where we have 10 shards each with a 100 TB storage capacity, but only 2 of them are used, we can see the downside to how we’ve scaled our resources. This would never happen with hash sharding. In the case of our symbol example, the hashing algorithm would store the data evenly across the shards regardless of their symbol.

Lookup sharding. Lookup sharding involves the use map similar to a lookup table that points to the appropriate shard where the data are. In this type of sharding, the data will generally not be evenly distributed because there isn’t an algorithm enforcing this. However, lookup sharding allows more more control of data querying and is incredibly useful in many business context, especially the 80–20 rule business context.

Using our symbol example, suppose that we knew that 20% of our symbols receive the most querying. We could have shards that support these data with more resources than other shards because we know the traffic to these shards will be greater. Using a lookup sharding approach, we could solve this problem using this approach. While this gives us more control and flexibility, we must remember that it does mean data are not evenly distributed and we need to support the lookup.

Range sharding. Range sharding is easiest to understand with calendar timing because range sharding groups data based on sequential values. However, this would apply to any numerical range, such as breaking down 1 through 10 in 5 shards of 2 each out of ten. While range sharding can result in an even distribution of data, this is not enforced. For an example, if we have the same number of data values in a month and we range shard by month, we will have an even distribution of values. However, if we were to range shard by month just on the number of days alone, we would not have an even distribution of data.

In the case of our symbol data, we could use range sharding to store all symbol data based on year. This would mean that when we query the data, the year is evaluated and the appropriate shard is selected. Keep in mind that this would not necessarily result in the even distribution of data because some years may have more or less symbols.

Round robin sharding. Round robin equally partitions and distributes a data set to compute resources. While it may seem like round robin and hash are similar based on the defintions, they differ in how they achieve their final result. Round robin sharding will use a cyclic order approach where it will cycle through the shards one-by-one and distribute to each shard. This differs from using a hash function, like in the case of hash sharding. Let’s look at a practical example of both in detail and then compare the benefits and costs:

Suppose that we have a data set of 10,000,000 records and we want to store the data by equally distributing the compute resources. As we’ve already seen, neither lookup nor range sharding would help us here, as we wouldn’t have any guarantee that the data set would be equally distributed across compute resources. In our example, both hash and round robin sharding would apply, as both of these would be closer to achieving this outcome.

Round robin will load the data set in a balanced manner like hash, except it uses no hash function to map the data to the shard. Rather, using our 10,000,000 records example, it will distribute the data in cyclic order by the number of partitions that exist — for instance, if we had 5 partitions, the 1rst record would go to partition 1, the 2nd record would go to partition 2, the 3rd record would go to partition 3, 4th record would go to partition 4, 5th record would go to partition 5, and the 6th record would cycle back and go to partition 1. In the case of remainders (our example wouldn’t have them), that remainder would go to the appropriate cycled partition. This is worth noting because in the case of remainders, both functions wouldn’t perfectly equally distribute the data.

Benefits:

Round robin sharding is considered easier to achieve as we won’t have to develop a complex hash algorithm to balance our data load. This relates to the next point as well.
Because round robin sharding is simpler, it’s easier to migrate. This is important when we have legacy systems that we need to update.
Hash sharding will result in faster querying, which includes faster aggregations of data.

Costs:

Both round robin and hash sharding do not follow the principle of the 80–20 business rule. In 90% (or more) of data situations I’ve seen, most data are either seldom or never queried. The amount of resources that are wasted in equally distributing the data is insane considering that we tend to only query a small fraction of our data frequently.
In the case of hash sharding, the hash algorithm for mapping the data must be sound otherwise, you’ll get the same hash value for two different values — for instance, creating a hash algorithm that uses the input of first name and last name only would be extremely unwise since we could have 100 John Smiths, creating a problem.
In terms of data migration, hash sharding will require that we re-hash the data. In the case of round robin, we’ll have to re-cycle the data through the shards again, if we want an equal distribution of data across shards.
Both will cost in terms of data fragmentation since we are distributing data to balance the data load. This can become a problem as the data set grows.
Delayed data can cause issues (especially with hash sharding) if the hash involves the timing of the data, since some data may come out of order. The same could possibly occur if we shard with round robin where we order by a time field and we end up with late data added. By contrast, if we think about range sharding where our range is organized by time, this doesn’t have an impact, since the time will determine where the data live.

As a quick note here, because I’ve heard and seen discussions and exams where replication or replicated data was brought up when discussing data loading and querying, it is worth briefly mentioning replication or replicated data. When we replicate data, we are creating a copy of that data. Replication does not create shards of partial data that we then need to put together. While we can query full replicas of the data and distribute how we query the full data from the replicated data, this is not sharding data, as each replica has the entire data set.

Scenarios

Other than our example, let’s look at some scenarios when we might want to consider using hash, lookup or range sharding. Keep in mind that we have to consider the full picture along with support of our approach.

Heavy time-based querying. Range sharding will generally be the best choice here because the queries will heavily look at the time of data. There may be a few exceptions where I would consider lookup sharding; for instance, if time-based querying is a big part of querying, but there’s a few other nuances where we want more control over how the data are sharded.
More control over sharding data. When I think of the 80–20 rule of business, lookup sharding is the best choice. In these scenarios — and they are frequent in business — we uncover that a small fraction of our data are looked at often. We want more control about where these data live. Remember that there’s extra complexity with the lookup and the data load is not distributed.
Even distribution of data along with less complexity. Hash sharding is the best choice here.

Azure Storage and Partitioning Storage Solutions

The below two scenarios each cover a business challenge that involves architecting an Azure storage solution for various applications. For these questions, we are only looking at Azure storage. Outside of Azure storage, we may have other alternative routes to build a solution that doesn’t require storage, but we’ll skip that for these questions — though I may explain why I pick an Azure storage for a particular solution that involves a related Azure technology. In this post, I also explain why one route or another is a better decision.

As an added challenge to these questions, you are required to come up with at least one key question to ask the client. Part of consulting involves understanding the challenge, not “solutioning” before you understand the problem or details. I will note on this point that while I list some key questions you could ask, you could ask much more relative to the company and relative to their data. These make good starting points to understand that you will need to ask more questions even with the following scenarios.

Questions

Question 1: you are working with an oil company that owns two pipelines. The company is building an application that must be able to track the flow of the pipeline along with compliance data. Since each region has its own compliance requirements, the compliance data should be tracked separately from the flow data. The company puts you in charge of a storage solution that uses Azure storage for their data and has the following requirements:

The flow and compliance data must be stored in the region where the pipeline is.
The flow data are ongoing and can update past data with different values because at the moment the pipeline streams data, the data are not always adjusted accurately.
The solution must optimize for flow data being accessed and updated as quickly as possible.

Question 2: you are working with a retail company that sells products made by various companies along with its own products. The company wants to use machine learning to categorize the product categories and determine which company tends to product the highest demand products by consumers. However, the company is skeptical that machine learning will help them above their own analysis, so they will provide your team with one-time data dumps of the retail data from several data sources that need to be combined and they want you to keep the costs as low as possible. The machine learning engineers on your team instruct you to create a solution that combines all the data from various sources and output either a parquet, CSV, or text file that can be used to pass into their algorithm. In review the requirements are:

You must keep costs as low as possible
You must combine the various data sources
You must output a CSV, parquet, text or other file that can be easily consumed by the machine learning algorithm

Solutions

Solution 1: in this case, I would use two storage accounts. Each storage account would belong to the legal jurisdiction of each pipeline. In each storage account, there would be two containers — one container for the flow data and one container for the complaince data. Also, the data would be stored with Data Lake Storage Gen2, as the flow data are streaming and can update past values so delta lake would be preferred over a multi-file import and merge solution.

Key questions:

I would ask a question about the streaming data window. When any client says that “early data can be inaccurate”, I want “early” defined. Part of this is because I want to know when I can assume that updating past data will be accurate.
I would ask a question that will involve the security of compliance data along with protecting the intellectual property of flow data. Compliance data generally needs to be secured and flow data will often come with intellectual property that must also be protected. With this last point (intellectual property), I would have many questions around this.

Solution 2: in this case, Azure blob storage from an Azure storage account with one container in one region would suffice because these are one-time data dumps. This analysis of combining data could be done for $5 using Azure Synapse SQL Serverless pool (the price at this time), as views can be created on top of the data dumps and then joined with a CSV being outputted.