Capped Collections In MongoDB

SqlInSix Tech Blog
5 min readApr 1, 2021

--

Architecting data solutions with data limitations built-in

Four years before I recorded the following video MongoDB: Capped Collections and Data Expiration (around 2010 was my original prediction), I wrote that the rise of social media would eventually encourage people to participate in platforms that favored small data over large data. The reason I made this prediction was because people were only beginning to see how social media was using their data. In addition, other people on the platforms were also using other people’s data.

As more people became familiar with how these data were being used, the rise of platforms which didn’t store much data would become popular eventually. Snapchat was released a year later, but wasn’t popular at the time. With the younger generation, it’s become immensely popular and one reason is that younger people like a “non-history” of activities.

That’s one example of a why for this technical solution. MongoDB provides us with a great tool for when we want to limit data size of a collection — capped collections and data expiration. Not only does this feature help us minimize the size of our environment (saving us resources), it can be used for tools where we only want to keep records around for a period of time automatically without having to worry about removing data later or even scheduling data removal (which may fail).

Basics

In our example, we create a capped collection called “AutoPrivacy” with the following syntax:

db.createCollection("AutoPrivacy", {capped: 1, size: 2, max: 2})

In this syntax, the capped specification with 1 makes this AutoPrivacy collection capped (true also works). The size specifies the number of bytes allowed in the collection and the max specifies the number of documents allowed. As a quick note on the size of a capped collection — it cannot be 0 bytes nor can it exceed 1 petabyte at the time of this article. Since this is only a test, our specifications are extremely limited.

Now, let’s test our example capped collection by adding some data we might expect to find in a micro-blogging platform like Twitter:

db.AutoPrivacy.insert({post "Hey everyone!"})
db.AutoPrivacy.insert({post "Hey y'all. I just moved."})
db.AutoPrivacy.insert({post "Hey y'all, that was a great event last night."})

As we see, we’ve exceeded the collection max of 2. What do you think will return when we query this collection? Let’s query it:

db.AutoPrivacy.find().pretty()

We see that we only get back the last two posts we inserted and no longer see the Hey everyone! post.

If we want to check if an existing collection is capped, we can execute the below:

db.AutoPrivacy.isCapped()

This will return with true in our case, since we just created a capped collection. While outside the video discussion, we can convert an existing collection to a capped collection. Since there could be costs and possible bugs introduced if we do make this change, we should review with our architecture approach — in general, it’s better if we create a new capped collection rather than converting an existing one. However, it is worth noting that we can convert an existing collection.

Video

Some questions that are answered in the video — MongoDB: Capped Collections and Data Expiration:

  • What is one reason that we may use a capped collection or use the feature of data expiration mentioned in the video?
  • What is another use case that you can think of based on the example?
  • In the example video, what is one technique we use to verify that MongoDB “expired” the data?
  • What is another way we can verify that data are not stored?
  • What do I note about the performance of a capped collection and where might this benefit us?
  • What is the final point I make about optionality and how could we use this in our architecture?

Related Points — Data Accuracy and Security

As a related data note, the idea of storing history can often be more distractive than accurate for predictions (often the rationalization for storing historical data). In addition, the costs may not offset the benefits. For instance, the cost of storing a person’s data, if compromised, may be significantly more than not storing a person’s data. In my years of research and predictions involving people, I’ve rarely used more than 2,000–5,000 data points to make my predictions and they’ve been extremely accurate. None of these research points involved storing people’s private information.

An example use case of the above technique using capped collections would be if we had a search engine where we stored the last 30 searches of a user. The user could download their last 30 searches and continue to save these, but our collection would only hold the last 30 searches. In this case, we wouldn’t store the full history (this could get expensive on the storage side), but we could still have a feature that allows the user to retain this if the user wanted. This provides the user with the freedom to retain or forget their historical searches.

We should consider one security point here with capped collections. In general, the security of a capped collection will be stronger for data protection, because a hacker at any given moment can only get access to the present data. If our capped collection, as an example, only stored the healthcare data of a person for that month, in a compromise only that period of time would be leaked or compromised. By contrast, if we had the full historic data of the person, if a hack occurred that compromised the data, the hacker would have full access to the historical data.

However, if a hacker remains in our system for a period of time and monitors our data to eventually extract it later, they may be able to get access to more of our data. But for the hacker, this becomes a challenge for them as well because the data aren’t present in our system, forcing them to find a way to extract the data outside our system. By contrast, in a database where we’re storing the full history, the attacker can devise a plan to extract the data later. Still, we should be extremely careful about assuming that a capped collection will limit a hacker’s ability to leak data across the board. It does mean hackers will have to use different techniques, but it isn’t 100% secure. We can use some technical features to possible disrupt attackers here that makes this a huge challenge for them. MongoDB’s feature here does offer some significant security advantages for some use cases over recording historic data.

Privacy and security are two use cases of capped collections and data expiration, but there are numerous other use cases. We may want to use capped collections for daily documents, where we only want the last day’s or last week’s document. Another use case is data restriction in general: we may have a limited environment and want to enforce that by forcing some data to become irrelevant in time. To the earlier point about data and research — historic data is not always important for us to retain, so provided that we know our use case, allowing data to expire may be a useful feature we want to consider.

As shown in the linked video, the best part is all of this is automatic. We don’t have to maintain the limits when we design it well.

--

--

SqlInSix Tech Blog
SqlInSix Tech Blog

Written by SqlInSix Tech Blog

I speak and write about research and data. Given my increased speaking frequency, I write three articles per year here.

No responses yet