VNet Injection With Azure Databricks

SQLInSix Minutes
6 min readApr 6, 2024
VNET Injection With Azure Databricks

One challenge that we can face with labor division involves two teams that must design the appropriate technical architecture with neither team having full control or understanding of all the moving parts. In this post, we’ll look at this example with Azure Databricks and design with private networking in Azure using Azure Virtual Networks. Given the design here, we are starting from the assumption that readers will be using the premium tier of Azure Databricks and that the goal of these workspaces is that all moving parts are not connected to the public internet (private network). This means that all moving parts of Azure Databricks are private; the cluster, workspace, key vault and storage accounts.

Quick thanks to my colleague Callen Smerker for the back-and-forth on this topic this past week (more at end).

Note: we’ll use a contrived example in this post with an IP address of 192.168.0.0 and this is only for example purposes.

Networking and Security Note

For the context of this post, we will only assume that the networking configuration involves Azure only. The complexity can increase if we have an on premise network that we must communicate with through a VPN gateway or ExpressRoute. Likewise, in this post we are not looking at specific Network Security Group configurations and if we have these requirements for our environment, then we would need to add these to our VNET injection configuration.

Another related point here is RBAC permissions or Unity Catalog setup for allowing access to data. This is not covered in this post and relative to how we architect our environment for access we would use a both or a combination of these for access and governance.

Terms

  • VNET: Azure virtual network environment that is used for running VMs, nodes and (or) other application processes in Azure.
  • Subnet: a smaller network within a network or a grouped network within a network; in the context of Azure, a subnet would be network space that is defined within a VNET.
  • VNET Injection: the process of deploying an Azure Databricks instance to an existing Azure virtual network.
  • Secure Cluster Connectivity: no open nodes on the virtual network and Databricks cluster nodes with no public IP addresses.
  • Private endpoint: receiving device that uses a private IP address from a private network; in the context of Azure, this would be a receiving endpoint that uses a private IP address from our subnet within our VNET.
  • Service endpoint: in the context of Azure, service endpoints differ from private endpoints in that they have publicly routable IP addresses (Ref).
  • Network security groups: configurations that control network traffic within Azure virtual networks.
  • ExpressRoute: an Azure service that allows users to connect their on premises infrastructure to Microsoft infrastructure. For the context of this post, this term involves on premise to cloud connectivity. As a reminder, this setup is excluded but is worth mentioning here as some configurations with Azure Databricks will involve on premise communication.
  • VPN Gateway: an Azure service that provides on premise to Azure connection through site-to-site. As a reminder, this setup is excluded but is worth mentioning here as some configurations with Azure Databricks will involve on premise communication.

Setting Up A Secure Cluster

As a quick reminder, this setup will have no public IP connectivity and is designed as a secure solution where all the Azure Databricks pieces can communicate with each other.

  1. Create an Azure VNET, if a virtual network does not already exist
  • When creating an IP address, make sure to input the correct address space (ie: 192.168.0.0/16 in our contrived example — 65,536 address spaces)
  • Within this step, create as many subnets within the VNET that fall within the IP address that you entered in this step and ensure these don’t overlap. In the least, 2 subnets are needed — 1 public subnet as a host and 1 private subnet as a container. We’ll create 2 more subnets — 1 for applications and 1 for private endpoints.

For our contrived example:

  • Our first subnet is 192.168.12.0/22 and is the public subnet. We have 1,019 plus 5 Azure reserved addresses for this subnet. This subnet would need the Service Endpoint of Microsoft.Sql added.
  • This means in our next subnet will be the private one. The specification would need to be 192.168.8.0/22. We also have 1,019 plus 5 Azure reserved addresses for this subnet. This subnet would need the Service Endpoint of Microsoft.Sql added.
  • Next we’d have a subnet for apps and private endpoints — examples of 192.168.0.0/22 and 192.168.4.0/22. For the latter subnet, we would want to add 2 Service Endpoints — Microsoft.Storage and Microsoft.KeyVault.

If we have an Azure VNET already created, ensure that the VNET has at least two subnets that can be used to match the configuration described in the above, even if they don’t have the exact number of address spaces needed.

2. Create an Azure Databricks workspace

  • Given the security configurations required, we will only use premium tier. You would need to review a use-case that would allow you to use a lower tier if needed.
  • Under networking select “Yes” for “Deploy Azure Databricks workspace with Secure Cluster Connectivity (No Public IP)”
  • Under networking select “Yes” for “Deploy Azure Databricks workspace in your own Virtual Network (VNET)”
  • Once you select “Yes” to the above step, more dialog options will pop up asking us about our VNET details.
  • Enter the Virtual Network
  • Enter the Public Subnet Name
  • Enter the Public Subnet CIDR Range (this would match the subnet’s range)
  • Enter the Private Subnet Name
  • Enter the Private Subnet CIDR Range

3. Connect Azure Storage To Subnet

Connect the Azure Storage to the public subnet, if you already have an existing Azure storage account. If you do not have an existing storage account, then create the storage account and ensure that the below networking options are available.

Under networking when creating the storage account (or updating the storage account, it will be under “Security + networking”):

  • Under “Firewalls and virtual networks”, choose the option “Selected networks”
  • After selecting the above option, more dialog will pop up askng us to enter the virtual network
  • We’ll select the VNET and the appropriate subnet. As a note, if the subnet doesn’t have service endpoints enabled for Microsoft.Storage, then we’ll have to enable this. (Note that if we want to confirm we’ve enabled this, we can go to the Virtual Network, select the option “Service Endpoints” and review the result under Microsoft.Storage — we’ll see our configured item.)
  • After all this, we’ll save the configurations

4. Connecting the Azure Key Vault

Under settings, click on “Networking” and review the settings for the “Firewall and virtual networks”:

  • We want to select the option for “Private endpoint and selected networks”
  • Once we select the above, more dialog will appear asking us about what Virtual Network and subnet we want the Key Vault to connect
  • Since we already have an existing VNET at this point, select the option “Add existing Virtual Network”
  • From here, we’d select the appropriate VNET and subnet. As a note, if the subnet doesn’t have service endpoints enabled for Microsoft.KeyVault, then we’ll have to enable this. (Note that if we want to confirm we’ve enabled this, we can go to the Virtual Network, select the option “Service Endpoints” and review the result under Microsoft.Keyvaults — we’ll see our configured item.)
  • We need to whitelist the IP addresses that will access the KeyVault in the option “IPv4 address or CIDR”
  • After all this, we’ll save the configurations

Summary

This post provides an overview of how VNET Injection works with Azure Databricks. Every network configuration has nuances that matter, so while this helps provide you with an understanding of how each of these tools communicate and connect on a VNET, the actual configuration will differ in your environment.

Another acknowledgement to my colleague Callen Smerker for reviewing the breakdown of Azure Databricks VNET injection with the subnets and the discussion. As he correctly mentioned, in environments where there are role separations (ie: data engineering and networking), we’ll sometimes see miscommunication because different roles don’t fully understand what others are doing. This can mean that data engineers might have the right design in mind, but during communication the networking team doesn’t quite have the same view. The purpose of this post is to make this clearer for everyone.

As I wrote, every environment will be different and could be much more in depth in the security architecture, so while this provides an overview, it does not dive into the depths we may see.

Further reading from Microsoft on VNET Injection.

Note: all images in the post are either created through actual code runs or from Pixabay.

--

--