Uploading Files from FTP Server to Databricks Unity Catalog

Shaun De Ponte
February 12, 2024

In today's blog post, we embark on an insightful journey to demystify the process of uploading files from an FTP server to the Databricks External Locations with Unity Catalog enabled. This is a task that might seem straightforward, but with the Unity Catalog enabled, the standard DBFS file store won't cut it. Let's delve into the intricacies and unveil the step-by-step guide to seamlessly achieve this.

Create an External Location

When you store your data outside of the Databricks workspace or cluster environment, it's referred to as an external location. Databricks is frequently used in conjunction with many storage options, including on-premises storage systems and cloud storage services (such as Amazon S3, Azure Data Lake Storage, or Google Cloud Storage).

These external locations can be read from and written to when working with data in Databricks. For instance, you may use Databricks notebooks or tasks to access and process a sizeable dataset that is kept in an Azure Storage Account.

  • First create a container in your storage account. In our case it's called "ftp-files".
  • Let's create an external location in Databricks. Create an external location name, specify the required storage credentials and the URL to the container in the storage account.

Create a Managed Catalog

The second step is to create a managed catalog within the Unity Catalog. Picture it as the foundation on which our file upload process will stand. Take, for example, a scenario where we have a catalog named "bronze_dev" with a corresponding external location. Then we need to create a Managed Catalog called ftp-files as shown below:

 

Navigating External Volumes

Once our managed catalog is in place, the next task is to create an external volume. This step adds the necessary layer for uploading files from the FTP server to the Unity Catalog. The file path we'll be working with is shaped by this external volume path.

  • When creating the external volume; from the "External Location" drop down select the external location as created in the previous step.
  • Once the external volume has been created, we can view the details of the newly created external volume. Notice that the Volume type is "External".
  • Examining the browser window, we find an external volume with a full file path to the designated storage location that was allocated during the creation of the external volume. You can also now upload files directly to this volume by clicking the button on the top right.

Crafting the File Path and Upload Mechanism

Now that our catalog groundwork is complete, the focus shifts to uploading files to the defined location. The process begins by creating a new notebook dedicated to handling the FTP external volume.

In the code snippet below, we leverage scope credentials to define IP address, password, port, and username. Furthermore, we specify the FTP site's data and file location, with the local path mirroring the file path on the Unity Catalog.

# Import the FTP module from the ftplib library
from ftplib import FTP

# Retrieve FTP credentials from KeyVault using Scoped Credentials
ip_address = dbutils.secrets.get(scope='dev',key='ip-address')
password =  dbutils.secrets.get(scope='dev',key='password')
port =  int(dbutils.secrets.get(scope='dev',key='port'))
username =  dbutils.secrets.get(scope='dev',key='username')

# Assign FTP connection details
ftp_host = ip_address
ftp_user = username
ftp_password = password

# Set FTP server path
ftp_path = "/Path/To/FTP"

# Set local paths for storing downloaded files
local_path = "/Volumes/bronze_dev/ftp-files/ftp_files"

# Create an FTP object
ftp = FTP()
# Initialize an empty list to store filenames from the FTP server
flat_files=[]

# Connect to the FTP server using the provided credentials
ftp.connect(ip_address, port)
ftp.login(username, password)
print("connected")

# Change the working directory on the FTP server
ftp.cwd(ftp_path)

# List files in the current directory on the FTP server
files_list = ftp.nlst(ftp_path)

# Iterate through each file in the FTP server directory
for file in files_list:
  print(file)
  print("local_path :" + local_path)
  # Display progress message
  print('Downloading files from remote server :' + file)

  # Open a local file for writing in binary mode
  with open(local_path + file, "wb") as local_file:
    print('local file: ', local_file)
    # Download the file from the FTP server and write it to the local file
    ftp.retrbinary("RETR " + file, local_file.write)
    # Close the local file
    local_file.close()

With the setup in place, the code proceeds to instantiate the FTP object, connecting to the FTP client, logging in, and confirming the connection. The pivotal moment arrives as the code iterates through the specified file path, uploading files to the local path within the Unity Catalog.

Conclusion

In conclusion, mastering the art of uploading files from an FTP server to the Databricks Unity Catalog involves creating a managed catalog, establishing an external location, and implementing a seamless upload mechanism. This process ensures that your files seamlessly find their place in the Unity Catalog, ready for further analysis and use.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Leave a reply

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Get ready for the future.

Need more?

Do you have an idea buzzing in your head? A dream that needs a launchpad? Or maybe you're curious about how Calybre can help build your future, your business, or your impact. Whatever your reason, we're excited to hear from you!

Reach out today - let's start a coversation and uncover the possibilities.

Register for our
Free Webinar

Can't make BigDataLondon? Here's your chance to listen to Ryan Jamieson as he talks about AI Readiness

REGISTER HERE