AI / Machine Learning
-
February 3, 2023

A Breakdown of Remote Data Science - a New Concept by OpenMined

As a data scientist, I was recently selected for the Padawan Program at OpenMined on Privacy Enhancing Technology (PETs). The program consists of one-to-one classes with a mentor who provides training on the framework and a breakdown of how it works. In this blog post, I will share in detail what I learned from the Padawan Program. 

What is OpenMined?

OpenMined is an Open Source community focused on researching, developing, and elevating tools for security and privacy-preserving. This community introduced the concept of Remote Data Science, which allows people and organizations to share information without compromising privacy. 

To provide some context, if you take into account that sometimes positive outcomes, such as collaboration among researchers, companies, and public institutions, are unable to happen as certain information can’t be shared. On the other hand, there can be negative outcomes when too much information is shared. This is where OpenMined can help.

This is because OpenMined seeks to create tools that let people and organizations share information and machine learning models without putting privacy at risk. This can be a big benefit to government organizations and medical institutions where privacy is of the utmost importance. 

OK, so why is this useful?

Good question; here's one of the main reasons why. Data anonymization and de-identification are techniques applied to data to protect the privacy of personal information. So, if your id, name, or gender are removed from the data, a hacker cannot decode who you are. But that is not true! Data re-identification is possible. Let's take a look at how.

Privacy Enhancing Technology (PETs)

PET are technologies that include fundamental data protection principles by minimizing personal data use. Some enhancing privacy technologies are:

  • Homomorphic encryption: allows computation over the encrypted data without first decrypting it.
  • Zero-knowledge proof: in this method, the prover can prove to another party that a statement is true without conveying any information.  
  • Secure multi-party computation: while keeping the inputs private, compute a function over those inputs. 
  • Differential privacy: publicly sharing information without revealing if an individual's data is part of the dataset.
  • Federated learning: training machine learning models across distributed nodes, where each node has a private dataset.

What is Syft, and how is it used here?

Syft is an open-source framework developed by OpenMined, which provides secure and private Data Science in Python. The objective is to apply techniques like Federated Learning, Differential Privacy, and Encrypted Computation over the data in integration with Deep Learning frameworks. 

Syft vs. Grid 

Syft - mimics popular data science tools for remote code access, using pointers that point to data that cannot be copied without special permissions.

Grid - the server component that runs containers around the data that you want to protect and safely utilize.


[Image taken from Padawan course]

How Differential Privacy works in Syft

There are different subjects considered in Syft. They consist of:

  1. Data Subjects: people represented in data whose data we want to protect.
  2. Data Owners: the person who uploads the data.
  3. Data Scientists: the person who accesses the data

With Differential Privacy (DP), we want to protect Data Subjects. How? By controlling the output privacy of the computations done over the data. 

Syft provides Differential Privacy on 3 levels:

  1. Adversarial Differential Privacy: 3 different roles are involved: Data Subjects, Data Owners, and Data Scientists. The Data Scientist only can query if they have enough privacy budget. 
  2. Individual Differential Privacy: The algorithm's output does not depend on the value of an individual. The privacy budget is tracked at the level of the individual Data Subject. 
  3. Automatic Differential Privacy: With enough privacy budget, you can get the results without blocking or waiting for the acceptance of the Data Owner. 

You can take a deeper look at this notebook that explains in detail how Syft handles Differential Privacy in a way that is easy for the user. 

How is the Syft architecture built?

There are several components that integrate this architecture to guarantee the main objectives of the framework function. 

Where it runs:

On any supported container host such as Docker or Kubernetes.

Domain:

A domain is the collection of containers and code that encapsulates private data and provides APIs that allow Remote Data Science operations to be performed as a client. A Grid Server runs in a node server domain. 

Network:

The network is used to let different domains communicate. It is at the VPN level. 

Client:

The client accesses the domain with certain privileges. The code inside the client is on the user's computer, in a python interpreter. While the code running inside the domain (server side) is inside the container. 

Messages:

A container called proxy runs a system called traefik, where we define HTTP routes. The message protocol is based on RPC. For Sync and blocking messages, we can use sync or async messaging, depending on what we want to do.

For instance, most actions are async, sent to a streaming endpoint (a RabbitMQ queue), and consumed by a Celery worker.  

Storage:

Temporary objects are stored in Redis, and persistent storage objects are stored in Postgres

Since all objects in Syft have a UUID, most of the data is stored in the 'key'/ 'value' store, the key for the UUID. This database is migrated to MongoDB, a document-based database. 

When the data is transferred between systems, it needs to be serialized in a blob (Binary Large Object).

Backend:

The backend and backend_stream are in charge of receiving the synchronous and asynchronous messages, respectively. 

Frontend:

The frontend operates as a UI to interact with the domain and manages users and datasets. 

[Image from Padaway Course]

How Do I Launch it? 

To launch follows these instructions:

  1. Install Syft with python, the (last version at the moment):
    pip install syft==0.7.0b57

  2. Install HAGrid:
    pip install hagrid

  3. Use HAGrid with docker-compose:
    hagrid launch xyz --tag=latest  

And that's it; you should be up and running.

What to take away from OpenMined

There is still too much to be done within Pysyft to apply it to real case studies. The tool promises huge advances in data science, enabling data sharing among organizations considering privacy guarantees. 

I am very excited to see how this tool evolves in the future and helps relevant industries such as healthcare to build more accurate models to accelerate illness predictions. Would you like to take the course? Applications are open!

Want to get involved?

OpenMined is an active community collaborating and working together to improve the project and help users understand it and how it works. If you are interested and would like to know more, check out the OpenMined Slack channel!