If this sounds familiar...
“I wish I could get access to that data to train our models…”
You’re not alone.
In the data-hungry world of machine learning and analytics, researchers and product teams are looking for ways to access their customers’ and partners’ data to improve their insights, only to be blocked by regulatory, trust, and technical barriers. These problems come up not only in areas like healthcare and financial services where the data used in collaborations is particularly sensitive, but also in mobile or IoT use cases where models are trained on data generated by edge devices.
The best solution in many of these use cases is federated learning.
Federated learning is a method for training machine learning models on distributed datasets.
No data moves within the system, which means that you can use other peoples’ data for machine learning and analytics that you otherwise wouldn’t be able to get access to.
In this article, we’ll be introducing some of the basic concepts around privacy safe machine learning with federated learning. If you’re looking for more of a technical deep dive, we recommend checking out this paper by Yang et al.
What is Federated Learning?
Imagine you’re a world-famous conductor, and you’ve decided to bring together the top musicians from around the world to record a new symphony. The most obvious way to do the recording would be to buy plane tickets for everybody to join you at a recording studio for a few days.
This is similar to traditional machine learning and analytics, where you need to move a number of datasets (musicians) to one central database (conductor) to calculate statistics and train your model.
But what happens if the musicians aren’t able to travel to one central location? Maybe they have family commitments. Maybe they’re worried about losing their instruments on their trip. Or maybe they’re grounded by a global pandemic…
As the conductor, you come up with a different strategy. Instead of bringing the musicians together, you send each musician the score, along with some directions on how to play their part. Each musician records their part and sends it back to you, and then you mix all of the parts together so it sounds like the whole orchestra is playing the symphony together. You can even send the final mixed recording back out to the musicians a few times so that they can listen to it and re-record their parts until the symphony sounds perfect.
This is similar to how federated learning works. Sometimes, data cannot move to a central database because of privacy laws, data residency regulations, or concerns about IP risk. In federated learning…
- A data scientist provides the model training instructions to a central server, which sends those instructions to local clients at each of the individual data nodes
- The local clients train a local model using the data at that node and send the model parameters (never the raw data) back to the central server
- The central server uses the model parameters it receives from each data node to build one global model
- The central server sends the global model back to the local clients, and they continue iterating on the model until a pre-set number of runs is reached or the model reaches convergence
Throughout the federated learning process, the only information that leaves the data nodes is model parameters - never raw data. However, there is some risk that bad actors can reidentify individual data points based on the parameters themselves, so differential privacy is used during model training. Some noise is built into each of the local models before the parameters are sent back to the central server, to reduce the risk of reidentification.
How can I get started with federated learning?
At a basic level, you need three components to get started with federated learning:
- A centralized service to coordinate between the data nodes and monitoring training progress
- A client system to perform client side training, and coordinate the model parameters with the central service
- A networking/protocol mechanism (there are some open source options like Flower or FATE) to pass instructions and model parameters during model training
In reality, federated learning systems are complicated, difficult to set up, and difficult to scale. That’s why we developed integrate.ai, a production-ready federated learning platform that makes privacy enhancing technology, like federated learning and differential privacy, accessible and easy to set up, so you don’t need to worry about building the infrastructure.
At integrate.ai, we’re experts in private data collaboration. If you’re interested in exploring whether federated learning is right for your use case, contact us - we’d love to chat!