When it comes to privacy, best intentions often fall short. Today, we’re going to look at how companies like Apple and Google are adapting privacy techniques, and how this all fits into the bigger puzzle of the roles that trust and responsibility play in artificial intelligence. Along the way, we’ll take a closer look at a technical approach called differential privacy, which is reshaping how companies analyze and share user data.
Anonymization is harder than it looks
Many privacy programs focus on personally identifiable information (PII), but over the last few years it’s become clear that removing PII from a dataset isn’t enough to protect privacy. Even more rigorous attempts to conceal PII, such as the cryptographic (hash) functions Facebook uses to protect user IDs in their Custom Audience advertising products have their limitations. They attend to the security of the data without attending to its underlying statistical features. In an age increasingly dominated by machine learning, this limitation can have serious repercussions.
There’s a famous anecdote that shows how best intentions in data practices end up deeply—even comically—inadequate. In 1996, William Weld authorized the release of Massachusetts state employee medical records for research. Explicit identifiers (name, address, phone number, etc.) were removed, but zip code and date of birth remained. Within a few days of the data being released, Latanya Sweeney re-identified the governor’s own records and sent him a copy.
The recent Cambridge Analytica fallout has brought privacy concerns into the spotlight. People need more education on how data is used. For instance, a recent survey showed that roughly two-thirds of Twitter users were unaware that tweets are sometimes incorporated into research studies.
But the bigger question still looms: how do companies ensure that any research done internally or externally on their users’ data remains truly, incontrovertibly anonymized?
Differential Privacy and the Quantification of Risk
Differential privacy is a new technique to ensure privacy while harnessing the power of machine learning. Basically, it entails adding noise when querying a dataset, allowing most of the dataset’s statistical features to be preserved while limiting the risk that any individual’s data will be identified. It uses different mathematics than standard de-identification techniques. In ideal implementations, this risk remains close to zero, guaranteeing that whether or not an individual’s data is included in a dataset will have virtually no adverse effect on them from an informational standpoint.
Until recently, differential privacy was only used in practice at companies like Apple, Google, and Uber. Apple incorporates differential privacy into user analytics, enabling improvements in various products. Google uses differential privacy locally on Androids, applying Federated Learning to share learnings from users without disclosing details. Uber is also investing engineering resources to build out more rigorous privacy algorithms, making its code open source.
Differential privacy will really hit its stride as it becomes easier to apply in startups. Georgian Partners’ new Epsilon product, cleverly named after the mathematical bound of a privacy guarantee, indicates this is on the horizon. Bluecore is using the technique to enable companies to benefit from shared data without compromising competitive insights. The goal is to help solve the infamous cold start problem in recommendation engines.
At the same time, research in differential privacy advances. Uber is focused on ensuring the privacy of SQL database queries with Elastic Sensitivity. Bolt-on differential privacy is designed for stochastic gradient descent (SGD) (a standard optimization technique). Two issues can complicate making SGD differentially private: reduced model accuracy (from the added noise) and reduced runtime efficiency. The bolt-on approach solves these obstacles by only perturbing the output of the model, essentially treating the training process as a black box. The result is an SGD-ready privacy algorithm that both maintains accuracy and keeps runtime reasonably low.
Differential privacy will soon be a privacy best practice. Still, it’s only one piece of a larger puzzle. If you’re ensuring user data remains private but failing to treat users fairly and consistently, you haven’t really done much to build trust. As machine learning models becoming increasingly ubiquitous, so do the threats of bias, misallocation, and misunderstanding.