Share & sell data with zero privacy risks using AI synthetic data generation

kevinyee · April 1, 2021, 2:32pm

Key Project Data

Name of project: SecondLook

Proposal Wallet Address: 0xb720883cb0e6FF8e9978c91009e3a49F09a13047

The proposal in one sentence: Share & sell data with zero privacy risk using AI synthetic data generation

Which category best describes your project? Pick one or more.
- [x] Build / improve applications or integrations to Ocean
- [x] Unleash data

Project Overview

Description of the project:

Share & sell data with zero privacy risks using AI synthetic data generation.

Problem introduction:

The world is producing more and more valuable data.

“90% of the world’s data was generated in the last two years with 2.5 quintillion bytes data being created each day.” - Forbes

But more and more of these data in companies are being restricted from being bought, sold or shared because of increasing privacy regulations.

“By 2023, 65% of the world’s population will have its personal data covered under modern privacy regulations, up from 10% in 2020.” - Gartner

Privacy laws increasing being implemented and enhanced around the world:

Thailand: Personal Data Protection Act (PDPA) → May 2021

Singapore: Enhanced Personal Data Protection Act (PDPA) → Feb 2021

Brazil: Lei Geral de Proteção de Dados Pessoais (LGPD) → Sep 2020

USA: California Consumer Privacy Act (CCPA) → July 2020

India: Personal Data Protection Bill → Proposed Dec 2019

Europe: General Data Protection Regulation (GDPR) → May 2018

Ocean Protocol’s team have rightly identified that the “most valuable data is private data — using it can improve research and business outcomes. But concerns over privacy and control make it hard to access.”

Could concerns over privacy and control could be a reason why majority of the data is accumulated and left trapped in companies, unanalysed and unsold?

“Nearly 97 percent of data sits unused by organizations” - Gartner

Problem 1: Even when data is being shared and sold, it has to be anonymised beforehand, which results in a loss of information and reduces its utility.

Imperva explains in detail the current methods of anonymisation:

Data masking: hiding data with altered value

Generalization: deliberately removes some of the data to make it less identifiable

Pseudo-anonymisation: replace private identifiers with fake identifiers or pseudonyms

Data perturbation: modifies the original dataset by rounding numbers and adding random noise

The more anonymisation techniques applied, the greater the information loss which reduces its utility for the data buyer.

What if there is a method that resolves the trade-off between data privacy and utility?

“Compute-to-data resolves the tradeoff between the benefits of using private data, and the risks of exposing it. It lets the data stay on-premise, yet allows 3rd parties to run specific compute jobs on it to get useful compute results like averaging or building an AI model.” -Ocean Protocol

Problem 2: However, compute-to-data still requires trust in the algorithm to ensure data does not get exposed.

Fig 1: Ocean Protocol Technical Whitepaper - Section 3.7.7

Solution:

SecondLook is a data-as-a-service platform that allows users to generate realistic and privacy-safe synthetic data from sensitive personal data.

Synthetic data generated would have similar structure and statistical properties as the original data without privacy compliance or data exposure risks because the synthetic data cannot be attributed back to any individual record from the original data.

We are using the same AI models behind deepfakes that had enabled the realistic generation of human faces that you may have seen in the media. Namely, we are focusing on Generative Adversarial Networks (GANs) that have shown impressive improvements over previous generative methods.

The synthetic data generation model can be used with or without Compute-to-Data.

With Compute-to-Data: This adds an additional layer of privacy since the synthetic data provided is realistic but cannot be traced back to the original private data.

Screenshot_17753×465 44.9 KB

Fig 2: Ocean Protocol Technical Whitepaper - Section 3.7.7
Without Compute-to-data: Instead of the same data hosted as an “encrypted data storage URL” which is distributed to any data buyer, synthetic data generation allows the data seller to generate a unique synthetic dataset per data buyer which is unique and traceable.

What is the final product (e.g. App, URL, Medium, etc)?

The final product is a web application that would be available at https://www.secondlook.ai/. We have bought the domain and working towards our first MVP.

The easy-to-use web application allows users to:

Input: Upload and preview sensitive personal data
Process: Automatically identify the data structure and select the most appropriate generative model
Output: Generate synthetic data
Metrics: Evaluate the utility & privacy of generated data
Report: One-stop compliance report

Fig 3: Web application user flow preview: Anonymisation vs Synthesis

How does this project drive value to the Ocean ecosystem? This is best expressed as Expected ROI, details here.

We drive value to the Ocean ecosystem by taking out one of the biggest friction of sharing and selling private data — the privacy risk. Data owners need to be assured that the data they publish is privacy compliance and risk free.

Our synthetic data generation model would allow data owners to convert their sensitive data into realistic privacy-preserving synthetic data that has both high utility with no privacy risk. This is beneficial for both data owners and data buyers.

If Gartner’s estimates that “nearly 97 percent of data sits unused by organizations” is accurate, only 3 percent of data is being utilised, shared or sold now.

We think our solution would encourage more companies to share their untapped private data and capture a fraction of the unused 97% of data. This would increase the Total Value Locked (TVL) from the total OCEAN staked in data token pools. The demand for staking OCEAN drives demand for OCEAN and therefore drives $OCEAN.

Conservatively, we expect the chain effect to increases the value of $OCEAN by 1%, at a total market cap of ~$600m, we would create a value of $6m.

Bang = USD 6m
Buck = Grant size = 10K OCEAN = USD 15k
Estimated % chance of success = 0.75
Expected ROI = Bang / Buck * Estimate % chance of success= USD 6000k / USD 15k * 0.75 = 400 x 0.75 = 300

We expect an expected ROI = 300 >1.

Project Deliverables - Roadmap

Any prior work completed thus far?

30 completed interviews with privacy officers, data protection officers, c-suite executives from international mid-large enterprises to understand the legal challenges and requirements for companies to share and sell data
7 interested parties from banks, consultancy firms and co-working spaces to try our demo
Preliminary evaluation and testing of Generative Adversarial Models (GANs) variants

Project Roadmap & Key Milestones:

Q2, 2021:

Build a synthetic data generative model MVP and provide API gateway access for developers
Develop automatic evaluation pipelines of the utility and privacy of generated data
Develop easy-to-use interface web application interface for easy access to non-developers for greater adoption

Q3, 2021: Future possible developments

Integration into Ocean Protocol’s Compute-to-data

Untitled (2)763×74 41.4 KB

Fig 4: Ocean Protocol Technical Whitepaper - Section 3.7.3
Ocean Market fork focused on synthetically generated private data

Untitled (3)1018×340 213 KB

Fig 5: Ocean Protocol Technical Whitepaper - Section 8.7

Please include the milestone: publish an article/tutorial explaining your project as part of the grant (eg medium, etc).

Medium article: https://hellosecondlook.medium.com/secondlook-x-ocean-protocol-proposal-round-4-9a317b3fb2d

Please include the team’s future plans and intentions.

This grant would be the first investment in our team and would be used to cover the cost of:
– Hiring to develop frontend and backend infrastructure
– GPU and server compute to test and train our AI models
We hope to use the Ocean Protocol’s data marketplace as our first case study to show the value proposition of synthetic data generation in encouraging companies to share their sensitive data by reducing the privacy risk involved.
Our broader vision is to remove the friction in sharing and access to data between any mediums starting with being privacy compliant.
This grant and case-study is important to help us raise our scheduled pre-seed funding round with Entrepreneur First in May 2021 to raise ~56k USD which would allow this project to turn into a proper self-sustaining start-up.

Project Details

Technology stack:

AI models: Variations of Generative Adversarial Networks (GANs)
Data & server infrastructure: AWS
Frontend & backend: ReactJS, NodeJS

Team members

Kevin Yee

Role: Software & AI Engineer
Relevant Credentials:
- GitHub: https://github.com/yee-kevin
- LinkedIn: https://www.linkedin.com/in/yee-kevin/
Background/Experience:
- BSc (Computer Science) from Singapore University of Technology & Design, Singapore
- Data & AI Developer at IBM, Singapore
- AI Researcher using GANs at Ben Gurion University (Israel) & Singapore University of Technology & Design
- Founder at Entrepreneur First, Singapore

Additional Information

Scheduled pre-seed fundraising round with Entrepreneur First in May 2021 to raise ~56k USD