[Proposal] AI Synthetic Data Generation - Sell sensitive data with zero-privacy risks

kevinyee · May 1, 2021, 5:23am

Key Project Data

Name of project: SecondLook - AI Synthetic Data Generation
Team Website (coming soon): https://www.secondlook.ai/
Current country of residence: Singapore
Contact Email: hellosecondlook@gmail.com
Proposal Wallet Address: 0xb720883cb0e6FF8e9978c91009e3a49F09a13047

The proposal in one sentence: Converts sensitive customer & production data into realistic synthetic data allowing organisations to sell data with zero-privacy risk using AI synthetic data generation.

Which category best describes your project? Pick one or more.
- [x] Build / improve applications or integrations to Ocean
- [x] Unleash data

Project Overview

Description of the project:

Strict privacy laws & increasing consumer’s awareness of how companies use their personal data make data sharing a long, costly and risky process due to fear of breaches, fines and loss of customer’s trust.

This makes it near impossible to sell data without having to go through multiple hoops, for good reasons. But this comes with a cost.

Our clinic records could be used to predict illnesses such as cancer or heart disease in advance for future patients. But would you consent to that?

Our location data from our Uber ride could be used to determine travel patterns and draft better transportation infrastructure, saving us time stuck in traffic jams and consequently reduce vehicle pollution. But would you consent to that?

Our AI synthetic data generation engine would help companies converts these valuable but sensitive customer & production data into realistic synthetic data that is up to 99% the same as the real data. Consumers cannot be individually identified from the synthetic data and organisations would be able to sell data with zero-privacy risks and also keep the trust of consumers.

The most valuable datasets are owned by companies who are bounded by privacy restrictions. Our solution complements Ocean Protocol by helping to increase the number of datasets published to the Ocean Market and consequently, the number of trades and TVL.

Project updates since our previous Round 4 Proposal:

Q2, 2021:

Completed:

Build a synthetic data generative model engine (completed Apr 2021)

Develop automatic evaluation pipelines of the utility and privacy of generated data (completed Apr 2021)

Next update:

Provide API gateway access for developers (expected May 2021)

Develop easy-to-use interface web application interface for easy access to non-developers for greater adoption (expected June 2021)

As per our timeline proposed in Round 4, we have successfully built a synthetic data generative model MVP and evaluation pipelines to measure the utility & privacy of generated data over the past few weeks in April.

Although we did not receive funding the previous round, we hope to have your support this round and we will continue to do our best to achieve the milestones we promised. The grant money for this round would be use to hire for our frontend UI/UX development to make our product easy to use even for non-tech savvy users.

Try our demo + video explanation:

I’m very excited to share with you our newly created YouTube video that explains how our AI synthetic data generation engine works and an open invitation to try our MVP out for free in 3 easy steps:

Prepare your dataset in a excel/csv file (supports tabular structured data types including categorical, numerical, discrete-ordinal and datetimes)
Submit it to our Google Forms (our backend is ready but our frontend still requires development)
Receive 1000 synthetic data samples and a report on utility & privacy to your email

Problem introduction:

The world is producing more and more valuable data.

“90% of the world’s data was generated in the last two years with 2.5 quintillion bytes data being created each day.” - Forbes

But more and more of these data in companies are being restricted from being bought, sold or shared because of increasing privacy regulations.

“By 2023, 65% of the world’s population will have its personal data covered under modern privacy regulations, up from 10% in 2020.” - Gartner

Privacy laws increasing being implemented and enhanced in major economies around the world:

Europe: General Data Protection Regulation (GDPR) → May 2018

India: Personal Data Protection Bill → Proposed Dec 2019

USA: California Consumer Privacy Act (CCPA) → July 2020

Brazil: Lei Geral de Proteção de Dados Pessoais (LGPD) → Sep 2020

Singapore: Enhanced Personal Data Protection Act (PDPA) → Feb 2021

Ocean Protocol’s team have rightly identified that the “most valuable data is private data — using it can improve research and business outcomes. But concerns over privacy and control make it hard to access.”

It is no doubt privacy is one of the biggest reasons why majority of the data is accumulated and left trapped in companies, unanalysed and unsold.

“Nearly 97 percent of data sits unused by organizations” - Gartner

Problem 1: Even when data is being shared and sold, it has to be anonymised beforehand, which results in a loss of information and reduces its utility.

Imperva explains in detail the current methods of anonymisation:

Data masking: hiding data with altered value

Generalization: deliberately removes some of the data to make it less identifiable

Pseudo-anonymisation: replace private identifiers with fake identifiers or pseudonyms

Data perturbation: modifies the original dataset by rounding numbers and adding random noise

The more anonymisation techniques applied, the greater the information loss which reduces its utility for the data buyer.

What if there is a method that resolves the trade-off between data privacy and utility?

“Compute-to-data resolves the trade-off between the benefits of using private data, and the risks of exposing it. It lets the data stay on-premise, yet allows 3rd parties to run specific compute jobs on it to get useful compute results like averaging or building an AI model.” -Ocean Protocol

Problem 2: However, compute-to-data still requires trust in the algorithm to ensure data does not get exposed.

Fig 1: Ocean Protocol Technical Whitepaper - Section 3.7.7

Problem 3: Even with compute-to-data which prevents data buyers from viewing the data, under privacy laws, data owners must seek consent from the data subjects to use their personal data for the use case of the data buyers. This is often impossible to determine in advance and is a complicated back-and-forth process.

Problem 4: Data buyers often prefer to view, wrangle and combine datasets purchased which cannot happen with compute-to-data.

Solution & How we complement Ocean Protocol:

SecondLook is a data-as-a-service platform that allows users to convert sensitive customer/production data into realistic and privacy-safe synthetic data.

Synthetic data generated would have similar structure and statistical properties as the original data without privacy compliance or data exposure risks because the synthetic data cannot be attributed back to any individual record from the original data.

We are using the same AI models behind deepfakes that had enabled the realistic generation of human faces that you may have seen in the media. Namely, we are focusing on Generative Adversarial Networks (GANs) that have shown impressive improvements over previous generative methods.

The synthetic data generation model can be used with or without Compute-to-Data.

With Compute-to-Data: This adds an additional layer of privacy since the synthetic data provided is realistic but cannot be traced back to the original private data.

Screenshot_17753×465 44.9 KB

Fig 2: Ocean Protocol Technical Whitepaper - Section 3.7.7
Without Compute-to-data: Instead of the same data hosted as an “encrypted data storage URL” which is distributed to any data buyer, synthetic data generation allows the data seller to generate a unique synthetic dataset per data buyer which is unique and traceable.

What is the final product (e.g. App, URL, Medium, etc)?

The final product is a web application that would be available at https://www.secondlook.ai/. We have bought the domain and have built the backend. We are developing the frontend for easy access to our first MVP.

The easy-to-use web application allows users to:

Input: Upload and preview data
Process: Automatically identify the data structure and select best AI model
Output: Train AI model & generate synthetic data
Metrics: Evaluate the utility & privacy of generated data
Report: One-stop compliance report

Fig 3: Web application user flow preview: Anonymisation vs Synthesis

How does this project drive value to the Ocean ecosystem? This is best expressed as Expected ROI, details here.

We drive value to the Ocean ecosystem by taking out one of the biggest friction of sharing and selling private data — the privacy risk. Data owners need to be assured that the data they publish is privacy compliance and risk free.

Our synthetic data generation model would allow data owners to convert their sensitive data into realistic privacy-preserving synthetic data that has both high utility with no privacy risk. This is beneficial for both data owners and data buyers.

If Gartner’s estimates that “nearly 97 percent of data sits unused by organizations” is accurate, only 3 percent of data is being utilised, shared or sold now. We think our solution would encourage more companies to share their untapped private data and capture a fraction of the unused 97% of data.

Primary Metric: “Datatoken Consuming Volume”
Secondary Metric: “Total Value Locked”

We believe our solution can get an additional 50 datasets published per month where each buyer does 10K OCEAN / week (40K OCEAN / month) of consume volume on Ocean Market. which is a total of 2000K OCEAN / month consume volume which is recurring over the next year.

Assuming the Ocean Community gets 0.2% of consume volume:

0.2% * 2000K / month * 12 months = ~ 50K Ocean

Bang = 50K OCEAN = 50K * 1.4 = 70k USD

Buck = Grant Size = 10K USD (in OCEAN)

ROI = Bang / Buck = 70K USD / 10K USD = 7.0

Chance of success = 75%

Final expected ROI = 0.75 x 7.0 = 5.25 > 1

Project Deliverables - Roadmap

Any prior work completed thus far?

50+ completed interviews with privacy officers, data protection officers, c-suite executives from international mid-large enterprises to understand the legal challenges and requirements for companies to share and sell data
10+ interested parties from banks, insurance, consultancy looking to try our demo
Synthetic data generation MVP (details below)

Project Roadmap & Key Milestones:

Q2, 2021: Deliverables

Build a synthetic data generative model MVP (completed Apr 2021)

Develop automatic evaluation pipelines of the utility and privacy of generated data (completed Apr 2021)

Provide API gateway access for developers (expected May 2021)

Develop easy-to-use interface web application interface for easy access to non-developers for greater adoption (expected June 2021)

Q3, 2021: Future developments

Integration into Ocean Protocol’s Compute-to-data

Untitled (2)763×74 41.4 KB

Fig 4: Ocean Protocol Technical Whitepaper - Section 3.7.3

Ocean Market fork focused on synthetically generated private data

Untitled (3)1018×340 213 KB

Fig 5: Ocean Protocol Technical Whitepaper - Section 8.7

Please include the milestone: publish an article/tutorial explaining your project as part of the grant (eg medium, etc).

Medium article: https://hellosecondlook.medium.com/secondlook-x-ocean-protocol-proposal-round-4-9a317b3fb2d
Youtube video on our AI synthetic data engine: https://youtu.be/xdDK4LbWZDE

Please include the team’s future plans and intentions.

This grant would be the first investment in our team and would be used to cover the cost of:
– Hiring to develop frontend and backend infrastructure
– GPU and server compute to test and train our AI models
We hope to use the Ocean Protocol’s data marketplace as our first case study to show the value proposition of synthetic data generation in encouraging companies to share their sensitive data by reducing the privacy risk involved.
This grant and case-study is important to help us raise our scheduled pre-seed funding round with Entrepreneur First in May 2021 to raise ~56k USD which would allow this project to turn into a proper self-sustaining start-up.

Project Details

Technology stack:

AI models: Variations of Generative Adversarial Networks (GANs)
Data & server infrastructure: AWS/Google Cloud
Frontend & backend: ReactJS, NodeJS

Team members

Kevin Yee

Role: Software & AI Engineer
Relevant Credentials:
- GitHub: https://github.com/yee-kevin
- LinkedIn: https://www.linkedin.com/in/yee-kevin/
Background/Experience:
- BSc (Computer Science) from Singapore University of Technology & Design
- Data & AI Developer at IBM
- AI Researcher using GANs at Ben Gurion University (Israel) & Singapore University of Technology & Design