Datatera | Datatera Metadata Functions | Round 19

Project Name

Datatera


Project Category

Build & Integrate


Proposal Earmark

General


Proposal Description

We would like to add a metadata feature where we inspect the datasets and detect the sensitive data by leveraging AI (Convolutional Neural Network (CNN)). The corresponding columns in the CSV file format that was detected as sensitive data will be ignored when we run the Compute Job by reading the results of the Sensitive Data Inspector Module in JSON when we configure the dataset path for the given algorithm. In this way, we will provide complete sensitive data security and also the “training data” concept. We will also assess the quality of the data by scanning through the data points to make sure that the main dimensions of data quality exist based on the relevant KPI that was applied in the AI Model.


Grant Deliverables

Grant Deliverables 1: Sensitive Text Data Inspector Function powered by AI

  • Output a result with a ratio indicating the columns that possibly contain sensitive data

Grant Deliverables 2: Qualitative Text Data Inspector Function powered by AI

  • Output a result with the ratio indicating the qualitativeness of the dataset that is based on the KPIs applied

Project Roadmap:

Grant Deliverables 1: Sensitive Text Data Inspector Function powered by AI (Convolutional Neural Network (CNN))

  • Sensitive Data Inspector Function powered by AI - Development completed & System test started - Apr 15, 2022
  • System test by developer completed & functions published on SwaggerHub - May 13, 2022
  • Test cases and sample datasets will be provided for Acceptance Test - on May 20, 2022
  • Acceptance test on Swagger - May 27, 2022
  • Publishing on social media that we release the pre-beta version - on June 30, 2022
  • Pre-Beta testers will be informed - on July 1, 2022

Grant Deliverables 2: Qualitative Text Data Inspector Function powered by AI (Convolutional Neural Network (CNN))

  • Qualitative Data Inspector Function powered by AI - Development completed & System test started - July 30, 2022
  • System test by developer completed & functions published on SwaggerHub - Aug 13, 2022
  • Test cases and sample datasets will be provided for Acceptance Test - on Aug 20, 2022
  • Acceptance test on Swagger - Aug 27, 2022
  • Publishing on social media that we release the MVP beta version - on Aug 30, 2022
  • MVP Beta testers will be informed - on Sep 1, 2022

Tech Stack:

  • Inspector Module Functions in Python
  • GitHub will be used for Code & Version Control
  • Inspector decision-making intelligence by Convolutional Neural Network (CNN)
  • Visual Studio Code will be used as IDE
  • Inspector results will be generated in JSON
  • Functions will be published on SwaggerHub

We will maintain and develop further and fix bugs/errors since this module will be part of our Datatera solution.

  • Dataset format will be in CSV only from the beginning and we can definitely support more formats e.g. XML, Xls, etc., and even medical images.
  • We will probably add more KPI and metrics to be able to better detect the sensitive and qualitative datasets.
  • We will add possible extensions to this work to be able to provide more relevant AI insights on the metadata feature.

Project Description

Datatera is a global marketplace to connect HealthData Providers with HealthTech companies by making larger samples of the high-quality real-world datasets available.


Final Product

HealthTech AI companies are facing challenges to get access to qualitative healthcare datasets while they are building AI models which result in bias and other errors and that take a lot of time and money to maintain and manage. Datatera will provide a global data computing marketplace where Data Scientists will have the opportunity to train their AI models on high-qualitative and diverse training datasets while preserving privacy.


Value Add Criteria

Usage of Ocean and Viability - We have an intention to enrich the metadata feature with very valuable AI insights to be able to help Data Consumers to choose the right dataset for their needs to consume.


We believe we can improve and develop the C2D concept with a richer metadata feature to be able to provide full awareness of data sensitivity and data quality of the datasets we provide in our platform.


It is equally important to ensure that all datasets that are available on our platform were already inspected and they contain certain value to Data Consumers when they choose to train their AI models.


Core Team

Tugce Ozdeger

Role: Developer, CTO, Lead Developer, Architect

Relevant Credentials:

GitHub: https://github.com/TugceOzdeger

LinkedIn: https://www.linkedin.com/in/tugceozdeger

Other:

Background/Experience:

Founder at Datatera

10+ years of professional experience as a senior system developer

Tevfik Akin

Role: Senior Pharmacist & Data Analyst

Relevant Credentials:

GitHub: https://github.com/tevfikakin

LinkedIn: https://www.linkedin.com/in/tevfik-akin

Other:

Background/Experience:

Data Analyst at Datatera

Zeki Gultekin

Role: Senior Data Analyst

Relevant Credentials:

GitHub: https://github.com/Gltknzk

LinkedIn: https://www.linkedin.com/in/zeki-gultekin

Other:

Background/Experience:

Senior Data Analyst at Datatera


Advisors

Ruslan Gasimli

Role: Advisor

Other:

Background/Experience:

Data Advisor at Datatera

Senior BI Data Scientist

Patrick Masaba

Role: Advisor

Relevant Credentials:

LinkedIn: https://www.linkedin.com/in/patrick-daniel-masaba-18914360

Other:

Background/Experience:

Medical Data Advisor at Datatera

Medical Doctor in radiology and Ph.D. in Artificial intelligence for prostate cancer detection.


Funding Requested
2000


Minimum Funding Requested
1000


Wallet Address
0xEB023A03cfebd0a58214CA018c3f25F0c8b96000


Hi,

Thank you for applying for R19!

Your proposal has been registered into the system and everything looks great!

Your previous Grant Deliverables have been reviewed and look to be in good condition. I have also looked at your Project Standing, it looks to be in good condition and ready to apply for another grant.

I would also recommend one (or all) of the following to increase support:

  1. Saying hi to the community in #ocean-dao and sharing your proposal.
  2. Saying hi to members of the #project-guiding WG and sharing your proposal.
  3. Meet with the Guides assigned to you by the #project-guiding WG.
  4. Attend a Town Hall or Project-Guiding WG meeting to talk about your project and proposal.

All the best!

-Christian Casazza

Project submitted deliverables:

We have built a model for sensitive text data analysis. It can detect sensitive data(name, password, credit card number, date of birth, etc.) to be able to be used as an initial detection by leveraging Convolutional Neural Network (ConvNet/CNN) which is a Deep Learning algorithm. We also built a function to evaluate the quality of textual data by applying 6 Data quality metrics/KPIs.

The source code can be found on: https://github.com/DatateraTechnology/Inspector

Check app.py for the implementation and the test data was uploaded to Storage Account on Azure and also the resulting model at the of the job execution.

We have been doing some work on Microsoft Azure to be able to deal with the timeouts for the Alpha version and we have successfully deployed the Alpha on a Virtual Machine without any timeout issues.

The Flask API was also built and prepared for the Sensitive & Qualitative Data querying. The API can be found on SwaggerHub: https://app.swaggerhub.com/apis/DatateraTech/DatateraBeta/1.0

Here are some tweets where we announced what we are building: https://twitter.com/DatateraTech/status/1565572411814207489?s=20&t=iy9rvgkv6v8GtG-0Ms9vJQ

https://twitter.com/DatateraTech/status/1562461237170286594?s=20&t=iy9rvgkv6v8GtG-0Ms9vJQ

https://twitter.com/DatateraTech/status/1560584918974988288?s=20&t=iy9rvgkv6v8GtG-0Ms9vJQ

https://twitter.com/DatateraTech/status/1559433949004324864?s=20&t=iy9rvgkv6v8GtG-0Ms9vJQ

Stay tuned.

LI: https://www.linkedin.com/company/datatera

Twitter: https://twitter.com/DatateraTech

Admin:

Hi Tugce, I have reviewed your deliverables and they seem to match up with what was promised as part of your last grant. Congratulations on the progress! I am going to accept your deliverables so you can submit for the next round. I would like to understand more moving forward, when integration into the Ocean stack will take place, especially with regards to C2D as this was one of the key areas of value-add for Datatera. Thank you!