Build & Integrate
Given I have dug into the Ocean ecosystem broadly and have started to look at the API in general, I would first get the practical experience and code set up to pull data through the API. Then, given the data I would generate statistics about the data, such as size, shape, counts of missing data, min/max of columns, data types, etc. This would all be output in a web app format to start, with code saved in a GitHub repo.
This is meant to just be step 1 of a multistep process to generate the final product.
GitHub repo with code
- Code to access a dataset based on data id
- Code to define metrics
- Code to create streamlit app & display metrics
- Allow user inputs for data id
- Show metrics about the data
This project is creating a tool for data suppliers or buyers to clean their data. Ultimately it would pull in data given the data ID or location and clean based on user options – at a minimum, it would deal with outliers and fill in missing data using statistical/machine learning techniques. Then it would provide the output in the desired format and location, supporting a range of formats and locations.
The backend would likely leverage existing packages such as PyCaret and so the value-add here is creating a UI that lets users have easy access to the data, clean it with point-and-click options and an easy way to output the cleaned dataset, as well as extending existing cleaning packages/methods to support this use case.
The product itself would be a web interface & the backend code that allows the user to access their data (point at it with data ID or URL), choose their cleaning options and show metrics about the data. The product would also include the new file(s) associated with the cleaned product and metrics around the cleaning.
Adds Value to the Ocean Ecosystem:
Will allow data users to have cleaner data for their analysis, which will save them time cleaning data and let them draw better insights. For sellers of data, this should increase the value of their data sets (or they could sell cleaned and uncleaned data to increase the likelihood of selling data at different price points).
Usage of Ocean Protocol:
Ocean Protocol will be used to locate the input data and will be one of the destination options for the output data.
Given I have a history of doing similar projects and the work is inline with my skill set, this project should be completely viable. I have done other work pulling data from APIs, cleaning data (both in an automated and manual fashion), dealing with missing data, creating files and making web apps. I also have a broad enough data science background that it will help at the detailed level or if there is any additional unforeseen work.
Currently on discord, I attended a MOBI talk with Bruce Pon and have attended one town hall at this point. Looking to get more active as this takes off, hoping to connect with other data science users/creators, the Project-Guiding WG and others. Would like to become more active on discord and start attending town halls more regularly.
Role: Data Scientist
GitHub: msquaredds · GitHub
- Currently an independent data scientist - freelancing, working on personal FinTech projects and getting into Web3
- Data analyst at the world’s largest hedge fund (Bridgewater Associates)
- Quantitative analyst at a mutual fund group
- Prior independent data science projects across finance, FinTech and veterinary sciences
Minimum Funding Requested