OceanDAO Grant Proposal: Round 8 — SandLabs
Helping realize Ocean Protocol’s mission via blockchain, data science, and software development!
Part 1 - Proposal Submission
Which category best describes your project?
Outreach / community / spread awareness
Which Fundamental Metric best describes your project?
Data Consume Volume
Proposal in one sentence:
In order to boost the usage of the Ocean Market, as well as the overall Ocean Protocol, SandLabs proposes to create a blockchain/crypto-related dataset from a variety of data streaming sources for listing on the Ocean Market, as well as corresponding project blogging in order to share the extraction and market listing processes with broader
Description of the project and what problem is it solving:
Ocean Protocol is a foundational technology for blockchain data exchange, however has yet to see mass adoption from data practitioners. For our first endeavor, SandLabs hopes to boost the wide-spread adoption of Ocean Protocol by sharing with the
BlockchainxData communities several blog posts describing helpful data extract, transform, load (ETL) pipeline best practices as well as how-to descriptions for serving data assets on the Ocean Marketplace. This project will culminate with the creation of a robust blockchain/crypto-related dataset from a diverse set of streaming data sources, which will be served as an asset on the Ocean Marketplace, as well as some of the ingredients for the blog post contents.
Data collection is vital to SandLabs operation and we propose to begin our endeavor by creating (ETL) pipelines for several APIs across a combination of topics including data science, AI/ML, and blockchain. So far we have considered using the Reddit and GitHub APIs, but upon successful extraction we hope to expand the collection (possibly with more DeFi or social data). To perform the ETL, we propose to leverage a Google Cloud Platform architecture; this will allow for serverless automation of the collection processes. Furthermore, our transformations will aggregate when possible to enable machine-learning ready datasets. These datasets will be listed on the Ocean Marketplace and promoted through blog postings created by SandLabs. We plan to write relevant posts in the popular data science blog, Towards Data Science (TDS), which has over 500,000 followers.
Apart from listing our collected data on the Ocean Marketplace, we also plan to create value by applying data scientific methodologies across the data collection in order to render insights about trends in blockchain technology. We hope to leverage these insights in order to potentially develop software for the blockchain ecosystem down the road. This software could be entities such as data wallets, data dashboards, automated defi trading protocols, helpful developer utilities or frameworks, and more!
Grant Deliverable 1:
Create a large crypto/blockchain-related dataset from a variety of sources and list it on the Ocean Market.
We want to begin with extracting data from GitHub and Reddit via their APIs. This will provide project meta-data (READMEs, tags, summaries, etc) as well as some social data. For search trends, we will first target a combination of blockchain and artificial intelligence or machine learning. From there, we could potentially expand our queries to broader blockchain-related domains. We hope to leverage Google Cloud Platform resources in order to perform the ETL in addition to performing data analysis and accessing data storage in the form of databases and general object storage (data lake).
Grant Deliverable 2:
Blog post in Towards Data Science about Blockchain ETL and how to list on the Ocean Marketplace (assuming publication approval).
Here, we will explain a helpful portion of our ETL process and our Ocean Market listing in the popular TDS blog.
Grant Deliverable 3:
Blog post in Towards Data Science on exploratory data analysis performed on our collected data (assuming publication approval).
This post will cover some of the insights we will extract from the data we collect in the first deliverable. The post will also mention the Ocean Market as the data source.
How does this project drive value to the fundamental metric and the overall Ocean ecosystem?
Through embracing broad communities of data practitioners with our blog postings this project aims to increase Data Consume Volume as well as Weekly Active Users – among other metrics. Not only will we be able to give Ocean exposure in broad data science communities, driving value to the Ocean ecosystem, but we can also give this exposure from a technical lens. Aside from our blog postings, our data engineering and dataset generation is garnered by the design priorities of informativeness, exoticness, and cleanliness and should certainly serve as a good asset on the marketplace, bringing in additional users, who in turn may choose to consume the dataset.
If chosen as recipients of an OceanDAO grant, it will also enable us to lay the bedrock for future analysis of our collected data (and possible bundling of this analysis for usage) in addition to the creation of new software for the blockchain ecosystem. Many ideas have come up for possible software projects including predictive APIs, data dashboards with state-of-the-art visualizations, and data wallets.
What is the final product?
Creation of a robust foundation for data analysis and software development in addition to two blog postings and at least one Ocean Market listing.
Proposal Wallet Address:
Have you previously received an OceanDAO Grant
|Project lead Contact Email||admin@SandLabs.co|
|Country of Residence||United States of America|
Part 2 - Proposal Details:
Project Deliverables - Category:
If Outreach / community, then:
(2) blog posts will be published at: Towards Data Science
If the project includes software:
- Google Cloud Platform
- Compute Engine or Cloud Functions
- Cloud Storage
- Additional tools as needed
- Data Science Stack
- Python & R
- Scikit-Learn & TensorFlow
- Matplotlib, Plotly, etc
- Jupyter/Google Colab
- Web Development
|July, 2021||SandLabs is founded|
|Mid August, 2021||Finish Initial Data ETL + Publish OceanDAO Grant Recipient Announcement (possibly)|
|End of August, 2021||Compile and Publish Initial Blog Posts|
|Beginning of September, 2021||Expand Data Extraction System|
Project Deliverables So Far
- Completed initial API queries
- Familiarized with GCP architectures
- Became TDS author
We plan to use this grant as a jumping off point for a few possible long-term endeavors. In the short-term we are going to collect and analyze data, but in the long-term we hope to leverage the insights generated by our analysis to release software products for the
BlockchainxData communities (such as data wallets, component libraries, and more)
For each team member, give their name, role and background.
|Role||Founder and Lead Developer|
|Relevant Credentials||GitHub, LinkedIn, Personal Website (WWalsh.io), Kaggle|
|Background/Experience||University of California, Berkeley: Industrial Engineering and Operations Research (Class of 2020), The Hotchkiss School (Class of 2014)|
Fully Automated Data Pipeline Using Free, Cloud-Based Solutions: Kaggle NBA Dataset
- Facilitated other’s sports-analytics data projects by creating the most robust, open-source, NBA-related database. Ensured $0 capital overhead requirements by using free cloud computing and dataset tools. Enabled better testing, deployment, and expansion by containerizing each pipeline segment’s Python scripts.
Machine Learning for NBA Game Attendance Prediction
- The goal of this project was to craft models in order to accurately predict the attendance of a future National Basketball Association (NBA) game. Game data, including attendance, was scraped from stats.nba.com and stadium capacity data collected from numerous online sources. This data was then cleaned, processed, explored through visualizations and statistical tests, and then modeled using many regression techniques including regularized methods, ensemble methods such as Random Forest and Boosting, and neural networks. Feature significance was also determined through techniques such as the Group Lasso and ensembling. The overall mean absolute error (MAE) in the best models was found to be around 750 people. A paper is included summarizing the goals and findings along with notions of future work that could be applied as well. The coding of this project was carried out in a combination of R and Python.
- Regularized Linear Regression Deep Dive
- Published 3 articles in Towards Data Science after a thorough investigation into underlying model optimization mathematics. Open-sourced all project implementations, including Pathwise Coordinate Descent optimization and cross-validation. Researched efficient methods for solving machine learning problems and made necessary derivations for model estimators
|Role||Chief of Operations; Editor|
- Tufts University (Class of 2020)
- The Hotchkiss School (Class of 2014)
- Co-founder of Phase 5 Analytics
- Co-founder of Oursock.com