[Proposal Round 4] ResilientML - Web3.0, DeFi, and Metaverse Sentiment

ResilientML · April 1, 2021, 1:45pm

Key Project Data

Name of project:

ResilientML - Web3.0, DeFi, and Metaverse Sentiment

Team Website:

https://www.resilientml.com/

ResilientML will efficiently extract value from the corpus of crypto related financial news, blockchain technology, regulatory articles and reporting as well as other important text based data sources based on github, reddit, twitter, facebook and other social media platforms.

Focus on rapidly emerging projects including Web3.0: Ocean, Streamr, Big Data Protocol, Fetch AI; DeFi: Solana, Serum, Polkadot; Metaverse: Decentraland, Enjin, Red Fox, Axie Infinity etc.

Proposal Wallet Address:

0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391

https://etherscan.io/address/0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391

The proposal in one sentence

Data is the modern oil of the blockchain economy, ResilientML Semantic Reservoirs will bring a vast collection of carefully crafted semantic and linguistically tailored data sets curated by experts in Natural Language Processing for utilization directly in machine learning methods and sentiment models running in the Ocean environment and available through the Ocean marketplace via the ResilientML NLP data app.

Which category best describes your project?

Unleash data

Project Overview

Mission:

The outcome of the collaboration between ResilientML and the OceanDAO community is multi-fold:

We will help open the flood gates to the Ocean community for one of the key building blocks of a modern data economy that is of growing relevance to machine learning applications in a data economy like Ocean. This building block is particularly specialized as it requires a non-standard set of data science processes to extract and curate.
In this regard, we seek to unlock the power of text-based information and data characterization for the Ocean machine learning community. They will be able to utilize our high quality Natural Language Processing (NLP) text data features to develop apps that will interface directly with our data reservoirs through API interfaces that can extract relevant text data features from our JSON formatted and curated feature libraries to form tributaries to the wider Ocean machine learning applications that seek inputs from text based data features.

ResilientML has developed methods in python to produce these JSON formatted text feature libraries that will form the core of our Semantic Reservoirs. These text-based data features are processed using specialized natural language processing NLP methods that ResilientML will bring to the Ocean community based on extensive academic and industry experience in developing such solutions.

In this regard, we aim to help make the Ocean marketplace the leader in the text processing, sentiment models, social media analytics, analyst report analytics, regulatory report analytics, topic models, chat-bot, text-to-speech, speech-to-text, labelling, context extraction tasks of the NLP data market by leveraging our expertise in cutting-edge, novel academic research and industry practice.

NLP is of prime importance in the crypto space due to the highly sentiment driven nature of crypto markets. Furthermore, we also plan to provide the following high value NLP datasets to the Ocean marketplace:

Crypto News Sentiment
Social Media Sentiments
Technology: github, bitbucket, wire, …
Regulatory compliance reports
Legal documents

The team at ResilientML have dedicated quantitative analysts, machine learning experts and industry leading engineers to develop this suite of tools in both api, cloud solutions in azure and AWS in the languages of R, Python, MongoDB and others.

Description of the project:

Here, we provide a high level overview of the project (a detailed description of the project is available in the appendix).

Significant value can be sourced to understand crypto markets, prices, developments, regulatory landscape, use cases etc., through harvesting information from written text. Natural Language Processing (NLP) refers to the set of methods and analytical tools used to analyze unstructured text data, namely text that was created in free form and has a natural linguistic flow, rather than text created based on templated and predefined rules.

“Big Data” (a term which is used to describe the vast amount of available information) are characterized by five properties (“the 5 Vs of Big Data”): volume, velocity, variety, value and veracity. The amount of information (volume), the rate at which it becomes created and transmitted (velocity), its heterogeneous nature (variety) which includes not only textual, but also financial, visual or verbal data, and the actionable information among “noise” (value), are the reasons why it would be infeasible to leverage such information without statistical algorithms that automate the collection, extraction and filtering processes applied on the data.

Furthermore, these operations are necessary to be applied in such a way that the fifth property (veracity) of Big Data is maintained at all stages of the statistical processing of text; veracity is related to the quality and provenance of the data that are fed into the statistical models. The source of data determines their quality and inherent biases, that inevitably affect the output of the final statistical model.

A detailed synopsis of the project can be found in the appendix of this proposal to see specifically the stages of Machine Learning considered to extract and summarize the value in textual data and sentiment that we propose to provide to the Ocean environment.

The collection, wrangling and curation of this information extracted from text requires specialist machine learning knowledge to automate this process. The relevancy and approaches to data collection require domain knowledge to identify the most relevant sources of data to extract value from to ensure the data is of highest integrity.

We will combine our machine learning skills and specialist domain knowledge in the crypto space and traditional financial and risk/insurance space to provide a high quality source of data for NLP tasks that is tailored specifically for crypto market understanding and analytics.

An important point of distinction to what we offer is that we move beyond the standard approach of bag-of-words and frequency of words based models which are ubiquitous in most NLP sentiment based frameworks, but fail to capture semantics and syntax. These extra components are critical to infer sentiment accurately. Our proposed framework provides additional structure capturing these components for positive, negative, and neutral sentiment indices. In addition, our framework is hierarchical in nature which means that we can extract contribution to sentiment by individual articles, authors, and news sources explicitly.

Another important distinction of our contribution compared to other sentiment based models is that we don’t just focus on twitter and social media feeds, which have limited scope to express sentiment, instead our approach targets detailed analyst reports, news reports, and regulatory reports. Working with these enriched data sources is more complex than working with social media based models, and our framework is able to accommodate this.

What problem is your project solving?

Currently, there is a lack of high quality data on the Ocean marketplace – which is to be expected at this stage. In order to attract a snowball effect of high quality data providers, an initial kernel of high quality datasets need to be published, staked, and purchased on the marketplace. We will contribute to the provision of such high quality datasets to the Ocean marketplace to drive growth – critical to the success of the protocol.

What is the final product?

Datasets will be published on the Ocean marketplace – focus on rapidly emerging projects including Web3.0: Ocean, Streamr, Big Data Protocol, Fetch AI; DeFi: Solana, Serum, Polkadot; Metaverse: Decentraland, Enjin, Red Fox, Axie Infinity etc. (and expanding in the future to include other projects with high following). Each dataset will provide cleaned, pre-processed, and featurized text data (as shown in the Value-Add Pipeline (VAP) in Figure 1) from every article, corresponding to 100,000s of n-grams and millions of tokens, from various news sources, e.g. cryptodaily.co.uk, cryptoslate.com. The fixed costs (man hours + compute cost) required to provide these datasets is 18,400 Ocean.

Figure 1: Text Analytics Value-Add Pipeline (VAP)

Expected ROI.

The publication of the datasets detailed above will drive value to the Ocean ecosystem through numerous channels, i.e. fee generation for Ocean community, Ocean token purchases by data publishers (ResilientML), Ocean token purchases by stakers attracted by high quality datasets to curate, network effects of attracting other data providers to the marketplace.

Here, we focus on the first of these value drivers – since it is the easiest to ballpark.

We make the following assumptions:

Probability of project success = 0.8
Ocean Community gets 0.2% of consume volume.

We provide OCEAN Datatoken Consumption and ROI calculations for a number of growth scenarios in Figures 2 and 3 respectively (please see the attached Jupyter notebook for further details).

Note:

Let x_t = consumption for month t.

x_t = x_0 * (1 + rate)^t , where t=0, 1, …11 months.

Where x_0 = initial_num_users x num_datasets x datatoken_price

Figure 2: OCEAN Datatoken Consumption Growth Scenarios (adjusted for probability of success = 0.8):

The man hours and computation necessary to code, scrape, clean, and process these datasets are substantial. Below we lay out our projected fixed costs (see attached Jupyter notebook for details) to provide for example 5 datasets. Variable costs should of course be considered but are omitted for simplicity here. Note: these projections are based on an analysis of preliminary processing using a smaller dataset.

| — | — | — | — |

|man_hours|fixed|4,000|10,000|

|compute|fixed|3,360|8,400|

|Total|fixed|7,360|18,400|

Project Deliverables – Category

Datasets will be published on the Ocean marketplace – focus on rapidly emerging projects including Web3.0: Ocean, Streamr, Fetch AI; DeFi: Solana, Serum, Polkadot; Metaverse: Decentraland, Enjin, Red Fox, Axie Infinity etc. (and expanding in the future to include other projects with high following). Each dataset will provide cleaned, pre-processed, and featurized text data (as shown in the Value-Add Pipeline (VAP) in Figure 1) from every article, corresponding to 100,000s of n-grams and millions of tokens, from various news sources, e.g. cryptodaily.co.uk, cryptoslate.com.

Project Deliverables – Roadmap:

Any prior work completed thus far?

The proposed project builds upon the contributions over the past two years of members of ResilientML in building the machine learning pipeline shown in Figure 1. As such, processing of data from 2017-2019 has been performed.

Part 1: Python code has been written to perform text data collection via Java-based screen scraping and document collection – this has been unit tested and validated.

Part 2: Python modules have been created with proprietary steps of text data de-noising based upon the concepts provided in the appendix – this has been unit tested and validated.

Part 3: Python modules have been created to extract time series of features related to:

Semantic bag-of-words frequency based features and their corresponding time series.
Grammar based parse trees and their corresponding time series.
Syntax based dependency graphs and their corresponding time series.

These have been unit tested and applied to crypto data. The next stage is to put these steps into a distributed production system and curate these in a json data format for an API feed.

Roadmap

Month 1:

Complete prototyping of stages 1 – 4.

Publish datasets to Ocean marketplace.