[Proposal Round 8] ResilientML – Expansion of Sentiment Data – FaaS + Online Dashboard UI

ResilientML · July 29, 2021, 3:09pm

Key Project Data

Name of project:

ResilientML – Expansion of Sentiment Data – FaaS + Online Dashboard UI

Team Website:

https://www.resilientml.com/

Proposal Wallet Address:

0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391

https://etherscan.io/address/0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391

Which category best describes your project?

Unleash data

Funding Requested

17,600 USD

Features-as-a-Service (FaaS) model – concept of a FaaS market place on Ocean

Looking at the market places currently available on Ocean you will quickly realise that the bulk of the markets are rudimentary datasets that have been taken from standard places and put on Ocean for a price which belies the fact that most of these datasets are freely available in standard libraries on R, Python and Julia. Furthermore, these datasets are not processed or curated in easily manageable ways for training ML models and they fail to provide additional processing value to save compute time in ML model training or fine tuning. This is where Features-as-a-Service (FaaS) data markets become important to understand.

The ResilientML FaaS curated data libraries provide a significant advantage compared to the existing datasets provided on the Ocean market places to date.

We aim to provide professional-level curated datasets for ML Compute-to-Data users to be able to access data and feature libraries that are weekly maintained, updated and extended, and which are in standard professional data formats with standardised feature templates in JSON and Python pickle, ready for training and fine-tuning of time-series models, deep neural network models, Transformer models, LSTMs etc.

In general, low quality raw datasets on the markets tend to create a problem for buyers and users as firstly, this adds a lot of noise to buyers’ perception of quality data, and secondly, they lack formal curation principles for dataset preparation and dataset utility for machine learning modelling applications. The data is often not processed, not developed into consistent formats for ease of use in ML models to be trained in Compute-to-Data modules in the future, and it is often static and non-updated or non-maintained.

ResilientML is trying to bring professional grade data curation and feature libraries in the NLP context to the Ocean market, and we hope the ML practitioners can appreciate the difference between our professionally curated data library in the FaaS context versus low level raw data libraries.

Since the datasets have basically no demand side at present, we have currently offered our FaaS at a fixed cost with a monthly subscription to the service. This will progressively change if eventually Ocean is able to get demand side participation to match supply side, which is currently a serious problem for people curating datasets.

It is vital for ResilientML as a company that a demand side is developed for the Ocean markets so that curating, maintaining and updating the FaaS is able to continue. Until that point, ResilientML has to rely on grant funding to continue to offer the highest level of data FaaS quality, which distinguishes itself from low quality data sets currently available.

ResilientML as a company will be unable to keep curating, maintaining and updating the FaaS data markets on Ocean if there is not a demand side to these markets and there is not grant funding. So if the community wants our service as part of the ecosystem I ask you to seriously consider not down grading our project as in the previous round and to consider our service as a state-of-the-art FaaS service for Ocean that we are sure can add increasingly valuable data in the midst of a lot of garbage noisy data sets currently present.

Updates on Features-as-a-Service model

Using the previous rounds of funding, ResilientML has been producing a weekly growing automated data market for Natural Language Processing Feature Libraries (NLPFL) on Ocean.

The following process has been automated and executed in Cloud infrastructure on a weekly basis (see figures below):

Scan increasing numbers of news sources (currently 3+) automatically and extract news reports on targeted topics related to key areas in the crypto sphere;
Process text into cleaned format – as outlined in proposal stages of data munging;
Extract key features from text in n-gram formats, sentence format;
Construct JSON data library update and automate this to box storage and update for Ocean market ever week on Saturday midnight.

As of 28/07/2021 the market place has data for Features-as-a-Service summarised as follows:

3 major crypto news sources
18 crypto projects that are reported on
5 news categories relevant to the crypto space
27,072 articles
13,760,130 processed text tokens,

the processing of all of which have been the result of more than 3 years of research into statistical natural language processing.

Furthermore, the following processes have been in development and will be offered on the market place both as data, and as Compute-to-Data modules, once the latter is fully supported:

Segment text into topics and append to previous feature libraries,
Process higher order features sets – CFG trees, Dependency graphs, spectral features and sentiment indices,
Create summary overview of changes made, current special features that week, KPIs and total records to date (currently partially in effect).

The same data suite will also be available for COVID-19 news articles in the next stage of our FaaS project, and data from a range of other sectors will follow (e.g. climate).

The KPIs and data summaries will in addition be developed and illustrated via a dedicated Dashboard on the ResilientML website, which will be showing the dataset state in real time.

Finally, the following Compute-to-Data modules are scheduled to be released, all of which are based on research outputs in peer-review academic journals from the ResilientML team:

Mixed-frequency data modelling for forecasting investors’ sentiment in crypto markets
Epidemiological models for COVID-19 infected cases, enhanced with public sentiment

Round 8 Proposal

This proposal is to:

continue to extend, develop and further curate our initial development of a Crypto-specific Natural Language Processing Data Suite, and
produce an online dashboard that will show in real time the status of the Real News Crypto dataset.

The grant will go to help fund the coding, NLP model development, man hours and compute hours required to extend our current published dataset. We aim to continue to grow this dataset and expand its usability by further extracting value from the corpus of crypto related financial news, and making it more accessible to the Ocean community.

This round we will expand our offering to include new high level extracted sentence structure features for the currently available data and the additional following components:

Web3.0 coins: Filecoin, BitTorrent, Stacks, The Graph, Basic Attention Token, Siacoin, Helium, Arweave;
Layer1 coins (Solana);
DeFi coins (Terra, PancakeSwap, Maker, THORChain, Serum);
Metaverse/Gaming/NFT coins (Enjin, Axie Infinity, Red Fox Labs);
Stablecoins (USDC, BUSD),

and including data from the following additional news sources:

Cointelegraph;
NewsBTC;
Bitcoinist;
Blocknomi;
Coinspeaker.

Specifically, in addition to the post-processed tokens, we will now include sentences processed to remove parts of text noise, whilst keeping their grammatical, syntactical and semantic structure intact. This is useful for NLP Compute-to-Data modules that utilise raw text sentences at their input (e.g. BERT-type, Transformer-based models). In addition, we will provide advanced features capturing the grammatical and syntactical structure of the sentences. These comprise Context-Free Grammar (CFG) parse trees which express the grammatical structure of a sentence, as well as Dependency Graphs which encode the sentence syntax.

We will expand the dataset going forwards but also back-filling the currently available data. This will effectively increase the current data offering we have developed by around a factor of three in terms of content and volume of processed corpus.

We will continue to grow the sophistication of the data being provided . This will be achieved by migrating from data munging to feature extraction and data curation for feature libraries.

The initial phase of this will encompass publishing a new dataset to provide extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings , as shown in Stage 2 of the Value-Add Pipeline (VAP) in Figure 1.

The proposal in one sentence

Data is the modern oil of the blockchain economy. ResilientML Semantic Reservoirs will bring a vast collection of carefully crafted semantic and linguistically tailored datasets curated by experts in Natural Language Processing for utilization directly in machine learning methods and sentiment models running in the Ocean environment and available through the Ocean marketplace via the ResilientML NLP data app.

Project Overview

Mission:

The outcome of the collaboration between ResilientML and the OceanDAO community is multi-fold:

We will help open the flood gates to the Ocean community for one of the key building blocks of a modern data economy that is of growing relevance to machine learning applications in a data economy like Ocean. This building block is particularly specialized as it requires a non-standard set of data science processes to extract and curate.
In this regard, we seek to unlock the power of text-based information and data characterization for the Ocean machine learning community. They will be able to utilize our high quality Natural Language Processing (NLP) text data features to develop apps that will interface directly with our data reservoirs through API interfaces that can extract relevant text data features from our JSON formatted and curated feature libraries to form tributaries to the wider Ocean machine learning applications that seek inputs from text based data features.

ResilientML has developed methods in python to produce these JSON formatted text feature collections that will form the core of our Semantic Reservoirs. These text-based data features are processed using specialized natural language processing NLP methods that ResilientML will bring to the Ocean community based on extensive academic and industry experience in developing such solutions.

In this regard, we aim to help make the Ocean marketplace the leader in the text processing, sentiment models, social media analytics, analyst report analytics, regulatory report analytics, topic models, chat-bot, text-to-speech, speech-to-text, labelling, context extraction tasks of the NLP data market by leveraging our expertise in cutting-edge, novel academic research and industry practice.

NLP is of prime importance in the crypto space due to the highly sentiment driven nature of crypto markets. Furthermore, we also plan to provide the following high value NLP datasets to the Ocean marketplace:

Crypto News Sentiment
Social Media Sentiments
Technology: github, bitbucket, wire, …
Regulatory compliance reports
Legal documents

The team at ResilientML have dedicated quantitative analysts, machine learning experts and industry leading engineers to develop this suite of tools in both API, cloud solutions in azure and AWS in the languages of R, Python, MongoDB and others.

Description of the project:

Here, we provide a high level overview of the project (a detailed description of the project is available in the appendix).

Significant value can be sourced to understand crypto markets, prices, developments, regulatory landscape, use cases etc., through harvesting information from written text. Natural Language Processing (NLP) refers to the set of methods and analytical tools used to analyze unstructured text data, namely text that was created in free form and has a natural linguistic flow, rather than text created based on templated and predefined rules.

“Big Data” (a term which is used to describe the vast amount of available information) are characterized by five properties (“the 5 Vs of Big Data”): volume, velocity, variety, value and veracity. The amount of information (volume), the rate at which it becomes created and transmitted (velocity), its heterogeneous nature (variety) which includes not only textual, but also financial, visual or verbal data, and the actionable information among “noise” (value), are the reasons why it would be infeasible to leverage such information without statistical algorithms that automate the collection, extraction and filtering processes applied on the data.

Furthermore, these operations are necessary to be applied in such a way that the fifth property (veracity) of Big Data is maintained at all stages of the statistical processing of text; veracity is related to the quality and provenance of the data that are fed into the statistical models. The source of data determines their quality and inherent biases, that inevitably affect the output of the final statistical model.

A detailed synopsis of the project can be found in the appendix of this proposal to see specifically the stages of Machine Learning considered to extract and summarize the value in textual data and sentiment that we propose to provide to the Ocean environment.

The collection, wrangling and curation of this information extracted from text requires specialist machine learning knowledge to automate this process. The relevancy and approaches to data collection require domain knowledge to identify the most relevant sources of data to extract value from to ensure the data is of highest integrity.

We will combine our machine learning skills and specialist domain knowledge in the crypto space and traditional financial and risk/insurance space to provide a high quality source of data for NLP tasks that is tailored specifically for crypto market understanding and analytics.

An important point of distinction to what we offer is that we move beyond the standard approach of bag-of-words and frequency of words based models which are ubiquitous in most NLP sentiment based frameworks, but fail to capture semantics and syntax. These extra components are critical to infer sentiment accurately. Our proposed framework provides additional structure capturing these components for positive, negative, and neutral sentiment indices. In addition, our framework is hierarchical in nature which means that we can extract contribution to sentiment by individual articles, authors, and news sources explicitly.

Another important distinction of our contribution compared to other sentiment based models is that we don’t just focus on Twitter and social media feeds, which have limited scope to express sentiment, instead our approach targets detailed analyst reports, editor-processed news reports , and regulatory reports. Working with these enriched, higher quality and credibility data sources is more complex than working with social media based models, and our framework is able to accommodate this.

What problem is your project solving?

Currently, there is a lack of high quality data on the Ocean marketplace – which is to be expected at this stage. In order to attract a snowball effect of high quality data providers, an initial kernel of high quality datasets need to be published, staked, and purchased on the marketplace. We will contribute to the provision of such high quality datasets to the Ocean marketplace to drive growth – critical to the success of the protocol.

What is the final product?

ResilientML will vastly expand our current dataset (TASPEL) on the Ocean marketplace to include new high level extracted sentence structure features, such as Context-Free Grammar trees and Dependency Graphs, for the following sectors:

Web3.0 coins: (Ocean, Chainlink, Filecoin, BitTorrent, Stacks, The Graph, Basic Attention Token, Siacoin,Helium, Arweave);
Layer1 coins (Cardano, Tezos, Polkadot, Solana etc.);
DeFi coins (Uniswap, Terra, Aave, PancakeSwap, Maker, THORChain, Serum etc.);
Metaverse/Gaming/NFT coins (Enjin, Axie Infinity, Red Fox Labs etc.);
Stablecoins (Tether, USDC, BUSD etc.),

with news topics covering:

DeFi;
Exchanges;
Regulation;
NFT,

and including data from the following additional news sources:

Coindesk;
Cointelegraph;
NewsBTC;
Bitcoinist;
Blocknomi;
Coinspeaker.

Each dataset will provide cleaned, pre-processed, and featurized text data (as shown in Stage 1 of the Value-Add Pipeline (VAP) in Figure 1) from every article, corresponding to 100,000s of n-grams and millions of tokens, from various news sources, e.g. cryptodaily.co.uk, cryptoslate.com.

Furthermore, we will publish a new dataset to provide extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings , as shown in Stage 2 of the Value-Add Pipeline (VAP) in Figure 1.

Figure 1: Text Analytics Value-Add Pipeline (VAP)

Expected ROI

The publication of the datasets detailed above will drive value to the Ocean ecosystem through numerous channels, i.e. fee generation for Ocean community, Ocean token purchases by data publishers (ResilientML), Ocean token purchases by stakers attracted by high quality datasets to curate, network effects of attracting other data providers to the marketplace.

Here, we focus on the first of these value drivers – since it is the easiest to ballpark.

We make the following assumptions:

Probability of project success = 0.8
Ocean Community gets 0.2% of consume volume.

We provide OCEAN Datatoken Consumption and ROI calculations for a number of growth scenarios in Figures 2 and 3 respectively.

Note:

Let x_t = consumption for month t.

x_t = x_0 * (1 + rate)^t , where t=0, 1, …11 months.

Where x_0 = initial_num_users x num_datasets x datatoken_price

Figure 2: OCEAN Datatoken Consumption Growth Scenarios (adjusted for probability of success = 0.8):

The man hours and computation necessary to code, scrape, clean, and process these datasets are substantial. Below we lay out our projected fixed costs to provide these datasets. Variable costs should of course be considered but are omitted for simplicity here. Note: these projections are based on an analysis of preliminary processing using a smaller dataset.

Cost Item	Cost USD
Man Hours	12,000
Computation	5,450
Data Storage	150
Total	17,600

Project Deliverables – Category

ResilientML will vastly expand our current dataset (TASPEL) on the Ocean marketplace – as detailed above.
Furthermore, we will publish a new dataset to provide the extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings, as shown in Stage 2 the Value-Add Pipeline (VAP) in Figure 1.

Project Deliverables – Roadmap:

Any prior work completed thus far?

The proposed project builds upon the contributions over the past two years of members of ResilientML in building the machine learning pipeline shown in Figure 1. We have performed the processing of data for over 15 assets, and have already published this dataset to the Polygon Ocean marketplace – updating weekly.

Part 1: Python code has been written to perform text data collection via Java-based screen scraping and document collection – this has been unit tested and validated.

Part 2: Python modules have been created with proprietary steps of text data de-noising based upon the concepts provided in the appendix – this has been unit tested and validated.

Part 3: Python modules have been created to extract time series of features related to:

Semantic bag-of-words frequency based features and their corresponding time series.
Grammar based parse trees and their corresponding time series.
Syntax based dependency graphs and their corresponding time series.

These have been unit tested and applied to crypto data. The next stage is to put these steps into a distributed production system and curate these in a json data format for an API feed.

Roadmap

Month 1:

Complete prototyping of stages 1 – 4.
Publish datasets to Ocean marketplace.

Month 2:

Maintenance and data support for data buyers.
Submit academic research paper to journal

Project Details

Further details of the research prototype are provided in the following peer reviewed papers:

Chalkiadakis, Ioannis and Peters, Gareth W. and Chantler, Michael John and Konstas, Ioannis, A statistical analysis of text: embeddings, properties, and time-series modeling.

Available at SSRN (under review): A Statistical Analysis of Text: Embeddings, Properties and Time-Series Modeling by Ioannis Chalkiadakis, Gareth Peters, Mike J. Chantler, Ioannis Konstas :: SSRN

Chalkiadakis, Ioannis and Zaremba, Anna and Peters, Gareth W. and Chantler, Michael John, Sentiment-driven statistical causality in multimodal systems.

Available at SSRN: On-chain Analytics for Sentiment-driven Statistical Causality in Cryptocurrencies by Ioannis Chalkiadakis, Anna Zaremba, Gareth Peters, Mike J. Chantler :: SSRN

Zaremba, A. and Peters, G., 2020. Statistical Causality for Multivariate Non-Linear Time Series via Gaussian Processes.

Available at SSRN: https://papers.ssrn. com/sol3/papers.cfm?abstract_id=3609497

Peters, Gareth, Statistical Machine Learning and Data Analytic Methods for Risk and Insurance

Available at SSRN: Statistical Machine Learning and Data Analytic Methods for Risk and Insurance by Gareth Peters :: SSRN or Statistical Machine Learning and Data Analytic Methods for Risk and Insurance by Gareth Peters :: SSRN

Team members

ResilientML consists of 5 team members.

Chair Prof. Gareth W. Peters (CStat-RSS, FIOR, YAS-RSS) - Head of Research

Background:

Linkedin: https://www.linkedin.com/in/gareth-w-peters-3928b4139/
GoogleScholar : https://scholar.google.com/citations?hl=en&user=goDorpkAAAAJ 1
Affiliations/Prizes: https://researchportal.hw.ac.uk/en/persons/gareth-w-peters/prizes/ 1
PhD University of NSW, Australia
MSc Cambridge University, England

Experience:

Co-founder of ResilientML
20+ years machine learning research
5 research books
200+ journal and conference papers
Successfully delivered projects from grants > 5mil+ GBP.

Short Bio

Prof. Gareth W. Peters is the ‘Chair Professor for Risk and Insurance’ in the Department of Actuarial Mathematics and Statistics, in Heriot-Watt University in Edinburgh. Previously he held tenured positions in the Department of Statistical Sciences, University College London, UK and the Department of Mathematics and Statistics in University of New South Wales, Sydney, Australia.

Prof. Peters is the Director of the Scottish Financial Risk Association.

Prof. Peters is also an elected member of the Young Academy of Scotland in the Royal Society of Edinburgh (YAS-RSE) and an elected Fellow of the Institute of Operational Risk (FIOR). He was also the Nachdiploma Lecturer in Machine Learning for Risk and Insurance at ETH Zurich in the Risk Laboratory.

He has made in excess of 150 international invited presentations, speaker engagements including numerous key note presentations. He has delivered numerous professional training courses to C-suite executive level industry professionals as well as numerous central banks.

He has published in excess of 150 peer reviewed articles on risk and insurance modelling, 2 research text books on Operational Risk and Insurance as well as being the editor and contributor to 3 edited text books on spatial statistics and Monte Carlo methods.

He currently holds positions as:

Honorary Prof. of Statistics at University College London, 2018+
Affiliated Prof. of Statistics in University of New South Wales Australia 2015+
Affiliate Member of Systemic Risk Center, London School of Economics 2014+
Affiliate Member of Oxford Man Institute, Oxford University (OMI) 2013+
Honorary Prof. of Statistics in University of Sydney Australia 2018+
Honorary Prof. of Statistics in Macquarie University, Australia 2018+
Visiting Prof. in Institute of Statistical Mathematics, Tokyo, Japan 2009-2018+

He previously held positions as:

Honorary Prof. of Peking University, Beijing, China 2014-2016
Adjunct Scientist in the Mathematics, Informatics and Statistics, Commonwealth Scientific and Industrial Research Organisation (CSIRO) 2009-2017

Webpage: https://www.qrslab.com/

Gordon Gay – CEO

Background :

Linkedin: https://www.linkedin.com/in/gordon-gay-6323542/
MEngSc Monash University
MBA Melbourne University, Melbourne Business School, The University of Melbourne

Experience:

Co-founder of ResilientML
23 years R&D at NEC Australia, roles - GM of R&D, National Head of Innovation

Matthew Ames – CTO / Co-Head of Research

Background:

Linkedin: https://www.linkedin.com/in/matthewames87/
PhD Statistics, University College London
2 years Postdoctoral research

Experience

5 years industry experience - machine learning, finance

Phong Nguyen – Principal Engineer

Background:

Linkedin: https://www.linkedin.com/in/phong-nguyen-0456b912/
Masters Adelaide University

Experience:

20+ years industry experience - R&D, Wireless technologies, Systems Engineering – engineering solutions realisation
Lead systems engineering and technology development at NEC
Creator of the first-to-market 3.6 & 7.2 Mbps HSDPA SoC (System on Chip), prototype for LTE technological trial, LTE/LTE-A SoC, and Muti-RAT programable SDR platform
Inventor of 57 SEPs (standard essential patents) and CEPs (commercial essential patent) on Bluetooth, 3G, 3.5G, 4G and 5G wireless technologies

Ioannis Chalkiadakis – Data Scientist / Natural Language Processing

Background:

Linkedin: https://www.linkedin.com/in/imchalkiadakis/
Masters of Engineering, National Technical University of Athens
PhD candidate, Quantitative Risk Solutions Lab, Heriot-Watt University, Edinburgh

Experience:

3 years Software Engineering

Appendix: Detailed Project Description

Extracting Value from Text Data

Natural Language Processing (NLP) refers to the set of methods and analytical tools used to analyze unstructured text data, namely text that was created in free form and has a natural linguistic flow, rather than text created based on templated and predefined rules.

“Big Data” (term which is used to describe the vast amount of available information) are characterized by five properties (“the 5 Vs of Big Data”): volume, velocity, variety, value and veracity.

The amount of information (volume), the rate at which it becomes created and transmitted (velocity), its heterogeneous nature (variety) which includes not only textual, but also financial, visual or verbal data, and the actionable information among “noise” (value), are the reasons why it would be infeasible to leverage such information without statistical algorithms that automate the collection, extraction and filtering processes applied on the data.

Furthermore, these operations are necessary to be applied in such a way that the fifth property (veracity) of Big Data is maintained at all stages of the statistical processing of text; veracity is related to the quality and provenance of the data that are fed into the statistical models. The source of data determines their quality and inherent biases, that inevitably affect the output of the final statistical model.

Importance of text pre-processing

With any type of data collected from real world processes, it is usually the case that a set of “clean-up” or pre-processing transformations are required before using them for the statistical processing.

The pre-processing procedures will remove the noise from the data which will allow us to operate on the actual information we want to process. In this way we will not only ensure the veracity we want to achieve, but will also obtain efficiency and computational benefits.

Statistical text processing: Pipeline

In general, we can identify three stages for the statistical analysis of text data:

• data import,

• data wrangling, and

• finally, development and evaluation of the statistical model.

The first step of importing the data consists of either loading an already existing dataset, or alternatively collecting one’s own set of data, for example via scraping web pages, scanning/optical character recognition (OCR) of printed documents or transcribing spoken text. The data import however, does not guarantee that the dataset will be in such a format that will facilitate subsequent processing.

Therefore, we need to go through the process of “tidying” the data, where one constructs “data frames”, i.e. tabular structures, where each variable is stored in its own column and each observation occupies one row.

This process will create a tidy dataset and will facilitate subsequent data transformations, visualization and processing. Creating a tidy dataset and applying the necessary transformations or visualization methods constitutes the process of “data wrangling”.

During modelling, it might be necessary to apply additional transformations on the data, hence there is a feedback loop between the data wrangling and modelling stages.

Noise in Text Data & its Removal During Data Wrangling.

What constitutes noise in raw text and under what conditions it may be introduced into our data?

Obvious noise artefacts are:

• encoding scheme (representation)

• word mis-spellings,

• errors in the linguistic structure (grammar or syntax),

• missing spaces or punctuation symbols and

• wrong capitalization patterns.

These types of noise patterns are usually introduced at the creation stage of the raw text, and are challenges that are expected in natural language applications.

However, noise may appear in non-obvious forms as well. Users of communication services, for instance SMS, e-mails, instant messages, or social media posts, often use abbreviations, emoticons, or even omit certain words.

These patterns, depending on the application, could hinder the processing of the raw data. For example, when analyzing sentiment from Tweets, most researchers will want to consider emoticons as they can be very expressive about the feelings of the author of the Tweet.

On the other hand, if someone strictly wants to analyze the lexical or grammatical patterns that appear frequently among Twitter users, information based on emoticons is potentially irrelevant, in which case it is noise and has to be removed.

The domain specificity of the noise patterns that this implies appears in additional noise sources that are considered standard in NLP, namely “stopwords” and punctuation.

The term stopwords refers to words that are not considered useful for the intended analysis because they lack discriminative power (e.g. appear too often in the dataset) or lack significant semantics, namely terms such as “a”, or “the”.

Stopword removal is considered a standard part of the pre-processing pipeline, is usually performed early in the pre-processing stage, and most NLP software packages come with standard predefined stopword lists.

Basic pre-processing steps

The list is not exhaustive:


1. Punctuation

Often punctuation marks (such as , . ! ? ; # “” ‘’ ~) are removed, for example when one aims to analyze counts of terms, and therefore punctuation becomes unnecessary. However, similar to stopwords, there are cases when all or a subset of punctuation marks are useful and are therefore desirable to maintain. For example, exclamation marks may reveal sentiment information, or some symbols may carry special meaning in certain contexts, such as the hashtag (#) symbol in Tweets where it relates to the Tweet semantic content. Finally, it is important to consider at which stage of the analysis one should remove punctuation. If we want to detect sentence boundaries, or perform syntax or grammar parsing, then it is important to maintain punctuation symbols before performing these stages. Once this type of analysis has finished, it may be safe to remove punctuation if it is required for further analysis steps.


2. Numbers

Also for numbers one should carefully consider the application context before deciding to remove them, due to not contributing to semantic information most of the times. If the domain requires the extraction of dates however, or case numbers when processing legal documents, then of course specific rules should be applied to dictate the conditions under which numbers will be removed from the text.


3. Lowercase

Lowercasing all terms is applied to reduce the vocabulary space, i.e. the set of words we expect to come across. This is useful for reducing the computational and space complexity in applications where we work with large sparse matrices of word counts. However, there are cases where uppercase letters reveal structural information: they can help identify sentence boundaries, or proper names, and can also help reduce ambiguity; for example distinguishing the proper name Rose from the noun denoting the flower rose.


4. Stemming

Stemming is an additional technique that aims to reduce the vocabulary space. It consists of removing any inflections from a word and reducing it to its most basic form. For example, a stemmer (the program that does the stemming) will map “walked”, “walking” and “walks” to the lexeme “walk”. Note that in the case of e.g. “studies”, the stemmer will return a basic word form (“studi”) that is an invalid word itself. This is because stemming does not account for the grammatical or syntactical pattern behind the inflection - it only cuts it off. An alternative method is lemmatisation, where the root lexeme is returned, which means that lemmatisation returns terms that are in the language. For example, the lemmatisation program will replace “studying” and “studies” to “study”. This is achieved by considering the part-of-speech of each term (e.g. is it a noun, verb, adjective or adverb?) in order to determine the suffix. Like punctuation, if we want to perform syntactic or grammatical analysis we have to postpone stemming and lemmatisation until after these stages.


5. Stopword removal

As we have discussed, this step should be treated very carefully, so as to minimize information loss and remaining noise in the dataset. It is therefore recommended that standard stopword lists in software packages be checked, and accordingly modified before used.


6. Word compounds

Word compounds are groups of words, usually groups of two (bigrams) or three (trigrams) that frequently appear together and convey a different meaning than if we consider each one individually. For example, the trigram “Wall Street Journal” denotes the name of a popular newspaper and we would like to account for it as a single term, when we want to extract its appearance in the dataset. If we do not, we can still identify the valid terms of “wall”, “street” and “journal” individually, however we ignore the fact that they refer to a newspaper rather than carry their separate meanings.


7. Remove low-frequency words

It is also common practice to remove extremely rare words, namely words that constitute less than a small fixed percentage (< 0.5 - 1 %) of the document corpus, again to reduce the computational and space complexity.

At the end of these processes we will have tokenized cleaned – wrangled text data that has been prepared for feature extraction and data analysis in NLP based machine learning.

Feature Extraction Methods – Time-Series of NLP Text Features

We identify three distinct categories:

Semantics : namely the meaning behind words and sentences and the coherence of a well-formed text - Bag-Of-Words (Frequency Based Features).

The way we capture semantics is based on the bag-of-words model (BoW), which has been widely applied in natural language processing (NLP) and information retrieval Harris, 19544. The main concept behind BoW is to map a segment of text to an unordered collection, or “bag”, of words. As we have seen this is the premise for the construction of document-term matrices for a corpus of documents, and in its original formulation it is applied on a complete document of a collection of documents (a corpus) and ignores the sequence of words in the text. We are transferring BoW into a time-series context and present an “online” formulation. This allows us to overcome computational difficulties associated with BoW, namely the handling of sparse matrices whose size depends on the number of distinct document words and corpus size, and may well be in the order of hundreds of thousands. In addition, this setting allows us to construct a text-based time-series that can be incorporated into a time-series based system for supervised or unsupervised learning.

Grammar : i.e. the structural rules that dictate how words fit into the sentence and form groups such as clauses and phrases - parse tree or constituency tree for n-grams/sentences.

Consider the following example

The brown dog is running in the park.

which can also be written in an equivalent grammatical manner:

He is running in the park.

without destroying the grammar or meaning of the sentence.

The fact that a group of words can operate as a single unit - and therefore in our example we can replace the phrase “the brown dog” with “he” - is the linguistic property of constituent structure.

Therefore one can extract features dictated by the grammatical rules that tell us with which words we can form and study units for their role in the sentence.

The formal system for studying this phenomenon, i.e. the grouping of words as in the above example, is the context-free grammar (CFG). Mathematically, a CFG is defined by a quadruple as follows: 𝐺=(𝑁,Σ,𝑅,𝑆) where

𝑁 is a set of non-terminal symbols

Σ is a set of terminal symbols, 𝑁∩Σ=∅

𝑅 is a set of rules, 𝑅={𝐴→𝛽:𝐴∈𝑁 and 𝛽∈(Σ∪𝑁)∗}

𝑆 is the designated start symbol, 𝑆∈𝑁

Such grammars are called “context-free” because the left hand side of each rule can contain exactly one non-terminal symbol.

A context-free grammar defines a formal language, which is the set of strings that we can construct of terminal tokens, that can be derived starting from S.

A sentence is called grammatical, if the set of strings that comprise it can be derived by following the rules of the CFG, otherwise, the sentence is not valid according to the language of the CFG (ungrammatical).

The process of analyzing the constituent structure of a sentence is called constituency parsing, and the derivation of a sentence, i.e. the rules that we followed when building it, can be represented with a hierarchical structure, a tree, which is called the parse tree or constituency tree.

Syntax : that is the principles that dictate the structure of sentences by specifying the order and role of each word in the text - Dependency Graphs.

The goal of syntactic analysis is to discover which pairs of words where one depends on the other, and what is the type of that dependence.

These dependency relations are binary and asymmetrical, and therefore we would like to know which of the two words acts as the head that is modified in some way, and which is the dependent that modifies or complements the head. This concept allows us to think of the dependency relations as inducing graph structures (dependency graphs) which we use to study the dependency relations between words, and therefore the syntax of a sentence. The syntactic analysis complements the grammatical, parse tree-based analysis, as now we aim to extract information on the functional role of each word in the sentence, rather than structural relations between them as we did with the context-free grammar.

ResilientML · July 29, 2021, 3:13pm

We have submitted our proposal for Round 8 for consideration by the community, and are keen to hear any feedback.

ResilientML as a company will be unable to keep curating, maintaining and updating the FaaS data markets on Ocean if there is not a demand side to these markets and there is not grant funding. So if the community wants our service as part of the ecosystem I ask you to seriously consider not down grading our project as in the previous round and to consider our service as a state-of-the-art FaaS service for Ocean that we are sure can add increasingly valuable data in the midst of a lot of garbage noisy data sets currently present.

realdatawhale · July 29, 2021, 7:16pm

Hello ResilientML team!

Hope you guys are doing great and thank you for briefly speaking about your work during the last OceanDAO Townhall.

First of all, your team has put a lot of effort in outlining a very detailed proposal! Browsing through your application, I’ve got a couple of observations that I wanted to share and I hope my feedback helps you to get to your desired outcome.

I found the proposal structure challenging to understand. There’s a lot of information and the key outcomes are difficult to spot, amidst often very detailed descriptions.
As your main objective, I understand that you are targeting to drive consumption of ResilientML’s quality data on the Marketplace. That’s great! I’m sure the data you are offering adds value to an audience. However, what are your objectives to encourage buyers? Who are your buyers? How are you planning to target them, make them aware about your datasets?
Following your last grant, ResilientML’s current dataset “TASPEL” is launched on Polygon as a fixed price asset. This means that the main incentive for most market participants is non-existent (i.e. staking, buying your Data Token with the objective to trade etc.). Right now, the OceanDAO is financing the curation of a dataset, which should be financed through Data Token sales. That’s where I currently see the biggest challenge for your value proposal to the Ocean eco-system and its current actors (stakers, traders, speculators). Consumers will only come into the picture later in my opinion.

I hope you don’t get us wrong here! I think the Ocean eco-system needs participants like ResilientML and surely your datasets will attract an important audience to the market in the long-run. Possibly you can make the proposal a little more relatable for potential voters that may not have an extensive knowledge in your field. The “what’s in it for us” question should always be answered, encouraging voters. A Data Token liquidity pool may do the trick, although I do believe that OceanDAO grants should always entail a more far-reaching benefit to the eco-system in addition to the curation of a dataset.

Let us know your thoughts and hope this helps to look at it from a different viewpoint.

Gareth · July 29, 2021, 10:50pm

Dear Realdatawhale

Thank you for your feedback - please see below our response to some of your raised comments. As always we appreciate an open discussion on different perspectives and whilst we dont agree with your characterisations we will try to explain some different perspectives that may help you understand better what we offer to the community.

The proposal is intentionally detailed as we want to make sure that those who chose to vote for our proposal will fully understand the careful and considered framework of our FaaS service. We have been successful in writing grants for many years for millions of pounds in academia and we bring withus the skill set learnt in being detailed and precise to the ocean proposal system. Whilst such precision and detail may seem unusual compared to perhaps other grants, we believe it reflects the considered effort we are making to explain exactly what Ocean Dao members would be voting for in our project to justify the funding.

I believe the key outcomes are clearly stated in the grant after we provide context to the FaaS concept and service we have already been developing in Ocean over the last 8 months.

For instance we stated the following:
" This proposal is to:

continue to extend, develop and further curate our initial development of a Crypto-specific Natural Language Processing Data Suite, and
produce an online dashboard that will show in real time the status of the Real News Crypto dataset.

The grant will go to help fund the coding, NLP model development, man hours and compute hours required to extend our current published dataset. We aim to continue to grow this dataset and expand its usability by further extracting value from the corpus of crypto related financial news, and making it more accessible to the Ocean community. … "

We believe if you take the time to read this carefully you will see our objectives clearly stated. If you stil have further questions in this regard after a more considered review of the proposal we are happy to address any quesitons or concerns further.

We are not just offering “Data” but rather a framework of “Features-as-a-Service” which as we explain is fundamentally different to just providing raw data. This is the real value add in that we process and curate the data into feature libraries that are direct inputs into Machine Learning training and validation once compute-to-data services are active. This will be very important we believe for Ocean users to understand that they can access pre-processed curated data feature libraries rather than unprocessed raw poorly formated data sets.

Regarding buyers, we discuss the use cases for our FaaS but let me reiterate this again, obviously the buyers of such a data set would be people actually wanting to train deep learning, transformer, LSTM, Sentiment, Gaussian process or causality models for NLP tasks such as chat bots, sentiment models, classifications etc. There is a siginficant market for such data sets already and this has not yet reached the crypto space in signficant manner, hence why we believe our service in Ocean will be the first FaaS service of this kind for ML data scientist to purchase and use on an ongoing basis for data modelling.

Regarding targetting buyers - we have a discord channel, a website and we have been very active in the community of Ocean at every Dao for the last 8 months or more to try to integrate our services into the Ocean ecosystem. Furthermore, being well connected to a variety fo professional services we will eventually begin to roll out our markets to commercial advertising, marketing and ML companies, as well as generate adoption for our data FaaS services in academic circles.

Actually this is wrong, the ResilientML service FaaS has been launched on Ocean for more than 6 months now and we have maintained weekly updates for this product and automated much of the processes to produce increasing value and quality to this data set. This is explained already in the proposal how much this has grown in volume of curated features over this 6 momths.

Note - it is not the correct perspective to think of selling this data “only as an asset for pure speculative trading” in our opinion you can think of these markets as with any commodity market - there will be speculators (as you describe) and there will be hedgers and real participants wanting to receive delivery regularly of the physical good for processing into refinement of products. We see ourselves as primary producers of a FaaS commodity that data scientists will consume for regular modelling tasks and products built upon our FaaS data service. Not purely for speculative trading as you describe - personally I think this is only a partially formed understandign of possible market economics that you describe so we hope you will now see better our perspective on this matter and why we have set a fixed price market to begin. To encourage consumption from data scientists.

We hope this clarfies your questions raised and we truly thank you for your comments and time to consider our proposal. Please dont hestitate to further reach out with any questions that remain and also feel free to engage on oru discord for ResilientML:

Vantagecrypto · July 31, 2021, 3:46pm

Great detailed proposal here. As mentioned on the previous townhalls and I think abundantly clear, our projects have a lot of potential to work together. We have a significant interest in social sentiment data and are excited to see you working to expand your data set. You can definitely count on our support for this initiative!

Gareth · July 31, 2021, 6:29pm

Thanks for the vote of confidence - we will definitely be looking to explore methods to engage with your platform at Vantage and your solution going forward and we strongly agree that there is a lot of synergies to explore regarding sentiment signals, trading signals based on causal ML methods incorporating sentiment for financial time-series, LOB modelling, volatility dynamics in crypto markets, NFT valuations markets, crypto derivatives markets and tokenised stocks. Look forward to seeing your success grow and discussing further.

Cipher · August 2, 2021, 8:26pm

Event the summary response was too detailed :smiley
Congratulations on the project, it surely is an intersting submission to review.

Gareth · August 2, 2021, 8:43pm

Thanks for your vote of confidence - just feel free to skip any parts you don’t understand or wish to follow detail on. Coming from academia where detail and precision is important I am really not used to such vagary and imprecision each of us has different backgrounds so this makes it an interesting challenge to communicate.

Cipher · August 6, 2021, 10:50am

I went again through the descriptions and tried to understand, I get the feel of it while skipping detailed tech solution. I hope it brings value to few other projects also that are working in the space.
I am voting positive.
Is the solution useful for others? Is it open source or are you ready to work with others?

Gareth · August 30, 2021, 7:27pm

ResilientML - Proposal Checklist Complete for Round8.

[Deliverable Checklist]

ResilientML will vastly expand our current dataset (TASPEL) on the Ocean marketplace.

In this regard, we have added the following new sources of text and articles: Ocean Blog, Coin Telegraph to complement the existing sets of news information from Crypto daily, crypto slate and coindesk. These will be ongoing new sources of text data that we are continuing to grow our dataset with and provide feature libraries.

We will publish a new dataset to provide the extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings, as shown in Stage 2 the Value-Add Pipeline (VAP)

In this regard, we have written the python code to produce more advanced feature sets and we have scoped the JSON data format we will utilise to present these to the market. So far we have been creating spectral features and matrix factorisations for Document-Word matrics and Document-Document, and Word-Word matrices. These will be uploaded in our JSON market files as they are processed.

Maintenance and data support for data buyers.

The data is continuing to be updated on a weekly rolling basis. Furthermore, we have created a python application that will be hosted and run from streamlit that will allow users to interact with the summary features of our data library and feature libraries. This will show case a detailed analytics dashboard to provide clarity to the market on exactly the size of our data and feature library, the content by topic, by token and by new source. We will continue to maintain and update this new hosted analytics tool to also provide sentiment and other analytics to do with authors and news source decompositions.

Submit academic research paper to journal

We have submited a journal paper to a leading computational econometrics journal i which we study using our data set and a new class of time-series and Deep learning models the relationship between price, technology factors such as hash rate and sentiment constructed from our market place data. This journal is currently in peer review and once accepted we will be able to also release the preprint version to the community along with github repository for the code which we will enhance and turn into a compute-to-data module in future once peer reivew is complete. This is a quite advanced time-series model that include Autoregressive distribute lag structure, Mixed data sampling MIDAS structure and transformed Deep Neural Networks integrated into a long-memory analysis of how price influences sentiment in the crypt setting.

@AlexN

Gareth · September 5, 2021, 4:57pm

Thanks for your feedback on this - just realised we didnt respond to your last question - we are indeed highly collaborative in nature and we are hoping to be able to make the solution useful to the wider Ocean markets and OceanDao teams… please reach out if you want to discuss further.