[ResilientML] - Python Analytics Suite on Streamlit.io and Data Sets Extension and Expansion

ResilientML · October 2, 2021, 3:11pm

Key Project Data

Name of project:

ResilientML – Python Analytics Suite on Streamlit.io and Data Sets Extension and Expansion

Team Website:

https://www.resilientml.com/

Proposal Wallet Address:

0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391

https://etherscan.io/address/0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391

Which category best describes your project?

Unleash data

Funding Requested

$50,000

Extend analytics suit and maintain data sets for three NLP core data sets for our Features-as-a-Service (FaaS) model

Development of a FaaS ResilientML market place on Ocean, Analytics toolbox and Streamlit.io user interface.

The ResilientML framework is providing progressively enhanced and maintained, professionally curated state-of-the-art text and NLP feature libraries to the Ocean market place under the Polygon markets as three distinct FaaS feature library data sets (all constantly updated and maintained weekly with automated data-collection, processing and production pipelines):

Tasty Pelican token – TASPEL 27. Real Crypto News Sentiment

Weekly Updated
Contains n-gram structures in curated JSON format data structures as well as python pickle files for 10 million+ tokenized news articles on crypto currency projects. They are curated by date, source, author and topic – making the utilisation of this data source ideal for NLP tasks such as sentiment construction. A detailed description of the data structure and content is available at:
https://market.oceanprotocol.com/asset/did:op:55d4346C439A5f9Ccf2Cb1802D56Cd5A70Be82c7

Passionate Cormorant token – PASCOR-89. Real Crypto News Sentiment Sentence Features

Weekly Updated
Contains the noise-free, processed article sentences for all current news sources and assets for the complete period that the Real News Crypto dataset covers (2017-present). This dataset will be maintained and extended in the following weeks to include higher-level structural features (grammar trees and syntax graphs) of the provided sentences. The dataset is now available at a fixed price at:
https://market.oceanprotocol.com/asset/did:op:a62962De45BEE6C99cae7a685D4f8ee1EaB7825C

Invidious Penguin Token – INVPEN-41. Real Crypto News Sentiment Spectral Features

Weekly Updated
The second dataset contains the top-3 tokens’ and articles’ spectral components (eigenvalues and eigenvectors of the word-word and document-document matrices) for the weekly article collections in the spanned period. Note that the dictionary tokens belong to a custom cryptocurrency-specific vocabulary, which has been sentiment-annotated hence it is expressive of market sentiment for the covered period. This is particularly informative for market participants wanting to gain insights into the market dynamics from an investors’ sentiment perspective. The dataset is now available at a fixed price at:
https://market.oceanprotocol.com/asset/did:op:E0AB818DB99dd89548512407A328A9D7Bf6f994a

We aim to provide professional-level curated datasets for ML Compute-to-Data users to be able to access data and feature libraries that are weekly maintained, updated and extended, and which are in standard professional data formats with standardised feature templates in JSON and Python pickle, ready for training and fine-tuning of time-series models, deep neural network models, Transformer models, LSTMs etc.

Analytics Tool: Python Streamlit.io User Data Interrogator

In addition, ResilientML have been developing a live interactive analytics tool that users can use to explore the attributes, quality, variety, veracity, complexity and completeness of our feature library on a real-time basis. This should drive the understanding and adoption of the market place for our FaaS as it will provide clear and detailed overviews for users of what they would be purchasing or holding when they buy the ResilientML data and FaaS product.

Updates on Features-as-a-Service model

Using the previous rounds of funding, ResilientML has been producing a weekly growing automated data market for Natural Language Processing Feature Libraries (NLPFL) on Ocean.

We now have three data sets produced and available on Ocean market places (polygon).

The following process has been automated and executed in Cloud infrastructure on a weekly basis (see figures below):

Scan increasing numbers of news sources (currently 4+) automatically and extract news reports on targeted topics related to key areas in the crypto sphere;
Process text into cleaned format – as outlined in proposal stages of data munging;
Extract key features from text in n-gram formats, sentence format;
Construct JSON data library update and automate this to box storage and update for Ocean market ever week on Saturday midnight.

As of 02/10/2021 the market place has data for Features-as-a-Service summarised as follows:

4 major crypto news sources
19 crypto projects that are reported on
8 news categories relevant to the crypto space
200,00+ articles
14,163,060 processed text tokens,

The processing of which has been the result of more than 3 years of research into statistical natural language processing.

Furthermore, the following processes have been in development and will be offered on the market place both as data, and as Compute-to-Data modules, once the latter is fully supported:

Segment text into topics and append to previous feature libraries,
Process higher order features sets – CFG trees, Dependency graphs, spectral features and sentiment indices,
Create summary overview of changes made, current special features that week, KPIs and total records to date (currently partially in effect, and presented in prototype of the Analytics App).

The same data suite will also be available for COVID-19 news articles in the next stage of our FaaS project, and data from a range of other sectors will follow (e.g. climate).

Finally, the following Compute-to-Data modules are scheduled to be released, all of which are based on research outputs in peer-reviewed academic journals from the ResilientML team:

Mixed-frequency data modelling for forecasting investors’ sentiment in crypto markets;
Epidemiological models for COVID-19 infected cases, enhanced with public sentiment.

Analytics and content visualisation of TASPEL-27 Real News Crypto Dataset

A prototype of the Analytics App for TASPEL-27 has been developed, presenting KPIs (linked to the dataset volume, richness and update rate) and data summaries of Real News Crypto to the potential data buyer and consumer. The application will be hosted online once the prototype is completed and will be weekly updated to reflect the current dataset state.

This is a significant advantage of the Real News Crypto Dataset as it offers transparency to the buyer and the opportunity to explore the dataset content before buying. This is significantly different from merely offering a small sample of the data for free, as an Analytics App offers interactivity and is continuously updated to reflect the up-to-date state of the dataset.

We trust that this is a significant step towards attracting more data buyers to our dataset and the Ocean Market.

Round 10 Proposal

The grant will go to help fund the coding, NLP model development, man hours and compute hours required to extend our current published datasets. We aim to continue to grow these datasets and expand its usability by further extracting value from the corpus of crypto related financial news, and making it more accessible to the Ocean community.

Grant Deliverables

In this round we will expand our offering to include:

Continue to extend, develop and further curate our initial development of a Crypto specific Natural Language Processing Data Suite – this includes specifically continuing to expand the three data sets TASPEL-27, PASCOR-89, INVPEN-41.
Extend and deploy the online Analytics App dashboard that will show in real time the status of the Real News Crypto datasets TASPEL-27, PASCOR-89, INVPEN-41.

Specifically we will be adding additional analytics tools to demonstrate to the user:

Word clouds (searchable by topic, news source, author or date range)
Sentiment signals (to be implement in propriety sentiment entropy signals – positive, negative, neutral signals)
Spectral heat-map diagrams for document-document spectra.

Develop fourth data set in the ResilientML suite of FaaS focussed on sentiment signals decomposed into polarities (positive, negative, neutral) by crypto news topic.

The dataset expansion will span the additional following components:

Web3.0 coins: Filecoin, BitTorrent, Stacks, The Graph, Basic Attention Token, Siacoin, Helium, Arweave;
Layer1 coins (Solana);
DeFi coins (Terra, PancakeSwap, Maker, THORChain, Serum);
Metaverse/Gaming/NFT coins (Enjin, Axie Infinity, Red Fox Labs);
Stablecoins (USDC, BUSD),

and additional news sources:

NewsBTC;
Bitcoinist;
Blocknomi;
Coinspeaker.

Specifically, in addition to the post-processed tokens, and the text-noise-free sentences, we will also provide parsimonious spectral features extracted from the articles via matrix factorisation approaches.

As before, we will expand the dataset going forwards but also back-filling the currently available data. This will effectively increase the current data offering we have developed by around a factor of three in terms of content and volume of processed corpus.

We will further continue to grow the sophistication of the data being provided . This will be achieved by progressively migrating from data munging to feature extraction and data curation for feature libraries.

The proposal in one sentence

Data is the modern oil of the blockchain economy. ResilientML Semantic Reservoirs will bring a vast collection of carefully crafted semantic and linguistically tailored datasets curated by experts in Natural Language Processing for utilization directly in machine learning methods and sentiment models running in the Ocean environment and available through the Ocean marketplace via the ResilientML NLP data app.

Project Overview

Mission:

The outcome of the collaboration between ResilientML and the OceanDAO community is multi-fold:

We will help open the flood gates to the Ocean community for one of the key building blocks of a modern data economy that is of growing relevance to machine learning applications in a data economy like Ocean. This building block is particularly specialized as it requires a non-standard set of data science processes to extract and curate.
In this regard, we seek to unlock the power of text-based information and data characterization for the Ocean machine learning community. They will be able to utilize our high quality Natural Language Processing (NLP) text data features to develop apps that will interface directly with our data reservoirs through API interfaces that can extract relevant text data features from our JSON formatted and curated feature libraries to form tributaries to the wider Ocean machine learning applications that seek inputs from text based data features.

ResilientML has developed methods in python to produce these JSON formatted text feature collections that will form the core of our Semantic Reservoirs. These text-based data features are processed using specialized natural language processing NLP methods that ResilientML will bring to the Ocean community based on extensive academic and industry experience in developing such solutions.

In this regard, we aim to help make the Ocean marketplace the leader in the text processing, sentiment models, social media analytics, analyst report analytics, regulatory report analytics, topic models, chat-bot, text-to-speech, speech-to-text, labelling, context extraction tasks of the NLP data market by leveraging our expertise in cutting-edge, novel academic research and industry practice.

NLP is of prime importance in the crypto space due to the highly sentiment driven nature of crypto markets. Furthermore, we also plan to provide the following high value NLP datasets to the Ocean marketplace:

Crypto News Sentiment
Social Media Sentiments
Technology: github, bitbucket, wire, …
Regulatory compliance reports
Legal documents

The team at ResilientML have dedicated quantitative analysts, machine learning experts and industry leading engineers to develop this suite of tools in both API, cloud solutions in azure and AWS in the languages of R, Python, MongoDB and others.

Description of the project:

Here, we provide a high level overview of the project (a detailed description of the project is available in the appendix).

Significant value can be sourced to understand crypto markets, prices, developments, regulatory landscape, use cases etc., through harvesting information from written text. Natural Language Processing (NLP) refers to the set of methods and analytical tools used to analyze unstructured text data, namely text that was created in free form and has a natural linguistic flow, rather than text created based on templated and predefined rules.

“Big Data” (a term which is used to describe the vast amount of available information) are characterized by five properties (“the 5 Vs of Big Data”): volume, velocity, variety, value and veracity. The amount of information (volume), the rate at which it becomes created and transmitted (velocity), its heterogeneous nature (variety) which includes not only textual, but also financial, visual or verbal data, and the actionable information among “noise” (value), are the reasons why it would be infeasible to leverage such information without statistical algorithms that automate the collection, extraction and filtering processes applied on the data.

Furthermore, these operations are necessary to be applied in such a way that the fifth property (veracity) of Big Data is maintained at all stages of the statistical processing of text; veracity is related to the quality and provenance of the data that are fed into the statistical models. The source of data determines their quality and inherent biases, that inevitably affect the output of the final statistical model.

A detailed synopsis of the project can be found in the appendix of this proposal to see specifically the stages of Machine Learning considered to extract and summarize the value in textual data and sentiment that we propose to provide to the Ocean environment.

The collection, wrangling and curation of this information extracted from text requires specialist machine learning knowledge to automate this process. The relevancy and approaches to data collection require domain knowledge to identify the most relevant sources of data to extract value from to ensure the data is of highest integrity.

We will combine our machine learning skills and specialist domain knowledge in the crypto space and traditional financial and risk/insurance space to provide a high quality source of data for NLP tasks that is tailored specifically for crypto market understanding and analytics.

An important point of distinction to what we offer is that we move beyond the standard approach of bag-of-words and frequency of words based models which are ubiquitous in most NLP sentiment based frameworks, but fail to capture semantics and syntax. These extra components are critical to infer sentiment accurately. Our proposed framework provides additional structure capturing these components for positive, negative, and neutral sentiment indices. In addition, our framework is hierarchical in nature which means that we can extract contribution to sentiment by individual articles, authors, and news sources explicitly.

Another important distinction of our contribution compared to other sentiment based models is that we don’t just focus on Twitter and social media feeds, which have limited scope to express sentiment, instead our approach targets detailed analyst reports, editor-processed news reports, and regulatory reports. Working with these enriched, higher quality and credibility data sources is more complex than working with social media based models, and our framework is able to accommodate this.

What problem is your project solving?

Currently, there is a lack of high quality data on the Ocean marketplace – which is to be expected at this stage. In order to attract a snowball effect of high quality data providers, an initial kernel of high quality datasets need to be published, staked, and purchased on the marketplace. We will contribute to the provision of such high quality datasets to the Ocean marketplace to drive growth – critical to the success of the protocol.

What is the final product?

ResilientML will vastly expand our current datasets on the Ocean marketplace to include processed, noise-free sentences as well as new high level extracted sentence structure features, such as Context-Free Grammar trees and Dependency Graphs, for the following sectors:

Web3.0 coins: (Ocean, Chainlink, Filecoin, BitTorrent, Stacks, The Graph, Basic Attention Token, Siacoin,Helium, Arweave);
Layer1 coins (Cardano, Tezos, Polkadot, Solana etc.);
DeFi coins (Uniswap, Terra, Aave, PancakeSwap, Maker, THORChain, Serum etc.);
Metaverse/Gaming/NFT coins (Enjin, Axie Infinity, Red Fox Labs etc.);
Stablecoins (Tether, USDC, BUSD etc.),

with news topics covering:

DeFi;
Exchanges;
Regulation;
NFT;
Business,
Technology;
Markets,

and including data from the following additional news sources:

NewsBTC;
Bitcoinist;
Blocknomi;
Coinspeaker.

Each dataset will provide cleaned, pre-processed, and featurized text data (as shown in Stage 1 of the Value-Add Pipeline (VAP) in Figure 1) from every article, corresponding to 100,000s of n-grams and millions of tokens, from various news sources, e.g. cryptodaily.co.uk, cryptoslate.com.

Furthermore, we will publish a new dataset to provide extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings, as shown in Stage 2 of the Value-Add Pipeline (VAP) in Figure 1.

Figure 1: Text Analytics Value-Add Pipeline (VAP)

Expected ROI

The publication of the datasets detailed above will drive value to the Ocean ecosystem through numerous channels, i.e. fee generation for Ocean community, Ocean token purchases by data publishers (ResilientML), Ocean token purchases by stakers attracted by high quality datasets to curate, network effects of attracting other data providers to the marketplace.

Here, we focus on the first of these value drivers – since it is the easiest to ballpark.

We make the following assumptions:

Probability of project success = 0.8
Ocean Community gets 0.2% of consume volume.

We provide OCEAN Datatoken Consumption and ROI calculations for a number of growth scenarios in Figures 2 and 3 respectively.

Note:

Let x_t = consumption for month t.

x_t = x_0 * (1 + rate)^t , where t=0, 1, …11 months.

Where x_0 = initial_num_users x num_datasets x datatoken_price

Figure 2: OCEAN Datatoken Consumption Growth Scenarios (adjusted for probability of success = 0.8):

Project Details

Further details of the research prototype are provided in the following peer reviewed papers:

Chalkiadakis, Ioannis and Peters, Gareth W. and Chantler, Michael John and Konstas, Ioannis, A statistical analysis of text: embeddings, properties, and time-series modeling.

Available at SSRN (under review): A Statistical Analysis of Text: Embeddings, Properties and Time-Series Modeling by Ioannis Chalkiadakis, Gareth Peters, Mike J. Chantler, Ioannis Konstas :: SSRN

Chalkiadakis, Ioannis and Zaremba, Anna and Peters, Gareth W. and Chantler, Michael John, Sentiment-driven statistical causality in multimodal systems.

Available at SSRN: On-chain Analytics for Sentiment-driven Statistical Causality in Cryptocurrencies by Ioannis Chalkiadakis, Anna Zaremba, Gareth Peters, Mike J. Chantler :: SSRN

Zaremba, A. and Peters, G., 2020. Statistical Causality for Multivariate Non-Linear Time Series via Gaussian Processes.

Available at SSRN: https://papers.ssrn. com/sol3/papers.cfm?abstract_id=3609497

Peters, Gareth, Statistical Machine Learning and Data Analytic Methods for Risk and Insurance

Available at SSRN: Statistical Machine Learning and Data Analytic Methods for Risk and Insurance by Gareth Peters :: SSRN or Statistical Machine Learning and Data Analytic Methods for Risk and Insurance by Gareth Peters :: SSRN

Team members

ResilientML consists of 5 team members.

Chair Prof. Gareth W. Peters (CStat-RSS, FIOR, YAS-RSS) - Head of Research

Background:

Linkedin: https://www.linkedin.com/in/gareth-w-peters-3928b4139/
GoogleScholar : https://scholar.google.com/citations?hl=en&user=goDorpkAAAAJ 1
Affiliations/Prizes: https://researchportal.hw.ac.uk/en/persons/gareth-w-peters/prizes/ 1
PhD University of NSW, Australia
MSc Cambridge University, England

Experience:

Co-founder of ResilientML
20+ years machine learning research
5 research books
200+ journal and conference papers
Successfully delivered projects from grants > 5mil+ GBP.

Short Bio

Prof. Gareth W. Peters is the ‘Chair Professor for Risk and Insurance’ in the Department of Actuarial Mathematics and Statistics, in Heriot-Watt University in Edinburgh. Previously he held tenured positions in the Department of Statistical Sciences, University College London, UK and the Department of Mathematics and Statistics in University of New South Wales, Sydney, Australia.

Prof. Peters is the Director of the Scottish Financial Risk Association.

Prof. Peters is also an elected member of the Young Academy of Scotland in the Royal Society of Edinburgh (YAS-RSE) and an elected Fellow of the Institute of Operational Risk (FIOR). He was also the Nachdiploma Lecturer in Machine Learning for Risk and Insurance at ETH Zurich in the Risk Laboratory.

He has made in excess of 150 international invited presentations, speaker engagements including numerous key note presentations. He has delivered numerous professional training courses to C-suite executive level industry professionals as well as numerous central banks.

He has published in excess of 150 peer reviewed articles on risk and insurance modelling, 2 research text books on Operational Risk and Insurance as well as being the editor and contributor to 3 edited text books on spatial statistics and Monte Carlo methods.

He currently holds positions as:

Honorary Prof. of Statistics at University College London, 2018+
Affiliated Prof. of Statistics in University of New South Wales Australia 2015+
Affiliate Member of Systemic Risk Center, London School of Economics 2014+
Affiliate Member of Oxford Man Institute, Oxford University (OMI) 2013+
Honorary Prof. of Statistics in University of Sydney Australia 2018+
Honorary Prof. of Statistics in Macquarie University, Australia 2018+
Visiting Prof. in Institute of Statistical Mathematics, Tokyo, Japan 2009-2018+

He previously held positions as:

Honorary Prof. of Peking University, Beijing, China 2014-2016
Adjunct Scientist in the Mathematics, Informatics and Statistics, Commonwealth Scientific and Industrial Research Organisation (CSIRO) 2009-2017

Webpage: https://www.qrslab.com/

Gordon Gay – CEO

Background :

Linkedin: https://www.linkedin.com/in/gordon-gay-6323542/
MEngSc Monash University
MBA Melbourne University, Melbourne Business School, The University of Melbourne

Experience:

Co-founder of ResilientML
23 years R&D at NEC Australia, roles - GM of R&D, National Head of Innovation

Matthew Ames – CTO / Co-Head of Research

Background:

Linkedin: https://www.linkedin.com/in/matthewames87/
PhD Statistics, University College London
2 years Postdoctoral research

Experience

5 years industry experience - machine learning, finance

Ioannis Chalkiadakis – Data Scientist / Natural Language Processing

Background:

Linkedin: https://www.linkedin.com/in/imchalkiadakis/
Masters of Engineering, National Technical University of Athens
PhD candidate, Quantitative Risk Solutions Lab, Heriot-Watt University, Edinburgh

Experience:

3 years Software Engineering

Appendix: Detailed Project Description

Extracting Value from Text Data

Natural Language Processing (NLP) refers to the set of methods and analytical tools used to analyze unstructured text data, namely text that was created in free form and has a natural linguistic flow, rather than text created based on templated and predefined rules.

“Big Data” (term which is used to describe the vast amount of available information) are characterized by five properties (“the 5 Vs of Big Data”): volume, velocity, variety, value and veracity.

The amount of information (volume), the rate at which it becomes created and transmitted (velocity), its heterogeneous nature (variety) which includes not only textual, but also financial, visual or verbal data, and the actionable information among “noise” (value), are the reasons why it would be infeasible to leverage such information without statistical algorithms that automate the collection, extraction and filtering processes applied on the data.

Furthermore, these operations are necessary to be applied in such a way that the fifth property (veracity) of Big Data is maintained at all stages of the statistical processing of text; veracity is related to the quality and provenance of the data that are fed into the statistical models. The source of data determines their quality and inherent biases, that inevitably affect the output of the final statistical model.

Importance of text pre-processing

With any type of data collected from real world processes, it is usually the case that a set of “clean-up” or pre-processing transformations are required before using them for the statistical processing.

The pre-processing procedures will remove the noise from the data which will allow us to operate on the actual information we want to process. In this way we will not only ensure the veracity we want to achieve, but will also obtain efficiency and computational benefits.

Statistical text processing: Pipeline

In general, we can identify three stages for the statistical analysis of text data:

• data import,

• data wrangling, and

• finally, development and evaluation of the statistical model.

The first step of importing the data consists of either loading an already existing dataset, or alternatively collecting one’s own set of data, for example via scraping web pages, scanning/optical character recognition (OCR) of printed documents or transcribing spoken text. The data import however, does not guarantee that the dataset will be in such a format that will facilitate subsequent processing.

Therefore, we need to go through the process of “tidying” the data, where one constructs “data frames”, i.e. tabular structures, where each variable is stored in its own column and each observation occupies one row.

This process will create a tidy dataset and will facilitate subsequent data transformations, visualization and processing. Creating a tidy dataset and applying the necessary transformations or visualization methods constitutes the process of “data wrangling”.

During modelling, it might be necessary to apply additional transformations on the data, hence there is a feedback loop between the data wrangling and modelling stages.

Noise in Text Data & its Removal During Data Wrangling.

What constitutes noise in raw text and under what conditions it may be introduced into our data?

Obvious noise artefacts are:

• encoding scheme (representation)

• word mis-spellings,

• errors in the linguistic structure (grammar or syntax),

• missing spaces or punctuation symbols and

• wrong capitalization patterns.

These types of noise patterns are usually introduced at the creation stage of the raw text, and are challenges that are expected in natural language applications.

However, noise may appear in non-obvious forms as well. Users of communication services, for instance SMS, e-mails, instant messages, or social media posts, often use abbreviations, emoticons, or even omit certain words.

These patterns, depending on the application, could hinder the processing of the raw data. For example, when analyzing sentiment from Tweets, most researchers will want to consider emoticons as they can be very expressive about the feelings of the author of the Tweet.

On the other hand, if someone strictly wants to analyze the lexical or grammatical patterns that appear frequently among Twitter users, information based on emoticons is potentially irrelevant, in which case it is noise and has to be removed.

The domain specificity of the noise patterns that this implies appears in additional noise sources that are considered standard in NLP, namely “stopwords” and punctuation.

The term stopwords refers to words that are not considered useful for the intended analysis because they lack discriminative power (e.g. appear too often in the dataset) or lack significant semantics, namely terms such as “a”, or “the”.

Stopword removal is considered a standard part of the pre-processing pipeline, is usually performed early in the pre-processing stage, and most NLP software packages come with standard predefined stopword lists.

Basic pre-processing steps

The list is not exhaustive:


1. Punctuation

Often punctuation marks (such as , . ! ? ; # “” ‘’ ~) are removed, for example when one aims to analyze counts of terms, and therefore punctuation becomes unnecessary. However, similar to stopwords, there are cases when all or a subset of punctuation marks are useful and are therefore desirable to maintain. For example, exclamation marks may reveal sentiment information, or some symbols may carry special meaning in certain contexts, such as the hashtag (#) symbol in Tweets where it relates to the Tweet semantic content. Finally, it is important to consider at which stage of the analysis one should remove punctuation. If we want to detect sentence boundaries, or perform syntax or grammar parsing, then it is important to maintain punctuation symbols before performing these stages. Once this type of analysis has finished, it may be safe to remove punctuation if it is required for further analysis steps.


2. Numbers

Also for numbers one should carefully consider the application context before deciding to remove them, due to not contributing to semantic information most of the times. If the domain requires the extraction of dates however, or case numbers when processing legal documents, then of course specific rules should be applied to dictate the conditions under which numbers will be removed from the text.


3. Lowercase

Lowercasing all terms is applied to reduce the vocabulary space, i.e. the set of words we expect to come across. This is useful for reducing the computational and space complexity in applications where we work with large sparse matrices of word counts. However, there are cases where uppercase letters reveal structural information: they can help identify sentence boundaries, or proper names, and can also help reduce ambiguity; for example distinguishing the proper name Rose from the noun denoting the flower rose.


4. Stemming

Stemming is an additional technique that aims to reduce the vocabulary space. It consists of removing any inflections from a word and reducing it to its most basic form. For example, a stemmer (the program that does the stemming) will map “walked”, “walking” and “walks” to the lexeme “walk”. Note that in the case of e.g. “studies”, the stemmer will return a basic word form (“studi”) that is an invalid word itself. This is because stemming does not account for the grammatical or syntactical pattern behind the inflection - it only cuts it off. An alternative method is lemmatisation, where the root lexeme is returned, which means that lemmatisation returns terms that are in the language. For example, the lemmatisation program will replace “studying” and “studies” to “study”. This is achieved by considering the part-of-speech of each term (e.g. is it a noun, verb, adjective or adverb?) in order to determine the suffix. Like punctuation, if we want to perform syntactic or grammatical analysis we have to postpone stemming and lemmatisation until after these stages.


5. Stopword removal

As we have discussed, this step should be treated very carefully, so as to minimize information loss and remaining noise in the dataset. It is therefore recommended that standard stopword lists in software packages be checked, and accordingly modified before used.


6. Word compounds

Word compounds are groups of words, usually groups of two (bigrams) or three (trigrams) that frequently appear together and convey a different meaning than if we consider each one individually. For example, the trigram “Wall Street Journal” denotes the name of a popular newspaper and we would like to account for it as a single term, when we want to extract its appearance in the dataset. If we do not, we can still identify the valid terms of “wall”, “street” and “journal” individually, however we ignore the fact that they refer to a newspaper rather than carry their separate meanings.


7. Remove low-frequency words

It is also common practice to remove extremely rare words, namely words that constitute less than a small fixed percentage (< 0.5 - 1 %) of the document corpus, again to reduce the computational and space complexity.

At the end of these processes we will have tokenized cleaned – wrangled text data that has been prepared for feature extraction and data analysis in NLP based machine learning.

Feature Extraction Methods – Time-Series of NLP Text Features

We identify three distinct categories:

Semantics : namely the meaning behind words and sentences and the coherence of a well-formed text - Bag-Of-Words (Frequency Based Features).

The way we capture semantics is based on the bag-of-words model (BoW), which has been widely applied in natural language processing (NLP) and information retrieval Harris, 19544. The main concept behind BoW is to map a segment of text to an unordered collection, or “bag”, of words. As we have seen this is the premise for the construction of document-term matrices for a corpus of documents, and in its original formulation it is applied on a complete document of a collection of documents (a corpus) and ignores the sequence of words in the text. We are transferring BoW into a time-series context and present an “online” formulation. This allows us to overcome computational difficulties associated with BoW, namely the handling of sparse matrices whose size depends on the number of distinct document words and corpus size, and may well be in the order of hundreds of thousands. In addition, this setting allows us to construct a text-based time-series that can be incorporated into a time-series based system for supervised or unsupervised learning.

Grammar : i.e. the structural rules that dictate how words fit into the sentence and form groups such as clauses and phrases - parse tree or constituency tree for n-grams/sentences.

Consider the following example

The brown dog is running in the park.

which can also be written in an equivalent grammatical manner:

He is running in the park.

without destroying the grammar or meaning of the sentence.

The fact that a group of words can operate as a single unit - and therefore in our example we can replace the phrase “the brown dog” with “he” - is the linguistic property of constituent structure.

Therefore one can extract features dictated by the grammatical rules that tell us with which words we can form and study units for their role in the sentence.

The formal system for studying this phenomenon, i.e. the grouping of words as in the above example, is the context-free grammar (CFG). Mathematically, a CFG is defined by a quadruple as follows: 𝐺=(𝑁,Σ,𝑅,𝑆) where

𝑁 is a set of non-terminal symbols

Σ is a set of terminal symbols, 𝑁∩Σ=∅

𝑅 is a set of rules, 𝑅={𝐴→𝛽:𝐴∈𝑁 and 𝛽∈(Σ∪𝑁)∗}

𝑆 is the designated start symbol, 𝑆∈𝑁

Such grammars are called “context-free” because the left hand side of each rule can contain exactly one non-terminal symbol.

A context-free grammar defines a formal language, which is the set of strings that we can construct of terminal tokens, that can be derived starting from S.

A sentence is called grammatical, if the set of strings that comprise it can be derived by following the rules of the CFG, otherwise, the sentence is not valid according to the language of the CFG (ungrammatical).

The process of analyzing the constituent structure of a sentence is called constituency parsing, and the derivation of a sentence, i.e. the rules that we followed when building it, can be represented with a hierarchical structure, a tree, which is called the parse tree or constituency tree.

Syntax : that is the principles that dictate the structure of sentences by specifying the order and role of each word in the text - Dependency Graphs.

The goal of syntactic analysis is to discover which pairs of words where one depends on the other, and what is the type of that dependence.

These dependency relations are binary and asymmetrical, and therefore we would like to know which of the two words acts as the head that is modified in some way, and which is the dependent that modifies or complements the head. This concept allows us to think of the dependency relations as inducing graph structures (dependency graphs) which we use to study the dependency relations between words, and therefore the syntax of a sentence. The syntactic analysis complements the grammatical, parse tree-based analysis, as now we aim to extract information on the functional role of each word in the sentence, rather than structural relations between them as we did with the context-free grammar.

realdatawhale · October 6, 2021, 8:38am

Hello there!

As highlighted before, your dataset surely adds quality data to the Ocean Market.

Nonetheless, It would have been great to include some deliverables regarding on boarding consumers for your data, otherwise it’ll be challenging to achieve the positive network value in the long run. These could have included the goal to have at least 3 presentations towards your target audience etc.

Let us know your thoughts!

Many thanks !

ResilientML · October 6, 2021, 1:24pm

Thank you for the interesting comment. I think this is a universal problem at present to attract buyers to the marketplace – not something specific to ResilientML. This has been heavily discussed in the DAO town halls on a weekly basis. All of us as a community are working to onboard data consumers. Scott Milat has some interesting insights into this, and has been working hard to develop a framework to drive data consumers to Ocean.

As detailed in our DAO town hall updates we have been making presentations via multiple channels of our research papers that are using the NLP datasets published on Ocean in ML solutions:

Publication of research papers
LinkedIn posts
Presentations at conferences

Currently, we are investing a lot of effort to build out our analytic visualization tool to interact with our datasets, and once completed we will be presenting this to our target audience.

We will be marketing on various social media channels (LinkedIn, …) once completed – we will make an announcement. Different channels are required to be targeted in order to reach different audiences. Our primary target audience is professional data scientists and machine learning engineers.

We have submitted an abstract of a paper to the International Conference on Computational and Financial Econometrics (CFE) which takes place in December 2021. It is necessary to wait for peer review and once accepted we can formally announce this as a deliverable.

We have released the following papers to the public domain – since we want scientists to engage with our data:

Many thanks for the comment

Vantagecrypto · October 10, 2021, 7:53pm

We at VantageCrypto like this project because we see a huge value for actionable streaming analytics on the Signal Syndication and Data Network (An Ocean Marketplace fork). We look forward to being able to run some analytics on ResilientML’s news sentiment engine outputs and hopefully seeing this data become available in a higher resolution than 24 hours.

Good luck team ResilientML!