Round 11: [ResilientML] - Python Analytics Suite on Streamlit.io and Data Sets Extension and Expansion

Key Project Data

Name of project:

ResilientML – Python Analytics Suite on Streamlit.io and Data Sets Extension and Expansion

Team Website:

https://www.resilientml.com/

Proposal Wallet Address:

0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391

https://etherscan.io/address/0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391

Which category best describes your project?

Unleash data

Funding Requested

25,000 US$

Extend analytics suite and maintain data sets for three NLP core data sets for our Features-as-a-Service (FaaS) model

Development of a FaaS ResilientML market place on Ocean and Analytics toolbox and Streamlit.io user interface.

Overview of ResilientML to date (after successfully shipping 5+ grants)

ResilientML in round 10 requested 50k as an established team in the Ocean ecosystem, having delivered successfully on 5 grants in previous rounds. We were unsuccessful in this round, however we believe that our product framework and offering are important components of an ML system in Ocean and that our professional grade features-as-a-service (FaaS) framework will be worth to continue to grow for the Ocean market place. Therefore, we are attempting this round with our proposal from last round, however we are requesting $25k instead of our previous request of $50k and we plan to make up for this shortfall in budget for allocated resources to complete the work proposed by supplementing the shortfall from other funding streams we have in ResilientML - so we should still be able to stick to our roadmap of deliverables.

As a short summary for voters on how ResilientML has progressed to date:

Round 4: received $6,500USD (data market place development for NLP Features-as-a-Service)

Round 5: received $7,500USD (data market place development for NLP Features-as-a-Service)

Round 5: received $7,500USD (second grant on education)

Round 8: received $33,208USD (data market place development for NLP Features-as-a-Service)

Round 9: received $22,764USD (data market place development for NLP Features-as-a-Service)

Monthly summary of dataset growth:

Start: 2017 – end of April 2021: 13,127,107 tokens

end of May 2021: 13,463,311 tokens

end of June 2021: 13,627,931 tokens

end of July 2021: 13,760,124 tokens

middle of August 2021: 13,830,419 tokens

In order to maintain this data set (weekly updates of existing sentiment NLP news data sets) and to further grow this data set with new news offerings and increased products and NLP sentiment solutions, we will require support to fund our staff costs in time and effort in developing the system, tools and maintenance of our data sets. We believe that once we are able to provide the analytics suite (as part of the development component of this round’s grant proposal) the community will have a visual tool interface to our market FaaS. If the ocean community wants to see this continue to develop please consider to support our efforts.

The ResilientML framework is providing a progressively enhanced and maintained, professionally curated state-of-the-art text and NLP feature libraries to the Ocean market place under the Polygon markets as three distinct FaaS feature library data sets (all constantly updated and maintained weekly with automated data-collection, processing and production pipelines):

Tasty Pelican token – TASPEL 27. Real Crypto News Sentiment

weekly updated

  • contains n-gram structures in curated JSON format data structures as well as python pickle files for 10 million+ tokens derived from news articles on crypto currency projects. They are curated by date, source, author and topic – making the utilisation to this data source ideal for NLP tasks such as sentiment construction. A detailed description of the data structure and content is available at:

( https://market.oceanprotocol.com/asset/did:op:55d4346C439A5f9Ccf2Cb1802D56Cd5A70Be82c7 )

Passionate Cormorant token – PASCOR-89. Real Crypto News Sentiment Sentence Features

- weekly updated

  • contains the noise-free, processed article sentences for all current news sources and assets for the complete period that the Real News Crypto dataset covers (2017-present). This dataset will be maintained and extended in the following weeks to include higher-level structural features (grammar trees and syntax graphs) of the provided sentences. The dataset is now available at a fixed price at:

( https://market.oceanprotocol.com/asset/did:op:a62962De45BEE6C99cae7a685D4f8ee1EaB7825C )

Invidious Penguin Token –INVPEN-41. Real Crypto News Sentiment Spectral Features

- weekly updated

  • The second dataset contains the top-3 tokens’ and articles’ spectral components (eigenvalues and eigenvectors of the word-word and document-document matrices) for the weekly article collections in the spanned period. Note that the dictionary tokens belong to a custom cryptocurrency-specific vocabulary, which has been sentiment-annotated hence it is expressive of market sentiment for the covered period. This is particularly informative for market participants wanting to gain insights into the market dynamics from an investors’ sentiment perspective. The dataset is now available at a fixed price at:

( https://market.oceanprotocol.com/asset/did:op:E0AB818DB99dd89548512407A328A9D7Bf6f994a )

We aim to provide professional-level curated datasets for ML Compute-to-Data users to be able to access data and feature libraries that are weekly maintained, updated and extended, and which are in standard professional data formats with standardised feature templates in JSON and Python pickle, ready for training and fine-tuning of time-series models, deep neural network models, Transformer models, LSTMs etc.

Analytics Tool: Python-Streamlit.io User Data Interrogator

In addition, ResilientML have been developing a live interactive analytics tool what users can explore the attributes, quality, variety, veracity, complexity and completeness of our feature library on a real-time basis. This should drive the understanding and adoption of the market place for our FaaS as it will provide clear and detailed overviews for users of what they would be purchasing or holding when they buy the ResilientML data and FaaS product.

Updates on Features-as-a-Service model

Using the previous rounds of funding, ResilientML has been producing a weekly growing automated data market for Natural Language Processing Feature Libraries (NLPFL) on Ocean.

We now have three data sets produced and available on Ocean market places (polygon).

The following process has been automated and executed in Cloud infrastructure on a weekly basis (see figures below):

  • Scan increasing numbers of news sources (currently 4+) automatically and extract news reports on targeted topics related to key areas in the crypto sphere;
  • Process text into cleaned format – as outlined in proposal stages of data munging;
  • Extract key features from text in n-gram formats, sentence format;
  • Construct JSON data library update and automate this to box storage and update for Ocean market ever week on Saturday midnight.

As of 02/10/2021 the market place has data for Features-as-a-Service summarised as follows:

  • 4 major crypto news sources
  • 19 crypto projects that are reported on
  • 8 news categories relevant to the crypto space
  • 200,00+ articles
  • 14,163,060 processed text tokens,

the processing of all of which has been the result of more than 3 years of research into statistical natural language processing.

Furthermore, the following processes have been in development and will be offered on the market place both as data, and as Compute-to-Data modules, once the latter is fully supported:

  • Segment text into topics and append to previous feature libraries,
  • Process higher order features sets – CFG trees, Dependency graphs, spectral features and sentiment indices,
  • Create summary overview of changes made, current special features that week, KPIs and total records to date (currently partially in effect, and presented in prototype of the Analytics App).

The same data suite will also be available for COVID-19 news articles in the next stage of our FaaS project, and data from a range of other sectors will follow (e.g. climate).

Finally, the following Compute-to-Data modules are scheduled to be released, all of which are based on research outputs in peer-review academic journals from the ResilientML team:

  • Mixed-frequency data modelling for forecasting investors’ sentiment in crypto markets;

  • Epidemiological models for COVID-19 infected cases, enhanced with public sentiment.

Analytics and content visualisation of TASPEL-27 Real News Crypto Dataset

A prototype of the Analytics App for TASPEL-27 has been developed, presenting KPIs (linked to the dataset volume, richness and update rate) and data summaries of Real News Crypto to the potential data buyer and consumer. The application will be hosted online once the prototype is completed and will be weekly updated to reflect the current dataset state.

This is a significant advantage of the Real News Crypto Dataset as it offers transparency to the buyer and the opportunity to explore the dataset content before buying. This is significantly different from merely offering a small sample of the data for free, as an Analytics App offers interactivity and is continuously updated to reflect the up-to-date state of the dataset.

We trust that this is a significant step towards attracting more data buyers to our dataset and the Ocean Market.

Round 11 Proposal

This proposal is to

  1. continue to extend, develop and further curate our initial development of a Crypto specific Natural Language Processing Data Suite – this includes specifically continuing to expand the three data sets TASPEL 27, PASCOR-89, INVPEN-41.

  2. extend and deploy the online Analytics App dashboard that will show in real time the status of the Real News Crypto datasets TASPEL 27, PASCOR-89, INVPEN-41.

Specifically we will be adding additional analytics tools to demonstrate to the user:

  • Word clouds (searchable by topic, news source, author or date range)
  • Sentiment signals (to be implement in propriety sentiment entropy signals – positive, negative, neutral signals)
  • Spectral heat-map diagrams for document-document spectra
  1. develop fourth data set in the ResilientML suite of FaaS focussed on sentiment signals decomposed into polarities (positive, negative, neutral) by crypto news topic.

The grant will go to help fund the coding, NLP model development, man hours and compute hours required to extend our current published datasets. We aim to continue to grow these datasets and expand its usability by further extracting value from the corpus of crypto related financial news, and making it more accessible to the Ocean community.

Grant Deliverables

In this round we will expand our offering to include:

  1. continue to extend, develop and further curate our initial development of a Crypto specific Natural Language Processing Data Suite – this includes specifically continuing to expand the three data sets TASPEL 27, PASCOR-89, INVPEN-41.

  2. extend and deploy the online Analytics App dashboard that will show in real time the status of the Real News Crypto datasets TASPEL 27, PASCOR-89, INVPEN-41.

Specifically we will be adding additional analytics tools to demonstrate to the user:

  • Word clouds (searchable by topic, news source, author or date range)
  • Sentiment signals (to be implement in propriety sentiment entropy signals – positive, negative, neutral signals)
  • Spectral heat-map diagrams for document-document spectra
  1. develop fourth data set in the ResilientML suite of FaaS focussed on sentiment signals decomposed into polarities (positive, negative, neutral) by crypto news topic.

The dataset expansion will span the additional following components:

  • Web3.0 coins: Filecoin, BitTorrent, Stacks, The Graph, Basic Attention Token, Siacoin, Helium, Arweave;

  • Layer1 coins (Solana);

  • DeFi coins (Terra, PancakeSwap, Maker, THORChain, Serum);

  • Metaverse/Gaming/NFT coins (Enjin, Axie Infinity, Red Fox Labs);

  • Stablecoins (USDC, BUSD),

and additional news sources:

  • NewsBTC;

  • Bitcoinist;

  • Blocknomi;

  • Coinspeaker.

Specifically, in addition to the post-processed tokens, and the text-noise-free sentences, we will also provide parsimonious spectral features extracted from the articles via matrix factorisation approaches.

As before, we will expand the dataset going forwards but also back-filling the currently available data. This will effectively increase the current data offering we have developed by around a factor of three in terms of content and volume of processed corpus.

We will further continue to grow the sophistication of the data being provided . This will be achieved by progressively migrating from data munging to feature extraction and data curation for feature libraries.

The proposal in one sentence

Data is the modern oil of the blockchain economy. ResilientML Semantic Reservoirs will bring a vast collection of carefully crafted semantic and linguistically tailored datasets curated by experts in Natural Language Processing for utilization directly in machine learning methods and sentiment models running in the Ocean environment and available through the Ocean marketplace via the ResilientML NLP data app.

Project Overview

Mission:

The outcome of the collaboration between ResilientML and the OceanDAO community is multi-fold:

  • We will help open the flood gates to the Ocean community for one of the key building blocks of a modern data economy that is of growing relevance to machine learning applications in a data economy like Ocean. This building block is particularly specialized as it requires a non-standard set of data science processes to extract and curate.

  • In this regard, we seek to unlock the power of text-based information and data characterization for the Ocean machine learning community. They will be able to utilize our high quality Natural Language Processing (NLP) text data features to develop apps that will interface directly with our data reservoirs through API interfaces that can extract relevant text data features from our JSON formatted and curated feature libraries to form tributaries to the wider Ocean machine learning applications that seek inputs from text based data features.

ResilientML has developed methods in python to produce these JSON formatted text feature collections that will form the core of our Semantic Reservoirs. These text-based data features are processed using specialized natural language processing NLP methods that ResilientML will bring to the Ocean community based on extensive academic and industry experience in developing such solutions.

In this regard, we aim to help make the Ocean marketplace the leader in the text processing, sentiment models, social media analytics, analyst report analytics, regulatory report analytics, topic models, chat-bot, text-to-speech, speech-to-text, labelling, context extraction tasks of the NLP data market by leveraging our expertise in cutting-edge, novel academic research and industry practice.

NLP is of prime importance in the crypto space due to the highly sentiment driven nature of crypto markets. Furthermore, we also plan to provide the following high value NLP datasets to the Ocean marketplace:

  • Crypto News Sentiment

  • Social Media Sentiments

  • Technology: github, bitbucket, wire, …

  • Regulatory compliance reports

  • Legal documents

The team at ResilientML have dedicated quantitative analysts, machine learning experts and industry leading engineers to develop this suite of tools in both API, cloud solutions in azure and AWS in the languages of R, Python, MongoDB and others.

Description of the project:

Here, we provide a high level overview of the project (a detailed description of the project is available in the appendix).

Significant value can be sourced to understand crypto markets, prices, developments, regulatory landscape, use cases etc., through harvesting information from written text. Natural Language Processing (NLP) refers to the set of methods and analytical tools used to analyze unstructured text data, namely text that was created in free form and has a natural linguistic flow, rather than text created based on templated and predefined rules.

“Big Data” (a term which is used to describe the vast amount of available information) are characterized by five properties (“the 5 Vs of Big Data”): volume, velocity, variety, value and veracity. The amount of information (volume), the rate at which it becomes created and transmitted (velocity), its heterogeneous nature (variety) which includes not only textual, but also financial, visual or verbal data, and the actionable information among “noise” (value), are the reasons why it would be infeasible to leverage such information without statistical algorithms that automate the collection, extraction and filtering processes applied on the data.

Furthermore, these operations are necessary to be applied in such a way that the fifth property (veracity) of Big Data is maintained at all stages of the statistical processing of text; veracity is related to the quality and provenance of the data that are fed into the statistical models. The source of data determines their quality and inherent biases, that inevitably affect the output of the final statistical model.

A detailed synopsis of the project can be found in the appendix of this proposal to see specifically the stages of Machine Learning considered to extract and summarize the value in textual data and sentiment that we propose to provide to the Ocean environment.

The collection, wrangling and curation of this information extracted from text requires specialist machine learning knowledge to automate this process. The relevancy and approaches to data collection require domain knowledge to identify the most relevant sources of data to extract value from to ensure the data is of highest integrity.

We will combine our machine learning skills and specialist domain knowledge in the crypto space and traditional financial and risk/insurance space to provide a high quality source of data for NLP tasks that is tailored specifically for crypto market understanding and analytics.

An important point of distinction to what we offer is that we move beyond the standard approach of bag-of-words and frequency of words based models which are ubiquitous in most NLP sentiment based frameworks, but fail to capture semantics and syntax. These extra components are critical to infer sentiment accurately. Our proposed framework provides additional structure capturing these components for positive, negative, and neutral sentiment indices. In addition, our framework is hierarchical in nature which means that we can extract contribution to sentiment by individual articles, authors, and news sources explicitly.

Another important distinction of our contribution compared to other sentiment based models is that we don’t just focus on Twitter and social media feeds, which have limited scope to express sentiment, instead our approach targets detailed analyst reports, editor-processed news reports , and regulatory reports. Working with these enriched, higher quality and credibility data sources is more complex than working with social media based models, and our framework is able to accommodate this.

What problem is your project solving?

Currently, there is a lack of high quality data on the Ocean marketplace – which is to be expected at this stage. In order to attract a snowball effect of high quality data providers, an initial kernel of high quality datasets need to be published, staked, and purchased on the marketplace. We will contribute to the provision of such high quality datasets to the Ocean marketplace to drive growth – critical to the success of the protocol.

What is the final product?

ResilientML will vastly expand our current dataset (TASPEL) on the Ocean marketplace to include processed, noise-free sentences as well as new high level extracted sentence structure features, such as Context-Free Grammar trees and Dependency Graphs, for the following sectors:

  • Web3.0 coins: (Ocean, Chainlink, Filecoin, BitTorrent, Stacks, The Graph, Basic Attention Token, Siacoin,Helium, Arweave);

  • Layer1 coins (Cardano, Tezos, Polkadot, Solana etc.);

  • DeFi coins (Uniswap, Terra, Aave, PancakeSwap, Maker, THORChain, Serum etc.);

  • Metaverse/Gaming/NFT coins (Enjin, Axie Infinity, Red Fox Labs etc.);

  • Stablecoins (Tether, USDC, BUSD etc.),

with news topics covering:

  • DeFi;

  • Exchanges;

  • Regulation;

  • NFT;

  • Business,

  • Technology;

  • Markets,

and including data from the following additional news sources:

  • NewsBTC;

  • Bitcoinist;

  • Blocknomi;

  • Coinspeaker.

Each dataset will provide cleaned, pre-processed, and featurized text data (as shown in Stage 1 of the Value-Add Pipeline (VAP) in Figure 1) from every article, corresponding to 100,000s of n-grams and millions of tokens, from various news sources, e.g. cryptodaily.co.uk, cryptoslate.com.

Furthermore, we will publish a new dataset to provide extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings , as shown in Stage 2 of the Value-Add Pipeline (VAP) in Figure 1.

Figure 1: Text Analytics Value-Add Pipeline (VAP)

Expected ROI

The publication of the datasets detailed above will drive value to the Ocean ecosystem through numerous channels, i.e. fee generation for Ocean community, Ocean token purchases by data publishers (ResilientML), Ocean token purchases by stakers attracted by high quality datasets to curate, network effects of attracting other data providers to the marketplace.

Here, we focus on the first of these value drivers – since it is the easiest to ballpark.

We make the following assumptions:

  • Probability of project success = 0.8

  • Ocean Community gets 0.2% of consume volume.

We provide OCEAN Datatoken Consumption and ROI calculations for a number of growth scenarios in Figures 2 and 3 respectively.

Note:

Let x_t = consumption for month t.

x_t = x_0 * (1 + rate)^t , where t=0, 1, …11 months.

Where x_0 = initial_num_users x num_datasets x datatoken_price

Figure 2: OCEAN Datatoken Consumption Growth Scenarios (adjusted for probability of success = 0.8):

image

image

The man hours and computation necessary to code, scrape, clean, and process these datasets are substantial.

Project Deliverables – Category

  1. ResilientML will vastly expand our current dataset (TASPEL) on the Ocean marketplace – as detailed above.

  2. Furthermore, we will publish a new dataset to provide the extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings, as shown in Stage 2 the Value-Add Pipeline (VAP) in Figure 1.

Project Deliverables – Roadmap:

Any prior work completed thus far?

The proposed project builds upon the contributions over the past two years of members of ResilientML in building the machine learning pipeline shown in Figure 1. We have performed the processing of data for over 15 assets, and have already published this dataset to the Polygon Ocean marketplace – updating weekly.

Part 1: Python code has been written to perform text data collection via Java-based screen scraping and document collection – this has been unit tested and validated.

Part 2: Python modules have been created with proprietary steps of text data de-noising based upon the concepts provided in the appendix – this has been unit tested and validated.

Part 3: Python modules have been created to extract time series of features related to:

  1. Semantic bag-of-words frequency based features and their corresponding time series.

  2. Grammar based parse trees and their corresponding time series.

  3. Syntax based dependency graphs and their corresponding time series.

These have been unit tested and applied to crypto data. The next stage is to put these steps into a distributed production system and curate these in a json data format for an API feed.

Roadmap

Month 1:

  • Complete prototyping of stages 1 – 4.

  • Publish datasets to Ocean marketplace.

Month 2:

  • Maintenance and data support for data buyers.

  • Submit academic research paper to journal

Project Details

Further details of the research prototype are provided in the following peer reviewed papers:

  1. Chalkiadakis, Ioannis and Peters, Gareth W. and Chantler, Michael John and Konstas, Ioannis, A statistical analysis of text: embeddings, properties, and time-series modeling.
  1. Chalkiadakis, Ioannis and Zaremba, Anna and Peters, Gareth W. and Chantler, Michael John, Sentiment-driven statistical causality in multimodal systems.
  1. Zaremba, A. and Peters, G., 2020. Statistical Causality for Multivariate Non-Linear Time Series via Gaussian Processes.
  1. Peters, Gareth, Statistical Machine Learning and Data Analytic Methods for Risk and Insurance

Team members

ResilientML consists of 5 team members.

Chair Prof. Gareth W. Peters (CStat-RSS, FIOR, YAS-RSS) - Head of Research

Background:

Experience:

  • Co-founder of ResilientML

  • 20+ years machine learning research

  • 5 research books

  • 200+ journal and conference papers

  • Successfully delivered projects from grants > 5mil+ GBP.

Short Bio

Prof. Gareth W. Peters is the ‘Janet and Ian Duncan Endowed Chair of Actuarial Science’ Professor of Statistics for Risk and Insurance in the Department of Statistics & Applied Probability, in University of California Santa Barbara (UCSB). Previously he held tenured positions in the Department of Actuarial Mathematics and Statistics, in Heriot-Watt University in Edinburgh, the Department of Statistical Sciences, University College London, UK and the Department of Mathematics and Statistics in University of New South Wales, Sydney, Australia.

Prof. Peters is the Director of the Scottish Financial Risk Association.

Prof. Peters is also an elected member of the Young Academy of Scotland in the Royal Society of Edinburgh (YAS-RSE) and an elected Fellow of the Institute of Operational Risk (FIOR). He was also the Nachdiploma Lecturer in Machine Learning for Risk and Insurance at ETH Zurich in the Risk Laboratory.

He has made in excess of 150 international invited presentations, speaker engagements including numerous key note presentations. He has delivered numerous professional training courses to C-suite executive level industry professionals as well as numerous central banks.

He has published in excess of 150 peer reviewed articles on risk and insurance modelling, 2 research text books on Operational Risk and Insurance as well as being the editor and contributor to 3 edited text books on spatial statistics and Monte Carlo methods.

He currently holds positions as:

  • Honorary Prof. of Statistics at University College London, 2018+

  • Affiliated Prof. of Statistics in University of New South Wales Australia 2015+

  • Affiliate Member of Systemic Risk Center, London School of Economics 2014+

  • Affiliate Member of Oxford Man Institute, Oxford University (OMI) 2013+

  • Honorary Prof. of Statistics in University of Sydney Australia 2018+

  • Honorary Prof. of Statistics in Macquarie University, Australia 2018+

  • Visiting Prof. in Institute of Statistical Mathematics, Tokyo, Japan 2009-2018+

He previously held positions as:

  • Honorary Prof. of Peking University, Beijing, China 2014-2016

  • Adjunct Scientist in the Mathematics, Informatics and Statistics, Commonwealth Scientific and Industrial Research Organisation (CSIRO) 2009-2017

Webpage: https://www.qrslab.com/

Gordon Gay – CEO

Background :

Experience:

  • Co-founder of ResilientML

  • 23 years R&D at NEC Australia, roles - GM of R&D, National Head of Innovation

Matthew Ames – CTO / Co-Head of Research

Background:

Experience

  • 5 years industry experience - machine learning, finance

Ioannis Chalkiadakis – Data Scientist / Natural Language Processing

Background:

Experience:

  • 3 years Software Engineering

Appendix: Detailed Project Description

Extracting Value from Text Data

Natural Language Processing (NLP) refers to the set of methods and analytical tools used to analyze unstructured text data, namely text that was created in free form and has a natural linguistic flow, rather than text created based on templated and predefined rules.

“Big Data” (term which is used to describe the vast amount of available information) are characterized by five properties (“the 5 Vs of Big Data”): volume, velocity, variety, value and veracity.

The amount of information (volume), the rate at which it becomes created and transmitted (velocity), its heterogeneous nature (variety) which includes not only textual, but also financial, visual or verbal data, and the actionable information among “noise” (value), are the reasons why it would be infeasible to leverage such information without statistical algorithms that automate the collection, extraction and filtering processes applied on the data.

Furthermore, these operations are necessary to be applied in such a way that the fifth property (veracity) of Big Data is maintained at all stages of the statistical processing of text; veracity is related to the quality and provenance of the data that are fed into the statistical models. The source of data determines their quality and inherent biases, that inevitably affect the output of the final statistical model.

Importance of text pre-processing

With any type of data collected from real world processes, it is usually the case that a set of “clean-up” or pre-processing transformations are required before using them for the statistical processing.

The pre-processing procedures will remove the noise from the data which will allow us to operate on the actual information we want to process. In this way we will not only ensure the veracity we want to achieve, but will also obtain efficiency and computational benefits.

Statistical text processing: Pipeline

In general, we can identify three stages for the statistical analysis of text data:

• data import,

• data wrangling, and

• finally, development and evaluation of the statistical model.

The first step of importing the data consists of either loading an already existing dataset, or alternatively collecting one’s own set of data, for example via scraping web pages, scanning/optical character recognition (OCR) of printed documents or transcribing spoken text. The data import however, does not guarantee that the dataset will be in such a format that will facilitate subsequent processing.

Therefore, we need to go through the process of “tidying” the data, where one constructs “data frames”, i.e. tabular structures, where each variable is stored in its own column and each observation occupies one row.

This process will create a tidy dataset and will facilitate subsequent data transformations, visualization and processing. Creating a tidy dataset and applying the necessary transformations or visualization methods constitutes the process of “data wrangling”.

During modelling, it might be necessary to apply additional transformations on the data, hence there is a feedback loop between the data wrangling and modelling stages.

Noise in Text Data & its Removal During Data Wrangling.

What constitutes noise in raw text and under what conditions it may be introduced into our data?

Obvious noise artefacts are:

• encoding scheme (representation)

• word mis-spellings,

• errors in the linguistic structure (grammar or syntax),

• missing spaces or punctuation symbols and

• wrong capitalization patterns.

These types of noise patterns are usually introduced at the creation stage of the raw text, and are challenges that are expected in natural language applications.

However, noise may appear in non-obvious forms as well. Users of communication services, for instance SMS, e-mails, instant messages, or social media posts, often use abbreviations, emoticons, or even omit certain words.

These patterns, depending on the application, could hinder the processing of the raw data. For example, when analyzing sentiment from Tweets, most researchers will want to consider emoticons as they can be very expressive about the feelings of the author of the Tweet.

On the other hand, if someone strictly wants to analyze the lexical or grammatical patterns that appear frequently among Twitter users, information based on emoticons is potentially irrelevant, in which case it is noise and has to be removed.

The domain specificity of the noise patterns that this implies appears in additional noise sources that are considered standard in NLP, namely “stopwords” and punctuation.

The term stopwords refers to words that are not considered useful for the intended analysis because they lack discriminative power (e.g. appear too often in the dataset) or lack significant semantics, namely terms such as “a”, or “the”.

Stopword removal is considered a standard part of the pre-processing pipeline, is usually performed early in the pre-processing stage, and most NLP software packages come with standard predefined stopword lists.

Basic pre-processing steps

The list is not exhaustive:


1. Punctuation

Often punctuation marks (such as , . ! ? ; # “” ‘’ ~) are removed, for example when one aims to analyze counts of terms, and therefore punctuation becomes unnecessary. However, similar to stopwords, there are cases when all or a subset of punctuation marks are useful and are therefore desirable to maintain. For example, exclamation marks may reveal sentiment information, or some symbols may carry special meaning in certain contexts, such as the hashtag (#) symbol in Tweets where it relates to the Tweet semantic content. Finally, it is important to consider at which stage of the analysis one should remove punctuation. If we want to detect sentence boundaries, or perform syntax or grammar parsing, then it is important to maintain punctuation symbols before performing these stages. Once this type of analysis has finished, it may be safe to remove punctuation if it is required for further analysis steps.


2. Numbers

Also for numbers one should carefully consider the application context before deciding to remove them, due to not contributing to semantic information most of the times. If the domain requires the extraction of dates however, or case numbers when processing legal documents, then of course specific rules should be applied to dictate the conditions under which numbers will be removed from the text.


3. Lowercase

Lowercasing all terms is applied to reduce the vocabulary space, i.e. the set of words we expect to come across. This is useful for reducing the computational and space complexity in applications where we work with large sparse matrices of word counts. However, there are cases where uppercase letters reveal structural information: they can help identify sentence boundaries, or proper names, and can also help reduce ambiguity; for example distinguishing the proper name Rose from the noun denoting the flower rose.


4. Stemming

Stemming is an additional technique that aims to reduce the vocabulary space. It consists of removing any inflections from a word and reducing it to its most basic form. For example, a stemmer (the program that does the stemming) will map “walked”, “walking” and “walks” to the lexeme “walk”. Note that in the case of e.g. “studies”, the stemmer will return a basic word form (“studi”) that is an invalid word itself. This is because stemming does not account for the grammatical or syntactical pattern behind the inflection - it only cuts it off. An alternative method is lemmatisation, where the root lexeme is returned, which means that lemmatisation returns terms that are in the language. For example, the lemmatisation program will replace “studying” and “studies” to “study”. This is achieved by considering the part-of-speech of each term (e.g. is it a noun, verb, adjective or adverb?) in order to determine the suffix. Like punctuation, if we want to perform syntactic or grammatical analysis we have to postpone stemming and lemmatisation until after these stages.


5. Stopword removal

As we have discussed, this step should be treated very carefully, so as to minimize information loss and remaining noise in the dataset. It is therefore recommended that standard stopword lists in software packages be checked, and accordingly modified before used.


6. Word compounds

Word compounds are groups of words, usually groups of two (bigrams) or three (trigrams) that frequently appear together and convey a different meaning than if we consider each one individually. For example, the trigram “Wall Street Journal” denotes the name of a popular newspaper and we would like to account for it as a single term, when we want to extract its appearance in the dataset. If we do not, we can still identify the valid terms of “wall”, “street” and “journal” individually, however we ignore the fact that they refer to a newspaper rather than carry their separate meanings.


7. Remove low-frequency words

It is also common practice to remove extremely rare words, namely words that constitute less than a small fixed percentage (< 0.5 - 1 %) of the document corpus, again to reduce the computational and space complexity.

At the end of these processes we will have tokenized cleaned – wrangled text data that has been prepared for feature extraction and data analysis in NLP based machine learning.

Feature Extraction Methods – Time-Series of NLP Text Features

We identify three distinct categories:

Semantics : namely the meaning behind words and sentences and the coherence of a well-formed text - Bag-Of-Words (Frequency Based Features).

The way we capture semantics is based on the bag-of-words model (BoW), which has been widely applied in natural language processing (NLP) and information retrieval Harris, 19544. The main concept behind BoW is to map a segment of text to an unordered collection, or “bag”, of words. As we have seen this is the premise for the construction of document-term matrices for a corpus of documents, and in its original formulation it is applied on a complete document of a collection of documents (a corpus) and ignores the sequence of words in the text. We are transferring BoW into a time-series context and present an “online” formulation. This allows us to overcome computational difficulties associated with BoW, namely the handling of sparse matrices whose size depends on the number of distinct document words and corpus size, and may well be in the order of hundreds of thousands. In addition, this setting allows us to construct a text-based time-series that can be incorporated into a time-series based system for supervised or unsupervised learning.

Grammar : i.e. the structural rules that dictate how words fit into the sentence and form groups such as clauses and phrases - parse tree or constituency tree for n-grams/sentences.

Consider the following example

The brown dog is running in the park.

which can also be written in an equivalent grammatical manner:

He is running in the park.

without destroying the grammar or meaning of the sentence.

The fact that a group of words can operate as a single unit - and therefore in our example we can replace the phrase “the brown dog” with “he” - is the linguistic property of constituent structure.

Therefore one can extract features dictated by the grammatical rules that tell us with which words we can form and study units for their role in the sentence.

The formal system for studying this phenomenon, i.e. the grouping of words as in the above example, is the context-free grammar (CFG). Mathematically, a CFG is defined by a quadruple as follows: 𝐺=(𝑁,Σ,𝑅,𝑆) where

𝑁 is a set of non-terminal symbols

Σ is a set of terminal symbols, 𝑁∩Σ=∅

𝑅 is a set of rules, 𝑅={𝐴→𝛽:𝐴∈𝑁 and 𝛽∈(Σ∪𝑁)∗}

𝑆 is the designated start symbol, 𝑆∈𝑁

Such grammars are called “context-free” because the left hand side of each rule can contain exactly one non-terminal symbol.

A context-free grammar defines a formal language, which is the set of strings that we can construct of terminal tokens, that can be derived starting from S.

A sentence is called grammatical, if the set of strings that comprise it can be derived by following the rules of the CFG, otherwise, the sentence is not valid according to the language of the CFG (ungrammatical).

The process of analyzing the constituent structure of a sentence is called constituency parsing, and the derivation of a sentence, i.e. the rules that we followed when building it, can be represented with a hierarchical structure, a tree, which is called the parse tree or constituency tree.

Syntax : that is the principles that dictate the structure of sentences by specifying the order and role of each word in the text - Dependency Graphs.

The goal of syntactic analysis is to discover which pairs of words where one depends on the other, and what is the type of that dependence.

These dependency relations are binary and asymmetrical, and therefore we would like to know which of the two words acts as the head that is modified in some way, and which is the dependent that modifies or complements the head. This concept allows us to think of the dependency relations as inducing graph structures (dependency graphs) which we use to study the dependency relations between words, and therefore the syntax of a sentence. The syntactic analysis complements the grammatical, parse tree-based analysis, as now we aim to extract information on the functional role of each word in the sentence, rather than structural relations between them as we did with the context-free grammar.

2 Likes

Hi @AlexN

Please can you facilitate a Stewardship call w.r.t. the above Round 11 proposal?

Many thanks
Matt

Hi @AlexN

ResilientML requests your assistance to please withdraw this grant from Round 11 as we no longer wish to seek funding from Ocean community for this project. Our project roadmap is extensive and substantial in our view as outlined in our stewardship call today, and we will continue to build this product roadmap out, but we have decided to achieve this through private capital raises that we are currently undertaking. ResilientML will continue to grow the data set that has been scoped and developed, just not on the Ocean market at this stage. We will re-engage in future if the demand side of the ocean market place finally appears in a meaningful fashion to justify the ability to meet the communicated expectations by Ocean Stewards of short term ROI expectations currently placed on projects in the Ocean Dao seeking funding. We believe that at least in our specialised ML solutions for Natural Language Processing context this does not yet have any demand side. It was also clear that for projects such as ours that are focussed on data provision and Features-as-a-Service in addition to building functionality if we are still classified as data providers (even when we also build a lot of ML solutions) a cap of 3k is no longer worthwhile economically to participate for our time commitment. Given the communicated lack of flexibility to recognise projects that straddle both data provision/development and fundamental tools/functionality (rather than just a fancy UX/UI interface with not a lot behind technically) we will rather withdraw and reassess in future once the approach to funding projects becomes more stable and consistent. We thank the community for the support and it has been a pleasure to engage with all those who voted for us. We will hope you will cross paths with the exciting products we are still developing as they are released on various android and apple app stores and integrated in major financial system platforms as is currently being developed with the partners of ResilientML… best of luck to the Ocean Dao and we will follow the progress of many interest projects that persevere through this turbulent times of setting the parameters and expectations of the funding frameworks and voting mechanisms…