Key Project Data
Name of project:
ResilientML – Expansion of Sentiment Data – FaaS + Online Dashboard UI
Team Website:
Proposal Wallet Address:
0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391
https://etherscan.io/address/0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391
Which category best describes your project?
Unleash data
Funding Requested
17,600 USD
Features-as-a-Service (FaaS) model – concept of a FaaS market place on Ocean
Looking at the market places currently available on Ocean you will quickly realise that the bulk of the markets are rudimentary datasets that have been taken from standard places and put on Ocean for a price which belies the fact that most of these datasets are freely available in standard libraries on R, Python and Julia. Furthermore, these datasets are not processed or curated in easily manageable ways for training ML models and they fail to provide additional processing value to save compute time in ML model training or fine tuning. This is where Features-as-a-Service (FaaS) data markets become important to understand.
The ResilientML FaaS curated data libraries provide a significant advantage compared to the existing datasets provided on the Ocean market places to date.
We aim to provide professional-level curated datasets for ML Compute-to-Data users to be able to access data and feature libraries that are weekly maintained, updated and extended, and which are in standard professional data formats with standardised feature templates in JSON and Python pickle, ready for training and fine-tuning of time-series models, deep neural network models, Transformer models, LSTMs etc.
In general, low quality raw datasets on the markets tend to create a problem for buyers and users as firstly, this adds a lot of noise to buyers’ perception of quality data, and secondly, they lack formal curation principles for dataset preparation and dataset utility for machine learning modelling applications. The data is often not processed, not developed into consistent formats for ease of use in ML models to be trained in Compute-to-Data modules in the future, and it is often static and non-updated or non-maintained.
ResilientML is trying to bring professional grade data curation and feature libraries in the NLP context to the Ocean market, and we hope the ML practitioners can appreciate the difference between our professionally curated data library in the FaaS context versus low level raw data libraries.
Since the datasets have basically no demand side at present, we have currently offered our FaaS at a fixed cost with a monthly subscription to the service. This will progressively change if eventually Ocean is able to get demand side participation to match supply side, which is currently a serious problem for people curating datasets.
It is vital for ResilientML as a company that a demand side is developed for the Ocean markets so that curating, maintaining and updating the FaaS is able to continue. Until that point, ResilientML has to rely on grant funding to continue to offer the highest level of data FaaS quality, which distinguishes itself from low quality data sets currently available.
ResilientML as a company will be unable to keep curating, maintaining and updating the FaaS data markets on Ocean if there is not a demand side to these markets and there is not grant funding. So if the community wants our service as part of the ecosystem I ask you to seriously consider not down grading our project as in the previous round and to consider our service as a state-of-the-art FaaS service for Ocean that we are sure can add increasingly valuable data in the midst of a lot of garbage noisy data sets currently present.
Updates on Features-as-a-Service model
Using the previous rounds of funding, ResilientML has been producing a weekly growing automated data market for Natural Language Processing Feature Libraries (NLPFL) on Ocean.
The following process has been automated and executed in Cloud infrastructure on a weekly basis (see figures below):
- Scan increasing numbers of news sources (currently 3+) automatically and extract news reports on targeted topics related to key areas in the crypto sphere;
- Process text into cleaned format – as outlined in proposal stages of data munging;
- Extract key features from text in n-gram formats, sentence format;
- Construct JSON data library update and automate this to box storage and update for Ocean market ever week on Saturday midnight.
As of 28/07/2021 the market place has data for Features-as-a-Service summarised as follows:
- 3 major crypto news sources
- 18 crypto projects that are reported on
- 5 news categories relevant to the crypto space
- 27,072 articles
- 13,760,130 processed text tokens,
the processing of all of which have been the result of more than 3 years of research into statistical natural language processing.
Furthermore, the following processes have been in development and will be offered on the market place both as data, and as Compute-to-Data modules, once the latter is fully supported:
- Segment text into topics and append to previous feature libraries,
- Process higher order features sets – CFG trees, Dependency graphs, spectral features and sentiment indices,
- Create summary overview of changes made, current special features that week, KPIs and total records to date (currently partially in effect).
The same data suite will also be available for COVID-19 news articles in the next stage of our FaaS project, and data from a range of other sectors will follow (e.g. climate).
The KPIs and data summaries will in addition be developed and illustrated via a dedicated Dashboard on the ResilientML website, which will be showing the dataset state in real time.
Finally, the following Compute-to-Data modules are scheduled to be released, all of which are based on research outputs in peer-review academic journals from the ResilientML team:
- Mixed-frequency data modelling for forecasting investors’ sentiment in crypto markets
- Epidemiological models for COVID-19 infected cases, enhanced with public sentiment
Round 8 Proposal
This proposal is to:
- continue to extend, develop and further curate our initial development of a Crypto-specific Natural Language Processing Data Suite, and
- produce an online dashboard that will show in real time the status of the Real News Crypto dataset.
The grant will go to help fund the coding, NLP model development, man hours and compute hours required to extend our current published dataset. We aim to continue to grow this dataset and expand its usability by further extracting value from the corpus of crypto related financial news, and making it more accessible to the Ocean community.
This round we will expand our offering to include new high level extracted sentence structure features for the currently available data and the additional following components:
-
Web3.0 coins: Filecoin, BitTorrent, Stacks, The Graph, Basic Attention Token, Siacoin, Helium, Arweave;
-
Layer1 coins (Solana);
-
DeFi coins (Terra, PancakeSwap, Maker, THORChain, Serum);
-
Metaverse/Gaming/NFT coins (Enjin, Axie Infinity, Red Fox Labs);
-
Stablecoins (USDC, BUSD),
and including data from the following additional news sources:
-
Cointelegraph;
-
NewsBTC;
-
Bitcoinist;
-
Blocknomi;
-
Coinspeaker.
Specifically, in addition to the post-processed tokens, we will now include sentences processed to remove parts of text noise, whilst keeping their grammatical, syntactical and semantic structure intact. This is useful for NLP Compute-to-Data modules that utilise raw text sentences at their input (e.g. BERT-type, Transformer-based models). In addition, we will provide advanced features capturing the grammatical and syntactical structure of the sentences. These comprise Context-Free Grammar (CFG) parse trees which express the grammatical structure of a sentence, as well as Dependency Graphs which encode the sentence syntax.
We will expand the dataset going forwards but also back-filling the currently available data. This will effectively increase the current data offering we have developed by around a factor of three in terms of content and volume of processed corpus.
We will continue to grow the sophistication of the data being provided . This will be achieved by migrating from data munging to feature extraction and data curation for feature libraries.
The initial phase of this will encompass publishing a new dataset to provide extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings , as shown in Stage 2 of the Value-Add Pipeline (VAP) in Figure 1.
The proposal in one sentence
Data is the modern oil of the blockchain economy. ResilientML Semantic Reservoirs will bring a vast collection of carefully crafted semantic and linguistically tailored datasets curated by experts in Natural Language Processing for utilization directly in machine learning methods and sentiment models running in the Ocean environment and available through the Ocean marketplace via the ResilientML NLP data app.
Project Overview
Mission:
The outcome of the collaboration between ResilientML and the OceanDAO community is multi-fold:
-
We will help open the flood gates to the Ocean community for one of the key building blocks of a modern data economy that is of growing relevance to machine learning applications in a data economy like Ocean. This building block is particularly specialized as it requires a non-standard set of data science processes to extract and curate.
-
In this regard, we seek to unlock the power of text-based information and data characterization for the Ocean machine learning community. They will be able to utilize our high quality Natural Language Processing (NLP) text data features to develop apps that will interface directly with our data reservoirs through API interfaces that can extract relevant text data features from our JSON formatted and curated feature libraries to form tributaries to the wider Ocean machine learning applications that seek inputs from text based data features.
ResilientML has developed methods in python to produce these JSON formatted text feature collections that will form the core of our Semantic Reservoirs. These text-based data features are processed using specialized natural language processing NLP methods that ResilientML will bring to the Ocean community based on extensive academic and industry experience in developing such solutions.
In this regard, we aim to help make the Ocean marketplace the leader in the text processing, sentiment models, social media analytics, analyst report analytics, regulatory report analytics, topic models, chat-bot, text-to-speech, speech-to-text, labelling, context extraction tasks of the NLP data market by leveraging our expertise in cutting-edge, novel academic research and industry practice.
NLP is of prime importance in the crypto space due to the highly sentiment driven nature of crypto markets. Furthermore, we also plan to provide the following high value NLP datasets to the Ocean marketplace:
-
Crypto News Sentiment
-
Social Media Sentiments
-
Technology: github, bitbucket, wire, …
-
Regulatory compliance reports
-
Legal documents
The team at ResilientML have dedicated quantitative analysts, machine learning experts and industry leading engineers to develop this suite of tools in both API, cloud solutions in azure and AWS in the languages of R, Python, MongoDB and others.
Description of the project:
Here, we provide a high level overview of the project (a detailed description of the project is available in the appendix).
Significant value can be sourced to understand crypto markets, prices, developments, regulatory landscape, use cases etc., through harvesting information from written text. Natural Language Processing (NLP) refers to the set of methods and analytical tools used to analyze unstructured text data, namely text that was created in free form and has a natural linguistic flow, rather than text created based on templated and predefined rules.
“Big Data” (a term which is used to describe the vast amount of available information) are characterized by five properties (“the 5 Vs of Big Data”): volume, velocity, variety, value and veracity. The amount of information (volume), the rate at which it becomes created and transmitted (velocity), its heterogeneous nature (variety) which includes not only textual, but also financial, visual or verbal data, and the actionable information among “noise” (value), are the reasons why it would be infeasible to leverage such information without statistical algorithms that automate the collection, extraction and filtering processes applied on the data.
Furthermore, these operations are necessary to be applied in such a way that the fifth property (veracity) of Big Data is maintained at all stages of the statistical processing of text; veracity is related to the quality and provenance of the data that are fed into the statistical models. The source of data determines their quality and inherent biases, that inevitably affect the output of the final statistical model.
A detailed synopsis of the project can be found in the appendix of this proposal to see specifically the stages of Machine Learning considered to extract and summarize the value in textual data and sentiment that we propose to provide to the Ocean environment.
The collection, wrangling and curation of this information extracted from text requires specialist machine learning knowledge to automate this process. The relevancy and approaches to data collection require domain knowledge to identify the most relevant sources of data to extract value from to ensure the data is of highest integrity.
We will combine our machine learning skills and specialist domain knowledge in the crypto space and traditional financial and risk/insurance space to provide a high quality source of data for NLP tasks that is tailored specifically for crypto market understanding and analytics.
An important point of distinction to what we offer is that we move beyond the standard approach of bag-of-words and frequency of words based models which are ubiquitous in most NLP sentiment based frameworks, but fail to capture semantics and syntax. These extra components are critical to infer sentiment accurately. Our proposed framework provides additional structure capturing these components for positive, negative, and neutral sentiment indices. In addition, our framework is hierarchical in nature which means that we can extract contribution to sentiment by individual articles, authors, and news sources explicitly.
Another important distinction of our contribution compared to other sentiment based models is that we don’t just focus on Twitter and social media feeds, which have limited scope to express sentiment, instead our approach targets detailed analyst reports, editor-processed news reports , and regulatory reports. Working with these enriched, higher quality and credibility data sources is more complex than working with social media based models, and our framework is able to accommodate this.
What problem is your project solving?
Currently, there is a lack of high quality data on the Ocean marketplace – which is to be expected at this stage. In order to attract a snowball effect of high quality data providers, an initial kernel of high quality datasets need to be published, staked, and purchased on the marketplace. We will contribute to the provision of such high quality datasets to the Ocean marketplace to drive growth – critical to the success of the protocol.
What is the final product?
ResilientML will vastly expand our current dataset (TASPEL) on the Ocean marketplace to include new high level extracted sentence structure features, such as Context-Free Grammar trees and Dependency Graphs, for the following sectors:
-
Web3.0 coins: (Ocean, Chainlink, Filecoin, BitTorrent, Stacks, The Graph, Basic Attention Token, Siacoin,Helium, Arweave);
-
Layer1 coins (Cardano, Tezos, Polkadot, Solana etc.);
-
DeFi coins (Uniswap, Terra, Aave, PancakeSwap, Maker, THORChain, Serum etc.);
-
Metaverse/Gaming/NFT coins (Enjin, Axie Infinity, Red Fox Labs etc.);
-
Stablecoins (Tether, USDC, BUSD etc.),
with news topics covering:
-
DeFi;
-
Exchanges;
-
Regulation;
-
NFT,
and including data from the following additional news sources:
-
Coindesk;
-
Cointelegraph;
-
NewsBTC;
-
Bitcoinist;
-
Blocknomi;
-
Coinspeaker.
Each dataset will provide cleaned, pre-processed, and featurized text data (as shown in Stage 1 of the Value-Add Pipeline (VAP) in Figure 1) from every article, corresponding to 100,000s of n-grams and millions of tokens, from various news sources, e.g. cryptodaily.co.uk, cryptoslate.com.
Furthermore, we will publish a new dataset to provide extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings , as shown in Stage 2 of the Value-Add Pipeline (VAP) in Figure 1.
Figure 1: Text Analytics Value-Add Pipeline (VAP)
Expected ROI
The publication of the datasets detailed above will drive value to the Ocean ecosystem through numerous channels, i.e. fee generation for Ocean community, Ocean token purchases by data publishers (ResilientML), Ocean token purchases by stakers attracted by high quality datasets to curate, network effects of attracting other data providers to the marketplace.
Here, we focus on the first of these value drivers – since it is the easiest to ballpark.
We make the following assumptions:
-
Probability of project success = 0.8
-
Ocean Community gets 0.2% of consume volume.
We provide OCEAN Datatoken Consumption and ROI calculations for a number of growth scenarios in Figures 2 and 3 respectively.
Note:
Let x_t = consumption for month t.
x_t = x_0 * (1 + rate)^t , where t=0, 1, …11 months.
Where x_0 = initial_num_users x num_datasets x datatoken_price
Figure 2: OCEAN Datatoken Consumption Growth Scenarios (adjusted for probability of success = 0.8):
The man hours and computation necessary to code, scrape, clean, and process these datasets are substantial. Below we lay out our projected fixed costs to provide these datasets. Variable costs should of course be considered but are omitted for simplicity here. Note: these projections are based on an analysis of preliminary processing using a smaller dataset.
Cost Item | Cost USD |
---|---|
Man Hours | 12,000 |
Computation | 5,450 |
Data Storage | 150 |
Total | 17,600 |
Project Deliverables – Category
-
ResilientML will vastly expand our current dataset (TASPEL) on the Ocean marketplace – as detailed above.
-
Furthermore, we will publish a new dataset to provide the extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings, as shown in Stage 2 the Value-Add Pipeline (VAP) in Figure 1.
Project Deliverables – Roadmap:
Any prior work completed thus far?
The proposed project builds upon the contributions over the past two years of members of ResilientML in building the machine learning pipeline shown in Figure 1. We have performed the processing of data for over 15 assets, and have already published this dataset to the Polygon Ocean marketplace – updating weekly.
Part 1: Python code has been written to perform text data collection via Java-based screen scraping and document collection – this has been unit tested and validated.
Part 2: Python modules have been created with proprietary steps of text data de-noising based upon the concepts provided in the appendix – this has been unit tested and validated.
Part 3: Python modules have been created to extract time series of features related to:
-
Semantic bag-of-words frequency based features and their corresponding time series.
-
Grammar based parse trees and their corresponding time series.
-
Syntax based dependency graphs and their corresponding time series.
These have been unit tested and applied to crypto data. The next stage is to put these steps into a distributed production system and curate these in a json data format for an API feed.
Roadmap
Month 1:
-
Complete prototyping of stages 1 – 4.
-
Publish datasets to Ocean marketplace.
Month 2:
-
Maintenance and data support for data buyers.
-
Submit academic research paper to journal
Project Details
Further details of the research prototype are provided in the following peer reviewed papers:
- Chalkiadakis, Ioannis and Peters, Gareth W. and Chantler, Michael John and Konstas, Ioannis, A statistical analysis of text: embeddings, properties, and time-series modeling.
- Available at SSRN (under review): A Statistical Analysis of Text: Embeddings, Properties and Time-Series Modeling by Ioannis Chalkiadakis, Gareth Peters, Mike J. Chantler, Ioannis Konstas :: SSRN
- Chalkiadakis, Ioannis and Zaremba, Anna and Peters, Gareth W. and Chantler, Michael John, Sentiment-driven statistical causality in multimodal systems.
- Available at SSRN: On-chain Analytics for Sentiment-driven Statistical Causality in Cryptocurrencies by Ioannis Chalkiadakis, Anna Zaremba, Gareth Peters, Mike J. Chantler :: SSRN
- Zaremba, A. and Peters, G., 2020. Statistical Causality for Multivariate Non-Linear Time Series via Gaussian Processes.
- Available at SSRN: https://papers.ssrn. com/sol3/papers.cfm?abstract_id=3609497
- Peters, Gareth, Statistical Machine Learning and Data Analytic Methods for Risk and Insurance
- Available at SSRN: Statistical Machine Learning and Data Analytic Methods for Risk and Insurance by Gareth Peters :: SSRN or Statistical Machine Learning and Data Analytic Methods for Risk and Insurance by Gareth Peters :: SSRN
Team members
ResilientML consists of 5 team members.
Chair Prof. Gareth W. Peters (CStat-RSS, FIOR, YAS-RSS) - Head of Research
Background:
-
Linkedin: https://www.linkedin.com/in/gareth-w-peters-3928b4139/
-
GoogleScholar : https://scholar.google.com/citations?hl=en&user=goDorpkAAAAJ 1
-
Affiliations/Prizes: https://researchportal.hw.ac.uk/en/persons/gareth-w-peters/prizes/ 1
-
PhD University of NSW, Australia
-
MSc Cambridge University, England
Experience:
-
Co-founder of ResilientML
-
20+ years machine learning research
-
5 research books
-
200+ journal and conference papers
-
Successfully delivered projects from grants > 5mil+ GBP.
Short Bio
Prof. Gareth W. Peters is the ‘Chair Professor for Risk and Insurance’ in the Department of Actuarial Mathematics and Statistics, in Heriot-Watt University in Edinburgh. Previously he held tenured positions in the Department of Statistical Sciences, University College London, UK and the Department of Mathematics and Statistics in University of New South Wales, Sydney, Australia.
Prof. Peters is the Director of the Scottish Financial Risk Association.
Prof. Peters is also an elected member of the Young Academy of Scotland in the Royal Society of Edinburgh (YAS-RSE) and an elected Fellow of the Institute of Operational Risk (FIOR). He was also the Nachdiploma Lecturer in Machine Learning for Risk and Insurance at ETH Zurich in the Risk Laboratory.
He has made in excess of 150 international invited presentations, speaker engagements including numerous key note presentations. He has delivered numerous professional training courses to C-suite executive level industry professionals as well as numerous central banks.
He has published in excess of 150 peer reviewed articles on risk and insurance modelling, 2 research text books on Operational Risk and Insurance as well as being the editor and contributor to 3 edited text books on spatial statistics and Monte Carlo methods.
He currently holds positions as:
-
Honorary Prof. of Statistics at University College London, 2018+
-
Affiliated Prof. of Statistics in University of New South Wales Australia 2015+
-
Affiliate Member of Systemic Risk Center, London School of Economics 2014+
-
Affiliate Member of Oxford Man Institute, Oxford University (OMI) 2013+
-
Honorary Prof. of Statistics in University of Sydney Australia 2018+
-
Honorary Prof. of Statistics in Macquarie University, Australia 2018+
-
Visiting Prof. in Institute of Statistical Mathematics, Tokyo, Japan 2009-2018+
He previously held positions as:
-
Honorary Prof. of Peking University, Beijing, China 2014-2016
-
Adjunct Scientist in the Mathematics, Informatics and Statistics, Commonwealth Scientific and Industrial Research Organisation (CSIRO) 2009-2017
Webpage: https://www.qrslab.com/
Gordon Gay – CEO
Background :
-
MEngSc Monash University
-
MBA Melbourne University, Melbourne Business School, The University of Melbourne
Experience:
-
Co-founder of ResilientML
-
23 years R&D at NEC Australia, roles - GM of R&D, National Head of Innovation
Matthew Ames – CTO / Co-Head of Research
Background:
-
PhD Statistics, University College London
-
2 years Postdoctoral research
Experience
- 5 years industry experience - machine learning, finance
Phong Nguyen – Principal Engineer
Background:
-
Linkedin: https://www.linkedin.com/in/phong-nguyen-0456b912/
-
Masters Adelaide University
Experience:
-
20+ years industry experience - R&D, Wireless technologies, Systems Engineering – engineering solutions realisation
-
Lead systems engineering and technology development at NEC
-
Creator of the first-to-market 3.6 & 7.2 Mbps HSDPA SoC (System on Chip), prototype for LTE technological trial, LTE/LTE-A SoC, and Muti-RAT programable SDR platform
-
Inventor of 57 SEPs (standard essential patents) and CEPs (commercial essential patent) on Bluetooth, 3G, 3.5G, 4G and 5G wireless technologies
Ioannis Chalkiadakis – Data Scientist / Natural Language Processing
Background:
-
Masters of Engineering, National Technical University of Athens
-
PhD candidate, Quantitative Risk Solutions Lab, Heriot-Watt University, Edinburgh
Experience:
- 3 years Software Engineering
Appendix: Detailed Project Description
Extracting Value from Text Data
Natural Language Processing (NLP) refers to the set of methods and analytical tools used to analyze unstructured text data, namely text that was created in free form and has a natural linguistic flow, rather than text created based on templated and predefined rules.
“Big Data” (term which is used to describe the vast amount of available information) are characterized by five properties (“the 5 Vs of Big Data”): volume, velocity, variety, value and veracity.
The amount of information (volume), the rate at which it becomes created and transmitted (velocity), its heterogeneous nature (variety) which includes not only textual, but also financial, visual or verbal data, and the actionable information among “noise” (value), are the reasons why it would be infeasible to leverage such information without statistical algorithms that automate the collection, extraction and filtering processes applied on the data.
Furthermore, these operations are necessary to be applied in such a way that the fifth property (veracity) of Big Data is maintained at all stages of the statistical processing of text; veracity is related to the quality and provenance of the data that are fed into the statistical models. The source of data determines their quality and inherent biases, that inevitably affect the output of the final statistical model.
Importance of text pre-processing
With any type of data collected from real world processes, it is usually the case that a set of “clean-up” or pre-processing transformations are required before using them for the statistical processing.
The pre-processing procedures will remove the noise from the data which will allow us to operate on the actual information we want to process. In this way we will not only ensure the veracity we want to achieve, but will also obtain efficiency and computational benefits.
Statistical text processing: Pipeline
In general, we can identify three stages for the statistical analysis of text data:
• data import,
• data wrangling, and
• finally, development and evaluation of the statistical model.
The first step of importing the data consists of either loading an already existing dataset, or alternatively collecting one’s own set of data, for example via scraping web pages, scanning/optical character recognition (OCR) of printed documents or transcribing spoken text. The data import however, does not guarantee that the dataset will be in such a format that will facilitate subsequent processing.
Therefore, we need to go through the process of “tidying” the data, where one constructs “data frames”, i.e. tabular structures, where each variable is stored in its own column and each observation occupies one row.
This process will create a tidy dataset and will facilitate subsequent data transformations, visualization and processing. Creating a tidy dataset and applying the necessary transformations or visualization methods constitutes the process of “data wrangling”.
During modelling, it might be necessary to apply additional transformations on the data, hence there is a feedback loop between the data wrangling and modelling stages.
Noise in Text Data & its Removal During Data Wrangling.
What constitutes noise in raw text and under what conditions it may be introduced into our data?
Obvious noise artefacts are:
• encoding scheme (representation)
• word mis-spellings,
• errors in the linguistic structure (grammar or syntax),
• missing spaces or punctuation symbols and
• wrong capitalization patterns.
These types of noise patterns are usually introduced at the creation stage of the raw text, and are challenges that are expected in natural language applications.
However, noise may appear in non-obvious forms as well. Users of communication services, for instance SMS, e-mails, instant messages, or social media posts, often use abbreviations, emoticons, or even omit certain words.
These patterns, depending on the application, could hinder the processing of the raw data. For example, when analyzing sentiment from Tweets, most researchers will want to consider emoticons as they can be very expressive about the feelings of the author of the Tweet.
On the other hand, if someone strictly wants to analyze the lexical or grammatical patterns that appear frequently among Twitter users, information based on emoticons is potentially irrelevant, in which case it is noise and has to be removed.
The domain specificity of the noise patterns that this implies appears in additional noise sources that are considered standard in NLP, namely “stopwords” and punctuation.
The term stopwords refers to words that are not considered useful for the intended analysis because they lack discriminative power (e.g. appear too often in the dataset) or lack significant semantics, namely terms such as “a”, or “the”.
Stopword removal is considered a standard part of the pre-processing pipeline, is usually performed early in the pre-processing stage, and most NLP software packages come with standard predefined stopword lists.
Basic pre-processing steps
The list is not exhaustive:
1. Punctuation
Often punctuation marks (such as , . ! ? ; # “” ‘’ ~) are removed, for example when one aims to analyze counts of terms, and therefore punctuation becomes unnecessary. However, similar to stopwords, there are cases when all or a subset of punctuation marks are useful and are therefore desirable to maintain. For example, exclamation marks may reveal sentiment information, or some symbols may carry special meaning in certain contexts, such as the hashtag (#) symbol in Tweets where it relates to the Tweet semantic content. Finally, it is important to consider at which stage of the analysis one should remove punctuation. If we want to detect sentence boundaries, or perform syntax or grammar parsing, then it is important to maintain punctuation symbols before performing these stages. Once this type of analysis has finished, it may be safe to remove punctuation if it is required for further analysis steps.
2. Numbers
Also for numbers one should carefully consider the application context before deciding to remove them, due to not contributing to semantic information most of the times. If the domain requires the extraction of dates however, or case numbers when processing legal documents, then of course specific rules should be applied to dictate the conditions under which numbers will be removed from the text.
3. Lowercase
Lowercasing all terms is applied to reduce the vocabulary space, i.e. the set of words we expect to come across. This is useful for reducing the computational and space complexity in applications where we work with large sparse matrices of word counts. However, there are cases where uppercase letters reveal structural information: they can help identify sentence boundaries, or proper names, and can also help reduce ambiguity; for example distinguishing the proper name Rose from the noun denoting the flower rose.
4. Stemming
Stemming is an additional technique that aims to reduce the vocabulary space. It consists of removing any inflections from a word and reducing it to its most basic form. For example, a stemmer (the program that does the stemming) will map “walked”, “walking” and “walks” to the lexeme “walk”. Note that in the case of e.g. “studies”, the stemmer will return a basic word form (“studi”) that is an invalid word itself. This is because stemming does not account for the grammatical or syntactical pattern behind the inflection - it only cuts it off. An alternative method is lemmatisation, where the root lexeme is returned, which means that lemmatisation returns terms that are in the language. For example, the lemmatisation program will replace “studying” and “studies” to “study”. This is achieved by considering the part-of-speech of each term (e.g. is it a noun, verb, adjective or adverb?) in order to determine the suffix. Like punctuation, if we want to perform syntactic or grammatical analysis we have to postpone stemming and lemmatisation until after these stages.
5. Stopword removal
As we have discussed, this step should be treated very carefully, so as to minimize information loss and remaining noise in the dataset. It is therefore recommended that standard stopword lists in software packages be checked, and accordingly modified before used.
6. Word compounds
Word compounds are groups of words, usually groups of two (bigrams) or three (trigrams) that frequently appear together and convey a different meaning than if we consider each one individually. For example, the trigram “Wall Street Journal” denotes the name of a popular newspaper and we would like to account for it as a single term, when we want to extract its appearance in the dataset. If we do not, we can still identify the valid terms of “wall”, “street” and “journal” individually, however we ignore the fact that they refer to a newspaper rather than carry their separate meanings.
7. Remove low-frequency words
It is also common practice to remove extremely rare words, namely words that constitute less than a small fixed percentage (< 0.5 - 1 %) of the document corpus, again to reduce the computational and space complexity.
At the end of these processes we will have tokenized cleaned – wrangled text data that has been prepared for feature extraction and data analysis in NLP based machine learning.
Feature Extraction Methods – Time-Series of NLP Text Features
We identify three distinct categories:
Semantics : namely the meaning behind words and sentences and the coherence of a well-formed text - Bag-Of-Words (Frequency Based Features).
The way we capture semantics is based on the bag-of-words model (BoW), which has been widely applied in natural language processing (NLP) and information retrieval Harris, 19544. The main concept behind BoW is to map a segment of text to an unordered collection, or “bag”, of words. As we have seen this is the premise for the construction of document-term matrices for a corpus of documents, and in its original formulation it is applied on a complete document of a collection of documents (a corpus) and ignores the sequence of words in the text. We are transferring BoW into a time-series context and present an “online” formulation. This allows us to overcome computational difficulties associated with BoW, namely the handling of sparse matrices whose size depends on the number of distinct document words and corpus size, and may well be in the order of hundreds of thousands. In addition, this setting allows us to construct a text-based time-series that can be incorporated into a time-series based system for supervised or unsupervised learning.
Grammar : i.e. the structural rules that dictate how words fit into the sentence and form groups such as clauses and phrases - parse tree or constituency tree for n-grams/sentences.
Consider the following example
The brown dog is running in the park.
which can also be written in an equivalent grammatical manner:
He is running in the park.
without destroying the grammar or meaning of the sentence.
The fact that a group of words can operate as a single unit - and therefore in our example we can replace the phrase “the brown dog” with “he” - is the linguistic property of constituent structure.
Therefore one can extract features dictated by the grammatical rules that tell us with which words we can form and study units for their role in the sentence.
The formal system for studying this phenomenon, i.e. the grouping of words as in the above example, is the context-free grammar (CFG). Mathematically, a CFG is defined by a quadruple as follows: 𝐺=(𝑁,Σ,𝑅,𝑆) where
𝑁 is a set of non-terminal symbols
Σ is a set of terminal symbols, 𝑁∩Σ=∅
𝑅 is a set of rules, 𝑅={𝐴→𝛽:𝐴∈𝑁 and 𝛽∈(Σ∪𝑁)∗}
𝑆 is the designated start symbol, 𝑆∈𝑁
Such grammars are called “context-free” because the left hand side of each rule can contain exactly one non-terminal symbol.
A context-free grammar defines a formal language, which is the set of strings that we can construct of terminal tokens, that can be derived starting from S.
A sentence is called grammatical, if the set of strings that comprise it can be derived by following the rules of the CFG, otherwise, the sentence is not valid according to the language of the CFG (ungrammatical).
The process of analyzing the constituent structure of a sentence is called constituency parsing, and the derivation of a sentence, i.e. the rules that we followed when building it, can be represented with a hierarchical structure, a tree, which is called the parse tree or constituency tree.
Syntax : that is the principles that dictate the structure of sentences by specifying the order and role of each word in the text - Dependency Graphs.
The goal of syntactic analysis is to discover which pairs of words where one depends on the other, and what is the type of that dependence.
These dependency relations are binary and asymmetrical, and therefore we would like to know which of the two words acts as the head that is modified in some way, and which is the dependent that modifies or complements the head. This concept allows us to think of the dependency relations as inducing graph structures (dependency graphs) which we use to study the dependency relations between words, and therefore the syntax of a sentence. The syntactic analysis complements the grammatical, parse tree-based analysis, as now we aim to extract information on the functional role of each word in the sentence, rather than structural relations between them as we did with the context-free grammar.