[Proposal Round 7] ResilientML – Expansion of Sentiment Data – Adding Sentence Structure Features

Key Project Data

Name of project:

ResilientML – Expansion of Sentiment Data – Adding Sentence Structure Features

Team Website:

https://www.resilientml.com/

Proposal Wallet Address:

0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391

https://etherscan.io/address/0x4D4290CBA904aBb4dFbc1568766bCD88e67Be391

Which category best describes your project?

Unleash data

Funding Requested

32,000 OCEAN

Updates on previous funding rounds

The offered dataset initially comprised of text data for 5 assets (BTC, ETH, LTC, XRP, TRX) obtained from two news sources (Cryptodaily, Cryptoslate). Previously awarded funding has been used to maintain and weekly update the existing data sources and content, and in addition to:

  • extend the dataset to over 15 different assets
  • extend the dataset to 5 different article topics, i.e. DeFi, Exchanges, NFT, Regulation, and Opinion articles
  • extend news sources (Coindesk)
  • utilise cloud-based servers and robustify data collection and curation infrastructure

Round 7 Proposal

This proposal is to continue to extend, develop and further curate our initial development of a Crypto specific Natural Language Processing Data Suite.

Grant will go to help fund the coding, NLP model development, man hours and compute hours required to extend our current published dataset. We aim to continue to grow this dataset and expand its usability by further extracting value from the corpus of crypto related financial news.

This round we will expand our offering to include new high level extracted sentence structure features for the following sectors:

  • Web3.0 coins: (Ocean, Chainlink, Filecoin, BitTorrent, Stacks, The Graph, Basic Attention Token, Siacoin, Helium, Arweave);

  • Layer1 coins (Cardano, Tezos, Polkadot, Solana etc.);

  • DeFi coins (Uniswap, Terra, Aave, PancakeSwap, Maker, THORChain, Serum etc.);

  • Metaverse/Gaming/NFT coins (Enjin, Axie Infinity, Red Fox Labs etc.);

  • Stablecoins (Tether, USDC, BUSD etc.),

with news topics covering:

  • DeFi;

  • Exchanges;

  • Regulation;

  • NFT,

and including data from the following additional news sources:

  • Cointelegraph;

  • NewsBTC;

  • Bitcoinist;

  • Blocknomi;

  • Coinspeaker.

Specifically, we will now include sentences processed to remove parts of text noise, whilst keeping their grammatical, syntactical and semantic structure intact. This is useful for NLP Compute-to-Data modules that utilise raw text sentences at their input (e.g. BERT-type, Transformer-based models). In addition, we will provide advanced features capturing the grammatical and syntactical structure of the sentences. These comprise Context-Free Grammar (CFG) parse trees which express the grammatical structure of a sentence, as well as Dependency Graphs which encode the sentence syntax.

We will expand the dataset going forwards but also back-filling the currently available data. This will effectively increase the current data offering we have developed by around a factor of three in terms of content and volume of processed corpus.

We will continue to grow the sophistication of the data being provided . This will be achieved by migrating from data munging to feature extraction and data curation for feature libraries.

The initial phase of this will encompass publishing a new dataset to provide extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings , as shown in Stage 2 of the Value-Add Pipeline (VAP) in Figure 1.

The proposal in one sentence

Data is the modern oil of the blockchain economy. ResilientML Semantic Reservoirs will bring a vast collection of carefully crafted semantic and linguistically tailored data sets curated by experts in Natural Language Processing for utilization directly in machine learning methods and sentiment models running in the Ocean environment and available through the Ocean marketplace via the ResilientML NLP data app.

Project Overview

Mission:

The outcome of the collaboration between ResilientML and the OceanDAO community is multi-fold:

  • We will help open the flood gates to the Ocean community for one of the key building blocks of a modern data economy that is of growing relevance to machine learning applications in a data economy like Ocean. This building block is particularly specialized as it requires a non-standard set of data science processes to extract and curate.

  • In this regard, we seek to unlock the power of text-based information and data characterization for the Ocean machine learning community. They will be able to utilize our high quality Natural Language Processing (NLP) text data features to develop apps that will interface directly with our data reservoirs through API interfaces that can extract relevant text data features from our JSON formatted and curated feature libraries to form tributaries to the wider Ocean machine learning applications that seek inputs from text based data features.

ResilientML has developed methods in python to produce these JSON formatted text feature collections that will form the core of our Semantic Reservoirs. These text-based data features are processed using specialized natural language processing NLP methods that ResilientML will bring to the Ocean community based on extensive academic and industry experience in developing such solutions.

In this regard, we aim to help make the Ocean marketplace the leader in the text processing, sentiment models, social media analytics, analyst report analytics, regulatory report analytics, topic models, chat-bot, text-to-speech, speech-to-text, labelling, context extraction tasks of the NLP data market by leveraging our expertise in cutting-edge, novel academic research and industry practice.

NLP is of prime importance in the crypto space due to the highly sentiment driven nature of crypto markets. Furthermore, we also plan to provide the following high value NLP datasets to the Ocean marketplace:

  • Crypto News Sentiment

  • Social Media Sentiments

  • Technology: github, bitbucket, wire, …

  • Regulatory compliance reports

  • Legal documents

The team at ResilientML have dedicated quantitative analysts, machine learning experts and industry leading engineers to develop this suite of tools in both API, cloud solutions in azure and AWS in the languages of R, Python, MongoDB and others.

Description of the project:

Here, we provide a high level overview of the project (a detailed description of the project is available in the appendix).

Significant value can be sourced to understand crypto markets, prices, developments, regulatory landscape, use cases etc., through harvesting information from written text. Natural Language Processing (NLP) refers to the set of methods and analytical tools used to analyze unstructured text data, namely text that was created in free form and has a natural linguistic flow, rather than text created based on templated and predefined rules.

“Big Data” (a term which is used to describe the vast amount of available information) are characterized by five properties (“the 5 Vs of Big Data”): volume, velocity, variety, value and veracity. The amount of information (volume), the rate at which it becomes created and transmitted (velocity), its heterogeneous nature (variety) which includes not only textual, but also financial, visual or verbal data, and the actionable information among “noise” (value), are the reasons why it would be infeasible to leverage such information without statistical algorithms that automate the collection, extraction and filtering processes applied on the data.

Furthermore, these operations are necessary to be applied in such a way that the fifth property (veracity) of Big Data is maintained at all stages of the statistical processing of text; veracity is related to the quality and provenance of the data that are fed into the statistical models. The source of data determines their quality and inherent biases, that inevitably affect the output of the final statistical model.

A detailed synopsis of the project can be found in the appendix of this proposal to see specifically the stages of Machine Learning considered to extract and summarize the value in textual data and sentiment that we propose to provide to the Ocean environment.

The collection, wrangling and curation of this information extracted from text requires specialist machine learning knowledge to automate this process. The relevancy and approaches to data collection require domain knowledge to identify the most relevant sources of data to extract value from to ensure the data is of highest integrity.

We will combine our machine learning skills and specialist domain knowledge in the crypto space and traditional financial and risk/insurance space to provide a high quality source of data for NLP tasks that is tailored specifically for crypto market understanding and analytics.

An important point of distinction to what we offer is that we move beyond the standard approach of bag-of-words and frequency of words based models which are ubiquitous in most NLP sentiment based frameworks, but fail to capture semantics and syntax. These extra components are critical to infer sentiment accurately. Our proposed framework provides additional structure capturing these components for positive, negative, and neutral sentiment indices. In addition, our framework is hierarchical in nature which means that we can extract contribution to sentiment by individual articles, authors, and news sources explicitly.

Another important distinction of our contribution compared to other sentiment based models is that we don’t just focus on Twitter and social media feeds, which have limited scope to express sentiment, instead our approach targets detailed analyst reports, editor-processed news reports , and regulatory reports. Working with these enriched, higher quality and credibility data sources is more complex than working with social media based models, and our framework is able to accommodate this.

What problem is your project solving?

Currently, there is a lack of high quality data on the Ocean marketplace – which is to be expected at this stage. In order to attract a snowball effect of high quality data providers, an initial kernel of high quality datasets need to be published, staked, and purchased on the marketplace. We will contribute to the provision of such high quality datasets to the Ocean marketplace to drive growth – critical to the success of the protocol.

What is the final product?

ResilientML will vastly expand our current dataset (TASPEL) on the Ocean marketplace to include new high level extracted sentence structure features, such as Context-Free Grammar trees and Dependency Graphs, for the following sectors:

  • Web3.0 coins: (Ocean, Chainlink, Filecoin, BitTorrent, Stacks, The Graph, Basic Attention Token, Siacoin,Helium, Arweave);

  • Layer1 coins (Cardano, Tezos, Polkadot, Solana etc.);

  • DeFi coins (Uniswap, Terra, Aave, PancakeSwap, Maker, THORChain, Serum etc.);

  • Metaverse/Gaming/NFT coins (Enjin, Axie Infinity, Red Fox Labs etc.);

  • Stablecoins (Tether, USDC, BUSD etc.),

with news topics covering:

  • DeFi;

  • Exchanges;

  • Regulation;

  • NFT,

and including data from the following additional news sources:

  • Coindesk;

  • Cointelegraph;

  • NewsBTC;

  • Bitcoinist;

  • Blocknomi;

  • Coinspeaker.

Each dataset will provide cleaned, pre-processed, and featurized text data (as shown in Stage 1 of the Value-Add Pipeline (VAP) in Figure 1) from every article, corresponding to 100,000s of n-grams and millions of tokens, from various news sources, e.g. cryptodaily.co.uk, cryptoslate.com.

Furthermore, we will publish a new dataset to provide extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings , as shown in Stage 2 of the Value-Add Pipeline (VAP) in Figure 1.

Figure 1: Text Analytics Value-Add Pipeline (VAP)

Expected ROI

The publication of the datasets detailed above will drive value to the Ocean ecosystem through numerous channels, i.e. fee generation for Ocean community, Ocean token purchases by data publishers (ResilientML), Ocean token purchases by stakers attracted by high quality datasets to curate, network effects of attracting other data providers to the marketplace.

Here, we focus on the first of these value drivers – since it is the easiest to ballpark.

We make the following assumptions:

  • Probability of project success = 0.8

  • Ocean Community gets 0.2% of consume volume.

We provide OCEAN Datatoken Consumption and ROI calculations for a number of growth scenarios in Figures 2 and 3 respectively.

Note:

Let x_t = consumption for month t.

x_t = x_0 * (1 + rate)^t , where t=0, 1, …11 months.

Where x_0 = initial_num_users x num_datasets x datatoken_price

Figure 2: OCEAN Datatoken Consumption Growth Scenarios (adjusted for probability of success = 0.8):

image

image

The man hours and computation necessary to code, scrape, clean, and process these datasets are substantial. Below we lay out our projected fixed costs to provide for example 5 datasets. Variable costs should of course be considered but are omitted for simplicity here. Note: these projections are based on an analysis of preliminary processing using a smaller dataset.

image

Project Deliverables – Category

  1. ResilientML will vastly expand our current dataset (TASPEL) on the Ocean marketplace – as detailed above.

  2. Furthermore, we will publish a new dataset to provide the extracted Semantic features, such as bag-of-words, matrix factorization - deep learning-based word embeddings, as shown in Stage 2 the Value-Add Pipeline (VAP) in Figure 1.

Project Deliverables – Roadmap:

Any prior work completed thus far?

The proposed project builds upon the contributions over the past two years of members of ResilientML in building the machine learning pipeline shown in Figure 1. We have performed the processing of data for over 15 assets, and have already published this dataset to the Polygon Ocean marketplace – updating weekly.

Part 1: Python code has been written to perform text data collection via Java-based screen scraping and document collection – this has been unit tested and validated.

Part 2: Python modules have been created with proprietary steps of text data de-noising based upon the concepts provided in the appendix – this has been unit tested and validated.

Part 3: Python modules have been created to extract time series of features related to:

  1. Semantic bag-of-words frequency based features and their corresponding time series.

  2. Grammar based parse trees and their corresponding time series.

  3. Syntax based dependency graphs and their corresponding time series.

These have been unit tested and applied to crypto data. The next stage is to put these steps into a distributed production system and curate these in a json data format for an API feed.

Roadmap

Month 1:

  • Complete prototyping of stages 1 – 4.

  • Publish datasets to Ocean marketplace.

Month 2:

  • Maintenance and data support for data buyers.

  • Submit academic research paper to journal

Project Details

Further details of the research prototype are provided in the following peer reviewed papers:

  1. Chalkiadakis, Ioannis and Peters, Gareth W. and Chantler, Michael John and Konstas, Ioannis, A statistical analysis of text: embeddings, properties, and time-series modeling.
  1. Chalkiadakis, Ioannis and Zaremba, Anna and Peters, Gareth W. and Chantler, Michael John, Sentiment-driven statistical causality in multimodal systems.
  1. Zaremba, A. and Peters, G., 2020. Statistical Causality for Multivariate Non-Linear Time Series via Gaussian Processes.
  1. Peters, Gareth, Statistical Machine Learning and Data Analytic Methods for Risk and Insurance

Team members

ResilientML consists of 5 team members.

Chair Prof. Gareth W. Peters (CStat-RSS, FIOR, YAS-RSS) - Head of Research

Background:

Experience:

  • Co-founder of ResilientML

  • 20+ years machine learning research

  • 5 research books

  • 200+ journal and conference papers

  • Successfully delivered projects from grants > 5mil+ GBP.

Short Bio

Prof. Gareth W. Peters is the ‘Chair Professor for Risk and Insurance’ in the Department of Actuarial Mathematics and Statistics, in Heriot-Watt University in Edinburgh. Previously he held tenured positions in the Department of Statistical Sciences, University College London, UK and the Department of Mathematics and Statistics in University of New South Wales, Sydney, Australia.

Prof. Peters is the Director of the Scottish Financial Risk Association.

Prof. Peters is also an elected member of the Young Academy of Scotland in the Royal Society of Edinburgh (YAS-RSE) and an elected Fellow of the Institute of Operational Risk (FIOR). He was also the Nachdiploma Lecturer in Machine Learning for Risk and Insurance at ETH Zurich in the Risk Laboratory.

He has made in excess of 150 international invited presentations, speaker engagements including numerous key note presentations. He has delivered numerous professional training courses to C-suite executive level industry professionals as well as numerous central banks.

He has published in excess of 150 peer reviewed articles on risk and insurance modelling, 2 research text books on Operational Risk and Insurance as well as being the editor and contributor to 3 edited text books on spatial statistics and Monte Carlo methods.

He currently holds positions as:

  • Honorary Prof. of Statistics at University College London, 2018+

  • Affiliated Prof. of Statistics in University of New South Wales Australia 2015+

  • Affiliate Member of Systemic Risk Center, London School of Economics 2014+

  • Affiliate Member of Oxford Man Institute, Oxford University (OMI) 2013+

  • Honorary Prof. of Statistics in University of Sydney Australia 2018+

  • Honorary Prof. of Statistics in Macquarie University, Australia 2018+

  • Visiting Prof. in Institute of Statistical Mathematics, Tokyo, Japan 2009-2018+

He previously held positions as:

  • Honorary Prof. of Peking University, Beijing, China 2014-2016

  • Adjunct Scientist in the Mathematics, Informatics and Statistics, Commonwealth Scientific and Industrial Research Organisation (CSIRO) 2009-2017

Webpage: https://www.qrslab.com/

Gordon Gay – CEO

Background :

Experience:

  • Co-founder of ResilientML

  • 23 years R&D at NEC Australia, roles - GM of R&D, National Head of Innovation

Matthew Ames – CTO / Co-Head of Research

Background:

Experience

  • 5 years industry experience - machine learning, finance

Phong Nguyen – Principal Engineer

Background:

Experience:

  • 20+ years industry experience - R&D, Wireless technologies, Systems Engineering – engineering solutions realisation

  • Lead systems engineering and technology development at NEC

  • Creator of the first-to-market 3.6 & 7.2 Mbps HSDPA SoC (System on Chip), prototype for LTE technological trial, LTE/LTE-A SoC, and Muti-RAT programable SDR platform

  • Inventor of 57 SEPs (standard essential patents) and CEPs (commercial essential patent) on Bluetooth, 3G, 3.5G, 4G and 5G wireless technologies

Ioannis Chalkiadakis – Data Scientist / Natural Language Processing

Background:

Experience:

  • 3 years Software Engineering

Appendix: Detailed Project Description

Extracting Value from Text Data

Natural Language Processing (NLP) refers to the set of methods and analytical tools used to analyze unstructured text data, namely text that was created in free form and has a natural linguistic flow, rather than text created based on templated and predefined rules.

“Big Data” (term which is used to describe the vast amount of available information) are characterized by five properties (“the 5 Vs of Big Data”): volume, velocity, variety, value and veracity.

The amount of information (volume), the rate at which it becomes created and transmitted (velocity), its heterogeneous nature (variety) which includes not only textual, but also financial, visual or verbal data, and the actionable information among “noise” (value), are the reasons why it would be infeasible to leverage such information without statistical algorithms that automate the collection, extraction and filtering processes applied on the data.

Furthermore, these operations are necessary to be applied in such a way that the fifth property (veracity) of Big Data is maintained at all stages of the statistical processing of text; veracity is related to the quality and provenance of the data that are fed into the statistical models. The source of data determines their quality and inherent biases, that inevitably affect the output of the final statistical model.

Importance of text pre-processing

With any type of data collected from real world processes, it is usually the case that a set of “clean-up” or pre-processing transformations are required before using them for the statistical processing.

The pre-processing procedures will remove the noise from the data which will allow us to operate on the actual information we want to process. In this way we will not only ensure the veracity we want to achieve, but will also obtain efficiency and computational benefits.

Statistical text processing: Pipeline

In general, we can identify three stages for the statistical analysis of text data:

• data import,

• data wrangling, and

• finally, development and evaluation of the statistical model.

The first step of importing the data consists of either loading an already existing dataset, or alternatively collecting one’s own set of data, for example via scraping web pages, scanning/optical character recognition (OCR) of printed documents or transcribing spoken text. The data import however, does not guarantee that the dataset will be in such a format that will facilitate subsequent processing.

Therefore, we need to go through the process of “tidying” the data, where one constructs “data frames”, i.e. tabular structures, where each variable is stored in its own column and each observation occupies one row.

This process will create a tidy dataset and will facilitate subsequent data transformations, visualization and processing. Creating a tidy dataset and applying the necessary transformations or visualization methods constitutes the process of “data wrangling”.

During modelling, it might be necessary to apply additional transformations on the data, hence there is a feedback loop between the data wrangling and modelling stages.

Noise in Text Data & its Removal During Data Wrangling.

What constitutes noise in raw text and under what conditions it may be introduced into our data?

Obvious noise artefacts are:

• encoding scheme (representation)

• word mis-spellings,

• errors in the linguistic structure (grammar or syntax),

• missing spaces or punctuation symbols and

• wrong capitalization patterns.

These types of noise patterns are usually introduced at the creation stage of the raw text, and are challenges that are expected in natural language applications.

However, noise may appear in non-obvious forms as well. Users of communication services, for instance SMS, e-mails, instant messages, or social media posts, often use abbreviations, emoticons, or even omit certain words.

These patterns, depending on the application, could hinder the processing of the raw data. For example, when analyzing sentiment from Tweets, most researchers will want to consider emoticons as they can be very expressive about the feelings of the author of the Tweet.

On the other hand, if someone strictly wants to analyze the lexical or grammatical patterns that appear frequently among Twitter users, information based on emoticons is potentially irrelevant, in which case it is noise and has to be removed.

The domain specificity of the noise patterns that this implies appears in additional noise sources that are considered standard in NLP, namely “stopwords” and punctuation.

The term stopwords refers to words that are not considered useful for the intended analysis because they lack discriminative power (e.g. appear too often in the dataset) or lack significant semantics, namely terms such as “a”, or “the”.

Stopword removal is considered a standard part of the pre-processing pipeline, is usually performed early in the pre-processing stage, and most NLP software packages come with standard predefined stopword lists.

Basic pre-processing steps

The list is not exhaustive:


1. Punctuation

Often punctuation marks (such as , . ! ? ; # “” ‘’ ~) are removed, for example when one aims to analyze counts of terms, and therefore punctuation becomes unnecessary. However, similar to stopwords, there are cases when all or a subset of punctuation marks are useful and are therefore desirable to maintain. For example, exclamation marks may reveal sentiment information, or some symbols may carry special meaning in certain contexts, such as the hashtag (#) symbol in Tweets where it relates to the Tweet semantic content. Finally, it is important to consider at which stage of the analysis one should remove punctuation. If we want to detect sentence boundaries, or perform syntax or grammar parsing, then it is important to maintain punctuation symbols before performing these stages. Once this type of analysis has finished, it may be safe to remove punctuation if it is required for further analysis steps.


2. Numbers

Also for numbers one should carefully consider the application context before deciding to remove them, due to not contributing to semantic information most of the times. If the domain requires the extraction of dates however, or case numbers when processing legal documents, then of course specific rules should be applied to dictate the conditions under which numbers will be removed from the text.


3. Lowercase

Lowercasing all terms is applied to reduce the vocabulary space, i.e. the set of words we expect to come across. This is useful for reducing the computational and space complexity in applications where we work with large sparse matrices of word counts. However, there are cases where uppercase letters reveal structural information: they can help identify sentence boundaries, or proper names, and can also help reduce ambiguity; for example distinguishing the proper name Rose from the noun denoting the flower rose.


4. Stemming

Stemming is an additional technique that aims to reduce the vocabulary space. It consists of removing any inflections from a word and reducing it to its most basic form. For example, a stemmer (the program that does the stemming) will map “walked”, “walking” and “walks” to the lexeme “walk”. Note that in the case of e.g. “studies”, the stemmer will return a basic word form (“studi”) that is an invalid word itself. This is because stemming does not account for the grammatical or syntactical pattern behind the inflection - it only cuts it off. An alternative method is lemmatisation, where the root lexeme is returned, which means that lemmatisation returns terms that are in the language. For example, the lemmatisation program will replace “studying” and “studies” to “study”. This is achieved by considering the part-of-speech of each term (e.g. is it a noun, verb, adjective or adverb?) in order to determine the suffix. Like punctuation, if we want to perform syntactic or grammatical analysis we have to postpone stemming and lemmatisation until after these stages.


5. Stopword removal

As we have discussed, this step should be treated very carefully, so as to minimize information loss and remaining noise in the dataset. It is therefore recommended that standard stopword lists in software packages be checked, and accordingly modified before used.


6. Word compounds

Word compounds are groups of words, usually groups of two (bigrams) or three (trigrams) that frequently appear together and convey a different meaning than if we consider each one individually. For example, the trigram “Wall Street Journal” denotes the name of a popular newspaper and we would like to account for it as a single term, when we want to extract its appearance in the dataset. If we do not, we can still identify the valid terms of “wall”, “street” and “journal” individually, however we ignore the fact that they refer to a newspaper rather than carry their separate meanings.


7. Remove low-frequency words

It is also common practice to remove extremely rare words, namely words that constitute less than a small fixed percentage (< 0.5 - 1 %) of the document corpus, again to reduce the computational and space complexity.

At the end of these processes we will have tokenized cleaned – wrangled text data that has been prepared for feature extraction and data analysis in NLP based machine learning.

Feature Extraction Methods – Time-Series of NLP Text Features

We identify three distinct categories:

Semantics : namely the meaning behind words and sentences and the coherence of a well-formed text - Bag-Of-Words (Frequency Based Features).

The way we capture semantics is based on the bag-of-words model (BoW), which has been widely applied in natural language processing (NLP) and information retrieval Harris, 19544. The main concept behind BoW is to map a segment of text to an unordered collection, or “bag”, of words. As we have seen this is the premise for the construction of document-term matrices for a corpus of documents, and in its original formulation it is applied on a complete document of a collection of documents (a corpus) and ignores the sequence of words in the text. We are transferring BoW into a time-series context and present an “online” formulation. This allows us to overcome computational difficulties associated with BoW, namely the handling of sparse matrices whose size depends on the number of distinct document words and corpus size, and may well be in the order of hundreds of thousands. In addition, this setting allows us to construct a text-based time-series that can be incorporated into a time-series based system for supervised or unsupervised learning.

Grammar : i.e. the structural rules that dictate how words fit into the sentence and form groups such as clauses and phrases - parse tree or constituency tree for n-grams/sentences.

Consider the following example

The brown dog is running in the park.

which can also be written in an equivalent grammatical manner:

He is running in the park.

without destroying the grammar or meaning of the sentence.

The fact that a group of words can operate as a single unit - and therefore in our example we can replace the phrase “the brown dog” with “he” - is the linguistic property of constituent structure.

Therefore one can extract features dictated by the grammatical rules that tell us with which words we can form and study units for their role in the sentence.

The formal system for studying this phenomenon, i.e. the grouping of words as in the above example, is the context-free grammar (CFG). Mathematically, a CFG is defined by a quadruple as follows: 𝐺=(𝑁,Σ,𝑅,𝑆) where

𝑁 is a set of non-terminal symbols

Σ is a set of terminal symbols, 𝑁∩Σ=∅

𝑅 is a set of rules, 𝑅={𝐴→𝛽:𝐴∈𝑁 and 𝛽∈(Σ∪𝑁)∗}

𝑆 is the designated start symbol, 𝑆∈𝑁

Such grammars are called “context-free” because the left hand side of each rule can contain exactly one non-terminal symbol.

A context-free grammar defines a formal language, which is the set of strings that we can construct of terminal tokens, that can be derived starting from S.

A sentence is called grammatical, if the set of strings that comprise it can be derived by following the rules of the CFG, otherwise, the sentence is not valid according to the language of the CFG (ungrammatical).

The process of analyzing the constituent structure of a sentence is called constituency parsing, and the derivation of a sentence, i.e. the rules that we followed when building it, can be represented with a hierarchical structure, a tree, which is called the parse tree or constituency tree.

Syntax : that is the principles that dictate the structure of sentences by specifying the order and role of each word in the text - Dependency Graphs.

The goal of syntactic analysis is to discover which pairs of words where one depends on the other, and what is the type of that dependence.

These dependency relations are binary and asymmetrical, and therefore we would like to know which of the two words acts as the head that is modified in some way, and which is the dependent that modifies or complements the head. This concept allows us to think of the dependency relations as inducing graph structures (dependency graphs) which we use to study the dependency relations between words, and therefore the syntax of a sentence. The syntactic analysis complements the grammatical, parse tree-based analysis, as now we aim to extract information on the functional role of each word in the sentence, rather than structural relations between them as we did with the context-free grammar.

1 Like

@ResilientML Please add the amount of OCEAN requested for this round to your proposal. Thanks!

Ah yes! Thanks, Alex

Resilient ML will have my support for this DAO round as you are creating a lot of value for the community and tech wise. Thanks for building this community and space.

1 Like

Thank you Kai! Great to hear. We are enjoying being part of the Ocean community :slight_smile:

1 Like