NgramPOS: a bigram-based linguistic and statistical feature process model for unstructured text classification

Research in financial domain has shown that sentiment aspects of stock news have a profound impact on volume trades, volatility, stock prices and firm earnings. In-depth analysis of stock news is now sourced from financial reviews by various social networking and marketing sites to help improve decision making. Nonetheless, such reviews are in the form of unstructured text, which requires natural language processing (NLP) in order to extract the sentiments. Accordingly, in this study we investigate the use of NLP tasks in effort to improve the performance of sentiment classification in evaluating the information content of financial news as an instrument in investment decision support system. At present, feature extraction approach is mainly based on the occurrence frequency of words. Therefore low-frequency linguistic features that could be critical in sentiment classification are typically ignored. In this research, we attempt to improve current sentiment analysis approaches for financial news classification by focusing on low-frequency but informative linguistic expressions. Our proposed combination of low and high-frequency linguistic expressions contributes a novel set of features for sentiment classification. The experimental results show that an optimal Ngram feature selection (combination of optimal unigram and bigram features) enhances sentiment classification accuracy as compared to other types of feature sets.


Introduction
Financial news has irresistible influence on stock markets and returns because it covers the breadth of financial development among firms as well as the local and global financial policies. In accordance with the Efficient Market Hypothesis [1], all available information of stocks reflects on their market prices. [2] showed that qualitative text have a substantial impact on stock prices. Hence, information particularly financial, market and economic news from traditional and new media plays an essential role when investors estimate the stock prices. Nonetheless, due to the rapid increase of financial news in new media over the past decades, it causes great difficulties for investors to fully investigate all available information before making financial decisions. Although attempts have been made for utilizing text mining to extract meaningful pieces of information from the unstructured form of the news into a standardized format for classification, studies in automated classification for textual financial news are still in their infancy [3]. Challenges in classification of financial news, such as feature extraction, feature selection, and classification process remain crucial as the tasks are dealing with unstructured text.
In literature, only a few studies [4,5] have employed sophisticated feature extraction approaches, such as noun phrase to extract low-frequency-based features. Most of the current studies [6][7][8] on sentiment-based financial news analysis typically rely on simple frequency-based textual presentations, such as Bag-of-Words (BoW) where the news are represented by the frequency occurrences of distinct words. Some other research works [9,10], in contrast, have employed unigram text characterization, which shows similar characteristics to the BoW approach thanks to their commonality in consideration of occurrence frequency. However, given the size of the textual data, it is natural to encounter a high percentage of low frequency bigrams that may be considered as an informative sentiment feature. In practice, the extraction of words is mainly based on their high frequency which leads to exclusion of low frequency-based linguistic features but may be worth to sentiment classification [11].
This research focuses closely on related studies that deal with different types of feature sets as input for text classification using the Support Vector Machine (SVM). In [9], the authors employed unigrams, stems, financial terms, health metaphors and agent-metaphors along with a number of feature selection algorithms such as the Information Gain (IG), Chi Square (CHI) and Document Frequency (DF). Two feature weighting methods were used, which include the binary and Term Frequency (TF) that assign different values to each feature. For example, binary feature weighting assigns zero and one to each feature as the presence or absence of that word in the content. The following sections will discuss feature weighting methods in detail. As previously mentioned unigrams and single words due to the possibility of having a higher frequency typically achieve higher accuracy (67.6%) than the other type of features. However, each unigram cannot convey sentiment orientation of each document. Although, the authors separately employed financial terms and health metaphors in determining the text polarity, the resulted led to the lowest accuracy, which are 59.2% and 52.4%, respectively. On the other hand, the highest accuracy was achieved by TF feature weighting method due to its ability in determining the importance of term values in a document.
The work by [4] focused on different types of features including dictionary-based single words that retrieved from corpus, bigrams, 2-word combination, and noun phrase, and used CHI as feature selection. The results indicated that the worst accuracy is related to bigrams with frequency-based feature reduction which is expected while 2-word combination has the high accuracy by two feature selection methods (frequency-based and CHI) with an accuracy of 62.0%, 72.6% respectively. This shows the high impact of CHI on classification accuracy while CHI method is strongly sensitive to sample size and small frequencies. It is also, independent of the strength of the relationship between the features. In [12], the experiments used news articles in order to predict stock market based on singular positive and negative words in the news documents for sentiment analysis. As discussed, the single words alone are unable to convey sentiment and semantic of textual news. Observations showed that Random Forest and SVM performed well in all types of testing from 86 to 92%, while Naïve Bayes (NB) performance is around 86%.
Researchers in [13] integrated and analyzed news article and financial blogs to develop a dynamic prediction model based on the mood of stock markets. These authors performed text segmentation to segment the input text into words and remove low frequency vocabulary. In the stage of feature acquisition and extension they utilized two approaches to extract words where the first way is without considering the part of speech and the second is the extension of only nouns. Finally, the authors used PCA for dimensionality reduction of features and achieved the highest performance by applying feature weighing method (TF-IDF) in one of the scenarios with an accuracy of 65.81%. Another work proposed a sentiment analysis engine (SAE) [14], which performed linguistic analysis based on grammars by analyzing the sentiments as word token and phrase for each sentence. The system was evaluated based on classification accuracy, whereby the accuracy percentage between 49.8 and 82.1% have been reported for two different set of word lists in a rule-based, unsupervised sentiment classification.
As discussed earlier, the main focus of the existing works is to identify the polarity of news contents. However, the existing approaches have employed variety of feature extraction and selection methods to classify news that have led to different outcomes. This provides a gap in the performance of feature extraction and feature selection algorithms in sentiment classification. The performance of the related work will be improved if these weaknesses can be addressed: • Identify linguistic and statistical relevant features • High dimensionality of feature space This research proposes a model to sentiment classification for financial news based on combination of statisticalbased and linguistic-based approaches. This model will be able to generate an optimal feature space based on high and low frequency features which is presented as NgramPOS model. The model that contains stronger sentimental lowfrequency bigrams influences the decision making behavior of investors and stockholders. The contributions of this paper are two-fold: • Demonstrate the strength of low frequency-based prominent statistical and linguistic features in determining the news polarity into positive and negative (bigram_POS). • Using Uni_POS and Bi_POS phrases as sentiment features for supervised machine learning.
This paper is set to enhance the performance of high frequency-based features (Uni-POS) by incorporating a richer set of information into the sentiments in the form of Bi-POS phrases (low frequency-based features) along with the Uni-POS feature. These features will be extracted using part of speech (POS) based on fixed patterns and principal component analysis (PCA).
The rest of this paper is organized as follows. Section 2 describes the research methodology, which focuses on how to build a bigram-based linguistic and statistical feature process model to financial news sentiment classification. Section 3 describes the experimental design and results for this paper. A conclusion and further research directions are given in Sect. 4.

Framework of financial news classification
This paper proposes a new framework for financial news classification that consists of three phases; pre-processing, feature processing, and financial sentiment classification. The first phase includes all activities related to the preparation of news text and the second phase is concerned with all processes related to feature extraction and selection and the last phase are included the classification methods. The components of this framework and the requirements with respect to each individual component are discussed as follows.

Financial news preprocessing
This phrase is composed of three components: extract and collect financial news, financial news labeling, and news preparing and cleaning, each of which is described as follows.

Financial news collection
The collection of financial news used in this work has been sourced from the Google Finance, is part of Internet posting. Internet posts are a useful source of textual sentiment because many people spend a notable amount of time every day reading and writing internet posting, however, it is potentially noisier than other sources because it contains more view from individual traders. Among other potential sources include media articles as well as public disclosures.

Financial news labeling
After storing financial news items extracted from National Association of Securities Dealers Automated Quotations (NASDAQ) and New York Stock Exchange (NYSE) stock markets as text files, news labeling was performed on the news data using R packages, namely tm.plugin.tags and tm.plugin.sentiment. The label of each stored textual news file was reviewed and validated manually [15] against an existing financial news classification (i.e. news documents categorized by Lydia sentiment analysis) and then labeling errors were fixed.

News cleaning
As mentioned in Sect. 2.1.1, the financial news collected from Google Finance are in the form of HTML/XML tags, whereby the main content body of each news is enclosed in between XML tags and includes four elements: news headline, news body, stock ticker, and publication date and time [16]. This step deals with a variety of words and signs in financial news, which is irrelevant and without any sentiment such as XML tags, email address, currency signs and so forth. This text cleaning process is very specific to removing the XML tags in preparing the dataset to consist only news pieces without the tags. Hence before feature preprocessing phase, the text cleaning is applied.

Feature processing
According to what is discussed in the introduction, the structure of the most of the related works to financial news sentiment relies on fairly simple approaches of feature selection and extraction such as bag of word (BoW) or unigrams. Hence, it is interesting to know if the combination of statistical and linguistic methods can actually enhance the performance of financial news classification and this is the main idea behind designing the proposed models in this research.

Part-of-speech tagging and tokenization
Generally, algorithms make use of a Natural Language Processing (NLP) technique called Part-of-Speech (PoS) tagging. PoS tagging is a task of labeling every single word in a sentence into a PoS tag. The tags correspond to the common PoS categories in English grammar such as noun, verb, adjective, adverb, preposition, pronoun, conjunction, and interjection. Since PoS identifies sentiment expressions and semantic relationships between these expressions, we use it to distinguish candidate features that indicate sentiment orientation.

N-gram model
An n-gram model is a sequence of n consecutive symbols that can be characters, words, bytes, or any other continuous symbols. A 1-gram is also called a unigram and consists only one symbol. Accordingly, a 2-gram or bigram consists of two consecutive symbols and so forth [9,10]. The n-gram model uses the existing n À 1 symbols in order to predict the following p t n jt nÀ1 ð Þ symbols. Research [17,18] in sentiment analysis has shown further that ngrams are effective features for word sense disambiguation. Hence, this study uses word-based n-gram model to extract unigram, bigram and trigram features where it will be able to reveal correlations between the words and the importance of individual phrases [19]. Following the PoS tagging and n-gram approaches, processing is then performed as Tokenize the financial news documents into individual tokens, remove punctuations, and uniform the whole text.

Unsupervised feature weighting methods
Term weighting methods can be categorized into two major groups; supervised and unsupervised methods [20,21]. The supervised weighting methods use prior information from the training document to form a set of pre-defined categories. In contrast to the supervised feature weighting methods, this study uses the unsupervised weighting methods derived from the field of Information Retrieval (IR). Among the methods used under unsupervised term weighting include the binary Term Frequency (TF), and Term Frequency-Inverse Document Frequency (TF-IDF) [22].

Dimensionality reduction using principle components analysis
One of the most popular feature extraction methods is Principle Component Analysis (PCA) which has been broadly applied to a variety of data from social science, biology, to finance [23]. In brief, PCA seeks to map data points, such as financial news documents, from a high to low dimensional space while keeping the most significant linear structure intact. Given a dataset consists of n rows (e.g. financial news documents) and m columns (e.g. their features), a n Â m matrix X is derived. Let k be the dimensionality of the new space to which we seek to map the data, and k ( m. The covariance between two features is defined by Eq. (1).
In this study, two criteria are applied to determine the k most significant components in the data space through PCA. They are proportion of variance, and Kaiser's rule [24].
• Proportion of variance The proportion of variance criterion search for the first k eigenvectors of the covariance matrix with the largest eigenvalues. After k i (eigenvalues) are sorted in descending, the cumulative proportion is described by the k principal components as shown in Eq. (2) [24].
where k i are eigenvalues and m is a dimensionality of different feature spaces in this model.k ( m happens when the dimensions are extremely correlated. Therefore, we will have a small number of eigenvectors, and so a large reduction in dimensionality may be achieved. Normally, the threshold is placed in range 80-90%, we consider 90% as a reasonable criterion.
• Kaiser's rule Kaiser's rule [24] is another criterion to eliminate the eigenvectors with low eigenvalues. Since

Financial news classification
Support Vector Machine (SVM) has been widely used as the classification algorithms for textual data especially in sentiment classification of financial news [22,25]. The specific characteristics of textual data classification, such as high-dimensional space and low-discriminative lexical feature, irrelevant features, and data feature sparsity, encourage the adaptation of SVM classifier [26,27] proposed in this work.

Support vector machine
Support Vector Machine (SVM) works by identifying a Maximum Margin Hyperplane (MMH) that creates the greatest separation between two classes. I was a simple and intuitive classifier developed by Vapnik [28]. However, when dealing with non-linear decision boundaries, the nonlinear SVM applies kernel methods for mapping the data (news) with non-linear decision boundary into a higher dimensional input space in order to obtain the MMH. In other words, the non-linear SVM maps each vector x (whereby x represents news document) in an n-dimensional input space into a new m-dimensional feature space U x ð Þ with the objective to separate the data in the feature space using a hyperplane. The methodology defined by [29] is shown in Eq. (4), where f x ð Þ is a linear classifier from the product of two vectors; w is the weight vector and b is the bias [26,30] showed that the RBF kernel will only achieve high accuracy in smaller convergence time when the data are normalized with proper values of the kernel parameters. Hence, this research uses a linear kernel first as baseline, and then used RBF kernel to determine best classifier to sentiment classification.

Model tuning in SVM
Grid-search optimization method is used to search the best values of the parameters and has been found more computational cost-effective as compared to other sophisticated optimization methods. This is because only two parameters in the grid-search optimization that need to be optimized, hence the process can be performed in parallel. In addition, both parameters, C and c parameters are independent, unlike other methods that have iterative processes [31]. In line with the prior research presented in [31,32] we treat a normal sequence of C; c ð Þ as c ¼ 10^À6 : À1 ð Þ , c ¼ 10^À3 : 2 ð Þfor RBF SVM and with C ¼ 10^À3 : 2 ð Þfor Linear SVM [33].

Experimentation and results
NgramPOS model is constructed using the linguistic-based model relying on Ngram model that is consisted of five main stages. Figure 1 shows the conducted experiment to create the NgramPOS model.

Preprocessing in NgramPOS-based methodology
The financial news documents after several processes are prepared for preprocessing.
• POS tagging and Ngram In this step, at first a few basic Natural Language Processing (NLP) procedures are applied on the cleaned text news that utilizes the standard Penn Treebank POS Tags 1 and Stanford 2 [34]. Figure 2 shows a sentence of financial news after applying POS tagger.
To have a clear understanding of the preprocessing step, for instance, let T is assumed with all possible words in the context of news documents, and an input cleaned text news file will be considered as d i ¼ t 1 ; t 2 ; . . .; t mÀ1 ; t m ð Þ , where t j 2 T and d i 2 D, where D ¼ d 1 ; d 2 ; . . .; d n f g , so that m; n are the size of the vocabulary that are extracted from the news sets and the total number of financial news documents, respectively. Table 1 shows the information about total number of financial news documents and depicts distribution of Ngram models (unigrams, bigrams and trigrams) collected from clean financial news documents, where ''the number of terms'' column indicates the number of unigrams, bigrams, and trigrams with their frequency (their iterations) in the whole corpus. Meanwhile, the last column shows the richness of the n-gram terms in the entire of news documents by dividing the number of the number of n-gram terms (unigrams, bigrams, trigrams) on the list of their vocabulary

Richness ¼ the number of nÀgram words the list of nÀgram words vocabulary
; where the vocabulary is the list of n-gram terms without their frequency.
As can be seen, the unigrams have the highest frequency which leads to high richness while the bigrams and the trigrams have much lower richness than the unigrams. Hence, we use unigrams and bigrams in order to feature extraction.

Uni-POS and Bi-POS feature extraction and selection
After the tokenization has been performed and a POS for each word was assigned, the feature process will start to choose the most relevant and riches terms as feature. Most of the researches on sentiment analysis [35,37] have focused on recognition of sentiment expressions, opinion words. Hence, POS is applied to identify terms that are used for sentiment analysis in order to recognize polarity each news for sentiment classification, such as adjective, adverb, noun and verb (unigram pattern), and collocations like adjective ? noun (bigram pattern).
Besides, adjectives, other content words such as adverbs, nouns, and verbs are also used to express sentiment and opinion. For instance, a sentiment expression that uses an adjective as ''negative'' denotes the sentiment towards its modifier noun such as in ''negative value'', and the whole noun phrase ''negative value'' itself becomes a sentiment expression with the same polarity as the sentiment adjective, in this case negative for and currencies with 13775 stop words. As can be seen in Fig. 3, four uni-POS and eleven bi-POS feature sets are created during the process of feature extraction and feature selection. Based on Fig. 3, let D be the set of documents that contains financial news and F UniÀPOS be the union of Uni-POS feature sets where each of them are defined as follows.  Likewise for Bi-POS feature sets, let F BiÀPOS be the union of Bi-POS feature sets; since none of them alone cover all of news documents, the bigrams are combined as a unique Bi-POS feature set. Before combine these feature spaces we remove features with frequency less than 2 (freq \ 2), thus the final Bi-POS feature sets are shown in Eqs. (10)- (13).

NgramPOS-based representation
The third phase of our experiment is document presentation. This research uses Vector Space Model (VSM) to transform each news within the text document into a vector of desired feature space. According to by [16], the different feature spaces and feature weighting methods have different effects in the performance of sentiment classification, hence in this stage, VSM would be created based on the three feature sets Uni-POS, Bi-POS, and UniBi-POS. Just like the previous subsection; D ¼ d 1 ; d 2 ; . . .; d n f gis considered as a set of news documents. In this case, three document-feature matrix are created, where rows represent the documents and columns represent the features. To measure the importance of each feature vector in the news document or from the entire corpus, the features must be associated with a value called weight. The three unsupervised feature weighting methods binary Term Frequency (TF), and Term Frequency Inverse Document Frequency (TF-IDF) are assigned to any feature. With regards to the nature of contextual data (financial news), since PCA is computationally inexpensive and can handle sparse and skewed data, it can be the best choice as a method to dimensionality reduction of feature space [38].

Financial news classification using RBF SVM
for Uni-POS, Bi-POS, and UniBi-POS feature spaces after applying PCA In this step, PCA are applied along with two criterions proportions of variance and Kaiser's rule in order to reduce dimensions of feature spaces. Therefore, the experiment is conducted in two ways with two criterions: firstly, k prominent features are extracted from UniBi-POS as a new feature space, and then in secondly, k optimal features are extracted individually, from Uni-POS and Bi-POS and then merged as another new feature space, which are defined as Eq. (14).
Note that the F UniBiÀPOS feature space is the union of Uni-POS and Bi-POS features while F Ã UniBiÀPOSÀPCA feature space includes new features extracted from F UniBiÀPOS in which it is not clear what percentage of the F UniÀPOS and F BiÀPOS has been extracted, since PCA is unsupervised method and chooses optimal features based on variance measurement. The second approach will also be determined as Eqs. (15)- (18).
From the equations, F Ã UniÀPOSÀPCA and F Ã BiÀPOSÀPCA feature spaces include the optimal features extracted using PCA from F UniÀPOS and F BiÀPOS , respectively. Unlike F Ã UniBiÀPOSÀPCA , F Ã UniÀBiÀPOSÀPCA feature space comprises of a certain percentage of each feature spaces F Ã UniÀPOSÀPCA and F Ã BiÀPOSÀPCA based on two criterions (proportion of variance and Kaiser's rule).
Let consider ''PCACUM90'' and ''PCAKi'' to refer to portion of variance 90% (Eq. 3) and Kaiser's rule (Eq. 2) as a threshold, respectively. For instance, Uni-POS-TF-PCACUM90 feature space includes 90 percentages of the transformed unigrams with TF feature weighting method by PCA method and UniBi-POS-B-PCAKi feature space refers to the transformed UniBi features along with binary feature weighting method by PCA with considering Kaiser's rule as criterion and likewise for other combinations in Tables 2 and 3.  Table 2 shows the maximum accuracy obtained for different feature spaces F Ã UniBiÀPOSÀPCA , F Ã UniÀPOSÀPCA , and  Table 3 in comparison with the classification accuracy Uni-TF-PCACUM90 (66.93 ± 2.91) and UniBi-B-PCAKi (64.00 ± 2.24) in Table 2, implies to the fact that despite the low frequency of Bi-POS features, these features are capable to increase the accuracy of sentiment classification as an effective feature. Since F Ã UniÀBiÀBÀPOSÀPCA includes the combination of Uni-POS-B and Bi-POS-B after applying the PCA method separately, while UniBi-B-PCAKi feature space created by applying PCA over the combination of unigrams and bigrams features. Therefore, the merge the two transformed feature spaces (F Ã UniÀPOSÀPCA , F Ã BiÀPOSÀPCA ) produced by PCA model, constructs a new feature space that is used for RBF SVM and provides a promising result since it keeps optimal features from Uni-POS-B and Bi-POS-B features spaces.

Conclusion
This study proposed an effective feature selection model for sentiment classification of financial news which is able to enhance the performance of feature processing in Ngram-based models. NgramPOS model employs a combination of statistical and linguistic approaches to extract sentiment information as features in order to classify