Analyzing A South African Financial News Twitter Corpus using a Topic Model

An example of a word cloud from Fin24, BusinessDay and Moneyweb tweets.

Over the past decade there has been an increase in the amount of digital information that is available. In particular, there is now vasts amount of data that is available on social media platform such as twitter and Facebook that can be analysed to gain further insight and to establish sentiment about a particular topic. This information can then be used as additional input into a portfolio construction model, amongst other things.

A method that can be used to analyse the vast amounts of digital data  that is available on the web and summarize this data is a topic model. Topic models are methods that can be used to extract hidden topics or themes that are present in a large collection of documents.

In this blog post, we create a South African financial news corpus (i.e. collection of  documents) from Twitter, then analyse this news corpus using the Latent Dirichlet Allocator (LDA) probabilistic topic model using R.

Creating the Financial News Corpus from Twitter

In this section we will outline how we created a Financial news corpus in R from the tweets  from prominent South African Financial news websites being Fin24, BusinessDay and Moneyweb.

Step 0 : Create an App in Twitter

To create a twitter app, go to apps.twitter.com/ and log in with your Twitter account. Next to your Profile picture (in the upper right-hand corner) there is a a drop-down menu. In this drop down menu, there is “My Applications”. Click on it and then click “Create new application”. The screen below will appear:

Setting up a Twitter App

All you have to do now is enter the details in the above screen. You can name your Application whatever you want and put whatever  Description you want. Note that Twitter requires a valid URL for the website, but you can just put a placeholder here (e.g. https://www.wikipedia.org/). As the Callback URL enter: http://127.0.0.1:1410.

Now click "Create your Twitter application" and you will be redirected to a screen that has all the Authorization information that you need to access tweets from an R session.

Note that you need to have installed the tm, twitterR and wordCloud packages before proceeding to the next steps.

Step 1: Set up the twitter authentication for your R session

The Authorization screen has the details that you need to set up the authentication for your R session. Enter your authentication details (Found in the Keys and Access Tokens  tab of your twitter app)

Step 2: Read in the tweets

Once your authentication is successful, we can then read in the tweets using the code below:

Step 3: Clean the tweets

We then need to clean the tweets to remove unnecessary items like links and other characters (e.g. "@", "#") which wont help us in our analysis.

Step 4: Visualize the tweets

After cleaning the tweets, we then remove the stop-words (e.g. "to", "and" , "go" etc.), and then visualize the words. In this blog post, we visualize the tweets by using word clouds. The code below illustrates how the word cloud could be created in R using the wordCloud package.

The word clouds for the three news sites are shown below:

fin24 word cloud
BusinessDay word cloud
moneyweb word cloud

The word clouds from the three financial news sites highlight more or less the same themes. In the next section, we will extract the key themes, using LDA, from the combined corpus from the three news sites to see which hidden topics are being implied by the tweets.

Analyzing the Corpus using a Topic Model

An example of using  plate notation for LDA with Dirichlet-distributed topic-word distributions.
The Latent Dirichlet Allocation (LDA)  model is one of the most popular probabilistic topic models. It is an unsupervised learning technique. The basic idea of LDA is that a document, in our case the documents are tweets, is represented as a random mixture over latent/hidden topics and a topic is a distribution over words in the vocabulary. The topics themselves are not observed. Only the words in the documents/tweets are observed. The hidden structure of he topics presents an ideal situation for inferring the parameters of the model using Bayesian inference.
LDA assumes that the mixture of topics for a document originates from a Dirichlet distribution and assigns a Dirichlet prior to the mixture of topics for a document. The Dirichlet prior is chosen because of its conjugacy to the multinomial distribution, a property which is crucial in simplifying the Bayesian statistical inference problem. The LDA  model is explained in more detail here.
The LDA implementation that we use in this blog posit is from the topicsmodel package in R.The package topicmodels provides basic infrastructure for fitting topic models based on data structures from the text mining package tm. The package includes interfaces to two algorithms for fitting topic models, being the variational expectation-maximization algorithm  and an algorithm using Gibbs sampling. In this blog post we will be using Gibbs sampling, which is a Markov Monte-Carlo algorithm.

 

In our implementation, we assumed that the number of topic in each document/tweet are fixed at 3. The results from fitting the LDA model, using  the topicmodels package in R,  are shown below:

The logarithmized parameters of the word distribution for each topic.

The latent 3 themes/topics in the financial corpus appear to be the following:

  • Topic 1: Local (South Africa) and International Financial markets
  • Topic 2: South African Politics
  • Topic 3: KPMG saga and State capture

The results produced by the model appear to make intuitive sense, but the model does have some disadvantages.

The LDA model that we implemented above assumes that there is no correlation between the topics. In addition, we have fixed the number of topics before hand, a better approach would have been to allow the topics to be generated from some distribution (e.g. Poisson). This however, will also increase the dimentionality of  the problem.

Another disadvantage with the above model is that it allows for more that one topic to be present in each document/tweet. In a tweet, it is very unlikely that there is more than one topic given the limited number of characters in a tweet. A model that allows for only one topic would perform better than the above approach.

Conclusion

In this blog post we create a financial news corpus from Twitter, and then analyse this corpus using  the LDA model. In our implementation, we assumed that there were at most 3 topics in each tweet. Given that a tweet only contains 140 characters, it is unlikely that a tweet will contain more than a single topic. So a topic model which only allows for one topic in a document/tweet would be more suitable for analyzing this financial news corpus.
In addition, instead of fixing the number of topics as we did in this blog post, one could  extended the analysis  by defining the number of topics as a random variable; this will then allow the model to infer the natural number of topics inherent in the text corpus. This however, will also increase the dimentionality of  the problem.

 

Code Used

This section contains the R code used in this post.

One Reply to “Analyzing A South African Financial News Twitter Corpus using a Topic Model”

  1. Pingback: Quantocracy's Daily Wrap for 10/08/2017 | Quantocracy

Leave a Reply

Your email address will not be published. Required fields are marked *

*