Over the past decade there has been an increase in the amount of digital information that is available. In particular, there is now vasts amount of data that is available on social media platform such as twitter and Facebook that can be analysed to gain further insight and to establish sentiment about a particular topic. This information can then be used as additional input into a portfolio construction model, amongst other things.
A method that can be used to analyse the vast amounts of digital data that is available on the web and summarize this data is a topic model. Topic models are methods that can be used to extract hidden topics or themes that are present in a large collection of documents.
In this blog post, we create a South African financial news corpus (i.e. collection of documents) from Twitter, then analyse this news corpus using the Latent Dirichlet Allocator (LDA) probabilistic topic model using R.
Creating the Financial News Corpus from Twitter
In this section we will outline how we created a Financial news corpus in R from the tweets from prominent South African Financial news websites being Fin24, BusinessDay and Moneyweb.
Step 0 : Create an App in Twitter
To create a twitter app, go to apps.twitter.com/ and log in with your Twitter account. Next to your Profile picture (in the upper right-hand corner) there is a a drop-down menu. In this drop down menu, there is “My Applications”. Click on it and then click “Create new application”. The screen below will appear:
All you have to do now is enter the details in the above screen. You can name your Application whatever you want and put whatever Description you want. Note that Twitter requires a valid URL for the website, but you can just put a placeholder here (e.g. https://www.wikipedia.org/). As the Callback URL enter: http://127.0.0.1:1410.
Now click "Create your Twitter application" and you will be redirected to a screen that has all the Authorization information that you need to access tweets from an R session.
Note that you need to have installed the tm, twitterR and wordCloud packages before proceeding to the next steps.
Step 1: Set up the twitter authentication for your R session
The Authorization screen has the details that you need to set up the authentication for your R session. Enter your authentication details (Found in the Keys and Access Tokens tab of your twitter app)
Step 2: Read in the tweets
Once your authentication is successful, we can then read in the tweets using the code below:
Step 3: Clean the tweets
We then need to clean the tweets to remove unnecessary items like links and other characters (e.g. "@", "#") which wont help us in our analysis.
Step 4: Visualize the tweets
After cleaning the tweets, we then remove the stop-words (e.g. "to", "and" , "go" etc.), and then visualize the words. In this blog post, we visualize the tweets by using word clouds. The code below illustrates how the word cloud could be created in R using the wordCloud package.
The word clouds for the three news sites are shown below:
The word clouds from the three financial news sites highlight more or less the same themes. In the next section, we will extract the key themes, using LDA, from the combined corpus from the three news sites to see which hidden topics are being implied by the tweets.
Analyzing the Corpus using a Topic Model
In our implementation, we assumed that the number of topic in each document/tweet are fixed at 3. The results from fitting the LDA model, using the topicmodels package in R, are shown below:
The latent 3 themes/topics in the financial corpus appear to be the following:
- Topic 1: Local (South Africa) and International Financial markets
- Topic 2: South African Politics
- Topic 3: KPMG saga and State capture
The results produced by the model appear to make intuitive sense, but the model does have some disadvantages.
The LDA model that we implemented above assumes that there is no correlation between the topics. In addition, we have fixed the number of topics before hand, a better approach would have been to allow the topics to be generated from some distribution (e.g. Poisson). This however, will also increase the dimentionality of the problem.
Another disadvantage with the above model is that it allows for more that one topic to be present in each document/tweet. In a tweet, it is very unlikely that there is more than one topic given the limited number of characters in a tweet. A model that allows for only one topic would perform better than the above approach.
This section contains the R code used in this post.