Stock market prices vary according to the court of public opinion. With this knowledge my goal is to build a trading simulator that incorporate internet-generated sentiment to a better forecast stock market returns using a time-series model based on ARIMA and GARCH models.


Overall I found good results and the model ran in about 29% of the time when the signals (google trend and news sentiments) are incorporated to the ARIMA-GARCH model compared to without.

To use, simply choose the company, and the graphs will update, showing the google trend index, representing the popularity of the search of that company on google, the google news index, which uses text analysis to determine the sentiment of news headlines related to the company, where positive indexes correlate to positive sentiments and vice versa. Also shown on the top row is the stock price. The bottom row shows the output of the ARIMA-GARCH model, showing the buying position, 1 represents having possession of stock, and 0 without, short positions are represented by -1. The total profit per share of stock is also shown with comparision to a 'buy and hold' strategy. The model details are shown below.

Model Details


Here I hypothesize that trends in stock market returns can be predicted by the relative changes in amount terms related to the stock index is searched on search engines, and by sentiment derived by news headlines realted to publicy-traded companies. I will use google trends since it gathers the total number of searches relative to the total search volume in google, and text-mine google news headlines for the sentiment as my "signals" to indicate buy-sell positions.


I test the hypothesis on the Apple stock since it is very popular and often searched term on google, and news stories are plentiful. I will quantify these inputs and use them as signals to increase the performance of my model. The time series model I will use is an autoregressive intergrated moving average (ARIMA) model, this model will take \(x\) number of days of time series data and use it to forecast a given number of days ahead.

To get a feel of what we are trying to predict we can plot the adjusted stock price of Apple as a function of time.

We can also plot the log daily return, given as \(r = \log(P_j) - \log(P_{j-1})\), where \(P_{i}\) is the price on a given day, \(i\).

And we can plot the autocorrelation function to see if there is any correlation between the daily rates.

In general there does not seem to much correlation between daily rates, as there are not any spikes above the 0.05 significance line. This indicates that the daily rates are akin to random fluctuations. To prove this we can plot the autocorrelation for uniform random numbers, which should no correlation

We can see that there are higher significant spikes on the random numbers compared to the daily returns. This tells us that it should be quite difficult to forecast the stock market return from looking at previous returns. After 2 days there is a negative spike around 0.05 and a positive spike above 0.05 after 4 days telling us that there is a slightly significant negative correlation after 2 days, a a significant correlation after 4 days in stock market return.

The trend is by no means clear though, which may lend credibility to the idea of looking at google trends and/or sentiments derived goole news headlines as a more reliable indicator for stock market returns.

Using Google Trends and News Data

Google Trends

Google trends quantifies the search popularity in googles search engine of a given set of terms, and outputs a value corresponding to a week, normalized between 0 and 100. This will act as our signal coming from the google trend input. For example, the google trend output for keyword ‘Apple’ in the United States is shown below.

We can see that there is a clear time-dependent trend in the data, with some seasonality. There appears to be some time-dependence, with spikes roughly twice per year. But the important question is whether it is a good predictor of stock market returns. To determine this, we can look at a t-test between the returns and the google trend index.

Welch Two Sample t-test

data:  tot_df$AAPL.Adjusted and tot_df$index
t = -93.263, df = 737, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -36.68956 -35.17677
sample estimates:
    mean of x     mean of y
-5.959825e-05  3.593311e+01

The p-value is within machine precision indicating that the relationship is statistically very significant, so should act as a good predictor of stock market returns and signal for my model.

Sentiment Analysis of Google News

While google trends may be able to tell us when peple are interested in a company, it does not tell us peoples' sentiment about the company. For example, people may be entering "How to I sell all of my Apple stock", or "How do I buy buckets and buckets of Apple stock". Google trends may see this as equivalent, but obviously our buy-sell position may be different dependent on the sentiment. To overome this I will use the Google News API to gather news from various News sources such as Yahoo finance, Reuters, and Seeking Alpha, among other, and perform a sentiment analysis on the news titles that acts as an indicator of the stock market returns, and signal for my model.

The sentiment analysis iterates over the news titles and returns sentiment values corresponding to the date the news article was published. The method uses a sentiment dictionary developed in the Nebraska Literary Lab to determine the specific valuecorresponding with words in the title. We can look at the sentiment analysis for the Apple company:

An example of some of the sentiment scores is shown below:

Date Title Sentiment
2016-07-27 22:30:00 Cramer solves the great Apple mystery: How it faked out everyone on Wall Street 1.40
2016-07-27 21:24:49 Apple Inc. (AAPL) Pops 6.5% for July 27 1.15
2016-07-27 15:56:15 What We Got Wrong About Apple 0.40

We can see that the sentiment seems plausible.

To see if is a good predictor of the stock market returns we can perform a t-test between the sentiment and the stock market returns.

Welch Two Sample t-test

data:  tot_df$AAPL.Adjusted and tot_df$score
t = -3.4467, df = 740.27, p-value = 0.0005996
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.06442960 -0.01766792
sample estimates:
    mean of x     mean of y
-5.959825e-05  4.098916e-02

Again, the p-value is less than 0.05, though not as strong as the google trend, but we can still say that the relationship is statistically significant, and will compliment the google trend as signals for the model.

Time-Series Model

We will attempt to predict the stock market price using the ARIMA model, in which the stock market return is a weighted linear sum of the \(n\) last daily stock returns. The ARIMA model has \(p\) auto-regressive terms, \(d\) differencing operations, and \(q\) moving average terms. To select the best number of parameters we will run through all combinations in parameter space and pick the best one.

We combine the ARIMA model with the generalized auto-regressive conditional heteroscedastic (GARCH) model which models volatility, and looks at how it changes over time. Typically the volatility varies less over time compared to the daily return, so we use larger windows to predict the volatility, such as 100 days.

We run the model, and for positive outcomes, the model predicts the return will be positive, a \(1\) is outputted, which represents a long position, or buy. Else if the stock return is negative, a \(-1\) is outputted, representing a short postion or sell.

Finally the equity curve is produced, which displays the relative change in value of the asset over time. If the equity curve remained at zero, there would be no change, if the equity curve climbed to 1, the value of the asset doubled. For comparison a 'buy & hold' strategy is compared to traditional ARMIA-GARCH model with no signals.

We can see that the ARIMA-GARCH models do not perform very well, this may be because there is not much autocorrelation in the daily returns of apple stock. When there is serial correlation this model typically performs very well, for assets in which the price and volatility do not flutuate much over time, such as S&P500. When the signals are included, the model outperformed the buy and hold strategy, but only slightly. The reason being is that the flutuations in the volatility and price are compensated by the information coming from google news and trends.

Furthermore, the inclusion of signals into the model drastically decreased the run-time. Without the signals the model took 4838 seconds to run, (~1 hour 20 minutes), whereas including the signals, the model took just 23 minutes. This may be because the threshold for determining whether to buy/sell condition is acheived faster with the signals, compared to without.

Now that we have an ARIMA-GARCH model that works well with the generated signals, and gives us reliable returns. To make this a complete, stand-alone data product, that users can interact with, we can:

  • Make the model more flexible by adjusting the tolerance associated with the buy/sell action - this will appeal to larger range of users, giving more value to the product.

  • Have this project run independently, by running the file as a scheduled task - I don't have time/want to manually update the model everyday.

  • Have the file send an email of the results, and the action associated (buy, hold or sell the asset).

Adjusting the risk tolerance

The ARIMA-GARCH model predicts the expected return of the next day given a certain number of previous days. To remind ourselves, if the expected return is positive the price of the asset is expected to increase, similarly, if the expected return is negative the price will decrease.

In my model when the expected return was positive this executed a buy/long condition, and a sell/short condition if negative, there were no hold conditions. I will update by adding a threshold on the condition such that only if the expected return is greater than a given value will we buy, and only if it is below a certain value will we sell. This adds can remove the element of uncertainty from the model, but may miss out on some profits. Overall it should add a degree of conservatism to the model.

If we plot the marginal change in stock price we can see how the two versions of the model compare. One there is no threshold, equivalent to the model developed last week, and the other there is a threshold of 0.1% daily return. We again compare to a buy and hold strategy.

Using a daily expected return of 0.1% threshold the performance of both forms of the ARIMA-GARCH models differ. Both do worse at the beginning of May, however after the less conservative model outperforms the conservative model, where the expected returns predicted by the model are less than 0.1%, so a hold action is executed, this misses out on some profits gained by the less conservative model.

Overall, the less conservative model outperformed the model with 0.1% threshold on the expected return. In the app I use a slider input to vary the risk-tolerance of the user. However, since users are likely to be inexperienced in knowing what risk tolerance to choose this could be taken into account with a short questionnaire that may include age of the user, how much the user is looking to invest, and the annual income of the user, etc.

Emailing Results and Scheduling the Model

Emailing through R

To set up the emailing once done I set up a Gmail API, which when connected to the R-package gmailr can send emails, and the R script to read API client ID and secret. The complete setup process can be found here.

complete_email <- mime(
  To = "moocarme@gmail.com",
  From = "moocarme@gmail.com",
  Subject = paste(toString(Sys.Date()),
                  "Stock prediction finished"),
  body = paste0('The trading prediction has finished for ',toString(Sys.Date()),
                '. \n', 'The predridction for AAPL was', toString(ind[1]),
                '. \n', 'You should ', toString(ifelse(ind[1]>0, 'buy', 'sell')),
                '. \n',
                'A more conservative model at a daily buy/sell rate of '
                , toString(bsh),' percent would suggest you '
                , ifelse(ind[1]>bsh, 'buy.',
                         ifelse(ind[1]<(-bsh), 'sell.', 'hold.'))))

A screenshot of the email output from the above code is shown below. The dates, daily returns and actions are generalized so the same email body template can be used every day.