Executive Summary

Cryptocurrency, bitcoin in particular, have sparked the journey of decentralized payment system using blockchain technology. Today, big eCommerce companies have already adapted accepting bitcoins as mode of payment for some of their products. While some traders and investors may see this as the currency of the future, others are still skeptic and intrigued with this cryptocurrency, primarily because of its extremely volatile behavior. As such, can we use Big Data and machine learning to effectively forecast bitcoin prices? In this project, we aim to answer this problem by estimating bitcoin price based on the daily global news and events from GDELT and Bitcoin trading information. A total of 2.5TB of raw GDELT data was collected, which was reduced to 148GB after filtering. This data was then preprocessed to extract the average daily tone components of the global events. These features were included for the model training. The forecasting setup was done for a 1-day, 3-days, and 7-days ahead from our present information. The team found that using Gradient Boosting Regressor is the best model with an RMSE of 251.63. This surpassed the baseline MAE produced by ARIMA, which is at 272.60. On the other hand, XGBoost Regressor was the best predictor for the 3-days and 7-days forecast. It was found that our models get less accurate when our predictions get further into the future, we may account this behavior to the reducing significance of a news article as time progresses. Lastly, the news structure (e.g., activity and self/group reference density) are found to be the most important sentiments from our news data.

We believe that this project could serve as a guide for both new and experienced crypto traders. However, profitability may still be increased if we will further develop this project. As a recommendation, we are encouraging future researchers to expand the data that was used for training in order to capture more bitcoin movements and incorporate neural networks in the model to further increase the performance of our models. Lastly, consider more journalistic parameters in our news data. This could unlock more insights and may lead us to conclusive factors that could have correlated with the volatile movements of bitcoin data.