权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Natural Language Processing for Financial Market Modelling and Forecasting

用于金融市场建模和预测的自然语言处理

基本信息

批准号：
2094258
负责人：
金额：
--
依托单位：
University of Oxford
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2018
资助国家：
英国
起止时间：
2018 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2094258
关键词：
Natural Language Processing Financial Market

项目摘要

Augmenting topic-sentiment models for financial forecasting In the recent past, language analysis in finance has been approached from different research directions. One important dimension of language - sentiment - as analysed by Antweiler and Frank (2004), Hu and Liu (2004), Bollen (2011), Si et al. (2013), Levenberg et al. (2014), focusses on the mood and emotion conveyed in text data sources such as online stock message boards, social media posts or financial news. The other major linguistic dimension - the conveyed story or narrative - can for example be approximated by estimating probabilistic topic models such as Latent Dirichlet Allocation (LDA), introduced by Blei et al. (2003). However, focusing only on either one of these language dimensions can leave relevant linguistic information unused. Recently, more holistic modelling approaches have attempted to model the full dimensionality of language for financial forecasting by combining sentiment and topic modelling (Nguyen and Shirai, 2015). While the authors measure an increased forecast performance of topic-sentiment models on financial market related indicators, I believe natural language processing for financial forecasting can be further adjusted to better match the actual time-series characteristics of financial and economic data. For instance, Latent Dirichlet Allocation assumes both, that topics do not change over time and that topics areuncorrelated. These are assumptions that might turn out to be too strong when analysing textual time series data in finance. I would be interested to adjust such sentiment-topic models with features that allow for topic-correlation (Blei and Lafferty, 2006a) or topic evolution (Blei and Lafferty, 2006b). Another potential model limitation in Nguyen and Shirai (2015) is its assumption of an exogenously determined number of topics. Teh et al. (2005) developed a hierarchical dirichlet process, which endogenises this parameter. It would be interesting to test whether such (a combination of) specifications yield better financial time-series forecasting performances.2. Application of topic-sentiment analysis to forecast monetary policy decisions In financial and economic theory, fluctuations of markets are often explained by the occurrences of exogenous shocks to the economy or financial system. One class of such shocks - namely monetarypolicy shocks - represents central bank decisions about changing the target interest rate, which cannot be explained by contemporary and forecasted values of macroeconomic variables relevant for monetary policy decision making. I follow the methodology brought forward by Romer and Romer (2004) to estimate such monetary policy shocks. That is, I first regress monetary policy decisions on contemporary and forecasted macroeconomic data of inflation, real GDP growth, and unemployment. The residuals of such a regression represent movements in monetary policy that cannot be explained by quantitative economic data underlying conventional monetary policy. I then assess whether narrative effects carry explanatory power to predict these monetary policy shocks (the regression residuals). I utilize topic-sentiment models (as described earlier in my proposal) to identify whether changes in a) the narrative of central bank internal reports and b) newspaper articles on political, business, financial and economic events carry predictive power to explain these monetary policy shocks. Focusing on the timespan of 2000-2011, I programme machine learning procedures in python to estimate probabilistic topic models and topic's sentiment scores spanning a dataset of over 500,000 articles of leading US newspapers as well as over 200 central bank reports. The central bank internal reports are being created for each regularly held FOMC2 meeting of the US Federal Reserve board members. In each of these FOMC meetings, the members decide about the target interest rate.

近年来，金融领域的语言分析从不同的研究方向展开。Antweiler和Frank（2004）、Hu和Liu（2004）、Bollen（2011）、Si等人（2013）、Levenberg等人（2014）分析了语言的一个重要维度——情感，重点关注在线股票留言板、社交媒体帖子或财经新闻等文本数据源中传达的情绪和情感。另一个主要的语言维度——所传达的故事或叙事——可以通过估计概率主题模型来近似，例如Blei等人（2003）引入的潜在狄利克雷分配（Latent Dirichlet Allocation， LDA）。然而，只关注这些语言维度中的任何一个都可能使相关的语言信息无法使用。最近，更全面的建模方法试图通过结合情感和主题建模来模拟金融预测语言的全维度（Nguyen和Shirai， 2015）。虽然作者测量主题情绪模型对金融市场相关指标的预测性能有所提高，但我认为，用于金融预测的自然语言处理可以进一步调整，以更好地匹配金融和经济数据的实际时间序列特征。例如，潜狄利克雷分配假设两个主题，即主题不随时间变化，主题不相关。在分析金融领域的文本时间序列数据时，这些假设可能会被证明过于强大。我有兴趣调整这种情绪-主题模型，使其具有主题相关性（Blei and Lafferty, 2006a）或主题演变（Blei and Lafferty, 2006b）的特征。Nguyen和Shirai（2015）的另一个潜在的模型限制是它假设了外生确定的主题数量。teet al.（2005）开发了一种分层狄利克雷过程，该过程内化了该参数。测试这样的（组合）规范是否产生更好的财务时间序列预测性能将是很有趣的。在金融和经济理论中，市场波动通常由经济或金融体系的外生冲击的发生来解释。其中一类冲击——即货币政策冲击——代表中央银行关于改变目标利率的决定，这不能用与货币政策决策相关的宏观经济变量的当代和预测值来解释。我遵循Romer和Romer（2004）提出的方法来估计这种货币政策冲击。也就是说，我首先根据当前和预测的通货膨胀、实际GDP增长和失业率的宏观经济数据对货币政策决策进行回归。这种回归的残差代表了传统货币政策背后的定量经济数据无法解释的货币政策变动。然后，我评估叙事效应是否具有预测这些货币政策冲击的解释力（回归残差）。我利用主题情绪模型（如我之前的建议所述）来确定a)中央银行内部报告的叙述和b)关于政治，商业，金融和经济事件的报纸文章的变化是否具有预测能力来解释这些货币政策冲击。专注于2000-2011年的时间跨度，我用python编程了机器学习程序，以估计概率主题模型和主题的情绪得分，该数据集涵盖了美国主要报纸的50多万篇文章以及200多份中央银行报告。美联储内部报告是为美联储董事会成员定期举行的FOMC2会议编写的。在每次FOMC会议上，委员们决定目标利率。