We want to implement Reinforcement learning on the topic of financial Portfolio optimization, which is one of the most important topic in assets mangement industry. We want to show the potential ability of Reinforcement learning on this area and, ideally, we will able to show that our model could create a portfolio which has much better performance than main index in financial market.

RL in Portfolio Optimization

December 8th, 2022 by Hyosang Ahn, Yongling Li and Ziru Yan

  • Motivation
  • Background Information
  • Dataset
  • Project Timeline
  • Environment and Agent
  • Methodology
    • DQN
    • DDPG
  • Current Problems, potential solution, and Future Plan
  • Video
  • Reference

Motivation

Portfolio management (PM) is a fundamental financial planning task that aims to achieve investment goals such as maximal profits or minimal risks. Its decision process involves continuous derivation of valuable information from sequential data, which is a prospective research direction for reinforcement learning (RL).

In this research, we aim to show the potential ability of Reinforcement learning in this area and, ideally, we would be able to show that our model could create a portfolio which has much better performance than the main index in the financial market.

Background Information

Reducing risk is the main objective of most investors. By combining Portfolio management and reinforcement learning we aim to use the financial planning task as a basis, then gradually improve its trading strategy by trial-and-error [1].

Due to Financial data being highly time-dependent, it is ideal to apply reinforcement learning. This is because reinforcement learning is modeled as a Markov decision process (MDP), which is a discrete-time stochastic control process [2]. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.

Reinforcement learning can also be implemented into Portfolio management due to its discrete action space. The agent only needs to make one of three decisions, which are buy, sale or hold.

Dataset

  • Stocks in the SP 500 Index: The Standard and Poor’s 500, or simply the S&P 500, is a stock market index tracking the stock performance of 500 large companies listed on stock exchanges in the United States. It is one of the most commonly followed equity indices.

  • Stocks daily adjusted time series: Intraday time series of the equity specified generated from online API. The intraday data is derived from the Securities Information Processor (SIP) market-aggregated data.

Project timeline

  1. Week 3-5: collect research material and dataset, understand tradition methods for Portfolio management
  2. Week 5-7: build environment and implement naive policy to test said environment
  3. Week 7-10: implement multiple methods aiming to achieve superhuman performance
  4. Week 10-11: present result, and summarize project progress

Environment and Agent

We choose to build our own environment and design it to be scalable, by changing the time span and portfolio size. This feature allows us to quickly implement testing and add new stocks time series into our portfolio. We decided not to use mock data, but focused on large-cap stocks to avoid the risk of volatility and generated our time-series data from S&P 500 companies with the most market share. This is due to environment uncertainty, where the non-stationary nature of financial markets induces uncertainty and often causes a distribution shift between training and testing data.

The agent has access to all historical prices. First, it observes the environment and then makes decisions to interact with the market and rebalance the portfolio by choosing between sale, buy or hold. The agent that gives accurate asset price change predictions is ideal, but it is hard to be trained in practice due to market uncertainties and possible distribution shifts between training (past market) and testing (future market) environments.

Methodology

DQN

The first method we have implemented is the network based on Deep-Q Learning or DQN short. Developed by DeepMind in 2015 in the environment of Atari games[4], DQN enhances the existing Q-Learning algorithm. The Q-Learning method utilizes the Bellman optimality equation in iterative update:

$Q_{i+1}(s,a)=\mathbb{E}{s’}[r+\gamma \text{max}{a’}Q_i(s’,a’) s,a]$

Where the optimal strategy finds the action $a*$ maximizing the expected value of $r+\gamma Q_i(s’,a’)$. The algorithm utilizes experience replay for training stability. At each time step, the transition ${s,a,r,s’}$ is recorded to a circular buffer called the replay buffer. This buffer is later used in a minibatch form to compute the loss and gradient during training, which improves data efficiency and stability. The following is the Deep-Q Learning algorithm used to train the DQN agent[3][4]:

After the agent has been trained for 1000 episodes with Apple’s recent stock closings for 1000 days, the portfolio value ratio has been shown higher than the market value. The agent makes a few selling actions, as shown in the result of its optimal performance below, where the agent sells the stock in the first third of the period.

The result’s portfolio and market value do not largely differ, which makes it easy to reproduce. The model also produces an inconsistent results in its selling actions–including no selling at all–which also shows that the model is not robust.

DDPG

DDPG improves from the aforementioned DQN by introducing continuous action space. It also incorporates the actor-critic technique, where the actor is a policy network that inputs the state and outputs its consecutive, continuous action. The Critic is a Q-value network that takes both action and state and outputs its corresponding Q-value. The following is the DDPG training algorithm[6]:

Dataset

15 Randomly Selected stocks from S&P 500 Index. Train data is price-related data in the year 2017, and test data is that in the year 2018. The time interval of samples is one minute. The highest, lowest, open, and close price and volume of this minute are included as features.

Training and Evaluation

  1. We split 2017 as Train data set and implemented a Rolling test in 2018. The rolling test is tested each month individually, and put the test set into the train set after the test, and the new model would be used for the next month’s test.
  2. Evaluation is simply the final value of the total portfolio compared to the market value of the portfolio.
  3. Best Performance Result

Best Performance is 15% Revenue compared to the average hold revenue of 9%

Hyperparameter

Due to limited computational resources, we only did very simple optimization on the hyperparameter under constant random seed and other hyperparameters unchanged.

  • Total Episode
| Episode Number | Final Value | | --- | ---: | | 1000 | 0.98 | | 1500 | 1.01 | | 2000 | 1.14 |
  • Batch Size
| Batch Size | Final Value | | --- | ---: | | 64 | 0.90 | | 128 | 1.14 | | 256 | 0.87 |
  • Tau
| Tau | Final Value | | --- | ---: | | 0.005 | 1.07 | | 0.001 | 1.14 | | 0.0002 | 0.97 |
  • Gamma
| Gamma | Final Value | | --- | ---: | | 0.99 | 1.146 | | 0.98 | 1.158 | | 0.96 | 1.151 |
  • Learning Rate
| Learning Rate | Final Value | | --- | ---: | | 0.01 | 0.98 | | 0.001 | 1.158 | | 0.0001 | 0.79 |

Analysis and Current Problems

  • Time/Data tradeoff
    • There is a consensus in the financial market that the shorter the time period price people use, the less trend it will have. This is due to the random movement under nash equilibrium in the micro sense of the stock market. Under this assumption, using larger time intervals, like 1 day, to do the research is preferred to produce a more visible trend rather than using short time intervals, like 1 minute. However, the modern stock market does not have a long history, which means if people want to use the day as a time interval, the total length of the dataset is only around 10000 samples, which is extremely limited for Any Deep Learning and Machine Learning Techniques. This tradeoff between Time Span and Data Scale will be discussed further in the next section.
  • Too Concentrate Weight Distribution Problems
    • In the test, during all test dataset, one stock, which the model think has the best performance, would take more than 95% of weights. This phenomenon does not change during the progress of hyperparameter optimization. This is due to the willingness to maximize the total revenue. The model shows good performance by this strategy, but this does not match the goal of portfolio optimization. More potential development will be discussed in the next section.
  • Baseline
    • Due to the high concentration weight of the model, we found out that the baseline model would not help our development

Current Problems, potential solution, and Future Plan

For the problems of the Time/Data Tradeoff, a few potential solutions would be to put some long-term data, like moving the average of the past N days, into the feature. However, this method will increase the complexity of the training and the cost of the computation.

For the concentration problem, a direct solution would be to add a regularization term into the reward function to avoid too concentrated weights. However, this method indeed would decrease the revenue since this regularization term does not exist in the real market. Another method is to add extra punishment–or negative reward–for the loss of money, which is closer to what people would think of in their investment.

When the above two problems could be solved suitably, we would add more data, and extend both the time span and stock span, to make the trading experience closer to the real world. Also, we will add more baseline models and top trader portfolios when the model is more comparable.

Furthermore, there are many high-performance Deep Neural Networks, like RNN, nowadays. When more data and computational resources are available, changing the internal networks of DDPG into higher performance Neural Networks could push the DDPG to a higher level.

Conclusion

Deep RL shows its potential ability to deal with Portfolio Optimization and other financial trading topics. Two RL algorithm we tried, DQN and DDPG, shows wonderful performance in creating revenue. However, they still have many problems that need to be solved, and our current approach may be insufficient for financial application. Further research needs more computational and data resources in order to create a more practical solution. Furthermore, robustness and explainability are two important characteristics of the financial implementation. Further research based on these two aspects is necessary as well as the revenue of the strategy.

Video

Presentation Recording https://youtu.be/Uw391OhU5-4

Reference

[1] Yunan Ye, Hengzhi Pei, Boxin Wang, Pin-Yu Chen, Yada Zhu, Jun Xiao, Bo Li, “Reinforcement-Learning based Portfolio Management with Augmented Asset Movement Prediction States”. Submitted: 9 Feb 2020. Available: https://arxiv.org/abs/2002.05780

[2] Yuting An, Zikai Sun, “Portfolio Management with Reinforcement Learning”. Published: 2020. Available: https://openreview.net/pdf?id=YdJuGLgMo4H

[3] Mehran, Ahmad Asadi, Reza Safabakhsh, “Learning Financial Asset-Specific Trading Rules via Deep Reinforcement Learning”. Published: 2020. Available: https://arxiv.org/abs/2010.14194

[4] Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236

[5] Fang Lin, Meiqing Wang, Rong Liu, Qianying Hong, “A DDPG Algorithm for Portfolio Management”. Published: 2017 Available: https://ieeexplore.ieee.org/document/9277817

[6] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra, “Continuous control with deep reinforcement learning”. Submitted on 9 Sep 2015. Available: https://arxiv.org/abs/1509.02971

Code

https://github.com/ziruyan/CS_269_FinalProject_Code