This is AutoTrader (for Cryptocurrency, Stock, ...) using the PPO algorithm in Reinforcement Learning.
The model and parameters of this project are just an example. If you want to improve performance, consider structuring your own model.
If you are interested in the algorithm used in this project, please refer to the following papers:
Proximal Policy Optimization Algorithms
High-Dimensional Continuous Control Using Generalized Advantage Estimation
There are several ways to use deep learning for trading, but the most basic method is to predict the price after a certain period and purchase the stock if a specific increase rate is expected. However, even if the price is predicted using this method and assuming the predicted price is accurate, the following problems exist:
1. It is not always guaranteed to purchase the stock at the same price as the closing price.
Let's assume that the closing price of a stock A is 1000 at time
2. Fee issue
In actual stock trading, fees are incurred. Therefore, to actually earn a profit through trading, the expected return must be calculated considering the fees.
The above two problems can be easily solved by using a reinforcement learning algorithm. By predicting the ask and bid prices between time
There are three actions as follows (
Num | Action | Control Range | Description |
---|---|---|---|
0 | Buy | (-inf, inf) | Place a buy order at the price of |
1 | Sell | (-inf, inf) | Place a sell order at the price of |
2 | Do nothing |
If the chosen action is Buy or Sell, you need to provide trading parameters. Also, if
State | Description |
---|---|
Wait-and-see | The state until the buy order is executed |
Hold | The state after the buy order is executed and until the sell order is executed |
Done | The state after both buy and sell orders are executed. The episode ends when this state is reached. |
Rewards are given according to the state and action as follows. Let's denote the closing price at time
-
Wait-and-see
(1) When Buy is taken and executed
$r_t = \log\frac{c_{t+1}}{\text{buy price}}$ -$p$
(2) When Buy is taken but not executed
Since it is the same as not taking any action, a reward of 0 is obtained.
(3) Nothing
Since there is no change in assets, a reward of 0 is obtained. -
Hold
(1) When Sell is taken and executed
$r_t = \log\frac{\text{sell price}}{c_{t+1}}$ -$p$
(2) When Sell is taken but not executed
Since it is the same as not taking any action, the change in the held assets is received as a reward.
$r_t = \log\frac{c_{t+1}}{c_t}$
(3) Do Nothing
$r_t = \log\frac{c_{t+1}}{c_t}$
If the rewards are given as above, it can be easily shown that the sum of rewards in an episode where buy and sell occur is
Let's briefly show this fact. Assume that a buy occurs at time
Therefore, it is self-evident.
When chart data is given, an observation is created through the preprocessed chart dataframe as follows:
$h^*_t = \log h_t - \log c_t$ $l^*_t = \log l_t - \log c_t$ $c^*_t = \log c_t - \log c_{t-1}$
In other words, in this project, as long as there is a dataframe series that satisfies the above, any feature can be used and well-compatible.
Preprocessing in this way has the following advantages:
-
It becomes easier to calculate rewards and check the execution of trading.
For example, let's assume that a buy is successfully executed with parameter$\theta$ at time$t$ . Then the reward is given as$\log( c_{t+1} / c_te^\theta)$ , which becomes$\log( c_{t+1} / c_te^\theta) = \log c_{t+1} - \log c_t - \theta = c_{t+1}^* - \theta$ , and the reward for the nothing action in the hold state simply becomes$c_{t+1}^*$ . Also, the execution of trading can be easily checked. For example, in the case of a buy, the condition is$e^{\theta}c_t > l_{t+1}$ . Taking the logarithm, it becomes$\theta + \log c_t > \log l_{t+1}$ , which can be easily checked as$\theta + \log c_t - \log c_{t+1} = \theta - c_{t+1}^* > \log l_{t+1} - \log c_{t+1} = l^*_{t+1}$ . -
It makes the data stationary.
Since calculations are made only based on ratios regardless of the absolute size of prices, it makes the chart data stationary.
Parameter | Type | Default | Description |
---|---|---|---|
lr | float | 1e-5 | Learning rate |
batch_size | int | 1024 | |
discount_factor | float | 1.0 | |
gae_factor | float | 0.9 | |
epoch | int | 5 | |
clip | float | 1e-1 | |
n_market | int | 128 | The number of trajectories |
max_episode_steps | int | 256 | |
obs_length | int | 30 | |
buy_incentive | float | 5e-3 | Additional reward given when a buy is executed |
sell_incentive | float | 5e-3 | Additional reward given when a sell is executed |
incentive_decay | float | 0.999 | Decay rate of trading incentives |
epsilon | float | 0.0 | Parameter that controls the execution of trading (the higher it is, the lower or higher the price needs to be compared to the high or low price for execution) |
trade_penalty | float | -log(1-5e-4) | Penalty received when trading is executed |
To eliminate time dependence, trajectories are obtained by selecting random time points and trading only for a certain period of time.
Also, unlike a typical policy, it is a method of first choosing a discrete action and then taking a continuous action according to a specific action. Considering that the policy can be calculated as follows, the probability can be easily obtained.
Please refer to the code for more details.