Typical RL task tries to optimize the object function
$$U_T=U(R_1, R_2, ..., R_T)$$
where $R_i$ is the instaneous reward at time $t$. For general $Q$-learning or policy gradient algorithm, the object function is
$$U_T=\sum_{i=1}^{T}\gamma^{i-1}\cdot R_i$$
which is a discounted cunmulated sum of all one-step reward. However, such object function does not take market risk into consideration. The simplest benchmark to measure risk and return is the Sharpe ratio, i.e.
$$U_T=\frac{E(R)}{\text{std}(R)}$$
with $E(R)=(1/T)\sum_{i=1}^{T}R_i$ and std is the standard deviation. However, such object function is not additive, which prohibits us to use the old wisdom such as $Q$-learning and vanilla policy gradient methods.
J.Moody propose the idea of DSR, which can turn $U$ above into an additive sum of single step reward.
First, for given step n, the Sharpe ratio $U_n$ can be estimated by
Now if we expand the whole $U_T$ to first order of $\eta$, we have
$$U_T\simeq \eta \sum_{i=1}^T D_i + O(\eta^2)$$
Now the original object function of Sharpe ratio becomes totally additive to first order. One step reward is now replaced by $D_t$. It means that we can use $D_t$ as reward and leave all others of RL framework unchanged.
Numerical simulation of DSR
Here use S&P500 daily close price to illustrate the idea of DSR. We note that the initial condition $A(0)=B(0)=0$ is singular for the definition of DSR. Thus we in practice, we shall use the former 200 days to calculate a base SR as the initial value of $A$ and $B$ and then begin to calculate $D_t$ in the following iteration.
# encoding: utf-8importpandasaspdimportnumpyasnpimportrandomimportmatplotlib.pyplotaspltdt=pd.read_csv('SP500.csv', index_col=0)['close']
pct=dt.pct_change().ffill().fillna(0.0)
pct=pct.valuesdefsharpe(ls):
returnnp.mean(ls)/np.std(ls)
# ls1 contains the true SR valuesls1= []
sr0=sharpe(pct[:200])
foriinrange(500):
sr=sharpe(pct[:200+i+1])
ls1.append(sr-sr0)
sr0=sr# ls2 uses cumulated DSR to approximate SR ls2= []
eta=0.004# use the first 200 days to set an initial value of SRsr=sharpe(pct[:200])
foriinrange(500):
A=np.mean(pct[:200+i])
B=np.mean(pct[:200+i]**2)
delta_A=pct[200+i+1] -Adelta_B=pct[200+i+1]**2-BDt= (B*delta_A-0.5*A*delta_B) / (B-A**2)**(3/2)
sr+=eta*Dtls2.append(Dt*eta)
The comparison between SR and cumulated DSR is presented a