Güçlendirme öğreniminde Q fonksiyonu ve V fonksiyonu nedir?

30

It seems to me that the $V$ function can be easily expressed by the $Q$ function and thus the $V$ function seems to be superfluous to me. However, I'm new to reinforcement learning so I guess I got something wrong.

Definitions

Q- and V-learning are in the context of Markov Decision Processes. A MDP is a 5-tuple $(S, A, P, R, \gamma)$ with

$S$ is a set of states (typically finite)
$A$ is a set of actions (typically finite)
$P(s, s', a) = P(s_{t+1} = s' | s_t = s, a_t = a)$ is the probability to get from state $s$ to state $s'$ with action $a$ .
$R(s, s', a) \in \mathbb{R}$ is the immediate reward after going from state $s$ to state $s'$ with action $a$ . (It seems to me that usually only $s'$ matters).
$\gamma \in [0, 1]$ is called discount factor and determines if one focuses on immediate rewards ( $\gamma = 0$ ), the total reward ( $\gamma = 1$ ) or some trade-off.

A policy $\pi$ , according to Reinforcement Learning: An Introduction by Sutton and Barto is a function $\pi: S \rightarrow A$ (this could be probabilistic).

According to Mario Martins slides, the $V$ function is

V π (s) = E π {R t | s t = s} = E π {\sum k = 0 \infty γ k r t + k + 1 | s t = s}

$V^\pi(s) = E_\pi \{R_t | s_t = s\} = E_\pi \{\sum_{k=0}^\infty \gamma^k r_{t+k+1} | s_t = s\}$ and the Q function is

Q π (s, a) = E π {R t | s t = s, a t = a} = E π {\sum k = 0 \infty γ k r t + k + 1 | s t = s, a t = a}

$Q^\pi(s, a) = E_\pi \{R_t | s_t = s, a_t = a\} = E_\pi \{\sum_{k=0}^\infty \gamma^k r_{t+k+1} | s_t = s, a_t=a\}$

My thoughts

The $V$ function states what the expected overall value (not reward!) of a state $s$ under the policy $\pi$ is.

The $Q$ function states what the value of a state $s$ and an action $a$ under the policy $\pi$ is.

This means,

Q π (s, π (s)) = V π (s)

$Q^\pi(s, \pi(s)) = V^\pi(s)$

Right? So why do we have the value function at all? (I guess I mixed up something)

machine-learning reinforcement-learning

— Martin Thoma
kaynak

15

Q-values are a great way to the make actions explicit so you can deal with problems where the transition function is not available (model-free). However, when your action-space is large, things are not so nice and Q-values are not so convenient. Think of a huge number of actions or even continuous action-spaces.

From a sampling perspective, the dimensionality of $Q(s, a)$ is higher than $V(s)$ so it might get harder to get enough $(s, a)$ samples in comparison with $(s)$ . If you have access to the transition function sometimes $V$ is good.

There are also other uses where both are combined. For instance, the advantage function where $A(s, a) = Q(s, a) - V(s)$ . If you are interested, you can find a recent example using advantage functions here:

Dueling Network Architectures for Deep Reinforcement Learning

by Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot and Nando de Freitas.

— Juan Leni
kaynak

19

$V^\pi(s)$ is the state-value function of MDP (Markov Decision Process). It's the expected return starting from state $s$ following policy $\pi$ .

In the expression

V π (s) = E π {G t | s t = s}

$V^\pi(s) = E_\pi \{G_t | s_t = s\}$

$G_t$ is the total DISCOUNTED reward from time step $t$ , as opposed to $R_t$ which is an immediate return. Here you are taking the expectation of ALL actions according to the policy $\pi$ .

$Q^\pi(s, a)$ is the action-value function. It is the expected return starting from state $s$ , following policy $\pi$ , taking action $a$ . It's focusing on the particular action at the particular state.

Q π (s, a) = E π {G t | s t = s, a t = a}

$Q^\pi(s, a) = E_\pi \{G_t | s_t = s, a_t = a\}$

The relationship between $Q^\pi$ and $V^\pi$ (the value of being in that state) is

V π (s) = \sum a \in A π (a | s) * Q π (a, s)

$V^\pi(s) = \sum_{a ∈ A} \pi (a|s) * Q^\pi(a,s)$

You sum every action-value multiplied by the probability to take that action (the policy $\pi(a|s)$ ).

If you think of the grid world example, you multiply the probability of (up/down/right/left) with the one step ahead state value of (up/down/right/left).

— Aaron
kaynak

5

This is the most concise answer.

— Brett

I have source that states that

Vπ(s)=maxa∈AQπ(s,a) $V^\pi(s) = \max_{a \in A} Q^\pi(s, a)$ . How do you relate this equation to the one you provide in your answer,

Vπ(s)=∑a∈Aπ(a∣s)∗Qπ(a,s) $V^\pi(s) = \sum_{a \in A} \pi (a \mid s) * Q^\pi(a, s)$ ? In your equation, you're defining

V $V$ in terms of a weighted sum of

Q $Q$ values. This is different from the definition I have, which defines

V $V$ as the highest

Q $Q$ .

— nbro

@nbro I believe it depends on what kind of policy you are following. In a pure greedy policy you are correct. But if it was a more exploratory policy, that was built to stochastically decide an action, then the above would be correct

— deltaskelta

7

You have it right, the $V$ function gives you the value of a state, and $Q$ gives you the value of an action in a state (following a given policy $\pi$ ). I found the clearest explanation of Q-learning and how it works in Tom Mitchell's book "Machine Learning" (1997), ch. 13, which is downloadable. $V$ is defined as the sum of an infinite series but its not important here. What matters is the $Q$ function is defined as

Q (s, a) = r (s, a) + γ V * (δ (s, a))

$Q(s,a ) = r(s,a ) + \gamma V^{*}(\delta(s,a))$ where V* is the best value of a state if you could follow an optimum policy which you don't know. However it has a nice characterization in terms of

Q $Q$

V * (s) = max a' Q (s, a')

$V^{*}(s)= \max_{a'} Q(s,a')$ Computing

Q $Q$ is done by replacing the

V∗ $V^*$ in the first equation to give

Q (s, a) = r (s, a) + γ max a' Q (δ (s, a), a')

$Q(s, a) = r(s, a) + \gamma \max_{a'} Q(\delta(s, a), a')$

This may seem an odd recursion at first because its expressing the Q value of an action in the current state in terms of the best Q value of a successor state, but it makes sense when you look at how the backup process uses it: The exploration process stops when it reaches a goal state and collects the reward, which becomes that final transition's Q value. Now in a subsequent training episode, when the exploration process reaches that predecessor state, the backup process uses the above equality to update the current Q value of the predecessor state. Next time its predecessor is visited that state's Q value gets updated, and so on back down the line (Mitchell's book describes a more efficient way of doing this by storing all the computations and replaying them later). Provided every state is visited infinitely often this process eventually computes the optimal Q

Sometimes you will see a learning rate $\alpha$ applied to control how much Q actually gets updated:

Q (s, a) = (1 - α) Q (s, a) + α (r (s, a) + γ max a' Q (s', a'))

$Q(s, a) = (1-\alpha)Q(s, a) + \alpha(r(s, a) + \gamma \max_{a'} Q(s',a'))$

= Q (s, a) + α (r (s, a) + γ max a' Q (s', a') - Q (s, a))

$= Q(s, a) + \alpha(r(s, a) + \gamma \max_{a'} Q(s',a') - Q(s,a))$ Notice now that the update to the Q value does depend on the current Q value. Mitchell's book also explains why that is and why you need

α $\alpha$ : its for stochastic MDPs. Without

$\alpha$ , every time a state,action pair was attempted there would be a different reward so the Q^ function would bounce all over the place and not converge.

$\alpha$ is there so that as the new knowledge is only accepted in part. Initially

$\alpha$ is set high so that the current (mostly random values) of Q are less influential.

$\alpha$ is decreased as training progresses, so that new updates have less and less influence, and now Q learning converges

— S.N.
kaynak

0

Here is a more detailed explanation of the relationship between state value and action value in Aaron's answer. Let's first take a look at the definitions of value function and action value function under policy $\pi$ :

$\begin{align} &v_{\pi}(s)=E{\left[G_t|S_t=s\right]} \\ &q_{\pi}(s,a)=E{\left[G_t|S_t=s, A_t=a\right]} \end{align}$ where

$G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}$ is the return at time

$t$ . The relationship between these two value functions can be derived as

$\begin{align} v_{\pi}(s)&=E{\left[G_t|S_t=s\right]} \nonumber \\ &=\sum_{g_t} p(g_t|S_t=s)g_t \nonumber \\ &= \sum_{g_t}\sum_{a}p(g_t, a|S_t=s)g_t \nonumber \\ &= \sum_{a}p(a|S_t=s)\sum_{g_t}p(g_t|S_t=s, A_t=a)g_t \nonumber \\ &= \sum_{a}p(a|S_t=s)E{\left[G_t|S_t=s, A_t=a\right]} \nonumber \\ &= \sum_{a}p(a|S_t=s)q_{\pi}(s,a) \end{align}$ The above equation is important. It describes the relationship between two fundamental value functions in reinforcement learning. It is valid for any policy. Moreover, if we have a deterministic policy, then

$v_{\pi}(s)=q_{\pi}(s,\pi(s))$ . Hope this is helpful for you. (to see more about Bellman optimality equation https://stats.stackexchange.com/questions/347268/proof-of-bellman-optimality-equation/370198#370198)

— Jie Shi
kaynak

0

The value function is an abstract formulation of utility. And the Q-function is used for the Q-learning algorithm.

— emmanuel
kaynak

For the context of this question, the

$V$ and

$Q$ are different.

— Siong Thye Goh