Güçlendirme öğreniminde Q fonksiyonu ve V fonksiyonu nedir?


30

It seems to me that the VV function can be easily expressed by the QQ function and thus the VV function seems to be superfluous to me. However, I'm new to reinforcement learning so I guess I got something wrong.

Definitions

Q- and V-learning are in the context of Markov Decision Processes. A MDP is a 5-tuple (S,A,P,R,γ)(S,A,P,R,γ) with

  • SS is a set of states (typically finite)
  • AA is a set of actions (typically finite)
  • P(s,s,a)=P(st+1=s|st=s,at=a)P(s,s,a)=P(st+1=s|st=s,at=a) is the probability to get from state ss to state ss with action aa.
  • R(s,s,a)RR(s,s,a)R is the immediate reward after going from state ss to state ss with action aa. (It seems to me that usually only ss matters).
  • γ[0,1]γ[0,1] is called discount factor and determines if one focuses on immediate rewards (γ=0γ=0), the total reward (γ=1γ=1) or some trade-off.

A policy ππ, according to Reinforcement Learning: An Introduction by Sutton and Barto is a function π:SAπ:SA (this could be probabilistic).

According to Mario Martins slides, the VV function is Vπ(s)=Eπ{Rt|st=s}=Eπ{k=0γkrt+k+1|st=s}

Vπ(s)=Eπ{Rt|st=s}=Eπ{k=0γkrt+k+1|st=s}
and the Q function is Qπ(s,a)=Eπ{Rt|st=s,at=a}=Eπ{k=0γkrt+k+1|st=s,at=a}
Qπ(s,a)=Eπ{Rt|st=s,at=a}=Eπ{k=0γkrt+k+1|st=s,at=a}

My thoughts

The VV function states what the expected overall value (not reward!) of a state ss under the policy ππ is.

The QQ function states what the value of a state ss and an action aa under the policy ππ is.

This means, Qπ(s,π(s))=Vπ(s)

Qπ(s,π(s))=Vπ(s)

Right? So why do we have the value function at all? (I guess I mixed up something)

Yanıtlar:


15

Q-values are a great way to the make actions explicit so you can deal with problems where the transition function is not available (model-free). However, when your action-space is large, things are not so nice and Q-values are not so convenient. Think of a huge number of actions or even continuous action-spaces.

From a sampling perspective, the dimensionality of Q(s,a)Q(s,a) is higher than V(s)V(s) so it might get harder to get enough (s,a)(s,a) samples in comparison with (s)(s). If you have access to the transition function sometimes VV is good.

There are also other uses where both are combined. For instance, the advantage function where A(s,a)=Q(s,a)V(s)A(s,a)=Q(s,a)V(s). If you are interested, you can find a recent example using advantage functions here:

Dueling Network Architectures for Deep Reinforcement Learning

by Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot and Nando de Freitas.


19

Vπ(s)Vπ(s) is the state-value function of MDP (Markov Decision Process). It's the expected return starting from state ss following policy ππ.

In the expression

Vπ(s)=Eπ{Gt|st=s}

Vπ(s)=Eπ{Gt|st=s}

GtGt is the total DISCOUNTED reward from time step tt, as opposed to RtRt which is an immediate return. Here you are taking the expectation of ALL actions according to the policy ππ.

Qπ(s,a)Qπ(s,a) is the action-value function. It is the expected return starting from state ss, following policy ππ, taking action aa. It's focusing on the particular action at the particular state.

Qπ(s,a)=Eπ{Gt|st=s,at=a}

Qπ(s,a)=Eπ{Gt|st=s,at=a}

The relationship between QπQπ and VπVπ (the value of being in that state) is

Vπ(s)=aAπ(a|s)Qπ(a,s)

Vπ(s)=aAπ(a|s)Qπ(a,s)

You sum every action-value multiplied by the probability to take that action (the policy π(a|s)π(a|s)).

If you think of the grid world example, you multiply the probability of (up/down/right/left) with the one step ahead state value of (up/down/right/left).


5
This is the most concise answer.
Brett

I have source that states that Vπ(s)=maxaAQπ(s,a)Vπ(s)=maxaAQπ(s,a). How do you relate this equation to the one you provide in your answer, Vπ(s)=aAπ(as)Qπ(a,s)Vπ(s)=aAπ(as)Qπ(a,s)? In your equation, you're defining VV in terms of a weighted sum of QQ values. This is different from the definition I have, which defines VV as the highest QQ.
nbro

@nbro I believe it depends on what kind of policy you are following. In a pure greedy policy you are correct. But if it was a more exploratory policy, that was built to stochastically decide an action, then the above would be correct
deltaskelta

7

You have it right, the VV function gives you the value of a state, and QQ gives you the value of an action in a state (following a given policy ππ). I found the clearest explanation of Q-learning and how it works in Tom Mitchell's book "Machine Learning" (1997), ch. 13, which is downloadable. VV is defined as the sum of an infinite series but its not important here. What matters is the QQ function is defined as

Q(s,a)=r(s,a)+γV(δ(s,a))

Q(s,a)=r(s,a)+γV(δ(s,a))
where V* is the best value of a state if you could follow an optimum policy which you don't know. However it has a nice characterization in terms of QQ V(s)=maxaQ(s,a)
V(s)=maxaQ(s,a)
Computing QQ is done by replacing the VV in the first equation to give Q(s,a)=r(s,a)+γmaxaQ(δ(s,a),a)
Q(s,a)=r(s,a)+γmaxaQ(δ(s,a),a)

This may seem an odd recursion at first because its expressing the Q value of an action in the current state in terms of the best Q value of a successor state, but it makes sense when you look at how the backup process uses it: The exploration process stops when it reaches a goal state and collects the reward, which becomes that final transition's Q value. Now in a subsequent training episode, when the exploration process reaches that predecessor state, the backup process uses the above equality to update the current Q value of the predecessor state. Next time its predecessor is visited that state's Q value gets updated, and so on back down the line (Mitchell's book describes a more efficient way of doing this by storing all the computations and replaying them later). Provided every state is visited infinitely often this process eventually computes the optimal Q

Sometimes you will see a learning rate αα applied to control how much Q actually gets updated: Q(s,a)=(1α)Q(s,a)+α(r(s,a)+γmaxaQ(s,a))

Q(s,a)=(1α)Q(s,a)+α(r(s,a)+γmaxaQ(s,a))
=Q(s,a)+α(r(s,a)+γmaxaQ(s,a)Q(s,a))
=Q(s,a)+α(r(s,a)+γmaxaQ(s,a)Q(s,a))
Notice now that the update to the Q value does depend on the current Q value. Mitchell's book also explains why that is and why you need αα: its for stochastic MDPs. Without α, every time a state,action pair was attempted there would be a different reward so the Q^ function would bounce all over the place and not converge. α is there so that as the new knowledge is only accepted in part. Initially α is set high so that the current (mostly random values) of Q are less influential. α is decreased as training progresses, so that new updates have less and less influence, and now Q learning converges


0

Here is a more detailed explanation of the relationship between state value and action value in Aaron's answer. Let's first take a look at the definitions of value function and action value function under policy π: vπ(s)=E[Gt|St=s]qπ(s,a)=E[Gt|St=s,At=a]

where Gt=k=0γkRt+k+1 is the return at time t. The relationship between these two value functions can be derived as vπ(s)=E[Gt|St=s]=gtp(gt|St=s)gt=gtap(gt,a|St=s)gt=ap(a|St=s)gtp(gt|St=s,At=a)gt=ap(a|St=s)E[Gt|St=s,At=a]=ap(a|St=s)qπ(s,a)
The above equation is important. It describes the relationship between two fundamental value functions in reinforcement learning. It is valid for any policy. Moreover, if we have a deterministic policy, then vπ(s)=qπ(s,π(s)). Hope this is helpful for you. (to see more about Bellman optimality equation https://stats.stackexchange.com/questions/347268/proof-of-bellman-optimality-equation/370198#370198)


0

The value function is an abstract formulation of utility. And the Q-function is used for the Q-learning algorithm.


For the context of this question, the V and Q are different.
Siong Thye Goh
Sitemizi kullandığınızda şunları okuyup anladığınızı kabul etmiş olursunuz: Çerez Politikası ve Gizlilik Politikası.
Licensed under cc by-sa 3.0 with attribution required.