You have it right, the VV function gives you the value of a state, and QQ gives you the value of an action in a state (following a given policy ππ). I found the clearest explanation of Q-learning and how it works in Tom Mitchell's book "Machine Learning" (1997), ch. 13, which is downloadable. VV is defined as the sum of an infinite series but its not important here. What matters is the QQ function is defined as
Q(s,a)=r(s,a)+γV∗(δ(s,a))Q(s,a)=r(s,a)+γV∗(δ(s,a))
where V* is the best value of a state if you could follow an optimum policy which you don't know. However it has a nice characterization in terms of QQ
V∗(s)=maxa′Q(s,a′)V∗(s)=maxa′Q(s,a′)
Computing QQ is done by replacing the V∗V∗ in the first equation to give
Q(s,a)=r(s,a)+γmaxa′Q(δ(s,a),a′)Q(s,a)=r(s,a)+γmaxa′Q(δ(s,a),a′)
This may seem an odd recursion at first because its expressing the Q value of an action in the current state in terms of the best Q value of a successor state, but it makes sense when you look at how the backup process uses it: The exploration process stops when it reaches a goal state and collects the reward, which becomes that final transition's Q value. Now in a subsequent training episode, when the exploration process reaches that predecessor state, the backup process uses the above equality to update the current Q value of the predecessor state. Next time its predecessor is visited that state's Q value gets updated, and so on back down the line (Mitchell's book describes a more efficient way of doing this by storing all the computations and replaying them later). Provided every state is visited infinitely often this process eventually computes the optimal Q
Sometimes you will see a learning rate αα applied to control how much Q actually gets updated:
Q(s,a)=(1−α)Q(s,a)+α(r(s,a)+γmaxa′Q(s′,a′))Q(s,a)=(1−α)Q(s,a)+α(r(s,a)+γmaxa′Q(s′,a′))
=Q(s,a)+α(r(s,a)+γmaxa′Q(s′,a′)−Q(s,a))=Q(s,a)+α(r(s,a)+γmaxa′Q(s′,a′)−Q(s,a))
Notice now that the update to the Q value does depend on the current Q value. Mitchell's book also explains why that is and why you need αα: its for stochastic MDPs. Without α, every time a state,action pair was attempted there would be a different reward so the Q^ function would bounce all over the place and not converge. α is there so that as the new knowledge is only accepted in part. Initially α is set high so that the current (mostly random values) of Q are less influential. α is decreased as training progresses, so that new updates have less and less influence, and now Q learning converges