For each action and state there is a reward. - Initially its all 0s - after a lot of random walks and actions and getting rewards associated to each the agent is more likely to pick more rewarding actions. This is called Q learning ## Markov Decision Process:The mathematical framework for defining a solution in reinforcement learning scenario is called - Set of states, S
- Set of actions, A
- Reward function, R
- Policy, π
- Value, V
We have to take an action (A) to transition from our start state to our end state ( The set of actions we took define our policy (π) and the rewards we get in return defines our value (V). Our task here is to maximize our rewards by choosing the correct policy. So we have to maximize E[r_t | π, s_t] for all possible values of epsilon greedy, which is literally a greedy approach to solving the problem. First you take the greedy choices first and see which gets you to destination. After that, if you (the salesman) want to go from place A to place F again, you would always choose the same policy. major categoriesPolicy based, where our focus is to find optimal policyValue based, where our focus is to find optimal value, i.e. cumulative rewardAction based, where our focus is on what optimal actions to take at each stepI would try to cover in-depth reinforcement learning algorithms in future articles. Till then, you can refer to this paper on Reinforcement Learning: A Survey, Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore, JAIR, 1996. |