Rick Astley

[email protected]

Portals
- Portal Arts & Entertainment Gaming Music Other_uses_in_arts_and_entertainment Computing Other_uses_in_computing Portal Naviation Directories Contents
Categories
- Categories
Pages
- Mike Tyson

Thompson sampling

None

This article needs additional citations for verification . Please help improve this article by adding citations to reliable sources . Unsourced material may be challenged and removed.
Find sources: "Thompson sampling" – news · newspapers · books · scholar · JSTOR ( May 2012 ) ( Learn how and when to remove this template message )

Thompson sampling , ^[1] ^[2] ^[3] named after William R. Thompson , is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.

Description

This section includes a list of references , related reading , or external links , but its sources remain unclear because it lacks inline citations . Please help to improve this section by introducing more precise citations. ( May 2012 ) ( Learn how and when to remove this template message )

Consider a set of contexts $X
{\displaystyle {\mathcal {X}}}$ ${\mathcal {X}}$ , a set of actions $A
{\displaystyle {\mathcal {A}}}$ ${\mathcal {A}}$ , and rewards in $R
{\displaystyle \mathbb {R} }$ $\mathbb {R}$ . The aim of the player is to play actions under the various contexts, such as to maximize the cumulative rewards. Specifically, in each round, the player obtains a context $x
∈
X
{\displaystyle x\in {\mathcal {X}}}$ $x\in {\mathcal {X}}$ , plays an action $a
∈
A
{\displaystyle a\in {\mathcal {A}}}$ $a\in {\mathcal {A}}$ and receives a reward $r
∈
R
{\displaystyle r\in \mathbb {R} }$ $r\in {\mathbb {R}}$ following a distribution that depends on the context and the issued action.

The elements of Thompson sampling are as follows:

a likelihood function $P
(
r
|
θ
,
a
,
x
)
{\displaystyle P(r|\theta ,a,x)}$ $P(r|\theta ,a,x)$ ;
a set $Θ
{\displaystyle \Theta }$ $\Theta$ of parameters $θ
{\displaystyle \theta }$ $\theta$ of the distribution of $r
{\displaystyle r}$ $r$ ;
a prior distribution $P
(
θ
)
{\displaystyle P(\theta )}$ $P(\theta )$ on these parameters;
past observations triplets $D
=
{
(
x
;
a
;
r
)
}
{\displaystyle {\mathcal {D}}=\{(x;a;r)\}}$ ${\mathcal {D}}=\{(x;a;r)\}$ ;
a posterior distribution $P
(
θ
|
D
)
∝
P
(
D
|
θ
)
P
(
θ
)
{\displaystyle P(\theta |{\mathcal {D}})\propto P({\mathcal {D}}|\theta )P(\theta )}$ $P(\theta |{\mathcal {D}})\propto P({\mathcal {D}}|\theta )P(\theta )$ , where $P
(
D
|
θ
)
{\displaystyle P({\mathcal {D}}|\theta )}$ $P({\mathcal {D}}|\theta )$ is the likelihood function.

Thompson sampling consists in playing the action $a
∗
∈
A
{\displaystyle a^{\ast }\in {\mathcal {A}}}$ $a^{\ast }\in {\mathcal {A}}$ according to the probability that it maximizes the expected reward; action $a
∗
{\displaystyle a^{\ast }}$ $a^{\ast }$ is chosen with probability

∫
I
[
E
(
r
|
a
∗
,
x
,
θ
)
=
max
a
′
E
(
r
|
a
′
,
x
,
θ
)
]
P
(
θ
|
D
)
d
θ
,
{\displaystyle \int \mathbb {I} \left[\mathbb {E} (r|a^{\ast },x,\theta )=\max _{a'}\mathbb {E} (r|a',x,\theta )\right]P(\theta |{\mathcal {D}})d\theta ,}

$\int \mathbb {I} \left[\mathbb {E} (r|a^{\ast },x,\theta )=\max _{a'}\mathbb {E} (r|a',x,\theta )\right]P(\theta |{\mathcal {D}})d\theta ,$

where $I
{\displaystyle \mathbb {I} }$ $\mathbb {I}$ is the indicator function .

In practice, the rule is implemented by sampling. In each round, parameters $θ
∗
{\displaystyle \theta ^{\ast }}$ $\theta ^{\ast }$ are sampled from the posterior $P
(
θ
|
D
)
{\displaystyle P(\theta |{\mathcal {D}})}$ $P(\theta |{\mathcal {D}})$ , and an action $a
∗
{\displaystyle a^{\ast }}$ $a^{\ast }$ chosen that maximizes $E
[
r
|
θ
∗
,
a
∗
,
x
]
{\displaystyle \mathbb {E} [r|\theta ^{\ast },a^{\ast },x]}$ ${\mathbb {E}}[r|\theta ^{\ast },a^{\ast },x]$ , i.e. the expected reward given the sampled parameters, the action, and the current context. Conceptually, this means that the player instantiates their beliefs randomly in each round according to the posterior distribution, and then acts optimally according to them. In most practical applications, it is computationally onerous to maintain and sample from a posterior distribution over models. As such, Thompson sampling is often used in conjunction with approximate sampling techniques. ^[3]

History

Thompson sampling was originally described by Thompson in 1933. ^[1] It was subsequently rediscovered numerous times independently in the context of multi-armed bandit problems. ^[4] ^[5] ^[6] ^[7] ^[8] ^[9] A first proof of convergence for the bandit case has been shown in 1997. ^[4] The first application to Markov decision processes was in 2000. ^[6] A related approach (see Bayesian control rule ) was published in 2010. ^[5] In 2010 it was also shown that Thompson sampling is instantaneously self-correcting . ^[9] Asymptotic convergence results for contextual bandits were published in 2011. ^[7] Thompson Sampling has been widely used in many online learning problems including A/B testing in website design and online advertising, ^[10] and accelerated learning in decentralized decision making. ^[11] A Double Thompson Sampling (D-TS) ^[12] algorithm has been proposed for dueling bandits, a variant of traditional MAB, where feedback comes in the form of pairwise comparison.

Relationship to other approaches

Probability matching

Bayesian control rule

A generalization of Thompson sampling to arbitrary dynamical environments and causal structures, known as Bayesian control rule , has been shown to be the optimal solution to the adaptive coding problem with actions and observations. ^[5] In this formulation, an agent is conceptualized as a mixture over a set of behaviours. As the agent interacts with its environment, it learns the causal properties and adopts the behaviour that minimizes the relative entropy to the behaviour with the best prediction of the environment's behaviour. If these behaviours have been chosen according to the maximum expected utility principle, then the asymptotic behaviour of the Bayesian control rule matches the asymptotic behaviour of the perfectly rational agent.

The setup is as follows. Let $a
1
,
a
2
,
…
,
a
T
{\displaystyle a_{1},a_{2},\ldots ,a_{T}}$ $a_{1},a_{2},\ldots ,a_{T}$ be the actions issued by an agent up to time $T
{\displaystyle T}$ $T$ , and let $o
1
,
o
2
,
…
,
o
T
{\displaystyle o_{1},o_{2},\ldots ,o_{T}}$ $o_{1},o_{2},\ldots ,o_{T}$ be the observations gathered by the agent up to time $T
{\displaystyle T}$ $T$ . Then, the agent issues the action $a
T
+
1
{\displaystyle a_{T+1}}$ $a_{{T+1}}$ with probability: ^[5]

P
(
a
T
+
1
|
a
^
1
:
T
,
o
1
:
T
)
,
{\displaystyle P(a_{T+1}|{\hat {a}}_{1:T},o_{1:T}),}

$P(a_{{T+1}}|{\hat {a}}_{{1:T}},o_{{1:T}}),$

where the "hat"-notation $a
^
t
{\displaystyle {\hat {a}}_{t}}$ ${\hat {a}}_{t}$ denotes the fact that $a
t
{\displaystyle a_{t}}$ $a_{t}$ is a causal intervention (see Causality ), and not an ordinary observation. If the agent holds beliefs $θ
∈
Θ
{\displaystyle \theta \in \Theta }$ $\theta \in \Theta$ over its behaviors, then the Bayesian control rule becomes

P
(
a
T
+
1
|
a
^
1
:
T
,
o
1
:
T
)
=
∫
Θ
P
(
a
T
+
1
|
θ
,
a
^
1
:
T
,
o
1
:
T
)
P
(
θ
|
a
^
1
:
T
,
o
1
:
T
)
d
θ
{\displaystyle P(a_{T+1}|{\hat {a}}_{1:T},o_{1:T})=\int _{\Theta }P(a_{T+1}|\theta ,{\hat {a}}_{1:T},o_{1:T})P(\theta |{\hat {a}}_{1:T},o_{1:T})\,d\theta }

$P(a_{{T+1}}|{\hat {a}}_{{1:T}},o_{{1:T}})=\int _{{\Theta }}P(a_{{T+1}}|\theta ,{\hat {a}}_{{1:T}},o_{{1:T}})P(\theta |{\hat {a}}_{{1:T}},o_{{1:T}})\,d\theta$ ,

where $P
(
θ
|
a
^
1
:
T
,
o
1
:
T
)
{\displaystyle P(\theta |{\hat {a}}_{1:T},o_{1:T})}$ $P(\theta |{\hat {a}}_{{1:T}},o_{{1:T}})$ is the posterior distribution over the parameter $θ
{\displaystyle \theta }$ $\theta$ given actions $a
1
:
T
{\displaystyle a_{1:T}}$ $a_{{1:T}}$ and observations $o
1
:
T
{\displaystyle o_{1:T}}$ $o_{{1:T}}$ .

In practice, the Bayesian control amounts to sampling, at each time step, a parameter $θ
∗
{\displaystyle \theta ^{\ast }}$ $\theta ^{\ast }$ from the posterior distribution $P
(
θ
|
a
^
1
:
T
,
o
1
:
T
)
{\displaystyle P(\theta |{\hat {a}}_{1:T},o_{1:T})}$ $P(\theta |{\hat {a}}_{{1:T}},o_{{1:T}})$ , where the posterior distribution is computed using Bayes' rule by only considering the (causal) likelihoods of the observations $o
1
,
o
2
,
…
,
o
T
{\displaystyle o_{1},o_{2},\ldots ,o_{T}}$ $o_{1},o_{2},\ldots ,o_{T}$ and ignoring the (causal) likelihoods of the actions $a
1
,
a
2
,
…
,
a
T
{\displaystyle a_{1},a_{2},\ldots ,a_{T}}$ $a_{1},a_{2},\ldots ,a_{T}$ , and then by sampling the action $a
T
+
1
∗
{\displaystyle a_{T+1}^{\ast }}$ $a_{{T+1}}^{\ast }$ from the action distribution $P
(
a
T
+
1
|
θ
∗
,
a
^
1
:
T
,
o
1
:
T
)
{\displaystyle P(a_{T+1}|\theta ^{\ast },{\hat {a}}_{1:T},o_{1:T})}$ $P(a_{{T+1}}|\theta ^{\ast },{\hat {a}}_{{1:T}},o_{{1:T}})$ .

Upper-Confidence-Bound (UCB) algorithms

Thompson sampling and upper-confidence bound algorithms share a fundamental property that underlies many of their theoretical guarantees. Roughly speaking, both algorithms allocate exploratory effort to actions that might be optimal and are in this sense "optimistic". Leveraging this property, one can translate regret bounds established for UCB algorithms to Bayesian regret bounds for Thompson sampling ^[13] or unify regret analysis across both these algorithms and many classes of problems. ^[14]

References

1 2 Thompson, William R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples" . Biometrika , 25(3–4):285–294, 1933.
↑ Thompson, W. R. (1935). On the theory of apportionment. American Journal of Mathematics , 57(2), 450-456.
1 2 Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband and Zheng Wen (2018), "A Tutorial on Thompson Sampling", Foundations and Trends in Machine Learning: Vol. 11: No. 1, pp 1-96. https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf
1 2 J. Wyatt. Exploration and Inference in Learning from Reinforcement . Ph.D. thesis, Department of Artificial Intelligence, University of Edinburgh. March 1997.
1 2 3 4 P. A. Ortega and D. A. Braun. "A Minimum Relative Entropy Principle for Learning and Acting", Journal of Artificial Intelligence Research , 38, pages 475–511, 2010.
1 2 M. J. A. Strens. "A Bayesian Framework for Reinforcement Learning", Proceedings of the Seventeenth International Conference on Machine Learning , Stanford University, California, June 29–July 2, 2000, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.1701
1 2 B. C. May, B. C., N. Korda, A. Lee, and D. S. Leslie. "Optimistic Bayesian sampling in contextual-bandit problems". Technical report, Statistics Group, Department of Mathematics, University of Bristol, 2011.
↑ Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling." Advances in neural information processing systems. 2011. http://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling
1 2 O.-C. Granmo. "Solving Two-Armed Bernoulli Bandit Problems Using a Bayesian Learning Automaton", International Journal of Intelligent Computing and Cybernetics , 3 (2), 2010, 207-234.
↑ Ian Clarke . "Proportionate A/B testing", September 22nd, 2011, http://blog.locut.us/2011/09/22/proportionate-ab-testing/
↑ Granmo, O. C.; Glimsdal, S. (2012). "Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game". Applied Intelligence . 38 (4): 479–488. doi : 10.1007/s10489-012-0346-z . hdl : 11250/137969 . S2CID 8746483 .
↑ Wu, Huasen; Liu, Xin; Srikant, R (2016), Double Thompson Sampling for Dueling Bandits , arXiv : 1604.07101 , Bibcode : 2016arXiv160407101W
↑ Daniel J. Russo and Benjamin Van Roy (2014), "Learning to Optimize Via Posterior Sampling", Mathematics of Operations Research, Vol. 39, No. 4, pp. 1221-1243, 2014. https://pubsonline.informs.org/doi/abs/10.1287/moor.2014.0650
↑ Daniel J. Russo and Benjamin Van Roy (2013), "Eluder Dimension and the Sample Complexity of Optimistic Exploration", Advances in Neural Information Processing Systems 26, pp. 2256-2264. https://proceedings.neurips.cc/paper/2013/file/41bfd20a38bb1b0bec75acf0845530a7-Paper.pdf

SERP BRANDS SERP SERP AI SERP FM SERP App SERP Games SERP Wiki SERP Dev SERP Style SERP Site SERP Media SERP University

PARTNER BRANDS DAFT FM Boxing Undefeated Devin Schumacher University of Guns Merchant Alternatives

Privacy Terms Affiliate Disclosure DMCA