# Dirichlet Distribution

Dirichlet Distribution(DD):
* — Symmetric DD can be considered a distribution of distributions
* — Each sample from Symmetric DD is a categorical distribution over K categories.
* — Generates samples that are similar discrete distributions
* — It is parameterized G0, a distribution over K categories and $\alpha$ a scale
factor
<code>
import numpy as np
from scipy.stats import dirichlet
np.set_printoptions(precision=2)

def stats(scale_factor, G0=[.2, .2, .6], N=10000):
samples = dirichlet(alpha = scale_factor * np.array(G0)).rvs(N)
print ” alpha:”, scale_factor
print ” element-wise mean:”, samples.mean(axis=0)
print “element-wise standard deviation:”, samples.std(axis=0)
print

for scale in [0.1, 1, 10, 100, 1000]:
stats(scale)
</code>

## Dirichlet Process(DP)

• — A way to generalize Dirichlet Distribution
• — Generate samples that are distributions similar to the parameter $H_0$
• — Also has the parameter $\alpha$ determines how much will the samples
vary from H0

• — a sample H of $DP( \alpha, H_0)$ is constructed by drawing a countably
infinite number of samples $\theta k$ and then setting
$H = \sum_{k=1}^{\infty} \pi_k * \delta(x-\theta_k)$

where $\pi_k$ — is carefully chosen weights that sum to 1
$\delta$ — is the Dirac delta function
* — Since the samples from a DP are similar to a parameter $H_0$ one-way to
test if a DP is generating your dataset is to check if the distributions(of
different attributes/dimensions) you get are similar to each other. Something like
permutation test, but understand the assumptions and caveats.

• — The code for dirichlet sampling can be written as:
<code>
import matplotlib.pyplot as plt
from scipy.stats import beta, norm

def dirichlet_sample_approximation(base_measure, alpha, tol=0.01):
betas = []
pis = []
betas.append(beta(1, alpha).rvs())
pis.append(betas[0])
while sum(pis) < (1.-tol):
s = np.sum([np.log(1 – b) for b in betas])
new_beta = beta(1, alpha).rvs()
betas.append(new_beta)
pis.append(new_beta * np.exp(s))
pis = np.array(pis)
thetas = np.array([base_measure() for _ in pis])
return pis, thetas

def plot_normal_dp_approximation(alpha):
plt.figure()
plt.title(“Dirichlet Process Sample with N(0,1) Base Measure”)
plt.suptitle(“alpha: %s” % alpha)
pis, thetas = dirichlet_sample_approximation(lambda: norm().rvs(), alpha)
pis = pis * (norm.pdf(0) / pis.max())
plt.vlines(thetas, 0, pis, )
X = np.linspace(-4,4,100)
plt.plot(X, norm.pdf(X))

plot_normal_dp_approximation(.1)
plot_normal_dp_approximation(1)
plot_normal_dp_approximation(10)
plot_normal_dp_approximation(1000)
</code>

• — The code for Dirichlet process can be written as :
<code>
from numpy.random import choice

class DirichletProcessSample():
def init(self, base_measure, alpha):
self.base_measure = base_measure
self.alpha = alpha

self.cache = []
self.weights = []
self.total_stick_used = 0.

def call(self):
remaining = 1.0 – self.total_stick_used
i = DirichletProcessSample.roll_die(self.weights + [remaining])
if i is not None and i < len(self.weights) :
return self.cache[i]
else:
stick_piece = beta(1, self.alpha).rvs() * remaining
self.total_stick_used += stick_piece
self.weights.append(stick_piece)
new_value = self.base_measure()
self.cache.append(new_value)
return new_value

@staticmethod
def roll_die(weights):
if weights:
return choice(range(len(weights)), p=weights)
else:
return None
</code>