Introduction to Monte Carlo Methods¶

Introduction¶

At this moment, we already several techniques to simulate several types of random variables. We can use a similar idea of one of these methods, the \textbf{rejection-aceptance method} to compute integrals of functions. The idea behind that is the following.

In the rejection-aceptance method we wanted to generate piles with a certain height. This height would agree with the value of the pdf function on that point. This is a simplification but matches the general idea. The point was that the maximum height was 1 and so we could reject the excess of the piles by simulating a uniform random variable in the interval $[0,1]$.

Suppose for the moment that the function that we want to measure is no longer a density function and that we want to compute the integral of this function in a finite interval. We could still use random numbers for somehow get a good approximation of the integral in question. All the methods that use random numbers to compute some quantity are called \textbf{Monte Carlo Methods}. There are some common features to all this methods that we will introduce along the process, but as general reference their effectiveness is based on two big ideas.

The first comes from the fact that the average has a stable character, meaning that if we simulate thoroughly a random varible, what we get resembles pretty much what it looks like. The second feature is dimensionality. On deterministic numerical methods there is a continous degradation when we increase the dimension of our problem. On Monte Carlo Methods this degradation is much slower.

We start by computing the same integral by two different MC methods. Consider the integral

\begin{equation}\label{eq:MC01.10} \int_{0.1}^2\frac{1}{x}dx. \end{equation}

Using the concept of the definite integral, that the integral is equal to the mean of the function times the length of the interval, defining $g(x)=1/x$, we get

\begin{equation}\label{eq:MC01.20} \int_{0.1}^2\frac{1}{x}dx = \lim_{n\to\infty}\frac{1}{n}\sum_{k=1}^ng(x_k), \end{equation}

assuming that the maximum of the distance between consecutive elements of the partition of the interval goes to zero as the number of elements goes to infinity. So, basically all that we need to do is to generate uniform elements in the interval $[0.1,2]$ and compute the mean of the function over these elements. This is done in the following Python code.

import numpy as np
import matplotlib.pyplot as plt
import math
import random

plt.rcParams["figure.figsize"] = (12,8)

def integrand1(x):
    return 1.0 / x

xmin = 0.1 
xmax = 2.0
amplitude = xmax - xmin

npts = 100

pontos = xmin + amplitude * np.random.uniform(0,1,npts)
images = np.array(list(map(lambda x : integrand1(x) , np.sort(pontos))))

plt.plot(np.sort(pontos),images)
plt.scatter(np.sort(pontos),images,edgecolor = "r",s=15,alpha=0.8)
plt.show()

The integral (\ref{eq:MC01.10}) is easy to compute by direct integration. Indeed

\begin{equation}\label{eq:MC01.30} \int_{0.1}^2\frac{1}{x}dx = \ln(2) - \ln(0.1). \end{equation}

The approximation of the value of the integral is given simply as

approx = amplitude * images.mean()
print(approx)

2.611952075675351

While the exacte value is

exact1 = math.log(xmax) - math.log(xmin)
print(exact1)

2.995732273553991

Using the same number of points with the trapezoidal rule

trapts = np.linspace(xmin,xmax,npts,endpoint=True)
imgpts = np.array(list(map(lambda x : integrand1(x) , trapts)))

trapvalue = (2*imgpts.sum() - imgpts[0] - imgpts[npts - 1])*amplitude /(2.0 * (npts-1.0))

print(trapvalue)

2.998782900298261

It is clear that with the same number of points, the approximation using the same number of points with trapezoidal rule is by far much better than the one we got using MC. From where does this discrepancy comes? Before we adress this issue, we will consider a second example that will help to clarify one of the issues here

def integrand2(x):
    return x * math.sin(x)

We want to compute the integral

\begin{equation}\label{eq:MC01.40} \int_0^1 x\sin(x) dx. \end{equation}

As before, we will generate random numbers and try to understand how close we got to the real value of the integral in (\ref{eq:MC01.40}).

xmin = 0 
xmax = 1

amplitude = xmax - xmin 

npts = 20

pontos2 = np.random.uniform(xmin , xmax , npts)
pontos2 = np.sort(pontos2)
images2 = np.array(list(map(lambda x : integrand2(x), pontos2)))

plt.plot(pontos2 , images2)
plt.scatter(pontos2 , images2 , s = 15 , edgecolors="r" , alpha=0.8)
plt.show()

One of the problems related with generating random points to compute the average of a function in an interval is evident from the observation of the previous graphic. The points may fall very close to each other in regions where the variation of the function is not very big and fall short on regions where the variation is huge. Even in this case, we can look for comparison between the used MC method and the basic trapezoidal rule.

First of all, we will define a general method to compute the trapezoidal rule, but specifying the points that will be used to compute the images.

def trapRule(xmin , xmax , images_array):
    soma = (2*images_array.sum() - images_array[0] - images_array[npts - 1])
    return  soma * (xmax - xmin) /(2.0 * (len(images_array) - 1.0))

exactb = math.sin(1) - math.cos(1)
print(exactb)

0.30116867893975674

Integrating by parts, the value of the integral (\ref{eq:MC01.40}) yields

\begin{equation}\label{eq:MC01.50} \int_0^1 x\sin(x) dx = \sin(1)-\cos(1)\approx 0.30117 \end{equation}

Using the points generated above to compute an approximation of the value of (\ref{eq:MC01.40}) using the trapezoidal rule we will get

print(trapRule(xmin , xmax , images2))

0.39750356223398137

Of course that the typical aproach of trapezoidal rule would be sampling points equally spaced in the interval.

trapts2 = np.linspace(xmin , xmax , npts , endpoint=True)
trapim2 = np.array(list(map(lambda x : integrand2(x) , trapts2)))
print(trapRule(xmin , xmax , trapim2))

0.3014876805087933

\textbf{Research Question 1:} One point that comes up is that when we compute an approximation of the value of the integral of the function $f(x) = x\sin(x)$ using trapezoidal rule at random generated points, the result will be a random variable. Can you give a characterization of this random variable, simulating its pdf, cdf and common parameters?

\textbf{Research Question 2:} Can you do exactly the same thing for the function $f(x) = 1/x$?

Error Analysis of Monte Carlo Methods¶

So far the computation of an integral of a function $f$ over the interval $[a,b]$ was based on the fundamental idea about integrals: the integral is the average of the function on the interval times the amplitude of the interval. If you stop for a moment, this is the common feature of all the integrals that you may find in your life (I should be careful here because I am refering to the integrals of Riemann, Lebesgue and Itô, the ones I have computed so far in my professional life). This idea can be conceptualized by

\begin{equation}\label{eq:MC01.60} \int_a^bg(x) dx \approx (b-a)\frac{1}{n}\sum_{k=1}^{n}g(x_k), \end{equation}

where $(x_k)$ is a partition of the interval $[a,b]$. As we already saw, to produce reasonable results we need a lot of points in our simulation, much more than we would need to get the same accuracy if we were using the traditional numerical methods. The question that naturally follows is: can we have a prediction of the error that we will have knowing the integrand function and the number of points that we will use in our simulation?

In order to answer to this question, we need to address some results from Probability Theory, that are sometimes overlooked but in here get a full demonstration of their usefulness.

Consider a non-decreasing and positive function $g$ and a random variable $\eta$ whose density is $f$. It is straightforward to show that

\begin{equation}\label{eq:MC01.70} \mathbb{E}(g(\eta)) = \int_{-\infty}^{+\infty}g(x)f(x)dx\geq g(a)\int_{a}^{+\infty}f(x)dx\Leftrightarrow \mathbb{P}(\eta\geq a)\leq \frac{\mathbb{E}(g(\eta))}{g(a)}. \end{equation}

For example, when $g(x)= x^2$, the final inequality in (\ref{eq:MC01.70}) it will be equivalent to say (\textbf{Markov's inequality})

\begin{equation}\label{eq:MC01.80} \mathbb{P}(\eta\geq a)\leq \frac{\mathbb{E}(\eta^2)}{a^2}. \end{equation}

Replacing $\eta$ with $\xi = |\eta-\mathbb{E}(\eta)|$ and $a=k\sigma$, with $\sigma$ representing the standard deviation of $\eta$, we reach the \textbf{Tchebycheff inequality}

\begin{equation}\label{eq:MC01.90} \mathbb{P}\left(|\eta-\mathbb{E}(\eta)|\geq k\sigma\right)\leq\frac{1}{k^2}. \end{equation}

Implicit in our discussion is that both mean and variance of the random variable $\eta$ are well defined. To help the solution of estimate the error bounds for the MC method that we are studying, we need to translate inequality (\ref{eq:MC01.90}) in more closed approach.

Consider the sequence of i.i.d. random variables $(\eta_k)$, with $k\in\mathbb{N}$ for which $\mathbb{E}(\eta_k) = \mu(1) = \mu$ and $\mathbb{E}(\eta_k-\mu)^2 = \sigma(1)^2=\sigma^2$. Then, due to the independence, if

\begin{equation}\label{eq:MC01.100} \eta = \frac{1}{n}\sum_{k=1}^{n}\eta_k \Rightarrow \mathbb{E}(\eta)=\mu\text{ and }\sigma(\eta)=\frac{\sigma}{\sqrt{n}}. \end{equation}

Then, the conjugation of (\ref{eq:MC01.90}) and (\ref{eq:MC01.100}) results in

\begin{equation}\label{eq:MC01.110} \mathbb{P}\left(|\eta-\mathbb{E}(\eta)|\geq \frac{k}{\sqrt{n}}\sigma\right)\leq\frac{1}{k^2}. \end{equation}

It becomes more clear if we apply directly the Markov's inequality and getting

\begin{equation}\label{eq:MC01.120} \mathbb{P}\left(|\eta-\mathbb{E}(\eta)|\geq n^{-1/4}\right)\leq\frac{\sigma^2}{\sqrt{n}}. \end{equation}

In equation (\ref{eq:MC01.120}), when $n\to\infty$ we get the \textbf{Weak Law of Large Numbers}, that states that for any $\varepsilon>0$,

\begin{equation}\label{eq:MC01.130} \mathbb{P}\left(|\eta-\mathbb{E}(\eta)|\geq \varepsilon\right)\to 0, \text{ as } n\to\infty. \end{equation}

So, while the result (\ref{eq:MC01.130}) give us the guarantee that as the number of points increase the error goes to zero, inequality give us a general bound for that error in function of $n$. More precisely, the error is of order $O\left(n^{-1/2}\right)$. The same inequality also states that this error bound is a factor of the variance of the variables of which we are computing the mean. This gives immediately a reason why we get much worst results when we are computing the integral of the function $1/x$ in an interval close to the origin, in relation to the integral of the function $x\sin(x)$. Because the variance in the first case is much bigger than in the second case. When the number of points is small, the size of the variance is something important.

This powerful ideas will become vivid in the analysis of the examples.

Second part of the story: the computation of exact values of both $\mathbb{E}(\eta)$ and $\sigma^2$ in (\ref{eq:MC01.120}) involves the computation of the integral (\ref{eq:MC01.60}). Well, that is our goal in first place! So instead of having a closed formula like in (\ref{eq:MC01.120}) we have a dynamic process

\begin{equation}\label{eq:MC01.140} \mathbb{P}\left(|\eta-\mu_n|\geq n^{-1/4}\right)\leq\frac{s_n^2}{\sqrt{n}}, \end{equation}

where $\mu_n$ and $s_n^2$ are the incremental mean and variance. So indeed, all the process will be useless unless we can stabilize the incremental variance. When we reach this point, as side effect, the mean is already stabilized. From that point on, we can just increment the number of points working with the stabilized incremental variance.

Recall that, the incremental variance is computed via a recurrence relation over the quadratic sums $S_n = \sum_{k=1}^n(x_k-\mu_n)^2$ given by

\begin{equation}\label{eq:MC01.150} S_n = S_{n-1} + (x_n-\mu_n)(x_n -\mu_{n-1}). \end{equation}

# we define the incremental mean and incremental variance by the usual method
# npts are the number of points in the previous computation of the mean

def incMean(oldmean , newvalue , npts):
    return (oldmean*npts + newvalue)/(npts + 1.0)

# the incremental sum of squares
def incSOS(oldsum , oldmean , newmean , newvalue):
    return oldsum + (newvalue - newmean) * (newvalue - oldmean)

# recall that integrand2(x) = x * sin(x)
var_error = 0.5 * math.pow(10,-3)
# a big initialization just to prevent a precarious stop
run_error = 10


# initialization of the variables
# the image of a random number in the interval (0,1)
valor = integrand2(random.uniform(0,1))
oldmean = valor
somavelha = 0

# this is the array of the rolling mean
somalista = []
somalista.append(somavelha)

# just counts the number of points used in the simulation
numpts = 1

# the number of points used in the rolling mean 
rollpts = 200

# we make a prevent run just to initilize the list of reference values for the rolling mean
for i in range(rollpts):
    valor = integrand2(random.uniform(0,1))
    newmean = incMean(oldmean , valor , numpts)
    somanova = incSOS(somavelha , oldmean , newmean , valor)
    
    oldmean = newmean
    somavelha = somanova 
    
    numpts += 1
    
    somalista.append(somavelha) 

# a second run to stabilize the variance 
while (run_error > var_error):
    
    valor = integrand2(random.uniform(0,1))
    newmean = incMean(oldmean , valor , numpts)
    
    somanova = incSOS(somavelha , oldmean , newmean , valor)
    
    referencia = np.mean(somalista)
    run_error = math.fabs((somanova - referencia)/(numpts - 1.0))

    # actualization of the variables
    oldmean = newmean
    somavelha = somanova
    numpts += 1
    
    somalista.pop(0)
    somalista.append(somavelha)

print(numpts)
print(newmean , somanova/(numpts - 1.0))

# a third run, now that the variance was stabilized, we make the var/sqrt(n) converge to zero.
# the convergence is the result of assuming that the variance exists!

run_error2 = somanova/(numpts-1.0)

while (run_error2 > var_error):
    valor = integrand2(random.uniform(0,1))
    newmean = incMean(oldmean , valor , numpts)
    somanova = incSOS(somavelha , oldmean , newmean , valor)
    
    numpts += 1 
    
    run_error2 = somanova/(math.sqrt(numpts)*(numpts - 1.0))

    somavelha = somanova
    oldmean = newmean

print("\n")
print("The number of points used in the simulation: %d " % numpts)
print("The approximate value for the integral: %f" % newmean)
print("The error in relation to the exact value: %f" % math.fabs(newmean - exactb))

11677
0.2968466685119233 0.06492885513422676


The number of points used in the simulation: 17021 
The approximate value for the integral: 0.299523
The error in relation to the exact value: 0.001646

So we can now do the same thing for the function $1/x$ in the interval $[0.1,2]$.

xmin = 0.1
xmax = 2.0

# recall that integrand2(x) = x * sin(x)
var_error = 0.5 * math.pow(10,-3)
# a big initialization just to prevent a precarious stop
run_error = 10

# initialization of the variables
# the image of a random number in the interval (xmin, xmax)
valor = integrand1(random.uniform(xmin , xmax))
oldmean = valor
somavelha = 0

# this is the array of the rolling mean
somalista = []
somalista.append(somavelha)

# just counts the number of points used in the simulation
numpts = 1

# the number of points used in the rolling mean 
rollpts = 20

# we make a prevent run just to initilize the list of reference values for the rolling mean
for i in range(rollpts):
    valor = integrand1(random.uniform(xmin , xmax))
    newmean = incMean(oldmean , valor , numpts)
    somanova = incSOS(somavelha , oldmean , newmean , valor)
    
    oldmean = newmean
    somavelha = somanova 
    
    numpts += 1
    
    somalista.append(somavelha) 

# a second run to stabilize the variance 
while (run_error > var_error):
    
    valor = integrand1(random.uniform(xmin , xmax))
    newmean = incMean(oldmean , valor , numpts)
    
    somanova = incSOS(somavelha , oldmean , newmean , valor)
    
    referencia = np.mean(somalista)
    run_error = math.fabs((somanova - referencia)/(numpts - 1.0))

    # actualization of the variables
    oldmean = newmean
    somavelha = somanova
    numpts += 1
    
    somalista.pop(0)
    somalista.append(somavelha)

print(numpts)
print(newmean *(xmax - xmin) , math.fabs(newmean *(xmax - xmin) - exact1))

# a third run, now that the variance was stabilized, we make the var/sqrt(n) converge to zero.
# the convergence is the result of assuming that the variance exists!

run_error2 = somanova/(numpts-1.0)

while (run_error2 > var_error):
    valor = integrand1(random.uniform(xmin , xmax))
    newmean = incMean(oldmean , valor , numpts)
    somanova = incSOS(somavelha , oldmean , newmean , valor)
    
    numpts += 1 
    
    run_error2 = somanova/(math.sqrt(numpts)*(numpts - 1.0))

    somavelha = somanova
    oldmean = newmean

integral1 = (xmax - xmin) * newmean
    
print("\n")
print("The number of points used in the simulation: %d " % numpts)
print("The approximate value for the integral: %f" % integral1)
print("The error in relation to the exact value: %f" % math.fabs(integral1 - exact1))

8716
3.003748358469428 0.008016084915436927


The number of points used in the simulation: 25265925 
The approximate value for the integral: 2.995417
The error in relation to the exact value: 0.000316

Notice that at the second step we already have a very nice approximation. The (huge) cost in the last step takes in consideration the fact that the approximation has order $O(1/\sqrt{n})$. Notice also that the final error is much smaller than the one imposed to stop the run. The generality of Tchebycheff inequality comes with a price!

We go now and compare the solution of the same problem using the more traditional trapezoidal rule. The \textbf{Composite Trapezoidal Rule} states (see \cite{burden2010}) that for a function $g\in C^2[a,b]$, with $h=(b-a)/n$ and $x_k = a +kh$, for $k\in\{0,\dots,n\}$, there is a value $c\in(a,b)$ such that

\begin{equation}\label{eq:MC01.160} \int_a^bg(x)dx = \frac{h}{2}\left[g(a)+\sum_{k=1}^{n-1}g(x_k)+g(b)\right]-\frac{(b-a)}{12}h^2g''(c). \end{equation}

A brute force application of (\ref{eq:MC01.160}) to the integral in (\ref{eq:MC01.10}) results in

\begin{equation}\label{eq:170} \int_{0.1}^2\frac{1}{x}dx = \frac{h}{2}\left[g(a)+\sum_{k=1}^{n-1}g(x_k)+g(b)\right] + \varepsilon(n), \end{equation}

where $\varepsilon(n) < \frac{12 (1.9)^3}{n^2}$. If we want, as before, that $\varepsilon(n)<0.5\times 10^{-3}$ we need to have, with not very optimized upper bounds,

\begin{equation}\label{eq:MC01.180} n\geq\sqrt{\frac{12^2\times 2^4}{0.5\times 10^{-3}}}\Leftrightarrow n \geq 1518. \end{equation}

We test this theoretical result bellow.

xmin = 0.1 
xmax = 2.0

ctnumpts = 1518

comptrapts = np.linspace(xmin , xmax , ctnumpts , endpoint=True)
comptraptsimg = np.array(list(map(lambda x : integrand1(x) , comptrapts)))

cptrap1 = trapRule(xmin , xmax , comptraptsimg)

print(cptrap1,exact1,math.fabs(cptrap1 - exact1))

2.990999858098007 2.995732273553991 0.004732415455984018

So we made our case in the worst possible manners. We have started to present a very elegant algorithm, we have call it Monte Carlo, with all the glamour that it involves, and we saw that the simple trapezoidal rule beats it badly. Indeed, for the integral (\ref{eq:MC01.10}) we have needed for each point in the trapezoidal rule, we have needed to generate more than 15000 to have the guarantee of equal precision. \textbf{So why all this excitment with Monte Carlo?}

Well, there is an important phenomenon called \textbf{curse of dimensionality}. The basic idea is that with dimension increasing, the most part of algorithms share the tendency to deteriorate. For example, in the case of the Composite Trapezoidal Rule, the order of approximation changes with dimension as $O(n^{-2/d})$. At the same time, Monte Carlo, depending only on the variance, keeps intact in $O(1/\sqrt{n})$. This shows why Monte Carlo methods are so useful in practice when the dimension of the problem is very high. This is particular true when we are simulating paths through some descritization scheme. The idea is the following. If we make a descritization of a path in $n$-steps, we are indeed considering a problem that lives in a $n$-dimensional space. Being $n$ arbitrarily large, Monte Carlo becomes a ruling paradigm.

References¶

(Burden and Faires, 2010) R.L. Burden and J.D. Faires, ``Numerical Analysis'', 2010.