Stochastic Gradient Descent

Online Algorithm to Compute the mean

Suppose that we have a sample, call it $(x_i)$ with $i\in\{0,\dots,N-1\}$, and we want to estimate its mean value, call it mu. The usual method to do it is to use the sample mean from which we obtain\

\begin{equation}\label{eq:sgd.10} \hat\mu_N = \frac{1}{N}\sum_{i=0}^{N-1} x_i. \end{equation}

Plain and simple. Now suppose that $N$ is very large, really, really large and you have a old, really, really old computer. Your computer is so old that it does not have enough memmory to load the array on it (and you do not know how to do incremental mean). Because your sample is really big, it comes to you that perhaps you do not need to compute the sum of all elements. The idea is that under a certain degree of accuracy

$$ \hat\mu_N = \hat\mu_{N-1} = \dots = \hat\mu_{N-p}. $$

If $N$ is big then $p$ may be also big. Indeed, for $N=0,\dots,k$ we easily derive the incremental mean function as

\begin{equation}\label{eq:sgd.30} \hat\mu_N = \left(\frac{N}{N+1}\right) \hat\mu_{N-1} + \left(\frac{1}{N+1}\right)x_N. \end{equation}
In [1]:
import numpy as np
import matplotlib.pyplot as plt

import math

from statsmodels.distributions.empirical_distribution import ECDF
from scipy.stats import norm
In [2]:
plt.rcParams['figure.figsize'] = (10.0, 6.0)

The next function is just the translation in Python of the expression (\ref{eq:sgd.30}).

In [3]:
def incremental_mean(oldmean , newvalue , npts):
    
    #out = ((npts - 1.0)/npts) * oldmean + 1.0 * newvalue / npts
    out = (npts/(npts + 1.0)) * oldmean + newvalue / (npts + 1.0)
    return out  
In [4]:
data_mean = 2 
data_std = 3
data_size = 5000

data = data_mean + data_std * np.random.randn(data_size)
print(data.mean())

print(data[0:5])
1.9081369088188072
[ 2.05771157  0.27681751 -1.16699371  4.81914887  6.89339327]
In [5]:
media = 0
comprimento = data.size
cummedia = np.empty(comprimento,float)

for i in range(cummedia.size):
    newvalue = data[i]
    # the value of npts must be in accordance with the expression in incremental_mean
    npts = i 
    
    cummedia[i] = incremental_mean(media , newvalue , npts)
    media = cummedia[i]
    
print('The mean value of the array computed by an online algorithm is %.14f.\n' %cummedia[cummedia.size - 1])
The mean value of the array computed by an online algorithm is 1.90813690881881.

We may ask how the online algorithm to compute the mean value of the array is moving toward the exact value.

In [6]:
steps = np.arange(data.size)
exact_mean = data.mean() * np.ones_like(steps)

plt.plot(steps, cummedia,label='Incremental')
plt.plot(steps , exact_mean , label = 'Exact Value')
plt.legend(('Incremental', 'Exact Value'),loc='upper right')
plt.show()

The natural question to ask is: given $\varepsilon>0$, what would be the magnitude of $x_N$ such that $|\hat\mu_N-\hat\mu_{N-1}|>\varepsilon$? Straightforward computations show that

\begin{equation}\label{eq:sgd.40} |x_N-\hat\mu_{N-1}|>(N+1)\cdot\varepsilon. \end{equation}

Suppose that you computed the mean value of the array. From the observed values of the array you can forecast if it is probable that the first element that you have left will verify equation (\ref{eq:sgd.40}). For that we will compute the \textbf{Empirical Cumulative Distribution Function}.

In [7]:
# the code of an empirical cumulative distributioin function
def my_ecdf(in_array, sort = "1"):
    
    if (sort != '1'):
        ordem = np.sort(in_array)
    else :
        ordem = in_array

    cumulation = np.cumsum(np.ones_like(ordem))/ordem.size
    
    return cumulation

We will not use this code to compute the value of the ecdf because statsmodels provides us one full optimize.

In [8]:
lixo = np.random.uniform(0,1,5)
ecdf_lixo = ECDF(lixo)
print(lixo)
print(ecdf_lixo.x)
print(ecdf_lixo.y)
[0.04664249 0.4887789  0.28969033 0.97421254 0.20838077]
[      -inf 0.04664249 0.20838077 0.28969033 0.4887789  0.97421254]
[0.  0.2 0.4 0.6 0.8 1. ]
In [9]:
# compute the empirical CDF of data
ecdf = ECDF(data,side = 'left')

teste_x = np.linspace(data.min() - 1 , data.max() + 1 , 100)
teste_y = norm.cdf(teste_x, loc = data.mean(), scale = data.std())

my_ecdf.y = my_ecdf(ecdf.x ,1)

plt.plot(teste_x, teste_y,linewidth = 1)
plt.plot(ecdf.x , ecdf.y,linewidth=1.5)
plt.plot(ecdf.x , my_ecdf.y)
plt.legend(('Exact','ECDF','MyECDF'),loc = 'upper left')
plt.title('Comparison between ECDF and exact value of the CF')
plt.show()

From expression (\ref{eq:sgd.40}), we build a function that at each step of the computation of incremental mean it states the bounds of the next value of the array, in order to respect the degree of accuracy imposed in first place.

In [10]:
def control_bounds(oldmean , max_error , npts):
    
    lower_bound = oldmean - (npts + 1.0) * max_error
    upper_bound = oldmean + (npts + 1.0) * max_error
    
    return lower_bound , upper_bound
In [11]:
epsilon = 0.1

liminf , limsup = control_bounds(cummedia[cummedia.size - 1] , epsilon , cummedia.size)

print('To respect the degree of accuracy, the next element must be in the interval [%f,%f]' %(liminf , limsup))
To respect the degree of accuracy, the next element must be in the interval [-498.191863,502.008137]
In [12]:
# this function gives the P(X<x) from a ECDF of an array.

def ecdf_value_func(x , ecdf_array):
         
    if (x > np.max(ecdf_array.x)) :
        out = 1  
    else :
        limsup = np.max(np.where(ecdf_array.x < x))
        out = ecdf_array.y[limsup]
        
    return out
In [13]:
# test this function over a very simple array
ecdf_value_func(0.5,ecdf_lixo)
Out[13]:
0.8

We can now compute the empirical probability that $x_{N}$ is outside the interval $\left]\hat\mu_{N-1}-(N+1)\cdot\varepsilon,\hat\mu_{N-1}+(N+1)\cdot\varepsilon\right[$.

In [14]:
print('The empirical probability of stable mean is %f.\n' %(ecdf_value_func(limsup,ecdf)-ecdf_value_func(liminf,ecdf)))
The empirical probability of stable mean is 1.000000.

In [15]:
tmp_data = np.array([])
# the array that will keep record of the probabilities
eprob = np.zeros_like(data)

epsilon = 0.01
inc_media = 0

for i in range(data.size):
    tmp_data = np.append(tmp_data,data[i])
    
    # it would be a good idea to find an online algorithm for ECDF
    tmp_ecdf = ECDF(tmp_data)
    
    inc_media = incremental_mean(inc_media , tmp_data[i],tmp_data.size)
    
    tmpinf , tmpsup = control_bounds(inc_media,epsilon,tmp_data.size)
    
    eprob[i] = ecdf_value_func(tmpsup,tmp_ecdf)-ecdf_value_func(tmpinf,tmp_ecdf)
    
indices = np.arange(data.size)
plt.plot(indices,eprob)
plt.title('The probability of having a stable mean')
plt.show()

Online algorithm for the variance

Let $S_N$ be the sum of the square errors of $x_k$ in relation to the mean ${\hat\mu}_N$,

$$%\label{eq:sgd.50} S_N=\sum_{k=0}^{N-1}(x_k-\hat\mu_N)^2. $$

Straightforward computations show that $S_N$ is given by

$$\label{eq:sgd.60} S_N=\sum_{k=0}^{N-1}x_k^2 - N\cdot\hat\mu_N^2. $$

This leads to the recurrent relation over $S_N$

\begin{equation}\label{eq:sgd.70} S_N - S_{N-1} = x_N^2-\hat\mu_{N-1}^2-N(\hat\mu_{N}^2-\hat\mu_{N-1}^2) \end{equation}

From equations (\ref{eq:sgd.30}) and (\ref{eq:sgd.70}) we can deduce a much more simple expression for the recurrence

\begin{equation}\label{eq:sgd.80} S_N = S_{N-1} + (x_N-\hat\mu_{N-1})(x_N-\hat\mu_{N}). \end{equation}

The expression (\ref{eq:sgd.70}) would be useful to compute a \textbf{stability interval} for the next value of the time series, while expression (\ref{eq:sgd.80}) the essential tool the \textit{online computation} of the variance. Because the former is more challenging, we start with the latter. Keep in mind that to compute $S_N$ we need three values: $x_N$, $\mu_{N-1}$ and $\mu_N$.

In [16]:
def incremental_sn (oldmu , newmu , olds , newvalue):
    
    out = olds + (newvalue - newmu) * (newvalue - oldmu)
    return out

With this function, we can compute the variance of our generated data.

In [17]:
mu_first = data[0]
mu_second = incremental_mean(mu_first, data[1] , 1)

s_old = 0
s_new = incremental_sn(mu_first , mu_second , s_old , data[1])

for i in range(1 , data.size):
    
    mu_second = incremental_mean(mu_first , data[i] , i)
    s_new = incremental_sn(mu_first , mu_second , s_old , data[i])

    s_old = s_new
    mu_first = mu_second

print("The comparison of the variance computed by both methods:")
print(s_new/(data.size),data.var())
The comparison of the variance computed by both methods:
9.175362794106347 9.175362794106348

Moreover, we can compute the evolution of this incremental series in a similar fashion of what we did in the mean case above

In [18]:
mu_first = data[0]
incremental_s = np.zeros(data.size)

for i in range(1 , data.size):
    
    mu_second = incremental_mean(mu_first , data[i] , i)
    s_new = incremental_sn(mu_first , mu_second , incremental_s[i-1] , data[i])

    incremental_s[i] = s_new
    mu_first = mu_second
    

sx = np.arange(data.size)
exact_var = data.var() * np.ones(data.size)

plt.plot(sx , incremental_s/data.size)
plt.plot(sx, exact_var)
plt.show()

As we can see from the analysis of the plot above, it does not make sense to compute the stable interval for the variance. For data that does not have \textbf{heavy tails}, we have an almost monotonous convergence to the variance.

Gradient descent for the mean

One different idea is to compute the mean value of the array by gradient descent. We need a function to converge to the value of the sample mean. One idea is to use the log likelihood function of the exponential distribution. This will become clear bellow.

Consider $p(x,\theta)=\theta e^{-\theta x}$ and a sample $(x_0,\dots,x_{N-1})$. The log likelihood function is given by

\begin{equation}\label{eq:sgd.90} L(\theta;x_0,\dots,x_{N-1}) = \sum_{k=0}^{N-1}\log(p(x_k,\theta)) = N\cdot \log\theta -\theta \cdot s, \end{equation}

where $s = \sum_{k=0}^{N-1}x_k$. The maximum value of this function is attained when $\theta$ is equal the value of the sample mean. This was the reason to choose this function as \textit{loss} function. The derivative is given by

\begin{equation}\label{eq:sgd.100} \frac{\partial L}{\partial\theta}=\frac{N}{\theta}-s. \end{equation}
In [23]:
data2_mean = 3.0
data2_std = 1.0
npts2 = 100000

data2 = data2_mean + data2_std * np.random.randn(npts2)
data2.mean()
Out[23]:
2.9971150528410657
In [20]:
def func2(theta , size , soma):
    
    return size * np.log(theta) - theta * soma

def gradfunc2(theta , size , soma):
    
    out = (size + 0.0)/theta - soma
    return out
In [63]:
batch_size = 1000
nbatchs = int(np.floor(data2.size)/batch_size)
learning_rate = 0.0001

nepochs = 30

theta_init = 0.1
theta_finit = theta_init

# the number of steps in the gradient descent
ngrad_jumps = 100
for j in range(nepochs):
    learning_rate *= 0.999999
    for i in range(nbatchs):
        batch = data2[i*batch_size:(i+1)*batch_size]
        size = batch.size 
        
        # this is the fundamental step: how to update the sum
        soma = batch.sum()
        
        for j in range(ngrad_jumps):
            theta_finit = theta_init + learning_rate * gradfunc2(theta_init , size , soma)
            theta_init = theta_finit
    
1/theta_init   
Out[63]:
3.018981007235945
In [64]:
batch_array_size = 2000
batch_array = np.zeros(batch_array_size)

learning_rate = 0.0001
nepochs = 30

# the number of steps in the gradient descent
ngrad_jumps = 100

for k in range(1,batch_array_size+1):
    batch_size = k
    nbatchs = int(np.floor(data2.size)/batch_size)
    
    theta_init = 0.1
    theta_finit = theta_init

    for j in range(nepochs):
        learning_rate *= 0.999999
        for i in range(nbatchs):
            batch = data2[i*batch_size:(i+1)*batch_size]
            size = batch.size 
        
            # this is the fundamental step: how to update the sum
            soma = batch.sum()
        
            for j in range(ngrad_jumps):
                theta_finit = theta_init + learning_rate * gradfunc2(theta_init , size , soma)
                theta_init = theta_finit
    
    batch_array[k-1] = theta_finit
In [66]:
bx = np.arange(batch_array_size)

plt.plot(bx,1/batch_array)
plt.show()
In [84]:
# the next variables define the min and max number of elements in batch
# and the difference between to consecutive sizes in batch_positions bellow
min_batch_size = 750
max_batch_size = 1250
dif_batch_size = 1

batch_positions = np.arange(min_batch_size , max_batch_size , step = dif_batch_size)

# this will retain the value of the gradient descent for each batch_size in batch_positions
batch_array_size = batch_positions.size
batch_array = np.zeros(batch_array_size)

# where are we in batch_array
ba_locator = 0

learning_rate = 0.0001
nepochs = 30

# the number of steps in the gradient descent
ngrad_jumps = 100

for k in batch_positions:
    batch_size = k
    nbatchs = int(np.floor(data2.size)/batch_size)
    
    theta_init = 0.1
    theta_finit = theta_init

    for j in range(nepochs):
        learning_rate *= 0.999999
        for i in range(nbatchs):
            batch = data2[i*batch_size:(i+1)*batch_size]
            size = batch.size 
        
            # this is the fundamental step: how to update the sum
            soma = batch.sum()
        
            for j in range(ngrad_jumps):
                theta_finit = theta_init + learning_rate * gradfunc2(theta_init , size , soma)
                theta_init = theta_finit
    
    batch_array[ba_locator] = 1.0/theta_finit
    ba_locator += 1
In [83]:
bxx = np.arange(min_batch_size , max_batch_size , step = dif_batch_size)

plt.plot(bxx , batch_array)
plt.show()

Gradient Descent for the variance

As in the case of the sample mean, we can use a likelihood function whose solution of the normal equations contains the sample variance. The obvious choice is to use the log-likelihood of the gaussian distribution

\begin{equation}\label{eq:sgd.110} L(\mu,\sigma;x_0,\dots,x_{N-1}) = -N\log(\sigma)-\frac{1}{2\sigma^2}\sum_{k=0}^{N-1}(x_k-\mu)^2. \end{equation}

The correspondent partial derivatives are given by

\begin{equation}\label{eq:sgd.120} \frac{\partial L}{\partial \sigma} = \frac{\sum_{k=0}^{N-1}(x_k-\mu)^2-N\sigma^2}{\sigma^3}\text{ and } \frac{\partial L}{\partial \mu} = \frac{\sum_{k=0}^{N-1}(x_k-\mu)}{\sigma^2}. \end{equation}

From these equations we could derive a similar algorithm for gradient descent to reach the values of the sample mean and sample variance.