Rohan Kotwani

The original problem statement was given here

There was also a verbal solution given in the members only section. I'm not sure if its legal to share the whole thing, but here is an excerpt of the solution. " The time series has a weekly periodicity with two peaks: Monday and Thursday, corresponding respectively to the publication of the Monday and Thursday digests. The impact of the Monday and Thursday email blasts extent over the next day; this makes measuring the yield more difficult, unless you use additional data, e.g. from our newsletter vendor. However, the bulk of the impact is really on Monday and Thursday."

There are a variety of signal processing techniques that can be applied to time series data. For example, trend components can be extracted from a time series by transforming the signal into the frequency domain, filtering out non dominant frequencies, and then recombining the signal. This tends to work well for finding the overall trend, but it violates the assumption that the signal is a stationary process. In addition, this method is limited to the shapes that the sum of orthogonal sinusoids can make. This paper considers another approach where dominant frequencies are extracted, but where the signals are recombined in a linear model. In this linear model, the seasonality of the time series can be accounted by including a set of orthogonal sinusoidal waves, different frequencies, and different time components. However, finding the appropriate frequency and time component for each sinusoid requires a little more digging. This paper shows how to use Discrete Fourier transforms to automatically find these values.

Signal Processing Basics

  • k ∈ [0,N-1] or k ∈ [-N/2, N/2-1] or k ∈ [-(N-1)/2, (N-1)/2]
  • N = # of data points
  • n = rank order of input data points
  • k = rank order of output data points

Assumptions

"The Fourier transform assumes that the signal is stationary and that the signals in the sample continue into infinity. The Fourier transform performs poorly when this is not the case."

A stationary process has a constant mean and variance over time which is a property that most time series don't have. A stationary process can be extracted from the time series by creating a time differenced variable. However, the complex magnitudes from a time difference variable is not representatitve of the actual signal. A regression model can then be used to recontruct the magnitudes of this signal.

Model Definition

y = T(t) + S(t) + R(t)

  • T(t)~Trend component
  • S(t)~Seasonal component
  • R(t)~Residual error

Objective function: minimize sum of squared errors

Closed form, least squared solution (Python code):

import numpy as np
A=x.T.dot(x) # x = intercept + predictor columns
b=x.T.dot(y) # y = target variable
z = np.linalg.solve(A,b)

This works perfectly for a dataset with a smaller number of non-collinear dimensions. However, the numpy algorithm can break with a dataset that has many collinear dimensions.

For the purposes of this post, we will only focus on the T(t) and S(t) components. 80% of the data or 584 observations were used to create the training set. The model was validated on the remaining 146 observations.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

full=pd.read_csv("DATA/DSC_Time_Series_Challenge.csv",dtype = {'Day ':str,'Sessions':int,'Pageviews':int})
time=[datetime.datetime.strptime(t[0],"%m/%d/%y") for t in full[['Day ']].values]
weekday=[datetime.datetime.strptime(t[0],"%m/%d/%y").isocalendar()[2] for t in full[['Day ']].values]

full['time']=time
full['index']=np.arange(1,len(full)+1)
full['weekday']=weekday
full=full.sort_values(by=['time'])

plt.plot(full['index'],full['Pageviews'],'-')
plt.xlabel('time')
plt.ylabel('Pageviews')
plt.title('Pageviews')
plt.show()

train=full[:580].copy()
valid=full[580:].copy()
train.head(n=5)
Day Sessions Pageviews time index weekday
0 9/25/13 4370 7177 2013-09-25 1 3
1 9/26/13 5568 10760 2013-09-26 2 4
2 9/27/13 4321 8171 2013-09-27 3 5
3 9/28/13 2378 4093 2013-09-28 4 6
4 9/29/13 2612 4881 2013-09-29 5 7

The weekday variable will be dummy coded and used later to compare results

weekday_dummies = pd.get_dummies(train.weekday, prefix='W').iloc[:, 1:]
train = pd.concat([train, weekday_dummies], axis=1)
weekday_dummies = pd.get_dummies(valid.weekday, prefix='W').iloc[:, 1:]
valid = pd.concat([valid, weekday_dummies], axis=1)
import Regression
import DSP

Finding the overall trend:

A polynomial term can be used to represent the trend component T(t) which can be created from the index variable. The complexity of the trend component can be tuned by using a training and validation set to find optimal complexity with the lowest sum of squared errors. A complexity of degree 2 had the lowest SSE value on the validation set so these terms will be included in the model.

heap = []
for i in range(1,15):
    z=Regression.sklearn_poly_regression(train[['index']],train[['Pageviews']],i)
    SSE = Regression.numpy_poly_regression_SSE(valid[['index']],valid[['Pageviews']],i,z)
    heap.append((SSE,i))
Regression.heapsort(heap)
    [(2501666772.2010365, 2),
     (2699152210.1621375, 9),
     (2699503808.6114817, 14),
     (2701796858.0785437, 10),
     (2701846050.0146337, 13),
     (2702552306.8232799, 12),
     (2706108691.5368271, 11),
     (2990342895.3898125, 8),
     (4491840385.0665913, 1),
     (4868690323.6375523, 4),
     (5963871838.3020554, 3),
     (11208662852.721825, 5),
     (317581251234.68689, 7),
     (477442016671.72321, 6)]
 
T_train = Regression.pandas_poly_feature(train[['index']],2)
T_valid = Regression.pandas_poly_feature(valid[['index']],2)

Removing the overall trend:

First, the overall trend needs to be removed from the signal in order to statisfy the stationary process assumption. This can be accomplished by creating a lag 1 differenced variable for Pageviews. This differenced variable will also make seasonal components easier to find.

time_diff_signal = DSP.time_diff_variable(train['Pageviews'],1)
plt.plot(train['index'][1:],time_diff_signal,'-')
plt.xlabel('time')
plt.ylabel('Pageviews[t] - Pageviews[t-1]')
plt.title('Pageviews[t] - Pageviews[t-1]')
plt.show()

The FFT transformation:

Taking the FFT of the time differenced signal creates a complex magnitude representation for each sinusoidal frequency component in the signal.

The data is filtered to include only the dominant sinusodial frequency components. Usually, the complex magnitude can be used to find the phase and absolute magnitude of the sinusodial wave corresponding to each frequency. In this case, we are using a time difference variable so the magnitudes and phases are not representative of the actual variable.

This process can be semi-automated by using statistics. The center frequency can be found by average across all frequencies. The bandwidth of the filter can be defined by frequencies within to standard deviations of the mean of the absolute complex magnitude. The purpose of this is avoid periods that are too large.

N = len(time_diff_signal)
unfiltered  = DSP.get_frequency_domain(time_diff_signal)
y_abs=( 2.0/N * np.abs(unfiltered[1:,1]))

center = np.mean(unfiltered[1:,0])
band = 1*np.std(unfiltered[1:,0])
threshold = np.mean(y_abs)+1*np.std(y_abs)

print("Center frequency: ",np.abs(center),"Band: ",np.absolute(band),"Threshold: ",threshold)

plt.plot(unfiltered[1:,0],y_abs)
plt.xlabel('frequency')
plt.ylabel('absolute magnitude')
plt.title("Absolute Magnitude of Complex Frequencies")
plt.show()

print("Frequency, Magnitude")
abs_filtered = np.absolute(DSP.filter_freq_domain(unfiltered, center,band,threshold))
print(abs_filtered)
Center frequency:  0.249568221071 Band:  0.14358883867 Threshold:  408.870795735

    Frequency, Magnitude
    [[  1.41623489e-01   1.33800535e+05]
     [  1.43350604e-01   4.90313504e+05]
     [  2.78065630e-01   1.31154698e+05]
     [  2.81519862e-01   1.23147724e+05]
     [  2.83246978e-01   2.54959357e+05]
     [  2.84974093e-01   6.93085781e+05]
     [  2.86701209e-01   5.95979105e+05]
     [  2.88428325e-01   1.80149377e+05]
     [  2.90155440e-01   1.44095857e+05]
     [  2.91882556e-01   1.36036705e+05]]

Creating seasonal predictor variables

After filtering the data, the dominant frequencies were found at .143, .28, and .29. These correspond to T=7,4, and 3, respectively. Sequences with these periods can be created.

Notice that the frequency .428, that corresponds to a period of 2 (actually a little below), was not included in our predictors. This is because of Nyquist's thoeorm:

"When the signal bandlimited to ? fs cycles/second (hertz), its Fourier transform, X(f), is 0 for all |f| > 1/2*(1/T)"

In this case the signal has a sampling period of 1 or a sampling frequency of 1. The maximum frequency we can "see," by Nyquist's theorm, would be a signal with a period greater than 2.

period_list = set([round(1/(ft)) for ft,ht in abs_filtered if round(1/(ft))>2])
print("Sequence list: ",period_list)

train = DSP.get_sequences(train,period_list,start_index=1,plot=True)
valid = DSP.get_sequences(valid,period_list,start_index=len(train)+1,plot=False)
valid.head()
Sequence list:  {3.0, 4.0, 7.0}

Day Sessions Pageviews time index weekday W_2 W_3 W_4 W_5 W_6 W_7 seq_3.0 seq_4.0 seq_7.0
580 4/28/15 13162 19757 2015-04-28 581 2 1.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0 0.0
581 4/29/15 12395 18568 2015-04-29 582 3 0.0 1.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0
582 4/30/15 12351 19122 2015-04-30 583 4 0.0 0.0 1.0 0.0 0.0 0.0 1.0 3.0 2.0
583 5/1/15 12301 19830 2015-05-01 584 5 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0 3.0
584 5/2/15 9345 14110 2015-05-02 585 6 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 4.0

Eureka! Weekday (Sequence period = 7) shares the same frequency components as Pageviews!

The table below shows the index and weekday to help illustrate what the periods or cycles. The index starts at 1, but this corresponds to weekday 3 (or Wednesday). So index 6 corresponds to Monday and index 2 corresponds to Thursday.

Fitting the Model

Since T(t) is already created, this part of the blog will focus on S(t).

Each generated sequence will be made into pairs of seasonable components and will have at least two components, two sinusoidal waves, i.e. ?1sin(t/T) + ?2cos(t/T). In fact, the sum of weighted cosine & sine terms for each dominant frequency in the signal can be thought of as a Fourier Series. The difference is that the ? valus will only be estimated for the most dominant frequenciesinstead of an infinite, or large number of periodic waves.

Fourier series formula

How is this done automatically?

Consider an example where there exists a periodic signal of period 13 inside of the main signal.

Seasonal components are generated automatically by first creating a saw tooth wave with a period of 13. This series represent the time component of a series of sinusoids with various periods. But how do we choose these periods? The saw tooth wave is unique in that each level appears only once in each period. By taking the FFT of this wave, the frequencies of this wave show how many sinusoids are needed to represent the wave at each unique level. By selecting the dominany freuquencies of this wave, a variery of shapes can be create for the seasonal component of period 13. This is accomplished by summing a set of orthogonal sine waves, with different periods, which use the save tooth wave as the time component. The saw tooth wave is used at the time component to force the resultant wave to have a period of 13.

Sum of Cosines = β1sin(2*π*x/13) + β2cos(2*π*x/7) + β3cos(2*π*x/4) + β4cos(2*π*x/3) + β5cos(2*π*x/2) , 0 <= x < 13

output=train[['Pageviews']]
y=output.values

m,n =np.shape(output)
count=0


S_train = np.ones((len(train),1))
S_valid = np.ones((len(valid),1))

for component in period_list:
    f=DSP.get_sequence_freq(var=train[["seq_"+str(component)]],std_thresh=3,plot=True)
    x=DSP.get_seasonal_predictors(train[["seq_"+str(component)]],f)
    S_train=np.column_stack((S_train,x))
    x=DSP.get_seasonal_predictors(valid[["seq_"+str(component)]],f)
    S_valid=np.column_stack((S_valid,x))
    count+=1
plt.plot(train['index'][:60],train['Pageviews'][:60],'-')
plt.xlabel("time")
plt.ylabel("Pageviews")
plt.title("Pageviews")
plt.show()

x=S_train

A=x.T.dot(x)
b=x.T.dot(y)
z = np.linalg.solve(A,b)

SSE=np.sum((y-x.dot(z))**2)
print("SSE :",SSE)
print("Baseline: ",sum((y-np.mean(y))**2)[0])

# plt.plot(train['index'],y,'.') 
plt.plot(train['index'][:60],x.dot(z)[:60],'-')
plt.title("Weighted Addition of Seasonal Components")
plt.xlabel("time")
plt.ylabel("predicted value")
plt.show()
print("coefficients: ")
print(z)
    threshold 0.213708597747
    frequencies  [[  0.00000000e+00   5.80000000e+02]
     [  3.31034483e-01   6.86580342e+01]
     [  3.32758621e-01   2.76353001e+02]
     [  3.34482759e-01   1.39043662e+02]]

     
    frequencies used in  seq_3.0  :  [0.33333333333333331]
    frequencies used in  seq_3.0  :  [0.33333333333333331]
    threshold 0.254028420631
    frequencies  [[  0.00000000e+00   8.70000000e+02]
     [  2.50000000e-01   4.10121933e+02]]

      
    frequencies used in  seq_4.0  :  [0.25]
    frequencies used in  seq_4.0  :  [0.25]
    threshold 0.525113865983
    frequencies  [[  0.00000000e+00   1.74300000e+03]
     [  1.43103448e-01   6.45133516e+02]
     [  2.86206897e-01   3.22688369e+02]
     [  4.27586207e-01   1.61660579e+02]
     [  4.29310345e-01   2.15278734e+02]]

    
    frequencies used in  seq_7.0  :  [0.14285714285714285, 0.5, 0.33333333333333331]
    frequencies used in  seq_7.0  :  [0.14285714285714285, 0.5, 0.33333333333333331]

   
    SSE : 5765825922.99
    Baseline:  8530795814.65

    
        coefficients: 
    [[  1.06306549e+04]
     [ -1.63450638e+00]
     [  5.94252422e+01]
     [ -6.20595993e+01]
     [ -3.09144366e+00]
     [  1.29084218e+03]
     [ -5.98934835e+18]
     [ -2.34003768e+03]
     [  1.42249877e+03]
     [ -1.72123723e+03]
     [  1.13494916e+03]]

Weighted Addition of Seasonal Components

In the regression model, each of these components will get a weight associated with it which can be added together to create a weighted seasonal component. The weighted seasonal component has a pattern similar to that of the Pageviews.

Seasonal Predictors

There are 3 saw tooth sequences included in the model each with a unique t, time index. A saw tooth of period 3 & 4 have only two sinusoids with with periods 0.33 & 0.25, respectively. Finally, a saw tooth wave of period 7 is used inside of 6 sinusoids with periods 0.1428, 0.333, and 0.5.

These periods can be found using the same semi-automated techniques used to generate the sequences as described earlier. The threshold can be adjust by the number of standard deviations from the mean of the absolute complex magnitude.

Finding peaks

There are two peaks which correspond to Monday and Thursday. This is consistent with the problem solution given above.

peaks=[]
for i,xz in enumerate(x.dot(z)):
    if i>0 and i<len(x.dot(z))-1:
        if x.dot(z)[i]>x.dot(z)[i-1] and x.dot(z)[i]>x.dot(z)[i+1]:
            peaks.append(train['time'][i].isocalendar()[2])
print(set(peaks))
    {1, 4}

Training & Validation

The following models will be trained and validated with different seasonal variables:

1. Model fitted with a dummy coded weekday variable

2. Model fitted with extracted sinusodial components

y=train[['Pageviews']].values

m =len(train)

x=np.ones((m,9))
x[:,1:]=np.column_stack((train['W_2'],train['W_3'],
                         train['W_4'],train['W_5'],train['W_6'],
                         train['W_7'],T_train))


A=x.T.dot(x)
b=x.T.dot(y)
z = np.linalg.solve(A,b)

SSE=np.sum((y-x.dot(z))**2)
SST=sum((y-np.mean(y))**2)[0]
print("Training")
print("SSE: ",SSE)
print("SST: ",SST)
print("R^2: ",1-SSE/SST)
print("Baseline: ",sum((y-np.mean(y))**2)[0])

plt.plot(train['index'],y,'.')
plt.plot(train['index'],x.dot(z),'-')
plt.title("Dummy coded weekday variable: Training")
plt.show()

plt.scatter(train['index'],y-x.dot(z))
plt.title("Residuals")
plt.show()
print("coefficients:")
print(z)
print()

y=valid[['Pageviews']].values

m =len(valid)

x=np.ones((m,9))
x[:,1:]=np.column_stack((valid['W_2'],valid['W_3'],
                         valid['W_4'],valid['W_5'],valid['W_6'],
                         valid['W_7'],T_valid))



SSE=np.sum((y-x.dot(z))**2)
SST=sum((y-np.mean(y))**2)[0]
print("Validation")
print("SSE: ",SSE)
print("SST: ",SST)
print("R^2: ",1-SSE/SST)
print("Baseline: ",sum((y-np.mean(y))**2)[0])

plt.plot(valid['index'],valid[['Pageviews']],'.')
plt.plot(valid['index'],x.dot(z),'-')
plt.title("Dummy coded weekday variable: Validation")
plt.show()

plt.scatter(valid['index'],y-x.dot(z))
plt.title("Residuals")
plt.show()
    Training
    SSE:  1912154025.09
    SST:  8530795814.65
    R^2:  0.775852796546
    Baseline:  8530795814.65
    

    coefficients:
    [[  1.15385953e+04]
     [ -2.79933336e+03]
     [ -3.32862564e+03]
     [ -1.48611482e+03]
     [ -3.71268360e+03]
     [ -6.88505486e+03]
     [ -5.73875872e+03]
     [ -1.29164524e+00]
     [  2.77465319e-02]]
    
    Validation
    SSE:  1159104596.32
    SST:  2424981317.62
    R^2:  0.522015040734
    Baseline:  2424981317.62

The regression model on the training dataset had an R-Squared value of 0.78 with 9 predictor variables. The residual are evenly distributed around zero with a two large outliers.

The validation dataset had an R-Squared of 0.52. The validation dataset residuals have many outliers compared the the training dataset residuals. In addition, there is a slight downward trend which suggests that a variable might be left out of the model.

y=train[['Pageviews']].values

x=np.column_stack((S_train,T_train))


A=x.T.dot(x)
b=x.T.dot(y)
z = np.linalg.solve(A,b)

SSE=np.sum((y-x.dot(z))**2)
print("SSE :",SSE)
print("Baseline: ",sum((y-np.mean(y))**2))


SSE=np.sum((y-x.dot(z))**2)
SST=sum((y-np.mean(y))**2)[0]
print("SSE: ",SSE)
print("SST: ",SST)
print("R^2: ",1-SSE/SST)


plt.plot(train['index'],y,'.')
plt.plot(train['index'],x.dot(z),'-')
plt.title("Seasonally predicted Values over time: Training")
plt.show()

plt.scatter(train['index'],y-x.dot(z))
plt.title("Residuals")
plt.show()


print(z)
print()


output=valid[['Pageviews']]
y=output.values


x=np.column_stack((S_valid,T_valid))

SSE=np.sum((y-x.dot(z))**2)
SST=sum((y-np.mean(y))**2)[0]

print("SSE: ",SSE)
print("SST: ",SST)
print("R^2: ",1-SSE/SST)
print("Baseline: ",sum((y-np.mean(y))**2)[0])

plt.plot(valid['index'],y,'.')
plt.plot(valid['index'],x.dot(z),'-')
plt.title("Seasonally predicted Values over time")
plt.show()
plt.scatter(valid['index'],y-x.dot(z))
plt.title("Residuals")
plt.show()

    SSE : 1910499966.56
    Baseline:  [  8.53079581e+09]
    SSE:  1910499966.56
    SST:  8530795814.65
    R^2:  0.776046689187

[[ 7.88822225e+03] [ -1.04772341e+01] [ 5.49537731e+01] [ -4.71633832e+01] [ -1.78556064e+01] [ 1.31883444e+03] [ -5.94631183e+18] [ -2.32539453e+03] [ 1.42930303e+03] [ -1.70801754e+03] [ 1.13000291e+03] [ -1.29585954e+00] [ 2.77521502e-02]]
    SSE:  1160469260.63
    SST:  2424981317.62
    R^2:  0.521452288231
    Baseline:  2424981317.62

Results

The regression model on the training dataset had an R-Squared value of 0.78 with 9 predictor variables. The residual are evenly distributed around zero with a two large outliers.

The validation dataset had an R-Squared of 0.52. The validation dataset residuals have many outliers compared the the training dataset residuals. In addition, there is a slight downward trend which suggests that a variable might be left out of the model.

The results are almost (if not) exactly the same, so which method to choose?

Dummy coding can be automated, but its not always easy to type every variable to include in a linear model. In addition, it requires some inpection and trail and error to find the seasonal variable. The advantage of dummy coding is that each level is accounted for and it relatively easy to compare levels.

Signal processing can be used to create seasonal components almost automatically, but it is more computational expensive to calculate. In addition, it is not clear from the coefficients what the seasonal components are. The advantage is that there can be more predictive variables in the model, and the number of sinusodial components can be increased or decreased by adjusting the threshold or bandwith.