Select Parameters with Cross Validation¶

This notebook demonstrates how to select parameters of best fit for a sparse tensor decomposition model using a cross-validation strategy.

Fitting a sparse tensor decomposition model requires tuning two key parameters: the number of components (rank) of the model, and the sparsity coefficient (lambda) applied to each mode of the model. These parameters are rarely known a priori, and instead must be inferred. We’ve developed a cross-validation protocol for selecting parameters of best fit. This protocol relies on measurement replicates to generate three replicate data tensors of equivalent shapes. These replicate data tensors can be used to calculate cross-validated sum of squared error (SSE) and factor match score (FMS) metrics that can be used to identify the parameters of best fit.

We will first simulate data replicates by generating three identical simulated data tensors – rep_a, rep_b, and rep_c – with independent noise added to each tensor to simulate variation between replicates. Next we will identify the best fit number of components using cross validated SSE. Locking in this rank, we’ll fine tune the best fit sparsity coefficient using cross validated FMS.

[1]:

# imports

import numpy as np
import pandas as pd
import scipy
import seaborn as sns
import tensorly as tl
import tlviz
from barnacle import (
    SparseCP,
    visualize_3d_tensor,
    simulated_sparse_tensor,
)

[2]:

# generate simulated data

true_rank = 5
true_shape = [20, 15, 10]
true_densities = [.2, .4, .6]

# re-seed simulated data until all factor matrices are full rank
full_rank = False
while not full_rank:
    # generate simulated tensor
    sim_tensor = simulated_sparse_tensor(
        shape=true_shape,
        rank=true_rank,
        densities=true_densities,
        factor_dist_list=[
            scipy.stats.uniform(loc=-1, scale=2),
            scipy.stats.uniform(),
            scipy.stats.uniform()
        ],
        random_state=9481
    )
    # check that all factors are full rank
    full_rank = np.all([np.linalg.matrix_rank(factor) == true_rank for factor in sim_tensor.factors])

# create three replicates with the same core signal data tensor, and 100% independent noise added to each
rep_a = sim_tensor.to_tensor(noise_level=1, sparse_noise=True, random_state=91891)
rep_b = sim_tensor.to_tensor(noise_level=1, sparse_noise=True, random_state=7394)
rep_c = sim_tensor.to_tensor(noise_level=1, sparse_noise=True, random_state=1597)

# visualize replicates
for i, rep in enumerate([rep_a, rep_b, rep_c]):
    fig = visualize_3d_tensor(
        rep,
        shell=False,
        midpoint=0,
        show_colorbar=True,
        label_axes=True,
        range_color=[-1, 1],
        opacity=1,
    )
    print('Replicate {}'.format({0:'A', 1:'B', 2:'C'}[i]))
    fig.show()

Replicate A

Replicate B

Replicate C

[3]:

# perform a cross-validated parameter search for the number of components (rank)
# NOTE: this block of code takes a minute or two to run

results = []
for rank in np.arange(1, 9, 1):
    print(f'Fitting rank-{rank} models')
    # instantiate sparse cp model
    model = SparseCP(
        rank=rank,
        lambdas=[0, 0, 0],
        nonneg_modes=[1, 2],
        tol=1e-5,
        random_state=21158,
        n_initializations=5
    )
    # fit model to each replicate data tensor
    for i, fit_rep in enumerate([rep_a, rep_b, rep_c]):
        cp = model.fit_transform(fit_rep, verbose=0)
        # calculate cross-validated SSE against each held out replicate
        for j, comparison_rep in enumerate([rep_a, rep_b, rep_c]):
            if i == j:
                comparison = 'fitting'
            else:
                comparison = 'cross-validation'
            sse = tlviz.model_evaluation.relative_sse(cp, comparison_rep)
            results.append({
                'Rank': rank,
                'Fitting Replicate': {0:'A', 1:'B', 2:'C'}[i],
                'Comparison Replicate': {0:'A', 1:'B', 2:'C'}[j],
                'Comparison': comparison,
                'SSE': sse
            })
# compile results into DataFrame
rank_results_df = pd.DataFrame(results)
rank_results_df

Fitting rank-1 models
Fitting rank-2 models
Fitting rank-3 models
Fitting rank-4 models
Fitting rank-5 models
Fitting rank-6 models
Fitting rank-7 models
Fitting rank-8 models

[3]:

	Rank	Fitting Replicate	Comparison Replicate	Comparison	SSE
0	1	A	A	fitting	0.657666
1	1	A	B	cross-validation	0.707090
2	1	A	C	cross-validation	0.714235
3	1	B	A	cross-validation	0.674137
4	1	B	B	fitting	0.691348
...	...	...	...	...	...
67	8	B	B	fitting	0.352665
68	8	B	C	cross-validation	0.648890
69	8	C	A	cross-validation	0.645520
70	8	C	B	cross-validation	0.646472
71	8	C	C	fitting	0.341968

72 rows × 5 columns

[4]:

# identify the rank of best fit

# find the rank that corresponds to the minimum mean cross-validated SSE
cv_sse_df = rank_results_df[rank_results_df.Comparison == 'cross-validation'].groupby('Rank').mean('SSE')
minimum = cv_sse_df.reset_index().sort_values('SSE').iloc[0, :]
print(f'The minimum cross-validated SSE ({minimum.SSE:.2f}) corresponds to a rank of {int(minimum.Rank)}.')

# visualize cross-validation data
ax = sns.lineplot(
    rank_results_df,
    x='Rank',
    y='SSE',
    hue='Comparison',
    errorbar='se',
    err_style='bars'
)

The minimum cross-validated SSE (0.56) corresponds to a rank of 5.

../_images/notebooks_cross-validation_4_1.png

The cross-validation result reveal a clear minimum in cross-validated SSE that corresponds to a rank of 5. This indicates that the model of best fit should be parameterized with 5 components. However, this parameterization was done in the absence of any sparsity parameter (lambda=0). So fixing the number of components at 5, we next perform a similar cross-validated parameter search of sparsity coefficients to identify the ideal sparsity level for the model.

[5]:

# perform a cross-validated parameter search for the sparsity coefficient
# NOTE: this block of code takes a minute or two to run

# define range of lambda values to test
lambdas = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.8, 1.]

results = []
for lamb in lambdas:
    print(f'Fitting models with sparsity coefficient of {lamb}')
    # instantiate sparse cp model
    model = SparseCP(
        rank=5,
        lambdas=[lamb, 0, 0],
        nonneg_modes=[1, 2],
        tol=1e-5,
        random_state=21158,
        n_initializations=5
    )
    # fit model to each replicate data tensor
    fit_models = []
    for i, fit_rep in enumerate([rep_a, rep_b, rep_c]):
        cp = model.fit_transform(fit_rep, verbose=0)
        fit_models.append(cp)
    # calculate cross-validated FMS and SSE between each pair of replicates
    for i, fit_rep in enumerate([rep_a, rep_b, rep_c]):
        for j, comparison_rep in enumerate([rep_a, rep_b, rep_c]):
            if i == j:
                continue # only compute cross-validated metrics
            elif i < j:
                fms = tlviz.factor_tools.factor_match_score(fit_models[i], fit_models[j])
            else:
                fms = np.nan # FMS is symmetrical, so FMS(A, B) == FMS(B, A)
            sse = tlviz.model_evaluation.relative_sse(fit_models[i], comparison_rep)
            results.append({
                'Sparsity Coefficient': lamb,
                'Fitting Replicate': {0:'A', 1:'B', 2:'C'}[i],
                'Comparison Replicate': {0:'A', 1:'B', 2:'C'}[j],
                'SSE': sse,
                'FMS': fms
            })
# compile results into DataFrame
lambda_results_df = pd.DataFrame(results)
lambda_results_df

Fitting models with sparsity coefficient of 0.05
Fitting models with sparsity coefficient of 0.1
Fitting models with sparsity coefficient of 0.2
Fitting models with sparsity coefficient of 0.3
Fitting models with sparsity coefficient of 0.4
Fitting models with sparsity coefficient of 0.5
Fitting models with sparsity coefficient of 0.6
Fitting models with sparsity coefficient of 0.8
Fitting models with sparsity coefficient of 1.0

[5]:

	Sparsity Coefficient	Fitting Replicate	Comparison Replicate	SSE	FMS
0	0.05	A	B	0.563678	0.621984
1	0.05	A	C	0.575991	0.587359
2	0.05	B	A	0.573267	NaN
3	0.05	B	C	0.561255	0.665616
4	0.05	C	A	0.549828	NaN
5	0.05	C	B	0.543003	NaN
6	0.10	A	B	0.569468	0.621996
7	0.10	A	C	0.563350	0.559078
8	0.10	B	A	0.553518	NaN
9	0.10	B	C	0.568633	0.580420
10	0.10	C	A	0.546041	NaN
11	0.10	C	B	0.538591	NaN
12	0.20	A	B	0.565581	0.601660
13	0.20	A	C	0.558382	0.559783
14	0.20	B	A	0.542834	NaN
15	0.20	B	C	0.533984	0.773011
16	0.20	C	A	0.544073	NaN
17	0.20	C	B	0.534930	NaN
18	0.30	A	B	0.564586	0.603628
19	0.30	A	C	0.558994	0.558960
20	0.30	B	A	0.540832	NaN
21	0.30	B	C	0.532331	0.783579
22	0.30	C	A	0.544599	NaN
23	0.30	C	B	0.534873	NaN
24	0.40	A	B	0.567422	0.601039
25	0.40	A	C	0.562392	0.555470
26	0.40	B	A	0.541534	NaN
27	0.40	B	C	0.533658	0.789679
28	0.40	C	A	0.548292	NaN
29	0.40	C	B	0.537406	NaN
30	0.50	A	B	0.571722	0.590569
31	0.50	A	C	0.567495	0.550812
32	0.50	B	A	0.549326	NaN
33	0.50	B	C	0.565389	0.619288
34	0.50	C	A	0.554570	NaN
35	0.50	C	B	0.542781	NaN
36	0.60	A	B	0.577466	0.577193
37	0.60	A	C	0.574194	0.542921
38	0.60	B	A	0.555087	NaN
39	0.60	B	C	0.570769	0.620740
40	0.60	C	A	0.561618	NaN
41	0.60	C	B	0.549689	NaN
42	0.80	A	B	0.594029	0.437480
43	0.80	A	C	0.592781	0.523104
44	0.80	B	A	0.606608	NaN
45	0.80	B	C	0.617170	0.465549
46	0.80	C	A	0.577298	NaN
47	0.80	C	B	0.566667	NaN
48	1.00	A	B	0.635231	0.313387
49	1.00	A	C	0.636494	0.521444
50	1.00	B	A	0.627666	NaN
51	1.00	B	C	0.639601	0.454012
52	1.00	C	A	0.597917	NaN
53	1.00	C	B	0.590343	NaN

[6]:

# identify the sparsity coefficient of best fit

# find the lambda that corresponds to the minimum mean cross-validated SSE
cv_df = lambda_results_df.groupby('Sparsity Coefficient').mean(['SSE', 'FMS']).reset_index()
min_sse = cv_df.sort_values('SSE').iloc[0, :]
print(f"The minimum cross-validated SSE ({min_sse.SSE:.2f}) " +
      f"corresponds to a sparsity coefficient of {min_sse['Sparsity Coefficient']}.")

# find the lambda that corresponds to the maximum mean cross-validated FMS
cv_df = lambda_results_df.groupby('Sparsity Coefficient').mean(['SSE', 'FMS']).reset_index()
max_fms = cv_df.sort_values('FMS', ascending=False).iloc[0, :]
print(f"The maximum cross-validated FMS ({max_fms.FMS:.2f}) " +
      f"corresponds to a sparsity coefficient of {max_fms['Sparsity Coefficient']}.")

# arrange data in tidy format for plotting
plot_df = lambda_results_df.melt(
    id_vars=['Sparsity Coefficient', 'Fitting Replicate', 'Comparison Replicate'],
    value_vars=['SSE', 'FMS'],
    var_name='Metric',
    value_name='Score'
)

# visualize cross-validation data
ax = sns.lineplot(
    plot_df,
    x='Sparsity Coefficient',
    y='Score',
    hue='Metric',
    errorbar='se',
    err_style='bars'
)
ax.set(xscale='log');

The minimum cross-validated SSE (0.55) corresponds to a sparsity coefficient of 0.3.
The maximum cross-validated FMS (0.65) corresponds to a sparsity coefficient of 0.4.

../_images/notebooks_cross-validation_7_1.png

Evidenced by these results, selecting the best sparsity coefficient is not always a straightforward task. However, in this example it is reassuring to see that the cross-validated SSE and FMS scores point to a similar range of sparsity coefficient values between 0.2 and 0.4. Above this range, there is a clear drop off in FMS and an increase in SSE, and below this range there is a similar trend that is subtler but nevertheless distinct.

Within the indicated range, selecting the most appropriate sparsity coefficient probably depends on the intended application of the model. There are a number of hueristics that can help, such as the “one standard error rule” in which we select the sparsest model for which the cross-validated score remains within one standard error of the best overall score. In this example, this heuristic might lead to a selection of 0.4 as the sparsest model of best fit.

Select Parameters with Cross Validation¶

Barnacle

Navigation

Related Topics