Analyzing GRU Training Dynamics on the Adding Problem - Part 2

1 Introduction

In this blog post, I’ll complete the journey started in the previous post in which I introduced the problem and showed some interesting plots depicting weights and the learning dynamics that was behind it. I tried to explain the way that GRU works under the hood, step by step with a full example of a sequence going through the whole GRU. We saw step by step each and every transformation that the input is subjected to. In this blog post we will do something in my opinion more interesting that is analysing the weights. Now, this can actually mean everything so I want to split the discussion in two parts:

What the weights actually mean? We saw a complete forward pass that showed more or less how the initial input is transformed into the hidden layer. But why those specific weights rather than other values?
How did we get there? It’s well known that weights are usually randomly initialized in networks, so how did we get from those random values to our values? Is the dynamics of the learning responsible of the specific values? Could have this been done differently? Could have the weights have followed different trajectories? In either case can it be proved?

Let’s begin, this is gonna be a lot of fun! (And a lot of work!!)

2 What those parameters mean?

Note

I will use weights and parameters interchangeably. One could argue that they have different meanings (and it might even be true) but in our case maybe we can loosen a bit this detail in terminology and live happily anyway.

So: what those parameter mean?

I tried to figure this out because I noticed one really weird thing looking at the last hidden layer of our walkthrough.

Let’s refresh quickly what was the example, remember that the input sequence was:

\[ \text{x} = \left[ \begin{bmatrix} 12 \\ 0 \end{bmatrix} \begin{bmatrix} 37 \\ 1 \end{bmatrix} \begin{bmatrix} 12 \\ 0 \end{bmatrix} \begin{bmatrix} 21 \\ 1 \end{bmatrix} \right] \]

we normalized the numbers so that they were represented in hundreths, so basically what we had after the normalization was:

The input sequence is represented as a tensor:

\[ \text{x} = \begin{bmatrix} 0.12 & 0 \\ 0.37 & 1 \\ 0.12 & 0 \\ 0.21 & 1 \end{bmatrix} \]

My architecture was given the smallest hidden layer as possible on purpose. What I was trying to do was pushin the GRU cell to make the best out of what it had, and my expectation was that at some point it would have learned to basically add to it’s hidden layer the value in position $0$ iff the flaf in position $1$ was equal to $true$ (or $1$ ok…) This didn’t happen and looking back at my original thought I feel a bit stupid about it. Let’s recall what a GRU cell is. Initially, for $t = 0$, the hidden layer is $h_0 = 0$.

then the cell uses this series of transformations to get tha hidden layer out:

\[\begin{aligned} z_t &= \sigma(W_z x_t + U_z h_{t-1} + b_z) \\ r_t &= \sigma(W_r x_t + U_r h_{t-1} + b_r) \\ \hat{h}_t &= \phi(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h) \\ h_t &= z_t \odot h_{t-1} + (1 - z_t) \odot \hat{h}_t \end{aligned}\]

if you look at the activation function this should already ring a bell, but let’s walk through step by step. Let’s assume we always have four elements and 2 of them are summed to the total. Let’s also assume that we normalize to the maximum each single value can reach (in our case $100$). Let’s have an extreme example now:

\[ \text{x} = \begin{bmatrix} 100 & 1 \\ 100 & 1 \\ 100 & 0 \\ 100 & 0 \end{bmatrix} \]

which after our normalization becomes:

\[ \text{x} = \begin{bmatrix} 1.0 & 1 \\ 1.0 & 1 \\ 1.0 & 0 \\ 1.0 & 0 \end{bmatrix} \]

Now: if my first hypothesis was right the hidden layer should have contained something like $1.0 + 1.0 = 2.0$. But this could have never happened. Why? Look at the activation functions and also at how the new hidden cell is calculated
\[ h_t = z_t \odot h_{t-1} + (1 - z_t) \odot \hat{h}_t \]

Basically it’s a weighted sum of the old value of the cell and the new value of the cell. If the weight was either $0$ or $1$ in extreme cases this would have meant that either the old value was being kept as it was (ignoring totally the new value) or the converse: the old value was forgotten and the new was taking its place. So we might expect that the truth is in the middle, meaning that $z_t$ was around $.5$ so that it took half the information from the $t-1$ step and half the information from the new value. But does this make sense? Why $\frac{1}{2}$?

If we see at what the network is doing instead, we observe a cool and symmetrical (I LOVE SYMMETRIES!) behaviour. What is that? If you see the old post you’ll see that $z_t$ takes on some nice values that are

\[\begin{aligned} z_1 = 0.8275274634361267 \\ z_2 = 0.15969596803188324 \\ z_3 = 0.8254608511924744 \\ z_4 = 0.1668987274169922 \\ \end{aligned}\]

Basically when the flag is $0$, then $z_t \ge .825$ otherwise $z_t \le .166$. So we already observe a pattern. I love this so much! Let’s recall then how $z_t$ is computed:

\[ z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z) \]

where $\sigma$ is the sigmoid function. Recall that the sigmoid function squeezes its input in the range $[0,1]$ In the plot below we map $100$ points to the sigmoid function (did you know that function, map and application are quite the same thing in math?)

Code

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

x = np.linspace(-10,10,100)
y = 1/(1+np.exp(-x))
sns.set_theme(style="darkgrid")  # This gives the typical Seaborn look

plt.figure(figsize=(10, 6))
sns.lineplot(x=x, y=y, linewidth=2)
plt.title('Sigmoid Function')
plt.xlabel('x')
plt.ylabel('$\sigma(x)$')
plt.grid(True)
plt.show()

So now, let’s do something nice and get back our weights, that are gonna be useful for our next section.

Code

import json
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from tqdm.notebook import trange, tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os

class AddingProblemGRU(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(AddingProblemGRU, self).__init__()
        self.gru = nn.GRU(
            input_size, hidden_size, num_layers=1, batch_first=True
        )
        self.linear = nn.Linear(hidden_size, output_size)
        self.init_weights()

    def init_weights(self):
        for name, param in self.gru.named_parameters():
            if "weight" in name:
                nn.init.orthogonal_(param)
            elif "bias" in name:
                nn.init.constant_(param, 0)
        nn.init.xavier_uniform_(self.linear.weight)
        nn.init.zeros_(self.linear.bias)

    def forward(self, x):
        out, hn = self.gru(x)
        output = self.linear(out[:, -1, :])
        return output, out


# Reproducibility
RANDOM_SEED = 37
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
torch.cuda.manual_seed_all(RANDOM_SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Hyperparameters
DELTA = 0
SEQ_LEN = 4
HIGH = 100
N_SAMPLES = 10000
TRAIN_SPLIT = 0.8
BATCH_SIZE = 256
LEARNING_RATE = 1e-4
WEIGHT_DECAY = 1e-5
CLIP_VALUE = 2.0
NUM_EPOCHS = 3000
HIDDEN_SIZE = 1
OUTPUT_SIZE = 1
INPUT_SIZE = 2

def adding_problem_generator(N, seq_len=6, high=1, delta=0.6):
    actual_seq_len = np.random.randint(
        int(seq_len * (1 - delta)), int(seq_len * (1 + delta))
    ) if delta > 0 else seq_len
    num_ones = np.random.randint(2, min(actual_seq_len - 1, 4))
    X_num = np.random.randint(low=0, high=high, size=(N, actual_seq_len, 1))
    X_mask = np.zeros((N, actual_seq_len, 1))
    Y = np.ones((N, 1))
    for i in range(N):
        positions = np.random.choice(actual_seq_len, size=num_ones, replace=False)
        X_mask[i, positions] = 1
        Y[i, 0] = np.sum(X_num[i, positions])
    X = np.append(X_num, X_mask, axis=2)
    return X, Y

X, Y = adding_problem_generator(N_SAMPLES, seq_len=SEQ_LEN, high=HIGH, delta=DELTA)

training_len = int(TRAIN_SPLIT * N_SAMPLES)
train_X = X[:training_len]
test_X = X[training_len:]
train_Y = Y[:training_len]
test_Y = Y[training_len:]

train_dataset = TensorDataset(
    torch.tensor(train_X).float(), torch.tensor(train_Y).float()
)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

test_dataset = TensorDataset(
    torch.tensor(test_X).float(), torch.tensor(test_Y).float()
)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)


# File paths for saved data
train_losses_path = "train_losses.json"
test_losses_path = "test_losses.json"
all_weights_path = "all_weights.json"
model_save_path = (
    f"gru_adding_problem_model_epochs_{NUM_EPOCHS}_hidden_{HIDDEN_SIZE}.pth"
)

FORCE_TRAIN = False

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


criterion = nn.MSELoss()
def evaluate(model, data_loader, criterion, high):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            inputs[:, :, 0] /= high
            outputs, _ = model(inputs)
            outputs = outputs * high
            loss = criterion(outputs, labels)
            total_loss += loss.item() * inputs.size(0)
    return total_loss / len(data_loader.dataset)


# Try to load data from files
if FORCE_TRAIN == False and os.path.exists(train_losses_path) and os.path.exists(test_losses_path) and os.path.exists(all_weights_path) and os.path.exists(model_save_path):
    with open(train_losses_path, "r") as f:
        train_losses = json.load(f)
    with open(test_losses_path, "r") as f:
        test_losses = json.load(f)
    with open(all_weights_path, "r") as f:
        all_weights_loaded = json.load(f)

    # Convert loaded weights (which are lists) back to numpy arrays
    all_weights = []
    for epoch_weights_list in all_weights_loaded:
        epoch_weights_dict = {}
        for name, weights_list in epoch_weights_list.items():
            epoch_weights_dict[name] = np.array(weights_list)
        all_weights.append(epoch_weights_dict)
    #load model
    model = AddingProblemGRU(
    input_size=INPUT_SIZE, hidden_size=HIDDEN_SIZE, output_size=OUTPUT_SIZE)
    model.load_state_dict(torch.load(model_save_path))
    model.to(device)



else:
    model = AddingProblemGRU(
        input_size=INPUT_SIZE, hidden_size=HIDDEN_SIZE, output_size=OUTPUT_SIZE
    )
    model.to(device)

    optimizer = torch.optim.Adam(
        model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY
    )
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode="min", factor=0.5, patience=5, min_lr=1e-6, verbose=False
    )


    

    train_losses = []
    test_losses = []
    all_weights = []

    for epoch in trange(NUM_EPOCHS, desc="Epoch"):
        running_loss = 0.0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            inputs[:, :, 0] /= HIGH
            labels_scaled = labels / HIGH
            optimizer.zero_grad()
            outputs, _ = model(inputs)
            loss = criterion(outputs, labels_scaled)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP_VALUE)
            optimizer.step()
            running_loss += loss.item() * inputs.size(0)

        epoch_loss = running_loss / len(train_loader.dataset)
        train_losses.append(epoch_loss)

        if epoch % 49 == 0:
            test_loss = evaluate(model, test_loader, criterion, HIGH)
            test_losses.append(test_loss)
            scheduler.step(test_loss)

            weights_dict = {}
            for name, param in model.named_parameters():
                weights_dict[name] = param.data.cpu().numpy().copy()
            all_weights.append(weights_dict)
        else:
            test_losses.append(None)

    # Save data to files
    with open(train_losses_path, "w") as f:
        json.dump(train_losses, f)
    with open(test_losses_path, "w") as f:
        json.dump(test_losses, f)
    # Convert weights to lists for JSON serialization
    all_weights_serializable = [
        {k: v.tolist() for k, v in epoch_weights.items()}
        for epoch_weights in all_weights
    ]
    with open(all_weights_path, "w") as f:
        json.dump(all_weights_serializable, f)

    # Save Model
    model_save_path = (
        f"gru_adding_problem_model_epochs_{NUM_EPOCHS}_hidden_{HIDDEN_SIZE}.pth"
    )
    torch.save(model.state_dict(), model_save_path)

input_sequence = torch.tensor([
    [12,0],
    [37,1],
    [12,0],
    [21,1],
]).float().unsqueeze(0)


input_sequence[:, :, 0] /= HIGH

# Get weights from the last training epoch.  'all_weights' is populated by the loading/training section.
last_epoch_weights = model.state_dict() #all_weights[-1]

# Extract the relevant weight matrices
W_ih = torch.tensor(last_epoch_weights['gru.weight_ih_l0']).float()  # Input-to-hidden
W_hh = torch.tensor(last_epoch_weights['gru.weight_hh_l0']).float()  # Hidden-to-hidden
b_ih = torch.tensor(last_epoch_weights['gru.bias_ih_l0']).float()  # Input-to-hidden bias
b_hh = torch.tensor(last_epoch_weights['gru.bias_hh_l0']).float()  # Hidden-to-hidden bias
W_linear = torch.tensor(last_epoch_weights['linear.weight']).float() # Linear layer weights
b_linear = torch.tensor(last_epoch_weights['linear.bias']).float()   # Linear layer bias

3 The Update Gate at $t=0$

Let’s start our analysis from the bottom up in the calculation of the updated value of the hidden state. For how PyTorch implements the GRU cell under the hood the $z_t$ value is used in this way:

\[ \texttt{hy} = (1 - \texttt{updategate}) * \texttt{newgate} + \texttt{updategate} * \texttt{h} \]

This means basically that the higher the value of $\texttt{updategate}$, the more we should keep in memory our previous value (sum?) and ignore the newgate (the update in memory, the term to be added to the sum?). This sums up with our previous observation: with flag $1$, update gate was pretty low and when flag was $0$ then $z_t$ was pretty high. But how much each component is involved in this behaviour?

Code

x_1 = input_sequence[:, 0, :]
h = torch.zeros(1, 1, 1)

# Stable version (using torch.matmul)
gi_stable = torch.matmul(x_1, W_ih.t())
gh_stable = torch.matmul(h, W_hh.t())

# Get gates
i_r, i_z, i_n = gi_stable.chunk(3, dim=1)
h_r, h_z, h_n = gh_stable.chunk(3, dim=2)

# Apply gate operations with controlled precision
resetgate = torch.sigmoid(i_r + h_r  + b_ih[0] + b_hh[0])
updategate = torch.sigmoid(i_z + h_z + b_ih[1] + b_hh[1])
newgate = torch.tanh(i_n + b_ih[2] + (resetgate * (h_n + b_hh[2])))

hy = (1 - updategate) * newgate + updategate * h
h_t = hy

These are the three components that go through the sigmoid for the update gate:

\[\eqalign{ b_{ih_z} + b_{hh_z} &= 1.6185756921768188 \cr i_z &= -0.05037130042910576; \cr h_z &= -0.0; \cr i_z + h_z + b_{ih_z} + b_{hh_z} &= 1.568204402923584 \cr z_t &= 0.8275274634361267 \cr } \]

This is interesting. So let’s break this down a little bit.

Probably it’s a good idea to verify how each term here is calculated and what happens. Let’s begin with seeing how $i_z$ is calculated, using the original weight matrix.

\[\eqalign{ i_z &= x_t \cdot W_{ih_z}^T\cr W_{ih_z}^T &= \begin{bmatrix} -0.4197608530521393 \\ -3.02486252784729 \end{bmatrix} \cr } \]

What we do to obtain $i_z$ is multiply our input row vector $x_0^T = \left[\begin{smallmatrix}0.12 \\ 0.0\end{smallmatrix}\right]$ by the column vector $W_{ih_z}^T$

\[ \begin{bmatrix}0.12 & 0.0\end{bmatrix}\cdot \begin{bmatrix} -0.4197608530521393 \\ -3.02486252784729 \end{bmatrix} \]

which becomes:

\[\eqalign{ i_z &= 0.12 \cdot -0.4197608530521393 + 0.0 \cdot -3.02486252784729 \cr i_z &= 0.12 \cdot -0.4197608530521393 + \cancel{0.0 \cdot -3.02486252784729} (\texttt{flag}=0) \cr i_z &= 0.12 \cdot -0.4197608530521393 \cr i_z &= -0.05037130042910576 } \]

Now this conveys a super important information: when the $\texttt{flag} = 0$ then only the number has some importance in the final calculation because the flag cancels the second term of the sum as saw before. Now remember, this is only one of the three terms that go through the sigmoid at the end to obtain the final $z_t$ term.

So let’s carry on with the second part of it, $h_z$: basically it’s $0$ since when our GRU cell sees the input for the first time its input state is $h=0$, and whatever we multiply here stays zero.

\[ h_z = -0.0 \]

Now the last portion of our sum, the bias. Now, you should know that many ML scientists avoid using the bias term when the data is already centered or when the model inherently accounts for offsets, as it can be redundant and complicate interpretation. But in my case the bias was left there. And in the first calculation you can observe that without the bias term accounts super heavily on the final sum:

\[\eqalign{ b_{ih_z} &= 0.8092878460884094 \cr b_{hh_z} &= 0.8092878460884094 \cr } \]

So the final sum is:

\[ \texttt{temp} = -0.05037130042910576 + 0 + 0.8092878460884094 + 0.8092878460884094; \]

I’ve always believed that a plot tells more then hundred numbers, so let’s plot a cumulative sum of the four terms to check how much each of them accounts for the gran total:

Code

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams["text.usetex"] = True

items = [r"$i_z$", r"$h_z$", r"$b_{ih_z}$", r"$b_{hh_z}$"]
values = [(input_sequence[0][0][0] * W_ih[1][0]).item(), 0, (b_ih[1]).item(), (b_hh[1]).item()]

cumsum_values = np.cumsum(values)

df = pd.DataFrame({"Term": items, "Cumulative Sum": cumsum_values})

plt.figure(figsize=(8, 5))
sns.barplot(x="Term", y="Cumulative Sum", data=df, color="skyblue", alpha=0.7)

for i, (item, value) in enumerate(zip(items, values)):
    plt.text(i, cumsum_values[i] - value / 2, f"$+{value}$", ha="center", fontsize=12, color="black")

plt.ylabel(r"\textbf{Cumulative Sum}", fontsize=12)
plt.xlabel(r"\textbf{Terms}", fontsize=12)
plt.title(r"\textbf{Cumulative Sum Contribution}", fontsize=14)
plt.ylim(0, max(cumsum_values) + .5)  # Adjust y-limit
plt.show()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/formatters.py:402, in BaseFormatter.__call__(self, obj)
    400     pass
    401 else:
--> 402     return printer(obj)
    403 # Finally look for special method names
    404 method = get_real_method(obj, self.print_method)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/pylabtools.py:187, in retina_figure(fig, base64, **kwargs)
    178 def retina_figure(fig, base64=False, **kwargs):
    179     """format a figure as a pixel-doubled (retina) PNG
    180 
    181     If `base64` is True, return base64-encoded str instead of raw bytes
   (...)
    185         base64 argument
    186     """
--> 187     pngdata = print_figure(fig, fmt="retina", base64=False, **kwargs)
    188     # Make sure that retina_figure acts just like print_figure and returns
    189     # None when the figure is empty.
    190     if pngdata is None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/pylabtools.py:170, in print_figure(fig, fmt, bbox_inches, base64, **kwargs)
    167     from matplotlib.backend_bases import FigureCanvasBase
    168     FigureCanvasBase(fig)
--> 170 fig.canvas.print_figure(bytes_io, **kw)
    171 data = bytes_io.getvalue()
    172 if fmt == 'svg':

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backend_bases.py:2155, in FigureCanvasBase.print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, pad_inches, bbox_extra_artists, backend, **kwargs)
   2152     # we do this instead of `self.figure.draw_without_rendering`
   2153     # so that we can inject the orientation
   2154     with getattr(renderer, "_draw_disabled", nullcontext)():
-> 2155         self.figure.draw(renderer)
   2156 if bbox_inches:
   2157     if bbox_inches == "tight":

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:94, in _finalize_rasterization.<locals>.draw_wrapper(artist, renderer, *args, **kwargs)
     92 @wraps(draw)
     93 def draw_wrapper(artist, renderer, *args, **kwargs):
---> 94     result = draw(artist, renderer, *args, **kwargs)
     95     if renderer._rasterizing:
     96         renderer.stop_rasterizing()

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:71, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     68     if artist.get_agg_filter() is not None:
     69         renderer.start_filter()
---> 71     return draw(artist, renderer)
     72 finally:
     73     if artist.get_agg_filter() is not None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/figure.py:3257, in Figure.draw(self, renderer)
   3254             # ValueError can occur when resizing a window.
   3256     self.patch.draw(renderer)
-> 3257     mimage._draw_list_compositing_images(
   3258         renderer, self, artists, self.suppressComposite)
   3260     renderer.close_group('figure')
   3261 finally:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/image.py:134, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    132 if not_composite or not has_images:
    133     for a in artists:
--> 134         a.draw(renderer)
    135 else:
    136     # Composite any adjacent images together
    137     image_group = []

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:71, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     68     if artist.get_agg_filter() is not None:
     69         renderer.start_filter()
---> 71     return draw(artist, renderer)
     72 finally:
     73     if artist.get_agg_filter() is not None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axes/_base.py:3145, in _AxesBase.draw(self, renderer)
   3142     for spine in self.spines.values():
   3143         artists.remove(spine)
-> 3145 self._update_title_position(renderer)
   3147 if not self.axison:
   3148     for _axis in self._axis_map.values():

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axes/_base.py:3089, in _AxesBase._update_title_position(self, renderer)
   3087 if title.get_text():
   3088     for ax in axs:
-> 3089         ax.yaxis.get_tightbbox(renderer)  # update offsetText
   3090         if ax.yaxis.offsetText.get_text():
   3091             bb = ax.yaxis.offsetText.get_tightbbox(renderer)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1364, in Axis.get_tightbbox(self, renderer, for_layout_only)
   1361     renderer = self.get_figure(root=True)._get_renderer()
   1362 ticks_to_draw = self._update_ticks()
-> 1364 self._update_label_position(renderer)
   1366 # go back to just this axis's tick labels
   1367 tlb1, tlb2 = self._get_ticklabel_bboxes(ticks_to_draw, renderer)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:2686, in YAxis._update_label_position(self, renderer)
   2682     return
   2684 # get bounding boxes for this axis and any siblings
   2685 # that have been set by `fig.align_ylabels()`
-> 2686 bboxes, bboxes2 = self._get_tick_boxes_siblings(renderer=renderer)
   2687 x, y = self.label.get_position()
   2689 if self.label_position == 'left':
   2690     # Union with extents of the left spine if present, of the axes otherwise.

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:2252, in Axis._get_tick_boxes_siblings(self, renderer)
   2250 axis = ax._axis_map[name]
   2251 ticks_to_draw = axis._update_ticks()
-> 2252 tlb, tlb2 = axis._get_ticklabel_bboxes(ticks_to_draw, renderer)
   2253 bboxes.extend(tlb)
   2254 bboxes2.extend(tlb2)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1343, in Axis._get_ticklabel_bboxes(self, ticks, renderer)
   1341 if renderer is None:
   1342     renderer = self.get_figure(root=True)._get_renderer()
-> 1343 return ([tick.label1.get_window_extent(renderer)
   1344          for tick in ticks if tick.label1.get_visible()],
   1345         [tick.label2.get_window_extent(renderer)
   1346          for tick in ticks if tick.label2.get_visible()])

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1343, in <listcomp>(.0)
   1341 if renderer is None:
   1342     renderer = self.get_figure(root=True)._get_renderer()
-> 1343 return ([tick.label1.get_window_extent(renderer)
   1344          for tick in ticks if tick.label1.get_visible()],
   1345         [tick.label2.get_window_extent(renderer)
   1346          for tick in ticks if tick.label2.get_visible()])

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:969, in Text.get_window_extent(self, renderer, dpi)
    964     raise RuntimeError(
    965         "Cannot get window extent of text w/o renderer. You likely "
    966         "want to call 'figure.draw_without_rendering()' first.")
    968 with cbook._setattr_cm(fig, dpi=dpi):
--> 969     bbox, info, descent = self._get_layout(self._renderer)
    970     x, y = self.get_unitless_position()
    971     x, y = self.get_transform().transform((x, y))

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:373, in Text._get_layout(self, renderer)
    370 ys = []
    372 # Full vertical extent of font, including ascenders and descenders:
--> 373 _, lp_h, lp_d = _get_text_metrics_with_cache(
    374     renderer, "lp", self._fontproperties,
    375     ismath="TeX" if self.get_usetex() else False,
    376     dpi=self.get_figure(root=True).dpi)
    377 min_dy = (lp_h - lp_d) * self._linespacing
    379 for i, line in enumerate(lines):

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:69, in _get_text_metrics_with_cache(renderer, text, fontprop, ismath, dpi)
     66 """Call ``renderer.get_text_width_height_descent``, caching the results."""
     67 # Cached based on a copy of fontprop so that later in-place mutations of
     68 # the passed-in argument do not mess up the cache.
---> 69 return _get_text_metrics_with_cache_impl(
     70     weakref.ref(renderer), text, fontprop.copy(), ismath, dpi)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:77, in _get_text_metrics_with_cache_impl(renderer_ref, text, fontprop, ismath, dpi)
     73 @functools.lru_cache(4096)
     74 def _get_text_metrics_with_cache_impl(
     75         renderer_ref, text, fontprop, ismath, dpi):
     76     # dpi is unused, but participates in cache invalidation (via the renderer).
---> 77     return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backends/backend_agg.py:211, in RendererAgg.get_text_width_height_descent(self, s, prop, ismath)
    209 _api.check_in_list(["TeX", True, False], ismath=ismath)
    210 if ismath == "TeX":
--> 211     return super().get_text_width_height_descent(s, prop, ismath)
    213 if ismath:
    214     ox, oy, width, height, descent, font_image = \
    215         self.mathtext_parser.parse(s, self.dpi, prop)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backend_bases.py:566, in RendererBase.get_text_width_height_descent(self, s, prop, ismath)
    562 fontsize = prop.get_size_in_points()
    564 if ismath == 'TeX':
    565     # todo: handle properties
--> 566     return self.get_texmanager().get_text_width_height_descent(
    567         s, fontsize, renderer=self)
    569 dpi = self.points_to_pixels(72)
    570 if ismath:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/texmanager.py:363, in TexManager.get_text_width_height_descent(cls, tex, fontsize, renderer)
    361 dpi_fraction = renderer.points_to_pixels(1.) if renderer else 1
    362 with dviread.Dvi(dvifile, 72 * dpi_fraction) as dvi:
--> 363     page, = dvi
    364 # A total height (including the descent) needs to be returned.
    365 return page.width, page.height + page.descent, page.descent

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/dviread.py:261, in Dvi.__iter__(self)
    245 def __iter__(self):
    246     """
    247     Iterate through the pages of the file.
    248 
   (...)
    259         integers.
    260     """
--> 261     while self._read():
    262         yield self._output()

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/dviread.py:343, in Dvi._read(self)
    341 self._dtable[byte](self, byte)
    342 if self._missing_font:
--> 343     raise self._missing_font.to_exception()
    344 name = self._dtable[byte].__name__
    345 if name == "_push":

FileNotFoundError: Matplotlib's TeX implementation searched for a file named 'cmss10.tfm' in your texmf tree, but could not find it

<Figure size 768x480 with 1 Axes>

Wow, so basically during the first iteration, we’re only dealing with bias terms. An important distinction to remember is that bias terms don’t depend on or connect to any individual data point being analyzed. Instead, they capture and convey overall patterns or information from the entire dataset as a whole.

To complete this first step, let’s do a sanity check. I plot here what would have happened after the first iteration to a number in the range $[0, 100]$ (remember normalized by a factor of $100$, so in the final range of $[0,1]$) There are 4 plots:

$y_0$ with bias: In our trained GRU cell (green line), when $\sigma$ is applied to an input flagged with $0$, the values oscillate between $[0.77, 0.83]$.
$y_1$ with bias: In our trained GRU cell (red line), when $\sigma$ is applied to an input flagged with $1$, the values oscillate between $[0.14, 0.20]$.
$y_0$ no bias: If we removed the bias term $b_z$ from the GRU cell (blue line) and applied $\sigma$ to an input flagged with $0$, the value stays around $0.5$. The reason for the cell choosing such a strong bias remains unclear - it seems significant since it pushes $z_t$ considerably higher.
$y_1$ no bias: If we removed the bias term $b_z$ from the GRU cell (orange line), and applied $\sigma$ to an input flagged with $1$, the values oscillate between $0.05$. Again, the strong bias choice is super cool. One (crazy!) hypothesis: it pushes up $z_t$ to act as a factor $\frac{1}{4}$, possibly because the network expects 4 items where 2 flags are in unknown positions and maintains this factor to account for missing items by adding $\frac{1}{4}$ of the current candidate.

The takeaway here is how the bias term strongly affects the outputs, particularly in pushing up $z_t$. While we can hypothesize about the network’s strategy (especially regarding the $\frac{1}{4}$ factor), the exact reason for such a strong bias will be hopefully uncovered in the next sections.

Code

bz =  b_ih[1].item() + b_hh[1].item()
x = np.linspace(0,1, 1000)
y0 = x * W_ih[1][0].item()
y1 = x * W_ih[1][0].item() + W_ih[1][1].item() 
y0_nb = 1 / (1 + np.exp(-(y0)))
y1_nb = 1 / (1 + np.exp(-(y1)))
y0_b = 1 / (1 + np.exp(-(y0 + bz)))
y1_b = 1 / (1 + np.exp(-(y1 + bz)))

sns.set_theme(style="darkgrid")  # This gives the typical Seaborn look

plt.figure(figsize=(10, 6))
sns.lineplot(x=x, y=y0_nb, linewidth=2, label='$y_0$ no bias')
sns.lineplot(x=x, y=y1_nb, linewidth=2, label='$y_1$ no bias')
sns.lineplot(x=x, y=y0_b, linewidth=2, label='$y_0$ with bias')
sns.lineplot(x=x, y=y1_b, linewidth=2, label='$y_1$ with bias')
plt.title('Sigmoid Function')
plt.xlabel('x')

plt.ylabel('$i_z$')
plt.grid(True)
plt.legend()
plt.show()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/formatters.py:402, in BaseFormatter.__call__(self, obj)
    400     pass
    401 else:
--> 402     return printer(obj)
    403 # Finally look for special method names
    404 method = get_real_method(obj, self.print_method)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/pylabtools.py:187, in retina_figure(fig, base64, **kwargs)
    178 def retina_figure(fig, base64=False, **kwargs):
    179     """format a figure as a pixel-doubled (retina) PNG
    180 
    181     If `base64` is True, return base64-encoded str instead of raw bytes
   (...)
    185         base64 argument
    186     """
--> 187     pngdata = print_figure(fig, fmt="retina", base64=False, **kwargs)
    188     # Make sure that retina_figure acts just like print_figure and returns
    189     # None when the figure is empty.
    190     if pngdata is None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/pylabtools.py:170, in print_figure(fig, fmt, bbox_inches, base64, **kwargs)
    167     from matplotlib.backend_bases import FigureCanvasBase
    168     FigureCanvasBase(fig)
--> 170 fig.canvas.print_figure(bytes_io, **kw)
    171 data = bytes_io.getvalue()
    172 if fmt == 'svg':

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backend_bases.py:2155, in FigureCanvasBase.print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, pad_inches, bbox_extra_artists, backend, **kwargs)
   2152     # we do this instead of `self.figure.draw_without_rendering`
   2153     # so that we can inject the orientation
   2154     with getattr(renderer, "_draw_disabled", nullcontext)():
-> 2155         self.figure.draw(renderer)
   2156 if bbox_inches:
   2157     if bbox_inches == "tight":

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:94, in _finalize_rasterization.<locals>.draw_wrapper(artist, renderer, *args, **kwargs)
     92 @wraps(draw)
     93 def draw_wrapper(artist, renderer, *args, **kwargs):
---> 94     result = draw(artist, renderer, *args, **kwargs)
     95     if renderer._rasterizing:
     96         renderer.stop_rasterizing()

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:71, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     68     if artist.get_agg_filter() is not None:
     69         renderer.start_filter()
---> 71     return draw(artist, renderer)
     72 finally:
     73     if artist.get_agg_filter() is not None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/figure.py:3257, in Figure.draw(self, renderer)
   3254             # ValueError can occur when resizing a window.
   3256     self.patch.draw(renderer)
-> 3257     mimage._draw_list_compositing_images(
   3258         renderer, self, artists, self.suppressComposite)
   3260     renderer.close_group('figure')
   3261 finally:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/image.py:134, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    132 if not_composite or not has_images:
    133     for a in artists:
--> 134         a.draw(renderer)
    135 else:
    136     # Composite any adjacent images together
    137     image_group = []

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:71, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     68     if artist.get_agg_filter() is not None:
     69         renderer.start_filter()
---> 71     return draw(artist, renderer)
     72 finally:
     73     if artist.get_agg_filter() is not None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axes/_base.py:3145, in _AxesBase.draw(self, renderer)
   3142     for spine in self.spines.values():
   3143         artists.remove(spine)
-> 3145 self._update_title_position(renderer)
   3147 if not self.axison:
   3148     for _axis in self._axis_map.values():

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axes/_base.py:3089, in _AxesBase._update_title_position(self, renderer)
   3087 if title.get_text():
   3088     for ax in axs:
-> 3089         ax.yaxis.get_tightbbox(renderer)  # update offsetText
   3090         if ax.yaxis.offsetText.get_text():
   3091             bb = ax.yaxis.offsetText.get_tightbbox(renderer)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1364, in Axis.get_tightbbox(self, renderer, for_layout_only)
   1361     renderer = self.get_figure(root=True)._get_renderer()
   1362 ticks_to_draw = self._update_ticks()
-> 1364 self._update_label_position(renderer)
   1366 # go back to just this axis's tick labels
   1367 tlb1, tlb2 = self._get_ticklabel_bboxes(ticks_to_draw, renderer)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:2686, in YAxis._update_label_position(self, renderer)
   2682     return
   2684 # get bounding boxes for this axis and any siblings
   2685 # that have been set by `fig.align_ylabels()`
-> 2686 bboxes, bboxes2 = self._get_tick_boxes_siblings(renderer=renderer)
   2687 x, y = self.label.get_position()
   2689 if self.label_position == 'left':
   2690     # Union with extents of the left spine if present, of the axes otherwise.

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:2252, in Axis._get_tick_boxes_siblings(self, renderer)
   2250 axis = ax._axis_map[name]
   2251 ticks_to_draw = axis._update_ticks()
-> 2252 tlb, tlb2 = axis._get_ticklabel_bboxes(ticks_to_draw, renderer)
   2253 bboxes.extend(tlb)
   2254 bboxes2.extend(tlb2)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1343, in Axis._get_ticklabel_bboxes(self, ticks, renderer)
   1341 if renderer is None:
   1342     renderer = self.get_figure(root=True)._get_renderer()
-> 1343 return ([tick.label1.get_window_extent(renderer)
   1344          for tick in ticks if tick.label1.get_visible()],
   1345         [tick.label2.get_window_extent(renderer)
   1346          for tick in ticks if tick.label2.get_visible()])

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1343, in <listcomp>(.0)
   1341 if renderer is None:
   1342     renderer = self.get_figure(root=True)._get_renderer()
-> 1343 return ([tick.label1.get_window_extent(renderer)
   1344          for tick in ticks if tick.label1.get_visible()],
   1345         [tick.label2.get_window_extent(renderer)
   1346          for tick in ticks if tick.label2.get_visible()])

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:969, in Text.get_window_extent(self, renderer, dpi)
    964     raise RuntimeError(
    965         "Cannot get window extent of text w/o renderer. You likely "
    966         "want to call 'figure.draw_without_rendering()' first.")
    968 with cbook._setattr_cm(fig, dpi=dpi):
--> 969     bbox, info, descent = self._get_layout(self._renderer)
    970     x, y = self.get_unitless_position()
    971     x, y = self.get_transform().transform((x, y))

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:373, in Text._get_layout(self, renderer)
    370 ys = []
    372 # Full vertical extent of font, including ascenders and descenders:
--> 373 _, lp_h, lp_d = _get_text_metrics_with_cache(
    374     renderer, "lp", self._fontproperties,
    375     ismath="TeX" if self.get_usetex() else False,
    376     dpi=self.get_figure(root=True).dpi)
    377 min_dy = (lp_h - lp_d) * self._linespacing
    379 for i, line in enumerate(lines):

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:69, in _get_text_metrics_with_cache(renderer, text, fontprop, ismath, dpi)
     66 """Call ``renderer.get_text_width_height_descent``, caching the results."""
     67 # Cached based on a copy of fontprop so that later in-place mutations of
     68 # the passed-in argument do not mess up the cache.
---> 69 return _get_text_metrics_with_cache_impl(
     70     weakref.ref(renderer), text, fontprop.copy(), ismath, dpi)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:77, in _get_text_metrics_with_cache_impl(renderer_ref, text, fontprop, ismath, dpi)
     73 @functools.lru_cache(4096)
     74 def _get_text_metrics_with_cache_impl(
     75         renderer_ref, text, fontprop, ismath, dpi):
     76     # dpi is unused, but participates in cache invalidation (via the renderer).
---> 77     return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backends/backend_agg.py:211, in RendererAgg.get_text_width_height_descent(self, s, prop, ismath)
    209 _api.check_in_list(["TeX", True, False], ismath=ismath)
    210 if ismath == "TeX":
--> 211     return super().get_text_width_height_descent(s, prop, ismath)
    213 if ismath:
    214     ox, oy, width, height, descent, font_image = \
    215         self.mathtext_parser.parse(s, self.dpi, prop)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backend_bases.py:566, in RendererBase.get_text_width_height_descent(self, s, prop, ismath)
    562 fontsize = prop.get_size_in_points()
    564 if ismath == 'TeX':
    565     # todo: handle properties
--> 566     return self.get_texmanager().get_text_width_height_descent(
    567         s, fontsize, renderer=self)
    569 dpi = self.points_to_pixels(72)
    570 if ismath:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/texmanager.py:363, in TexManager.get_text_width_height_descent(cls, tex, fontsize, renderer)
    361 dpi_fraction = renderer.points_to_pixels(1.) if renderer else 1
    362 with dviread.Dvi(dvifile, 72 * dpi_fraction) as dvi:
--> 363     page, = dvi
    364 # A total height (including the descent) needs to be returned.
    365 return page.width, page.height + page.descent, page.descent

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/dviread.py:261, in Dvi.__iter__(self)
    245 def __iter__(self):
    246     """
    247     Iterate through the pages of the file.
    248 
   (...)
    259         integers.
    260     """
--> 261     while self._read():
    262         yield self._output()

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/dviread.py:343, in Dvi._read(self)
    341 self._dtable[byte](self, byte)
    342 if self._missing_font:
--> 343     raise self._missing_font.to_exception()
    344 name = self._dtable[byte].__name__
    345 if name == "_push":

FileNotFoundError: Matplotlib's TeX implementation searched for a file named 'cmss10.tfm' in your texmf tree, but could not find it

<Figure size 960x576 with 1 Axes>

Figure 1

4 The Reset gate at $t=0$

I lied earlier. I said we were going bottom up, but we’re actually approaching this top down because next in our discussion is the reset gate. The reasons is easy to tell: the update gate value $z_t$ acts on the previous $h_{t-1}$ as well as on $\hat{h}_t$ (the candidate memory update). But to make a discussion about $\hat{h}_t$ we need to discuss the reset gate $r_t$ first.

Let’s recall again how both the reset gate and newgate are computed

\[\eqalign{ r_t &= \sigma(W_r x_t + U_r h_{t-1} + b_r) \\ \hat{h}_t &= \phi(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h) \\ } \]

First of all, you might have noticed already in the previous post that there was a different activation function in one step, which is $\phi$. But what is $\phi$? It is the $\tanh$, which yields a plot like this:

Code

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

x = np.linspace(-10,10,100)
y = np.tanh(x)
sns.set_theme(style="darkgrid")  # This gives the typical Seaborn look

plt.figure(figsize=(10, 6))
sns.lineplot(x=x, y=y, linewidth=2)
plt.title('Sigmoid Function')
plt.axis(True)
plt.xlabel('x')
plt.ylabel('$\sigma(x)$')
plt.grid(True)
plt.show()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/formatters.py:402, in BaseFormatter.__call__(self, obj)
    400     pass
    401 else:
--> 402     return printer(obj)
    403 # Finally look for special method names
    404 method = get_real_method(obj, self.print_method)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/pylabtools.py:187, in retina_figure(fig, base64, **kwargs)
    178 def retina_figure(fig, base64=False, **kwargs):
    179     """format a figure as a pixel-doubled (retina) PNG
    180 
    181     If `base64` is True, return base64-encoded str instead of raw bytes
   (...)
    185         base64 argument
    186     """
--> 187     pngdata = print_figure(fig, fmt="retina", base64=False, **kwargs)
    188     # Make sure that retina_figure acts just like print_figure and returns
    189     # None when the figure is empty.
    190     if pngdata is None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/pylabtools.py:170, in print_figure(fig, fmt, bbox_inches, base64, **kwargs)
    167     from matplotlib.backend_bases import FigureCanvasBase
    168     FigureCanvasBase(fig)
--> 170 fig.canvas.print_figure(bytes_io, **kw)
    171 data = bytes_io.getvalue()
    172 if fmt == 'svg':

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backend_bases.py:2155, in FigureCanvasBase.print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, pad_inches, bbox_extra_artists, backend, **kwargs)
   2152     # we do this instead of `self.figure.draw_without_rendering`
   2153     # so that we can inject the orientation
   2154     with getattr(renderer, "_draw_disabled", nullcontext)():
-> 2155         self.figure.draw(renderer)
   2156 if bbox_inches:
   2157     if bbox_inches == "tight":

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:94, in _finalize_rasterization.<locals>.draw_wrapper(artist, renderer, *args, **kwargs)
     92 @wraps(draw)
     93 def draw_wrapper(artist, renderer, *args, **kwargs):
---> 94     result = draw(artist, renderer, *args, **kwargs)
     95     if renderer._rasterizing:
     96         renderer.stop_rasterizing()

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:71, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     68     if artist.get_agg_filter() is not None:
     69         renderer.start_filter()
---> 71     return draw(artist, renderer)
     72 finally:
     73     if artist.get_agg_filter() is not None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/figure.py:3257, in Figure.draw(self, renderer)
   3254             # ValueError can occur when resizing a window.
   3256     self.patch.draw(renderer)
-> 3257     mimage._draw_list_compositing_images(
   3258         renderer, self, artists, self.suppressComposite)
   3260     renderer.close_group('figure')
   3261 finally:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/image.py:134, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    132 if not_composite or not has_images:
    133     for a in artists:
--> 134         a.draw(renderer)
    135 else:
    136     # Composite any adjacent images together
    137     image_group = []

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:71, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     68     if artist.get_agg_filter() is not None:
     69         renderer.start_filter()
---> 71     return draw(artist, renderer)
     72 finally:
     73     if artist.get_agg_filter() is not None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axes/_base.py:3145, in _AxesBase.draw(self, renderer)
   3142     for spine in self.spines.values():
   3143         artists.remove(spine)
-> 3145 self._update_title_position(renderer)
   3147 if not self.axison:
   3148     for _axis in self._axis_map.values():

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axes/_base.py:3089, in _AxesBase._update_title_position(self, renderer)
   3087 if title.get_text():
   3088     for ax in axs:
-> 3089         ax.yaxis.get_tightbbox(renderer)  # update offsetText
   3090         if ax.yaxis.offsetText.get_text():
   3091             bb = ax.yaxis.offsetText.get_tightbbox(renderer)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1364, in Axis.get_tightbbox(self, renderer, for_layout_only)
   1361     renderer = self.get_figure(root=True)._get_renderer()
   1362 ticks_to_draw = self._update_ticks()
-> 1364 self._update_label_position(renderer)
   1366 # go back to just this axis's tick labels
   1367 tlb1, tlb2 = self._get_ticklabel_bboxes(ticks_to_draw, renderer)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:2686, in YAxis._update_label_position(self, renderer)
   2682     return
   2684 # get bounding boxes for this axis and any siblings
   2685 # that have been set by `fig.align_ylabels()`
-> 2686 bboxes, bboxes2 = self._get_tick_boxes_siblings(renderer=renderer)
   2687 x, y = self.label.get_position()
   2689 if self.label_position == 'left':
   2690     # Union with extents of the left spine if present, of the axes otherwise.

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:2252, in Axis._get_tick_boxes_siblings(self, renderer)
   2250 axis = ax._axis_map[name]
   2251 ticks_to_draw = axis._update_ticks()
-> 2252 tlb, tlb2 = axis._get_ticklabel_bboxes(ticks_to_draw, renderer)
   2253 bboxes.extend(tlb)
   2254 bboxes2.extend(tlb2)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1343, in Axis._get_ticklabel_bboxes(self, ticks, renderer)
   1341 if renderer is None:
   1342     renderer = self.get_figure(root=True)._get_renderer()
-> 1343 return ([tick.label1.get_window_extent(renderer)
   1344          for tick in ticks if tick.label1.get_visible()],
   1345         [tick.label2.get_window_extent(renderer)
   1346          for tick in ticks if tick.label2.get_visible()])

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1343, in <listcomp>(.0)
   1341 if renderer is None:
   1342     renderer = self.get_figure(root=True)._get_renderer()
-> 1343 return ([tick.label1.get_window_extent(renderer)
   1344          for tick in ticks if tick.label1.get_visible()],
   1345         [tick.label2.get_window_extent(renderer)
   1346          for tick in ticks if tick.label2.get_visible()])

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:969, in Text.get_window_extent(self, renderer, dpi)
    964     raise RuntimeError(
    965         "Cannot get window extent of text w/o renderer. You likely "
    966         "want to call 'figure.draw_without_rendering()' first.")
    968 with cbook._setattr_cm(fig, dpi=dpi):
--> 969     bbox, info, descent = self._get_layout(self._renderer)
    970     x, y = self.get_unitless_position()
    971     x, y = self.get_transform().transform((x, y))

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:373, in Text._get_layout(self, renderer)
    370 ys = []
    372 # Full vertical extent of font, including ascenders and descenders:
--> 373 _, lp_h, lp_d = _get_text_metrics_with_cache(
    374     renderer, "lp", self._fontproperties,
    375     ismath="TeX" if self.get_usetex() else False,
    376     dpi=self.get_figure(root=True).dpi)
    377 min_dy = (lp_h - lp_d) * self._linespacing
    379 for i, line in enumerate(lines):

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:69, in _get_text_metrics_with_cache(renderer, text, fontprop, ismath, dpi)
     66 """Call ``renderer.get_text_width_height_descent``, caching the results."""
     67 # Cached based on a copy of fontprop so that later in-place mutations of
     68 # the passed-in argument do not mess up the cache.
---> 69 return _get_text_metrics_with_cache_impl(
     70     weakref.ref(renderer), text, fontprop.copy(), ismath, dpi)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:77, in _get_text_metrics_with_cache_impl(renderer_ref, text, fontprop, ismath, dpi)
     73 @functools.lru_cache(4096)
     74 def _get_text_metrics_with_cache_impl(
     75         renderer_ref, text, fontprop, ismath, dpi):
     76     # dpi is unused, but participates in cache invalidation (via the renderer).
---> 77     return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backends/backend_agg.py:211, in RendererAgg.get_text_width_height_descent(self, s, prop, ismath)
    209 _api.check_in_list(["TeX", True, False], ismath=ismath)
    210 if ismath == "TeX":
--> 211     return super().get_text_width_height_descent(s, prop, ismath)
    213 if ismath:
    214     ox, oy, width, height, descent, font_image = \
    215         self.mathtext_parser.parse(s, self.dpi, prop)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backend_bases.py:566, in RendererBase.get_text_width_height_descent(self, s, prop, ismath)
    562 fontsize = prop.get_size_in_points()
    564 if ismath == 'TeX':
    565     # todo: handle properties
--> 566     return self.get_texmanager().get_text_width_height_descent(
    567         s, fontsize, renderer=self)
    569 dpi = self.points_to_pixels(72)
    570 if ismath:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/texmanager.py:363, in TexManager.get_text_width_height_descent(cls, tex, fontsize, renderer)
    361 dpi_fraction = renderer.points_to_pixels(1.) if renderer else 1
    362 with dviread.Dvi(dvifile, 72 * dpi_fraction) as dvi:
--> 363     page, = dvi
    364 # A total height (including the descent) needs to be returned.
    365 return page.width, page.height + page.descent, page.descent

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/dviread.py:261, in Dvi.__iter__(self)
    245 def __iter__(self):
    246     """
    247     Iterate through the pages of the file.
    248 
   (...)
    259         integers.
    260     """
--> 261     while self._read():
    262         yield self._output()

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/dviread.py:343, in Dvi._read(self)
    341 self._dtable[byte](self, byte)
    342 if self._missing_font:
--> 343     raise self._missing_font.to_exception()
    344 name = self._dtable[byte].__name__
    345 if name == "_push":

FileNotFoundError: Matplotlib's TeX implementation searched for a file named 'cmss10.tfm' in your texmf tree, but could not find it

<Figure size 960x576 with 1 Axes>

Previously $\sigma$ was squeezing its input in the range $[0, 1]$, whereas $\phi$ maps its input to a value in the range $[-1, 1]$. The reason why the update gate employs this activation function is that the range ([-1, 1]) allows the gate to dynamically amplify, suppress, or invert features from the previous hidden state, enabling more nuanced control over the update mechanism compared to the purely additive or multiplicative behavior of $[0, 1]$. Moreover: this symmetry and bidirectional scaling can improve gradient flow during training and help the model learn richer representations by incorporating both positive and negative adjustments to the hidden state. In other words $\sigma$ was telling: “How much of this should I add to this other thing?”, whereas $\phi$ is telling “How much of this should I add or remove from this other thing?”

Now, why is this important? Remember that the GRU cell during its computation tries to build an internal hidden state that conveys some information. Think of the hidden state as some sort of scratchpad, where you take notes as you read more data. Sometimes some of the new information should be added to the previous state, sometimes new data should be ignored. Sometimes new data instead needs to kinda remove information from the state, in order to have fresher information.

Imagine for the sake of example to be a detective, which is trying to solve a mystery. You might get new information as time flows. At some point you might even have a track on a suspect and build your knowledge on that. But then at some point, you find out that your suspect was just cheating on his wife and as such you need to forget about him, otherwise you’ll focus on something that’s not needed in your ivnestigation.

How does our network know what to forget, what to keep, what to update and so on is the task of machine learning. But we can surely observe what happened here!

First we need to study the reset gate $r_t$, similarly to how we did before!

Again, there are three components that go through the sigmoid for the update gate:

\[\eqalign{ b_{ih_r} + b_{hh_r} &= 1.4853122234344482 \cr i_r &= 0.027704410254955292; \cr h_r &= -0.0; \cr i_r + h_r + b_{ih_r} + b_{hh_r} &= 1.513016700744629 \cr r_t &= 0.819507896900177 \cr } \]

As we observed before the bias is king again, because it accounts for the largest part of the activation. But we can observe something else here, let’s compare how the flag affects both the update ($z_t$) and reset ($r_t$) gate:

Update gate $z_t$ with $\texttt{flag} = 0$

\[\eqalign{ i_z + h_z + b_{ih_z} + b_{hh_z} &= 1.568204402923584\cr z_t &= 0.8275274634361267 \cr }\] and with $\texttt{flag} = 1$ \[\eqalign{ i_z + h_z + b_{ih_z} + b_{hh_z} &= -1.456658124923706\cr z _t &= 0.18897898495197296 \cr }\]

In Figure 1 we saw how $z_t$ was affected pretty heavily by the flag at position $1$ of the input item, whereas in this case we observe that the reset gate is not affected too much about the flag. Basically it looks like if we have $\texttt{flag} = 0$

\[\eqalign{ i_r + h_r + b_{ih_r} + b_{hh_r} &= 1.513016700744629\cr r_t &= 0.819507896900177 \cr }\]

or $\texttt{flag} = 1$

\[\eqalign{ i_r + h_r + b_{ih_r} + b_{hh_r} &= 2.146040916442871\cr r_t &= 0.8952982425689697 \cr }\]

the reset gate $r_t$ keeps taking a pretty large value (remember that sigmoid is in the range $[0,1]$, so you can consider it as a percentage, meaning that the reset gate is always above $80\%$)

Why is this? Let’s take a look at the weights involved. Again being this the first step in the sequence the hidden state is not yer involved and we can safely (FOR NOW!) ignore it.

\[\eqalign{ i_r &= x_t \cdot W_{ih_r}^T\cr W_{ih_r}^T &= \begin{bmatrix} 0.2308700829744339 \\ 0.6330242156982422 \end{bmatrix} \cr }\]

Let’s compare it with the update gate weights: \[\eqalign{ i_z &= x_t \cdot W_{ih_z}^T\cr W_{ih_z}^T &= \begin{bmatrix} -0.4197608530521393 \\ -3.02486252784729 \end{bmatrix} \cr }\]

We can take two key observations:

The update weights are both negative and reset are both positive
The difference in magnitude is noteworthy: in the update gate is heavily affected by the flag input, moving the activation along the $\hat{y}$, whereas the reset gate is not so strongly affected by it.

Let’s plot them both here where $y_{0z}$ and $y_{1z}$ are the update gates when $\texttt{flag} = 0$ and $\texttt{flag} = 1$, and $y_{0r}$ and $y_{1r}$ are the reset gates in the same cases, respectively.

Code

br =  b_ih[0].item() + b_hh[0].item()
x = np.linspace(0,1, 1000)
y0 = x * W_ih[1][0].item()
y1 = x * W_ih[1][0].item() + W_ih[1][1].item() 
y0_r = x * W_ih[0][0].item()
y1_r = x * W_ih[0][0].item() + W_ih[0][1].item() 

y0_b = 1 / (1 + np.exp(-(y0 + br)))
y1_b = 1 / (1 + np.exp(-(y1 + br)))
y0_r_b = 1 / (1 + np.exp(-(y0_r + br)))
y1_r_b = 1 / (1 + np.exp(-(y1_r + br)))

sns.set_theme(style="darkgrid")

plt.figure(figsize=(10, 6))
sns.lineplot(x=x, y=y0_b, linewidth=2, linestyle='--', color='blue', label='$y_{0z}$')
sns.lineplot(x=x, y=y1_b, linewidth=2, linestyle='--', color='orange', label='$y_{1z}$')
sns.lineplot(x=x, y=y0_r_b, linewidth=2, linestyle='-', color='green', label='$y_{0r}$')
sns.lineplot(x=x, y=y1_r_b, linewidth=2, linestyle='-', color='red', label='$y_{1r}$')

plt.title('Comparison between $z_t$ and $r_t$')
plt.xlabel('x')
plt.ylabel('$i_r$')
plt.grid(True)
plt.legend()
plt.show()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/formatters.py:402, in BaseFormatter.__call__(self, obj)
    400     pass
    401 else:
--> 402     return printer(obj)
    403 # Finally look for special method names
    404 method = get_real_method(obj, self.print_method)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/pylabtools.py:187, in retina_figure(fig, base64, **kwargs)
    178 def retina_figure(fig, base64=False, **kwargs):
    179     """format a figure as a pixel-doubled (retina) PNG
    180 
    181     If `base64` is True, return base64-encoded str instead of raw bytes
   (...)
    185         base64 argument
    186     """
--> 187     pngdata = print_figure(fig, fmt="retina", base64=False, **kwargs)
    188     # Make sure that retina_figure acts just like print_figure and returns
    189     # None when the figure is empty.
    190     if pngdata is None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/IPython/core/pylabtools.py:170, in print_figure(fig, fmt, bbox_inches, base64, **kwargs)
    167     from matplotlib.backend_bases import FigureCanvasBase
    168     FigureCanvasBase(fig)
--> 170 fig.canvas.print_figure(bytes_io, **kw)
    171 data = bytes_io.getvalue()
    172 if fmt == 'svg':

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backend_bases.py:2155, in FigureCanvasBase.print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, pad_inches, bbox_extra_artists, backend, **kwargs)
   2152     # we do this instead of `self.figure.draw_without_rendering`
   2153     # so that we can inject the orientation
   2154     with getattr(renderer, "_draw_disabled", nullcontext)():
-> 2155         self.figure.draw(renderer)
   2156 if bbox_inches:
   2157     if bbox_inches == "tight":

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:94, in _finalize_rasterization.<locals>.draw_wrapper(artist, renderer, *args, **kwargs)
     92 @wraps(draw)
     93 def draw_wrapper(artist, renderer, *args, **kwargs):
---> 94     result = draw(artist, renderer, *args, **kwargs)
     95     if renderer._rasterizing:
     96         renderer.stop_rasterizing()

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:71, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     68     if artist.get_agg_filter() is not None:
     69         renderer.start_filter()
---> 71     return draw(artist, renderer)
     72 finally:
     73     if artist.get_agg_filter() is not None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/figure.py:3257, in Figure.draw(self, renderer)
   3254             # ValueError can occur when resizing a window.
   3256     self.patch.draw(renderer)
-> 3257     mimage._draw_list_compositing_images(
   3258         renderer, self, artists, self.suppressComposite)
   3260     renderer.close_group('figure')
   3261 finally:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/image.py:134, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    132 if not_composite or not has_images:
    133     for a in artists:
--> 134         a.draw(renderer)
    135 else:
    136     # Composite any adjacent images together
    137     image_group = []

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/artist.py:71, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
     68     if artist.get_agg_filter() is not None:
     69         renderer.start_filter()
---> 71     return draw(artist, renderer)
     72 finally:
     73     if artist.get_agg_filter() is not None:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axes/_base.py:3145, in _AxesBase.draw(self, renderer)
   3142     for spine in self.spines.values():
   3143         artists.remove(spine)
-> 3145 self._update_title_position(renderer)
   3147 if not self.axison:
   3148     for _axis in self._axis_map.values():

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axes/_base.py:3089, in _AxesBase._update_title_position(self, renderer)
   3087 if title.get_text():
   3088     for ax in axs:
-> 3089         ax.yaxis.get_tightbbox(renderer)  # update offsetText
   3090         if ax.yaxis.offsetText.get_text():
   3091             bb = ax.yaxis.offsetText.get_tightbbox(renderer)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1364, in Axis.get_tightbbox(self, renderer, for_layout_only)
   1361     renderer = self.get_figure(root=True)._get_renderer()
   1362 ticks_to_draw = self._update_ticks()
-> 1364 self._update_label_position(renderer)
   1366 # go back to just this axis's tick labels
   1367 tlb1, tlb2 = self._get_ticklabel_bboxes(ticks_to_draw, renderer)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:2686, in YAxis._update_label_position(self, renderer)
   2682     return
   2684 # get bounding boxes for this axis and any siblings
   2685 # that have been set by `fig.align_ylabels()`
-> 2686 bboxes, bboxes2 = self._get_tick_boxes_siblings(renderer=renderer)
   2687 x, y = self.label.get_position()
   2689 if self.label_position == 'left':
   2690     # Union with extents of the left spine if present, of the axes otherwise.

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:2252, in Axis._get_tick_boxes_siblings(self, renderer)
   2250 axis = ax._axis_map[name]
   2251 ticks_to_draw = axis._update_ticks()
-> 2252 tlb, tlb2 = axis._get_ticklabel_bboxes(ticks_to_draw, renderer)
   2253 bboxes.extend(tlb)
   2254 bboxes2.extend(tlb2)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1343, in Axis._get_ticklabel_bboxes(self, ticks, renderer)
   1341 if renderer is None:
   1342     renderer = self.get_figure(root=True)._get_renderer()
-> 1343 return ([tick.label1.get_window_extent(renderer)
   1344          for tick in ticks if tick.label1.get_visible()],
   1345         [tick.label2.get_window_extent(renderer)
   1346          for tick in ticks if tick.label2.get_visible()])

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/axis.py:1343, in <listcomp>(.0)
   1341 if renderer is None:
   1342     renderer = self.get_figure(root=True)._get_renderer()
-> 1343 return ([tick.label1.get_window_extent(renderer)
   1344          for tick in ticks if tick.label1.get_visible()],
   1345         [tick.label2.get_window_extent(renderer)
   1346          for tick in ticks if tick.label2.get_visible()])

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:969, in Text.get_window_extent(self, renderer, dpi)
    964     raise RuntimeError(
    965         "Cannot get window extent of text w/o renderer. You likely "
    966         "want to call 'figure.draw_without_rendering()' first.")
    968 with cbook._setattr_cm(fig, dpi=dpi):
--> 969     bbox, info, descent = self._get_layout(self._renderer)
    970     x, y = self.get_unitless_position()
    971     x, y = self.get_transform().transform((x, y))

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:373, in Text._get_layout(self, renderer)
    370 ys = []
    372 # Full vertical extent of font, including ascenders and descenders:
--> 373 _, lp_h, lp_d = _get_text_metrics_with_cache(
    374     renderer, "lp", self._fontproperties,
    375     ismath="TeX" if self.get_usetex() else False,
    376     dpi=self.get_figure(root=True).dpi)
    377 min_dy = (lp_h - lp_d) * self._linespacing
    379 for i, line in enumerate(lines):

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:69, in _get_text_metrics_with_cache(renderer, text, fontprop, ismath, dpi)
     66 """Call ``renderer.get_text_width_height_descent``, caching the results."""
     67 # Cached based on a copy of fontprop so that later in-place mutations of
     68 # the passed-in argument do not mess up the cache.
---> 69 return _get_text_metrics_with_cache_impl(
     70     weakref.ref(renderer), text, fontprop.copy(), ismath, dpi)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/text.py:77, in _get_text_metrics_with_cache_impl(renderer_ref, text, fontprop, ismath, dpi)
     73 @functools.lru_cache(4096)
     74 def _get_text_metrics_with_cache_impl(
     75         renderer_ref, text, fontprop, ismath, dpi):
     76     # dpi is unused, but participates in cache invalidation (via the renderer).
---> 77     return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backends/backend_agg.py:211, in RendererAgg.get_text_width_height_descent(self, s, prop, ismath)
    209 _api.check_in_list(["TeX", True, False], ismath=ismath)
    210 if ismath == "TeX":
--> 211     return super().get_text_width_height_descent(s, prop, ismath)
    213 if ismath:
    214     ox, oy, width, height, descent, font_image = \
    215         self.mathtext_parser.parse(s, self.dpi, prop)

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/backend_bases.py:566, in RendererBase.get_text_width_height_descent(self, s, prop, ismath)
    562 fontsize = prop.get_size_in_points()
    564 if ismath == 'TeX':
    565     # todo: handle properties
--> 566     return self.get_texmanager().get_text_width_height_descent(
    567         s, fontsize, renderer=self)
    569 dpi = self.points_to_pixels(72)
    570 if ismath:

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/texmanager.py:363, in TexManager.get_text_width_height_descent(cls, tex, fontsize, renderer)
    361 dpi_fraction = renderer.points_to_pixels(1.) if renderer else 1
    362 with dviread.Dvi(dvifile, 72 * dpi_fraction) as dvi:
--> 363     page, = dvi
    364 # A total height (including the descent) needs to be returned.
    365 return page.width, page.height + page.descent, page.descent

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/dviread.py:261, in Dvi.__iter__(self)
    245 def __iter__(self):
    246     """
    247     Iterate through the pages of the file.
    248 
   (...)
    259         integers.
    260     """
--> 261     while self._read():
    262         yield self._output()

File ~/miniconda3/envs/rnn-research/lib/python3.10/site-packages/matplotlib/dviread.py:343, in Dvi._read(self)
    341 self._dtable[byte](self, byte)
    342 if self._missing_font:
--> 343     raise self._missing_font.to_exception()
    344 name = self._dtable[byte].__name__
    345 if name == "_push":

FileNotFoundError: Matplotlib's TeX implementation searched for a file named 'cmss10.tfm' in your texmf tree, but could not find it

<Figure size 960x576 with 1 Axes>

Figure 2

Caution

This is all cool and stuff… But I forgot to mention that is also useless. Just kidding. But thre’s an actual catch here: at $t=0$ this is acting only on the bias term because $r_t$ is applied via a Hadamard product (element-wise multiplication) to the previous hidden state plus the bias. Its purpose is to control how much of it we want to remember. But at $t=0$, $h_t = h_0 = 0$, meaning $r_0$ only acts on the bias term!

5 The New gate at $t=0$

You will find in literature multiple ways to refer to it, but I think the most useful is: candidate $\hat{h}_t$. Basically, it weights some of the input $\mathbf{x}_t$ and some of the $h_{t-1}$ value using the reset gate $r_t$ to decide how much it should be brought to the activation. In other words, the candidate $\hat{h}_t$ represents a proposed new hidden state that combines the current input $\mathbf{x}_t$ and the previous hidden state $h_{t-1}$, modulated by the reset gate $r_t$.

Let’s recall how the candidate hidden state $\hat{h}_t$ is computed:

\[ \hat{h}_t = \phi(W_h \mathbf{x}_t + b_{{ih}_n} + r_t \odot [(U_h \cdot h_{t-1}) + b_{{hh}_n}]) \] where $\phi = \tanh(\cdot)$

Note

Up until now we always treated the bias terms as a single value, summing them up. But now we need to make extra care! Previously we had 2 terms added up and each of them had its bias, so we could safely add them either during the $$ product or later when inside the $\sigma$ sigmoid. But now, the bias term is multipled to the reset gate as much as the projected hidden state.

One term cancels out, because at $t=0$ $h_0 = 0$:

\[\eqalign{ \hat{h}_t &= \phi(W_h \mathbf{x}_t + b_{{ih}_n} + r_t \odot [\cancel{(U_h \cdot h_{t-1})} + b_{{hh}_n}])\cr \hat{h}_t &= \phi(W_h \mathbf{x}_t + b_{{ih}_n} + r_t \odot b_{{hh}_n}) }\]

from the previous section we learnt that $r_t$ has always values above $.8$, which at $t=0$ means that we carry on with us $80\%$ of the bias in the activatoin $\phi$.

Now this leaves us with these terms: \[\eqalign{ b_{ih_n} &= 0.6560405492782593 \cr b_{hh_n} &= 0.16526484489440918 \cr i_n &= -0.03407169133424759; \cr h_t &= 0.0; \cr h_t + b_{hh_n} &= 0.16526484489440918; \cr \hat{h}_t &= \phi(i_n + b_{ih_n} + r_t \odot b_{hh_n}) \cr \hat{h}_t &= \phi(-0.03407169133424759 + 0.6560405492782593 + 0.819507896900177 \odot 0.16526484489440918) \cr \hat{h}_t &= \phi(0.7574047034149629) \cr \hat{h}_t &= 0.639545738697052 \cr } \]

Again, bias is king! It accounts for basically the whole activation. And again a reminder about it:

Important

Bias is not about a single data point. Instead, it reflects a systematic shift in the entire dataset or even during the learning process. It tells us something about the data as a whole, rather than just the point we’re observing. This is so important: no matter what data we input, the bias would remain the same!

6 Final step

So we’re finally ready for our final hidden state:

\[\eqalign{ h_t &= (1 - z_t) \cdot \hat{h} + z_t \cdot h_0 \cr h_t &= 0.1724725365638733 \cdot 0.639545738697052 + \cancel{0.8275274634361267 \cdot 0.0}\cr h_t &= 0.11030407580169665 + \cancel{0.0}\cr h_t &= 0.11030407249927521\cr }\]

Now: this is a Recurrent Neural Network, so theory tells us that it should be able to handle sequences of arbitrary length. But is it?

Let’s try to just project into the final output through the linear layer:

\[ \text{x} = \begin{bmatrix} 12 & 0 \\ \end{bmatrix} \]

\[ \hat{y} = 41.076526045799255 \neq 0 \]

Woooh! That’s waaay off. What if we let the flag be one instead?

\[ \text{x} = \begin{bmatrix} 12 & 1 \\ \end{bmatrix} \]

\[ \hat{y} = 101.57540893554688 \gg 12 \]

Way off again!

Wait… let’s see hat happend if we put 4 elements, 2 of which are set to true?

\[ \text{x} = \left[ \begin{bmatrix} 75 \\ 1 \end{bmatrix} \begin{bmatrix} 38 \\ 1 \end{bmatrix} \begin{bmatrix} 12 \\ 0 \end{bmatrix} \begin{bmatrix} 12 \\ 0 \end{bmatrix} \right] \]

\[ \hat{y} = 112.553955078125 \approxeq 113 = y \]

Mmmmh! That works much better. But why? It seems the network has learned three key things:

It’s a sum: The network has learned to sum only the numbers flagged with $1$. Well, that was actually the main task.
Fixed length $n = 4$: It expects an input sequence of length $n = 4$.
Exactly two flagged items: It assumes that exactly two elements in the input are flagged with $1$.

First Step Insights: What Have We Learned?

Alright, let’s pause and take stock. After dissecting the GRU’s very first move, some things are becoming clearer. We’re seeing how much those bias terms are driving the initial behavior – they’re the unsung heroes at $t=0$! And it’s fascinating how the flag in the input is already so specifically wired to control the update gate (but less so the reset gate, interesting!). Plus, we’re starting to suspect the network is already ‘assuming’ a certain kind of input – fixed length, maybe even expecting those two flagged numbers.

But remember our starting questions? We’re just scratching the surface of “what these weights mean.” And the big “WHY?” – “how did we even get these weights?” – is still a complete mystery! This first step analysis is cool, but it’s just the beginning. To really understand this GRU, we gotta dig deeper into how these weights learned to be this way. And that’s where the real fun begins…

7 Coming next…

In this part, we took a thorough investagion into the inner workings of our GRU cell, dissecting its very first iteration piece by piece. We carefully traced how the input and weights influenced the $z_t$, $r_t$, and $\hat{h}$ gates, step by step, uncovering how activations evolved and what they actually meant. Along the way, we stumbled upon three key insights about what the network had learned.

But that is still just the what—now it’s time to ask why.

Why did the model learn to do this seemingly strange thing? Was it always heading in this direction, or did it explore different strategies earlier in training? Were the weights trying to do something entirely different at first?

To answer these questions, we’ll rewind the clock and analyze how the model’s weights evolved over time. Did they start off chaotic before settling into a structured pattern? Were different strategies competing before the final approach emerged?

And then, we’ll take things a step further. Instead of just observing the learned weights, we’ll craft our own by hand—designing a set that does precisely what we expect. Then, we’ll compare our manually created weights with the network’s chosen ones. Did the network find a more efficient solution? Did it take shortcuts we wouldn’t have thought of? Or did it stumble upon an elegant trick that we can learn from?

Next up: reverse-engineering learning itself. Let’s crack this thing open!

I LOVE THIS SO MUCH!

Analyzing GRU Training Dynamics on the Adding Problem - Part 2

1 Introduction

2 What those parameters mean?

3 The Update Gate at \(t=0\)

4 The Reset gate at \(t=0\)

5 The New gate at \(t=0\)

6 Final step

7 Coming next…