Lecture Notes on Graph Representation Learning

Author

Your Name

Published

February 3, 2025

Introduction

This lecture transitions from traditional graph analysis methods to modern graph representation learning techniques. Traditional approaches often depend on hand-engineered features, which are inflexible and necessitate substantial manual effort. We will examine the limitations of these traditional methods and introduce graph representation learning as a powerful alternative. The central idea is to automatically learn node embeddings that capture the structural information of a graph. We will discuss the encoder-decoder framework for learning these embeddings, beginning with shallow embedding methods and progressing towards more complex Graph Neural Networks (GNNs). Key concepts include latent spaces, neighborhood aggregation, message passing, and different types of aggregation and update functions used in GNNs.

Traditional Graph Analysis

Limitations of Hand-Engineered Features

Traditional graph analysis often relies on hand-engineered features, such as node and graph-level statistics. These approaches are limited because they require carefully designed, specific statistics and measures. Designing these features is a time-consuming and expensive process.

Inflexibility and Lack of Adaptability

Hand-engineered features are inflexible and cannot adapt or learn from data dynamically. Once designed, they remain static and may not generalize well to new data or different graph structures.

Example: Spectral Methods

Spectral methods, briefly mentioned earlier, exemplify traditional approaches that often rely on hand-engineered features. While powerful, they can be limited by their reliance on fixed feature representations.

Graph Representation Learning

To overcome the limitations of traditional methods, we introduce graph representation learning. This approach aims to learn representations that automatically encode the structural information of the graph, moving from static, hand-engineered features to dynamic, learning-based approaches.

Learning Embeddings

In graph representation learning, we learn vectors of features and adjust their coordinates in a meaningful latent space. This involves embedding nodes into a vector space where the geometry of the space reflects the graph structure.

Encoding and Decoding Neighborhoods

A crucial concept in graph representation learning is encoding and decoding neighborhoods. We aim to embed graph nodes into a latent space by considering their neighborhoods.

Latent Space

We embed nodes and their neighborhoods into a latent space. This space is designed to capture the essential structural information of the graph in a compressed form. As illustrated in Figure 3.1 (Not provided, detail missing: Figure 3.1 not in transcription), we map nodes from the original network into this embedding space.

Low-Dimensionality

The latent space must be low-dimensional to simplify and compress the information, retaining only the essential structural aspects of the graph.

Desiderata for Latent Space

The desired properties for a latent space are:

Significant and interesting properties of nodes and their neighborhoods should be transformed into geometric or vectorial properties in this space.
The latent space must be as low-dimensional as possible, providing a distilled and efficient representation.

Encoder-Decoder Framework

We organize our discussion of node embeddings based on the encoder-decoder framework. This approach is a recurring theme in graph representation learning. As shown in Figure 3.2 (Not provided, detail missing: Figure 3.2 not in transcription), the encoder maps nodes to a low-dimensional embedding, and the decoder reconstructs the local neighborhood information from these embeddings.

Encoding Step

Definition 1 (Encoder). The encoder, denoted as $\ENC$, takes a node $u$ and maps it to a low-dimensional embedding $\mathbf{z}_u \in \mathbb{R}^d$. For shallow embedding approaches, this mapping is often a direct mapping. Mathematically, $\ENC : \mathcal{V} \rightarrow \mathbb{R}^d$.

Decoding Step

Definition 2 (Decoder). The decoder, denoted as $\DEC$, takes an embedding $\mathbf{z}_u$ (and potentially $\mathbf{z}_v$) and reconstructs the local neighborhood information of node $u$. For pairwise decoders, $\DEC$ takes two embeddings $\mathbf{z}_u$ and $\mathbf{z}_v$ and outputs a real number that approximates the similarity $S_{u,v}$ between nodes $u$ and $v$. Mathematically, $\DEC(\ENC(u), \ENC(v)) = \DEC(\mathbf{z}_u, \mathbf{z}_v) \approx S_{u,v}$. We focus on reconstructing the local neighborhood, not the entire graph structure at once.

Shallow Embedding Approaches

Shallow embedding approaches use a direct mapping as the encoder. The embedding $\mathbf{z}_u$ is a $d$-dimensional vector in $\mathbb{R}^d$. Let $Z \in \mathbb{R}^{|\mathcal{V}| \times d}$ be the matrix of embeddings for all nodes.

Encoder: Direct Mapping

In shallow embedding, the encoder $\ENC(u)$ is a direct mapping to $\mathbb{R}^d$, using the node itself as the argument without complex transformations or node features.

Pairwise Decoder

The decoder $\DEC$ predicts relationships or similarities between pairs of nodes based on their embeddings. It is a pairwise decoder that operates on pairs of embeddings $\mathbf{z}_u$ and $\mathbf{z}_v$, mapping them to a similarity measure $S_{u,v}$.

Loss Function

To learn the parameters of the encoder and decoder, we define a loss function $\mathcal{L}$. This function measures the discrepancy between the decoder output and the actual similarity measure.

Empirical Reconstruction Loss

Definition 3 (Empirical Reconstruction Loss). The loss function $\mathcal{L}$ is defined as the sum over pairs of nodes $(u, v)$ in a training set $D$:

\[\mathcal{L} = \sum_{(u,v) \in D} \ell(\DEC(\mathbf{z}_u, \mathbf{z}_v), S_{u,v})\]

where $\ell$ is a loss function (e.g., mean-squared error or cross-entropy), and $D$ is the training set. The goal is to minimize this Empirical Reconstruction Loss (ERM) to find optimal encoder and decoder parameters.

\[\text{ERM}(S) \in \text{argmin}_{h \in \mathcal{H}} \mathcal{L}_S(h)\]

Table 3.1 (Not provided, detail missing: Table 3.1 not in transcription) summarizes well-known shallow embedding algorithms, which differ in their choices of decoder, similarity measure, and loss function.

Factorization-Based Approaches

Factorization-based approaches view the encoder-decoder framework from the perspective of matrix factorization. The challenge of decoding local neighborhood structure is related to reconstructing entries in a node-node similarity matrix $S$, which generalizes the adjacency matrix.

Matrix Factorization

Matrix factorization is used to learn a low-dimensional approximation of the similarity matrix $S$.

Euclidean Distance Decoder

Example 4 (Euclidean Distance Decoder). A common decoder is the squared Euclidean distance:

\[\DEC(\mathbf{z}_u, \mathbf{z}_v) = ||\mathbf{z}_u - \mathbf{z}_v||^2\]

The loss function in this case is:

\[\mathcal{L} = \sum_{(u,v) \in D} \DEC(\mathbf{z}_u, \mathbf{z}_v) \cdot S_{u,v}\]

Minimizing this loss aims to bring embeddings of similar nodes closer in Euclidean space. If $S$ satisfies Laplacian matrix properties, minimizing this loss is equivalent to spectral clustering.

Inner Product Decoder

Example 5 (Inner Product Decoder). An alternative decoder is the inner product:

\[\DEC(\mathbf{z}_u, \mathbf{z}_v) = \mathbf{z}_u^T \mathbf{z}_v\]

The loss function becomes:

\[\mathcal{L} = \sum_{(u,v) \in D} ||\DEC(\mathbf{z}_u, \mathbf{z}_v) - S_{u,v}||^2\]

The inner product decoder focuses on the alignment of embedding vectors. Different similarity measures, such as the adjacency matrix $A$ or powers of $A$ (e.g., $A^k$), lead to different loss functions.

Connection to Spectral Methods

Remark. Remark 6. If $S$ is constructed to satisfy Laplacian matrix properties, minimizing the loss with a Euclidean distance decoder is identical to spectral clustering solutions. The optimal solution is given by the $d$ smallest eigenvectors of the Laplacian (excluding the eigenvector of all ones).

Singular Value Decomposition (SVD)

Remark. Remark 7. Minimization of the loss function in factorization-based approaches is often done using Singular Value Decomposition (SVD). By stacking node embeddings into a matrix $Z \in \mathbb{R}^{|\mathcal{V}|\times d}$, the reconstruction objective can be written as:

\[\mathcal{L} \approx ||ZZ^T - S||^2_F\]

This corresponds to a low-dimensional factorization of the similarity matrix $S$. SVD provides a way to find a low-rank approximation of $S$.

Stochastic Decoders and Random Walks

When a deterministic similarity measure $S_{u,v}$ is not available, stochastic decoders can be used. These approaches optimize embeddings to encode the statistics of random walks.

Probability of Visiting Nodes

Example 8 (Stochastic Decoder). Consider a stochastic decoder based on the probability of visiting node $v$ starting from node $u$ in a random walk of length $T$, denoted as $p_{G,T}(v|u)$. A possible decoder is:

\[\DEC(\mathbf{z}_u, \mathbf{z}_v) \approx \frac{\exp(\mathbf{z}_u^T \mathbf{z}_v)}{\sum_{k \in \mathcal{V}} \exp(\mathbf{z}_u^T \mathbf{z}_k)} \approx p_{G,T}(v|u)\]

This decoder is stochastic and asymmetric, capturing directed relationships in graphs.

Negative Log-Likelihood Loss

Example 9 (Negative Log-Likelihood Loss). A common loss function for stochastic decoders is the negative log-likelihood:

\[\mathcal{L} = \sum_{(u,v) \in D} -\log(\DEC(\mathbf{z}_u, \mathbf{z}_v))\]

Here, $D$ is the training set of random walks, generated by sampling random walks starting from each node.

Computational Challenges

Remark. Remark 10. Evaluating the denominator $\sum_{v' \in \mathcal{V}} \exp(\mathbf{z}_u^T \mathbf{z}_{v'})$ in the decoder has a computational cost of $O(|\mathcal{V}|)$, leading to a high training cost, also $O(|\mathcal{V}|)$. Techniques like negative sampling are used to address this computational bottleneck.

Limitations of Shallow Embeddings

Shallow embedding methods have several limitations:

Ignoring Node Features

Shallow embedding approaches do not leverage node features in the encoder. Many real-world graphs have rich node features that could be informative for encoding.

Transductive Nature

Shallow embedding methods are inherently transductive. They can only generate embeddings for nodes present during training. Generalizing to new, unseen nodes is not possible without additional optimizations. This limits their applicability in inductive settings.

Graph Neural Networks (GNNs)

To overcome the limitations of shallow embeddings, we move to more complex encoder models, specifically Graph Neural Networks (GNNs). GNNs provide a general framework for defining deep neural networks on graph data.

Moving Beyond Shallow Embeddings

GNNs address the limitations of shallow embeddings by generating node representations that depend on the graph structure and node features.

GNN Formalism

The key idea of GNNs is to learn node representations that are functions of both the graph structure and node features. This is achieved through neural message passing.

Challenges with CNNs and RNNs on Graphs

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), while powerful, are not directly applicable to general graphs. CNNs are designed for grids, and RNNs for sequences. Applying them directly to graphs is problematic because of the irregular structure of graphs compared to grids or sequences. A naive approach to linearize the graph and use Multilayer Perceptrons (MLPs) is order-dependent. Representing a graph using adjacency matrices $A_1, A_2, ..., A_{|\mathcal{V}|}$ and feeding them into an MLP, $\mathbf{z}_G = \text{MLP}(A_1 \oplus A_2 \oplus ... \oplus A_{|\mathcal{V}|})$, depends on the node ordering, which is undesirable.

We need permutation invariant or equivariant functions for graph representations.

Permutation Invariance: $f(PAP^T) = f(A)$
Permutation Equivariance: $f(PAP^T) = Pf(A)$

where $P$ is a permutation matrix applied to the adjacency matrix $A$.

Neural Message Passing

Neural Message Passing is a framework for performing computations directly on the graph structure, forming the foundation of GNNs.

Convolution on Manifolds

GNNs can be viewed as a form of convolution on topological spaces with weights, homeomorphic to Euclidean spaces. This perspective connects GNNs to geometric deep learning.

Differentiable Belief Propagation

GNNs can also be seen as differentiable variants of belief propagation algorithms, highlighting their connection to probabilistic graphical models.

Connection to the Weisfeiler-Lehman Test

GNNs have connections to the Weisfeiler-Lehman (WL) graph isomorphism test, which is crucial for understanding their representational power.

Ingredients of Message Passing

The core ingredients of the message passing framework are:

Iterations (Layers)

GNNs operate in layers, similar to deep neural networks.

Hidden Embeddings

Each node $u$ has a hidden embedding $\mathbf{h}^{(k)}_u$ at the $k$-th iteration (layer).

Messages

Nodes exchange messages with their neighbors.

AGGREGATE and UPDATE Functions

Two key functions in each layer are:

$\text{AGGREGATE}^{(k)}$: Aggregates messages from neighbors.
$\text{UPDATE}^{(k)}$: Updates a node’s embedding based on the aggregated message.

These functions are differentiable, allowing GNN training via backpropagation. Figure 5.1 (Not provided, detail missing: Figure 5.1 not in transcription) illustrates how a node aggregates messages from its local neighborhood.

Mathematical Formulation of Message Passing

Theorem 11 (Message Passing Process). The message passing process can be mathematically formulated as:

\[\mathbf{h}^{(k+1)}_u = \text{UPDATE}^{(k)} \left( \mathbf{h}^{(k)}_u, \text{AGGREGATE}^{(k)} \left( \{ \mathbf{h}^{(k)}_v, \forall v \in \mathcal{N}(u) \} \right) \right)\]

Let $\mathbf{m}^{(k)}_{\mathcal{N}(u)} = \text{AGGREGATE}^{(k)} \left( \{ \mathbf{h}^{(k)}_v, \forall v \in \mathcal{N}(u) \} \right)$ be the aggregated message. Then,

\[\mathbf{h}^{(k+1)}_u = \text{UPDATE}^{(k)} \left( \mathbf{h}^{(k)}_u, \mathbf{m}^{(k)}_{\mathcal{N}(u)} \right)\]

Starting with input representations at layer $k=0$, after $K$ iterations, we obtain the final node embeddings: $\mathbf{z}_u = \mathbf{h}^{(K)}_u, \forall u \in \mathcal{V}$. GNNs defined this way are permutation equivariant because the $\text{AGGREGATE}$ function operates on a set of neighbors.

Node Features

Unlike shallow embedding methods, GNNs typically require node features $\mathbf{x}_u, \forall u \in \mathcal{V}$ as input.

Options When Node Features Are Unavailable

If node features are not available, options include:

Using node statistics (e.g., from Section 2.1, detail missing: Section 2.1 not in transcription) to define features.
Using identity features: associating each node with a one-hot indicator feature. However, using identity features makes the model transductive.

Types of Information Passed

Two main types of information are passed in messages:

Structural Information

Graph topological information is inherently passed through the message aggregation process.

Feature-Based Information

Feature information is also aggregated from the neighborhoods. After $k$ iterations, embeddings encode information about features in the $k$-hop neighborhood. This is analogous to convolutional kernels in CNNs, but GNNs aggregate information based on graph neighborhoods instead of spatial patches.

Implementation of AGGREGATE and UPDATE

Common choices for implementing $\text{AGGREGATE}$ and $\text{UPDATE}$ functions include vector/matrix operations.

Vector/Matrix Operations

A common formulation is:

\[\mathbf{h}^{(k)}_u = \sigma \left( \mathbf{W}^{(k)}_{\text{self}} \mathbf{h}^{(k-1)}_u + \mathbf{W}^{(k)}_{\text{neigh}} \sum_{v \in \mathcal{N}(u)} \mathbf{h}^{(k-1)}_v + \mathbf{b}^{(k)} \right)\]

where $\mathbf{W}^{(k)}_{\text{self}}, \mathbf{W}^{(k)}_{\text{neigh}} \in \mathbb{R}^{d^{(k)} \times d^{(k-1)}}$ are weight matrices, $\mathbf{b}^{(k)}$ is a bias vector, and $\sigma$ is a non-linear activation function. This is a node-based formulation using an Elman-style update.

Elman-Style Update

This update is similar to those used in MLPs, involving linear transformations and non-linear activation. We can express it using explicit message aggregation: \[\begin{aligned} \mathbf{m}^{(k-1)}_{\mathcal{N}(u)} &= \sum_{v \in \mathcal{N}(u)} \mathbf{h}^{(k-1)}_v \\ \mathbf{h}^{(k)}_u &= \sigma \left( \mathbf{W}^{(k)}_{\text{self}} \mathbf{h}^{(k-1)}_u + \mathbf{W}^{(k)}_{\text{neigh}} \mathbf{m}^{(k-1)}_{\mathcal{N}(u)} + \mathbf{b}^{(k)} \right) \end{aligned}\]

Graph-Level Formulation

In matrix notation, the graph-level formulation is: \[H^{(k)} = \sigma \left( AH^{(k-1)} \mathbf{W}^{(k)}_{\text{neigh}} + H^{(k-1)} \mathbf{W}^{(k)}_{\text{self}} \right)\] where $A$ is the adjacency matrix and $H^{(k)}$ is the matrix of hidden embeddings at layer $k$. By adding self-loops (i.e., considering $\mathcal{N}(u) \cup \{u\}$), we can replace $\mathbf{W}^{(k)}_{\text{self}}$ by modifying the aggregation: \[H^{(k)} = \sigma \left( (A + I)H^{(k-1)} \mathbf{W}^{(k)} \right)\] In any case, the $\text{AGGREGATE}$ function is essentially a summation ($\sum$).

Neighborhood Normalization

To improve aggregation, especially when dealing with nodes with varying degrees, neighborhood normalization can be used.

Average Aggregation

Instead of simple summation, average aggregation normalizes by the degree of the node: \[\mathbf{m}^{(k-1)}_{\mathcal{N}(u)} = \frac{\sum_{v \in \mathcal{N}(u)} \mathbf{h}^{(k-1)}_v}{|\mathcal{N}(u)|}\]

Symmetric Normalization

Symmetric normalization accounts for degree differences between nodes $u$ and $v$: \[\mathbf{m}^{(k-1)}_{\mathcal{N}(u)} = \sum_{v \in \mathcal{N}(u)} \frac{\mathbf{h}^{(k-1)}_v}{\sqrt{|\mathcal{N}(u)| |\mathcal{N}(v)|}}\]

Graph Convolutional Networks (GCN) Example

Example 12 (Graph Convolutional Networks (GCNs)). Graph Convolutional Networks (GCNs) use symmetric normalization and self-loops:

\[\mathbf{h}^{(k)}_u = \sigma \left( \mathbf{W}^{(k)} \sum_{v \in \mathcal{N}(u) \cup \{u\}} \frac{\mathbf{h}^{(k-1)}_v}{\sqrt{|\mathcal{N}(u)| |\mathcal{N}(v)|}} \right)\]

Here, $|\mathcal{N}(v)|$ should be interpreted as $|\mathcal{N}(v) \cup \{v\}|$ when $v=u$.

Set Aggregators

The neighborhood aggregation is fundamentally a set function, operating on the set of neighbor embeddings $\{ \mathbf{h}^{(k)}_v, \forall v \in \mathcal{N}(u) \}$. Aggregation functions must be permutation invariant.

Set Pooling

Set pooling replaces simple summation with more complex functions, often using MLPs: \[\mathbf{m}^{(k-1)}_{\mathcal{N}(u)} = \text{MLP}_\theta \left( \sum_{v \in \mathcal{N}(u)} \text{MLP}_\phi(\mathbf{h}^{(k-1)}_v) \right)\] This adds extra layers and allows for more complex aggregation patterns, acting as a universal function approximator. Heuristic max/min/mean operations can also be used.

Janossy Pooling

Janossy pooling is another set aggregation approach. It addresses permutation invariance by using permutation-sensitive functions combined with averaging over permutations: \[\mathbf{m}^{(k-1)}_{\mathcal{N}(u)} = \text{MLP}_\theta \left( \frac{1}{|\Pi|} \sum_{\pi \in \Pi} \rho_\phi(\mathbf{h}^{(k-1)}_{\pi_1}, \mathbf{h}^{(k-1)}_{\pi_2}, ..., \mathbf{h}^{(k-1)}_{\pi_{|\mathcal{N}(u)|}}) \right)\] where $\Pi$ is the collection of permutations of neighbors and $\rho_\phi$ is a permutation-sensitive function (e.g., an MLP). In practice, approximations are used to handle the computational cost of summing over all permutations, such as sampling permutations or using canonical orderings.

The time complexity of a single message passing step (i.e., updating the embedding of a single node) is typically $O(|\mathcal{N}(u)| d^{(k)} d^{(k-1)})$, where $|\mathcal{N}(u)|$ is the degree of node $u$, and $d^{(k)}$ and $d^{(k-1)}$ are the dimensions of the embeddings at layers $k$ and $k-1$, respectively. This complexity arises from the matrix multiplications involved in the $\text{UPDATE}$ and $\text{AGGREGATE}$ functions.

The space complexity for storing the embeddings at each layer is $O(|\mathcal{V}| d^{(k)})$, where $|\mathcal{V}|$ is the number of nodes in the graph. The space complexity for storing the weight matrices is $O(L d^2)$, where $L$ is the number of layers and $d$ is the maximum embedding dimension across all layers.

Conclusion

In this lecture, we transitioned from traditional graph analysis to graph representation learning, highlighting the limitations of hand-engineered features and introducing the encoder-decoder framework for learning node embeddings. We explored shallow embedding approaches and their limitations, which motivated the need for more powerful models like Graph Neural Networks (GNNs). We delved into the GNN formalism, focusing on neural message passing, its ingredients, and mathematical formulation. We discussed different implementations of $\text{AGGREGATE}$ and $\text{UPDATE}$ functions, including normalization techniques and set aggregators.

Key takeaways include the importance of learning representations directly from graph structure and node features, the power of message passing in GNNs, and the flexibility offered by different aggregation and update mechanisms.

Further topics to explore include different GNN architectures, applications of GNNs in various domains, and advanced techniques for training and scaling GNNs. Consider how the concepts discussed in this lecture can be applied to real-world graph datasets and the challenges that might arise.

--- title: "Lecture Notes on Graph Representation Learning" author: "Your Name" date: "2025-02-03" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction This lecture transitions from traditional graph analysis methods to modern graph representation learning techniques. Traditional approaches often depend on hand-engineered features, which are inflexible and necessitate substantial manual effort. We will examine the limitations of these traditional methods and introduce graph representation learning as a powerful alternative. The central idea is to automatically learn node embeddings that capture the structural information of a graph. We will discuss the encoder-decoder framework for learning these embeddings, beginning with shallow embedding methods and progressing towards more complex Graph Neural Networks (GNNs). Key concepts include latent spaces, neighborhood aggregation, message passing, and different types of aggregation and update functions used in GNNs. # Traditional Graph Analysis ## Limitations of Hand-Engineered Features Traditional graph analysis often relies on hand-engineered features, such as node and graph-level statistics. These approaches are limited because they require carefully designed, specific statistics and measures. Designing these features is a time-consuming and expensive process. ## Inflexibility and Lack of Adaptability Hand-engineered features are inflexible and cannot adapt or learn from data dynamically. Once designed, they remain static and may not generalize well to new data or different graph structures. ## Example: Spectral Methods Spectral methods, briefly mentioned earlier, exemplify traditional approaches that often rely on hand-engineered features. While powerful, they can be limited by their reliance on fixed feature representations. # Graph Representation Learning To overcome the limitations of traditional methods, we introduce graph representation learning. This approach aims to learn representations that automatically encode the structural information of the graph, moving from static, hand-engineered features to dynamic, learning-based approaches. ## Learning Embeddings In graph representation learning, we learn vectors of features and adjust their coordinates in a meaningful latent space. This involves embedding nodes into a vector space where the geometry of the space reflects the graph structure. ## Encoding and Decoding Neighborhoods A crucial concept in graph representation learning is encoding and decoding neighborhoods. We aim to embed graph nodes into a latent space by considering their neighborhoods. ## Latent Space We embed nodes and their neighborhoods into a *latent space*. This space is designed to capture the essential structural information of the graph in a compressed form. As illustrated in Figure 3.1 (Not provided, detail missing: Figure 3.1 not in transcription), we map nodes from the original network into this embedding space. ### Low-Dimensionality The latent space must be low-dimensional to simplify and compress the information, retaining only the essential structural aspects of the graph. ## Desiderata for Latent Space The desired properties for a latent space are: 1. Significant and interesting properties of nodes and their neighborhoods should be transformed into geometric or vectorial properties in this space. 2. The latent space must be as low-dimensional as possible, providing a distilled and efficient representation. ## Encoder-Decoder Framework We organize our discussion of node embeddings based on the encoder-decoder framework. This approach is a recurring theme in graph representation learning. As shown in Figure 3.2 (Not provided, detail missing: Figure 3.2 not in transcription), the encoder maps nodes to a low-dimensional embedding, and the decoder reconstructs the local neighborhood information from these embeddings. ### Encoding Step ::: definition **Definition 1** (Encoder). *The encoder, denoted as $\ENC$, takes a node $u$ and maps it to a low-dimensional embedding $\mathbf{z}_u \in \mathbb{R}^d$. For shallow embedding approaches, this mapping is often a direct mapping. Mathematically, $\ENC : \mathcal{V} \rightarrow \mathbb{R}^d$.* ::: ### Decoding Step ::: definition **Definition 2** (Decoder). *The decoder, denoted as $\DEC$, takes an embedding $\mathbf{z}_u$ (and potentially $\mathbf{z}_v$) and reconstructs the local neighborhood information of node $u$. For pairwise decoders, $\DEC$ takes two embeddings $\mathbf{z}_u$ and $\mathbf{z}_v$ and outputs a real number that approximates the similarity $S_{u,v}$ between nodes $u$ and $v$. Mathematically, $\DEC(\ENC(u), \ENC(v)) = \DEC(\mathbf{z}_u, \mathbf{z}_v) \approx S_{u,v}$. We focus on reconstructing the local neighborhood, not the entire graph structure at once.* ::: ## Shallow Embedding Approaches Shallow embedding approaches use a direct mapping as the encoder. The embedding $\mathbf{z}_u$ is a $d$-dimensional vector in $\mathbb{R}^d$. Let $Z \in \mathbb{R}^{|\mathcal{V}| \times d}$ be the matrix of embeddings for all nodes. ### Encoder: Direct Mapping In shallow embedding, the encoder $\ENC(u)$ is a direct mapping to $\mathbb{R}^d$, using the node itself as the argument without complex transformations or node features. ### Pairwise Decoder The decoder $\DEC$ predicts relationships or similarities between pairs of nodes based on their embeddings. It is a pairwise decoder that operates on pairs of embeddings $\mathbf{z}_u$ and $\mathbf{z}_v$, mapping them to a similarity measure $S_{u,v}$. ### Loss Function To learn the parameters of the encoder and decoder, we define a loss function $\mathcal{L}$. This function measures the discrepancy between the decoder output and the actual similarity measure. #### Empirical Reconstruction Loss ::::: definition **Definition 3** (Empirical Reconstruction Loss). *The loss function $\mathcal{L}$ is defined as the sum over pairs of nodes $(u, v)$ in a training set $D$:* ::: tcolorbox *$$\mathcal{L} = \sum_{(u,v) \in D} \ell(\DEC(\mathbf{z}_u, \mathbf{z}_v), S_{u,v})$$* ::: *where $\ell$ is a loss function (e.g., mean-squared error or cross-entropy), and $D$ is the training set. The goal is to minimize this Empirical Reconstruction Loss (ERM) to find optimal encoder and decoder parameters.* ::: tcolorbox *$$\text{ERM}(S) \in \text{argmin}_{h \in \mathcal{H}} \mathcal{L}_S(h)$$* ::: ::::: Table 3.1 (Not provided, detail missing: Table 3.1 not in transcription) summarizes well-known shallow embedding algorithms, which differ in their choices of decoder, similarity measure, and loss function. ## Factorization-Based Approaches Factorization-based approaches view the encoder-decoder framework from the perspective of matrix factorization. The challenge of decoding local neighborhood structure is related to reconstructing entries in a node-node similarity matrix $S$, which generalizes the adjacency matrix. ### Matrix Factorization Matrix factorization is used to learn a low-dimensional approximation of the similarity matrix $S$. ### Euclidean Distance Decoder ::::: example **Example 4** (Euclidean Distance Decoder). *A common decoder is the squared Euclidean distance:* ::: tcolorbox *$$\DEC(\mathbf{z}_u, \mathbf{z}_v) = ||\mathbf{z}_u - \mathbf{z}_v||^2$$* ::: *The loss function in this case is:* ::: tcolorbox *$$\mathcal{L} = \sum_{(u,v) \in D} \DEC(\mathbf{z}_u, \mathbf{z}_v) \cdot S_{u,v}$$* ::: *Minimizing this loss aims to bring embeddings of similar nodes closer in Euclidean space. If $S$ satisfies Laplacian matrix properties, minimizing this loss is equivalent to spectral clustering.* ::::: ### Inner Product Decoder ::::: example **Example 5** (Inner Product Decoder). *An alternative decoder is the inner product:* ::: tcolorbox *$$\DEC(\mathbf{z}_u, \mathbf{z}_v) = \mathbf{z}_u^T \mathbf{z}_v$$* ::: *The loss function becomes:* ::: tcolorbox *$$\mathcal{L} = \sum_{(u,v) \in D} ||\DEC(\mathbf{z}_u, \mathbf{z}_v) - S_{u,v}||^2$$* ::: *The inner product decoder focuses on the alignment of embedding vectors. Different similarity measures, such as the adjacency matrix $A$ or powers of $A$ (e.g., $A^k$), lead to different loss functions.* ::::: ### Connection to Spectral Methods ::: remark **Remark 6**. *If $S$ is constructed to satisfy Laplacian matrix properties, minimizing the loss with a Euclidean distance decoder is identical to spectral clustering solutions. The optimal solution is given by the $d$ smallest eigenvectors of the Laplacian (excluding the eigenvector of all ones).* ::: ### Singular Value Decomposition (SVD) :::: remark **Remark 7**. *Minimization of the loss function in factorization-based approaches is often done using Singular Value Decomposition (SVD). By stacking node embeddings into a matrix $Z \in \mathbb{R}^{|\mathcal{V}|\times d}$, the reconstruction objective can be written as:* ::: tcolorbox *$$\mathcal{L} \approx ||ZZ^T - S||^2_F$$* ::: *This corresponds to a low-dimensional factorization of the similarity matrix $S$. SVD provides a way to find a low-rank approximation of $S$.* :::: ## Stochastic Decoders and Random Walks When a deterministic similarity measure $S_{u,v}$ is not available, stochastic decoders can be used. These approaches optimize embeddings to encode the statistics of random walks. ### Probability of Visiting Nodes :::: example **Example 8** (Stochastic Decoder). *Consider a stochastic decoder based on the probability of visiting node $v$ starting from node $u$ in a random walk of length $T$, denoted as $p_{G,T}(v|u)$. A possible decoder is:* ::: tcolorbox *$$\DEC(\mathbf{z}_u, \mathbf{z}_v) \approx \frac{\exp(\mathbf{z}_u^T \mathbf{z}_v)}{\sum_{k \in \mathcal{V}} \exp(\mathbf{z}_u^T \mathbf{z}_k)} \approx p_{G,T}(v|u)$$* ::: *This decoder is stochastic and asymmetric, capturing directed relationships in graphs.* :::: ### Negative Log-Likelihood Loss :::: example **Example 9** (Negative Log-Likelihood Loss). *A common loss function for stochastic decoders is the negative log-likelihood:* ::: tcolorbox *$$\mathcal{L} = \sum_{(u,v) \in D} -\log(\DEC(\mathbf{z}_u, \mathbf{z}_v))$$* ::: *Here, $D$ is the training set of random walks, generated by sampling random walks starting from each node.* :::: ### Computational Challenges ::: remark **Remark 10**. *Evaluating the denominator $\sum_{v' \in \mathcal{V}} \exp(\mathbf{z}_u^T \mathbf{z}_{v'})$ in the decoder has a computational cost of $O(|\mathcal{V}|)$, leading to a high training cost, also $O(|\mathcal{V}|)$. Techniques like negative sampling are used to address this computational bottleneck.* ::: ## Limitations of Shallow Embeddings Shallow embedding methods have several limitations: ### Lack of Parameter Sharing Shallow embeddings do not share parameters between nodes in the encoder. Each node has a unique embedding vector optimized independently. This is statistically and computationally inefficient, especially for large graphs where the number of parameters grows as $O(|\mathcal{V}|)$. ### Ignoring Node Features Shallow embedding approaches do not leverage node features in the encoder. Many real-world graphs have rich node features that could be informative for encoding. ### Transductive Nature Shallow embedding methods are inherently transductive. They can only generate embeddings for nodes present during training. Generalizing to new, unseen nodes is not possible without additional optimizations. This limits their applicability in inductive settings. # Graph Neural Networks (GNNs) To overcome the limitations of shallow embeddings, we move to more complex encoder models, specifically Graph Neural Networks (GNNs). GNNs provide a general framework for defining deep neural networks on graph data. ## Moving Beyond Shallow Embeddings GNNs address the limitations of shallow embeddings by generating node representations that depend on the graph structure and node features. ## GNN Formalism The key idea of GNNs is to learn node representations that are functions of both the graph structure and node features. This is achieved through neural message passing. ## Challenges with CNNs and RNNs on Graphs Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), while powerful, are not directly applicable to general graphs. CNNs are designed for grids, and RNNs for sequences. Applying them directly to graphs is problematic because of the irregular structure of graphs compared to grids or sequences. A naive approach to linearize the graph and use Multilayer Perceptrons (MLPs) is order-dependent. Representing a graph using adjacency matrices $A_1, A_2, ..., A_{|\mathcal{V}|}$ and feeding them into an MLP, $\mathbf{z}_G = \text{MLP}(A_1 \oplus A_2 \oplus ... \oplus A_{|\mathcal{V}|})$, depends on the node ordering, which is undesirable. We need permutation invariant or equivariant functions for graph representations. - Permutation Invariance: $f(PAP^T) = f(A)$ - Permutation Equivariance: $f(PAP^T) = Pf(A)$ where $P$ is a permutation matrix applied to the adjacency matrix $A$. ## Neural Message Passing Neural Message Passing is a framework for performing computations directly on the graph structure, forming the foundation of GNNs. ### Convolution on Manifolds GNNs can be viewed as a form of convolution on topological spaces with weights, homeomorphic to Euclidean spaces. This perspective connects GNNs to geometric deep learning. ### Differentiable Belief Propagation GNNs can also be seen as differentiable variants of belief propagation algorithms, highlighting their connection to probabilistic graphical models. ### Connection to the Weisfeiler-Lehman Test GNNs have connections to the Weisfeiler-Lehman (WL) graph isomorphism test, which is crucial for understanding their representational power. ## Ingredients of Message Passing The core ingredients of the message passing framework are: ### Iterations (Layers) GNNs operate in layers, similar to deep neural networks. ### Hidden Embeddings Each node $u$ has a hidden embedding $\mathbf{h}^{(k)}_u$ at the $k$-th iteration (layer). ### Messages Nodes exchange messages with their neighbors. ### AGGREGATE and UPDATE Functions Two key functions in each layer are: - $\text{AGGREGATE}^{(k)}$: Aggregates messages from neighbors. - $\text{UPDATE}^{(k)}$: Updates a node's embedding based on the aggregated message. These functions are differentiable, allowing GNN training via backpropagation. Figure 5.1 (Not provided, detail missing: Figure 5.1 not in transcription) illustrates how a node aggregates messages from its local neighborhood. ## Mathematical Formulation of Message Passing ::::: theorem **Theorem 11** (Message Passing Process). *The message passing process can be mathematically formulated as:* ::: tcolorbox *$$\mathbf{h}^{(k+1)}_u = \text{UPDATE}^{(k)} \left( \mathbf{h}^{(k)}_u, \text{AGGREGATE}^{(k)} \left( \{ \mathbf{h}^{(k)}_v, \forall v \in \mathcal{N}(u) \} \right) \right)$$* ::: *Let $\mathbf{m}^{(k)}_{\mathcal{N}(u)} = \text{AGGREGATE}^{(k)} \left( \{ \mathbf{h}^{(k)}_v, \forall v \in \mathcal{N}(u) \} \right)$ be the aggregated message. Then,* ::: tcolorbox *$$\mathbf{h}^{(k+1)}_u = \text{UPDATE}^{(k)} \left( \mathbf{h}^{(k)}_u, \mathbf{m}^{(k)}_{\mathcal{N}(u)} \right)$$* ::: *Starting with input representations at layer $k=0$, after $K$ iterations, we obtain the final node embeddings: $\mathbf{z}_u = \mathbf{h}^{(K)}_u, \forall u \in \mathcal{V}$. GNNs defined this way are permutation equivariant because the $\text{AGGREGATE}$ function operates on a set of neighbors.* ::::: ## Node Features Unlike shallow embedding methods, GNNs typically require node features $\mathbf{x}_u, \forall u \in \mathcal{V}$ as input. ### Options When Node Features Are Unavailable If node features are not available, options include: 1. Using node statistics (e.g., from Section 2.1, detail missing: Section 2.1 not in transcription) to define features. 2. Using identity features: associating each node with a one-hot indicator feature. However, using identity features makes the model transductive. ## Types of Information Passed Two main types of information are passed in messages: ### Structural Information Graph topological information is inherently passed through the message aggregation process. ### Feature-Based Information Feature information is also aggregated from the neighborhoods. After $k$ iterations, embeddings encode information about features in the $k$-hop neighborhood. This is analogous to convolutional kernels in CNNs, but GNNs aggregate information based on graph neighborhoods instead of spatial patches. ## Implementation of AGGREGATE and UPDATE Common choices for implementing $\text{AGGREGATE}$ and $\text{UPDATE}$ functions include vector/matrix operations. ### Vector/Matrix Operations :::: algorithm A common formulation is: ::: tcolorbox $$\mathbf{h}^{(k)}_u = \sigma \left( \mathbf{W}^{(k)}_{\text{self}} \mathbf{h}^{(k-1)}_u + \mathbf{W}^{(k)}_{\text{neigh}} \sum_{v \in \mathcal{N}(u)} \mathbf{h}^{(k-1)}_v + \mathbf{b}^{(k)} \right)$$ ::: where $\mathbf{W}^{(k)}_{\text{self}}, \mathbf{W}^{(k)}_{\text{neigh}} \in \mathbb{R}^{d^{(k)} \times d^{(k-1)}}$ are weight matrices, $\mathbf{b}^{(k)}$ is a bias vector, and $\sigma$ is a non-linear activation function. This is a node-based formulation using an Elman-style update. :::: ### Elman-Style Update This update is similar to those used in MLPs, involving linear transformations and non-linear activation. We can express it using explicit message aggregation: $$\begin{aligned} \mathbf{m}^{(k-1)}_{\mathcal{N}(u)} &= \sum_{v \in \mathcal{N}(u)} \mathbf{h}^{(k-1)}_v \\ \mathbf{h}^{(k)}_u &= \sigma \left( \mathbf{W}^{(k)}_{\text{self}} \mathbf{h}^{(k-1)}_u + \mathbf{W}^{(k)}_{\text{neigh}} \mathbf{m}^{(k-1)}_{\mathcal{N}(u)} + \mathbf{b}^{(k)} \right) \end{aligned}$$ ### Graph-Level Formulation In matrix notation, the graph-level formulation is: $$H^{(k)} = \sigma \left( AH^{(k-1)} \mathbf{W}^{(k)}_{\text{neigh}} + H^{(k-1)} \mathbf{W}^{(k)}_{\text{self}} \right)$$ where $A$ is the adjacency matrix and $H^{(k)}$ is the matrix of hidden embeddings at layer $k$. By adding self-loops (i.e., considering $\mathcal{N}(u) \cup \{u\}$), we can replace $\mathbf{W}^{(k)}_{\text{self}}$ by modifying the aggregation: $$H^{(k)} = \sigma \left( (A + I)H^{(k-1)} \mathbf{W}^{(k)} \right)$$ In any case, the $\text{AGGREGATE}$ function is essentially a summation ($\sum$). ## Neighborhood Normalization To improve aggregation, especially when dealing with nodes with varying degrees, neighborhood normalization can be used. ### Average Aggregation Instead of simple summation, average aggregation normalizes by the degree of the node: $$\mathbf{m}^{(k-1)}_{\mathcal{N}(u)} = \frac{\sum_{v \in \mathcal{N}(u)} \mathbf{h}^{(k-1)}_v}{|\mathcal{N}(u)|}$$ ### Symmetric Normalization Symmetric normalization accounts for degree differences between nodes $u$ and $v$: $$\mathbf{m}^{(k-1)}_{\mathcal{N}(u)} = \sum_{v \in \mathcal{N}(u)} \frac{\mathbf{h}^{(k-1)}_v}{\sqrt{|\mathcal{N}(u)| |\mathcal{N}(v)|}}$$ ### Graph Convolutional Networks (GCN) Example :::: example **Example 12** (Graph Convolutional Networks (GCNs)). *Graph Convolutional Networks (GCNs) use symmetric normalization and self-loops:* ::: tcolorbox *$$\mathbf{h}^{(k)}_u = \sigma \left( \mathbf{W}^{(k)} \sum_{v \in \mathcal{N}(u) \cup \{u\}} \frac{\mathbf{h}^{(k-1)}_v}{\sqrt{|\mathcal{N}(u)| |\mathcal{N}(v)|}} \right)$$* ::: *Here, $|\mathcal{N}(v)|$ should be interpreted as $|\mathcal{N}(v) \cup \{v\}|$ when $v=u$.* :::: ## Set Aggregators The neighborhood aggregation is fundamentally a set function, operating on the set of neighbor embeddings $\{ \mathbf{h}^{(k)}_v, \forall v \in \mathcal{N}(u) \}$. Aggregation functions must be permutation invariant. ### Set Pooling Set pooling replaces simple summation with more complex functions, often using MLPs: $$\mathbf{m}^{(k-1)}_{\mathcal{N}(u)} = \text{MLP}_\theta \left( \sum_{v \in \mathcal{N}(u)} \text{MLP}_\phi(\mathbf{h}^{(k-1)}_v) \right)$$ This adds extra layers and allows for more complex aggregation patterns, acting as a universal function approximator. Heuristic max/min/mean operations can also be used. ### Janossy Pooling Janossy pooling is another set aggregation approach. It addresses permutation invariance by using permutation-sensitive functions combined with averaging over permutations: $$\mathbf{m}^{(k-1)}_{\mathcal{N}(u)} = \text{MLP}_\theta \left( \frac{1}{|\Pi|} \sum_{\pi \in \Pi} \rho_\phi(\mathbf{h}^{(k-1)}_{\pi_1}, \mathbf{h}^{(k-1)}_{\pi_2}, ..., \mathbf{h}^{(k-1)}_{\pi_{|\mathcal{N}(u)|}}) \right)$$ where $\Pi$ is the collection of permutations of neighbors and $\rho_\phi$ is a permutation-sensitive function (e.g., an MLP). In practice, approximations are used to handle the computational cost of summing over all permutations, such as sampling permutations or using canonical orderings. ::: tcolorbox The time complexity of a single message passing step (i.e., updating the embedding of a single node) is typically $O(|\mathcal{N}(u)| d^{(k)} d^{(k-1)})$, where $|\mathcal{N}(u)|$ is the degree of node $u$, and $d^{(k)}$ and $d^{(k-1)}$ are the dimensions of the embeddings at layers $k$ and $k-1$, respectively. This complexity arises from the matrix multiplications involved in the $\text{UPDATE}$ and $\text{AGGREGATE}$ functions. The space complexity for storing the embeddings at each layer is $O(|\mathcal{V}| d^{(k)})$, where $|\mathcal{V}|$ is the number of nodes in the graph. The space complexity for storing the weight matrices is $O(L d^2)$, where $L$ is the number of layers and $d$ is the maximum embedding dimension across all layers. ::: # Conclusion In this lecture, we transitioned from traditional graph analysis to graph representation learning, highlighting the limitations of hand-engineered features and introducing the encoder-decoder framework for learning node embeddings. We explored shallow embedding approaches and their limitations, which motivated the need for more powerful models like Graph Neural Networks (GNNs). We delved into the GNN formalism, focusing on neural message passing, its ingredients, and mathematical formulation. We discussed different implementations of $\text{AGGREGATE}$ and $\text{UPDATE}$ functions, including normalization techniques and set aggregators. Key takeaways include the importance of learning representations directly from graph structure and node features, the power of message passing in GNNs, and the flexibility offered by different aggregation and update mechanisms. Further topics to explore include different GNN architectures, applications of GNNs in various domains, and advanced techniques for training and scaling GNNs. Consider how the concepts discussed in this lecture can be applied to real-world graph datasets and the challenges that might arise.