Lecture Notes on Analytic Geometry, Norms, and Inner Products

Author

Your Name

Published

February 3, 2025

Introduction

This lecture introduces the fundamental concepts of analytic geometry, focusing on norms and inner products in vector spaces. We begin by generalizing the notion of length using norms, then extend this to inner products, which allow us to define orthogonality and angles in abstract vector spaces. The lecture further explores the relationship between inner products and symmetric positive definite matrices, and culminates in an introduction to applications in deep learning, particularly matrix operations and the TensorFlow library. Key concepts include norms, inner products, orthogonality, orthonormal bases, projections, and their relevance in the context of neural networks and TensorFlow.

Norms

Definition of a Norm

We start by formalizing the concept of length in a vector space $\mathcal{V}$.

Definition 1 (Norm). A norm on a vector space $\mathcal{V}$ is a function $\left\|\cdot\right\| : \mathcal{V}\to \mathbb{R}$ that assigns each vector $\mathbf{x} \in \mathcal{V}$ a real number $\left\|\mathbf{x}\right\|$, called its norm or length, satisfying the following properties for all scalars $\lambda \in \mathbb{R}$ and vectors $\mathbf{x}, \mathbf{y} \in \mathcal{V}$:

Absolute Homogeneity: $\left\|\lambda \mathbf{x}\right\| = |\lambda| \left\|\mathbf{x}\right\|$.
Triangle Inequality: $\left\|\mathbf{x} + \mathbf{y}\right\| \leq \left\|\mathbf{x}\right\| + \left\|\mathbf{y}\right\|$.
Positive Definite: $\left\|\mathbf{x}\right\| \geq 0$, and $\left\|\mathbf{x}\right\| = 0$ if and only if $\mathbf{x} = \mathbf{0}$.

These properties are essential for defining a meaningful measure of length in abstract vector spaces.

Examples of Norms

Manhattan Norm (L1 Norm)

Definition 2 (Manhattan Norm). For a vector $\mathbf{x} = (x_1, x_2, \dots, x_m) \in \mathbb{R}^m$, the Manhattan norm, or L1 norm, is defined as: \[\left\|\mathbf{x}\right\|_1 = \sum_{i=1}^{m} |x_i|\]

In $\mathbb{R}^2$, the set of vectors with a Manhattan norm of 1, i.e., $\{\mathbf{x} \in \mathbb{R}^2 : \left\|\mathbf{x}\right\|_1 = 1\}$, forms a diamond shape centered at the origin.

Unit circle for Manhattan Norm

Euclidean Norm (L2 Norm)

Definition 3 (Euclidean Norm). For a vector $\mathbf{x} = (x_1, x_2, \dots, x_m) \in \mathbb{R}^m$, the Euclidean norm, or L2 norm, is defined as: \[\left\|\mathbf{x}\right\|_2 = \sqrt{\sum_{i=1}^{m} x_i^2} = \sqrt{\mathbf{x}^T \mathbf{x}}\]

The Euclidean norm corresponds to the standard geometric length in Euclidean space. The set of vectors with a Euclidean norm of 1, i.e., $\{\mathbf{x} \in \mathbb{R}^2 : \left\|\mathbf{x}\right\|_2 = 1\}$, forms the familiar unit circle in $\mathbb{R}^2$. The expression $\left\|\mathbf{x}\right\|_2 = \sqrt{\mathbf{x}^T \mathbf{x}}$ highlights the relationship between the Euclidean norm and the dot product.

Unit circle for Euclidean Norm

Inner Products

General Inner Products

The dot product is a specific instance of a more general concept known as the inner product. Inner products are essential for determining orthogonality and angles between vectors in abstract vector spaces.

Bilinear Mappings

To generalize the dot product, we first define bilinear mappings, which are linear in each argument separately.

Definition 4 (Bilinear Mapping). A bilinear mapping on a vector space $\mathcal{V}$ is a function $\Omega: \mathcal{V}\times \mathcal{V}\to \mathbb{R}$ that is linear in each argument separately. That is, for all vectors $\mathbf{x}, \mathbf{y}, \mathbf{z} \in \mathcal{V}$ and scalars $\lambda, \psi \in \mathbb{R}$:

$\Omega(\lambda\mathbf{x} + \psi\mathbf{y}, \mathbf{z}) = \lambda\Omega(\mathbf{x}, \mathbf{z}) + \psi\Omega(\mathbf{y}, \mathbf{z})$ (Linearity in the first argument)
$\Omega(\mathbf{x}, \lambda\mathbf{y} + \psi\mathbf{z}) = \lambda\Omega(\mathbf{x}, \mathbf{y}) + \psi\Omega(\mathbf{x}, \mathbf{z})$ (Linearity in the second argument)

Complexity Analysis for Bilinear Mapping: A bilinear mapping takes two vectors as input. The complexity depends on the specific mapping $\Omega$ and the dimension of the vector space $\mathcal{V}$. In general, for vectors in $\mathbb{R}^n$, the computation involves operations that scale with the square of the dimension, $O(n^2)$, or higher, depending on the complexity of $\Omega$.

Symmetric and Positive Definite Mappings

Among bilinear mappings, we are particularly interested in those that are symmetric and positive definite. These properties are crucial for defining inner products.

Definition 5 (Symmetric Bilinear Mapping). A bilinear mapping $\Omega: \mathcal{V}\times \mathcal{V}\to \mathbb{R}$ is symmetric if for all $\mathbf{x}, \mathbf{y} \in \mathcal{V}$, \[\Omega(\mathbf{x}, \mathbf{y}) = \Omega(\mathbf{y}, \mathbf{x})\] This means the order of the arguments does not affect the result.

Definition 6 (Positive Definite Bilinear Mapping). A bilinear mapping $\Omega: \mathcal{V}\times \mathcal{V}\to \mathbb{R}$ is positive definite if for all $\mathbf{x} \in \mathcal{V}$:

$\Omega(\mathbf{x}, \mathbf{x}) \geq 0$
$\Omega(\mathbf{x}, \mathbf{x}) = 0$ if and only if $\mathbf{x} = \mathbf{0}$

Definition of Inner Product

A positive definite, symmetric bilinear mapping is defined as an inner product.

Definition 7 (Inner Product). An inner product on a vector space $\mathcal{V}$ is a bilinear mapping $\Omega: \mathcal{V}\times \mathcal{V}\to \mathbb{R}$ that is symmetric and positive definite. We denote the inner product of $\mathbf{x}$ and $\mathbf{y}$ as $\left(\mathbf{x}, \mathbf{y}\right)$ instead of $\Omega(\mathbf{x}, \mathbf{y})$. A vector space $\mathcal{V}$ equipped with an inner product is called an inner product space. If the inner product is the dot product, $\mathcal{V}$ is called a Euclidean vector space.

Example: Non-Dot Product Inner Product

The following example demonstrates an inner product on $\mathbb{R}^2$ that differs from the standard dot product.

Example 8 (Non-Dot Product Inner Product). Consider $\mathcal{V}= \mathbb{R}^2$. Define a mapping $\left(\mathbf{x}, \mathbf{y}\right) = x_1y_1 - (x_1y_2 + x_2y_1) + 2x_2y_2$ for $\mathbf{x} = (x_1, x_2)$ and $\mathbf{y} = (y_1, y_2)$. This mapping is an inner product on $\mathbb{R}^2$, but it is different from the standard dot product. Verifying that this mapping satisfies the properties of an inner product (bilinearity, symmetry, and positive definiteness) is left as an exercise.

Inner Products and Bases

Matrix Representation of Inner Products

Given a basis $B = \{b_1, \dots, b_n\}$ for a vector space $\mathcal{V}$, we can represent the inner product using a matrix. Let $\mathbf{x}, \mathbf{y} \in \mathcal{V}$ be expressed in terms of the basis $B$ as $\mathbf{x} = \sum_{i=1}^{n} \psi_i b_i$ and $\mathbf{y} = \sum_{j=1}^{n} \lambda_j b_j$. Due to the bilinearity of the inner product, we have: \[\begin{aligned} \left(\mathbf{x}, \mathbf{y}\right) &= \left(\sum_{i=1}^{n} \psi_i b_i, \sum_{j=1}^{n} \lambda_j b_j\right) \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n} \psi_i \lambda_j \left(b_i, b_j\right) \end{aligned}\] Define a matrix $A \in \mathbb{R}^{n \times n}$ where $A_{ij} = \left(b_i, b_j\right)$. If $\mathbf{\psi} = (\psi_1, \dots, \psi_n)^T$ and $\mathbf{\lambda} = (\lambda_1, \dots, \lambda_n)^T$ are the coordinate vectors of $\mathbf{x}$ and $\mathbf{y}$ with respect to the basis $B$, then the inner product can be written in matrix form: \[\left(\mathbf{x}, \mathbf{y}\right) = \mathbf{\psi}^T A \mathbf{\lambda}\] This shows that the inner product is completely determined by the coordinates of the vectors in a chosen basis and the matrix $A$.

Symmetric Positive Definite Matrix and Inner Products

The matrix $A$ representing the inner product in a basis has specific properties, namely, it is symmetric and positive definite.

Definition 9 (Symmetric Positive Definite Matrix). A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is positive definite if for all non-zero vectors $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{x}^T A \mathbf{x} > 0$. If $\mathbf{x}^T A \mathbf{x} \geq 0$ for all $\mathbf{x} \in \mathbb{R}^n$, $A$ is positive semi-definite.

Theorem 10 (Inner Products and SPD Matrices). For a real-valued, finite-dimensional vector space $\mathcal{V}$ and an ordered basis $B$, a mapping $\left(\cdot, \cdot\right) : \mathcal{V}\times \mathcal{V}\to \mathbb{R}$ is an inner product if and only if there exists a symmetric, positive definite matrix $A \in \mathbb{R}^{n \times n}$ such that for any $\mathbf{x}, \mathbf{y} \in \mathcal{V}$ with coordinate vectors $\mathbf{\psi}$ and $\mathbf{\lambda}$ in basis $B$, \[\left(\mathbf{x}, \mathbf{y}\right) = \mathbf{\psi}^T A \mathbf{\lambda}\] Description: This theorem establishes a fundamental connection between inner products on vector spaces and symmetric positive definite (SPD) matrices. It states that for any inner product in a finite-dimensional vector space, there exists a corresponding SPD matrix that can represent this inner product in a chosen basis. Conversely, any SPD matrix defines an inner product on the vector space.

Complexity Analysis for Inner Product using Matrix Representation: Given coordinate vectors $\mathbf{\psi}$ and $\mathbf{\lambda}$ of size $n \times 1$, and a matrix $A$ of size $n \times n$, calculating $\left(\mathbf{x}, \mathbf{y}\right) = \mathbf{\psi}^T A \mathbf{\lambda}$ involves:

Matrix-vector multiplication $A\mathbf{\lambda}$, which is $O(n^2)$.
Dot product of $\mathbf{\psi}^T$ and the resulting vector, which is $O(n)$.

The overall complexity is dominated by the matrix-vector multiplication, resulting in a time complexity of $O(n^2)$.

Remark. Remark 11. The matrix $A$ is a linear map with a trivial kernel $\{\mathbf{0}\}$. This is because $\mathbf{x}^T A \mathbf{x} > 0$ for any non-zero $\mathbf{x}$, which implies $A\mathbf{x} \neq \mathbf{0}$ for $\mathbf{x} \neq \mathbf{0}$. Furthermore, the diagonal elements $a_{ii}$ of $A$ are positive, as $a_{ii} = \mathbf{e}_i^T A \mathbf{e}_i > 0$, where $\mathbf{e}_i$ is the $i$-th standard basis vector in $\mathbb{R}^n$.

Applications of Inner Products

Vector Length and Induced Norm

Inner products provide a way to generalize the concept of length.

Definition 12 (Induced Norm). Given an inner product $\left(\cdot, \cdot\right)$ on a vector space $\mathcal{V}$, the induced norm (or canonical norm) of a vector $\mathbf{x} \in \mathcal{V}$ is defined as: \[\left\|\mathbf{x}\right\| = \sqrt{\left(\mathbf{x}, \mathbf{x}\right)}\]

Since $\left(\mathbf{x}, \mathbf{x}\right) \geq 0$ for a positive definite inner product, the square root is well-defined and non-negative. In matrix form, $\left(\mathbf{x}, \mathbf{x}\right) = \mathbf{\psi}^T A \mathbf{\psi} \geq 0$, guaranteeing the square root is real.

Cauchy-Schwarz Inequality

A fundamental inequality in inner product spaces is the Cauchy-Schwarz inequality.

Theorem 13 (Cauchy-Schwarz Inequality). For any vectors $\mathbf{x}, \mathbf{y}$ in an inner product space, the Cauchy-Schwarz inequality states that: \[|\left(\mathbf{x}, \mathbf{y}\right)| \leq \left\|\mathbf{x}\right\| \left\|\mathbf{y}\right\|\] Description: The Cauchy-Schwarz inequality is a cornerstone result in the study of inner product spaces. It provides an upper bound for the absolute value of the inner product of two vectors in terms of the product of their norms. This inequality has wide-ranging applications across mathematics and physics, particularly in areas involving vector spaces and norms.

Remark. Remark 14. For the dot product in $\mathbb{R}^n$, we know that $\left(\mathbf{x}, \mathbf{y}\right) = \left\|\mathbf{x}\right\| \left\|\mathbf{y}\right\| \cos(\theta)$, where $\theta$ is the angle between $\mathbf{x}$ and $\mathbf{y}$. Therefore, $|\left(\mathbf{x}, \mathbf{y}\right)| = \left\|\mathbf{x}\right\| \left\|\mathbf{y}\right\| |\cos(\theta)|$. Since $|\cos(\theta)| \leq 1$, the Cauchy-Schwarz inequality holds for the dot product. The general case can be proven by considering the non-negative function $f(\lambda) = \left(\mathbf{x} - \lambda\mathbf{y}, \mathbf{x} - \lambda\mathbf{y}\right) \geq 0$ for $\lambda \in \mathbb{R}$ and analyzing the discriminant of the resulting quadratic equation in $\lambda$.

Distance and Metric

Using the induced norm, we can define a distance function in inner product spaces.

Definition 15 (Distance and Metric). In an inner product space $(\mathcal{V}, \left(\cdot, \cdot\right))$, the distance between two vectors $\mathbf{x}, \mathbf{y} \in \mathcal{V}$ is defined as: \[d(\mathbf{x}, \mathbf{y}) = \left\|\mathbf{x} - \mathbf{y}\right\| = \sqrt{\left(\mathbf{x} - \mathbf{y}, \mathbf{x} - \mathbf{y}\right)}\] This distance function $d(\cdot, \cdot)$ is a metric on $\mathcal{V}$, satisfying the following properties for all $\mathbf{x}, \mathbf{y}, \mathbf{z} \in \mathcal{V}$:

Positive Definite: $d(\mathbf{x}, \mathbf{y}) \geq 0$, and $d(\mathbf{x}, \mathbf{y}) = 0$ if and only if $\mathbf{x} = \mathbf{y}$.
Symmetric: $d(\mathbf{x}, \mathbf{y}) = d(\mathbf{y}, \mathbf{x})$.
Triangle Inequality: $d(\mathbf{x}, \mathbf{z}) \leq d(\mathbf{x}, \mathbf{y}) + d(\mathbf{y}, \mathbf{z})$.

When the inner product is the dot product, $d(\mathbf{x}, \mathbf{y})$ is the Euclidean distance.

Angles and Orthogonality

Defining Angles

The inner product allows us to generalize the concept of angles between vectors. From the Cauchy-Schwarz inequality, we know that $-1 \leq \frac{\left(\mathbf{x}, \mathbf{y}\right)}{\left\|\mathbf{x}\right\| \left\|\mathbf{y}\right\|} \leq 1$. Thus, we can define the angle $\omega$ between $\mathbf{x}$ and $\mathbf{y}$ using the cosine: \[\cos(\omega) = \frac{\left(\mathbf{x}, \mathbf{y}\right)}{\left\|\mathbf{x}\right\| \left\|\mathbf{y}\right\|}\]

Orthogonality

A particularly important concept is orthogonality, which generalizes the notion of perpendicularity.

Definition 16 (Orthogonality). Two vectors $\mathbf{x}$ and $\mathbf{y}$ are orthogonal if and only if their inner product is zero: $\left(\mathbf{x}, \mathbf{y}\right) = 0$. In this case, we write $\mathbf{x} \perp \mathbf{y}$.

Orthonormality

Definition 17 (Orthonormality). If two vectors $\mathbf{x}$ and $\mathbf{y}$ are orthogonal and both are unit vectors (i.e., $\left\|\mathbf{x}\right\| = 1$ and $\left\|\mathbf{y}\right\| = 1$), they are called orthonormal.

Orthonormal Bases and Related Concepts

Orthonormal Basis (ONB)

Orthonormal bases are fundamental in simplifying computations and representations in vector spaces, especially in inner product spaces.

Definition 18 (Orthonormal Basis (ONB)). A basis $B = \{b_1, \dots, b_n\}$ for an $n$-dimensional vector space $\mathcal{V}$ is an orthonormal basis if it satisfies two conditions:

Orthogonality: $\left(b_i, b_j\right) = 0$ for all $i \neq j$.
Normalization: $\left(b_i, b_i\right) = 1$ for all $i = 1, \dots, n$.

If only the first condition is satisfied, $B$ is called an orthogonal basis. Note that the normalization condition $\left(b_i, b_i\right) = 1$ is equivalent to $\left\|b_i\right\| = \sqrt{\left(b_i, b_i\right)} = 1$, meaning each basis vector is a unit vector.

Complexity Analysis for Orthonormal Basis Verification: To verify if a basis $\{b_1, \dots, b_n\}$ is orthonormal, we need to compute the inner product of every pair of basis vectors.

For orthogonality, we compute $\left(b_i, b_j\right)$ for all $i \neq j$ and check if they are zero. There are $O(n^2)$ such pairs.
For normalization, we compute $\left(b_i, b_i\right)$ for all $i = 1, \dots, n$ and check if they are one. There are $O(n)$ such vectors.

Assuming the inner product computation for two vectors in $\mathcal{V}$ is $C$, the total complexity to verify orthonormality is $O(n^2 \cdot C)$. For vectors in $\mathbb{R}^d$, if the inner product is the dot product, $C = O(d)$, and the total complexity becomes $O(n^2 \cdot d)$.

Orthogonal Matrices and Isometries

Matrices with orthonormal columns have special properties, particularly in transformations and preserving geometric structures.

Definition 19 (Orthogonal Matrix). A square matrix $A \in \mathbb{R}^{n \times n}$ is orthogonal if its columns form an orthonormal set.

If $A$ is an orthogonal matrix, then $A^T A = I$, where $I$ is the identity matrix. This property implies that the inverse of an orthogonal matrix is simply its transpose, i.e., $A^{-1} = A^T$. Orthogonal matrices represent isometries, which are transformations that preserve lengths and angles.

Definition 20 (Isometry). A linear transformation represented by a matrix $A$ is an isometry if it preserves the length of vectors, i.e., $\left\|A\mathbf{x}\right\| = \left\|\mathbf{x}\right\|$ for all vectors $\mathbf{x}$. Orthogonal matrices represent isometries in Euclidean space.

Properties of Isometries and Orthogonal Matrices:

Length Preservation: Isometries preserve vector lengths by definition.
Angle Preservation: Isometries also preserve angles between vectors.
Inverse is Transpose: For an orthogonal matrix $A$, $A^{-1} = A^T$.
Determinant Magnitude: The determinant of an orthogonal matrix is either $+1$ or $-1$, i.e., $|\det(A)| = 1$.

Orthogonal Complement

The concept of orthogonality extends to subspaces, leading to the definition of an orthogonal complement.

Definition 21 (Orthogonal Complement). Let $\mathcal{V}$ be a $D$-dimensional vector space and $U \subseteq \mathcal{V}$ be an $M$-dimensional subspace. The orthogonal complement of $U$, denoted $U^\perp$, is the $(D-M)$-dimensional subspace of $\mathcal{V}$ containing all vectors in $\mathcal{V}$ that are orthogonal to every vector in $U$. \[U^\perp = \{ \mathbf{v} \in \mathcal{V}\mid \left(\mathbf{v}, \mathbf{u}\right) = 0 \text{ for all } \mathbf{u} \in U \}\]

Any vector $\mathbf{x} \in \mathcal{V}$ can be uniquely decomposed into a sum of two orthogonal components, one in $U$ and one in $U^\perp$. If $\{b_m\}_{m=1}^M$ is an orthonormal basis for $U$ and $\{\beta_j\}_{j=1}^{D-M}$ is an orthonormal basis for $U^\perp$, then any $\mathbf{x} \in \mathcal{V}$ can be written as: \[\mathbf{x} = \mathbf{x}_U + \mathbf{x}_{U^\perp} = \sum_{m=1}^{M} \lambda_m b_m + \sum_{j=1}^{D-M} \psi_j \beta_j\] where $\mathbf{x}_U = \sum_{m=1}^{M} \lambda_m b_m \in U$ and $\mathbf{x}_{U^\perp} = \sum_{j=1}^{D-M} \psi_j \beta_j \in U^\perp$. This decomposition is unique and fundamental in various applications, including projections and optimization.

Projections

Projections are linear transformations that map vectors onto a subspace, effectively "projecting" them onto that subspace.

Definition 22 (Projection). Let $\mathcal{V}$ be a vector space and $U \subseteq \mathcal{V}$ be a subspace. A linear mapping $\pi: \mathcal{V}\to U$ is called a projection if it satisfies the property $\pi^2 = \pi \circ \pi = \pi$, meaning that applying the projection twice is the same as applying it once.

Properties of Projections:

Idempotence: Projections are idempotent, i.e., $\pi^2 = \pi$.
Range is Subspace: The range of a projection $\pi: \mathcal{V}\to \mathcal{V}$ is a subspace $U \subseteq \mathcal{V}$, and $\pi: \mathcal{V}\to U$.
Dimensionality Reduction: Projections can reduce dimensionality by mapping vectors from a higher-dimensional space onto a lower-dimensional subspace.
Orthogonal Projection: If $U$ and $U^\perp$ are orthogonal complements, we can define orthogonal projections onto $U$ and $U^\perp$.

Computational Cost of Projection: Projecting a vector $\mathbf{x} \in \mathcal{V}$ onto a subspace $U$ typically involves matrix operations. If $U$ is spanned by an orthonormal basis, the projection can be efficiently computed. The complexity depends on the dimension of $\mathcal{V}$ and $U$ and the method used for projection. For orthogonal projection onto an $M$-dimensional subspace in a $D$-dimensional space, the complexity is generally around $O(M \cdot D)$.

Projections are widely used for dimensionality reduction, feature extraction, and solving least squares problems in machine learning and signal processing.

Deep Learning and Matrix Operations

Neural Networks and Matrices

Matrix operations are fundamental to the implementation and efficient computation of neural networks in deep learning. Libraries like TensorFlow are designed to optimize these operations, often utilizing GPUs to accelerate computations. The use of matrices allows for parallel processing and concise representation of neural network layers.

Matrix Multiplication in Neural Networks

The core operation in a single-layer neural network can be represented using matrix notation as: \[\mathbf{L} = \mathbf{XW} + \mathbf{B}\] where:

$\mathbf{X}$ is the input matrix, with each row representing an input sample.
$\mathbf{W}$ is the weight matrix, representing the connection weights of the layer.
$\mathbf{B}$ is the bias vector, added to each output unit.
$\mathbf{L}$ is the output matrix, containing the logits or pre-activation values.

For instance, in the context of MNIST digit classification, each input image is flattened into a vector of 784 pixels. If we consider a singlelayer network for classifying these images into 10 classes, the dimensions of the matrices and vectors are as follows:

Input vector $\mathbf{x}$ for a single image: $1 \times 784$.
Weight matrix $\mathbf{W}$: $784 \times 10$.
Bias vector $\mathbf{B}$: $1 \times 10$.
Output vector $\mathbf{l}$ (logits) for a single image: $1 \times 10$.

The matrix multiplication $\mathbf{xW}$ computes the weighted sum of inputs efficiently, and adding the bias $\mathbf{B}$ shifts these sums.

Batch Processing for Efficiency

To improve computational efficiency, especially during training, neural networks typically process data in batches. Instead of feeding one input at a time, a batch of $m$ input samples is processed simultaneously. In this case, the input matrix $\mathbf{X}$ becomes an $m \times 784$ matrix, where each row corresponds to a different input example. The operation $\mathbf{L} = \mathbf{XW} + \mathbf{B}$ then becomes a matrix operation where:

Input matrix $\mathbf{X}$ for a batch of $m$ images: $m \times 784$.
Weight matrix $\mathbf{W}$: $784 \times 10$.
Bias vector $\mathbf{B}$: $1 \times 10$.
Output matrix $\mathbf{L}$ (logits) for a batch of $m$ images: $m \times 10$.

The matrix multiplication $\mathbf{XW}$ now processes the entire batch in parallel. Each row of the output matrix $\mathbf{L}$ corresponds to the logits for the respective input image in the batch.

Broadcasting of Biases

When adding the bias term $\mathbf{B}$ (dimension $1 \times 10$) to the result of matrix multiplication $\mathbf{XW}$ (dimension $m \times 10$), broadcasting is employed. Broadcasting is a feature in libraries like NumPy and TensorFlow that automatically expands the dimensions of arrays to make operations compatible. In this context, the $1 \times 10$ bias vector $\mathbf{B}$ is effectively "broadcast" to an $m \times 10$ matrix by virtually replicating it $m$ times along the rows. This allows for element-wise addition of the bias to each row of $\mathbf{XW}$, ensuring that each input sample in the batch is correctly biased.

Algorithm 23 (Forward Pass for a Single Layer Neural Network with Batch Processing). Input: Input batch matrix $\mathbf{X} \in \mathbb{R}^{m \times 784}$, Weight matrix $\mathbf{W} \in \mathbb{R}^{784 \times 10}$, Bias vector $\mathbf{B} \in \mathbb{R}^{1 \times 10}$
Output: Logits matrix $\mathbf{L} \in \mathbb{R}^{m \times 10}$

// Matrix multiplication of input batch and weights $\mathbf{Z} \leftarrow \mathbf{XW}$ // Broadcasting bias vector to match dimensions and add $\mathbf{L} \leftarrow \mathbf{Z} + \mathbf{B}$ return $\mathbf{L}$

Complexity Analysis:

The use of matrix operations and batch processing are crucial for the efficiency of neural networks, enabling faster training and inference, especially when combined with hardware acceleration like GPUs.

TensorFlow Introduction

TensorFlow as a Computational Library

TensorFlow is an open-source library by Google Brain, designed for high-performance numerical computation and large-scale machine learning. It acts as a comprehensive ecosystem for implementing and deploying machine learning models. Python is used as the primary interface to define and initiate computations in TensorFlow, while the core computational operations are executed in optimized C++ backend for performance.

Computation Graphs and Sessions

TensorFlow employs the concept of computation graphs to represent mathematical computations. In TensorFlow, you first define a computation graph, which is a symbolic representation of the operations and data flow. This graph specifies the series of operations to be performed but does not execute them immediately. Actual computation happens within a TensorFlow session. A session is an environment where the graph is executed.

Example 24 (TensorFlow Example: Hello World).

import tensorflow as tf

# Define a constant tensor in the computation graph
x = tf.constant("Hello World")

# Create a TensorFlow session to execute the graph
sess = tf.Session()

# Run the session to evaluate the tensor x and print the result
print(sess.run(x)) # Output: b'Hello World'

In this example, ‘tf.constant("Hello World")’ defines a constant tensor within the graph. The session ‘sess = tf.Session()’ is then created to allow graph execution. ‘sess.run(x)’ triggers the computation to evaluate the tensor ‘x’, and the result, "Hello World", is printed. No computation occurs until ‘sess.run()’ is called.

Tensors: Multi-dimensional Data Arrays

The fundamental data unit in TensorFlow is the tensor. Tensors are multi-dimensional arrays, generalizing scalars, vectors, and matrices to higher dimensions. Each tensor has a data type (e.g., float32, int32, string) and a shape (dimensions). TensorFlow is designed to manipulate tensors, performing operations like element-wise arithmetic, matrix multiplication, and more complex transformations as defined in the computation graph.

Python Environment and TensorFlow Functions

Python serves as the user-friendly front-end for TensorFlow, providing an intuitive API to construct computation graphs. TensorFlow functions are used in Python to define tensors, operations, and neural network layers. Placeholders are symbolic variables that allow feeding external data into the graph at runtime, making the graph reusable with different inputs. Variables, on the other hand, are tensors that hold mutable state, such as model parameters (weights and biases) that are updated during training.

Example 25 (TensorFlow Example: Constants and Placeholders).

import tensorflow as tf

# Define a constant tensor with value 2.0
x_const = tf.constant(2.0)

# Define a placeholder tensor of type float32, to be fed data later
z_placeholder = tf.placeholder(tf.float32)

# Define a computation: add constant and placeholder tensors
computation = tf.add(x_const, z_placeholder)

# Create a TensorFlow session
sess = tf.Session()

# Execute the computation graph, feeding a value of 3.0 to the placeholder z_placeholder
result1 = sess.run(computation, feed_dict={z_placeholder: 3.0})
print(result1)  # Output: 5.0

# Execute the same computation graph, now feeding a value of 16.0 to z_placeholder
result2 = sess.run(computation, feed_dict={z_placeholder: 16.0})
print(result2) # Output: 18.0

# Evaluate and print the constant tensor x_const
print(sess.run(x_const)) # Output: 2.0

In this example, ‘z_placeholder’ is used to represent input data that will be provided when the session is run. The ‘feed_dict’ argument in ‘sess.run()’ is used to pass values to placeholders. TensorFlow’s design facilitates the optimization of model parameters in machine learning by automatically computing gradients and providing tools for efficient optimization algorithms.

Conclusion

This lecture provided an introduction to analytic geometry, focusing on norms and inner products as generalizations of length and the dot product. We explored the properties of norms and inner products, their relationship with symmetric positive definite matrices, and their applications in defining distances, angles, and orthogonality. Furthermore, we transitioned to deep learning, highlighting the importance of matrix operations and introducing TensorFlow as a key library for implementing neural networks. Key takeaways include the understanding of norms and inner products as fundamental mathematical tools, their connection to geometric concepts, and their practical relevance in modern machine learning frameworks like TensorFlow.

Further study could include exploring different types of norms and inner products, delving deeper into the properties of orthogonal matrices and projections, and practicing with TensorFlow to build and train simple neural networks.

Follow-up questions for the next lecture might include:

How are orthonormal bases constructed in practice (e.g., Gram-Schmidt process)?
What are different types of projections and their applications?
How are gradients calculated and used in TensorFlow for training neural networks?
How do activation functions introduce non-linearity in neural networks, and why is non-linearity important?

--- title: "Lecture Notes on Analytic Geometry, Norms, and Inner Products" author: "Your Name" date: "2025-02-03" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction This lecture introduces the fundamental concepts of analytic geometry, focusing on norms and inner products in vector spaces. We begin by generalizing the notion of length using norms, then extend this to inner products, which allow us to define orthogonality and angles in abstract vector spaces. The lecture further explores the relationship between inner products and symmetric positive definite matrices, and culminates in an introduction to applications in deep learning, particularly matrix operations and the TensorFlow library. Key concepts include norms, inner products, orthogonality, orthonormal bases, projections, and their relevance in the context of neural networks and TensorFlow. # Norms ## Definition of a Norm We start by formalizing the concept of length in a vector space $\mathcal{V}$. ::: {.def:norm .definition} **Definition 1** (Norm). *A norm on a vector space $\mathcal{V}$ is a function $\left\|\cdot\right\| : \mathcal{V}\to \mathbb{R}$ that assigns each vector $\mathbf{x} \in \mathcal{V}$ a real number $\left\|\mathbf{x}\right\|$, called its norm or length, satisfying the following properties for all scalars $\lambda \in \mathbb{R}$ and vectors $\mathbf{x}, \mathbf{y} \in \mathcal{V}$:* 1. ***Absolute Homogeneity:** $\left\|\lambda \mathbf{x}\right\| = |\lambda| \left\|\mathbf{x}\right\|$.* 2. ***Triangle Inequality:** $\left\|\mathbf{x} + \mathbf{y}\right\| \leq \left\|\mathbf{x}\right\| + \left\|\mathbf{y}\right\|$.* 3. ***Positive Definite:** $\left\|\mathbf{x}\right\| \geq 0$, and $\left\|\mathbf{x}\right\| = 0$ if and only if $\mathbf{x} = \mathbf{0}$.* ::: These properties are essential for defining a meaningful measure of length in abstract vector spaces. ## Examples of Norms ### Manhattan Norm (L1 Norm) ::: {.def:manhattan_norm .definition} **Definition 2** (Manhattan Norm). *For a vector $\mathbf{x} = (x_1, x_2, \dots, x_m) \in \mathbb{R}^m$, the Manhattan norm, or L1 norm, is defined as: $$\left\|\mathbf{x}\right\|_1 = \sum_{i=1}^{m} |x_i|$$* ::: In $\mathbb{R}^2$, the set of vectors with a Manhattan norm of 1, i.e., $\{\mathbf{x} \in \mathbb{R}^2 : \left\|\mathbf{x}\right\|_1 = 1\}$, forms a diamond shape centered at the origin. <figure id="fig:manhattan_unit_circle"> <figcaption>Unit circle for Manhattan Norm</figcaption> </figure> ### Euclidean Norm (L2 Norm) ::: {.def:euclidean_norm .definition} **Definition 3** (Euclidean Norm). *For a vector $\mathbf{x} = (x_1, x_2, \dots, x_m) \in \mathbb{R}^m$, the Euclidean norm, or L2 norm, is defined as: $$\left\|\mathbf{x}\right\|_2 = \sqrt{\sum_{i=1}^{m} x_i^2} = \sqrt{\mathbf{x}^T \mathbf{x}}$$* ::: The Euclidean norm corresponds to the standard geometric length in Euclidean space. The set of vectors with a Euclidean norm of 1, i.e., $\{\mathbf{x} \in \mathbb{R}^2 : \left\|\mathbf{x}\right\|_2 = 1\}$, forms the familiar unit circle in $\mathbb{R}^2$. The expression $\left\|\mathbf{x}\right\|_2 = \sqrt{\mathbf{x}^T \mathbf{x}}$ highlights the relationship between the Euclidean norm and the dot product. <figure id="fig:euclidean_unit_circle"> <figcaption>Unit circle for Euclidean Norm</figcaption> </figure> # Inner Products ## General Inner Products The dot product is a specific instance of a more general concept known as the inner product. Inner products are essential for determining orthogonality and angles between vectors in abstract vector spaces. ### Bilinear Mappings To generalize the dot product, we first define bilinear mappings, which are linear in each argument separately. ::: {.def:bilinear_mapping .definition} **Definition 4** (Bilinear Mapping). *A bilinear mapping on a vector space $\mathcal{V}$ is a function $\Omega: \mathcal{V}\times \mathcal{V}\to \mathbb{R}$ that is linear in each argument separately. That is, for all vectors $\mathbf{x}, \mathbf{y}, \mathbf{z} \in \mathcal{V}$ and scalars $\lambda, \psi \in \mathbb{R}$:* 1. *$\Omega(\lambda\mathbf{x} + \psi\mathbf{y}, \mathbf{z}) = \lambda\Omega(\mathbf{x}, \mathbf{z}) + \psi\Omega(\mathbf{y}, \mathbf{z})$ (Linearity in the first argument)* 2. *$\Omega(\mathbf{x}, \lambda\mathbf{y} + \psi\mathbf{z}) = \lambda\Omega(\mathbf{x}, \mathbf{y}) + \psi\Omega(\mathbf{x}, \mathbf{z})$ (Linearity in the second argument)* ::: ::: tcolorbox **Complexity Analysis for Bilinear Mapping:** A bilinear mapping takes two vectors as input. The complexity depends on the specific mapping $\Omega$ and the dimension of the vector space $\mathcal{V}$. In general, for vectors in $\mathbb{R}^n$, the computation involves operations that scale with the square of the dimension, $O(n^2)$, or higher, depending on the complexity of $\Omega$. ::: ### Symmetric and Positive Definite Mappings Among bilinear mappings, we are particularly interested in those that are symmetric and positive definite. These properties are crucial for defining inner products. ::: {.def:symmetric_bilinear_mapping .definition} **Definition 5** (Symmetric Bilinear Mapping). *A bilinear mapping $\Omega: \mathcal{V}\times \mathcal{V}\to \mathbb{R}$ is symmetric if for all $\mathbf{x}, \mathbf{y} \in \mathcal{V}$, $$\Omega(\mathbf{x}, \mathbf{y}) = \Omega(\mathbf{y}, \mathbf{x})$$ This means the order of the arguments does not affect the result.* ::: ::: {.def:positive_definite_bilinear_mapping .definition} **Definition 6** (Positive Definite Bilinear Mapping). *A bilinear mapping $\Omega: \mathcal{V}\times \mathcal{V}\to \mathbb{R}$ is positive definite if for all $\mathbf{x} \in \mathcal{V}$:* 1. *$\Omega(\mathbf{x}, \mathbf{x}) \geq 0$* 2. *$\Omega(\mathbf{x}, \mathbf{x}) = 0$ if and only if $\mathbf{x} = \mathbf{0}$* ::: ### Definition of Inner Product A positive definite, symmetric bilinear mapping is defined as an inner product. ::: {.def:inner_product .definition} **Definition 7** (Inner Product). *An inner product on a vector space $\mathcal{V}$ is a bilinear mapping $\Omega: \mathcal{V}\times \mathcal{V}\to \mathbb{R}$ that is symmetric and positive definite. We denote the inner product of $\mathbf{x}$ and $\mathbf{y}$ as $\left(\mathbf{x}, \mathbf{y}\right)$ instead of $\Omega(\mathbf{x}, \mathbf{y})$. A vector space $\mathcal{V}$ equipped with an inner product is called an inner product space. If the inner product is the dot product, $\mathcal{V}$ is called a Euclidean vector space.* ::: ### Example: Non-Dot Product Inner Product The following example demonstrates an inner product on $\mathbb{R}^2$ that differs from the standard dot product. ::: {.ex:non_dot_product_inner_product .example} **Example 8** (Non-Dot Product Inner Product). *Consider $\mathcal{V}= \mathbb{R}^2$. Define a mapping $\left(\mathbf{x}, \mathbf{y}\right) = x_1y_1 - (x_1y_2 + x_2y_1) + 2x_2y_2$ for $\mathbf{x} = (x_1, x_2)$ and $\mathbf{y} = (y_1, y_2)$. This mapping is an inner product on $\mathbb{R}^2$, but it is different from the standard dot product. Verifying that this mapping satisfies the properties of an inner product (bilinearity, symmetry, and positive definiteness) is left as an exercise.* ::: ## Inner Products and Bases ### Matrix Representation of Inner Products Given a basis $B = \{b_1, \dots, b_n\}$ for a vector space $\mathcal{V}$, we can represent the inner product using a matrix. Let $\mathbf{x}, \mathbf{y} \in \mathcal{V}$ be expressed in terms of the basis $B$ as $\mathbf{x} = \sum_{i=1}^{n} \psi_i b_i$ and $\mathbf{y} = \sum_{j=1}^{n} \lambda_j b_j$. Due to the bilinearity of the inner product, we have: $$\begin{aligned} \left(\mathbf{x}, \mathbf{y}\right) &= \left(\sum_{i=1}^{n} \psi_i b_i, \sum_{j=1}^{n} \lambda_j b_j\right) \\ &= \sum_{i=1}^{n} \sum_{j=1}^{n} \psi_i \lambda_j \left(b_i, b_j\right) \end{aligned}$$ Define a matrix $A \in \mathbb{R}^{n \times n}$ where $A_{ij} = \left(b_i, b_j\right)$. If $\mathbf{\psi} = (\psi_1, \dots, \psi_n)^T$ and $\mathbf{\lambda} = (\lambda_1, \dots, \lambda_n)^T$ are the coordinate vectors of $\mathbf{x}$ and $\mathbf{y}$ with respect to the basis $B$, then the inner product can be written in matrix form: $$\left(\mathbf{x}, \mathbf{y}\right) = \mathbf{\psi}^T A \mathbf{\lambda}$$ This shows that the inner product is completely determined by the coordinates of the vectors in a chosen basis and the matrix $A$. ### Symmetric Positive Definite Matrix and Inner Products The matrix $A$ representing the inner product in a basis has specific properties, namely, it is symmetric and positive definite. ::: {.def:spd_matrix .definition} **Definition 9** (Symmetric Positive Definite Matrix). *A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is positive definite if for all non-zero vectors $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{x}^T A \mathbf{x} > 0$. If $\mathbf{x}^T A \mathbf{x} \geq 0$ for all $\mathbf{x} \in \mathbb{R}^n$, $A$ is positive semi-definite.* ::: ::: {.thm:inner_product_spd_matrix .theorem} **Theorem 10** (Inner Products and SPD Matrices). *For a real-valued, finite-dimensional vector space $\mathcal{V}$ and an ordered basis $B$, a mapping $\left(\cdot, \cdot\right) : \mathcal{V}\times \mathcal{V}\to \mathbb{R}$ is an inner product if and only if there exists a symmetric, positive definite matrix $A \in \mathbb{R}^{n \times n}$ such that for any $\mathbf{x}, \mathbf{y} \in \mathcal{V}$ with coordinate vectors $\mathbf{\psi}$ and $\mathbf{\lambda}$ in basis $B$, $$\left(\mathbf{x}, \mathbf{y}\right) = \mathbf{\psi}^T A \mathbf{\lambda}$$ **Description:** This theorem establishes a fundamental connection between inner products on vector spaces and symmetric positive definite (SPD) matrices. It states that for any inner product in a finite-dimensional vector space, there exists a corresponding SPD matrix that can represent this inner product in a chosen basis. Conversely, any SPD matrix defines an inner product on the vector space.* ::: ::: tcolorbox **Complexity Analysis for Inner Product using Matrix Representation:** Given coordinate vectors $\mathbf{\psi}$ and $\mathbf{\lambda}$ of size $n \times 1$, and a matrix $A$ of size $n \times n$, calculating $\left(\mathbf{x}, \mathbf{y}\right) = \mathbf{\psi}^T A \mathbf{\lambda}$ involves: 1. Matrix-vector multiplication $A\mathbf{\lambda}$, which is $O(n^2)$. 2. Dot product of $\mathbf{\psi}^T$ and the resulting vector, which is $O(n)$. The overall complexity is dominated by the matrix-vector multiplication, resulting in a time complexity of $O(n^2)$. ::: ::: {.rem:spd_matrix_properties .remark} **Remark 11**. *The matrix $A$ is a linear map with a trivial kernel $\{\mathbf{0}\}$. This is because $\mathbf{x}^T A \mathbf{x} > 0$ for any non-zero $\mathbf{x}$, which implies $A\mathbf{x} \neq \mathbf{0}$ for $\mathbf{x} \neq \mathbf{0}$. Furthermore, the diagonal elements $a_{ii}$ of $A$ are positive, as $a_{ii} = \mathbf{e}_i^T A \mathbf{e}_i > 0$, where $\mathbf{e}_i$ is the $i$-th standard basis vector in $\mathbb{R}^n$.* ::: # Applications of Inner Products ## Vector Length and Induced Norm Inner products provide a way to generalize the concept of length. ::: {.def:induced_norm .definition} **Definition 12** (Induced Norm). *Given an inner product $\left(\cdot, \cdot\right)$ on a vector space $\mathcal{V}$, the induced norm (or canonical norm) of a vector $\mathbf{x} \in \mathcal{V}$ is defined as: $$\left\|\mathbf{x}\right\| = \sqrt{\left(\mathbf{x}, \mathbf{x}\right)}$$* ::: Since $\left(\mathbf{x}, \mathbf{x}\right) \geq 0$ for a positive definite inner product, the square root is well-defined and non-negative. In matrix form, $\left(\mathbf{x}, \mathbf{x}\right) = \mathbf{\psi}^T A \mathbf{\psi} \geq 0$, guaranteeing the square root is real. ## Cauchy-Schwarz Inequality A fundamental inequality in inner product spaces is the Cauchy-Schwarz inequality. ::: {.thm:cauchy_schwarz_inequality .theorem} **Theorem 13** (Cauchy-Schwarz Inequality). *For any vectors $\mathbf{x}, \mathbf{y}$ in an inner product space, the Cauchy-Schwarz inequality states that: $$|\left(\mathbf{x}, \mathbf{y}\right)| \leq \left\|\mathbf{x}\right\| \left\|\mathbf{y}\right\|$$ **Description:** The Cauchy-Schwarz inequality is a cornerstone result in the study of inner product spaces. It provides an upper bound for the absolute value of the inner product of two vectors in terms of the product of their norms. This inequality has wide-ranging applications across mathematics and physics, particularly in areas involving vector spaces and norms.* ::: ::: {.rem:cauchy_schwarz_dot_product .remark} **Remark 14**. *For the dot product in $\mathbb{R}^n$, we know that $\left(\mathbf{x}, \mathbf{y}\right) = \left\|\mathbf{x}\right\| \left\|\mathbf{y}\right\| \cos(\theta)$, where $\theta$ is the angle between $\mathbf{x}$ and $\mathbf{y}$. Therefore, $|\left(\mathbf{x}, \mathbf{y}\right)| = \left\|\mathbf{x}\right\| \left\|\mathbf{y}\right\| |\cos(\theta)|$. Since $|\cos(\theta)| \leq 1$, the Cauchy-Schwarz inequality holds for the dot product. The general case can be proven by considering the non-negative function $f(\lambda) = \left(\mathbf{x} - \lambda\mathbf{y}, \mathbf{x} - \lambda\mathbf{y}\right) \geq 0$ for $\lambda \in \mathbb{R}$ and analyzing the discriminant of the resulting quadratic equation in $\lambda$.* ::: ## Distance and Metric Using the induced norm, we can define a distance function in inner product spaces. ::: {.def:distance_metric .definition} **Definition 15** (Distance and Metric). *In an inner product space $(\mathcal{V}, \left(\cdot, \cdot\right))$, the distance between two vectors $\mathbf{x}, \mathbf{y} \in \mathcal{V}$ is defined as: $$d(\mathbf{x}, \mathbf{y}) = \left\|\mathbf{x} - \mathbf{y}\right\| = \sqrt{\left(\mathbf{x} - \mathbf{y}, \mathbf{x} - \mathbf{y}\right)}$$ This distance function $d(\cdot, \cdot)$ is a metric on $\mathcal{V}$, satisfying the following properties for all $\mathbf{x}, \mathbf{y}, \mathbf{z} \in \mathcal{V}$:* 1. ***Positive Definite:** $d(\mathbf{x}, \mathbf{y}) \geq 0$, and $d(\mathbf{x}, \mathbf{y}) = 0$ if and only if $\mathbf{x} = \mathbf{y}$.* 2. ***Symmetric:** $d(\mathbf{x}, \mathbf{y}) = d(\mathbf{y}, \mathbf{x})$.* 3. ***Triangle Inequality:** $d(\mathbf{x}, \mathbf{z}) \leq d(\mathbf{x}, \mathbf{y}) + d(\mathbf{y}, \mathbf{z})$.* *When the inner product is the dot product, $d(\mathbf{x}, \mathbf{y})$ is the Euclidean distance.* ::: ## Angles and Orthogonality ### Defining Angles The inner product allows us to generalize the concept of angles between vectors. From the Cauchy-Schwarz inequality, we know that $-1 \leq \frac{\left(\mathbf{x}, \mathbf{y}\right)}{\left\|\mathbf{x}\right\| \left\|\mathbf{y}\right\|} \leq 1$. Thus, we can define the angle $\omega$ between $\mathbf{x}$ and $\mathbf{y}$ using the cosine: $$\cos(\omega) = \frac{\left(\mathbf{x}, \mathbf{y}\right)}{\left\|\mathbf{x}\right\| \left\|\mathbf{y}\right\|}$$ ### Orthogonality A particularly important concept is orthogonality, which generalizes the notion of perpendicularity. ::: {.def:orthogonality .definition} **Definition 16** (Orthogonality). *Two vectors $\mathbf{x}$ and $\mathbf{y}$ are orthogonal if and only if their inner product is zero: $\left(\mathbf{x}, \mathbf{y}\right) = 0$. In this case, we write $\mathbf{x} \perp \mathbf{y}$.* ::: ### Orthonormality ::: {.def:orthonormality .definition} **Definition 17** (Orthonormality). *If two vectors $\mathbf{x}$ and $\mathbf{y}$ are orthogonal and both are unit vectors (i.e., $\left\|\mathbf{x}\right\| = 1$ and $\left\|\mathbf{y}\right\| = 1$), they are called orthonormal.* ::: # Orthonormal Bases and Related Concepts ## Orthonormal Basis (ONB) Orthonormal bases are fundamental in simplifying computations and representations in vector spaces, especially in inner product spaces. ::: {.def:orthonormal_basis .definition} **Definition 18** (Orthonormal Basis (ONB)). *A basis $B = \{b_1, \dots, b_n\}$ for an $n$-dimensional vector space $\mathcal{V}$ is an orthonormal basis if it satisfies two conditions:* 1. ***Orthogonality:** $\left(b_i, b_j\right) = 0$ for all $i \neq j$.* 2. ***Normalization:** $\left(b_i, b_i\right) = 1$ for all $i = 1, \dots, n$.* *If only the first condition is satisfied, $B$ is called an orthogonal basis. Note that the normalization condition $\left(b_i, b_i\right) = 1$ is equivalent to $\left\|b_i\right\| = \sqrt{\left(b_i, b_i\right)} = 1$, meaning each basis vector is a unit vector.* ::: ::: tcolorbox **Complexity Analysis for Orthonormal Basis Verification:** To verify if a basis $\{b_1, \dots, b_n\}$ is orthonormal, we need to compute the inner product of every pair of basis vectors. 1. For orthogonality, we compute $\left(b_i, b_j\right)$ for all $i \neq j$ and check if they are zero. There are $O(n^2)$ such pairs. 2. For normalization, we compute $\left(b_i, b_i\right)$ for all $i = 1, \dots, n$ and check if they are one. There are $O(n)$ such vectors. Assuming the inner product computation for two vectors in $\mathcal{V}$ is $C$, the total complexity to verify orthonormality is $O(n^2 \cdot C)$. For vectors in $\mathbb{R}^d$, if the inner product is the dot product, $C = O(d)$, and the total complexity becomes $O(n^2 \cdot d)$. ::: ## Orthogonal Matrices and Isometries Matrices with orthonormal columns have special properties, particularly in transformations and preserving geometric structures. ::: {.def:orthogonal_matrix .definition} **Definition 19** (Orthogonal Matrix). *A square matrix $A \in \mathbb{R}^{n \times n}$ is orthogonal if its columns form an orthonormal set.* ::: If $A$ is an orthogonal matrix, then $A^T A = I$, where $I$ is the identity matrix. This property implies that the inverse of an orthogonal matrix is simply its transpose, i.e., $A^{-1} = A^T$. Orthogonal matrices represent isometries, which are transformations that preserve lengths and angles. ::: {.def:isometry .definition} **Definition 20** (Isometry). *A linear transformation represented by a matrix $A$ is an isometry if it preserves the length of vectors, i.e., $\left\|A\mathbf{x}\right\| = \left\|\mathbf{x}\right\|$ for all vectors $\mathbf{x}$. Orthogonal matrices represent isometries in Euclidean space.* ::: ::: tcolorbox **Properties of Isometries and Orthogonal Matrices:** - **Length Preservation:** Isometries preserve vector lengths by definition. - **Angle Preservation:** Isometries also preserve angles between vectors. - **Inverse is Transpose:** For an orthogonal matrix $A$, $A^{-1} = A^T$. - **Determinant Magnitude:** The determinant of an orthogonal matrix is either $+1$ or $-1$, i.e., $|\det(A)| = 1$. ::: ## Orthogonal Complement The concept of orthogonality extends to subspaces, leading to the definition of an orthogonal complement. ::: {.def:orthogonal_complement .definition} **Definition 21** (Orthogonal Complement). *Let $\mathcal{V}$ be a $D$-dimensional vector space and $U \subseteq \mathcal{V}$ be an $M$-dimensional subspace. The orthogonal complement of $U$, denoted $U^\perp$, is the $(D-M)$-dimensional subspace of $\mathcal{V}$ containing all vectors in $\mathcal{V}$ that are orthogonal to every vector in $U$. $$U^\perp = \{ \mathbf{v} \in \mathcal{V}\mid \left(\mathbf{v}, \mathbf{u}\right) = 0 \text{ for all } \mathbf{u} \in U \}$$* ::: Any vector $\mathbf{x} \in \mathcal{V}$ can be uniquely decomposed into a sum of two orthogonal components, one in $U$ and one in $U^\perp$. If $\{b_m\}_{m=1}^M$ is an orthonormal basis for $U$ and $\{\beta_j\}_{j=1}^{D-M}$ is an orthonormal basis for $U^\perp$, then any $\mathbf{x} \in \mathcal{V}$ can be written as: $$\mathbf{x} = \mathbf{x}_U + \mathbf{x}_{U^\perp} = \sum_{m=1}^{M} \lambda_m b_m + \sum_{j=1}^{D-M} \psi_j \beta_j$$ where $\mathbf{x}_U = \sum_{m=1}^{M} \lambda_m b_m \in U$ and $\mathbf{x}_{U^\perp} = \sum_{j=1}^{D-M} \psi_j \beta_j \in U^\perp$. This decomposition is unique and fundamental in various applications, including projections and optimization. ## Projections Projections are linear transformations that map vectors onto a subspace, effectively \"projecting\" them onto that subspace. ::: {.def:projection .definition} **Definition 22** (Projection). *Let $\mathcal{V}$ be a vector space and $U \subseteq \mathcal{V}$ be a subspace. A linear mapping $\pi: \mathcal{V}\to U$ is called a projection if it satisfies the property $\pi^2 = \pi \circ \pi = \pi$, meaning that applying the projection twice is the same as applying it once.* ::: ::: tcolorbox **Properties of Projections:** - **Idempotence:** Projections are idempotent, i.e., $\pi^2 = \pi$. - **Range is Subspace:** The range of a projection $\pi: \mathcal{V}\to \mathcal{V}$ is a subspace $U \subseteq \mathcal{V}$, and $\pi: \mathcal{V}\to U$. - **Dimensionality Reduction:** Projections can reduce dimensionality by mapping vectors from a higher-dimensional space onto a lower-dimensional subspace. - **Orthogonal Projection:** If $U$ and $U^\perp$ are orthogonal complements, we can define orthogonal projections onto $U$ and $U^\perp$. **Computational Cost of Projection:** Projecting a vector $\mathbf{x} \in \mathcal{V}$ onto a subspace $U$ typically involves matrix operations. If $U$ is spanned by an orthonormal basis, the projection can be efficiently computed. The complexity depends on the dimension of $\mathcal{V}$ and $U$ and the method used for projection. For orthogonal projection onto an $M$-dimensional subspace in a $D$-dimensional space, the complexity is generally around $O(M \cdot D)$. ::: Projections are widely used for dimensionality reduction, feature extraction, and solving least squares problems in machine learning and signal processing. # Deep Learning and Matrix Operations ## Neural Networks and Matrices Matrix operations are fundamental to the implementation and efficient computation of neural networks in deep learning. Libraries like TensorFlow are designed to optimize these operations, often utilizing GPUs to accelerate computations. The use of matrices allows for parallel processing and concise representation of neural network layers. ## Matrix Multiplication in Neural Networks The core operation in a single-layer neural network can be represented using matrix notation as: $$\mathbf{L} = \mathbf{XW} + \mathbf{B}$$ where: - $\mathbf{X}$ is the input matrix, with each row representing an input sample. - $\mathbf{W}$ is the weight matrix, representing the connection weights of the layer. - $\mathbf{B}$ is the bias vector, added to each output unit. - $\mathbf{L}$ is the output matrix, containing the logits or pre-activation values. For instance, in the context of MNIST digit classification, each input image is flattened into a vector of 784 pixels. If we consider a singlelayer network for classifying these images into 10 classes, the dimensions of the matrices and vectors are as follows: - Input vector $\mathbf{x}$ for a single image: $1 \times 784$. - Weight matrix $\mathbf{W}$: $784 \times 10$. - Bias vector $\mathbf{B}$: $1 \times 10$. - Output vector $\mathbf{l}$ (logits) for a single image: $1 \times 10$. The matrix multiplication $\mathbf{xW}$ computes the weighted sum of inputs efficiently, and adding the bias $\mathbf{B}$ shifts these sums. ## Batch Processing for Efficiency To improve computational efficiency, especially during training, neural networks typically process data in batches. Instead of feeding one input at a time, a batch of $m$ input samples is processed simultaneously. In this case, the input matrix $\mathbf{X}$ becomes an $m \times 784$ matrix, where each row corresponds to a different input example. The operation $\mathbf{L} = \mathbf{XW} + \mathbf{B}$ then becomes a matrix operation where: - Input matrix $\mathbf{X}$ for a batch of $m$ images: $m \times 784$. - Weight matrix $\mathbf{W}$: $784 \times 10$. - Bias vector $\mathbf{B}$: $1 \times 10$. - Output matrix $\mathbf{L}$ (logits) for a batch of $m$ images: $m \times 10$. The matrix multiplication $\mathbf{XW}$ now processes the entire batch in parallel. Each row of the output matrix $\mathbf{L}$ corresponds to the logits for the respective input image in the batch. ## Broadcasting of Biases When adding the bias term $\mathbf{B}$ (dimension $1 \times 10$) to the result of matrix multiplication $\mathbf{XW}$ (dimension $m \times 10$), broadcasting is employed. Broadcasting is a feature in libraries like NumPy and TensorFlow that automatically expands the dimensions of arrays to make operations compatible. In this context, the $1 \times 10$ bias vector $\mathbf{B}$ is effectively \"broadcast\" to an $m \times 10$ matrix by virtually replicating it $m$ times along the rows. This allows for element-wise addition of the bias to each row of $\mathbf{XW}$, ensuring that each input sample in the batch is correctly biased. :::: {.alg:forward_pass .algorithm} **Algorithm 23** (Forward Pass for a Single Layer Neural Network with Batch Processing). ***Input:** Input batch matrix $\mathbf{X} \in \mathbb{R}^{m \times 784}$, Weight matrix $\mathbf{W} \in \mathbb{R}^{784 \times 10}$, Bias vector $\mathbf{B} \in \mathbb{R}^{1 \times 10}$\ **Output:** Logits matrix $\mathbf{L} \in \mathbb{R}^{m \times 10}$* ::: algorithmic *// Matrix multiplication of input batch and weights $\mathbf{Z} \leftarrow \mathbf{XW}$ // Broadcasting bias vector to match dimensions and add $\mathbf{L} \leftarrow \mathbf{Z} + \mathbf{B}$ **return** $\mathbf{L}$* ::: ***Complexity Analysis:*** :::: The use of matrix operations and batch processing are crucial for the efficiency of neural networks, enabling faster training and inference, especially when combined with hardware acceleration like GPUs. # TensorFlow Introduction ## TensorFlow as a Computational Library TensorFlow is an open-source library by Google Brain, designed for high-performance numerical computation and large-scale machine learning. It acts as a comprehensive ecosystem for implementing and deploying machine learning models. Python is used as the primary interface to define and initiate computations in TensorFlow, while the core computational operations are executed in optimized C++ backend for performance. ## Computation Graphs and Sessions TensorFlow employs the concept of computation graphs to represent mathematical computations. In TensorFlow, you first define a computation graph, which is a symbolic representation of the operations and data flow. This graph specifies the series of operations to be performed but does not execute them immediately. Actual computation happens within a TensorFlow session. A session is an environment where the graph is executed. ::: {.ex:tf_hello_world .example} **Example 24** (TensorFlow Example: Hello World). import tensorflow as tf # Define a constant tensor in the computation graph x = tf.constant("Hello World") # Create a TensorFlow session to execute the graph sess = tf.Session() # Run the session to evaluate the tensor x and print the result print(sess.run(x)) # Output: b'Hello World' ::: In this example, 'tf.constant(\"Hello World\")' defines a constant tensor within the graph. The session 'sess = tf.Session()' is then created to allow graph execution. 'sess.run(x)' triggers the computation to evaluate the tensor 'x', and the result, \"Hello World\", is printed. No computation occurs until 'sess.run()' is called. ## Tensors: Multi-dimensional Data Arrays The fundamental data unit in TensorFlow is the tensor. Tensors are multi-dimensional arrays, generalizing scalars, vectors, and matrices to higher dimensions. Each tensor has a data type (e.g., float32, int32, string) and a shape (dimensions). TensorFlow is designed to manipulate tensors, performing operations like element-wise arithmetic, matrix multiplication, and more complex transformations as defined in the computation graph. ## Python Environment and TensorFlow Functions Python serves as the user-friendly front-end for TensorFlow, providing an intuitive API to construct computation graphs. TensorFlow functions are used in Python to define tensors, operations, and neural network layers. Placeholders are symbolic variables that allow feeding external data into the graph at runtime, making the graph reusable with different inputs. Variables, on the other hand, are tensors that hold mutable state, such as model parameters (weights and biases) that are updated during training. ::: {.ex:tf_constants_placeholders .example} **Example 25** (TensorFlow Example: Constants and Placeholders). import tensorflow as tf # Define a constant tensor with value 2.0 x_const = tf.constant(2.0) # Define a placeholder tensor of type float32, to be fed data later z_placeholder = tf.placeholder(tf.float32) # Define a computation: add constant and placeholder tensors computation = tf.add(x_const, z_placeholder) # Create a TensorFlow session sess = tf.Session() # Execute the computation graph, feeding a value of 3.0 to the placeholder z_placeholder result1 = sess.run(computation, feed_dict={z_placeholder: 3.0}) print(result1) # Output: 5.0 # Execute the same computation graph, now feeding a value of 16.0 to z_placeholder result2 = sess.run(computation, feed_dict={z_placeholder: 16.0}) print(result2) # Output: 18.0 # Evaluate and print the constant tensor x_const print(sess.run(x_const)) # Output: 2.0 ::: In this example, 'z_placeholder' is used to represent input data that will be provided when the session is run. The 'feed_dict' argument in 'sess.run()' is used to pass values to placeholders. TensorFlow's design facilitates the optimization of model parameters in machine learning by automatically computing gradients and providing tools for efficient optimization algorithms. # Conclusion This lecture provided an introduction to analytic geometry, focusing on norms and inner products as generalizations of length and the dot product. We explored the properties of norms and inner products, their relationship with symmetric positive definite matrices, and their applications in defining distances, angles, and orthogonality. Furthermore, we transitioned to deep learning, highlighting the importance of matrix operations and introducing TensorFlow as a key library for implementing neural networks. Key takeaways include the understanding of norms and inner products as fundamental mathematical tools, their connection to geometric concepts, and their practical relevance in modern machine learning frameworks like TensorFlow. Further study could include exploring different types of norms and inner products, delving deeper into the properties of orthogonal matrices and projections, and practicing with TensorFlow to build and train simple neural networks. Follow-up questions for the next lecture might include: - How are orthonormal bases constructed in practice (e.g., Gram-Schmidt process)? - What are different types of projections and their applications? - How are gradients calculated and used in TensorFlow for training neural networks? - How do activation functions introduce non-linearity in neural networks, and why is non-linearity important?