Unsupervised Methods: Principal Component Analysis and Cluster Analysis
Introduction
Supervised vs. Unsupervised Learning
Most statistical learning problems can be categorized into two main types: supervised and unsupervised learning. In supervised learning, for each observation \(\mathbf{x}_i\) (where \(i = 1, \dots, n\)) of \(p\) predictor variables, we also have a corresponding response variable \(y_i\). The goal is to build a model that relates the response to the predictors. This model can then be used for both interpretation and prediction. Linear regression and logistic regression are typical examples of supervised learning techniques. More advanced methods include decision trees, support vector machines, and neural networks.
Supervised learning involves building a statistical model to predict or estimate an output variable based on one or more input variables. The defining characteristic of supervised learning is the use of a labeled dataset, where each data point is associated with a known output or response. This allows the learning process to be guided by the ‘supervisor’—the known response—to optimize model parameters for accurate predictions on new, unseen data.
In contrast, unsupervised learning is applied when we have input variables \(\mathbf{x}_i\) but no corresponding response variable \(y_i\). For each observation \(i = 1, \dots, n\), we only observe a vector of measurements \(\mathbf{x}_i\). This scenario is more challenging because there is no response variable to supervise the analysis or to evaluate the predictive performance of a model directly.
Unsupervised learning involves building statistical models to infer patterns from input data without explicit output variables or labels. The primary goal is to discover the inherent structure in the data. This can include identifying groupings of data points (clustering), reducing the dimensionality of the data while preserving essential information (dimensionality reduction), or finding associations between variables, all without prior knowledge of the outcomes or a predefined response variable.
Focus of Unsupervised Learning
In unsupervised learning, the focus is on the joint study of a set of \(p\) variables, \(X_1, \dots, X_p\), observed for a sample of size \(n\). The data is structured as an \(n \times p\) matrix \(\mathbf{X}\), where \(x_{ij}\) represents the \(i\)-th observation of the \(j\)-th variable. A key aspect is the absence of a response variable; all \(p\) variables are treated symmetrically, and the analysis aims to understand their collective behavior and relationships.
The primary goal of unsupervised learning is exploratory data analysis, seeking to uncover interesting and potentially hidden structures within the measurements of the variables \(X_1, \dots, X_p\). Specifically, the objectives include:
Data Summarization and Visualization: To find effective ways to summarize and visualize complex, high-dimensional data in a lower-dimensional space, making it easier to understand and interpret. For instance, reducing data to two or three dimensions for plotting and visual inspection.
Subgroup Identification: To identify subgroups or clusters within the observations or variables. This could mean finding segments of customers with similar purchasing behaviors in market data or identifying groups of genes with correlated expression patterns in biological data.
Techniques used in unsupervised learning are fundamentally explorative. Historically, these methods were often grouped under the term ‘multivariate analysis techniques,’ a broad descriptor for statistical methods involving multiple variables. Unsupervised learning is inherently more subjective than supervised learning because there is no objective metric like prediction accuracy to guide the process. The interpretation of results often relies on domain knowledge and qualitative assessments of the patterns discovered.
Exploratory Data Analysis in High Dimensions
As with any statistical investigation, the initial step in unsupervised learning is to describe and visualize the data. However, this becomes increasingly challenging as the number of variables (\(p\)) grows. While basic tools like scatterplot matrices and low-dimensional projections are useful for initial exploration, they may not capture complex relationships in high-dimensional spaces. Relying solely on a series of pairwise plots can lead to overlooking important multivariate structures present in the data.
To overcome these limitations, more advanced exploratory techniques are valuable. Dynamic graphics, for example, allow for interactive manipulation of data visualizations, such as rotating three-dimensional point clouds or projecting data onto different lower-dimensional subspaces. These dynamic approaches can reveal structures that are not apparent in static, low-dimensional plots.
In this lecture, we will delve into two prominent classes of unsupervised learning methods that are particularly effective for exploratory data analysis and dimension reduction:
Principal Component Analysis (PCA)
Cluster Analysis
Principal Component Analysis (PCA)
Introduction to Principal Components Analysis
Principal Components Analysis (PCA) is a dimension reduction technique. It replaces the original input variables with a new set of derived variables called principal components. These principal components are linear combinations of the original variables. PCA is particularly useful when dealing with high-dimensional datasets, where it can simplify the data structure and extract the most relevant information.
Principal Component Analysis (PCA) is a statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance, and each subsequent component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
The key features of PCA are:
Dimension Reduction: PCA aims to reduce the dimensionality of the data by finding a smaller number of principal components that capture most of the variance in the original data. By focusing on the components with the highest variance, PCA effectively compresses the data, discarding less informative dimensions.
Ordered Components: The principal components are ordered based on the amount of variance in the original variables that they explain. The first principal component explains the most variance, the second principal component explains the second most variance, and so on. This ordering allows for selecting a subset of the first few principal components that retain a significant portion of the original data’s variability.
Data Visualization: PCA is useful for understanding multivariate data and serves as a tool for data visualization. Plots of the first few principal components can often reveal insightful patterns and structures in the data, especially when the number of variables \(p\) is large. For instance, plotting the first two principal components allows for visualizing high-dimensional data in a 2D scatter plot, revealing clusters or trends.
Low-Dimensional Representation: PCA provides a low-dimensional representation of the data, based on a small number of ‘interesting’ dimensions, that captures as much information as possible from the original data. This reduced representation simplifies subsequent analysis and modeling, making computations more efficient and models more interpretable.
Regression Applications: PCA can be useful in regression problems where there are a large number of candidate explanatory variables, especially when multicollinearity is present. The first few principal components, which are uncorrelated, can replace these variables, provided they adequately synthesize the information contained in the original candidates. This can stabilize regression models and improve prediction accuracy.
The Principal Components
Given a set of \(p\) variables \(X_1, \dots, X_p\), the first principal component \(Z_1\) is defined as the normalized linear combination of these variables that has the largest variance. Mathematically, it is expressed as: \[Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p = \sum_{j=1}^{p} \phi_{j1} X_j\] where \(\boldsymbol{\phi}_{1} = (\phi_{11}, \dots, \phi_{p1})^T\) is the loading vector for the first principal component. The loadings represent the weights of each original variable in the linear combination that forms the principal component. The loadings are constrained to be normalized, meaning: \[\sum_{j=1}^{p} \phi_{j1}^2 = 1\] This normalization is crucial. Without it, we could arbitrarily increase the variance of \(Z_1\) by simply scaling up the loadings, making the maximization problem ill-defined. The constraint ensures that the direction in the \(p\)-dimensional space is uniquely defined, focusing on the direction of maximal variance rather than the magnitude of the coefficients.
The principal components are sensitive to the scaling of the variables. If variables are measured in different units or have vastly different variances, variables with larger variances might disproportionately influence the PCA results. Therefore, it is generally recommended to standardize the variables before applying PCA. Standardization typically involves centering the variables to have mean zero and scaling them to have unit variance. This ensures that all variables contribute equally to the analysis, regardless of their original scale.
Standardizing the variables before applying PCA is crucial when the original variables are measured in different units or have vastly different variances. Standardization ensures that each variable contributes equally to the analysis, preventing variables with larger variances from dominating the principal components. Common standardization methods include Z-score standardization, where each variable is transformed to have a mean of zero and a standard deviation of one. For a variable \(X_j\), the standardized variable \(X_j'\) is calculated as: \[X_j' = \frac{X_j - \text{mean}(X_j)}{\text{sd}(X_j)}\] where \(\text{mean}(X_j)\) and \(\text{sd}(X_j)\) are the sample mean and sample standard deviation of \(X_j\), respectively.
Calculating Principal Components
Obtaining the loadings for the first principal component is a linear algebra problem that involves finding the direction in \(p\)-dimensional space that maximizes the variance of the projected data. Given an \(n \times p\) data matrix \(\mathbf{X}\), we aim to find the loadings \(\phi_{11}, \dots, \phi_{p1}\) such that the linear combination of the \(n\) sample values: \[z_{i1} = \phi_{11} x_{i1} + \phi_{21} x_{i2} + \dots + \phi_{p1} x_{ip} = \sum_{j=1}^{p} \phi_{j1} x_{ij}\] has the largest possible sample variance, subject to the normalization constraint \(\sum_{j=1}^{p} \phi_{j1}^2 = 1\).
Assuming the variables are centered (column means of \(\mathbf{X}\) are zero), the sample variance of the first principal component scores is given by: \[\text{Var}(Z_1) = \frac{1}{n} \sum_{i=1}^{n} z_{i1}^2 = \frac{1}{n} \sum_{i=1}^{n} \left( \sum_{j=1}^{p} \phi_{j1} x_{ij} \right)^2\] The loading vector \(\boldsymbol{\phi}_1\) that maximizes this variance is the eigenvector corresponding to the largest eigenvalue of the sample covariance matrix \(\mathbf{S} = \frac{1}{n-1} \mathbf{X}^T \mathbf{X}\) (or the correlation matrix if the data is standardized to have unit variance). This loading vector defines the direction of the first principal component. Alternatively, singular value decomposition (SVD) of the data matrix \(\mathbf{X}\) can also be used to find the principal components.
Once the loading vector \(\boldsymbol{\phi}_1\) is determined, we can calculate the observed values for the first principal component \(Z_1\). These values, \(z_{i1}\) for \(i = 1, \dots, n\), are called the scores of the first principal component. They represent the projection of each data point onto the direction defined by \(\boldsymbol{\phi}_1\).
The principal component scores are the projections of the original data points onto the principal component directions. For the \(k\)-th principal component, the scores are given by \(z_{ik} = \sum_{j=1}^{p} \phi_{jk} x_{ij}\) for \(i = 1, \dots, n\), where \(\boldsymbol{\phi}_k = (\phi_{1k}, \dots, \phi_{pk})^T\) is the \(k\)-th loading vector. The scores represent the coordinates of the data points in the new principal component space.
Subsequent Principal Components
The second principal component \(Z_2\) is defined similarly, but with an additional constraint. It is the normalized linear combination of \(X_1, \dots, X_p\) that has maximal variance among all linear combinations that are uncorrelated with \(Z_1\). This uncorrelation constraint ensures that the second principal component captures variance in the data that is not already explained by the first principal component. The second principal component scores \(z_{i2}, \dots, z_{n2}\) are given by: \[z_{i2} = \phi_{12} x_{i1} + \phi_{22} x_{i2} + \dots + \phi_{p2} x_{ip} = \sum_{j=1}^{p} \phi_{j2} x_{ij}\] where the second principal component loading vector \(\boldsymbol{\phi}_2 = (\phi_{12}, \dots, \phi_{p2})^T\) satisfies the normalization constraint \(\sum_{j=1}^{p} \phi_{j2}^2 = 1\) and the uncorrelation constraint with the first principal component: \[\text{Cov}(Z_1, Z_2) = \sum_{j=1}^{p} \phi_{j1} \phi_{j2} = 0\] Geometrically, this orthogonality condition means that the directions of the loading vectors \(\boldsymbol{\phi}_1\) and \(\boldsymbol{\phi}_2\) are perpendicular in the \(p\)-dimensional variable space.
This procedure can be iterated to find subsequent principal components. At each step \(k\), the \(k\)-th principal component \(Z_k\) is the linear combination with maximal variance that is uncorrelated with all previous principal components \(Z_1, \dots, Z_{k-1}\). We can extract up to \(m = \min(n-1, p)\) principal components. Since principal components are derived from the eigenvalue decomposition of the covariance matrix (or SVD of the data matrix), the number of non-zero eigenvalues (or singular values) is at most \(\min(n-1, p)\).
Geometric Interpretation
PCA can be understood geometrically as a rotation of the original coordinate axes to align with the directions of maximal variance in the data.
The first loading vector \(\boldsymbol{\phi}_1\) defines a direction in the \(p\)-dimensional variable space along which the data varies the most. This direction is the first principal component direction. Projecting the \(n\) data points onto this direction gives the first principal component scores. In essence, PC1 captures the primary axis of variation in the data cloud.
The second loading vector \(\boldsymbol{\phi}_2\) defines a direction that is orthogonal (perpendicular) to the direction of \(\boldsymbol{\phi}_1\). Along this direction, the variability is maximized, subject to the orthogonality constraint. Projecting the data points onto this second direction yields the second principal component scores. PC2 captures the second most significant axis of variation, orthogonal to the first.
In general, the \(k\)-th loading vector \(\boldsymbol{\phi}_k\) defines a direction orthogonal to the directions defined by the first \(k-1\) loading vectors, along which the remaining variability is maximized. This sequential approach ensures that each principal component captures a unique aspect of the data’s variance, without redundancy.
Visually, if you imagine the data points forming an ellipsoid in \(p\)-dimensional space, the principal components correspond to the axes of this ellipsoid, ordered from longest to shortest. The first principal component is along the longest axis, the second along the second longest axis orthogonal to the first, and so on.
Example: US Arrest Data
Consider the ‘USArrests’ dataset, which contains statistics on arrests per 100,000 residents for Assault, Murder, and Rape in each of the 50 US states in 1973, along with the percentage of the population living in urban areas (UrbanPop). We have \(p=4\) variables and \(n=50\) observations. This dataset is commonly used to illustrate PCA because it is low-dimensional and interpretable, yet exhibits correlations between variables that PCA can effectively summarize.
Note: Diagram description: Scatterplot matrix showing relationships between Murder, Assault, UrbanPop, and Rape. Diagonal shows variable names. Off-diagonal plots show scatter plots between pairs of variables with a red smoother line.
Analysis: The scatterplot matrix (Figure 1) provides an initial view of the relationships between pairs of variables. For example, we can observe a positive correlation between Murder and Assault, suggesting that states with higher murder rates also tend to have higher assault rates. UrbanPop appears to have a weaker relationship with the crime variables compared to the crime variables among themselves.
Note: Diagram description: Boxplot showing distributions of Murder, Assault, UrbanPop, Rape, and PC1 scores. PC1 boxplot shows larger variance compared to individual variables.
A boxplot (Figure 2) comparing the four standardized variables with the scores from the first principal component (PC1) demonstrates that PC1 exhibits greater variability than any of the original standardized variables. This is a key characteristic of the first principal component – it is constructed to maximize variance, thus capturing a significant amount of the overall variability present in the dataset.
The first two principal component loading vectors, \(\boldsymbol{\phi}_1\) and \(\boldsymbol{\phi}_2\), derived from PCA on the standardized USArrests data, are shown in the table below:
| Murder | Assault | UrbanPop | Rape | |
|---|---|---|---|---|
| PC1 | 0.5359 | 0.5832 | 0.2782 | 0.5434 |
| PC2 | -0.4182 | -0.1880 | 0.8728 | 0.1673 |
Interpretation of PC1 Loadings
Interpretation of PC1 Loadings
The first loading vector \(\boldsymbol{\phi}_1 = (0.5359, 0.5832, 0.2782, 0.5434)^T\) shows the weights of each original variable in forming PC1. We observe that Murder (0.5359), Assault (0.5832), and Rape (0.5434) have relatively large and similar positive loadings, while UrbanPop (0.2782) has a smaller positive loading. This indicates that PC1 is primarily a weighted average of these crime rates, with a lesser contribution from urbanization. Thus, PC1 roughly corresponds to a measure of overall rates of serious crimes. States with high scores on PC1 are those with generally high crime rates across murder, assault, and rape.
Interpretation of PC2 Loadings
Interpretation of PC2 Loadings
The second loading vector \(\boldsymbol{\phi}_2 = (-0.4182, -0.1880, 0.8728, 0.1673)^T\) exhibits a different pattern. UrbanPop has a very large positive loading (0.8728), while Murder (-0.4182) and Assault (-0.1880) have negative loadings, and Rape (0.1673) has a small positive loading. This suggests that PC2 contrasts UrbanPop with Murder and, to a lesser extent, Assault. Hence, PC2 roughly corresponds to the level of urbanization of the state, with states having high UrbanPop and lower murder and assault rates scoring high on PC2.
Note: Diagram description: Biplot showing states plotted based on PC1 and PC2 scores. Red arrows represent loadings for Murder, Assault, UrbanPop, and Rape.
Biplot Visualization
Biplot Visualization
The biplot (Figure 3) is a powerful graphical tool that simultaneously displays both the principal component scores for each observation (state names in this case) and the loading vectors for each variable (red arrows). It provides a low-dimensional visualization of both the data points and the variable contributions in the principal component space, typically spanned by the first two PCs.
Interpreting Variable Relationships from Biplot
Interpreting Variable Relationships from Biplot: In the biplot, the angles between the variable arrows reflect the correlations between the original variables in the PCA space. Variables that are highly correlated in the original data will have arrows that are close to each other in the biplot. Conversely, variables that are uncorrelated will have arrows that are approximately orthogonal. In Figure 3, the arrows for Murder, Assault, and Rape are clustered together and point in roughly the same direction, indicating that these crime-related variables are positively correlated. The arrow for UrbanPop is at a larger angle from the crime variables, suggesting it is less correlated with them and captures a different dimension of variation in the data.
Interpreting State Positions from Biplot
Interpreting State Positions from Biplot: The position of each state in the biplot is determined by its scores on PC1 (horizontal axis) and PC2 (vertical axis).
States located to the right side of the biplot (positive PC1 scores), such as California, Nevada, and Florida, have high crime rates, consistent with the interpretation of PC1 as an overall crime rate index.
States on the left side (negative PC1 scores), like North Dakota, have low crime rates.
States in the upper part of the biplot (positive PC2 scores), such as California, have high levels of urbanization, aligning with PC2 representing urbanization.
States in the lower part (negative PC2 scores) are less urbanized.
States located near the origin (e.g., Indiana) have approximately average levels of both crime and urbanization, as they have scores close to zero on both PC1 and PC2.
Further Comments on PCA
Uniqueness of Principal Components: Each principal component is unique up to a sign flip. If we multiply all loadings in a loading vector by -1, the direction in \(p\)-dimensional space defined by the principal component remains the same. Consequently, the scores are also flipped in sign, but the explained variance and the overall interpretation of the component are unchanged. Therefore, when interpreting loadings, the sign is relative and should be considered in the context of the other loadings within the same component and the overall direction being captured.
Alternative Geometric Interpretation: The first \(r\) principal component score and loading vectors provide the best \(r\)-dimensional approximation (using Euclidean distance in \(\mathbb{R}^p\)) to the \(i\)-th observation. Specifically, if we reconstruct the original data using the first \(r\) principal components, the approximation for the \(j\)-th variable of the \(i\)-th observation is given by \(x_{ij} \approx \sum_{s=1}^{r} z_{is} \phi_{js}\). This approximation minimizes the sum of squared errors between the original data and its reconstruction in \(r\) dimensions. The approximation becomes exact if we use all \(r = \min(n-1, p)\) principal components.
Number of Principal Components: The first principal component is the most informative one-dimensional linear summary of the data, capturing the largest possible variance. The first two principal components together provide the most informative two-dimensional graphical summary. A crucial decision in PCA is determining how many principal components are sufficient to capture a satisfactory amount of variance in the data for dimension reduction and visualization purposes. This is often addressed by examining the proportion of variance explained by each component.
The Proportion of Variance Explained
To decide on the number of principal components to retain, we examine the variance explained by each component. Assuming the observed variables have been centered to have mean zero, the proportion of variance explained (PVE) by the \(s\)-th principal component is: \[\text{PVE}_s = \frac{(1/n) \sum_{i=1}^{n} z_{is}^2}{\sum_{j=1}^{p} (1/n) \sum_{i=1}^{n} x_{ij}^2} = \frac{\text{Variance}(Z_s)}{\sum_{j=1}^{p} \text{Variance}(X_j)}\] The numerator is the sample variance of the \(s\)-th principal component scores, representing the variance captured by this component. The denominator is the total variance of all original (centered) variables, representing the total variance in the dataset. Thus, PVE\(_s\) quantifies the fraction of the total variance that is attributable to the \(s\)-th principal component.
The cumulative proportion of variance explained for the first \(r\) components is the sum of the PVEs for the first \(r\) components: \(\sum_{s=1}^{r} \text{PVE}_s\). This cumulative PVE indicates the total variance captured by the first \(r\) principal components. If we use all \(r = \min(n-1, p)\) principal components, the cumulative PVE is 1, meaning 100% of the variance is explained, as we have simply rotated the data without losing any information.
A useful graphical tool for deciding the number of components is the scree plot. The scree plot displays the PVE\(_s\) values for \(s = 1, \dots, \min(n-1, p)\) typically as a line plot or bar chart. The term "scree" refers to the rubble at the base of a mountain slope; in analogy, the scree plot often shows a steep drop initially, followed by a leveling off, resembling a mountain slope transitioning to scree. There is no universally accepted objective method to definitively determine the optimal number of principal components. However, a common heuristic is to examine the scree plot and look for an "elbow" point. The elbow point is where the proportion of variance explained drops off significantly, suggesting that components beyond this point contribute relatively little additional information. Components before the elbow are typically retained as they capture the majority of the variance.
Example: US Arrest Data (Variance Explained)
For the US Arrest data, the scree plot and cumulative proportion of variance explained are shown below:
Note: Diagram description: Left plot shows PVE for PC1, PC2, PC3, PC4. Right plot shows cumulative PVE for PC1, PC1+PC2, PC1+PC2+PC3, PC1+PC2+PC3+PC4.
Scree Plot Analysis
Scree Plot Analysis
Interpretation: Examining the scree plot (left panel of Figure 4), we observe a sharp drop in PVE from PC1 to PC2, and then a much gentler decline for PC3 and PC4.
The first principal component explains a substantial portion of the total variance, over 60%. This indicates that PC1 is highly effective in summarizing the primary axis of variation in the US Arrests data.
The second principal component explains a significant but smaller proportion, around 25% of the variance. While less than PC1, PC2 still captures a considerable amount of additional, independent variance.
Cumulative Proportion of Variance Explained
Cumulative Proportion of Variance Explained
Interpretation: The cumulative proportion of variance explained plot (right panel of Figure 4) shows the accumulated variance as we include more principal components.
Together, the first two components explain almost 87% of the variance in the data (approximately 62% + 25%). This suggests that by using just two principal components, we can retain a large majority of the information present in the original four variables, achieving significant dimension reduction with minimal information loss. This two-dimensional representation is highly useful for visualization and further analysis.
The scree plot exhibits an "elbow" after the second principal component, where the slope of the curve noticeably decreases. This elbow suggests that retaining the first two principal components might be a reasonable choice for dimension reduction and visualization, as adding more components beyond the second contributes diminishingly to the explained variance. In this case, reducing the dimensionality from 4 to 2 using PCA seems justified, as it retains most of the variance while simplifying the data structure considerably.
Cluster Analysis
Clustering Methods: Overview
Clustering methods are a broad class of unsupervised techniques aimed at partitioning a dataset into subgroups, or clusters. The fundamental objective of cluster analysis is to group observations such that those within the same cluster are more similar to each other than to those in other clusters. This ‘similarity’ is typically defined based on a chosen distance or dissimilarity measure.
Cluster analysis, or clustering, is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a principal task of exploratory data mining, and a common technique for statistical data analysis, used in many fields.
Examples of Applications:
Market Segmentation: In market research, clustering is extensively used to segment customers into distinct groups based on purchasing behavior, demographics, or preferences. For example, clustering can identify groups like ‘high-value spenders,’ ‘budget-conscious shoppers,’ or ‘tech-early adopters,’ enabling targeted marketing strategies. This is done without prior knowledge of customer segments, purely based on observed purchasing patterns and customer attributes.
Biological Data Analysis: In genomics and bioinformatics, clustering is crucial for analyzing gene expression data, protein interactions, and patient data. For instance, in cancer research, clustering gene expression profiles of tumor samples can reveal previously unknown subtypes of cancer, each potentially requiring different treatment approaches. Similarly, in ecological studies, clustering can group species based on their traits or habitats to understand ecological niches and biodiversity patterns.
Cluster analysis is primarily concerned with grouping the rows (observations) of the \(n \times p\) data matrix \(\mathbf{X}\). While the typical application is to cluster observations based on their features (variables), the approach can also be reversed to cluster features based on observations. This is achieved by transposing the data matrix and applying clustering algorithms to the transposed data. For example, in gene expression analysis, one might cluster genes based on their expression patterns across different experimental conditions or patients.
While both clustering and PCA are unsupervised methods aimed at simplifying data, they serve different purposes. PCA is a dimension reduction technique focused on variance, whereas cluster analysis is a partitioning technique focused on grouping similar observations. Although clustering is a widely used and versatile tool, it is important to recognize that visualization techniques, especially when combined with interactive and dynamic graphics software, can sometimes offer equally compelling or even more insightful alternatives for exploratory data analysis.
Clustering Algorithms
Clustering algorithms are diverse, but most rely on quantifying the dissimilarity or similarity between data points. A broad classification of clustering methods includes:
Hierarchical Clustering: This approach builds a hierarchy of clusters, represented as a tree-like structure called a dendrogram. Hierarchical clustering does not require pre-specifying the number of clusters.
Agglomerative (bottom-up) Approach: This is the most common type of hierarchical clustering. It starts with each observation in its own cluster and iteratively merges the most similar clusters until all observations belong to a single cluster, or a predetermined stopping criterion is met.
Divisive (top-down) Approach: This approach is less common due to its computational intensity. It begins with all observations in one cluster and recursively splits clusters into smaller, less similar clusters until each observation forms its own cluster, or a stopping criterion is satisfied.
Hierarchical clustering is valuable for exploring data at multiple levels of granularity and visualizing cluster relationships through dendrograms.
Partitioning Clustering: Partitioning methods divide the dataset into a set of non-overlapping clusters. These methods typically require the user to pre-specify the number of clusters, \(K\).
- \(K\)-means Clustering: A widely used partitioning method, particularly suitable for continuous data. \(K\)-means aims to partition the data into \(K\) clusters by minimizing the within-cluster sum of squares. It iteratively assigns observations to the nearest clustercentroid and updates the centroids based on the current cluster members.
Model-based Clustering: These methods assume that the data is generated from a mixture of probability distributions, with each distribution representing a cluster. Model-based clustering offers a statistical framework for clustering, providing probabilistic cluster assignments and measures of uncertainty.
- Gaussian Mixture Models (GMMs): A common model-based approach that assumes clusters are Gaussian distributions. GMMs use Expectation-Maximization (EM) algorithm to estimate model parameters and cluster assignments.
Model-based clustering can be computationally demanding but offers advantages in terms of statistical rigor and flexibility, especially when the underlying data distribution can be reasonably modeled.
Different clustering methods can lead to varying results depending on the data structure and algorithm characteristics. It is crucial to be aware of the potential for over-interpretation of clustering results. To enhance the robustness and interpretability of cluster analysis, it is often beneficial to use clustering in conjunction with data visualization techniques. Furthermore, dimension reduction methods like PCA can be a valuable preprocessing step, especially for high-dimensional data, to improve clustering performance and visualization.
Measure of Dissimilarity
Many cluster analysis methods begin with a measure of dissimilarity (or similarity) between observations (rows of the data matrix \(\mathbf{X}\)). A dissimilarity coefficient \(d\) typically has the following properties for observations \(\mathbf{a} = (a_1, \dots, a_p)\), \(\mathbf{b} = (b_1, \dots, b_p)\), and \(\mathbf{c} = (c_1, \dots, c_p)\):
Non-negativity: \(d(\mathbf{a}, \mathbf{b}) \geq 0\)
Identity of indiscernibles: \(d(\mathbf{a}, \mathbf{b}) = 0\) if and only if \(\mathbf{a} = \mathbf{b}\)
Symmetry: \(d(\mathbf{a}, \mathbf{b}) = d(\mathbf{b}, \mathbf{a})\)
For a metric dissimilarity (distance), the triangle inequality must also hold:
\[d(\mathbf{a}, \mathbf{c}) \leq d(\mathbf{a}, \mathbf{b}) + d(\mathbf{b}, \mathbf{c})\] Description: For a dissimilarity measure to be considered a metric (or distance), it must satisfy the triangle inequality. This property ensures that the dissimilarity between two points is always less than or equal to the sum of the dissimilarities from each point to a third point.
For an ultrametric dissimilarity:
\[d(\mathbf{a}, \mathbf{c}) \leq \max\{d(\mathbf{a}, \mathbf{b}), d(\mathbf{b}, \mathbf{c})\}\] Description: An ultrametric dissimilarity is a stronger condition than the triangle inequality. It states that the dissimilarity between two points is less than or equal to the maximum of the dissimilarities from each point to a third point. Ultrametric dissimilarities are often found in hierarchical clustering and phylogenetic analysis.
It’s important to note that a dissimilarity measure does not necessarily need to be a distance (i.e., it may not satisfy the triangle inequality).
Common Distance Measures
For Numeric Vectors
For Numeric Vectors
For numeric vectors \(\mathbf{a} = (a_1, \dots, a_p)\) and \(\mathbf{b} = (b_1, \dots, b_p)\), common distance measures include:
Euclidean Distance
Euclidean Distance: \[d(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^{p} (a_i - b_i)^2}\] Description: The Euclidean distance is the straight-line distance between two points in Euclidean space. It is calculated as the square root of the sum of the squared differences between the corresponding coordinates of the points.
Manhattan (Taxicab) Distance
Manhattan (Taxicab) Distance: \[d(\mathbf{a}, \mathbf{b}) = \sum_{i=1}^{p} |a_i - b_i|\] Description: The Manhattan distance, also known as taxicab distance or L1 distance, is the sum of the absolute differences of their Cartesian coordinates. It measures the distance traveled along axes at right angles, like moving along city blocks.
Maximum Distance (Chebyshev Distance)
Maximum Distance (Chebyshev Distance): \[d(\mathbf{a}, \mathbf{b}) = \max_{i} |a_i - b_i|\] Description: The Chebyshev distance, also known as L-infinity distance or maximum value distance, is the maximum of the absolute differences between the coordinates of the vectors. It is the distance that a king would have to travel on a chessboard to move between squares.
For Non-Numeric or Mixed Data
For Non-Numeric or Mixed Data
For non-numeric vectors, or vectors with mixed data types, other dissimilarity measures are used:
Binary Distance
Binary Distance: For binary vectors (0s and 1s), it is the percentage of nonzero coordinates (excluding (0,0) matches) that differ. Description: Binary distance is used for binary data. It calculates dissimilarity based on the proportion of differing non-zero coordinates, ignoring matches where both vectors have zero.
Jaccard Distance
Jaccard Distance: For categorical variables with a preferred level, it’s the proportion of variables where one case is at the preferred level and the cases differ. For sets \(A\) and \(B\): \[d(A, B) = \frac{|A \cup B| - |A \cap B|}{|A \cup B|} = \frac{|A \triangle B|}{|A \cup B|}\] Description: The Jaccard distance measures dissimilarity between sets. It is calculated as the size of the symmetric difference of the sets divided by the size of their union. For categorical variables, it can be adapted to measure differences based on preferred levels.
where \(|\cdot|\) denotes the size of a set, and \(A \triangle B\) is the symmetric difference.
Hamming Distance
Hamming Distance: For strings or categorical data, it’s the number of coordinates where the two strings differ. Description: Hamming distance is used to measure the number of positions at which the corresponding symbols are different in two strings of equal length. It is commonly used for error detection or correction in data transmission and for comparing categorical data.
Gower Dissimilarity
Gower Dissimilarity: For mixed numeric-categorical data, it uses a more complex formula and is a non-metric dissimilarity useful for real-world datasets. Description: Gower dissimilarity is a versatile measure for mixed data types (numeric and categorical). It uses different distance metrics for different variable types and then combines them into a single dissimilarity value. It is non-metric but widely used for its ability to handle diverse data.
Hierarchical Clustering in Detail
Agglomerative vs. Divisive Methods: A Comparison
Agglomerative vs. Divisive Methods: A Comparison
Hierarchical clustering algorithms connect "objects" (observations) to form "clusters" based on their distance. This results in a tree-based representation called a dendrogram.
Agglomerative Hierarchical Methods (Bottom-up): Start with each observation as a separate cluster and iteratively merge the closest pairs of clusters until all observations are in a single cluster or a desired number of clusters is reached. Agglomerative hierarchical clustering is more common.
Divisive Hierarchical Methods (Top-down): Begin with all observations in one cluster and recursively split clusters into smaller ones. Divisive methods are computationally more intensive and are less commonly used than agglomerative methods. Divisive methods can be useful when grouping into a few large clusters is of interest.
In some situations, the assumption of a hierarchical structure might be unrealistic (e.g., clustering people by gender or nationality).
Linkage Criteria in Hierarchical Clustering
Linkage criteria define how the dissimilarity between two groups of observations (clusters) is calculated, based on pairwise dissimilarities between individual observations. Common linkage criteria include:
Complete Linkage (Maximum Linkage)
Complete Linkage (Maximum Linkage): The dissimilarity between two clusters is the maximum dissimilarity between any pair of observations, one from each cluster. Description: Complete linkage, also known as maximum linkage, defines the distance between two clusters as the largest distance between any two points in the two clusters. It tends to produce compact clusters.
Single Linkage (Minimum Linkage)
Single Linkage (Minimum Linkage): The dissimilarity between two clusters is the minimum dissimilarity between any pair of observations, one from each cluster. Description: Single linkage, also known as minimum linkage, defines the distance between two clusters as the smallest distance between any two points in the two clusters. It can lead to chaining, where clusters are formed by linking observations that are close, even if the clusters are elongated or spread out.
Average Linkage
Average Linkage: The dissimilarity between two clusters is the average of all pairwise dissimilarities between observations from the two clusters. Description: Average linkage defines the distance between two clusters as the average of all pairwise distances between points in the two clusters. It is a compromise between single and complete linkage, less prone to chaining than single linkage and less sensitive to outliers than complete linkage.
Centroid Linkage
Centroid Linkage: The dissimilarity between two clusters is the dissimilarity between their centroids (mean vectors). Description: Centroid linkage defines the distance between two clusters as the distance between their centroids (means). It is computationally efficient but can sometimes lead to inversions, where clusters are merged at distances less than the distances at which their subclusters were formed.
Dendrogram Interpretation
Dendrograms visually represent the hierarchy of clusters formed during hierarchical clustering.
The y-axis of a dendrogram represents the distance or dissimilarity at which clusters merge.
The x-axis represents the observations, ordered in a way that visually groups similar observations together.
The hierarchy of clusters is obtained by "cutting" the dendrogram at different heights (dissimilarity levels).
Hierarchical methods do not produce a unique partitioning but rather a hierarchy. The user must choose an appropriate level to cut the dendrogram to obtain specific clusters. Dendrograms are sensitive to outliers, which can appear as separate clusters or cause other clusters to merge prematurely (the "chaining phenomenon", especially with single linkage).
There are no universal guidelines for choosing the optimal cut height. Cluster analysis is exploratory, and the optimal number of clusters is often context-specific.
Example: Three Clusters Simulated Data
Consider a simulated dataset with \(n=75\) observations of two variables \(X_1\) and \(X_2\), with 25 observations in each of three clusters (black, red, and blue).
Note: Diagram description: Scatterplot of simulated data points colored black, red, and blue, representing three true clusters.
Hierarchical clustering with complete linkage and Euclidean distance is applied to this data.
Note: Diagram description: Dendrogram showing hierarchical clustering of the simulated data. Height axis represents dissimilarity. A horizontal line is drawn at height 5.
Applying Hierarchical Clustering and Dendrogram Cut
Applying Hierarchical Clustering and Dendrogram Cut
Each leaf of the dendrogram (Figure 6) represents one of the 75 observations. As we move up the tree, leaves fuse into branches, indicating groups of similar observations. The height of fusion on the y-axis indicates the dissimilarity between the merged clusters. Observations that fuse at the very bottom are most similar.
Cutting the dendrogram at a height of 5, for example, results in three distinct clusters. Different cut heights can lead to different clustering results.
Visualizing Cluster Identification on Dendrogram
Visualizing Cluster Identification on Dendrogram
In Figure 6, a horizontal line at height 5 visually separates the dendrogram into three clusters. This cut-off point is subjective and chosen for illustrative purposes.
Example: Swiss Socioeconomic Indicators
The ‘swiss’ dataset from the ‘datasets’ package in R contains socioeconomic indicators for 47 French-speaking provinces of Switzerland around 1888. There are \(n=47\) observations and \(p=5\) variables, each in percent. The variables are Agriculture, Examination, Education, Catholic, and Infant.Mortality.
Note: Diagram description: Scatterplot matrix showing relationships between Agriculture, Examination, Education, Catholic, and Infant.Mortality. Diagonal shows variable names. Off-diagonal plots show scatter plots between pairs of variables with a red smoother line.
Since the data are percentages, Euclidean distance is a reasonable choice for dissimilarity. Single linkage agglomerative clustering is performed and the tree is cut into three clusters.
Note: Diagram description: Dendrogram showing agglomerative hierarchical clustering of Swiss data. Red rectangles highlight three clusters.
Applying Agglomerative Hierarchical Clustering
Applying Agglomerative Hierarchical Clustering
Figure 8 shows the dendrogram obtained by applying agglomerative hierarchical clustering to the Swiss dataset. The red rectangles highlight a possible cut for three clusters.
Divisive clustering is also performed, yielding similar three clusters.
Note: Diagram description: Dendrogram showing divisive hierarchical clustering of Swiss data. Red rectangles highlight three clusters.
Applying Divisive Hierarchical Clustering
Applying Divisive Hierarchical Clustering
Figure 9 displays the dendrogram from divisive hierarchical clustering on the Swiss dataset. Similar to the agglomerative approach, three clusters are highlighted with red rectangles.
Comparison of Agglomerative and Divisive Results
Comparison of Agglomerative and Divisive Results
Visually comparing Figure 8 and Figure 9 shows that both agglomerative and divisive hierarchical clustering methods yield similar cluster structures for the Swiss dataset, particularly when cutting the dendrogram to obtain three clusters.
Partitioning Clustering
Introduction to Partitioning Clustering Methods
Partitioning methods aim to classify observations into \(K > 0\) distinct, non-overlapping clusters. These methods require pre-specifying the number of clusters \(K\). While criteria exist for choosing \(K\) iteratively, this selection is often not straightforward.
In partitioning clustering, deciding on the number of clusters is crucial, similar to choosing a cut point in hierarchical clustering. An initial cluster assignment is required for the observations. There are various partitioning algorithms and several alternative optimality criteria, some of which are based on probabilistic models.
Considerations for Choosing K (Number of Clusters)
Choosing the number of clusters, \(K\), in partitioning methods is a critical step. Several approaches can be used:
Elbow Method: Plot the within-cluster sum of squares (WSS) against different values of \(K\). Look for an "elbow" in the plot, where the rate of decrease in WSS sharply changes. This elbow point can be a reasonable estimate for \(K\).
Silhouette Analysis: Measures how similar an object is to its own cluster compared to other clusters. The silhouette value ranges from -1 to +1, with higher values indicating better clustering. Average silhouette width can be used to assess the optimal \(K\).
Domain Knowledge: Prior knowledge about the data and the problem domain can guide the choice of \(K\). For example, in market segmentation, business requirements might dictate a specific number of target customer segments.
Information Criteria: For model-based partitioning methods, information criteria like AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) can be used to compare models with different numbers of clusters and select the \(K\) that minimizes these criteria.
It is important to note that there is no universally optimal method for choosing \(K\), and the best approach often involves a combination of these techniques along with visual inspection of the clusters.
Brief Mention of Partitioning Algorithms
Common partitioning algorithms include:
\(K\)-means: A widely used algorithm that aims to partition \(n\) observations into \(K\) clusters in which each observation belongs to the cluster with the nearest mean (cluster center or centroid), serving as a prototype of the cluster.
\(K\)-medoids (PAM - Partitioning Around Medoids): Similar to \(K\)-means but uses medoids instead of centroids. Medoids are actual data points within the cluster, making \(K\)-medoids more robust to outliers.
CLARA (Clustering Large Applications): An extension of \(K\)-medoids designed to handle large datasets. CLARA works by sampling a subset of the data, applying PAM on the sample, and then assigning the rest of the data points to the clusters formed in the sample.
FANNY (Fuzzy Analysis Clustering): Allows for fuzzy or soft clustering, where each observation can belong to multiple clusters with different membership degrees, rather than being assigned exclusively to one cluster.
These algorithms differ in their approach to partitioning, sensitivity to outliers, computational complexity, and assumptions about the data. The choice of algorithm depends on the specific characteristics of the dataset and the goals of the analysis.
Conclusion
In this lecture, we explored unsupervised learning methods, focusing on Principal Component Analysis (PCA) and Cluster Analysis. These techniques are essential for exploratory data analysis, aiming to uncover hidden structures and patterns in data without labeled responses.
Key Takeaways:
PCA for Dimension Reduction: PCA is established as a powerful and versatile technique for reducing the dimensionality of multivariate data while effectively preserving most of its variance. It is invaluable for simplifying complex datasets, facilitating data visualization in lower dimensions, and identifying the most salient features or components that capture the essential variability in the data. By transforming original variables into a smaller set of uncorrelated principal components, PCA aids in both data interpretation and computational efficiency.
Cluster Analysis for Grouping: Cluster analysis provides a rich toolkit of methods for grouping observations into clusters based on measures of similarity or dissimilarity. We discussed hierarchical and partitioning methods, each offering distinct approaches to clustering. Hierarchical methods, with their dendrogram representation, are useful for exploring hierarchical data structures and relationships at different levels of granularity. Partitioning methods, such as \(K\)-means, are efficient for dividing data into a pre-specified number of clusters, optimizing for within-cluster homogeneity and between-cluster heterogeneity. The choice between these methods, and among various linkage criteria or partitioning algorithms, depends on the specific goals of the analysis and the characteristics of the dataset.
Exploratory Nature: It is crucial to emphasize that unsupervised learning methods, including PCA and cluster analysis, are primarily exploratory tools. They are most effective in the initial stages of data analysis for gaining insights, generating hypotheses, and understanding the inherent structure of data. The results obtained from these methods are not definitive answers but rather starting points for further investigation. Interpretation of results often requires subjective judgment, domain expertise, and careful validation using visualization techniques and contextual understanding.
Practical Considerations: Successful application of both PCA and cluster analysis requires careful attention to several practical considerations. Data preprocessing, including standardization or normalization, is often essential to ensure that variables are on a comparable scale and to prevent variables with larger variances from dominating the results. Parameter selection, such as the number of principal components in PCA, the number of clusters in partitioning methods, and the choice of linkage methods or distance measures in cluster analysis, significantly impacts the outcomes and requires informed decisions. Finally, the interpretation of results must be grounded in the context of the data and the problem at hand, avoiding over-interpretation and acknowledging the exploratory nature of these techniques.
Further Remarks:
Unsupervised learning is inherently an iterative and interactive process. It typically involves experimenting with different methods, adjusting parameters, and exploring various visualizations to progressively refine insights and understanding of the data. This iterative nature is key to effectively uncovering hidden patterns and structures.
The effectiveness and reliability of unsupervised methods are strongly contingent on the quality and nature of the input data. Data quality issues, such as missing values, noise, and outliers, can significantly distort the results of PCA and cluster analysis. Therefore, careful data cleaning and preprocessing steps are paramount to ensure meaningful and robust outcomes.
Combining PCA and cluster analysis can be a highly effective strategy, particularly for high-dimensional data. PCA can be used as a preprocessing step to reduce data dimensionality, remove noise, and mitigate multicollinearity before applying clustering algorithms. Dimension reduction via PCA can improve the computational efficiency of clustering, enhance cluster quality by focusing on the most informative dimensions, and facilitate visualization of clusters in a reduced dimensional space.
Algorithm: Summary of Steps for Applying Unsupervised Learning
Data Preprocessing: Clean and preprocess the data. This may include handling missing values, normalizing or standardizing variables, and addressing outliers.
Method Selection: Choose appropriate unsupervised learning methods based on the goals of the analysis (dimension reduction, clustering, etc.) and the characteristics of the data. For dimension reduction, consider PCA. For clustering, choose between hierarchical, partitioning, or model-based methods.
Parameter Tuning: Determine and tune relevant parameters for the chosen methods. For PCA, decide on the number of principal components to retain (e.g., using scree plot). For clustering, select the number of clusters (e.g., using elbow method, silhouette analysis, or domain knowledge) and choose appropriate distance measures and linkage criteria (for hierarchical clustering) or algorithms (for partitioning clustering).
Model Application: Apply the chosen unsupervised learning methods with the selected parameters to the preprocessed data.
Result Interpretation and Validation: Interpret the results in the context of the problem domain. Visualize the results (e.g., biplots for PCA, dendrograms or cluster plots for cluster analysis). Validate the findings using domain knowledge and, if possible, external data or metrics. Assess the robustness and stability of the results.
Iterate and Refine: Unsupervised learning is often iterative. Based on the initial results and interpretations, revisit previous steps. Consider alternative methods, parameter settings, or preprocessing techniques to refine the analysis and gain deeper insights.
Description: This algorithm outlines the general steps for applying unsupervised learning techniques. It emphasizes the iterative nature of the process, from data preparation to interpretation and refinement of results.
Follow-up Questions and Topics for Next Lecture:
How do we validate the results of unsupervised learning methods? Are the identified principal components or clusters meaningful and robust?
What are the limitations of PCA and cluster analysis? When are these methods not appropriate or effective?
Are there other unsupervised learning methods beyond PCA and cluster analysis that are useful in practice? (e.g., dimensionality reduction techniques like t-SNE and UMAP, anomaly detection methods)
How can we apply these unsupervised learning techniques in specific domains or applications, such as image processing, natural language processing, or bioinformatics?