Information-theoretic Generalization Analysis for Randomized Learning Algorithms
PhD Thesis. 2024.
Yuxin Dong.
Abstract
In recent years, machine learning models, especially deep neural networks, have achieved astonishing success in various fields, e.g. finance, education, and biology. Along with the development of these technologies, there is an increasing need for a deeper understanding of their underlying mechanisms, capabilities, and limitations. A key question in statistical machine learning theory is how to accurately characterize the generalization ability of randomized learning algorithms. Classical learning theory derives uniform convergence generalization bounds based on metrics of hypothesis space complexity. An alternative approach assesses generalization ability from the perspective of stability, which is evaluated as the sensitivity of the algorithm to certain perturbations in the training data. However, these methods often depend on the complexity of the hypothesis space or certain strong assumptions, making them unsuitable or overly pessimistic for modern deep neural networks.
Recent research has discovered that leveraging the mutual information between the training dataset and the hypothesis leads to generalization bounds under weaker assumptions, which can also characterize certain properties of the data distribution or the learning algorithm. Furthermore, by incorporating conditional mutual information analysis within the supersample framework, the issue of unbounded mutual information caused by continuous invertible forward propagation functions can be addressed, resulting in computationally tractable generalization bounds based on network predictions. Information-theoretic generalization analysis methods have gained significant attention among global scholars, and have become one of the primary approaches for analyzing the generalization abilities of modern deep learning models. However, existing works still have various limitations in terms of computational tractability, tightness of generalization bounds, and applicability of theoretical results. This dissertation focuses on information-theoretic generalization analysis for randomized learning algorithms, exploring generalization properties in various classical learning scenarios, overcoming major limitations of current theoretical results, and constructing a computable, high-precision, and broadly applicable information-theoretic generalization analysis framework. The major contributions of this dissertation can be divided into:
(1) Computationally Tractable Generalization Bounds via Kernelized Rényi’s Entropy. To address the issue that existing generalization bounds based on traditional Shannon’s information measures are computationally intractable in practice, a novel information measure, named kernelized Rényi’s entropy, is proposed. It can be directly approximated with finite data points, regardless of the dimensionality of the corresponding random variables, and is still compatible with the existing generalization analysis framework based on Shannon’s entropy. The existing generalization results for iterative and noisy learning algorithms, e.g. SGLD and SGD, are successfully extended with kernelized Rényi’s entropy, resulting in tighter and computationally tractable generalization bounds based on gradient covariance metrics. Additionally, to address the computational burden of eigenvalue decomposition required by matrix-based Rényi’s entropy, fast approximations with theoretically optimal convergence rates based on trace estimation and polynomial approximation techniques are proposed, ensuring the computability of related generalization results.
(2) Loss Entropy Induced High-probability Generalization Bounds. To address the issue that existing generalization error bounds based on high-dimensional mutual information measures are computationally intractable and need to be tightened, a novel low-dimensional information-theoretic generalization measure, named loss entropy, is proposed. It only involves one-dimensional random variables, and thus can be directly estimated using kernel density estimation or binning methods, completely solving the problem of computational intractability in related works. Upon loss entropy, relative theoretical results are successfully extended to data-independent generalization scenarios, significantly improving the tightness and computational tractability of generalization bounds based on information bottleneck measures, and providing new theoretical insights into the generalization behavior of the minimum error entropy criterion. Furthermore, existing generalization results for data-dependent generalization scenarios are improved under the leave-one-out and supersample frameworks, providing high-probability generalization bounds based on loss entropy. This is the first computationally tractable high-probability information-theoretic generalization bound, and is also significantly tighter than existing works in various classical learning scenarios.
(3) A Unified Generalization Framework for Non-pointwise Learning. To address the issue that existing information-theoretic generalization bounds for traditional supervised pointwise learning scenarios are no longer applicable to non-pointwise learning scenarios, e.g. contrastive learning, the first information-theoretic generalization framework for non-pointwise learning paradigms is proposed. Firstly, by utilizing the superadditivity of mutual information to decompose the expected generalization error and adopting a bottom-to-top reduction, the challenge of non-i.i.d. loss terms in non-pointwise learning is overcome. Next, by adopting an independent decomposition of the supersample variables, generalization bounds based on low-dimensional mutual information measures are derived, completely solving the dimensional explosion challenge in extending existing results within the supersample framework. The results apply to any bounded non-pointwise loss functions, encompassing pointwise, pairwise, triplet, and higher-order learning paradigms, all within a unified framework.
(4) Information-theoretic Analysis and Algorithm Design for Domain Generalization. To address the issue that existing generalization bounds based on the traditional i.i.d. sample assumption are no longer applicable to domain generalization or other out-of-distribution scenarios, a high-probability domain generalization analysis framework is proposed under the lens of information theory. Firstly, using the global population risk as a bridge, the domain generalization error is decomposed into source-domain and target-domain generalization errors. The following analysis then derives information-theoretic generalization bounds for them respectively, identifying the key mutual information measures that affect the domain generalization performance of learning algorithms. Subsequently, aligning inter-domain gradients and representations jointly constitutes a sufficient condition for domain generalization. Based on these findings, the IDM algorithm is proposed for domain generalizaion, and the PDM method improves over existing distribution matching methods based on low-order moments, together improving the overall performance from 68.7% to 71.2% compared to the traditional ERM method on the DomainBed benchmark.