1. Duality
1.1 Lagrangian and Dual Function
- Lagrangian for a primal problem:
$$L(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) $$
- Dual function:
$$g(\lambda, \nu) = \inf_{x \in D} L(x, \lambda, \nu) $$
- Always concave (regardless of convexity of primal).
- Lower bound property: If $\lambda \geq 0$, then $g(\lambda, \nu) \leq p^*$.
1.2 Dual Problem
- Dual problem:
$$\begin{aligned} &\text{maximize } g(\lambda, \nu) \\ &\text{subject to } \lambda \geq 0 \end{aligned} $$
- Always a concave problem (even if primal is non-convex).
- Optimal dual value: $d^*$.
1.3 Weak and Strong Duality
- Weak duality: $d^* \leq p^*$ (always holds).
- Strong duality: $d^* = p^*$ (holds under Slater’s condition for convex problems).
1.4 KKT Conditions
For optimal $(x^*, \lambda^*, \nu^*)$ under strong duality:
- Primal feasibility: $f_i(x^*) \leq 0, \quad h_i(x^*) = 0$.
- Dual feasibility: $\lambda_i^* \geq 0$.
- Complementary slackness: $\lambda_i^* f_i(x^*) = 0$.
- Stationarity: $\nabla_x L(x^*, \lambda^*, \nu^*) = 0$.
- For convex problems, KKT is necessary and sufficient for optimality.
2. Applications in Machine Learning
2.1 Logistic Regression
- Convex optimization formulation:
$$\min_w \left[ -\sum_{i=1}^n y_i w^T x_i + \sum_{i=1}^n \log(1 + e^{w^T x_i}) \right] $$
- Unconstrained convex problem.
2.2 Support Vector Machines (SVM)
- Primal QP:
$$\begin{aligned} &\min_{w,w_0} \frac{1}{2}\|w\|^2 \\ &\text{s.t. } y_i(w^T x_i + w_0) \geq 1, \quad i=1,\ldots,n \end{aligned} $$
- Dual problem:
$$\begin{aligned} &\max_{\lambda} \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \lambda_i \lambda_j y_i y_j x_i^T x_j \\ &\text{s.t. } \sum_{i=1}^n \lambda_i y_i = 0, \quad \lambda_i \geq 0 \end{aligned} $$
- Support vectors: points with $\lambda_i > 0$.
- Kernel trick: Replace $x_i^T x_j$ with $K(x_i, x_j)$.
2.3 Non-Separable Case (Soft Margin)
- Introduce slack variables $\xi_i$:
$$\begin{aligned} &\min_{w,w_0,\xi} \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i \\ &\text{s.t. } y_i(w^T x_i + w_0) \geq 1 - \xi_i, \quad \xi_i \geq 0 \end{aligned} $$
- Dual constraints become: $0 \leq \lambda_i \leq C$.
3. Neural Network Training & Optimization
3.1 Backpropagation
- Chain rule applied layer-by-layer.
- Local gradients for operations:
- Add gate: gradient distributor.
- Multiply gate: gradient swapper.
- Max gate: gradient router.
- Copy gate: gradient adder.
3.2 Stochastic Gradient Descent (SGD)
- Update rule:
$$w_{k+1} = w_k - \eta_k \nabla f_{i(k)}(w_k) $$
- Mini-batch SGD:
$$w_{k+1} = w_k - \eta_k \frac{1}{|B|} \sum_{j \in B} \nabla f_j(w_k) $$
- Improves parallelism and stability.
3.3 Advanced Optimizers
- Momentum:
$$m_k = \beta m_{k-1} + \nabla f(w_k), \quad w_{k+1} = w_k - \eta m_k $$
- Adam: Combines momentum and adaptive learning rates.
3.4 Activation Functions
- ReLU: Most common, but can cause “dead neurons”.
- Sigmoid/Tanh: Suffer from vanishing gradients.
- Leaky ReLU/Maxout: Alternatives to mitigate issues.
3.5 Initialization
- He initialization (for ReLU): $\sigma^2 = 2 / \text{fan-in}$.
- Xavier initialization (for sigmoid/tanh): $\sigma^2 = 1 / \text{fan-in}$.
3.6 Regularization & Techniques
- Dropout: Randomly deactivate neurons during training.
- Batch Normalization: Normalize layer inputs to speed up training.
- Gradient Clipping: Prevent exploding gradients.
4. Model Selection
4.1 Cross-Validation
- K-fold CV: Partition data into $K$ subsets, train on $K-1$, validate on 1.
- Choose model with lowest cross-validation error.
4.2 Bayesian Model Selection
- BIC:
$$J_{\text{BIC}}(m) = \log p(D|\hat{\theta}_m) - \frac{d}{2} \log n $$
- AIC:
$$J_{\text{AIC}}(m) = \log p(D|\hat{\theta}_m) - d $$
- Occam’s Razor: Prefer simpler models if performance is similar.
5. Statistical Learning Theory
5.1 Generalization Error
- $\Delta(D, h) = R(h) - \hat{R}(D, h)$.
- Overfitting: Small training error, large generalization error.
- Underfitting: Large training and generalization error.
5.2 PAC Learnability
- A hypothesis class $\mathcal{H}$ is PAC-learnable if:
$$\mathbb{P}\{R(\hat{h}) < R(h^*) + \epsilon\} \geq 1 - \delta $$
for sufficiently large $n$. - Finite $\mathcal{H}$: Always PAC-learnable.
- Infinite $\mathcal{H}$: Use VC dimension to characterize learnability.
5.3 VC Dimension
- Definition: Largest set size that can be shattered by $\mathcal{H}$.
- Examples:
- Linear classifiers in $\mathbb{R}^d$: $VC = d+1$.
- Axis-aligned rectangles in $\mathbb{R}^2$: $VC = 4$.
6. Summary
- Duality provides lower bounds and optimality conditions.
- Convex optimization underpins many ML models (SVM, logistic regression).
- Neural networks rely on backpropagation and advanced optimizers.
- Model selection balances complexity and performance.
- Learning theory formalizes generalization and model capacity.
说些什么吧!