Convex Optimization Notes

2025-12-03

1. Duality

1.1 Lagrangian and Dual Function

Lagrangian for a primal problem:
$$L(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) $$
Dual function:
$$g(\lambda, \nu) = \inf_{x \in D} L(x, \lambda, \nu) $$
- Always concave (regardless of convexity of primal).
- Lower bound property: If $\lambda \geq 0$, then $g(\lambda, \nu) \leq p^*$.

1.2 Dual Problem

Dual problem:
$$\begin{aligned} &\text{maximize } g(\lambda, \nu) \\ &\text{subject to } \lambda \geq 0 \end{aligned} $$
- Always a concave problem (even if primal is non-convex).
- Optimal dual value: $d^*$.

1.3 Weak and Strong Duality

Weak duality: $d^* \leq p^*$ (always holds).
Strong duality: $d^* = p^*$ (holds under Slater’s condition for convex problems).

1.4 KKT Conditions

For optimal $(x^*, \lambda^*, \nu^*)$ under strong duality:

Primal feasibility: $f_i(x^*) \leq 0, \quad h_i(x^*) = 0$.
Dual feasibility: $\lambda_i^* \geq 0$.
Complementary slackness: $\lambda_i^* f_i(x^*) = 0$.
Stationarity: $\nabla_x L(x^*, \lambda^*, \nu^*) = 0$.

For convex problems, KKT is necessary and sufficient for optimality.

2. Applications in Machine Learning

2.1 Logistic Regression

Convex optimization formulation:
$$\min_w \left[ -\sum_{i=1}^n y_i w^T x_i + \sum_{i=1}^n \log(1 + e^{w^T x_i}) \right] $$
- Unconstrained convex problem.

2.2 Support Vector Machines (SVM)

Primal QP:
$$\begin{aligned} &\min_{w,w_0} \frac{1}{2}\|w\|^2 \\ &\text{s.t. } y_i(w^T x_i + w_0) \geq 1, \quad i=1,\ldots,n \end{aligned} $$
Dual problem:
$$\begin{aligned} &\max_{\lambda} \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \lambda_i \lambda_j y_i y_j x_i^T x_j \\ &\text{s.t. } \sum_{i=1}^n \lambda_i y_i = 0, \quad \lambda_i \geq 0 \end{aligned} $$
- Support vectors: points with $\lambda_i > 0$.
- Kernel trick: Replace $x_i^T x_j$ with $K(x_i, x_j)$.

2.3 Non-Separable Case (Soft Margin)

Introduce slack variables $\xi_i$:
$$\begin{aligned} &\min_{w,w_0,\xi} \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i \\ &\text{s.t. } y_i(w^T x_i + w_0) \geq 1 - \xi_i, \quad \xi_i \geq 0 \end{aligned} $$
Dual constraints become: $0 \leq \lambda_i \leq C$.

3. Neural Network Training & Optimization

3.1 Backpropagation

Chain rule applied layer-by-layer.
Local gradients for operations:
- Add gate: gradient distributor.
- Multiply gate: gradient swapper.
- Max gate: gradient router.
- Copy gate: gradient adder.

3.2 Stochastic Gradient Descent (SGD)

Update rule:
$$w_{k+1} = w_k - \eta_k \nabla f_{i(k)}(w_k) $$
Mini-batch SGD:
$$w_{k+1} = w_k - \eta_k \frac{1}{|B|} \sum_{j \in B} \nabla f_j(w_k) $$
- Improves parallelism and stability.

3.3 Advanced Optimizers

Momentum:
$$m_k = \beta m_{k-1} + \nabla f(w_k), \quad w_{k+1} = w_k - \eta m_k $$
Adam: Combines momentum and adaptive learning rates.

3.4 Activation Functions

ReLU: Most common, but can cause “dead neurons”.
Sigmoid/Tanh: Suffer from vanishing gradients.
Leaky ReLU/Maxout: Alternatives to mitigate issues.

3.5 Initialization

He initialization (for ReLU): $\sigma^2 = 2 / \text{fan-in}$.
Xavier initialization (for sigmoid/tanh): $\sigma^2 = 1 / \text{fan-in}$.

3.6 Regularization & Techniques

Dropout: Randomly deactivate neurons during training.
Batch Normalization: Normalize layer inputs to speed up training.
Gradient Clipping: Prevent exploding gradients.

4. Model Selection

4.1 Cross-Validation

K-fold CV: Partition data into $K$ subsets, train on $K-1$, validate on 1.
Choose model with lowest cross-validation error.

4.2 Bayesian Model Selection

BIC:
$$J_{\text{BIC}}(m) = \log p(D|\hat{\theta}_m) - \frac{d}{2} \log n $$
AIC:
$$J_{\text{AIC}}(m) = \log p(D|\hat{\theta}_m) - d $$
Occam’s Razor: Prefer simpler models if performance is similar.

5. Statistical Learning Theory

5.1 Generalization Error

$\Delta(D, h) = R(h) - \hat{R}(D, h)$.
Overfitting: Small training error, large generalization error.
Underfitting: Large training and generalization error.

5.2 PAC Learnability

A hypothesis class $\mathcal{H}$ is PAC-learnable if:
$$\mathbb{P}\{R(\hat{h}) < R(h^*) + \epsilon\} \geq 1 - \delta $$
for sufficiently large $n$.
Finite $\mathcal{H}$: Always PAC-learnable.
Infinite $\mathcal{H}$: Use VC dimension to characterize learnability.

5.3 VC Dimension

Definition: Largest set size that can be shattered by $\mathcal{H}$.
Examples:
- Linear classifiers in $\mathbb{R}^d$: $VC = d+1$.
- Axis-aligned rectangles in $\mathbb{R}^2$: $VC = 4$.

6. Summary

Duality provides lower bounds and optimality conditions.
Convex optimization underpins many ML models (SVM, logistic regression).
Neural networks rely on backpropagation and advanced optimizers.
Model selection balances complexity and performance.
Learning theory formalizes generalization and model capacity.

1. Duality

1.1 Lagrangian and Dual Function

1.2 Dual Problem

1.3 Weak and Strong Duality

1.4 KKT Conditions

2. Applications in Machine Learning

2.1 Logistic Regression

2.2 Support Vector Machines (SVM)

2.3 Non-Separable Case (Soft Margin)

3. Neural Network Training & Optimization

3.1 Backpropagation

3.2 Stochastic Gradient Descent (SGD)

3.3 Advanced Optimizers

3.4 Activation Functions

3.5 Initialization

3.6 Regularization & Techniques

4. Model Selection

4.1 Cross-Validation

4.2 Bayesian Model Selection

5. Statistical Learning Theory

5.1 Generalization Error

5.2 PAC Learnability

5.3 VC Dimension

6. Summary

说些什么吧！