Constrained Optimization

2019/2/14

| machine learning

Constrained Optimization $$\min_{x \in R^n} f(x)$$ subject to: $$c_i (x) = 0, i \in E \text{(equality constraints)}$$ $$c_i (x) \geq 0, i \in I \text{(inequality constraints)}$$ Feasible Set $\Omega = { x_i | c_i(x) = 0, i \in E \text{ and } c_i(x) \geq 0, i \in I}$. Case 1 $\min_{x \in R^n} f(x)$ subject to $c_1 (x) = 0$ x is local minimum if x + s $\notin \Omega$ or f(x+s) $\geq$ f(x).……

Gradient Descent

2019/2/7

| machine learning

Gradient Descent A simple algorithm to go “downward” against the gradient of the function. Algebrically: $$w_{t+1} = w_t - \eta \nabla f(w_t)$$ where $\eta$ is called learning rate or step size. Step Size $\eta$ too small, slow convergence $\eta$ too large, solution will bounce around In practice: Set $\eta$ to be a smalle constant Backtracking line search (work when $\nabla f$ is continuous) Parameter $\bar{\alpha}, c \in (0,1), \rho \in (0,1)$.……

Convex Optimization

2019/2/7

| machine learning

Gradient && Hessian The gradient of a f, d x 1, can be represented as follow $$ \nabla f(x) = \begin{bmatrix} \frac {\partial f(x)} {\partial x_1} \newline …\newline \frac {\partial f(x)} {\partial x_d} \end{bmatrix} $$ and the Hessian, d x d, can be represented as $$ \nabla^2 f(x) = \begin{bmatrix} \frac {\partial^2 f(x)} {\partial x_1^2} & \frac {\partial^2 f(x)} {\partial x_1x_2} & \frac {\partial^2 f(x)} {\partial x_1x_d} \newline … & … & …\newline \frac {\partial^2 f(x)} {\partial x_d^2} & \frac {\partial^2 f(x)} {\partial x_dx_2} & \frac {\partial^2 f(x)} {\partial x_dx_d} \newline \end{bmatrix} $$……

Logistic Regression

2019/2/1

| machine learning

Uncertainty in Prediction Related to Linear Regression. The available features x do not contain enough information to perfectly predict y, such as x = medical record for patients at risk for a disease y = will he contact disease in next 5 years Model We still going to use linear model for conditional probability estmation $$w_1x_1 + w_2x_2 + … + w_dx_d + b = w \cdot x + b$$……

Linear Regression

2019/1/31

| machine learning

Basic Idea Fit a line to a bunch of points. Example Without extra information, we will predict the mean 2.47. Average squared error = $\mathbb{E} [(studentGPA - predictedGPA)^2]$ = Variance If we have SAT scores, then we can fit a line. Now if we predict based on this line, the MSE drops to 0.43. This is a regression problem with: Predictor variable: SAT score Response variable: College GPA Formula For $\mathbb{R}$ $$y = ax + b$$……

Bayes Optimal Classifier

2019/1/31

| machine learning

Background Marginal Distribution Three ways to sample from P Draw (x,y) Draw y according to its marginal distribution, then x according to the conditional distribution of x | y Draw X according to its marginal distribution, then Y according to the conditional distribution of y | x Define: $\mu$: distribution on $X$ $\eta$: conditional distribution y|x Classifier Normal Classifier $h : x \rightarrow y$ $R(h) = Pr_{(x,y) \in p} (h(x) \neq y)$, where R = risk……

lp norm

2019/1/31

| machine learning

Families of Distance Function $l_p$ norm The most common one is $l_2$ norm (Euclidean distance): $$||x - z||_2 = \sqrt{\sum_{i=1}^{m}(x_i - z_i)^2}$$ Notes: sometime 2 is dropped. For $p \geq 1$, the $l_p$ distance: $$||x - z||_p = (\sum_{i=1}^{m}(x_i - z_i)^p)^{1/p}$$ Special case: $l_1$ distance: $$||x - z||_1 = \sum_{i=1}^{m}|x_i - z_i|$$ $l_\infty$ distance: $$||x - z||_1 = max_i |x_i - z_i|$$ Metric space Let $X$ be the space that data lie.……

Nearest Neighbor Classification

2019/1/31

| machine learning

Nearest Neighbor Classification Procedures Assemble a data set (training set) How to classify a new image x? find its closest neighbor y, and label it the same Notes: training set of 60000 images test set of 10000 images How do we determine if two data (images) are closest? With 28 x 28 image, we can strech it to become a vector of 784.……

Margin of Error

2019/1/27

| mathematic

Z-Score vs T-Score Z-Score Link Z-Score’s formula $$z = \frac{X - \mu}{\sigma}$$ where X = sample mean, $\mu$ = population means, $\sigma$ = population standard deviation. Also, we use Z Score when sample size >= 30 or we know the population’s mean dna SD. Z Table T-Score T-Score T-Score’s formula $$T = \frac{X - \mu}{s/ \sqrt{n}}$$ where X = sample mean, $\mu$ = population mean, s = sample standard deviation, and n = sample size.……

Articles in statistical learning series

Constrained Optimization

Gradient Descent

Convex Optimization

Logistic Regression

Linear Regression

Bayes Optimal Classifier

lp norm

Nearest Neighbor Classification

Margin of Error

Most Recents

Categories

Series

Tags

Other