Articles in machine learning category

Constrained Optimization

Constrained Optimization $$\min_{x \in R^n} f(x)$$ subject to: $$c_i (x) = 0, i \in E \text{(equality constraints)}$$ $$c_i (x) \geq 0, i \in I \text{(inequality constraints)}$$ Feasible Set $\Omega = { x_i | c_i(x) = 0, i \in E \text{ and } c_i(x) \geq 0, i \in I}$. Case 1 $\min_{x \in R^n} f(x)$ subject to $c_1 (x) = 0$ x is local minimum if x + s $\notin \Omega$ or f(x+s) $\geq$ f(x).……

Read More

Gradient Descent

Gradient Descent A simple algorithm to go “downward” against the gradient of the function. Algebrically: $$w_{t+1} = w_t - \eta \nabla f(w_t)$$ where $\eta$ is called learning rate or step size. Step Size $\eta$ too small, slow convergence $\eta$ too large, solution will bounce around In practice: Set $\eta$ to be a smalle constant Backtracking line search (work when $\nabla f$ is continuous) Parameter $\bar{\alpha}, c \in (0,1), \rho \in (0,1)$.……

Read More

Convex Optimization

Gradient && Hessian The gradient of a f, d x 1, can be represented as follow $$ \nabla f(x) = \begin{bmatrix} \frac {\partial f(x)} {\partial x_1} \newline …\newline \frac {\partial f(x)} {\partial x_d} \end{bmatrix} $$ and the Hessian, d x d, can be represented as $$ \nabla^2 f(x) = \begin{bmatrix} \frac {\partial^2 f(x)} {\partial x_1^2} & \frac {\partial^2 f(x)} {\partial x_1x_2} & \frac {\partial^2 f(x)} {\partial x_1x_d} \newline … & … & …\newline \frac {\partial^2 f(x)} {\partial x_d^2} & \frac {\partial^2 f(x)} {\partial x_dx_2} & \frac {\partial^2 f(x)} {\partial x_dx_d} \newline \end{bmatrix} $$……

Read More

Logistic Regression

Uncertainty in Prediction Related to Linear Regression. The available features x do not contain enough information to perfectly predict y, such as x = medical record for patients at risk for a disease y = will he contact disease in next 5 years Model We still going to use linear model for conditional probability estmation $$w_1x_1 + w_2x_2 + … + w_dx_d + b = w \cdot x + b$$……

Read More

Linear Regression

Basic Idea Fit a line to a bunch of points. Example Without extra information, we will predict the mean 2.47. Average squared error = $\mathbb{E} [(studentGPA - predictedGPA)^2]$ = Variance If we have SAT scores, then we can fit a line. Now if we predict based on this line, the MSE drops to 0.43. This is a regression problem with: Predictor variable: SAT score Response variable: College GPA Formula For $\mathbb{R}$ $$y = ax + b$$……

Read More

Bayes Optimal Classifier

Background Marginal Distribution Three ways to sample from P Draw (x,y) Draw y according to its marginal distribution, then x according to the conditional distribution of x | y Draw X according to its marginal distribution, then Y according to the conditional distribution of y | x Define: $\mu$: distribution on $X$ $\eta$: conditional distribution y|x Classifier Normal Classifier $h : x \rightarrow y$ $R(h) = Pr_{(x,y) \in p} (h(x) \neq y)$, where R = risk……

Read More

lp norm

Families of Distance Function $l_p$ norm The most common one is $l_2$ norm (Euclidean distance): $$||x - z||_2 = \sqrt{\sum_{i=1}^{m}(x_i - z_i)^2}$$ Notes: sometime 2 is dropped. For $p \geq 1$, the $l_p$ distance: $$||x - z||_p = (\sum_{i=1}^{m}(x_i - z_i)^p)^{1/p}$$ Special case: $l_1$ distance: $$||x - z||_1 = \sum_{i=1}^{m}|x_i - z_i|$$ $l_\infty$ distance: $$||x - z||_1 = max_i |x_i - z_i|$$ Metric space Let $X$ be the space that data lie.……

Read More

Nearest Neighbor Classification

Nearest Neighbor Classification Procedures Assemble a data set (training set) How to classify a new image x? find its closest neighbor y, and label it the same Notes: training set of 60000 images test set of 10000 images How do we determine if two data (images) are closest? With 28 x 28 image, we can strech it to become a vector of 784.……

Read More

Prototype Selection

Backgrond kNN prototype selection Summary List There are couple drawbacks for KNN high storage for data computation for decision boundary intolerance to noise There are couple methods address above issue better similarity metric or better distance function k-d trees or R-trees as storage reduction technique (prototype selection) Prototype Selection 1. edition method - remove noise 1. condensation method - remove superfluous dataset 1. hybrid method - achive elimination of noise and superfluous at the same time……

Read More