# Logistic Regression

## Classification and Representation

### Classification 예제들

• 이메일: 스팸 여부
• Online transactions: fraudulent or not(사기 여부)
• Tumor(종양): Malignant or not(악성 여부)

### Binary classification problem

0: Negative Class, 1: Positive Class

### Using Linear regression for Classification problem

• because classification is not actually a linear function
• 치우친 샘플이 Hypothesis 기울기에 영향을 주기 때문에 쓰기 어려움

### Logistic Regression Model

Logistic function or Sigmoid function

Want

Equation, g = sigmoid function Interpretation of Hypothesis output

• 70% change of tumor being malignant

• probability that y = 1, given x parameterized by theta

### Decision Boundary

Sigmoid function이 g의 인자가 0보다 크면 1, 0보다 작으면 0

$\theta_0 = -3$, $\theta_1 = 1$, $\theta_2 = 1$일 때, g 함수의 인자가 0보다 크면 1 ## Logistic Regression Model

### Cost function

Training set

m examples

Hypothesis

Cost function 일반화

Linear regression cost function

Linear regression cost Function을 Logistic regression의 Hypothesis에 사용하면 $J(\theta)$가 non-convex 함수가 됨, 그래서 Local optima 도 많고 Gradient Descent를 제대로 실행할 수 없음 Logistic regression cost function

When $y = 1$, $J(\theta)$ vs $h_\theta(x)$: When $y = 0$, $J(\theta)$ vs $h_\theta(x)$: Logistic regression의 cost function은 y = 0일 때, hypothesis의 결과가 1이 나오면 cost 값이 무한대로 가고 y = 1일 때, hypothesis의 결과가 0이 나오면 cost 값이 무한대로 나옴

### Simplified Cost Function and Gradient Descent

Note: y = 0 or 1 always

Plugging-in the definition for full cost function

Vectorized implementation:

To fit parameters theta:

To make a prediction given new x (inference):

Gradient Descent, Repeat(simultaneously update all theta_j):

Calculate the derivative term, Repeat(simultaneously update all theta_j):

Vectorized implementation, abstract j features:

XXX: 아래 식 추론 경로는 아직 모름

Sophisticated optimization algorithm

• Conjugate gradient, BFGS, L-BFGS
• Advantage: No need to manually pick alpha and faster than gradient descent
• 처음에는 이해하려고 하지말고 그냥 라이브러리 사용하길 권함 by Andrew Ng
• XXX: 아래 같은 분류가 있는 것 같음

Linear regression Example in Octave

% costFunction
function [jVal, gradient] = costFunction(theta)

jVal = (theta(1)-5)^2 + (theta(2)-5)^2;

% Main
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,);

[functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options)


Logistic regression Example in Octave

% costFunction
function [jVal, gradient] = costFunction(theta)

jVal = % code to compute J(theta)
gradient(1) = % derivative for theta_0, CODE#1
gradient(2) = % derivative for theta_1, CODE#2


CODE#1

### Multi-class Classification(one-vs-all)

• 이메일 Tagging: 직장, 친구, 가족, 취미
• 날씨: Sunny, Cloudy, Rain, Snow One-vs-all: Train a logistic regression classifier $h_\theta^{(i)}(x)$ for each class i

Inference, pick the class i that maximizes:

## Regularization

### The Problem of Overfitting • Under-fitting, High bias
• the algorithm has a very strong preconception
• hypothesis function maps poorly to the trend of the data
• It is usually caused by a function that is too simple or uses too few features
• Just-right
• Over-fitting, High variance: We don’t have enough data to constrain it
• We have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples

• plotting the data: feature가 많은 경우 시각적으로 표현하기 어려움
• 1) Reduce number of features
• Manually select which features to keep
• Model selection algorithm(뒤에 나옴)
• 2) Regularization(정규화)
• Keep all the features, but reduce magnitude/ values of parameters $\theta_j$
• Works well when we have a lot of features, each of which contributes a bit to predicting y

### Cost function

High order degree polynomial, Suppose we penalize and make $\theta_3$, $\theta_4$ really small Regularization: penalize the parameter values being large

• “Simpler” hypothesis
• Less prone to overfitting
• Don’t penalize $\theta_0$

Regularization example: Housing problem with Linear regression

Add regularization term at the end of Cost function:

• regularization parameter: $\lambda$
• It determines how much the costs of our theta parameters are inflated
• If \lambda is set to an extremely large value, the algorithm results in under-fitting

### Regularized linear regression

Gradient descent, Repeat until convergence:

First term: learning rate and lambda is small it’s value is less than 1(i.g. 0.99)

Normal equation

If $\lambda > 0$,

Using regularization also takes care of any non-invertibility issues of the X transpose X matrix as well

### Regularized logistic regression

Over-fitting logistic regression Gradient descent, Repeat(linear regression과 모양은 비슷하지만, hypothesis가 다름):

function [jVal, gradient] = costFunction(theta)
jVal = % code to compute J(theta)

gradient(1) = % code to compute partial theta_0 of J(theta)
gradient(n + 1) = % code to compute partial theta_n of J(theta)


jVal:

You probably know quite a lot more machine learning right now than frankly, many of the Silicon Valley engineers out there having very successful careers. You know, making tons of money for the companies. Or building products using machine learning algorithms. So, congratulations. - Andrew Ng

XXX: Andrew Ng는 강의에서 묘한 웃음으로 말했는데, 뭘 의미하는 걸까?

XXX: Week 3 Programming 숙제에서는 Sigmoid Function, Cost Function, Predict 구현 했음. 이전처럼 Gradient Descent는 직접 구현하지 않고 Octave에 있는 fminunc를 사용했다. 수식에 맞게 Vector와 Matrix를 잘 조립하는게 중요했는데, size 함수로 하나씩 찍어보면서 맞춰야 했다.

%  Set options for fminunc
options = optimset('GradObj', 'on', 'MaxIter', 400);

%  Run fminunc to obtain the optimal theta
%  This function will return theta and the cost
[theta, cost] = ...
fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);