Notation

n = number of features
m = number of training examples
x⁽ⁱ⁾ = input of i^th training example
x_j⁽ⁱ⁾ = value of feature j in i^th training example

Model

x⁽ⁱ⁾ to denote the “input” variables (living area in this example), also called input features
y⁽ⁱ⁾ to denote the “output” or target variable
A pair (x⁽ⁱ⁾,y⁽ⁱ⁾) is called a training example
a list of m training examples (x⁽ⁱ⁾,y⁽ⁱ⁾);i=1,...,m—is called a training set

Cost Function

Gradient Descent

The gradient descent algorithm is:

where j=0,1 represents the feature index number.

Feature Scaling
To achieve gradient decent goal, Two techniques to help with this are feature scaling and mean normalization.

Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.
Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.

Where μ_i is the average of all the values for feature (i) and s_i is the range of values (max - min), or s_i is the standard deviation.

Learning Rate

If α is too small: slow convergence.
If α is too large: may not decrease on every iteration and thus may not converge.

Polynomial Regression
Our hypothesis function need not be linear (a straight line) if that does not fit the data well. We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

Normal Equation

In the "Normal Equation" method, we will minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation formula is given below:

In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

Logistic Regression

For multiple classification, we predict the probability that 'y' is a member of one of our classes.

We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.