Training Models

Linear Regression

A linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term (also called the intercept term). The equation for a linear model called be written as:

$$ \begin{equation} \hat{y}= \theta_0 +\theta_1x_1+\theta_2x_2+…+\theta_nx \end{equation} $$

  • $\hat{y}$ is the predicted value
  • $n$ is the number of features
  • $x_i$ is the $i$th feature value
  • $\theta_j$ is the $j$th model parameter (including the bias term $\theta_0$ and the feature weights $\theta_1, \theta_2,…, \theta_n$)

Or it can be written more concisely in the vectorized form:

$$ \begin{equation} \hat{y}=h_0(x)=\theta \cdot x \end{equation} $$

  • $\theta$ is the model’s parameter vector, containing the bias term $\theta_0$ and the feature weights $\theta_1$ to $\theta_n$
  • x is the instance’s feature vector, containing $x_0$ to $x_n$, with $x_0$ always equal to $1$
  • $\theta \cdot x$ is the dot product of the vectors $\theta$ and $x$, which is equal to $\theta_0x_0+\theta_1x_1+\theta_2x_2+…+\theta_nx_n$
  • $h_0$ is the hypothesis function, using the model parameters $\theta$

In Machine Learning, vectors are represented as column vectors, which are 2D arrays with a single column. If $\theta$ and $x$ are column vectors, then the prediction $\hat{y}=\theta^Tx$, where $\theta^T$ is the transpose of $\theta$ (a row vector instead of a column vector) and $\theta^Tx$ is the matrix multiplication of $\theta^T$ and $x$. It is the same prediction, but now it is represented as a single cell matrix rather than a scalar value.

The best way to measure the model accuracy for a linear regression model is to use the Root Mean Square Error. We need to find a value of $\theta$ that minimizes the RMSE. The MSE of a Linear Regression hypothesis $h_\theta$ on a training set $X$ is calculated using:

$$ \begin{equation} MSE(X, h_\theta)=\frac{1}{m}\sum_{i=1}^m\big(\theta^Tx^{(i)}-y^{(i)}\big)^2 \end{equation} $$

A difference is that we write $h_\theta$ instead of $h$ because the model is parameterized by the vector $\theta$, but to simplify it will be shown as $MSE(\theta)$ instead of $MSE(X, h_\theta)$.

Normal Equation

In order to find the value of $\theta$ that minimizes the cost function, there is a closed-form solution, i.e. a mathematical equation that gives the result directly. This is called the normal equation:

$$ \begin{equation} \hat{\theta}=\big(X^T \cdot X\big)^{-1} \cdot X^T \cdot y \end{equation} $$

  • $\hat{\theta}$ is the value of $\theta$ that minimizes the cost function
  • $y$ is the vector of target values containing $y^{(1)}$ to $y^{(m)}$
### We can test the normal function above by creating some linear data

import numpy as np
import matplotlib.pyplot as plt

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.rand(100, 1)

plt.scatter(X, y, c="lightgreen")

We can now compute $\hat{\theta}$ using the Normal Equation. We can use the inv() function from NumPy’s Linear Algebra module (np.linalg) to compute the inverse of a matrix, and the dot() method for the matrix multiplication.

### Add X\theta = 1 to each instance
X_b = np.c_[np.ones((100, 1)), X]
print("Regular X: \n", X[:5])
print("X with 1 added to it (do do matrix multiplication): \n", X_b[:5])
### Our Normal Function
theta_best = np.linalg.inv(
print("Theta best: \n", theta_best)
Regular X: 
X with 1 added to it (do do matrix multiplication): 
 [[1.         0.01007283]
 [1.         0.99026667]
 [1.         1.66275616]
 [1.         1.58994263]
 [1.         1.39956948]]
Theta best: 

Considering our original equation was $y=4+3x_1+Gaussian\hspace{.25cm}Noise$, our model prediction of $\hat{y}=4.467 + 3.05x_n$ is pretty close to the actual equation. remember that we will never be able to get it perfect because of the concept of irreducible error.

We can now make predictions using $\hat{\theta}$:

X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict =
print("Prediction for X_new: \n", y_predict)
Prediction for X_new: 
 [[ 4.47949433]
plt.plot(X_new, y_predict, "r-", c="red")
plt.plot(X, y, "b.", c="lightgreen")

It is even easier to perform a Linear Regression in Scikit-Learn

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression(), y)
print("Linear Regression Intercept: \n", lin_reg.intercept_,"\n", "Linear Regression Coefficient: \n", lin_reg.coef_)
Linear Regression Intercept: 
 Linear Regression Coefficient: 

The LinearRegression class is based on the scipy.linalg.lstsq() function (the name stands for “least squares”), which you could call directly:

theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
print("Theta Best Single Value Decomposition: \n", theta_best_svd)
Theta Best Single Value Decomposition: 

This function computes $\hat{\theta}=X^+y$ where $X^+$ is the pseudoinverse of $X$ (specifically the Moore-Pearson inverse). You can use np.linalg.pinv() to compute the pseudoinverse directly:

print("Pseudoinverse:\n", np.linalg.pinv(X_b).dot(y))

The pseudoinverse itself is computed using a standard matrix factorization technique called Singular Value Decompisition (SVD) that can decompose the training set matrix $X$ into the matrix multiplication of three matrices $U \hspace{.5cm} \Sigma \hspace{.5cm} V^T$. The pseudoinverse is computed as $X^+=V\Sigma^+U^T$. To compute the matrix $\Sigma^+$ the algorithm takes $\Sigma$ and sets to zero all values smaleer than a tiny threshold value, then it replaces all the non-zero values with their inverse, and finally it transposes the resulting matrix. This approach is more efficient than computing the Normal Equation.

Gradient Descent

Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function. Gradient Descent is similar to being lost on a mountain in dense fog and can only feel the ground that is a couple feet in front of you. A good strategy to get to the bottom is to go down the steepest slope until you hit the valley. We are measuring the local gradient of the error function with regards to the parameter vector $\theta$, and it goes in the direction of descending gradient.

You start by filling $\theta$ with random values (this is called random initialization), and then you improve it gradually , taking one baby step at a time, with each step attempting to decrease the cost function, until the algorithm converges to a minimum. The learning rate is a parameter used to change the size of the steps in Gradient Descent model. Set it too small, and the algorithm will have to go through many iterations to converge, which will take a long time and a lot of your compute resources.

If you make the learning rate too large, you can “jump across the valley” and end up farther up on the curve than you were before.

These charts can be misleading when dealing with more complex models, as some cost functions have ridges, plateaus, holes, or other types of odd shapes, making it very difficult to converge to the minimum. You can end up converging on a local minimum, which will not be as good as a global minimum.

When dealing with Linear Regression we will not have to worry about irregular shapes, because we use MSE as the cost function, and MSE is a convex function, which means no two points have a line segment crossing the curve. We have to be careful in our search for parameters that minimize the cost function. It is a search for the model’s parameter space: the more parameters a model has, the more dimensions this space has, and the harder the search is. It is like searching for a needle in a 500 dimensional haystack.

Batch Gradient Descent

When implementing Gradient Descent you need to calculate the gradient of the cost function with regards to each model parameter $\theta_j$. Which means, you will need to calculate how much the cost function will change if you change $\theta_j$ a small bit. This is referred to as a partial derivative. In order to compute the partial derivative of a cost function with regards to parameter $\theta_j$, noted as $\frac{\partial}{\partial\theta_j}MSE(\theta)$:

$$ \begin{equation} \frac{\partial}{\partial\theta_j}MSE(\theta)=\frac{2}{m}\sum^m_{i=1}\big(\theta^Tx^{(i)}-y^{(i)}\big)x^{(i)}_j \end{equation} $$

There is a way to calculate all the partial derivatives together instead of individually. You can use the gradient vector $\nabla_\theta MSE(\theta)$, which contains all the partial derivatives of a cost function. It can be viewed as:

$$ \begin{equation} \nabla_\theta MSE(\theta)= \begin{pmatrix} \frac{\partial}{\partial\theta_0}MSE(\theta)\
\frac{\partial}{\partial\theta_n}MSE(\theta) \end{pmatrix} = \frac{2}{m}X^T(X\theta-y) \end{equation} $$

Once you have the gradient vector that points uphil, just go in the opposite direction, which means you need to subtract $\nabla_\theta MSE(\theta)$ from $\theta$. Remember that we need to establish a _learning rate_ ($\eta$). We need to multiply the gradient vector by $\eta$ to determine the size of the downhill slope. Gradient Descent Step can be viewed as:

$$ \begin{equation} \theta^{(next\hspace{.1cm}step)}=\theta-\eta \nabla_\theta MSE(\theta) \end{equation} $$

### Set our learning rate to 0.1
eta = 0.1
n_iterations = 1000
m = 100

### Random initialization
theta = np.random.randn(2,1)

for iteration in range(n_iterations):
    gradients = 2/m * - y)
    theta = theta - eta * gradients
print("Our Theta is:\n", theta)
Our Theta is:

Our theta at different learning rates:

  • Left - Learning rate is too low, it will take forever to reach the solution
  • Center - Learning rate is just right and was able to converge on the solution
  • Right - Learning rate is too high and blew past the solution, and is continuing to grow past it

You can use Grid Search to find a good learning rate.

You may wonder how to set the number of iterations. If it is too low, you will still be far away from the optimal solution when the algorithm stops, but if it is too high, you will waste time while the model parameters do not change anymore. A simple solu‐ tion is to set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes tiny—that is, when its norm becomes smaller than a tiny number $\epsilon$ (called the tolerance)—because this happens when Gradient Descent has (almost) reached the minimum

Stochastic Gradient Descent

The main issue with batch gradient descent, is that it uses the whol training set to compute the the gradients at every step, in turn, making it very slow when training large datasets. In that case, we can use Stochastic Gradient Descent which takes the opposite approach. Stochastic Gradient Descent picks a single instance at each step and then computes the gradient at that single instance. This makes the algorithm much faster, but also also more irregular at each step.

If the cost function is very irregular this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does. The randomness is good because it means that it can escape from local optima, but it can never settle at the minimum. You can try to reduce the learning rate gradually to try to avoid not being able to settle at the minimum. You can set a learning schedule to slow down the learning rate. You have to be careful because if you reduce the learning rate too quickly you can get stuck at the local minima. On the other hand, if it is reduced too slowly, you can end up jumping around the minima for a long time and end up with a poor solution.

### Each round will be called an epoch
n_epochs = 50
### Learning schedule hyperparameters
t0, t1 = 5, 50

def learning_schedule(t):
    return t0 / (t + t1)

theta = np.random.randn(2, 1)

for epoch in range(n_epochs):
    for i in range(m):
        random_index = np.random.randint(m)
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradients = 2 * - yi)
        eta = learning_schedule(epoch * m + i)
        theta = theta - eta * gradients
print("Stochastic GD theta: \n", theta)
Stochastic GD theta: 
### Now lets use the Scikit-Learn Package
from sklearn.linear_model import SGDRegressor

### max_iter = epochs, tolerance = loss drops, eta0 = learning rate
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1), y.ravel())

print("SGD Intercept: \n", sgd_reg.intercept_, "\nSGD Coefficient: \n", sgd_reg.coef_)
SGD Intercept: 
SGD Coefficient: 

Mini-batch Gradient Descent

As you could probably guess by our previous sections, Mini-batch Gradient Descent is a combination of batch and stochastic gradient descent. It uses small samples at each step to calculate the gradient descent. We can compare the previous models as follows, keep in mind $m$ is the number of training instances, and $n$ is the number of features.

Polynomial Regression

If your data is more complex than a straight line, you can still use a linear model to fit non-linear data, by adding powers to each feature as new features, then train a model on these extended sets of features. This technique is called Polynomial Regression.

Let’s use a quadratic equation as an example:

m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.rand(m, 1)

plt.scatter(X, y, c="lightgreen")

from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
print("Value at index 0 of X: \n", X[0])
print("\nPolynomial Values: \n", X_poly[0])
Value at index 0 of X: 

Polynomial Values: 
 [-2.55883025  6.54761225]
### X_poly now contains the original feature of X plus the square of this feature. 
### Now you can fit a LinearRegression model to this extended training data.

lin_reg = LinearRegression(), y)
y_pred = lin_reg.predict(X_poly)
print("Intercept: \n", lin_reg.intercept_, "\nCoefficient: \n", lin_reg.coef_)

plt.scatter(X, y)
plt.scatter(X, y_pred)
 [[1.00320431 0.52111015]]

<matplotlib.collections.PathCollection at 0x1a1a59eed0>

Learning Curve

High-degree polynomials can do a much better job at fitting training data, but you have to be careful about overfitting the data. So how do you decide how complex your model is going to be? We have used cross-validation already, in the example housing project post, to see how well our model generalizes. Another way to determine model performance is by using a learning curve. A learning curve is a plot of the model’s performance on the training set and the validation set as a function of the training set size. To generate the plots, train the data several times on different sized subset of the training set.

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
    train_errors, val_errors = [], []
    for m in range(1, len(X_train)):[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
        val_errors.append(mean_squared_error(y_val, y_val_predict))
    plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
    plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)

When there are just one or two instances in the training set, the model can fit them perfectly, which is why the curve starts at zero. But as new instances are added to the training set, it becomes impossible for the model to fit the training data perfectly, both because the data is noisy and because it is not linear at all. So the error on the training data goes up until it reaches a plateau, at which point adding new instances to the training set doesn’t make the average error much better or worse. Now let’s look at the performance of the model on the validation data. When the model is trained on very few training instances, it is incapable of generalizing properly, which is why the validation error is initially quite big. Then as the model is shown more training examples, it learns and thus the validation error slowly goes down. However, once again a straight line cannot do a good job modeling the data, so the error ends up at a plateau, very close to the other curve.

### Train a 10th degree polynomial

from sklearn.pipeline import Pipeline

polynomial_regression = Pipeline([
    ("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
    ("lin_reg", LinearRegression()),

plot_learning_curves(polynomial_regression, X, y)

Regularized Linear Models

As mentioned in other posts, one way we can reduce over fitting is by regularizing the model, by constraining the degrees of freedom it has. We just showed an example of this when we added more polynomial degrees, we ended up overfitting the model.

Ridge Regression

Ridge Regression (Tikhonov Regression) is a regularized version of Linear Regression: a regularization term equal to $\alpha\sum_{i=1}^n=\theta_i^2$ is added to the cost function. This causes the model to fit the data and keep the weights as small as possible. The regularization term should only be added to the model during training. The hyperparameter $\alpha$ controls how much you want to regularize the model. For example, if $\alpha=0$ then _Ridge Regression_ is equal to _Linear Regression_, but if $\alpha$ is very large, then all weights will end up close to 0 and the result is a flatline running through the mean of the dataset. **_Ridge Regression Cost Function:_**

$$ \begin{equation} J(\theta)=MSE(\theta)+\alpha\frac{1}{2}\sum_{i=1}^n\theta_i^2 \end{equation} $$

The bias term $\theta_0$ is not regularized (the sum ($\sum$) starts at i = 1, not 0). If we define $w$ as the vector of feature weights ($\theta_1$ to $\theta_n$), then the regularization term is simply equal to $\frac{1}{2}( \lVert {w} \rVert _2)^2$, where $ \lVert w \rVert_2$ represents the $l_2$ norm of the weight vector.

It is important to scale the data (e.g., using a StandardScaler) before performing Ridge Regression, as it is sensitive to the scale of the input features.

Ridge Regression closed-form solution:

$$ \begin{equation} \hat\theta = \big( X^TX+ \alpha A \big)^{-1} \hspace{.25cm} X^T \hspace{.25cm} y \end{equation} $$

from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=1, solver="cholesky"), y)

print("Ridge Regression: ", ridge_reg.predict([[1.5]]))

### Stochastic Gradient Descent
sgd_reg = SGDRegressor(penalty="l2"), y.ravel())

print("SGD: ", sgd_reg.predict([[1.5]]))
Ridge Regression:  [[5.72922032]]
SGD:  [5.70922821]

The penalty hyperparameter set the regularization that will be used, we set this as the “l2” norm, meaning the cost function equal to half the square of the “l2” norm of the weight vector.

Lasso Regression

Least Absolute Shrinkage and Selection Operator Regression (Lasso) is similar to Ridge Regression in that it uses a norm of the weight vector. Where Ridge uses the “l2” norm, Lasso uses the “l1” norm. The Lasso function is:

$$ \begin{equation} J(\theta)=MSE(\theta)+\alpha\sum_{i=1}^n\big\lvert{\theta_i}\big\rvert \end{equation} $$

An important characteristic of Lasso Regression is that it tends to completely eliminate the weights of the least important features (i.e., set them to zero). Lasso Regression automatically performs feature selection and outputs a sparse model (i.e., with few nonzero feature weights).

from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.1), y)
print("Lasso: ", lasso_reg.predict([[1.5]]))
Lasso:  [5.68930416]

Elastic Net

Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio $r$. When $r = 0$, Elastic Net is equivalent to Ridge Regression, and when $r = 1$, it is equivalent to Lasso Regression.

$$ \begin{equation} J(\theta)=MSE(\theta)+ r\alpha\sum_{i=1}^n\big\lvert\theta_i\big\rvert+\frac{1-r}{2}\alpha\sum_{i=1}^n\theta_i^2 \end{equation} $$

So when should you use plain Linear Regression (i.e., without any regularization), Ridge, Lasso, or Elastic Net? It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are actually useful, you should pre‐ fer Lasso or Elastic Net since they tend to reduce the useless features’ weights down to zero as we have discussed. In general, Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.

from sklearn.linear_model import ElasticNet

elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5), y)
print("Elastic Net: ", elastic_net.predict([[1.5]]))
Elastic Net:  [5.68874853]

Early Stopping

A very different way to regularize iterative learning algorithms such as Gradient Descent is to stop training as soon as the validation error reaches a minimum. This is called early stopping.

from sklearn.base import clone
from sklearn.preprocessing import StandardScaler

poly_scaler = Pipeline([
    ("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
    ("std_scaler", StandardScaler())
X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)

sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
                      penalty=None, learning_rate="constant", eta0=0.0005)
minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):, y_train)
    y_val_predict = sgd_reg.predict(X_val_poly_scaled)
    val_error = mean_squared_error(y_val, y_val_predict)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = clone(sgd_reg)

Logistic Regression

Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class. If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary classifier.

Estimating Probabilities

Just like a Linear Regression model, a Logistic Regression model computes a weighted sum of the input features (plus a bias term), but instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result. Logistic Regression function:

$$ \begin{equation} \hat{p}=h_\theta(x) = \sigma\big(x^T\theta\big) \end{equation} $$

The logistic—noted $\sigma(\cdot)$—is a sigmoid function (i.e., S-shaped) that outputs a number between 0 and 1. Logistic function:

$$ \begin{equation} \sigma(t)=\frac{1}{1+exp(-t)} \end{equation} $$

Model Prediction:

$$ \begin{equation} \hat{y}=\begin{cases} 0 \hspace{.2cm} if \hspace{.2cm} \hat{p} < 0.5\
1 \hspace{.2cm} if \hspace{.2cm} \hat{p} >= 0.5\
\end{cases} \end{equation} $$

The objective of training is to set the parameter vector $\theta$ so that the model estimates high probabilities for positive instances $(y = 1)$ and low probabilities for negative instances $(y = 0)$. The cost function is:

$$ \begin{equation} c(\theta)=\begin{cases} -log(\hat{p}) \hspace{.2cm} if \hspace{.2cm} y=1 \
-log(1-\hat{p}) \hspace{.2cm} if \hspace{.2cm} y=0 \end{cases} \end{equation} $$

This cost function makes sense because $–log(t)$ grows very large when $t$ approaches 0, so the cost will be large if the model estimates a probability close to 0 for a positive instance, and it will also be very large if the model estimates a probability close to 1 for a negative instance. On the other hand, $– log(t)$ is close to 0 when $t$ is close to 1, so the cost will be close to 0 if the estimated probability is close to 0 for a negative instance or close to 1 for a positive instance, which is precisely what we want.

Decision Boundaries

Let’s try to build a classifier to detect the Iris-Virginica type based only on the petal width feature.

from sklearn import datasets

iris = datasets.load_iris()
print("Iris Keys: ", list(iris.keys()))

X = iris["data"][:, 3:] ### petal width
y = (iris["target"] == 2).astype(

### Training a Logistic Regression Model

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(), y)

X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
plt.plot(X_new, y_proba[:, 1], "g-", label="Iris-Virginica")
plt.plot(X_new, y_proba[:, 0], "b--", label="Not Iris-Virginca")
Iris Keys:  ['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']

<matplotlib.legend.Legend at 0x1a1cc1f190>

The petal width of Iris-Virginica flowers (represented by triangles) ranges from 1.4 cm to 2.5 cm, while the other iris flowers (represented by squares) generally have a smaller petal width, ranging from 0.1 cm to 1.8 cm. Notice that there is a bit of overlap. Above about 2 cm the classifier is highly confident that the flower is an Iris- Virginica (it outputs a high probability to that class), while below 1 cm it is highly confident that it is not an Iris-Virginica (high probability for the “Not Iris-Virginica” class).

log_reg.predict([[1.7], [1.5]])
array([1, 0])

Softmax Regression

The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers, this is called Softmax Regression, or Multinomial Logistic Regression. When given an instance of $x$, the softmax model first computes a score of $s_k(x)$ for each class of $k$, then estimates the probability of each class by applying the softmax function (aka normalized exponential) to the scores. Softmax scores for class of $k$:

$$ \begin{equation} s_k(x)=x^T\theta^{(k)} \end{equation} $$

Each class has its own parameter vector $\theta^{(k)}$ and all these vectors are typically stored as rows in a parameter matrix $\Theta$. Once you completed the score of each you can estimate each $\hat{P}_k$ by running it through the softmax function:

$$ \begin{equation} \hat{P}_k = \sigma(s(x))k=\frac{exp\big(s_k(x)\big)}{\sum{j=1}^Kexp\big(s_j(x)\big)} \end{equation} $$

  • $K$ is the number of classes
  • $s(x)$ is a vector containing the scores of each class for the instance $x$
  • $\sigma(s(x))_k$ is the estimated probability that the instance $x$ belongs to class $k$ given the scores of each class for that instance
X = iris["data"][:, (2, 3)]
y = iris["target"]

softmax_reg = LogisticRegression(multi_class="multinomial", solver="lbfgs", C=10), y)

print("Softmax Prediction: ", softmax_reg.predict([[5, 2]]))
print("softmax Prediciton: ", softmax_reg.predict_proba([[5, 2]]))
Softmax Prediction:  [2]
softmax Prediciton:  [[6.38014896e-07 5.74929995e-02 9.42506362e-01]]