Reformat markdown.
Change: 138541907 (cherry picked)
This commit is contained in:
parent
f413209512
commit
a9e21bc20a
@ -436,35 +436,35 @@ you a desirable model size.
|
||||
|
||||
Finally, let's take a minute to talk about what the Logistic Regression model
|
||||
actually looks like in case you're not already familiar with it. We'll denote
|
||||
the label as $$Y$$, and the set of observed features as a feature vector
|
||||
$$\mathbf{x}=[x_1, x_2, ..., x_d]$$. We define $$Y=1$$ if an individual earned >
|
||||
50,000 dollars and $$Y=0$$ otherwise. In Logistic Regression, the probability of
|
||||
the label being positive ($$Y=1$$) given the features $$\mathbf{x}$$ is given
|
||||
the label as \\(Y\\), and the set of observed features as a feature vector
|
||||
\\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). We define \\(Y=1\\) if an individual earned >
|
||||
50,000 dollars and \\(Y=0\\) otherwise. In Logistic Regression, the probability of
|
||||
the label being positive (\\(Y=1\\)) given the features \\(\mathbf{x}\\) is given
|
||||
as:
|
||||
|
||||
$$ P(Y=1|\mathbf{x}) = \frac{1}{1+\exp(-(\mathbf{w}^T\mathbf{x}+b))}$$
|
||||
|
||||
where $$\mathbf{w}=[w_1, w_2, ..., w_d]$$ are the model weights for the features
|
||||
$$\mathbf{x}=[x_1, x_2, ..., x_d]$$. $$b$$ is a constant that is often called
|
||||
where \\(\mathbf{w}=[w_1, w_2, ..., w_d]\\) are the model weights for the features
|
||||
\\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). \\(b\\) is a constant that is often called
|
||||
the **bias** of the model. The equation consists of two parts—A linear model and
|
||||
a logistic function:
|
||||
|
||||
* **Linear Model**: First, we can see that $$\mathbf{w}^T\mathbf{x}+b = b +
|
||||
w_1x_1 + ... +w_dx_d$$ is a linear model where the output is a linear
|
||||
function of the input features $$\mathbf{x}$$. The bias $$b$$ is the
|
||||
* **Linear Model**: First, we can see that \\(\mathbf{w}^T\mathbf{x}+b = b +
|
||||
w_1x_1 + ... +w_dx_d\\) is a linear model where the output is a linear
|
||||
function of the input features \\(\mathbf{x}\\). The bias \\(b\\) is the
|
||||
prediction one would make without observing any features. The model weight
|
||||
$$w_i$$ reflects how the feature $$x_i$$ is correlated with the positive
|
||||
label. If $$x_i$$ is positively correlated with the positive label, the
|
||||
weight $$w_i$$ increases, and the probability $$P(Y=1|\mathbf{x})$$ will be
|
||||
closer to 1. On the other hand, if $$x_i$$ is negatively correlated with the
|
||||
positive label, then the weight $$w_i$$ decreases and the probability
|
||||
$$P(Y=1|\mathbf{x})$$ will be closer to 0.
|
||||
\\(w_i\\) reflects how the feature \\(x_i\\) is correlated with the positive
|
||||
label. If \\(x_i\\) is positively correlated with the positive label, the
|
||||
weight \\(w_i\\) increases, and the probability \\(P(Y=1|\mathbf{x})\\) will be
|
||||
closer to 1. On the other hand, if \\(x_i\\) is negatively correlated with the
|
||||
positive label, then the weight \\(w_i\\) decreases and the probability
|
||||
\\(P(Y=1|\mathbf{x})\\) will be closer to 0.
|
||||
|
||||
* **Logistic Function**: Second, we can see that there's a logistic function
|
||||
(also known as the sigmoid function) $$S(t) = 1/(1+\exp(-t))$$ being applied
|
||||
(also known as the sigmoid function) \\(S(t) = 1/(1+\exp(-t))\\) being applied
|
||||
to the linear model. The logistic function is used to convert the output of
|
||||
the linear model $$\mathbf{w}^T\mathbf{x}+b$$ from any real number into the
|
||||
range of $$[0, 1]$$, which can be interpreted as a probability.
|
||||
the linear model \\(\mathbf{w}^T\mathbf{x}+b\\) from any real number into the
|
||||
range of \\([0, 1]\\), which can be interpreted as a probability.
|
||||
|
||||
Model training is an optimization problem: The goal is to find a set of model
|
||||
weights (i.e. model parameters) to minimize a **loss function** defined over the
|
||||
|
@ -157,8 +157,8 @@ The higher the `dimension` of the embedding is, the more degrees of freedom the
|
||||
model will have to learn the representations of the features. For simplicity, we
|
||||
set the dimension to 8 for all feature columns here. Empirically, a more
|
||||
informed decision for the number of dimensions is to start with a value on the
|
||||
order of $$k\log_2(n)$$ or $$k\sqrt[4]n$$, where $$n$$ is the number of unique
|
||||
features in a feature column and $$k$$ is a small constant (usually smaller than
|
||||
order of \\(\log_2(n)\\) or \\(k\sqrt[4]n\\), where \\(n\\) is the number of unique
|
||||
features in a feature column and \\(k\\) is a small constant (usually smaller than
|
||||
10).
|
||||
|
||||
Through dense embeddings, deep models can generalize better and make predictions
|
||||
|
Loading…
Reference in New Issue
Block a user