Added a "Getting Started with TensorFlow for ML Beginners" chapter to Get
Started section. PiperOrigin-RevId: 181396430
This commit is contained in:
parent
411f8bcff6
commit
cf3fb6bc1d
732
tensorflow/docs_src/get_started/get_started_for_beginners.md
Normal file
732
tensorflow/docs_src/get_started/get_started_for_beginners.md
Normal file
@ -0,0 +1,732 @@
|
|||||||
|
# Getting Started for ML Beginners
|
||||||
|
|
||||||
|
This document explains how to use machine learning to classify (categorize)
|
||||||
|
Iris flowers by species. This document dives deeply into the TensorFlow
|
||||||
|
code to do exactly that, explaining ML fundamentals along the way.
|
||||||
|
|
||||||
|
If the following list describes you, then you are in the right place:
|
||||||
|
|
||||||
|
* You know little to nothing about machine learning.
|
||||||
|
* You want to learn how to write TensorFlow programs.
|
||||||
|
* You can code (at least a little) in Python.
|
||||||
|
|
||||||
|
If you are already familiar with basic machine learning concepts
|
||||||
|
but are new to TensorFlow, read
|
||||||
|
@{$premade_estimators$Getting Started with TensorFlow: for ML Experts}.
|
||||||
|
|
||||||
|
## The Iris classification problem
|
||||||
|
|
||||||
|
Imagine you are a botanist seeking an automated way to classify each
|
||||||
|
Iris flower you find. Machine learning provides many ways to classify flowers.
|
||||||
|
For instance, a sophisticated machine learning program could classify flowers
|
||||||
|
based on photographs. Our ambitions are more modest--we're going to classify
|
||||||
|
Iris flowers based solely on the length and width of their
|
||||||
|
[sepals](https://en.wikipedia.org/wiki/Sepal) and
|
||||||
|
[petals](https://en.wikipedia.org/wiki/Petal).
|
||||||
|
|
||||||
|
The Iris genus entails about 300 species, but our program will classify only
|
||||||
|
the following three:
|
||||||
|
|
||||||
|
* Iris setosa
|
||||||
|
* Iris virginica
|
||||||
|
* Iris versicolor
|
||||||
|
|
||||||
|
<div style="margin:auto; margin-bottom:10px; margin-top:20px;">
|
||||||
|
<img style="width:100%"
|
||||||
|
alt="Petal geometry compared for three iris species: Iris setosa, Iris virginica, and Iris versicolor"
|
||||||
|
src="../images/iris_three_species.jpg">
|
||||||
|
</div>
|
||||||
|
**From left to right,
|
||||||
|
[*Iris setosa*](https://commons.wikimedia.org/w/index.php?curid=170298) (by
|
||||||
|
[Radomil](https://commons.wikimedia.org/wiki/User:Radomil), CC BY-SA 3.0),
|
||||||
|
[*Iris versicolor*](https://commons.wikimedia.org/w/index.php?curid=248095) (by
|
||||||
|
[Dlanglois](https://commons.wikimedia.org/wiki/User:Dlanglois), CC BY-SA 3.0),
|
||||||
|
and [*Iris virginica*](https://www.flickr.com/photos/33397993@N05/3352169862)
|
||||||
|
(by [Frank Mayfield](https://www.flickr.com/photos/33397993@N05), CC BY-SA
|
||||||
|
2.0).**
|
||||||
|
<p> </p>
|
||||||
|
|
||||||
|
Fortunately, someone has already created [a data set of 120 Iris
|
||||||
|
flowers](https://en.wikipedia.org/wiki/Iris_flower_data_set)
|
||||||
|
with the sepal and petal measurements. This data set has become
|
||||||
|
one of the canonical introductions to machine learning classification problems.
|
||||||
|
(The [MNIST database](https://en.wikipedia.org/wiki/MNIST_database),
|
||||||
|
which contains handwritten digits, is another popular classification
|
||||||
|
problem.) The first 5 entries of the Iris data set
|
||||||
|
look as follows:
|
||||||
|
|
||||||
|
| Sepal length | sepal width | petal length | petal width | species
|
||||||
|
| --- | --- | --- | --- | ---
|
||||||
|
|6.4 | 2.8 | 5.6 | 2.2 | 2
|
||||||
|
|5.0 | 2.3 | 3.3 | 1.0 | 1
|
||||||
|
|4.9 | 2.5 | 4.5 | 1.7 | 2
|
||||||
|
|4.9 | 3.1 | 1.5 | 0.1 | 0
|
||||||
|
|5.7 | 3.8 | 1.7 | 0.3 | 0
|
||||||
|
|
||||||
|
Let's introduce some terms:
|
||||||
|
|
||||||
|
* The last column (species) is called the
|
||||||
|
[**label**](https://developers.google.com/machine-learning/glossary/#label);
|
||||||
|
the first four columns are called
|
||||||
|
[**features**](https://developers.google.com/machine-learning/glossary/#feature).
|
||||||
|
Features are characteristics of an example, while the label is
|
||||||
|
the thing we're trying to predict.
|
||||||
|
|
||||||
|
* An [**example**](https://developers.google.com/machine-learning/glossary/#example)
|
||||||
|
consists of the set of features and the label for one sample
|
||||||
|
flower. The preceding table shows 5 examples from a data set of
|
||||||
|
120 examples.
|
||||||
|
|
||||||
|
Each label is naturally a string (for example, "setosa"), but machine learning
|
||||||
|
typically relies on numeric values. Therefore, someone mapped each string to
|
||||||
|
a number. Here's the representation scheme:
|
||||||
|
|
||||||
|
* 0 represents setosa
|
||||||
|
* 1 represents versicolor
|
||||||
|
* 2 represents virginica
|
||||||
|
|
||||||
|
|
||||||
|
## Models and training
|
||||||
|
|
||||||
|
A **model** is the relationship between features
|
||||||
|
and the label. For the Iris problem, the model defines the relationship
|
||||||
|
between the sepal and petal measurements and the Iris species.
|
||||||
|
Some simple models can be described with a few lines of algebra;
|
||||||
|
more complex machine learning models
|
||||||
|
contain such a large number of interlacing mathematical functions and
|
||||||
|
parameters that they become hard to summarize mathematically.
|
||||||
|
|
||||||
|
Could you determine the relationship between the four features and the
|
||||||
|
Iris species *without* using machine learning? That is, could you use
|
||||||
|
traditional programming techniques (for example, a lot of conditional
|
||||||
|
statements) to create a model? Maybe. You could play with the data set
|
||||||
|
long enough to determine the right relationships of petal and sepal
|
||||||
|
measurements to particular species. However, a good machine learning
|
||||||
|
approach *determines the model for you*. That is, if you feed enough
|
||||||
|
representative examples into the right machine learning model type, the program
|
||||||
|
will determine the relationship between sepals, petals, and species.
|
||||||
|
|
||||||
|
**Training** is the stage of machine learning in which the model is
|
||||||
|
gradually optimized (learned). The Iris problem is an example
|
||||||
|
of [**supervised machine
|
||||||
|
learning**](https://developers.google.com/machine-learning/glossary/#supervised_machine_learning)
|
||||||
|
in which a model is trained from examples that contain labels. (In
|
||||||
|
[**unsupervised machine
|
||||||
|
learning**](https://developers.google.com/machine-learning/glossary/#unsupervised_machine_learning),
|
||||||
|
the examples don't contain labels. Instead, the model typically finds
|
||||||
|
patterns among the features.)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Get the sample program
|
||||||
|
|
||||||
|
Prior to playing with the sample code in this document, do the following:
|
||||||
|
|
||||||
|
1. @{$install$Install TensorFlow}.
|
||||||
|
2. If you installed TensorFlow with virtualenv or Anaconda, activate your
|
||||||
|
TensorFlow environment.
|
||||||
|
3. Install or upgrade pandas by issuing the following command:
|
||||||
|
|
||||||
|
`pip install pandas`
|
||||||
|
|
||||||
|
|
||||||
|
Take the following steps to get the sample program:
|
||||||
|
|
||||||
|
1. Clone the TensorFlow Models repository from github by entering the following
|
||||||
|
command:
|
||||||
|
|
||||||
|
`git clone https://github.com/tensorflow/models`
|
||||||
|
|
||||||
|
2. Change directory within that branch to the location containing the examples
|
||||||
|
used in this document:
|
||||||
|
|
||||||
|
`cd models/samples/core/get_started/`
|
||||||
|
|
||||||
|
In that `get_started` directory, you'll find a program
|
||||||
|
named `premade_estimator.py`.
|
||||||
|
|
||||||
|
|
||||||
|
## Run the sample program
|
||||||
|
|
||||||
|
You run TensorFlow programs as you would run any Python program. Therefore,
|
||||||
|
issue the following command from a command line to
|
||||||
|
run `premade_estimators.py`:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
python premade_estimator.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Running the program should output a whole bunch of information ending with
|
||||||
|
three prediction lines like the following:
|
||||||
|
|
||||||
|
```None
|
||||||
|
...
|
||||||
|
Prediction is "Setosa" (99.6%), expected "Setosa"
|
||||||
|
|
||||||
|
Prediction is "Versicolor" (99.8%), expected "Versicolor"
|
||||||
|
|
||||||
|
Prediction is "Virginica" (97.9%), expected "Virginica"
|
||||||
|
```
|
||||||
|
|
||||||
|
If the program generates errors instead of predictions, ask yourself the
|
||||||
|
following questions:
|
||||||
|
|
||||||
|
* Did you install TensorFlow properly?
|
||||||
|
* Are you using the correct version of TensorFlow? The `premade_estimators.py`
|
||||||
|
program requires at least TensorFlow v1.4.
|
||||||
|
* If you installed TensorFlow with virtualenv or Anaconda, did you activate
|
||||||
|
the environment?
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## The TensorFlow programming stack
|
||||||
|
|
||||||
|
As the following illustration shows, TensorFlow
|
||||||
|
provides a programming stack consisting of multiple API layers:
|
||||||
|
|
||||||
|
<div style="margin:auto; margin-bottom:10px; margin-top:20px;">
|
||||||
|
<img style="width:100%" src="../images/tensorflow_programming_environment.png">
|
||||||
|
</div>
|
||||||
|
**The TensorFlow Programming Environment.**
|
||||||
|
<p> </p>
|
||||||
|
|
||||||
|
As you start writing TensorFlow programs, we strongly recommend focusing on
|
||||||
|
the following two high-level APIs:
|
||||||
|
|
||||||
|
* Estimators
|
||||||
|
* Datasets
|
||||||
|
|
||||||
|
Although we'll grab an occasional convenience function from other APIs,
|
||||||
|
this document focuses on the preceding two APIs.
|
||||||
|
|
||||||
|
|
||||||
|
## The program itself
|
||||||
|
|
||||||
|
Thanks for your patience; let's dig into the code.
|
||||||
|
The general outline of `premade_estimator.py`--and many other TensorFlow
|
||||||
|
programs--is as follows:
|
||||||
|
|
||||||
|
* Import and parse the data sets.
|
||||||
|
* Create feature columns to describe the data.
|
||||||
|
* Select the type of model
|
||||||
|
* Train the model.
|
||||||
|
* Evaluate the model's effectiveness.
|
||||||
|
* Let the trained model make predictions.
|
||||||
|
|
||||||
|
The following subsections detail each part.
|
||||||
|
|
||||||
|
|
||||||
|
### Import and parse the data sets
|
||||||
|
|
||||||
|
The Iris program requires the data from the following two .csv files:
|
||||||
|
|
||||||
|
* `http://download.tensorflow.org/data/iris_training.csv`, which contains
|
||||||
|
the training set.
|
||||||
|
* `http://download.tensorflow.org/data/iris_test.csv`, which contains the
|
||||||
|
the test set.
|
||||||
|
|
||||||
|
The **training set** contains the examples that we'll use to train the model;
|
||||||
|
the **test set** contains the examples that we'll use to evaluate the trained
|
||||||
|
model's effectiveness.
|
||||||
|
|
||||||
|
The training set and test set started out as a
|
||||||
|
single data set. Then, someone split the examples, with the majority going into
|
||||||
|
the training set and the remainder going into the test set. Adding
|
||||||
|
examples to the training set usually builds a better model; however, adding
|
||||||
|
more examples to the test set enables us to better gauge the model's
|
||||||
|
effectiveness. Regardless of the split, the examples in the test set
|
||||||
|
must be separate from the examples in the training set. Otherwise, you can't
|
||||||
|
accurately determine the model's effectiveness.
|
||||||
|
|
||||||
|
The `premade_estimators.py` program relies on the `load_data` function
|
||||||
|
in the adjacent [`iris_data.py`](
|
||||||
|
https://github.com/tensorflow/models/blob/master/samples/core/get_started/iris_data.py)
|
||||||
|
file to read in and parse the training set and test set.
|
||||||
|
Here is a heavily commented version of the function:
|
||||||
|
|
||||||
|
```python
|
||||||
|
TRAIN_URL = "http://download.tensorflow.org/data/iris_training.csv"
|
||||||
|
TEST_URL = "http://download.tensorflow.org/data/iris_test.csv"
|
||||||
|
|
||||||
|
CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth',
|
||||||
|
'PetalLength', 'PetalWidth', 'Species']
|
||||||
|
|
||||||
|
...
|
||||||
|
|
||||||
|
def load_data(label_name='Species'):
|
||||||
|
"""Parses the csv file in TRAIN_URL and TEST_URL."""
|
||||||
|
|
||||||
|
# Create a local copy of the training set.
|
||||||
|
train_path = tf.keras.utils.get_file(fname=TRAIN_URL.split('/')[-1],
|
||||||
|
origin=TRAIN_URL)
|
||||||
|
# train_path now holds the pathname: ~/.keras/datasets/iris_training.csv
|
||||||
|
|
||||||
|
# Parse the local CSV file.
|
||||||
|
train = pd.read_csv(filepath_or_buffer=train_path,
|
||||||
|
names=CSV_COLUMN_NAMES, # list of column names
|
||||||
|
header=0 # ignore the first row of the CSV file.
|
||||||
|
)
|
||||||
|
# train now holds a pandas DataFrame, which is data structure
|
||||||
|
# analogous to a table.
|
||||||
|
|
||||||
|
# 1. Assign the DataFrame's labels (the right-most column) to train_label.
|
||||||
|
# 2. Delete (pop) the labels from the DataFrame.
|
||||||
|
# 3. Assign the remainder of the DataFrame to train_features
|
||||||
|
train_features, train_label = train, train.pop(label_name)
|
||||||
|
|
||||||
|
# Apply the preceding logic to the test set.
|
||||||
|
test_path = tf.keras.utils.get_file(TEST_URL.split('/')[-1], TEST_URL)
|
||||||
|
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)
|
||||||
|
test_features, test_label = test, test.pop(label_name)
|
||||||
|
|
||||||
|
# Return four DataFrames.
|
||||||
|
return (train_features, train_label), (test_features, test_label)
|
||||||
|
```
|
||||||
|
|
||||||
|
Keras is an open-sourced machine learning library; `tf.keras` is a TensorFlow
|
||||||
|
implementation of Keras. The `premade_estimator.py` program only accesses
|
||||||
|
one `tf.keras` function; namely, the `tf.keras.utils.get_file` convenience
|
||||||
|
function, which copies a remote CSV file to a local file system.
|
||||||
|
|
||||||
|
The call to `load_data` returns two `(feature,label)` pairs, for the training
|
||||||
|
and test sets respectively:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Call load_data() to parse the CSV file.
|
||||||
|
(train_feature, train_label), (test_feature, test_label) = load_data()
|
||||||
|
```
|
||||||
|
|
||||||
|
Pandas is an open-source Python library leveraged by several
|
||||||
|
TensorFlow functions. A pandas
|
||||||
|
[**DataFrame**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)
|
||||||
|
is a table with named columns headers and numbered rows.
|
||||||
|
The features returned by `load_data` are packed in `DataFrames`.
|
||||||
|
For example, the `test_feature` DataFrame looks as follows:
|
||||||
|
|
||||||
|
```none
|
||||||
|
SepalLength SepalWidth PetalLength PetalWidth
|
||||||
|
0 5.9 3.0 4.2 1.5
|
||||||
|
1 6.9 3.1 5.4 2.1
|
||||||
|
2 5.1 3.3 1.7 0.5
|
||||||
|
...
|
||||||
|
27 6.7 3.1 4.7 1.5
|
||||||
|
28 6.7 3.3 5.7 2.5
|
||||||
|
29 6.4 2.9 4.3 1.3
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
### Describe the data
|
||||||
|
|
||||||
|
A **feature column** is a data structure that tells your model
|
||||||
|
how to interpret the data in each feature. In the Iris problem,
|
||||||
|
we want the model to interpret the data in each
|
||||||
|
feature as its literal floating-point value; that is, we want the
|
||||||
|
model to interpret an input value like 5.4 as, well, 5.4. However,
|
||||||
|
in other machine learning problems, it is often desirable to interpret
|
||||||
|
data less literally. Using feature columns to
|
||||||
|
interpret data is such a rich topic that we devote an entire
|
||||||
|
@{$feature_columns$document} to it.
|
||||||
|
|
||||||
|
From a code perspective, you build a list of `feature_column` objects by calling
|
||||||
|
functions from the @{tf.feature_column} module. Each object describes an input
|
||||||
|
to the model. To tell the model to interpret data as a floating-point value,
|
||||||
|
call @{tf.feature_column.numeric_column). In `premade_estimator.py`, all
|
||||||
|
four features should be interpreted as literal floating-point values, so
|
||||||
|
the code to create a feature column looks as follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Create feature columns for all features.
|
||||||
|
my_feature_columns = []
|
||||||
|
for key in train_x.keys():
|
||||||
|
my_feature_columns.append(tf.feature_column.numeric_column(key=key))
|
||||||
|
```
|
||||||
|
|
||||||
|
Here is a less elegant, but possibly clearer, alternative way to
|
||||||
|
encode the preceding block:
|
||||||
|
|
||||||
|
```python
|
||||||
|
my_feature_columns = [
|
||||||
|
tf.feature_column.numeric_column(key='SepalLength'),
|
||||||
|
tf.feature_column.numeric_column(key='SepalWidth'),
|
||||||
|
tf.feature_column.numeric_column(key='PetalLength'),
|
||||||
|
tf.feature_column.numeric_column(key='PetalWidth')
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
### Select the type of model
|
||||||
|
|
||||||
|
We need the select the kind of model that will be trained.
|
||||||
|
Lots of model types exist; picking the ideal type takes experience.
|
||||||
|
We've selected a neural network to solve the Iris problem. [**Neural
|
||||||
|
networks**](https://developers.google.com/machine-learning/glossary/#neural_network)
|
||||||
|
can find complex relationships between features and the label.
|
||||||
|
A neural network is a highly-structured graph, organized into one or more
|
||||||
|
[**hidden layers**](https://developers.google.com/machine-learning/glossary/#hidden_layer).
|
||||||
|
Each hidden layer consists of one or more
|
||||||
|
[**neurons**](https://developers.google.com/machine-learning/glossary/#neuron).
|
||||||
|
There are several categories of neural networks.
|
||||||
|
We'll be using a [**fully connected neural
|
||||||
|
network**](https://developers.google.com/machine-learning/glossary/#fully_connected_layer),
|
||||||
|
which means that the neurons in one layer take inputs from *every* neuron in
|
||||||
|
the previous layer. For example, the following figure illustrates a
|
||||||
|
fully connected neural network consisting of three hidden layers:
|
||||||
|
|
||||||
|
* The first hidden layer contains four neurons.
|
||||||
|
* The second hidden layer contains three neurons.
|
||||||
|
* The third hidden layer contains two neurons.
|
||||||
|
|
||||||
|
<div style="margin:auto; margin-bottom:10px; margin-top:20px;">
|
||||||
|
<img style="width:100%" src="../images/simple_dnn.svg">
|
||||||
|
</div>
|
||||||
|
**A neural network with three hidden layers.**
|
||||||
|
<p> </p>
|
||||||
|
|
||||||
|
To specify a model type, instantiate an
|
||||||
|
[**Estimator**](https://developers.google.com/machine-learning/glossary/#Estimators)
|
||||||
|
class. TensorFlow provides two categories of Estimators:
|
||||||
|
|
||||||
|
* [**pre-made
|
||||||
|
Estimators**](https://developers.google.com/machine-learning/glossary/#pre-made_Estimator),
|
||||||
|
which someone else has already written for you.
|
||||||
|
* [**custom
|
||||||
|
Estimators**](https://developers.google.com/machine-learning/glossary/#custom_estimator),
|
||||||
|
which you must code yourself, at least partially.
|
||||||
|
|
||||||
|
To implement a neural network, the `premade_estimators.py` program uses
|
||||||
|
a pre-made Estimator named @{tf.estimator.DNNClassifier}. This Estimator
|
||||||
|
builds a neural network that classifies examples. The following call
|
||||||
|
instantiates `DNNClassifier`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
classifier = tf.estimator.DNNClassifier(
|
||||||
|
feature_columns=my_feature_columns,
|
||||||
|
hidden_units=[10, 10],
|
||||||
|
n_classes=3)
|
||||||
|
```
|
||||||
|
|
||||||
|
Use the `hidden_units` parameter to define the number of neurons
|
||||||
|
in each hidden layer of the neural network. Assign this parameter
|
||||||
|
a list. For example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
hidden_units=[10, 10],
|
||||||
|
```
|
||||||
|
|
||||||
|
The length of the list assigned to `hidden_units` identifies the number of
|
||||||
|
hidden layers (2, in this case).
|
||||||
|
Each value in the list represents the number of neurons in a particular
|
||||||
|
hidden layer (10 in the first hidden layer and 10 in the second hidden layer).
|
||||||
|
To change the number of hidden layers or neurons, simply assign a different
|
||||||
|
list to the `hidden_units` parameter.
|
||||||
|
|
||||||
|
The ideal number of hidden layers and neurons depends on the problem
|
||||||
|
and the data set. Like many aspects of machine learning,
|
||||||
|
picking the ideal shape of the neural network requires some mixture
|
||||||
|
of knowledge and experimentation.
|
||||||
|
As a rule of thumb, increasing the number of hidden layers and neurons
|
||||||
|
*typically* creates a more powerful model, which requires more data to
|
||||||
|
train effectively.
|
||||||
|
|
||||||
|
The `n_classes` parameter specifies the number of possible values that the
|
||||||
|
neural network can predict. Since the Iris problem classifies 3 Iris species,
|
||||||
|
we set `n_classes` to 3.
|
||||||
|
|
||||||
|
The constructor for `tf.Estimator.DNNClassifier` takes an optional argument
|
||||||
|
named `optimizer`, which our sample code chose not to specify. The
|
||||||
|
[**optimizer**](https://developers.google.com/machine-learning/glossary/#optimizer)
|
||||||
|
controls how the model will train. As you develop more expertise in machine
|
||||||
|
learning, optimizers and
|
||||||
|
[**learning
|
||||||
|
rate**](https://developers.google.com/machine-learning/glossary/#learning_rate)
|
||||||
|
will become very important.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### Train the model
|
||||||
|
|
||||||
|
Instantiating a `tf.Estimator.DNNClassifier` creates a framework for learning
|
||||||
|
the model. Basically, we've wired a network but haven't yet let data flow
|
||||||
|
through it. To train the neural network, call the Estimator object's `train`
|
||||||
|
method. For example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
classifier.train(
|
||||||
|
input_fn=lambda:train_input_fn(train_feature, train_label, args.batch_size),
|
||||||
|
steps=args.train_steps)
|
||||||
|
```
|
||||||
|
|
||||||
|
The `steps` argument tells `train` to stop training after the specified
|
||||||
|
number of iterations. Increasing `steps` increases the amount of time
|
||||||
|
the model will train. Counter-intuitively, training a model longer
|
||||||
|
does not guarantee a better model. The default value of `args.train_steps`
|
||||||
|
is 1000. The number of steps to train is a
|
||||||
|
[**hyperparameter**](https://developers.google.com/machine-learning/glossary/#hyperparameter)
|
||||||
|
you can tune. Choosing the right number of steps usually
|
||||||
|
requires both experience and experimentation.
|
||||||
|
|
||||||
|
The `input_fn` parameter identifies the function that supplies the
|
||||||
|
training data. The call to the `train` method indicates that the
|
||||||
|
`train_input_fn` function will supply the training data. Here's that
|
||||||
|
method's signature:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def train_input_fn(features, labels, batch_size):
|
||||||
|
```
|
||||||
|
|
||||||
|
We're passing the following arguments to `train_input_fn`:
|
||||||
|
|
||||||
|
* `train_feature` is a Python dictionary in which:
|
||||||
|
* Each key is the name of a feature.
|
||||||
|
* Each value is an array containing the values for each example in the
|
||||||
|
training set.
|
||||||
|
* `train_label` is an array containing the values of the label for every
|
||||||
|
example in the training set.
|
||||||
|
* `args.batch_size` is an integer defining the [**batch
|
||||||
|
size**](https://developers.google.com/machine-learning/glossary/#batch_size).
|
||||||
|
|
||||||
|
The `train_input_fn` function relies on the **Dataset API**. This is a
|
||||||
|
high-level TensorFlow API for reading data and transforming it into a form
|
||||||
|
that the `train` method requires. The following call converts the
|
||||||
|
input features and labels into a `tf.data.Dataset` object, which is the base
|
||||||
|
class of the Dataset API:
|
||||||
|
|
||||||
|
```python
|
||||||
|
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
|
||||||
|
```
|
||||||
|
|
||||||
|
The `tf.dataset` class provides many useful functions for preparing examples
|
||||||
|
for training. The following line calls three of those functions:
|
||||||
|
|
||||||
|
```python
|
||||||
|
dataset = dataset.shuffle(buffer_size=1000).repeat(count=None).batch(batch_size)
|
||||||
|
```
|
||||||
|
|
||||||
|
Training works best if the training examples are in
|
||||||
|
random order. To randomize the examples, call
|
||||||
|
`tf.data.Dataset.shuffle`. Setting the `buffer_size` to a value
|
||||||
|
larger than the number of examples (120) ensures that the data will
|
||||||
|
be well shuffled.
|
||||||
|
|
||||||
|
During training, the `train` method typically processes the
|
||||||
|
examples multiple times. Calling the
|
||||||
|
`tf.data.Dataset.repeat` method without any arguments ensures
|
||||||
|
that the `train` method has an infinite supply of (now shuffled)
|
||||||
|
training set examples.
|
||||||
|
|
||||||
|
The `train` method processes a
|
||||||
|
[**batch**](https://developers.google.com/machine-learning/glossary/#batch)
|
||||||
|
of examples at a time.
|
||||||
|
The `tf.data.Dataset.batch` method creates a batch by
|
||||||
|
concatenating multiple examples.
|
||||||
|
This program sets the default [**batch
|
||||||
|
size**](https://developers.google.com/machine-learning/glossary/#batch_size)
|
||||||
|
to 100, meaning that the `batch` method will concatenate groups of
|
||||||
|
100 examples. The ideal batch size depends on the problem. As a rule
|
||||||
|
of thumb, smaller batch sizes usually enable the `train` method to train
|
||||||
|
the model faster at the expense (sometimes) of accuracy.
|
||||||
|
|
||||||
|
The following `return` statement passes a batch of examples back to
|
||||||
|
the caller (the `train` method).
|
||||||
|
|
||||||
|
```python
|
||||||
|
return dataset.make_one_shot_iterator().get_next()
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
### Evaluate the model
|
||||||
|
|
||||||
|
**Evaluating** means determining how effectively the model makes
|
||||||
|
predictions. To determine the Iris classification model's effectiveness,
|
||||||
|
pass some sepal and petal measurements to the model and ask the model
|
||||||
|
to predict what Iris species they represent. Then compare the model's
|
||||||
|
prediction against the actual label. For example, a model that picked
|
||||||
|
the correct species on half the input examples would have an
|
||||||
|
[accuracy](https://developers.google.com/machine-learning/glossary/#accuracy)
|
||||||
|
of 0.5. The following suggests a more effective model:
|
||||||
|
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<tr>
|
||||||
|
<th style="background-color:darkblue" colspan="5">
|
||||||
|
Test Set</th>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<th colspan="4">Features</th>
|
||||||
|
<th colspan="1">Label</th>
|
||||||
|
<th colspan="1">Prediction</th>
|
||||||
|
</tr>
|
||||||
|
<tr> <td>5.9</td> <td>3.0</td> <td>4.3</td> <td>1.5</td> <td>1</td>
|
||||||
|
<td style="background-color:green">1</td></tr>
|
||||||
|
<tr> <td>6.9</td> <td>3.1</td> <td>5.4</td> <td>2.1</td> <td>2</td>
|
||||||
|
<td style="background-color:green">2</td></tr>
|
||||||
|
<tr> <td>5.1</td> <td>3.3</td> <td>1.7</td> <td>0.5</td> <td>0</td>
|
||||||
|
<td style="background-color:green">0</td></tr>
|
||||||
|
<tr> <td>6.0</td> <td>3.4</td> <td>4.5</td> <td>1.6</td> <td>1</td>
|
||||||
|
<td style="background-color:red">2</td></tr>
|
||||||
|
<tr> <td>5.5</td> <td>2.5</td> <td>4.0</td> <td>1.3</td> <td>1</td>
|
||||||
|
<td style="background-color:green">1</td></tr>
|
||||||
|
</table>
|
||||||
|
**A model that is 80% accurate.**
|
||||||
|
<p> </p>
|
||||||
|
|
||||||
|
To evaluate a model's effectiveness, each Estimator provides an `evaluate`
|
||||||
|
method. The `premade_estimator.py` program calls `evaluate` as follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Evaluate the model.
|
||||||
|
eval_result = classifier.evaluate(
|
||||||
|
input_fn=lambda:eval_input_fn(test_x, test_y, args.batch_size))
|
||||||
|
|
||||||
|
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))
|
||||||
|
```
|
||||||
|
|
||||||
|
The call to `classifier.evaluate` is similar to the call to `classifier.train`.
|
||||||
|
The biggest difference is that `classifier.evaluate` must get its examples
|
||||||
|
from the test set rather than the training set. In other words, to
|
||||||
|
fairly assess a model's effectiveness, the examples used to
|
||||||
|
*evaluate* a model must be different from the examples used to *train*
|
||||||
|
the model. The `eval_input_fn` function serves a batch of examples from
|
||||||
|
the test set. Here's the `eval_input_fn` method:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def eval_input_fn(features, labels=None, batch_size=None):
|
||||||
|
"""An input function for evaluation or prediction"""
|
||||||
|
if labels is None:
|
||||||
|
# No labels, use only features.
|
||||||
|
inputs = features
|
||||||
|
else:
|
||||||
|
inputs = (features, labels)
|
||||||
|
|
||||||
|
# Convert inputs to a tf.dataset object.
|
||||||
|
dataset = tf.data.Dataset.from_tensor_slices(inputs)
|
||||||
|
|
||||||
|
# Batch the examples
|
||||||
|
assert batch_size is not None, "batch_size must not be None"
|
||||||
|
dataset = dataset.batch(batch_size)
|
||||||
|
|
||||||
|
# Return the read end of the pipeline.
|
||||||
|
return dataset.make_one_shot_iterator().get_next()
|
||||||
|
```
|
||||||
|
|
||||||
|
In brief, `eval_input_fn` does the following when called by
|
||||||
|
`classifier.evaluate`:
|
||||||
|
|
||||||
|
1. Converts the features and labels from the test set to a `tf.dataset`
|
||||||
|
object.
|
||||||
|
2. Creates a batch of test set examples. (There's no need to shuffle
|
||||||
|
or repeat the test set examples.)
|
||||||
|
3. Returns that batch of test set examples to `classifier.evaluate`.
|
||||||
|
|
||||||
|
Running this code yields the following output (or something close to it):
|
||||||
|
|
||||||
|
```none
|
||||||
|
Test set accuracy: 0.967
|
||||||
|
```
|
||||||
|
|
||||||
|
An accuracy of 0.967 implies that our trained model correctly classified 29
|
||||||
|
out of the 30 Iris species in the test set.
|
||||||
|
|
||||||
|
|
||||||
|
### Predicting
|
||||||
|
|
||||||
|
We've now trained a model and "proven" that it is good--but not
|
||||||
|
perfect--at classifying Iris species. Now let's use the trained
|
||||||
|
model to make some predictions on [**unlabeled
|
||||||
|
examples**](https://developers.google.com/machine-learning/glossary/#unlabeled_example);
|
||||||
|
that is, on examples that contain features but not a label.
|
||||||
|
|
||||||
|
In real-life, the unlabeled examples could come from lots of different
|
||||||
|
sources including apps, CSV files, and data feeds. For now, we're simply
|
||||||
|
going to manually provide the following three unlabeled examples:
|
||||||
|
|
||||||
|
```python
|
||||||
|
predict_x = {
|
||||||
|
'SepalLength': [5.1, 5.9, 6.9],
|
||||||
|
'SepalWidth': [3.3, 3.0, 3.1],
|
||||||
|
'PetalLength': [1.7, 4.2, 5.4],
|
||||||
|
'PetalWidth': [0.5, 1.5, 2.1],
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Every Estimator provides a `predict` method, which `premade_estimator.py`
|
||||||
|
calls as follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
predictions = classifier.predict(
|
||||||
|
input_fn=lambda:eval_input_fn(predict_x, batch_size=args.batch_size))
|
||||||
|
```
|
||||||
|
|
||||||
|
As with the `evaluate` method, our `predict` method also gathers examples
|
||||||
|
from the `eval_input_fn` method.
|
||||||
|
|
||||||
|
When doing predictions, we're *not* passing labels to `eval_input_fn`.
|
||||||
|
Therefore, `eval_input_fn` does the following:
|
||||||
|
|
||||||
|
1. Converts the features from the 3-element manual set we just created.
|
||||||
|
2. Creates a batch of 3 examples from that manual set.
|
||||||
|
3. Returns that batch of examples to `classifier.predict`.
|
||||||
|
|
||||||
|
The `predict` method returns a python iterable, yielding a dictionary of
|
||||||
|
prediction results for each example. This dictionary contains several keys.
|
||||||
|
The `probabilities` key holds a list of three floating-point values,
|
||||||
|
each representing the probability that the input example is a particular
|
||||||
|
Iris species. For example, consider the following `probabilities` list:
|
||||||
|
|
||||||
|
```none
|
||||||
|
'probabilities': array([ 1.19127117e-08, 3.97069454e-02, 9.60292995e-01])
|
||||||
|
```
|
||||||
|
|
||||||
|
The preceding list indicates:
|
||||||
|
|
||||||
|
* A negligible chance of the Iris being Setosa.
|
||||||
|
* A 3.97% chance of the Iris being Versicolor.
|
||||||
|
* A 96.0% chance of the Iris being Virginica.
|
||||||
|
|
||||||
|
The `class_ids` key holds a one-element array that identifies the most
|
||||||
|
probable species. For example:
|
||||||
|
|
||||||
|
```none
|
||||||
|
'class_ids': array([2])
|
||||||
|
```
|
||||||
|
|
||||||
|
The number `2` corresponds to Virginica. The following code iterates
|
||||||
|
through the returned `predictions` to report on each prediction:
|
||||||
|
|
||||||
|
``` python
|
||||||
|
for pred_dict, expec in zip(predictions, expected):
|
||||||
|
template = ('\nPrediction is "{}" ({:.1f}%), expected "{}"')
|
||||||
|
|
||||||
|
class_id = pred_dict['class_ids'][0]
|
||||||
|
probability = pred_dict['probabilities'][class_id]
|
||||||
|
print(template.format(SPECIES[class_id], 100 * probability, expec))
|
||||||
|
```
|
||||||
|
|
||||||
|
Running the program yields the following output:
|
||||||
|
|
||||||
|
|
||||||
|
``` None
|
||||||
|
...
|
||||||
|
Prediction is "Setosa" (99.6%), expected "Setosa"
|
||||||
|
|
||||||
|
Prediction is "Versicolor" (99.8%), expected "Versicolor"
|
||||||
|
|
||||||
|
Prediction is "Virginica" (97.9%), expected "Virginica"
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
<!--TODO(barryr): When MLCC is released, add pointers to relevant sections.-->
|
||||||
|
This document provides a short introduction to machine learning.
|
||||||
|
|
||||||
|
Because `premade_estimators.py` relies on high-level APIs, much of the
|
||||||
|
mathematical complexity in machine learning is hidden.
|
||||||
|
If you intend to become more proficient in machine learning, we recommend
|
||||||
|
ultimately learning more about [**gradient
|
||||||
|
descent**](https://developers.google.com/machine-learning/glossary/#gradient_descent),
|
||||||
|
batching, and neural networks.
|
||||||
|
|
||||||
|
We recommend reading the @{$feature_columns$Feature Columns} document next,
|
||||||
|
which explains how to represent different kinds of data in machine learning.
|
||||||
Loading…
x
Reference in New Issue
Block a user