From f37d0ea47b0b4cf0e66676702f48800e03d18655 Mon Sep 17 00:00:00 2001 From: "A. Unique TensorFlower" Date: Fri, 2 Jun 2017 16:53:15 -0700 Subject: [PATCH] Internal change -- first draft docs PiperOrigin-RevId: 157891937 --- .../docs_src/programmers_guide/embedding.md | 352 ++++++++++++++++++ 1 file changed, 352 insertions(+) create mode 100644 tensorflow/docs_src/programmers_guide/embedding.md diff --git a/tensorflow/docs_src/programmers_guide/embedding.md b/tensorflow/docs_src/programmers_guide/embedding.md new file mode 100644 index 00000000000..975850349f0 --- /dev/null +++ b/tensorflow/docs_src/programmers_guide/embedding.md @@ -0,0 +1,352 @@ +# Embeddings + +[TOC] + +## Introduction + +An embedding is a mapping from discrete objects, such as words, to vectors of +real numbers. For example, a 300-dimensional embedding for English words could +include: + +``` +blue: (0.01359, 0.00075997, 0.24608, ..., -0.2524, 1.0048, 0.06259) +blues: (0.01396, 0.11887, -0.48963, ..., 0.033483, -0.10007, 0.1158) +orange: (-0.24776, -0.12359, 0.20986, ..., 0.079717, 0.23865, -0.014213) +oranges: (-0.35609, 0.21854, 0.080944, ..., -0.35413, 0.38511, -0.070976) +``` + +Embeddings let you apply machine learning to discrete inputs. Classifiers, and +neural networks more generally, are designed to work with dense continuous +vectors, where all values contribute to define what an object is. If discrete +objects are naively encoded as discrete atoms, e.g., unique id numbers, they +hinder learning and generalization. One way to think of embeddings is as a way +to transform non-vector objects into useful inputs for machine learning. + +Embeddings are also useful as outputs of machine learning. Because embeddings +map objects to vectors, applications can use similarity in vector space (e.g., +Euclidean distance or the angle between vectors) as a robust and flexible +measure of object similarity. One common use is to find nearest neighbors. +Using the same word embeddings above, for instance, here are the three nearest +neighbors for each word and the corresponding angles (in degrees): + +``` +blue: (red, 47.6°), (yellow, 51.9°), (purple, 52.4°) +blues: (jazz, 53.3°), (folk, 59.1°), (bluegrass, 60.6°) +orange: (yellow, 53.5°), (colored, 58.0°), (bright, 59.9°) +oranges: (apples, 45.3°), (lemons, 48.3°), (mangoes, 50.4°) +``` + +This would tell an application that apples and oranges are in some way more +similar (45.3° apart) than lemons and oranges (48.3° apart). + +## Training an Embedding + +To train word embeddings in TensorFlow, we first need to split the text into +words and assign an integer to every word in the vocabulary. Let us assume that +this has already been done, and that `word_ids` is a vector of these integers. +For example, the sentence “I have a cat.” could be split into +`[“I”, “have”, “a”, “cat”, “.”]` and then the corresponding `word_ids` tensor +would have shape `[5]` and consist of 5 integers. To get these word ids +embedded, we need to create the embedding variable and use the `tf.gather` +function as follows: + +``` +word_embeddings = tf.get_variable(“word_embeddings”, + [vocabulary_size, embedding_size]) +embedded_word_ids = tf.gather(word_embeddings, word_ids) +``` + +After this, the tensor `embedded_word_ids` will have shape `[5, embedding_size]` +in our example and contain the embeddings (dense vectors) for each of the 5 +words. The variable `word_embeddings` will be learned and at the end of the +training it will contain the embeddings for all words in the vocabulary. +The embeddings can be trained in many ways, depending on the data available. +For example, one could use a recurrent neural network to predict the next word +from the previous one given a large corpus of sentences, or one could train +two networks to do multi-lingual translation. These methods are described in +[Vector Representations of Words](../tutorials/word2vec.md) tutorial, but in +all cases there is an embedding variable like above and words are embedded +using `tf.gather`, as shown. + +## Visualizing Embeddings + +TensorBoard has a built-in visualizer, called the Embedding Projector, +for interactive visualization of embeddings. The embedding projector will read +the embeddings from your checkpoint file and project them into 3 dimensions using +[principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). +For a visual explanation of PCA, see +[this article](http://setosa.io/ev/principal-component-analysis/). Another +very useful projection you can use is +[t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). + +If you are working with an embedding, you'll probably want to attach +labels/images to the data points. You can do this by generating a +[metadata file](#metadata) containing the labels for each point and configuring +the projector either by using our Python API, or manually constructing and +saving a +[projector_config.pbtxt](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorboard/plugins/projector/projector_config.proto) +in the same directory as your checkpoint file. + +### Setup + +For in depth information on how to run TensorBoard and make sure you are +logging all the necessary information, see +[TensorBoard: Visualizing Learning](../get_started/summaries_and_tensorboard.md). + +To visualize your embeddings, there are 3 things you need to do: + +1) Setup a 2D tensor that holds your embedding(s). + +```python +embedding_var = tf.get_variable(....) +``` + +2) Periodically save your model variables in a checkpoint in +LOG_DIR. + +```python +saver = tf.train.Saver() +saver.save(session, os.path.join(LOG_DIR, "model.ckpt"), step) +``` + +3) (Optional) Associate metadata with your embedding. + +If you have any metadata (labels, images) associated with your embedding, you +can tell TensorBoard about it either by directly storing a +[projector_config.pbtxt](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorboard/plugins/projector/projector_config.proto) +in the LOG_DIR, or use our python API. + +For instance, the following projector_config.ptxt associates the +word_embedding tensor with metadata stored in $LOG_DIR/metadata.tsv: + +``` +embeddings { + tensor_name: 'word_embedding' + metadata_path: '$LOG_DIR/metadata.tsv' +} +``` + +The same config can be produced programmatically using the following code snippet: + +```python +from tensorflow.contrib.tensorboard.plugins import projector + +# Create randomly initialized embedding weights which will be trained. +vocabulary_size = 10000 +embedding_size = 200 +embedding_var = tf.get_variable('word_embedding', [vocabulary_size, embedding_size]) + +# Format: tensorflow/tensorboard/plugins/projector/projector_config.proto +config = projector.ProjectorConfig() + +# You can add multiple embeddings. Here we add only one. +embedding = config.embeddings.add() +embedding.tensor_name = embedding_var.name +# Link this tensor to its metadata file (e.g. labels). +embedding.metadata_path = os.path.join(LOG_DIR, 'metadata.tsv') + +# Use the same LOG_DIR where you stored your checkpoint. +summary_writer = tf.summary.FileWriter(LOG_DIR) + +# The next line writes a projector_config.pbtxt in the LOG_DIR. TensorBoard will +# read this file during startup. +projector.visualize_embeddings(summary_writer, config) +``` + +After running your model and training your embeddings, run TensorBoard and point +it to the LOG_DIR of the job. + +```python +tensorboard --logdir=LOG_DIR +``` + +Then click on the *Embeddings* tab on the top pane +and select the appropriate run (if there are more than one run). + + +### Metadata +Usually embeddings have metadata associated with it (e.g. labels, images). The +metadata should be stored in a separate file outside of the model checkpoint +since the metadata is not a trainable parameter of the model. The format should +be a [TSV file](https://en.wikipedia.org/wiki/Tab-separated_values) +(tab characters shown in red) with the first line containing column headers +(shown in bold) and subsequent lines contain the metadata values: + + +Word\tFrequency
+ Airplane\t345
+ Car\t241
+ ... +
+ +There is no explicit key shared with the main data file; instead, the order in +the metadata file is assumed to match the order in the embedding tensor. In +other words, the first line is the header information and the (i+1)-th line in +the metadata file corresponds to the i-th row of the embedding tensor stored in +the checkpoint. + +Note: If the TSV metadata file has only a single column, then we don’t expect a +header row, and assume each row is the label of the embedding. We include this +exception because it matches the commonly-used "vocab file" format. + +### Images +If you have images associated with your embeddings, you will need to +produce a single image consisting of small thumbnails of each data point. +This is known as the +[sprite image](https://www.google.com/webhp#q=what+is+a+sprite+image). +The sprite should have the same number of rows and columns with thumbnails +stored in row-first order: the first data point placed in the top left and the +last data point in the bottom right: + + + + + + + + + + + + + + + + + +
012
345
67
+ +Note in the example above that the last row doesn't have to be filled. For a +concrete example of a sprite, see +[this sprite image](https://www.tensorflow.org/images/mnist_10k_sprite.png) of 10,000 MNIST digits +(100x100). + +Note: We currently support sprites up to 8192px X 8192px. + +After constructing the sprite, you need to tell the Embedding Projector where +to find it: + + +```python +embedding.sprite.image_path = PATH_TO_SPRITE_IMAGE +# Specify the width and height of a single thumbnail. +embedding.sprite.single_image_dim.extend([w, h]) +``` + +### Interaction + +The Embedding Projector has three panels: + +1. *Data panel* on the top left, where you can choose the run, the embedding + tensor and data columns to color and label points by. +2. *Projections panel* on the bottom left, where you choose the type of + projection (e.g. PCA, t-SNE). +3. *Inspector panel* on the right side, where you can search for particular + points and see a list of nearest neighbors. + +### Projections +The Embedding Projector has three methods of reducing the dimensionality of a +data set: two linear and one nonlinear. Each method can be used to create either +a two- or three-dimensional view. + +**Principal Component Analysis** A straightforward technique for reducing +dimensions is Principal Component Analysis (PCA). The Embedding Projector +computes the top 10 principal components. The menu lets you project those +components onto any combination of two or three. PCA is a linear projection, +often effective at examining global geometry. + +**t-SNE** A popular non-linear dimensionality reduction technique is t-SNE. +The Embedding Projector offers both two- and three-dimensional t-SNE views. +Layout is performed client-side animating every step of the algorithm. Because +t-SNE often preserves some local structure, it is useful for exploring local +neighborhoods and finding clusters. Although extremely useful for visualizing +high-dimensional data, t-SNE plots can sometimes be mysterious or misleading. +See this [great article](http://distill.pub/2016/misread-tsne/) for how to use +t-SNE effectively. + +**Custom** You can also construct specialized linear projections based on text +searches for finding meaningful directions in space. To define a projection +axis, enter two search strings or regular expressions. The program computes the +centroids of the sets of points whose labels match these searches, and uses the +difference vector between centroids as a projection axis. + +### Navigation + +To explore a data set, you can navigate the views in either a 2D or a 3D mode, +zooming, rotating, and panning using natural click-and-drag gestures. +Clicking on a point causes the right pane to show an explicit textual list of +nearest neighbors, along with distances to the current point. The +nearest-neighbor points themselves are highlighted on the projection. + +Zooming into the cluster gives some information, but it is sometimes more +helpful to restrict the view to a subset of points and perform projections only +on those points. To do so, you can select points in multiple ways: + +1. After clicking on a point, its nearest neighbors are also selected. +2. After a search, the points matching the query are selected. +3. Enabling selection, clicking on a point and dragging defines a selection + sphere. + +After selecting a set of points, you can isolate those points for +further analysis on their own with the "Isolate Points" button in the Inspector +pane on the right hand side. + + +![Selection of nearest neighbors](https://www.tensorflow.org/images/embedding-nearest-points.png "Selection of nearest neighbors") +*Selection of the nearest neighbors of “important” in a word embedding dataset.* + +The combination of filtering with custom projection can be powerful. Below, we filtered +the 100 nearest neighbors of “politics” and projected them onto the +“best” - “worst” vector as an x axis. The y axis is random. + +You can see that on the right side we have “ideas”, “science”, “perspective”, +“journalism” while on the left we have “crisis”, “violence” and “conflict”. + + + + + + + + + + +
+ Custom controls panel + + Custom projection +
+ Custom projection controls. + + Custom projection of neighbors of "politics" onto "best" - "worst" vector. +
+ +### Collaborative Features + +To share your findings, you can use the bookmark panel in the bottom right +corner and save the current state (including computed coordinates of any +projection) as a small file. The Projector can then be pointed to a set of one +or more of these files, producing the panel below. Other users can then walk +through a sequence of bookmarks. + +Bookmark panel + + +## Mini-FAQ + +**Is "embedding" an action or a thing?** +Both. People talk about embedding words in a vector space (action) and about +producing word embeddings (things). Common to both is the notion of embedding +as a mapping from discrete objects to vectors. Creating or applying that +mapping is an action, but the mapping itself is a thing. + +**Are embeddings high-dimensional or low-dimensional?** +It depends. A 300-dimensional vector space of words and phrases, for instance, +is often called low-dimensional (and dense) when compared to the millions of +words and phrases it can contain. But mathematically it is high-dimensional, +displaying many properties that are dramatically different from what our human +intuition has learned about 2- and 3-dimensional spaces. + +**Is an embedding the same as an embedding layer?** +No; an embedding layer is a part of neural network, but an embedding is a more +general concept.