TensorFlow: Initial commit of TensorFlow library.

TensorFlow is an open source software library for numerical computation using data flow graphs. Base CL: 107276108
2015-11-06 16:27:58 -08:00 · 2015-11-06 16:27:58 -08:00 · f41959ccb2
commit f41959ccb2
1900 changed files with 391534 additions and 0 deletions
--- a/.gitmodules
+++ b/.gitmodules
@ -0,0 +1,3 @@
+[submodule "google/protobuf"]
+	path = google/protobuf
+	url = https://github.googlesource.com/google/protobuf.git
--- a/50
+++ b/50
@ -0,0 +1,50 @@
+Some of TensorFlow's code is derived from Caffe, which is subject to the following copyright notice:
+
+COPYRIGHT
+
+All contributions by the University of California:
+
+Copyright (c) 2014, The Regents of the University of California (Regents)
+All rights reserved.
+
+All other contributions:
+
+Copyright (c) 2014, the respective contributors
+All rights reserved.
+
+Caffe uses a shared copyright model: each contributor holds copyright over
+their contributions to Caffe. The project versioning records all such
+contribution and copyright details. If a contributor wants to further mark
+their specific copyright on a particular contribution, they should indicate
+their copyright solely in the commit message of the change when it is
+committed.
+
+LICENSE
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+   ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+   WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+   DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+   ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+   (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+   LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+   ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+CONTRIBUTION AGREEMENT
+
+By contributing to the BVLC/caffe repository through pull-request, comment,
+or otherwise, the contributor releases their content to the
+license and copyright terms herein.
+
--- a/9
+++ b/9
@ -0,0 +1,9 @@
+# This is the official list of TensorFlow authors for copyright purposes.
+# This file is distinct from the CONTRIBUTORS files.
+# See the latter for an explanation.
+
+# Names should be added to this file as:
+# Name or Organization <email address>
+# The email address is not required for organizations.
+
+Google Inc.
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,17 @@
+# Contributing guidelines
+
+## How to become a contributor and submit your own code
+
+### Contributor License Agreements
+
+We'd love to accept your patches! Before we can take them, we have to jump a couple of legal hurdles.
+
+Please fill out either the individual or corporate Contributor License Agreement (CLA).
+
+  * If you are an individual writing original source code and you're sure you own the intellectual property, then you'll need to sign an [individual CLA](http://code.google.com/legal/individual-cla-v1.0.html).
+  * If you work for a company that wants to allow you to contribute your work, then you'll need to sign a [corporate CLA](http://code.google.com/legal/corporate-cla-v1.0.html).
+
+Follow either of the two links above to access the appropriate CLA and instructions for how to sign and return it. Once we receive it, we'll be able to accept your pull requests.
+
+***NOTE***: Only original source code from you and other people that have signed the CLA can be accepted into the main repository.
+
--- a/203
+++ b/203
@ -0,0 +1,203 @@
+Copyright 2015 The TensorFlow Authors.  All rights reserved.
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright 2015, The TensorFlow Authors.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/README.md
+++ b/README.md
@ -0,0 +1,17 @@
+#TensorFlow
+
+TensorFlow is an open source software library for numerical computation using
+data flow graphs.  Nodes in the graph represent mathematical operations, while
+the graph edges represent the multidimensional data arrays (tensors) that flow
+between them.  This flexible architecture lets you deploy computation to one
+or more CPUs or GPUs in a desktop, server, or mobile device without rewriting
+code.  TensorFlow was originally developed by researchers and engineers
+working on the Google Brain team within Google's Machine Intelligence research
+organization for the purposes of conducting machine learning and deep neural
+networks research.  The system is general enough to be applicable in a wide
+variety of other domains, as well.
+
+##For more information
+
+* [Installation and setup instructions](/tensorflow/g3doc/get_started/os_setup.md)
+* [TensorFlow website](http://tensorflow.org)
--- a/551
+++ b/551
@ -0,0 +1,551 @@
+# Uncomment and update the paths in these entries to build the Android demo.
+#android_sdk_repository(
+#    name = "androidsdk",
+#    api_level = 23,
+#    build_tools_version = "23.0.1",
+#    # Replace with path to Android SDK on your system
+#    path = "<PATH_TO_SDK>",
+#)
+#
+#android_ndk_repository(
+#    name="androidndk",
+#    path="<PATH_TO_NDK>",
+#    api_level=21)
+
+new_http_archive(
+  name = "gmock_archive",
+  url = "https://googlemock.googlecode.com/files/gmock-1.7.0.zip",
+  sha256 = "26fcbb5925b74ad5fc8c26b0495dfc96353f4d553492eb97e85a8a6d2f43095b",
+  build_file = "gmock.BUILD",
+)
+
+bind(
+  name = "gtest",
+  actual = "@gmock_archive//:gtest",
+)
+
+bind(
+  name = "gtest_main",
+  actual = "@gmock_archive//:gtest_main",
+)
+
+git_repository(
+  name = "re2",
+  remote = "https://github.com/google/re2.git",
+  tag = "2015-07-01",
+)
+
+new_http_archive(
+  name = "jpeg_archive",
+  url = "http://www.ijg.org/files/jpegsrc.v9a.tar.gz",
+  sha256 = "3a753ea48d917945dd54a2d97de388aa06ca2eb1066cbfdc6652036349fe05a7",
+  build_file = "jpeg.BUILD",
+)
+
+git_repository(
+  name = "gemmlowp",
+  remote = "https://github.com/google/gemmlowp.git",
+  commit = "cc5d3a0",
+)
+
+new_http_archive(
+  name = "png_archive",
+  url = "https://storage.googleapis.com/libpng-public-archive/libpng-1.2.53.tar.gz",
+  sha256 = "e05c9056d7f323088fd7824d8c6acc03a4a758c4b4916715924edc5dd3223a72",
+  build_file = "png.BUILD",
+)
+
+new_http_archive(
+  name = "six_archive",
+  url = "https://pypi.python.org/packages/source/s/six/six-1.10.0.tar.gz#md5=34eed507548117b2ab523ab14b2f8b55",
+  sha256 = "105f8d68616f8248e24bf0e9372ef04d3cc10104f1980f54d57b2ce73a5ad56a",
+  build_file = "six.BUILD",
+)
+
+bind(
+  name = "six",
+  actual = "@six_archive//:six",
+)
+
+new_git_repository(
+  name = "iron-ajax",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-ajax.git",
+  tag = "v1.0.8",
+)
+
+new_git_repository(
+  name = "iron-dropdown",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-dropdown.git",
+  tag = "v1.0.6",
+)
+
+new_git_repository(
+  name = "accessibility-developer-tools",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/GoogleChrome/accessibility-developer-tools.git",
+  tag = "v2.10.0",
+)
+
+new_git_repository(
+  name = "iron-doc-viewer",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-doc-viewer.git",
+  tag = "v1.0.6",
+)
+
+new_git_repository(
+  name = "iron-icons",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/polymerelements/iron-icons.git",
+  tag = "v1.0.4",
+)
+
+new_git_repository(
+  name = "paper-icon-button",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-icon-button.git",
+  tag = "v1.0.5",
+)
+
+new_git_repository(
+  name = "sinonjs",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/blittle/sinon.js.git",
+  tag = "v1.17.1",
+)
+
+new_git_repository(
+  name = "paper-dropdown-menu",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-dropdown-menu.git",
+  tag = "v1.0.5",
+)
+
+new_git_repository(
+  name = "iron-flex-layout",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/polymerelements/iron-flex-layout.git",
+  tag = "v1.0.4",
+)
+
+new_git_repository(
+  name = "iron-autogrow-textarea",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-autogrow-textarea.git",
+  tag = "v1.0.7",
+)
+
+new_git_repository(
+  name = "d3",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/mbostock/d3.git",
+  tag = "v3.5.6",
+)
+
+new_git_repository(
+  name = "iron-component-page",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-component-page.git",
+  tag = "v1.0.8",
+)
+
+new_git_repository(
+  name = "stacky",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerLabs/stacky.git",
+  tag = "v1.2.4",
+)
+
+new_git_repository(
+  name = "paper-styles",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-styles.git",
+  tag = "v1.0.12",
+)
+
+new_git_repository(
+  name = "paper-input",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-input.git",
+  tag = "v1.0.16",
+)
+
+new_git_repository(
+  name = "paper-item",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-item.git",
+  tag = "v1.0.5",
+)
+
+new_git_repository(
+  name = "marked-element",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/marked-element.git",
+  tag = "v1.1.1",
+)
+
+new_git_repository(
+  name = "prism",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/LeaVerou/prism.git",
+  tag = "v1.3.0",
+)
+
+new_git_repository(
+  name = "paper-progress",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-progress.git",
+  tag = "v1.0.7",
+)
+
+new_git_repository(
+  name = "iron-checked-element-behavior",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-checked-element-behavior.git",
+  tag = "v1.0.2",
+)
+
+new_git_repository(
+  name = "paper-toolbar",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-toolbar.git",
+  tag = "v1.0.4",
+)
+
+new_git_repository(
+  name = "async",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/caolan/async.git",
+  tag = "0.9.2",
+)
+
+new_git_repository(
+  name = "es6-promise",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/components/es6-promise.git",
+  tag = "v3.0.2",
+)
+
+new_git_repository(
+  name = "promise-polyfill",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/polymerlabs/promise-polyfill.git",
+  tag = "v1.0.0",
+)
+
+new_git_repository(
+  name = "font-roboto",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/font-roboto.git",
+  tag = "v1.0.1",
+)
+
+new_git_repository(
+  name = "paper-menu",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-menu.git",
+  tag = "v1.1.1",
+)
+
+new_git_repository(
+  name = "iron-icon",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/polymerelements/iron-icon.git",
+  tag = "v1.0.7",
+)
+
+new_git_repository(
+  name = "iron-meta",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-meta.git",
+  tag = "v1.1.0",
+)
+
+new_git_repository(
+  name = "lodash",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/lodash/lodash.git",
+  tag = "3.10.1",
+)
+
+new_git_repository(
+  name = "iron-resizable-behavior",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-resizable-behavior.git",
+  tag = "v1.0.2",
+)
+
+new_git_repository(
+  name = "iron-fit-behavior",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-fit-behavior.git",
+  tag = "v1.0.3",
+)
+
+new_git_repository(
+  name = "iron-overlay-behavior",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-overlay-behavior.git",
+  tag = "v1.0.9",
+)
+
+new_git_repository(
+  name = "neon-animation",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/polymerelements/neon-animation.git",
+  tag = "v1.0.7",
+)
+
+new_git_repository(
+  name = "iron-a11y-keys-behavior",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/polymerelements/iron-a11y-keys-behavior.git",
+  tag = "v1.0.7",
+)
+
+new_git_repository(
+  name = "plottable",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/palantir/plottable.git",
+  tag = "v1.16.1",
+)
+
+new_git_repository(
+  name = "webcomponentsjs",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/Polymer/webcomponentsjs.git",
+  tag = "v0.7.15",
+)
+
+new_git_repository(
+  name = "iron-validatable-behavior",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-validatable-behavior.git",
+  tag = "v1.0.5",
+)
+
+new_git_repository(
+  name = "sinon-chai",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/domenic/sinon-chai.git",
+  tag = "2.8.0",
+)
+
+new_git_repository(
+  name = "paper-button",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-button.git",
+  tag = "v1.0.8",
+)
+
+new_git_repository(
+  name = "iron-input",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-input.git",
+  tag = "v1.0.6",
+)
+
+new_git_repository(
+  name = "iron-menu-behavior",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-menu-behavior.git",
+  tag = "v1.0.5",
+)
+
+new_git_repository(
+  name = "paper-slider",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-slider.git",
+  tag = "v1.0.7",
+)
+
+new_git_repository(
+  name = "iron-list",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-list.git",
+  tag = "v1.1.5",
+)
+
+new_git_repository(
+  name = "marked",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/chjj/marked.git",
+  tag = "v0.3.5",
+)
+
+new_git_repository(
+  name = "paper-material",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/polymerelements/paper-material.git",
+  tag = "v1.0.3",
+)
+
+new_git_repository(
+  name = "iron-range-behavior",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-range-behavior.git",
+  tag = "v1.0.4",
+)
+
+new_git_repository(
+  name = "svg-typewriter",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/palantir/svg-typewriter.git",
+  tag = "v0.3.0",
+)
+
+new_git_repository(
+  name = "web-animations-js",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/web-animations/web-animations-js.git",
+  tag = "2.1.2",
+)
+
+new_git_repository(
+  name = "hydrolysis",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/Polymer/hydrolysis.git",
+  tag = "v1.19.3",
+)
+
+new_git_repository(
+  name = "web-component-tester",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/Polymer/web-component-tester.git",
+  tag = "v3.3.29",
+)
+
+new_git_repository(
+  name = "paper-toggle-button",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-toggle-button.git",
+  tag = "v1.0.11",
+)
+
+new_git_repository(
+  name = "paper-behaviors",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/polymerelements/paper-behaviors.git",
+  tag = "v1.0.7",
+)
+
+new_git_repository(
+  name = "paper-radio-group",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-radio-group.git",
+  tag = "v1.0.6",
+)
+
+new_git_repository(
+  name = "iron-selector",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-selector.git",
+  tag = "v1.0.7",
+)
+
+new_git_repository(
+  name = "iron-form-element-behavior",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-form-element-behavior.git",
+  tag = "v1.0.5",
+)
+
+new_git_repository(
+  name = "mocha",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/mochajs/mocha.git",
+  tag = "v2.3.3",
+)
+
+new_git_repository(
+  name = "dagre",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/cpettitt/dagre.git",
+  tag = "v0.7.4",
+)
+
+new_git_repository(
+  name = "iron-behaviors",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-behaviors.git",
+  tag = "v1.0.9",
+)
+
+new_git_repository(
+  name = "graphlib",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/cpettitt/graphlib.git",
+  tag = "v1.0.7",
+)
+
+new_git_repository(
+  name = "iron-collapse",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-collapse.git",
+  tag = "v1.0.4",
+)
+
+new_git_repository(
+  name = "paper-checkbox",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-checkbox.git",
+  tag = "v1.0.13",
+)
+
+new_git_repository(
+  name = "paper-radio-button",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-radio-button.git",
+  tag = "v1.0.10",
+)
+
+new_git_repository(
+  name = "paper-header-panel",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/paper-header-panel.git",
+  tag = "v1.0.5",
+)
+
+new_git_repository(
+  name = "prism-element",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/prism-element.git",
+  tag = "v1.0.2",
+)
+
+new_git_repository(
+  name = "chai",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/chaijs/chai.git",
+  tag = "2.3.0",
+)
+
+new_git_repository(
+  name = "paper-menu-button",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/polymerelements/paper-menu-button.git",
+  tag = "v1.0.3",
+)
+
+new_git_repository(
+  name = "polymer",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/Polymer/polymer.git",
+  tag = "v1.2.1",
+)
+
+new_git_repository(
+  name = "paper-ripple",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/polymerelements/paper-ripple.git",
+  tag = "v1.0.4",
+)
+
+new_git_repository(
+  name = "iron-iconset-svg",
+  build_file = "bower.BUILD",
+  remote = "https://github.com/PolymerElements/iron-iconset-svg.git",
+  tag = "v1.0.8",
+)
--- a/bower.BUILD
+++ b/bower.BUILD
@ -0,0 +1,971 @@
+package(default_visibility = ["//tensorflow:internal"])
+
+filegroup(
+    name = "iron-ajax",
+    srcs = [
+        "index.html",
+        "iron-ajax.html",
+        "iron-request.html",
+    ],
+)
+
+filegroup(
+    name = "iron-dropdown",
+    srcs = [
+        "index.html",
+        "iron-dropdown.html",
+        "iron-dropdown-scroll-manager.html",
+    ],
+)
+
+filegroup(
+    name = "accessibility-developer-tools",
+    srcs = ["dist/js/axs_testing.js"],
+)
+
+filegroup(
+    name = "iron-doc-viewer",
+    srcs = [
+        "index.html",
+        "iron-doc-property.css",
+        "iron-doc-property.html",
+        "iron-doc-viewer.css",
+        "iron-doc-viewer.html",
+    ],
+)
+
+filegroup(
+    name = "iron-icons",
+    srcs = [
+        "av-icons.html",
+        "communication-icons.html",
+        "device-icons.html",
+        "editor-icons.html",
+        "hardware-icons.html",
+        "image-icons.html",
+        "index.html",
+        "iron-icons.html",
+        "maps-icons.html",
+        "notification-icons.html",
+        "social-icons.html",
+    ],
+)
+
+filegroup(
+    name = "paper-icon-button",
+    srcs = [
+        "index.html",
+        "paper-icon-button.html",
+    ],
+)
+
+filegroup(
+    name = "sinonjs",
+    srcs = ["sinon.js"],
+)
+
+filegroup(
+    name = "paper-dropdown-menu",
+    srcs = [
+        "index.html",
+        "paper-dropdown-menu.html",
+    ],
+)
+
+filegroup(
+    name = "iron-flex-layout",
+    srcs = [
+        "classes/iron-flex-layout.html",
+        "classes/iron-shadow-flex-layout.html",
+        "index.html",
+        "iron-flex-layout.html",
+    ],
+)
+
+filegroup(
+    name = "iron-autogrow-textarea",
+    srcs = [
+        "index.html",
+        "iron-autogrow-textarea.html",
+    ],
+)
+
+filegroup(
+    name = "d3",
+    srcs = [
+        "d3.js",
+        "d3.min.js",
+        "package.js",
+    ],
+)
+
+filegroup(
+    name = "iron-component-page",
+    srcs = [
+        "index.html",
+        "iron-component-page.css",
+        "iron-component-page.html",
+    ],
+)
+
+filegroup(
+    name = "stacky",
+    srcs = [
+        "gulpfile.js",
+        "lib/formatting.js",
+        "lib/index.js",
+        "lib/normalization.js",
+        "lib/parsing.js",
+    ],
+)
+
+filegroup(
+    name = "paper-styles",
+    srcs = [
+        "classes/global.html",
+        "classes/shadow.html",
+        "classes/shadow-layout.html",
+        "classes/typography.html",
+        "color.html",
+        "default-theme.html",
+        "demo.css",
+        "demo-pages.html",
+        "index.html",
+        "paper-styles.html",
+        "paper-styles-classes.html",
+        "shadow.html",
+        "typography.html",
+    ],
+)
+
+filegroup(
+    name = "paper-input",
+    srcs = [
+        "all-imports.html",
+        "index.html",
+        "paper-input.html",
+        "paper-input-addon-behavior.html",
+        "paper-input-behavior.html",
+        "paper-input-char-counter.html",
+        "paper-input-container.html",
+        "paper-input-error.html",
+        "paper-textarea.html",
+    ],
+)
+
+filegroup(
+    name = "paper-item",
+    srcs = [
+        "all-imports.html",
+        "index.html",
+        "paper-icon-item.html",
+        "paper-item.html",
+        "paper-item-body.html",
+        "paper-item-shared-styles.html",
+    ],
+)
+
+filegroup(
+    name = "marked-element",
+    srcs = [
+        "index.html",
+        "marked-element.html",
+        "marked-import.html",
+    ],
+)
+
+filegroup(
+    name = "prism",
+    srcs = [
+        "components.js",
+        "components/prism-abap.js",
+        "components/prism-abap.min.js",
+        "components/prism-actionscript.js",
+        "components/prism-actionscript.min.js",
+        "components/prism-apacheconf.js",
+        "components/prism-apacheconf.min.js",
+        "components/prism-apl.js",
+        "components/prism-apl.min.js",
+        "components/prism-applescript.js",
+        "components/prism-applescript.min.js",
+        "components/prism-asciidoc.js",
+        "components/prism-asciidoc.min.js",
+        "components/prism-aspnet.js",
+        "components/prism-aspnet.min.js",
+        "components/prism-autohotkey.js",
+        "components/prism-autohotkey.min.js",
+        "components/prism-autoit.js",
+        "components/prism-autoit.min.js",
+        "components/prism-bash.js",
+        "components/prism-bash.min.js",
+        "components/prism-basic.js",
+        "components/prism-basic.min.js",
+        "components/prism-batch.js",
+        "components/prism-batch.min.js",
+        "components/prism-bison.js",
+        "components/prism-bison.min.js",
+        "components/prism-brainfuck.js",
+        "components/prism-brainfuck.min.js",
+        "components/prism-c.js",
+        "components/prism-c.min.js",
+        "components/prism-clike.js",
+        "components/prism-clike.min.js",
+        "components/prism-coffeescript.js",
+        "components/prism-coffeescript.min.js",
+        "components/prism-core.js",
+        "components/prism-core.min.js",
+        "components/prism-cpp.js",
+        "components/prism-cpp.min.js",
+        "components/prism-crystal.js",
+        "components/prism-crystal.min.js",
+        "components/prism-csharp.js",
+        "components/prism-csharp.min.js",
+        "components/prism-css.js",
+        "components/prism-css.min.js",
+        "components/prism-css-extras.js",
+        "components/prism-css-extras.min.js",
+        "components/prism-d.js",
+        "components/prism-d.min.js",
+        "components/prism-dart.js",
+        "components/prism-dart.min.js",
+        "components/prism-diff.js",
+        "components/prism-diff.min.js",
+        "components/prism-docker.js",
+        "components/prism-docker.min.js",
+        "components/prism-eiffel.js",
+        "components/prism-eiffel.min.js",
+        "components/prism-elixir.js",
+        "components/prism-elixir.min.js",
+        "components/prism-erlang.js",
+        "components/prism-erlang.min.js",
+        "components/prism-fortran.js",
+        "components/prism-fortran.min.js",
+        "components/prism-fsharp.js",
+        "components/prism-fsharp.min.js",
+        "components/prism-gherkin.js",
+        "components/prism-gherkin.min.js",
+        "components/prism-git.js",
+        "components/prism-git.min.js",
+        "components/prism-glsl.js",
+        "components/prism-glsl.min.js",
+        "components/prism-go.js",
+        "components/prism-go.min.js",
+        "components/prism-groovy.js",
+        "components/prism-groovy.min.js",
+        "components/prism-haml.js",
+        "components/prism-haml.min.js",
+        "components/prism-handlebars.js",
+        "components/prism-handlebars.min.js",
+        "components/prism-haskell.js",
+        "components/prism-haskell.min.js",
+        "components/prism-haxe.js",
+        "components/prism-haxe.min.js",
+        "components/prism-http.js",
+        "components/prism-http.min.js",
+        "components/prism-icon.js",
+        "components/prism-icon.min.js",
+        "components/prism-inform7.js",
+        "components/prism-inform7.min.js",
+        "components/prism-ini.js",
+        "components/prism-ini.min.js",
+        "components/prism-j.js",
+        "components/prism-j.min.js",
+        "components/prism-jade.js",
+        "components/prism-jade.min.js",
+        "components/prism-java.js",
+        "components/prism-java.min.js",
+        "components/prism-javascript.js",
+        "components/prism-javascript.min.js",
+        "components/prism-jsx.js",
+        "components/prism-jsx.min.js",
+        "components/prism-julia.js",
+        "components/prism-julia.min.js",
+        "components/prism-keyman.js",
+        "components/prism-keyman.min.js",
+        "components/prism-kotlin.js",
+        "components/prism-kotlin.min.js",
+        "components/prism-latex.js",
+        "components/prism-latex.min.js",
+        "components/prism-less.js",
+        "components/prism-less.min.js",
+        "components/prism-lolcode.js",
+        "components/prism-lolcode.min.js",
+        "components/prism-lua.js",
+        "components/prism-lua.min.js",
+        "components/prism-makefile.js",
+        "components/prism-makefile.min.js",
+        "components/prism-markdown.js",
+        "components/prism-markdown.min.js",
+        "components/prism-markup.js",
+        "components/prism-markup.min.js",
+        "components/prism-matlab.js",
+        "components/prism-matlab.min.js",
+        "components/prism-mel.js",
+        "components/prism-mel.min.js",
+        "components/prism-mizar.js",
+        "components/prism-mizar.min.js",
+        "components/prism-monkey.js",
+        "components/prism-monkey.min.js",
+        "components/prism-nasm.js",
+        "components/prism-nasm.min.js",
+        "components/prism-nginx.js",
+        "components/prism-nginx.min.js",
+        "components/prism-nim.js",
+        "components/prism-nim.min.js",
+        "components/prism-nix.js",
+        "components/prism-nix.min.js",
+        "components/prism-nsis.js",
+        "components/prism-nsis.min.js",
+        "components/prism-objectivec.js",
+        "components/prism-objectivec.min.js",
+        "components/prism-ocaml.js",
+        "components/prism-ocaml.min.js",
+        "components/prism-oz.js",
+        "components/prism-oz.min.js",
+        "components/prism-parigp.js",
+        "components/prism-parigp.min.js",
+        "components/prism-parser.js",
+        "components/prism-parser.min.js",
+        "components/prism-pascal.js",
+        "components/prism-pascal.min.js",
+        "components/prism-perl.js",
+        "components/prism-perl.min.js",
+        "components/prism-php.js",
+        "components/prism-php.min.js",
+        "components/prism-php-extras.js",
+        "components/prism-php-extras.min.js",
+        "components/prism-powershell.js",
+        "components/prism-powershell.min.js",
+        "components/prism-processing.js",
+        "components/prism-processing.min.js",
+        "components/prism-prolog.js",
+        "components/prism-prolog.min.js",
+        "components/prism-puppet.js",
+        "components/prism-puppet.min.js",
+        "components/prism-pure.js",
+        "components/prism-pure.min.js",
+        "components/prism-python.js",
+        "components/prism-python.min.js",
+        "components/prism-q.js",
+        "components/prism-q.min.js",
+        "components/prism-qore.js",
+        "components/prism-qore.min.js",
+        "components/prism-r.js",
+        "components/prism-r.min.js",
+        "components/prism-rest.js",
+        "components/prism-rest.min.js",
+        "components/prism-rip.js",
+        "components/prism-rip.min.js",
+        "components/prism-roboconf.js",
+        "components/prism-roboconf.min.js",
+        "components/prism-ruby.js",
+        "components/prism-ruby.min.js",
+        "components/prism-rust.js",
+        "components/prism-rust.min.js",
+        "components/prism-sas.js",
+        "components/prism-sas.min.js",
+        "components/prism-sass.js",
+        "components/prism-sass.min.js",
+        "components/prism-scala.js",
+        "components/prism-scala.min.js",
+        "components/prism-scheme.js",
+        "components/prism-scheme.min.js",
+        "components/prism-scss.js",
+        "components/prism-scss.min.js",
+        "components/prism-smalltalk.js",
+        "components/prism-smalltalk.min.js",
+        "components/prism-smarty.js",
+        "components/prism-smarty.min.js",
+        "components/prism-sql.js",
+        "components/prism-sql.min.js",
+        "components/prism-stylus.js",
+        "components/prism-stylus.min.js",
+        "components/prism-swift.js",
+        "components/prism-swift.min.js",
+        "components/prism-tcl.js",
+        "components/prism-tcl.min.js",
+        "components/prism-textile.js",
+        "components/prism-textile.min.js",
+        "components/prism-twig.js",
+        "components/prism-twig.min.js",
+        "components/prism-typescript.js",
+        "components/prism-typescript.min.js",
+        "components/prism-verilog.js",
+        "components/prism-verilog.min.js",
+        "components/prism-vhdl.js",
+        "components/prism-vhdl.min.js",
+        "components/prism-vim.js",
+        "components/prism-vim.min.js",
+        "components/prism-wiki.js",
+        "components/prism-wiki.min.js",
+        "components/prism-yaml.js",
+        "components/prism-yaml.min.js",
+        "examples.js",
+        "gulpfile.js",
+        "plugins/autolinker/prism-autolinker.css",
+        "plugins/autolinker/prism-autolinker.js",
+        "plugins/autolinker/prism-autolinker.min.js",
+        "plugins/autoloader/prism-autoloader.js",
+        "plugins/autoloader/prism-autoloader.min.js",
+        "plugins/file-highlight/prism-file-highlight.js",
+        "plugins/file-highlight/prism-file-highlight.min.js",
+        "plugins/highlight-keywords/prism-highlight-keywords.js",
+        "plugins/highlight-keywords/prism-highlight-keywords.min.js",
+        "plugins/ie8/prism-ie8.css",
+        "plugins/ie8/prism-ie8.js",
+        "plugins/ie8/prism-ie8.min.js",
+        "plugins/jsonp-highlight/prism-jsonp-highlight.js",
+        "plugins/jsonp-highlight/prism-jsonp-highlight.min.js",
+        "plugins/keep-markup/prism-keep-markup.js",
+        "plugins/keep-markup/prism-keep-markup.min.js",
+        "plugins/line-highlight/prism-line-highlight.css",
+        "plugins/line-highlight/prism-line-highlight.js",
+        "plugins/line-highlight/prism-line-highlight.min.js",
+        "plugins/line-numbers/prism-line-numbers.css",
+        "plugins/line-numbers/prism-line-numbers.js",
+        "plugins/line-numbers/prism-line-numbers.min.js",
+        "plugins/previewer-angle/prism-previewer-angle.css",
+        "plugins/previewer-angle/prism-previewer-angle.js",
+        "plugins/previewer-angle/prism-previewer-angle.min.js",
+        "plugins/previewer-base/prism-previewer-base.css",
+        "plugins/previewer-base/prism-previewer-base.js",
+        "plugins/previewer-base/prism-previewer-base.min.js",
+        "plugins/previewer-color/prism-previewer-color.css",
+        "plugins/previewer-color/prism-previewer-color.js",
+        "plugins/previewer-color/prism-previewer-color.min.js",
+        "plugins/previewer-easing/prism-previewer-easing.css",
+        "plugins/previewer-easing/prism-previewer-easing.js",
+        "plugins/previewer-easing/prism-previewer-easing.min.js",
+        "plugins/previewer-gradient/prism-previewer-gradient.css",
+        "plugins/previewer-gradient/prism-previewer-gradient.js",
+        "plugins/previewer-gradient/prism-previewer-gradient.min.js",
+        "plugins/previewer-time/prism-previewer-time.css",
+        "plugins/previewer-time/prism-previewer-time.js",
+        "plugins/previewer-time/prism-previewer-time.min.js",
+        "plugins/remove-initial-line-feed/prism-remove-initial-line-feed.js",
+        "plugins/remove-initial-line-feed/prism-remove-initial-line-feed.min.js",
+        "plugins/show-invisibles/prism-show-invisibles.css",
+        "plugins/show-invisibles/prism-show-invisibles.js",
+        "plugins/show-invisibles/prism-show-invisibles.min.js",
+        "plugins/show-language/prism-show-language.css",
+        "plugins/show-language/prism-show-language.js",
+        "plugins/show-language/prism-show-language.min.js",
+        "plugins/wpd/prism-wpd.css",
+        "plugins/wpd/prism-wpd.js",
+        "plugins/wpd/prism-wpd.min.js",
+        "prism.js",
+        "tests/helper/components.js",
+        "tests/helper/prism-loader.js",
+        "tests/helper/test-case.js",
+        "tests/helper/test-discovery.js",
+        "tests/helper/token-stream-transformer.js",
+        "tests/run.js",
+        "tests/run-child.js",
+        "tests/testrunner-tests.js",
+        "themes/prism.css",
+        "themes/prism-coy.css",
+        "themes/prism-dark.css",
+        "themes/prism-funky.css",
+        "themes/prism-okaidia.css",
+        "themes/prism-tomorrow.css",
+        "themes/prism-twilight.css",
+        "vendor/promise.js",
+    ],
+)
+
+filegroup(
+    name = "paper-progress",
+    srcs = [
+        "index.html",
+        "paper-progress.html",
+    ],
+)
+
+filegroup(
+    name = "iron-checked-element-behavior",
+    srcs = [
+        "index.html",
+        "iron-checked-element-behavior.html",
+    ],
+)
+
+filegroup(
+    name = "paper-toolbar",
+    srcs = [
+        "index.html",
+        "paper-toolbar.html",
+    ],
+)
+
+filegroup(
+    name = "async",
+    srcs = [
+        "deps/nodeunit.css",
+        "deps/nodeunit.js",
+        "lib/async.js",
+        "support/sync-package-managers.js",
+    ],
+)
+
+filegroup(
+    name = "es6-promise",
+    srcs = [
+        "promise.js",
+        "promise.min.js",
+    ],
+)
+
+filegroup(
+    name = "promise-polyfill",
+    srcs = [
+        "Gruntfile.js",
+        "Promise.js",
+        "Promise.min.js",
+        "Promise-Statics.js",
+        "promise-polyfill.html",
+        "promise-polyfill-lite.html",
+    ],
+)
+
+filegroup(
+    name = "font-roboto",
+    srcs = ["roboto.html"],
+)
+
+filegroup(
+    name = "paper-menu",
+    srcs = [
+        "index.html",
+        "paper-menu.html",
+        "paper-menu-shared.css",
+        "paper-submenu.html",
+    ],
+)
+
+filegroup(
+    name = "iron-icon",
+    srcs = [
+        "index.html",
+        "iron-icon.html",
+    ],
+)
+
+filegroup(
+    name = "iron-meta",
+    srcs = [
+        "index.html",
+        "iron-meta.html",
+    ],
+)
+
+filegroup(
+    name = "lodash",
+    srcs = [
+        "lodash.js",
+        "lodash.min.js",
+    ],
+)
+
+filegroup(
+    name = "iron-resizable-behavior",
+    srcs = [
+        "demo/src/x-app.html",
+        "index.html",
+        "iron-resizable-behavior.html",
+    ],
+)
+
+filegroup(
+    name = "iron-fit-behavior",
+    srcs = [
+        "index.html",
+        "iron-fit-behavior.html",
+    ],
+)
+
+filegroup(
+    name = "iron-overlay-behavior",
+    srcs = [
+        "index.html",
+        "iron-overlay-backdrop.html",
+        "iron-overlay-behavior.html",
+        "iron-overlay-manager.html",
+    ],
+)
+
+filegroup(
+    name = "neon-animation",
+    srcs = [
+        "animations/cascaded-animation.html",
+        "animations/fade-in-animation.html",
+        "animations/fade-out-animation.html",
+        "animations/hero-animation.html",
+        "animations/opaque-animation.html",
+        "animations/reverse-ripple-animation.html",
+        "animations/ripple-animation.html",
+        "animations/scale-down-animation.html",
+        "animations/scale-up-animation.html",
+        "animations/slide-down-animation.html",
+        "animations/slide-from-left-animation.html",
+        "animations/slide-from-right-animation.html",
+        "animations/slide-left-animation.html",
+        "animations/slide-right-animation.html",
+        "animations/slide-up-animation.html",
+        "animations/transform-animation.html",
+        "demo/card/index.html",
+        "demo/card/x-card.html",
+        "demo/card/x-cards-list.html",
+        "demo/declarative/index.html",
+        "demo/doc/basic.html",
+        "demo/doc/my-animatable.html",
+        "demo/doc/my-dialog.html",
+        "demo/doc/types.html",
+        "demo/dropdown/animated-dropdown.html",
+        "demo/dropdown/index.html",
+        "demo/grid/animated-grid.html",
+        "demo/grid/fullsize-page-with-card.html",
+        "demo/grid/index.html",
+        "demo/list/full-view.html",
+        "demo/list/index.html",
+        "demo/list/list-demo.html",
+        "demo/list/list-view.html",
+        "demo/load/animated-grid.html",
+        "demo/load/full-page.html",
+        "demo/load/index.html",
+        "demo/reprojection/animated-grid.html",
+        "demo/reprojection/fullsize-page-with-card.html",
+        "demo/reprojection/index.html",
+        "demo/reprojection/reprojected-pages.html",
+        "demo/tiles/circles-page.html",
+        "demo/tiles/index.html",
+        "demo/tiles/squares-page.html",
+        "index.html",
+        "neon-animatable.html",
+        "neon-animatable-behavior.html",
+        "neon-animated-pages.html",
+        "neon-animation.html",
+        "neon-animation-behavior.html",
+        "neon-animation-runner-behavior.html",
+        "neon-animations.html",
+        "neon-shared-element-animatable-behavior.html",
+        "neon-shared-element-animation-behavior.html",
+        "web-animations.html",
+    ],
+)
+
+filegroup(
+    name = "iron-a11y-keys-behavior",
+    srcs = [
+        "index.html",
+        "iron-a11y-keys-behavior.html",
+    ],
+)
+
+filegroup(
+    name = "plottable",
+    srcs = [
+        "plottable.css",
+        "plottable.js",
+        "plottable.min.js",
+    ],
+)
+
+filegroup(
+    name = "webcomponentsjs",
+    srcs = [
+        "CustomElements.js",
+        "CustomElements.min.js",
+        "HTMLImports.js",
+        "HTMLImports.min.js",
+        "MutationObserver.js",
+        "MutationObserver.min.js",
+        "ShadowDOM.js",
+        "ShadowDOM.min.js",
+        "webcomponents.js",
+        "webcomponents.min.js",
+        "webcomponents-lite.js",
+        "webcomponents-lite.min.js",
+    ],
+)
+
+filegroup(
+    name = "iron-validatable-behavior",
+    srcs = [
+        "index.html",
+        "iron-validatable-behavior.html",
+    ],
+)
+
+filegroup(
+    name = "sinon-chai",
+    srcs = ["lib/sinon-chai.js"],
+)
+
+filegroup(
+    name = "paper-button",
+    srcs = [
+        "index.html",
+        "paper-button.html",
+    ],
+)
+
+filegroup(
+    name = "iron-input",
+    srcs = [
+        "index.html",
+        "iron-input.html",
+    ],
+)
+
+filegroup(
+    name = "iron-menu-behavior",
+    srcs = [
+        "index.html",
+        "iron-menu-behavior.html",
+        "iron-menubar-behavior.html",
+    ],
+)
+
+filegroup(
+    name = "paper-slider",
+    srcs = [
+        "index.html",
+        "paper-slider.html",
+    ],
+)
+
+filegroup(
+    name = "iron-list",
+    srcs = [
+        "index.html",
+        "iron-list.html",
+        "test/smoke/avg-worst-case.html",
+        "test/smoke/dummy-data.html",
+        "test/smoke/index.html",
+    ],
+)
+
+filegroup(
+    name = "marked",
+    srcs = [
+        "Gulpfile.js",
+        "index.js",
+        "lib/marked.js",
+        "marked.min.js",
+    ],
+)
+
+filegroup(
+    name = "paper-material",
+    srcs = [
+        "index.html",
+        "paper-material.html",
+    ],
+)
+
+filegroup(
+    name = "iron-range-behavior",
+    srcs = [
+        "index.html",
+        "iron-range-behavior.html",
+    ],
+)
+
+filegroup(
+    name = "svg-typewriter",
+    srcs = ["svgtypewriter.js"],
+)
+
+filegroup(
+    name = "web-animations-js",
+    srcs = [
+        "web-animations.html",
+        "web-animations.min.js",
+        "web-animations-next.min.js",
+        "web-animations-next-lite.min.js",
+    ],
+)
+
+filegroup(
+    name = "hydrolysis",
+    srcs = [
+        "hydrolysis.html",
+        "hydrolysis.js",
+        "hydrolysis-analyzer.html",
+        "index.js",
+    ],
+)
+
+filegroup(
+    name = "web-component-tester",
+    srcs = [
+        "browser.js",
+        "data/a11ySuite.js",
+        "data/index.html",
+    ],
+)
+
+filegroup(
+    name = "paper-toggle-button",
+    srcs = [
+        "index.html",
+        "paper-toggle-button.html",
+    ],
+)
+
+filegroup(
+    name = "paper-behaviors",
+    srcs = [
+        "index.html",
+        "paper-button-behavior.html",
+        "paper-checked-element-behavior.html",
+        "paper-inky-focus-behavior.html",
+        "paper-ripple-behavior.html",
+    ],
+)
+
+filegroup(
+    name = "paper-radio-group",
+    srcs = [
+        "index.html",
+        "paper-radio-group.html",
+    ],
+)
+
+filegroup(
+    name = "iron-selector",
+    srcs = [
+        "index.html",
+        "iron-multi-selectable.html",
+        "iron-selectable.html",
+        "iron-selection.html",
+        "iron-selector.html",
+    ],
+)
+
+filegroup(
+    name = "iron-form-element-behavior",
+    srcs = [
+        "index.html",
+        "iron-form-element-behavior.html",
+    ],
+)
+
+filegroup(
+    name = "mocha",
+    srcs = [
+        "mocha.css",
+        "mocha.js",
+    ],
+)
+
+filegroup(
+    name = "dagre",
+    srcs = [
+        "dist/dagre.core.js",
+        "dist/dagre.core.min.js",
+    ],
+)
+
+filegroup(
+    name = "iron-behaviors",
+    srcs = [
+        "index.html",
+        "iron-button-state.html",
+        "iron-control-state.html",
+    ],
+)
+
+filegroup(
+    name = "graphlib",
+    srcs = [
+        "dist/graphlib.core.js",
+        "dist/graphlib.core.min.js",
+    ],
+)
+
+filegroup(
+    name = "iron-collapse",
+    srcs = [
+        "index.html",
+        "iron-collapse.html",
+    ],
+)
+
+filegroup(
+    name = "paper-checkbox",
+    srcs = [
+        "index.html",
+        "metadata.html",
+        "paper-checkbox.html",
+    ],
+)
+
+filegroup(
+    name = "paper-radio-button",
+    srcs = [
+        "index.html",
+        "paper-radio-button.html",
+    ],
+)
+
+filegroup(
+    name = "paper-header-panel",
+    srcs = [
+        "index.html",
+        "paper-header-panel.css",
+        "paper-header-panel.html",
+    ],
+)
+
+filegroup(
+    name = "prism-element",
+    srcs = [
+        "prism-highlighter.html",
+        "prism-import.html",
+    ],
+)
+
+filegroup(
+    name = "chai",
+    srcs = [
+        "chai.js",
+        "karma.conf.js",
+        "karma.sauce.js",
+        "sauce.browsers.js",
+    ],
+)
+
+filegroup(
+    name = "paper-menu-button",
+    srcs = [
+        "index.html",
+        "paper-menu-button.html",
+        "paper-menu-button-animations.html",
+    ],
+)
+
+filegroup(
+    name = "polymer",
+    srcs = [
+        "polymer.html",
+        "polymer-micro.html",
+        "polymer-mini.html",
+    ],
+)
+
+filegroup(
+    name = "paper-ripple",
+    srcs = [
+        "index.html",
+        "paper-ripple.html",
+    ],
+)
+
+filegroup(
+    name = "iron-iconset-svg",
+    srcs = [
+        "index.html",
+        "iron-iconset-svg.html",
+    ],
+)
--- a/82
+++ b/82
@ -0,0 +1,82 @@
+#!/bin/bash
+
+## Set up Cuda-related environment settings
+
+while [ "$TF_NEED_CUDA" == "" ]; do
+  read -p "Do you wish to bulid TensorFlow with GPU support? [y/n] " INPUT
+  case $INPUT in
+    [Yy]* ) echo -e "GPU support will be enabled for TensorFlow\n"; TF_NEED_CUDA=1;;
+    [Nn]* ) echo -e "No GPU support will be enabled for TensorFlow\n"; TF_NEED_CUDA=0;;
+    * ) echo "Invalid selection: " $INPUT;;
+  esac
+done
+
+if [ "$TF_NEED_CUDA" == "0" ]; then
+  echo "Configuration finished"
+  exit
+fi
+
+# Find out where the CUDA toolkit is installed
+while true; do
+  fromuser=""
+  if [ -z "$CUDA_TOOLKIT_PATH" ]; then
+    default_cuda_path=/usr/local/cuda
+    read -p "Please specify the location where CUDA 7.0 toolkit is installed. Refer to README.md for more details. [Default is $default_cuda_path]: " CUDA_TOOLKIT_PATH
+    fromuser="1"
+    if [ -z "$CUDA_TOOLKIT_PATH" ]; then
+      CUDA_TOOLKIT_PATH=$default_cuda_path
+    fi
+  fi
+  if [ -e "$CUDA_TOOLKIT_PATH/lib64/libcudart.so.7.0" ]; then
+    break
+  fi
+  echo "Invalid path to CUDA 7.0 toolkit. ${CUDA_TOOLKIT_PATH}/lib64/libcudart.so.7.0 cannot be found"
+  if [ -z "$fromuser" ]; then
+    exit 1
+  fi
+  CUDA_TOOLKIT_PATH=""
+  # Retry
+done
+
+# Find out where the CUDNN library is installed
+while true; do
+  fromuser=""
+  if [ -z "$CUDNN_INSTALL_PATH" ]; then
+    default_cudnn_path=${CUDA_TOOLKIT_PATH}
+    read -p "Please specify the location where CUDNN 6.5 V2 library is installed. Refer to README.md for more details. [Default is $default_cudnn_path]: " CUDNN_INSTALL_PATH
+    fromuser="1"
+    if [ -z "$CUDNN_INSTALL_PATH" ]; then
+      CUDNN_INSTALL_PATH=$default_cudnn_path
+    fi
+    # Result returned from "read" will be used unexpanded. That make "~" unuseable.
+    # Going through one more level of expansion to handle that.
+    CUDNN_INSTALL_PATH=$(bash -c "readlink -f $CUDNN_INSTALL_PATH")
+  fi
+  if [ -e "$CUDNN_INSTALL_PATH/libcudnn.so.6.5" -o -e "$CUDNN_INSTALL_PATH/lib64/libcudnn.so.6.5" ]; then
+    break
+  fi
+  echo "Invalid path to CUDNN 6.5 V2 toolkit. Neither of the following two files can be found:"
+  echo "$CUDNN_INSTALL_PATH/lib64/libcudnn.so.6.5"
+  echo "$CUDNN_INSTALL_PATH/libcudnn.so.6.5"
+  if [ -z "$fromuser" ]; then
+    exit 1
+  fi
+  CUDNN_INSTALL_PATH=""
+  # Retry
+done
+
+cat > third_party/gpus/cuda/cuda.config <<EOF
+# CUDA_TOOLKIT_PATH refers to the CUDA toolkit. Tensorflow requries Cuda 7.0
+# at the moment.
+CUDA_TOOLKIT_PATH="$CUDA_TOOLKIT_PATH"
+
+# CUDNN_INSTALL_PATH refers to the CUDNN toolkit. The cudnn header and library
+# files can be either in this directory, or under include/ and lib64/
+# directories separately.
+CUDNN_INSTALL_PATH="$CUDNN_INSTALL_PATH"
+EOF
+
+# Invoke the cuda_config.sh and set up the TensorFlow's canonical view of the Cuda libraries
+(cd third_party/gpus/cuda; ./cuda_config.sh;) || exit -1
+
+echo "Configuration finished"
--- a/gmock.BUILD
+++ b/gmock.BUILD
@ -0,0 +1,23 @@
+cc_library(
+    name = "gtest",
+    srcs = [
+        "gmock-1.7.0/gtest/src/gtest-all.cc",
+        "gmock-1.7.0/src/gmock-all.cc",
+    ],
+    includes = [
+        "gmock-1.7.0",
+        "gmock-1.7.0/gtest",
+        "gmock-1.7.0/gtest/include",
+        "gmock-1.7.0/include",
+    ],
+    linkopts = ["-pthread"],
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "gtest_main",
+    srcs = ["gmock-1.7.0/src/gmock_main.cc"],
+    linkopts = ["-pthread"],
+    visibility = ["//visibility:public"],
+    deps = [":gtest"],
+)
--- a/google/protobuf
+++ b/google/protobuf
@ -0,0 +1 @@
+Subproject commit 55ad57a235c009d0414aed1781072adda0c89137
--- a/jpeg.BUILD
+++ b/jpeg.BUILD
@ -0,0 +1,83 @@
+SOURCES = [
+    "jaricom.c",
+    "jcapimin.c",
+    "jcapistd.c",
+    "jcarith.c",
+    "jccoefct.c",
+    "jccolor.c",
+    "jcdctmgr.c",
+    "jchuff.c",
+    "jcinit.c",
+    "jcmainct.c",
+    "jcmarker.c",
+    "jcmaster.c",
+    "jcomapi.c",
+    "jcparam.c",
+    "jcprepct.c",
+    "jcsample.c",
+    "jctrans.c",
+    "jdarith.c",
+    "jdapimin.c",
+    "jdapistd.c",
+    "jdatadst.c",
+    "jdatasrc.c",
+    "jdcoefct.c",
+    "jdcolor.c",
+    "jddctmgr.c",
+    "jdhuff.c",
+    "jdinput.c",
+    "jdmainct.c",
+    "jdmarker.c",
+    "jdmaster.c",
+    "jdmerge.c",
+    "jdpostct.c",
+    "jdsample.c",
+    "jdtrans.c",
+    "jerror.c",
+    "jfdctflt.c",
+    "jfdctfst.c",
+    "jfdctint.c",
+    "jidctflt.c",
+    "jidctfst.c",
+    "jidctint.c",
+    "jmemmgr.c",
+    "jmemnobs.c",
+    "jquant1.c",
+    "jquant2.c",
+    "jutils.c",
+]
+
+HEADERS = [
+    "cderror.h",
+    "cdjpeg.h",
+    "jconfig.h",
+    "jdct.h",
+    "jerror.h",
+    "jinclude.h",
+    "jmemsys.h",
+    "jmorecfg.h",
+    "jpegint.h",
+    "jpeglib.h",
+    "jversion.h",
+    "transupp.h",
+]
+
+prefix_dir = "jpeg-9a"
+
+genrule(
+    name = "configure",
+    srcs = glob(
+        ["**/*"],
+        exclude = [prefix_dir + "/jconfig.h"],
+    ),
+    outs = [prefix_dir + "/jconfig.h"],
+    cmd = "pushd external/jpeg_archive/%s; workdir=$$(mktemp -d -t tmp.XXXXXXXXXX); cp -a * $$workdir; pushd $$workdir; ./configure; popd; popd; cp $$workdir/jconfig.h $(@D); rm -rf $$workdir;" % prefix_dir,
+)
+
+cc_library(
+    name = "jpeg",
+    srcs = [prefix_dir + "/" + source for source in SOURCES],
+    hdrs = glob(["**/*.h"]) + [":configure"],
+    includes = [prefix_dir],
+    visibility = ["//visibility:public"],
+)
--- a/png.BUILD
+++ b/png.BUILD
@ -0,0 +1,40 @@
+package(default_visibility = ["//visibility:public"])
+
+prefix_dir = "libpng-1.2.53"
+
+PNG_SOURCES = [
+    "png.c",
+    "pngerror.c",
+    "pngget.c",
+    "pngmem.c",
+    "pngpread.c",
+    "pngread.c",
+    "pngrio.c",
+    "pngrtran.c",
+    "pngrutil.c",
+    "pngset.c",
+    "pngtrans.c",
+    "pngwio.c",
+    "pngwrite.c",
+    "pngwtran.c",
+    "pngwutil.c",
+]
+
+genrule(
+    name = "configure",
+    srcs = glob(
+        ["**/*"],
+        exclude = [prefix_dir + "/config.h"],
+    ),
+    outs = [prefix_dir + "/config.h"],
+    cmd = "pushd external/png_archive/%s; workdir=$$(mktemp -d -t tmp.XXXXXXXXXX); cp -a * $$workdir; pushd $$workdir; ./configure --enable-shared=no --with-pic=no; popd; popd; cp $$workdir/config.h $(@D); rm -rf $$workdir;" % prefix_dir,
+)
+
+cc_library(
+    name = "png",
+    srcs = [prefix_dir + "/" + source for source in PNG_SOURCES],
+    hdrs = glob(["**/*.h"]) + [":configure"],
+    includes = [prefix_dir],
+    linkopts = ["-lz"],
+    visibility = ["//visibility:public"],
+)
--- a/six.BUILD
+++ b/six.BUILD
@ -0,0 +1,12 @@
+genrule(
+    name = "copy_six",
+    srcs = ["six-1.10.0/six.py"],
+    outs = ["six.py"],
+    cmd = "cp $< $(@)",
+)
+
+py_library(
+    name = "six",
+    srcs = ["six.py"],
+    visibility = ["//visibility:public"],
+)
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@ -0,0 +1,43 @@
+# Description:
+# TensorFlow is a computational framework, primarily for use in machine
+# learning applications.
+
+package(default_visibility = [":internal"])
+
+licenses(["notice"])  # Apache 2.0
+
+exports_files([
+    "LICENSE",
+    "ACKNOWLEDGMENTS",
+])
+
+package_group(
+    name = "internal",
+    packages = ["//tensorflow/..."],
+)
+
+sh_binary(
+    name = "swig",
+    srcs = ["tools/swig/swig.sh"],
+    data = glob(["tools/swig/**"]),
+)
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+            "g3doc/sitemap.md",
+        ],
+    ),
+    visibility = ["//tensorflow:__subpackages__"],
+)
+
+py_library(
+    name = "tensorflow_py",
+    srcs = ["__init__.py"],
+    visibility = ["//visibility:public"],
+    deps = ["//tensorflow/python"],
+)
--- a/tensorflow/init.py
+++ b/tensorflow/init.py
@ -0,0 +1,4 @@
+# Bring in all of the public TensorFlow interface into this
+# module.
+# pylint: disable=wildcard-import
+from tensorflow.python import *
--- a/tensorflow/cc/BUILD
+++ b/tensorflow/cc/BUILD
@ -0,0 +1,89 @@
+# Description:
+# TensorFlow is a computational framework, primarily for use in machine
+# learning applications.
+
+package(default_visibility = ["//tensorflow:internal"])
+
+licenses(["notice"])  # Apache 2.0
+
+exports_files(["LICENSE"])
+
+load("/tensorflow/tensorflow", "tf_copts")
+load("/tensorflow/tensorflow", "tf_gen_op_wrappers_cc")
+
+cc_library(
+    name = "cc_op_gen_main",
+    srcs = [
+        "ops/cc_op_gen.cc",
+        "ops/cc_op_gen_main.cc",
+    ],
+    hdrs = ["ops/cc_op_gen.h"],
+    copts = tf_copts(),
+    deps = [
+        "//tensorflow/core:framework",
+    ],
+)
+
+# Generates a library that contains C++ wrappers for ops.
+tf_gen_op_wrappers_cc(
+    name = "cc_ops",
+    op_lib_names = [
+        "array_ops",
+        "attention_ops",
+        "candidate_sampling_ops",
+        "control_flow_ops",
+        "data_flow_ops",
+        "image_ops",
+        "io_ops",
+        "linalg_ops",
+        "logging_ops",
+        "math_ops",
+        "nn_ops",
+        "no_op",
+        "parsing_ops",
+        "random_ops",
+        "sendrecv_ops",
+        "sparse_ops",
+        "state_ops",
+        "string_ops",
+        "summary_ops",
+        "training_ops",
+        "user_ops",
+    ],
+    other_hdrs = [
+        "ops/const_op.h",
+        "ops/standard_ops.h",
+    ],
+    other_srcs = [
+        "ops/const_op.cc",
+    ] + glob(["ops/*_grad.cc"]),
+    pkg = "//tensorflow/core",
+)
+
+cc_binary(
+    name = "tutorials_example_trainer",
+    srcs = ["tutorials/example_trainer.cc"],
+    copts = tf_copts(),
+    linkopts = [
+        "-lpthread",
+        "-lm",
+    ],
+    deps = [
+        ":cc_ops",
+        "//tensorflow/core:kernels",
+        "//tensorflow/core:local",
+        "//tensorflow/core:tensorflow",
+    ],
+)
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+        ],
+    ),
+    visibility = ["//tensorflow:__subpackages__"],
+)
--- a/tensorflow/cc/ops/array_grad.cc
+++ b/tensorflow/cc/ops/array_grad.cc
@ -0,0 +1,32 @@
+#include "tensorflow/core/framework/function.h"
+#include "tensorflow/core/lib/core/errors.h"
+
+namespace tensorflow {
+
+typedef FunctionDefHelper FDH;
+
+REGISTER_OP_NO_GRADIENT("Shape");
+REGISTER_OP_NO_GRADIENT("Rank");
+REGISTER_OP_NO_GRADIENT("Size");
+
+Status ReshapeGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  *g = FDH::Define(
+      // Arg defs
+      {"x: T", "shape: int32", "dy: T"},
+      // Ret val defs
+      {"dx: T", "dshape: int32"},
+      // Attr defs
+      {{"T: {float, double}"}},
+      // Nodes
+      {
+        {{"x_shape"}, "Shape", {"x"}, {{"T", "$T"}}},
+        {{"dx"}, "Reshape", {"dy", "x_shape"}, {{"T", "$T"}}},
+        {{"dshape"}, "ZerosLike", {"shape"}, {{"T", DT_INT32}}},
+      });
+  // clang-format on
+  return Status::OK();
+}
+REGISTER_OP_GRADIENT("Reshape", ReshapeGrad);
+
+}  // end namespace tensorflow
--- a/tensorflow/cc/ops/cc_op_gen.cc
+++ b/tensorflow/cc/ops/cc_op_gen.cc
@ -0,0 +1,350 @@
+// TODO(josh11b): Rewrite function parameter names to avoid C++ keywords
+// or "opts".
+
+#include "tensorflow/cc/ops/cc_op_gen.h"
+
+#include <unordered_map>
+#include "tensorflow/core/framework/attr_value_util.h"
+#include "tensorflow/core/framework/op_def.pb.h"
+#include "tensorflow/core/framework/op_def_util.h"
+#include "tensorflow/core/framework/op_gen_lib.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/lib/gtl/map_util.h"
+#include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/public/env.h"
+
+namespace tensorflow {
+namespace {
+
+const int kRightMargin = 79;
+
+const char* AttrTypeName(StringPiece attr_type) {
+  static const char* kAttrTypeName[][2] = {
+      {"string", "StringPiece"},
+      {"list(string)", "gtl::ArraySlice<string>"},
+      {"int", "int64"},
+      {"list(int)", "gtl::ArraySlice<int>"},
+      {"float", "float"},
+      {"list(float)", "gtl::ArraySlice<float>"},
+      {"bool", "bool"},
+      {"list(bool)", "gtl::ArraySlice<bool>"},
+      {"type", "DataType"},
+      {"list(type)", "DataTypeSlice"},
+      {"shape", "TensorShape"},
+      {"list(shape)", "gtl::ArraySlice<TensorShape>"},
+      {"tensor", "const Tensor&"},
+      {"list(tensor)", "gtl::ArraySlice<Tensor>"},
+      {"func", "const NameAttrList&"},
+  };
+  for (size_t i = 0; i < TF_ARRAYSIZE(kAttrTypeName); ++i) {
+    if (attr_type == kAttrTypeName[i][0]) {
+      return kAttrTypeName[i][1];
+    }
+  }
+  LOG(FATAL) << "Unsupported Attr type: " << attr_type;
+  return "";
+}
+
+// Change:     Into:
+//   ABC         // ABC
+//               //
+//   DEF         // DEF
+string MakeComment(StringPiece text) {
+  string ret;
+  while (!text.empty()) {
+    int last_non_space = -1;
+    int newline;
+    for (newline = 0; newline < static_cast<int>(text.size()); ++newline) {
+      if (text[newline] == '\n') break;
+      if (text[newline] != ' ') last_non_space = newline;
+    }
+    if (last_non_space == -1) {
+      strings::StrAppend(&ret, "//\n");
+    } else {
+      strings::StrAppend(&ret, "// ", text.substr(0, last_non_space + 1), "\n");
+    }
+    text.remove_prefix(newline + 1);
+  }
+  return ret;
+}
+
+void WriteCCOp(const OpDef& op_def, WritableFile* h, WritableFile* cc) {
+  // TODO(josh11b): Better wrapping of comments.
+  string comment;
+  if (op_def.summary().empty()) {
+    comment = "TODO: add doc.\n";
+  } else {
+    comment = strings::StrCat(op_def.summary(), "\n");
+    if (!op_def.description().empty()) {
+      strings::StrAppend(&comment, "\n", op_def.description(), "\n");
+    }
+  }
+
+  static const string kSingleInputType = "NodeOut";
+  static const string kListInputType = "gtl::ArraySlice<NodeOut>";
+
+  std::vector<string> arg_types;
+  std::vector<string> arg_names;
+
+  strings::StrAppend(&comment, "\nArguments:\n");
+
+  // Map from attr name to the first input arg it is inferred from.
+  std::unordered_map<string, string> inferred_attrs;
+  for (int i = 0; i < op_def.input_arg_size(); ++i) {
+    const auto& arg(op_def.input_arg(i));
+    arg_names.emplace_back(arg.name());
+    bool is_list = false;
+
+    if (!arg.type_attr().empty()) {
+      gtl::InsertIfNotPresent(&inferred_attrs, arg.type_attr(), arg.name());
+    } else if (!arg.type_list_attr().empty()) {
+      gtl::InsertIfNotPresent(&inferred_attrs, arg.type_list_attr(),
+                              arg.name());
+      is_list = true;
+    }
+    if (!arg.number_attr().empty()) {
+      gtl::InsertIfNotPresent(&inferred_attrs, arg.number_attr(), arg.name());
+      is_list = true;
+    }
+    if (is_list) {
+      arg_types.emplace_back(kListInputType);
+    } else {
+      arg_types.emplace_back(kSingleInputType);
+    }
+
+    // TODO(josh11b): Include input type information.
+    StringPiece description = arg.description();
+    if (!description.empty()) {
+      ConsumeEquals(&description);
+      strings::StrAppend(&comment, "* ", arg_names.back(), ": ",
+                         arg.description(), "\n");
+    }
+  }
+
+  string options_comment;
+  for (int i = 0; i < op_def.attr_size(); ++i) {
+    const auto& attr(op_def.attr(i));
+    // Do not add inferred attrs or attrs with defaults to the C++
+    // function signature.
+    if (inferred_attrs.find(attr.name()) == inferred_attrs.end()) {
+      if (!attr.has_default_value()) {
+        arg_names.emplace_back(attr.name());
+        arg_types.emplace_back(AttrTypeName(attr.type()));
+        if (!attr.description().empty()) {
+          strings::StrAppend(&comment, "* ", arg_names.back(), ": ",
+                             attr.description(), "\n");
+        }
+      } else {
+        strings::StrAppend(&options_comment, "  .WithAttr(\"", attr.name(),
+                           "\", ", AttrTypeName(attr.type()), "): Defaults to ",
+                           SummarizeAttrValue(attr.default_value()), ".\n");
+        if (!attr.description().empty()) {
+          strings::StrAppend(&options_comment, "    ", attr.description(),
+                             "\n");
+        }
+      }
+    }
+  }
+  CHECK_EQ(arg_names.size(), arg_types.size());
+  strings::StrAppend(&comment, "* opts:\n", options_comment,
+                     R"comment(  .WithName(StringPiece): Set the Node's name
+  .WithDevice(StringPiece): Set the Node's requested device
+  .WithControlInput(Node*) / .WithControlInputs({Node*, ...}):
+    Add control depencies on the specified Node(s).
+
+Returns a pointer to the created Node)comment");
+
+  // TODO(josh11b): Include output type information.
+  if (op_def.output_arg_size() == 0) {
+    strings::StrAppend(&comment, ".\n");
+  } else if (op_def.output_arg_size() == 1) {
+    StringPiece description = op_def.output_arg(0).description();
+    ConsumeEquals(&description);
+    if (description.empty()) {
+      strings::StrAppend(&comment, ".\n");
+    } else {
+      strings::StrAppend(&comment, ", with output:\n", description, "\n");
+    }
+  } else {
+    strings::StrAppend(&comment, ", with outputs:\n");
+    for (int o = 0; o < op_def.output_arg_size(); ++o) {
+      StringPiece description = op_def.output_arg(o).description();
+      ConsumeEquals(&description);
+      if (description.empty()) {
+        strings::StrAppend(&comment, "* ", op_def.output_arg(o).name(), "\n");
+      } else {
+        strings::StrAppend(&comment, "* ", op_def.output_arg(o).name(), ": ",
+                           description, "\n");
+      }
+    }
+  }
+
+  // Write the header comment.
+  TF_CHECK_OK(h->Append(MakeComment(comment)));
+
+  // Declare the function wrapper.
+  const string prefix = strings::StrCat("Node* ", op_def.name(), "(");
+  string h_rest;
+  for (size_t i = 0; i < arg_names.size(); ++i) {
+    strings::StrAppend(&h_rest, arg_types[i], " ", arg_names[i], ", ");
+  }
+  strings::StrAppend(&h_rest, "const GraphDefBuilder::Options& opts");
+  string cc_decl = h_rest;
+  strings::StrAppend(&h_rest, ");");
+  TF_CHECK_OK(h->Append(WordWrap(prefix, h_rest, kRightMargin) + "\n\n"));
+
+  // Define the function wrapper.
+  strings::StrAppend(&cc_decl, ") {");
+  TF_CHECK_OK(cc->Append(WordWrap(prefix, cc_decl, kRightMargin) + "\n"));
+  const string op_name = strings::StrCat("  static const string kOpName = \"",
+                                         op_def.name(), "\";\n");
+
+  if (arg_types.empty()) {
+    TF_CHECK_OK(cc->Append(op_name));
+    TF_CHECK_OK(cc->Append("  return SourceOp(kOpName, opts);\n}\n\n"));
+  } else if (arg_types == std::vector<string>({kSingleInputType})) {
+    TF_CHECK_OK(cc->Append(op_name));
+    TF_CHECK_OK(cc->Append(strings::StrCat("  return UnaryOp(kOpName, ",
+                                           arg_names[0], ", opts);\n}\n\n")));
+  } else if (arg_types ==
+             std::vector<string>({kSingleInputType, kSingleInputType})) {
+    TF_CHECK_OK(cc->Append(op_name));
+    // TODO(josh11b): Word wrap this if it ever becomes necessary.
+    TF_CHECK_OK(
+        cc->Append(strings::StrCat("  return BinaryOp(kOpName, ", arg_names[0],
+                                   ", ", arg_names[1], ", opts);\n}\n\n")));
+  } else {
+    TF_CHECK_OK(cc->Append("  if (opts.HaveError()) return nullptr;\n"));
+    TF_CHECK_OK(cc->Append(op_name));
+    TF_CHECK_OK(cc->Append(
+        "  NodeBuilder node_builder(opts.GetNameForOp(kOpName), kOpName,\n"
+        "                           opts.op_registry());\n"));
+    for (size_t i = 0; i < arg_names.size(); ++i) {
+      if (i < static_cast<size_t>(op_def.input_arg_size())) {
+        TF_CHECK_OK(cc->Append(
+            strings::StrCat("  node_builder.Input(", arg_names[i], ");\n")));
+      } else {
+        TF_CHECK_OK(
+            cc->Append(strings::StrCat("  node_builder.Attr(\"", arg_names[i],
+                                       "\", ", arg_names[i], ");\n")));
+      }
+    }
+    TF_CHECK_OK(
+        cc->Append("  return opts.FinalizeBuilder(&node_builder);\n"
+                   "}\n\n"));
+  }
+}
+
+// Converts:
+//   bazel-out/.../genfiles/XX
+// to: XX.
+string GetPath(const std::string& dot_h_fname) {
+  auto pos = dot_h_fname.find("/genfiles/");
+  if (pos == string::npos) return dot_h_fname;
+  // - 1 account for the terminating null character (\0) in "/genfiles/".
+  return dot_h_fname.substr(pos + sizeof("/genfiles/") - 1);
+}
+
+// Converts:
+//   cc/ops/gen_foo_ops.h
+// to:
+//   CC_OPS_GEN_FOO_OPS_H_
+string ToGuard(const std::string& path) {
+  string guard;
+  guard.reserve(path.size() + 1);  // + 1 -> trailing _
+  for (const char c : path) {
+    if (c >= 'A' && c <= 'Z') {
+      guard += c;
+    } else if (c >= 'a' && c <= 'z') {
+      guard += c + 'A' - 'a';
+    } else {
+      guard += '_';
+    }
+  }
+  guard += '_';
+  return guard;
+}
+
+}  // namespace
+
+void WriteCCOps(const OpList& ops, const std::string& dot_h_fname,
+                const std::string& dot_cc_fname) {
+  Env* env = Env::Default();
+  WritableFile* h = nullptr;
+  WritableFile* cc = nullptr;
+  TF_CHECK_OK(env->NewWritableFile(dot_h_fname, &h));
+  TF_CHECK_OK(env->NewWritableFile(dot_cc_fname, &cc));
+
+  // .h Header
+  const string include = GetPath(dot_h_fname);
+  const string guard = ToGuard(include);
+  // TODO(josh11b): Mention the library for which wrappers are being generated.
+  Status s;
+  s = h->Append(
+      strings::StrCat("// This file is MACHINE GENERATED! Do not edit.\n\n"
+                      "#ifndef ",
+                      guard,
+                      "\n"
+                      "#define ",
+                      guard, R"header(
+
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/graph/graph_def_builder.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/public/tensor.h"
+#include "tensorflow/core/public/tensor_shape.h"
+
+namespace tensorflow {
+namespace ops {
+
+// These add a node to the graph from opts.
+//
+// Note for "NodeOut" inputs, you will typically either pass
+// * a {Node*, int index} (to pass the index-th output of that node), or
+// * a Node* (to pass the first output of that node).
+
+
+)header"));
+  TF_CHECK_OK(s);
+  // .cc Header
+  s = cc->Append(
+      strings::StrCat("// This file is MACHINE GENERATED! Do not edit.\n\n"
+                      "#include \"",
+                      include, R"header("
+
+#include "tensorflow/core/graph/node_builder.h"
+
+namespace tensorflow {
+namespace ops {
+
+)header"));
+  TF_CHECK_OK(s);
+
+  for (const auto& op_def : ops.op()) {
+    WriteCCOp(op_def, h, cc);
+  }
+
+  // .h Footer
+
+  s = h->Append(strings::StrCat(R"footer(}  // namespace ops
+}  // namespace tensorflow
+
+#endif  // )footer",
+                                guard, "\n"));
+  TF_CHECK_OK(s);
+
+  // .cc Footer
+
+  s = cc->Append(R"footer(}  // namespace ops
+}  // namespace tensorflow
+)footer");
+  TF_CHECK_OK(s);
+
+  TF_CHECK_OK(cc->Close());
+  TF_CHECK_OK(h->Close());
+}
+
+}  // namespace tensorflow
--- a/tensorflow/cc/ops/cc_op_gen.h
+++ b/tensorflow/cc/ops/cc_op_gen.h
@ -0,0 +1,14 @@
+#ifndef TENSORFLOW_CC_OPS_CC_OP_GEN_H_
+#define TENSORFLOW_CC_OPS_CC_OP_GEN_H_
+
+#include "tensorflow/core/framework/op_def.pb.h"
+
+namespace tensorflow {
+
+// Result is written to files dot_h and dot_cc.
+void WriteCCOps(const OpList& ops, const std::string& dot_h_fname,
+                const std::string& dot_cc_fname);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CC_OPS_CC_OP_GEN_H_
--- a/tensorflow/cc/ops/cc_op_gen_main.cc
+++ b/tensorflow/cc/ops/cc_op_gen_main.cc
@ -0,0 +1,34 @@
+#include "tensorflow/cc/ops/cc_op_gen.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/op_def.pb.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/platform/init_main.h"
+#include "tensorflow/core/platform/port.h"
+
+namespace tensorflow {
+namespace {
+
+void PrintAllCCOps(const std::string& dot_h, const std::string& dot_cc,
+                   bool include_internal) {
+  OpList ops;
+  OpRegistry::Global()->Export(include_internal, &ops);
+  WriteCCOps(ops, dot_h, dot_cc);
+}
+
+}  // namespace
+}  // namespace tensorflow
+
+int main(int argc, char* argv[]) {
+  tensorflow::port::InitMain(argv[0], &argc, &argv);
+  if (argc != 4) {
+    fprintf(stderr,
+            "Usage: %s out.h out.cc include_internal\n"
+            "  include_internal: 1 means include internal ops\n",
+            argv[0]);
+    exit(1);
+  }
+
+  bool include_internal = tensorflow::StringPiece("1") == argv[3];
+  tensorflow::PrintAllCCOps(argv[1], argv[2], include_internal);
+  return 0;
+}
--- a/tensorflow/cc/ops/const_op.cc
+++ b/tensorflow/cc/ops/const_op.cc
@ -0,0 +1,113 @@
+#include "tensorflow/cc/ops/const_op.h"
+
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/graph/node_builder.h"
+#include "tensorflow/core/lib/core/errors.h"
+
+namespace tensorflow {
+namespace ops {
+
+namespace {
+const string& OpName() {
+  static const string kOpName = "Const";
+  return kOpName;
+}
+}  // namespace
+
+#define DEFINE_CONST_SCALAR(TYPE)                                         \
+  Node* Const(TYPE s, const GraphDefBuilder::Options& options) {          \
+    return Const(gtl::ArraySlice<TYPE>(&s, 1), TensorShape({}), options); \
+  }
+
+#define DEFINE_CONST_VECTOR(TYPE)                                          \
+  Node* Const(gtl::ArraySlice<TYPE> v,                                     \
+              const GraphDefBuilder::Options& options) {                   \
+    return Const(v, TensorShape({static_cast<int64>(v.size())}), options); \
+  }
+
+#define DEFINE_CONST_TENSOR(TYPE, ...)                                         \
+  Node* Const(gtl::ArraySlice<TYPE> t, const TensorShape& shape,               \
+              const GraphDefBuilder::Options& options) {                       \
+    if (options.HaveError()) return nullptr;                                   \
+    NodeBuilder node_builder(options.GetNameForOp(OpName()), OpName(),         \
+                             options.op_registry());                           \
+    const DataType dt = DataTypeToEnum<TYPE>::v();                             \
+    if (t.size() == 1) {                                                       \
+      TensorProto proto;                                                       \
+      proto.set_dtype(dt);                                                     \
+      shape.AsProto(proto.mutable_tensor_shape());                             \
+      __VA_ARGS__;                                                             \
+      node_builder.Attr("dtype", dt).Attr("value", proto);                     \
+    } else {                                                                   \
+      Tensor tensor(dt, shape);                                                \
+      if (tensor.NumElements() != static_cast<int64>(t.size())) {              \
+        options.UpdateStatus(errors::InvalidArgument(                          \
+            t.size(), " values provided to Const() != ", tensor.NumElements(), \
+            " elements for shape ", shape.ShortDebugString()));                \
+      } else {                                                                 \
+        std::copy_n(t.data(), t.size(), tensor.flat<TYPE>().data());           \
+        node_builder.Attr("dtype", dt).Attr("value", tensor);                  \
+      }                                                                        \
+    }                                                                          \
+    return options.FinalizeBuilder(&node_builder);                             \
+  }
+
+#define DEFINE_CONST_IMPL(TYPE, ...) \
+  DEFINE_CONST_SCALAR(TYPE)          \
+  DEFINE_CONST_VECTOR(TYPE)          \
+  DEFINE_CONST_TENSOR(TYPE, __VA_ARGS__)
+
+#define DEFINE_CONST(TYPE, FIELD) \
+  DEFINE_CONST_IMPL(TYPE, proto.add_##FIELD(*t.begin());)
+
+DEFINE_CONST(float, float_val);
+DEFINE_CONST(double, double_val);
+DEFINE_CONST(int32, int_val);
+DEFINE_CONST(uint8, int_val);
+DEFINE_CONST(int16, int_val);
+DEFINE_CONST(int8, int_val);
+DEFINE_CONST(int64, int64_val);
+DEFINE_CONST(bool, bool_val);
+
+DEFINE_CONST_IMPL(complex64, proto.add_scomplex_val(t.begin()->real());
+                  proto.add_scomplex_val(t.begin()->imag()););
+
+Node* Const(StringPiece s, const GraphDefBuilder::Options& options) {
+  if (options.HaveError()) return nullptr;
+  NodeBuilder node_builder(options.GetNameForOp(OpName()), OpName(),
+                           options.op_registry());
+  TensorProto proto;
+  proto.set_dtype(DT_STRING);
+  TensorShape({}).AsProto(proto.mutable_tensor_shape());
+  proto.add_string_val(s.data(), s.size());
+  node_builder.Attr("dtype", DT_STRING).Attr("value", proto);
+  return options.FinalizeBuilder(&node_builder);
+}
+
+DEFINE_CONST_VECTOR(string)
+DEFINE_CONST_TENSOR(string, proto.add_string_val(*t.begin());)
+
+#undef DEFINE_CONST
+#undef DEFINE_CONST_IMPL
+#undef DEFINE_CONST_TENSOR
+#undef DEFINE_CONST_VECTOR
+#undef DEFINE_CONST_SCALAR
+
+Node* Const(const Tensor& t, const GraphDefBuilder::Options& options) {
+  if (options.HaveError()) return nullptr;
+  NodeBuilder node_builder(options.GetNameForOp(OpName()), OpName(),
+                           options.op_registry());
+  node_builder.Attr("dtype", t.dtype()).Attr("value", t);
+  return options.FinalizeBuilder(&node_builder);
+}
+
+Node* Const(const TensorProto& proto, const GraphDefBuilder::Options& options) {
+  if (options.HaveError()) return nullptr;
+  NodeBuilder node_builder(options.GetNameForOp(OpName()), OpName(),
+                           options.op_registry());
+  node_builder.Attr("dtype", proto.dtype()).Attr("value", proto);
+  return options.FinalizeBuilder(&node_builder);
+}
+
+}  // namespace ops
+}  // namespace tensorflow
--- a/tensorflow/cc/ops/const_op.h
+++ b/tensorflow/cc/ops/const_op.h
@ -0,0 +1,70 @@
+#ifndef TENSORFLOW_CC_OPS_CONST_OP_H_
+#define TENSORFLOW_CC_OPS_CONST_OP_H_
+
+#include "tensorflow/core/framework/tensor.pb.h"
+#include "tensorflow/core/graph/graph_def_builder.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/public/tensor.h"
+
+namespace tensorflow {
+namespace ops {
+
+// If a shape is specified, you may either provide the same number of values,
+// or a single value and that value will be duplicated to fill out the Tensor.
+#define DECLARE_CONST(TYPE)                                                  \
+  Node* Const(TYPE s, const GraphDefBuilder::Options& options); /* Scalar */ \
+  Node* Const(gtl::ArraySlice<TYPE> v,                                       \
+              const GraphDefBuilder::Options& options); /* Vector */         \
+  Node* Const(gtl::ArraySlice<TYPE> t, const TensorShape& shape,             \
+              const GraphDefBuilder::Options& options); /* Tensor */         \
+  inline Node* Const(std::initializer_list<TYPE> v, /* Vector using {...} */ \
+                     const GraphDefBuilder::Options& options) {              \
+    return Const(gtl::ArraySlice<TYPE>(v), options);                         \
+  }                                                                          \
+  inline Node* Const(std::initializer_list<TYPE> t, /* Tensor using {...} */ \
+                     const TensorShape& shape,                               \
+                     const GraphDefBuilder::Options& options) {              \
+    return Const(gtl::ArraySlice<TYPE>(t), shape, options);                  \
+  }
+
+DECLARE_CONST(float);
+DECLARE_CONST(double);
+DECLARE_CONST(int32);
+DECLARE_CONST(uint8);
+DECLARE_CONST(int16);
+DECLARE_CONST(int8);
+DECLARE_CONST(complex64);
+DECLARE_CONST(int64);
+DECLARE_CONST(bool);
+
+#undef DECLARE_CONST
+
+// String
+Node* Const(StringPiece s, const GraphDefBuilder::Options& options);
+Node* Const(gtl::ArraySlice<string> v, const GraphDefBuilder::Options& options);
+Node* Const(gtl::ArraySlice<string> t, const TensorShape& shape,
+            const GraphDefBuilder::Options& options);
+inline Node* Const(std::initializer_list<string> v,
+                   const GraphDefBuilder::Options& options) {
+  return Const(gtl::ArraySlice<string>(v), options);
+}
+inline Node* Const(std::initializer_list<string> t, const TensorShape& shape,
+                   const GraphDefBuilder::Options& options) {
+  return Const(gtl::ArraySlice<string>(t), shape, options);
+}
+
+// A Tensor of any type.
+Node* Const(const Tensor& t, const GraphDefBuilder::Options& options);
+Node* Const(const TensorProto& proto, const GraphDefBuilder::Options& options);
+
+template <class T>
+Node* EmptyConst(const GraphDefBuilder::Options& options) {
+  return Const(gtl::ArraySlice<T>(), options);
+}
+
+// TODO(josh11b): Support other types (e.g. quantized ints, float16).
+
+}  // namespace ops
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CC_OPS_CONST_OP_H_
--- a/tensorflow/cc/ops/functional_grad.cc
+++ b/tensorflow/cc/ops/functional_grad.cc
@ -0,0 +1,42 @@
+#include "tensorflow/core/framework/function.h"
+#include "tensorflow/core/lib/core/errors.h"
+
+namespace tensorflow {
+
+typedef FunctionDefHelper FDH;
+
+Status MapAccumulateGrad(const AttrSlice& attrs, FunctionDef* ret) {
+  const NameAttrList* func;
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "f", &func));
+  DataType T;
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "T", &T));
+  int k;
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "K", &k));
+  // The gradient function of f.
+  //  f : (K*T, T, T) -> T
+  //  g : (K*T, T, T, T) -> (K*T, T, T)
+  auto grad = FDH::FunctionRef("SymbolicGradient",
+                               {{"f", *func},
+                                {"Tin", std::vector<DataType>(k + 3, T)},
+                                {"Tout", std::vector<DataType>(k + 2, T)}});
+  *ret = FDH::Define(
+      // Arg defs
+      {"theta: K*T", "x: T", "u: T", "dy: T"},
+      // Ret val defs
+      {"dtheta: K*T", "dx: T", "du: T"},
+      // Attr defs
+      {{"T: {float, double}"}},
+      // nodes.
+      {{{"y"},
+        "MapAccumulate",
+        {"theta", "x", "u"},
+        {{"f", *func}, {"T", "$T"}, {"K", k}}},
+       {{"dtheta", "dx", "du"},
+        "MapAccumulateGrad",
+        {"theta", "x", "u", "y", "dy"},
+        {{"g", grad}, {"T", "$T"}, {"K", k}}}});
+  return Status::OK();
+}
+REGISTER_OP_GRADIENT("MapAccumulate", MapAccumulateGrad);
+
+}  // end namespace tensorflow
--- a/tensorflow/cc/ops/math_grad.cc
+++ b/tensorflow/cc/ops/math_grad.cc
@ -0,0 +1,566 @@
+#include "tensorflow/core/framework/function.h"
+#include "tensorflow/core/lib/core/errors.h"
+
+namespace tensorflow {
+
+typedef FunctionDefHelper FDH;
+
+// Cwise binary ops
+Status GradForUnaryCwise(FunctionDef* g, std::vector<FDH::Node> nodes) {
+  for (auto& n : nodes) {
+    if (n.attr.empty()) {
+      n.attr = {{"T", "$T"}};
+    }
+  }
+  *g = FDH::Define(
+      // Arg defs
+      {"x: T", "dy: T"},
+      // Ret val defs
+      {"dx: T"},
+      // Attr defs
+      {{"T: {float, double}"}},
+      // Nodes
+      nodes);
+  return Status::OK();
+}
+
+Status AbsGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"sign"}, "Sign", {"x"}},
+      {{"dx"}, "Mul", {"dy", "sign"}},
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Abs", AbsGrad);
+
+Status NegGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"dx"}, "Neg", {"dy"}},
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Neg", NegGrad);
+
+Status InvGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"y"}, "Inv", {"x"}},
+      {{"y2"}, "Square", {"y"}},
+      {{"y2_neg"}, "Neg", {"y2"}},
+      {{"dx"}, "Mul", {"dy", "y2_neg"}}
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Inv", InvGrad);
+
+Status SquareGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      FDH::Const("c", 2LL),
+      {{"two"}, "Cast", {"c"}, {{"SrcT", DT_INT64}, {"DstT", "$T"}}},
+      {{"x2"}, "Mul", {"x", "two"}},  // x * 2
+      {{"dx"}, "Mul", {"dy", "x2"}},  // dy * (x * 2)
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Square", SquareGrad);
+
+Status SqrtGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"y"}, "Sqrt", {"x"}},
+      {{"y_inv"}, "Inv", {"y"}},
+      FDH::Const("const", 0.5f),
+      {{"half"}, "Cast", {"const"}, {{"SrcT", DT_FLOAT}, {"DstT", "$T"}}},
+      {{"a"}, "Mul", {"half", "y_inv"}},  // .5 * 1/y
+      {{"dx"}, "Mul", {"dy", "a"}},  // dy * (.5 * 1/y)
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Sqrt", SqrtGrad);
+
+Status RsqrtGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"x_inv"}, "Inv", {"x"}},
+      {{"y"}, "Rsqrt", {"x"}},
+      FDH::Const("const", -.5f),
+      {{"neghalf"}, "Cast", {"const"}, {{"SrcT", DT_FLOAT}, {"DstT", "$T"}}},
+      {{"a"}, "Mul", {"neghalf", "x_inv"}},   // -0.5 * 1/x
+      {{"b"}, "Mul", {"a", "y"}},             // -0.5 * 1/x * y
+      {{"dx"}, "Mul", {"dy", "b"}},           // dy * (1/y * .5)
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Rsqrt", RsqrtGrad);
+
+Status ExpGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"y"}, "Exp", {"x"}},
+      {{"dx"}, "Mul", {"dy", "y"}},           // dy * y
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Exp", ExpGrad);
+
+Status LogGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"x_inv"}, "Inv", {"x"}},
+      {{"dx"}, "Mul", {"dy", "x_inv"}},           // dy * 1/x
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Log", LogGrad);
+
+Status TanhGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"y"}, "Tanh", {"x"}},
+      {{"y2"}, "Square", {"y"}},
+      FDH::Const("const", 1.0f),
+      {{"one"}, "Cast", {"const"}, {{"SrcT", DT_FLOAT}, {"DstT", "$T"}}},
+      {{"a"}, "Sub", {"one", "y2"}},
+      {{"dx"}, "Mul", {"dy", "a"}},           // dy * (1 - y*y)
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Tanh", TanhGrad);
+
+Status SigmoidGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"y"}, "Sigmoid", {"x"}},
+      FDH::Const("const", 1.0f),
+      {{"one"}, "Cast", {"const"}, {{"SrcT", DT_FLOAT}, {"DstT", "$T"}}},
+      {{"a"}, "Sub", {"one", "y"}},
+      {{"b"}, "Mul", {"y", "a"}},             // y * (1 - y)
+      {{"dx"}, "Mul", {"dy", "b"}},           // dy * y * (1 - y)
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Sigmoid", SigmoidGrad);
+
+Status SignGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"s"}, "Shape", {"x"}},
+      FDH::Const("zero", 0.f),
+      {{"val"}, "Cast", {"zero"}, {{"SrcT", DT_FLOAT}, {"DstT", "$T"}}},
+      {{"dx"}, "Fill", {"s", "val"}},
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Sign", SignGrad);
+
+Status SinGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"cos"}, "Cos", {"x"}},
+      {{"dx"}, "Mul", {"dy", "cos"}},  // dy * cos(x)
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Sin", SinGrad);
+
+Status CosGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"sin"}, "Sin", {"x"}},
+      {{"neg"}, "Neg", {"sin"}},
+      {{"dx"}, "Mul", {"dy", "neg"}},  // dy * (-sin(x))
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Cos", CosGrad);
+
+Status RealGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      FDH::Const("zero", 0.f),
+      {{"dx"}, "Complex", {"dy", "zero"}},
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Real", RealGrad);
+
+Status ImagGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      FDH::Const("zero", 0.f),
+      {{"dx"}, "Complex", {"zero", "dy"}},
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Imag", ImagGrad);
+
+Status ConjGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForUnaryCwise(g, {
+      {{"dx"}, "Conj", {"dy"}},
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Conj", ConjGrad);
+
+// Cwise binary ops
+//
+// TODO(zhifengc): This can be arrange as a function in the standard
+// library.
+Status GradForBinaryCwise(FunctionDef* g, std::vector<FDH::Node> body) {
+  // clang-format off
+  std::vector<FDH::Node> nodes = {
+    {{"sx"}, "Shape", {"x"}},
+    {{"sy"}, "Shape", {"y"}},
+  };
+  nodes.insert(nodes.end(), body.begin(), body.end());
+  std::vector<FDH::Node> reshapes = {
+    {{"sum_gx"}, "Sum", {"gx", "rx"}},
+    {{"dx"}, "Reshape", {"sum_gx", "sx"}},
+    {{"sum_gy"}, "Sum", {"gy", "ry"}},
+    {{"dy"}, "Reshape", {"sum_gy", "sy"}},
+  };
+  nodes.insert(nodes.end(), reshapes.begin(), reshapes.end());
+
+  // clang-format on
+  for (auto& n : nodes) {
+    if (n.attr.empty()) {
+      n.attr = {{"T", "$T"}};
+    }
+  }
+  // "BroadcastGradientArgs" doesn't need any attrs.
+  nodes.push_back({{"rx", "ry"}, "BroadcastGradientArgs", {"sx", "sy"}});
+  *g = FDH::Define(
+      // Arg defs
+      {"x: T", "y: T", "dz: T"},
+      // Ret val defs
+      {"dx: T", "dy: T"},
+      // Attr defs
+      {{"T: {float, double}"}},
+      // Nodes
+      nodes);
+  return Status::OK();
+}
+
+Status AddGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForBinaryCwise(g, {
+      {{"gx"}, "Identity", {"dz"}},
+      {{"gy"}, "Identity", {"dz"}},
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Add", AddGrad);
+
+Status SubGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForBinaryCwise(g, {
+      {{"gx"}, "Identity", {"dz"}},
+      {{"gy"}, "Neg", {"dz"}},          // -dz
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Sub", SubGrad);
+
+Status MulGrad(const AttrSlice& attrs, FunctionDef* g) {
+  DataType T;
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "T", &T));
+  if (T == DT_COMPLEX64) {
+    return GradForBinaryCwise(
+        g, {
+               {{"cy"}, "Conj", {"y"}},
+               {{"gx"}, "Mul", {"dz", "cy"}},  // dz * Conj(y)
+               {{"cx"}, "Conj", {"x"}},
+               {{"gy"}, "Mul", {"cx", "dz"}},  // Conj(x) * dz
+           });
+  } else {
+    // clang-format off
+    return GradForBinaryCwise(g, {
+        {{"gx"}, "Mul", {"dz", "y"}},  // dz * y
+        {{"gy"}, "Mul", {"x", "dz"}},  // x * dz
+    });
+    // clang-format on
+  }
+}
+REGISTER_OP_GRADIENT("Mul", MulGrad);
+
+Status DivGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForBinaryCwise(g, {
+      {{"gx"}, "Div", {"dz", "y"}},
+      {{"nx"}, "Neg", {"x"}},
+      {{"y2"}, "Square", {"y"}},
+      {{"nx_y2"}, "Div", {"nx", "y2"}},
+      {{"gy"}, "Mul", {"dz", "nx_y2"}},  // dz * (- x / y^2)
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Div", DivGrad);
+
+Status PowGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForBinaryCwise(g, {
+      {{"z"}, "Pow", {"x", "y"}},
+      // dz * y * Pow(x, y - 1)
+      FDH::Const("const", 1.0f),
+      {{"one"}, "Cast", {"const"}, {{"SrcT", DT_FLOAT}, {"DstT", "$T"}}},
+      {{"t0"}, "Sub", {"y", "one"}},
+      {{"t1"}, "Pow", {"x", "t0"}},
+      {{"t2"}, "Mul", {"dz", "y"}},
+      {{"gx"}, "Mul", {"t1", "t2"}},
+      // dz * z * Log(x)
+      {{"t3"}, "Log", {"x"}},
+      {{"t4"}, "Mul", {"dz", "z"}},
+      {{"gy"}, "Mul", {"t3", "t4"}},
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Pow", PowGrad);
+
+Status MaximumMinimumGradHelper(const string& comparator,
+                                const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForBinaryCwise(g, {
+      {{"c"}, comparator, {"x", "y"}},
+      {{"mask"}, "Cast", {"c"}, {{"SrcT", DT_BOOL}, {"DstT", "$T"}}},
+      {{"gx"}, "Mul", {"dz", "mask"}},
+      {{"gy"}, "Sub", {"dz", "gx"}},
+  });
+  // clang-format on
+}
+
+Status MaximumGrad(const AttrSlice& attrs, FunctionDef* g) {
+  return MaximumMinimumGradHelper("GreaterEqual", attrs, g);
+}
+REGISTER_OP_GRADIENT("Maximum", MaximumGrad);
+
+Status MinimumGrad(const AttrSlice& attrs, FunctionDef* g) {
+  return MaximumMinimumGradHelper("LessEqual", attrs, g);
+}
+REGISTER_OP_GRADIENT("Minimum", MinimumGrad);
+
+Status ComplexGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForBinaryCwise(g, {
+      {{"gx"}, "Real", {"dz"}},
+      {{"gy"}, "Imag", {"dz"}},
+  });
+  // clang-format on
+}
+REGISTER_OP_GRADIENT("Complex", ComplexGrad);
+
+// Cwise ternary ops.
+Status SelectGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  *g = FDH::Define(
+      {"c:bool", "x:T", "y:T", "dz:T"},
+      {"dc:bool", "dx:T", "dy:T"},
+      {{"T: {float, double}"}},
+      {
+        {{"dc"}, "ZerosLike", {"c"}, {{"T", DT_BOOL}}},
+        {{"zeros"}, "ZerosLike", {"x"}, {{"T", "$T"}}},
+        {{"dx"}, "Select", {"c", "dz", "zeros"}, {{"T", "$T"}}},
+        {{"dy"}, "Select", {"c", "zeros", "dz"}, {{"T", "$T"}}},
+      });
+  // clang-format on
+  return Status::OK();
+}
+REGISTER_OP_GRADIENT("Select", SelectGrad);
+
+// N-ry ops
+// REGISTER_OP_GRADIENT("AddN", AddNGrad);
+
+// Reduction ops
+//
+// TODO(zhifengc): This helper is pretty ugly. Do something better.
+// TODO(zhifengc): This can be arrange as a function in the standard library.
+Status GradForReductionOp(FunctionDef* g, std::vector<FDH::Node> body) {
+  // Shape manipulation nodes.
+
+  // clang-format off
+  std::vector<FDH::Node> nodes = {
+   {{"x_shape"}, "Shape", {"x"}},
+   {{"x_rank"}, "Rank", {"x"}},
+   {{"i_shape"}, "Shape", {"i"}, {{"T", DT_INT32}}},
+   FDH::Const("zero", 0),
+   FDH::Const("one", 1),
+   // stitch_idx0 = Range(0, x_rank, 1)
+   {{"stitch_idx1"}, "Identity", {"i"}, {{"T", DT_INT32}}},
+   {{"stitch_idx"}, "_ListToArray", {"stitch_idx0", "stitch_idx1"},
+    {{"Tin", DataTypeSlice{DT_INT32, DT_INT32}},
+     {"T", DT_INT32}, {"N", 2}}},
+   {{"stitch_val0"}, "Identity", {"x_shape"}, {{"T", DT_INT32}}},
+   {{"stitch_val1"}, "Fill", {"i_shape", "one"}, {{"T", DT_INT32}}},
+   {{"stitch_val"}, "_ListToArray", {"stitch_val0", "stitch_val1"},
+    {{"Tin", DataTypeSlice{DT_INT32, DT_INT32}},
+     {"T", DT_INT32}, {"N", 2}}},
+   {{"y_shape"}, "DynamicStitch", {"stitch_idx", "stitch_val"},
+                 {{"N", 2}, {"T", DT_INT32}}},
+   {{"tile_scaling"}, "Div", {"x_shape", "y_shape"}, {{"T", DT_INT32}}},
+   {{"di"}, "ZerosLike", {"i"}, {{"T", DT_INT32}}}
+  };
+  // clang-format on
+  nodes.insert(nodes.end(), body.begin(), body.end());
+  for (auto& n : nodes) {
+    if (n.attr.empty()) {
+      n.attr = {{"T", "$T"}};
+    }
+  }
+  // "Range" doesn't need any attr.
+  nodes.push_back({{"stitch_idx0"}, "Range", {"zero", "x_rank", "one"}, {}});
+  *g = FDH::Define(
+      // Arg defs
+      {"x:T", "i:int32", "dy:T"},
+      // Ret val defs
+      {"dx:T", "di:int32"},
+      // Attr defs
+      {{"T: {float, double}"}},
+      // Nodes
+      nodes);
+  return Status::OK();
+}
+
+Status SumGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForReductionOp(g, {
+    {{"dy_reshaped"}, "Reshape", {"dy", "y_shape"}},
+    {{"dx"}, "Tile", {"dy_reshaped", "tile_scaling"}},
+  });
+  // clang-format on
+  return Status::OK();
+}
+REGISTER_OP_GRADIENT("Sum", SumGrad);
+
+Status MeanGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  return GradForReductionOp(g, {
+    {{"factor"}, "Prod", {"tile_scaling", "zero"}, {{"T", DT_INT32}}},
+    {{"factor_T"}, "Cast", {"factor"}, {{"SrcT", DT_INT32}, {"DstT", "$T"}}},
+    {{"dy_scaled"}, "Div", {"dy", "factor_T"}},
+    {{"dy_reshaped"}, "Reshape", {"dy_scaled", "y_shape"}},
+    {{"dx"}, "Tile", {"dy_reshaped", "tile_scaling"}},
+  });
+  // clang-format on
+  return Status::OK();
+}
+REGISTER_OP_GRADIENT("Mean", MeanGrad);
+
+// REGISTER_OP_GRADIENT("Prod", ProdGrad);
+// REGISTER_OP_GRADIENT("SegmentSum", SegmentSumGrad);
+// REGISTER_OP_GRADIENT("SegmentMean", SegmentMeanGrad);
+// REGISTER_OP_GRADIENT("SparseSegmentSum", SparseSegmentSumGrad);
+// REGISTER_OP_GRADIENT("SparseSegmentMean", SparseSegmentMeanGrad);
+// REGISTER_OP_GRADIENT("SegmentMin", SegmentMinGrad);
+// REGISTER_OP_GRADIENT("SegmentMax", SegmentMaxGrad);
+// REGISTER_OP_GRADIENT("UnsortedSegmentSum", UnsortedSegmentSumGrad);
+
+Status MinMaxGradHelper(const string& op, const AttrSlice& attrs,
+                        FunctionDef* g) {
+  // clang-format off
+  *g = FDH::Define(
+      // Arg defs
+      {"x:T", "i:int32", "dy:T"},
+      // Ret val defs
+      {"dx:T", "di:int32"},
+      // Attr defs
+      {{"T: {float, double}"}},
+      {
+        // keep_dims because we need to do x == y, which requries x
+        // and y are broadcastable.
+        {{"y"}, op, {"x", "i"}, {{"T", "$T"}, {"keep_dims", true}}},
+        {{"mask"}, "Equal", {"x", "y"}, {{"T", "$T"}}},
+        {{"mask_cast"}, "Cast", {"mask"}, {{"SrcT", DT_BOOL}, {"DstT", "$T"}}},
+        {{"mask_sum"}, "Sum", {"mask_cast", "i"}, {{"T", "$T"}}},
+        {{"norm_dy"}, "Div", {"dy", "mask_sum"}, {{"T", "$T"}}},
+        {{"sy"}, "Shape", {"y"}, {{"T", "$T"}}},
+        {{"norm_dy_reshaped"}, "Reshape", {"norm_dy", "sy"}, {{"T", "$T"}}},
+        {{"dx"}, "Mul", {"mask_cast", "norm_dy_reshaped"}, {{"T", "$T"}}},
+        {{"di"}, "ZerosLike", {"i"}, {{"T", DT_INT32}}}
+      });
+  // clang-format on
+  return Status::OK();
+}
+
+Status MaxGrad(const AttrSlice& attrs, FunctionDef* g) {
+  return MinMaxGradHelper("Max", attrs, g);
+}
+REGISTER_OP_GRADIENT("Max", MaxGrad);
+
+Status MinGrad(const AttrSlice& attrs, FunctionDef* g) {
+  return MinMaxGradHelper("Min", attrs, g);
+}
+REGISTER_OP_GRADIENT("Min", MinGrad);
+
+static Status MatMulGradHelper(FunctionDef* g, const string& x0, bool tx0,
+                               const string& x1, bool tx1, const string& y0,
+                               bool ty0, const string& y1, bool ty1) {
+  *g = FDH::Define(
+      // Arg defs
+      {"x: T", "y: T", "dz: T"},
+      // Ret val defs
+      {"dx: T", "dy: T"},
+      // Attr defs
+      {{"T: {float, double}"}},
+      // Nodes
+      {
+          {{"dx"},
+           "MatMul",
+           {x0, x1},
+           {{"T", "$T"}, {"transpose_a", tx0}, {"transpose_b", tx1}}},
+          {{"dy"},
+           "MatMul",
+           {y0, y1},
+           {{"T", "$T"}, {"transpose_a", ty0}, {"transpose_b", ty1}}},
+      });
+  return Status::OK();
+}
+
+Status MatMulGrad(const AttrSlice& attrs, FunctionDef* g) {
+  DataType T;
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "T", &T));
+  if (T == DT_COMPLEX64) {
+    return errors::Unimplemented(
+        "MatMul gradient for complex is not supported yet.");
+  }
+  bool ta;
+  bool tb;
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "transpose_a", &ta));
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "transpose_b", &tb));
+  if (!ta && !tb) {
+    return MatMulGradHelper(g, "dz", false, "y", true, "x", true, "dz", false);
+  }
+  if (!ta && tb) {
+    return MatMulGradHelper(g, "dz", false, "y", false, "dz", true, "x", false);
+  }
+  if (ta && !tb) {
+    return MatMulGradHelper(g, "y", false, "dz", true, "x", false, "dz", false);
+  }
+  CHECK(ta && tb);
+  return MatMulGradHelper(g, "y", true, "dz", true, "dz", true, "x", true);
+}
+REGISTER_OP_GRADIENT("MatMul", MatMulGrad);
+
+// REGISTER_OP_GRADIENT("SparseMatMul", SparseMatMulGrad);
+// REGISTER_OP_GRADIENT("BatchMatMul", BatchMatMulGrad);
+
+// Comparison ops.
+REGISTER_OP_NO_GRADIENT("Less");
+REGISTER_OP_NO_GRADIENT("LessEqual");
+REGISTER_OP_NO_GRADIENT("Greater");
+REGISTER_OP_NO_GRADIENT("GreaterEqual");
+REGISTER_OP_NO_GRADIENT("Equal");
+REGISTER_OP_NO_GRADIENT("NotEqual");
+
+// Logical ops.
+REGISTER_OP_NO_GRADIENT("LogicalAnd");
+REGISTER_OP_NO_GRADIENT("LogicalOr");
+REGISTER_OP_NO_GRADIENT("LogicalNot");
+
+// Sequence generation ops.
+REGISTER_OP_NO_GRADIENT("Range");
+REGISTER_OP_NO_GRADIENT("LinSpace");
+
+}  // end namespace tensorflow
--- a/tensorflow/cc/ops/nn_grad.cc
+++ b/tensorflow/cc/ops/nn_grad.cc
@ -0,0 +1,55 @@
+#include "tensorflow/core/framework/function.h"
+#include "tensorflow/core/lib/core/errors.h"
+
+namespace tensorflow {
+
+typedef FunctionDefHelper FDH;
+
+Status ReluGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  *g = FDH::Define(
+      // Arg defs
+      {"x: T", "dy: T"},
+      // Ret val defs
+      {"dx: T"},
+      // Attr defs
+      {{"T: {float, double}"}},
+      // Nodes
+      {
+        {{"dx"}, "ReluGrad", {"dy", "x"}, {{"T", "$T"}}}
+      });
+  // clang-format on
+  return Status::OK();
+}
+REGISTER_OP_GRADIENT("Relu", ReluGrad);
+
+Status CrossEntropyGrad(const AttrSlice& attrs, FunctionDef* g) {
+  // clang-format off
+  *g = FDH::Define(
+    // Arg defs
+    {"features: T", "labels: T", "dcost_dloss: T", "donotcare: T"},
+    // Ret val defs
+    {"dcost_dfeatures: T", "dcost_dlabels: T"},
+    // Attr defs
+    {{"T: {float, double}"}},
+    // Nodes
+    {
+      // _, dloss_dfeatures = CrossEntropy(features, labels)
+      {{"donotcare_loss", "dloss_dfeatures"}, "CrossEntropy",
+       {"features", "labels"}, {{"T", "$T"}}},
+      // dcost_dloss is of shape [batch_size].
+      // dcost_dloss_mat is of shape [batch_size, 1].
+      FDH::Const("neg1", -1),
+      {{"dcost_dloss_mat"}, "ExpandDims", {"dcost_dloss", "neg1"},
+       {{"T", "$T"}}},
+      // chain rule: dcost/dfeatures = dcost/dloss * dloss/dfeatures
+      {{"dcost_dfeatures"}, "Mul", {"dcost_dloss_mat", "dloss_dfeatures"},
+       {{"T", "$T"}}},
+      {{"dcost_dlabels"}, "ZerosLike", {"labels"}, {{"T", "$T"}}},
+    });
+  // clang-format on
+  return Status::OK();
+}
+REGISTER_OP_GRADIENT("CrossEntropy", CrossEntropyGrad);
+
+}  // end namespace tensorflow
--- a/tensorflow/cc/ops/standard_ops.h
+++ b/tensorflow/cc/ops/standard_ops.h
@ -0,0 +1,26 @@
+// #include this file to get access to the standard set of C++ graph
+// definition libraries.
+
+#ifndef TENSORFLOW_CC_OPS_STANDARD_OPS_H_
+#define TENSORFLOW_CC_OPS_STANDARD_OPS_H_
+
+#include "tensorflow/cc/ops/array_ops.h"
+#include "tensorflow/cc/ops/attention_ops.h"
+#include "tensorflow/cc/ops/const_op.h"
+#include "tensorflow/cc/ops/data_flow_ops.h"
+#include "tensorflow/cc/ops/image_ops.h"
+#include "tensorflow/cc/ops/io_ops.h"
+#include "tensorflow/cc/ops/linalg_ops.h"
+#include "tensorflow/cc/ops/logging_ops.h"
+#include "tensorflow/cc/ops/math_ops.h"
+#include "tensorflow/cc/ops/nn_ops.h"
+#include "tensorflow/cc/ops/parsing_ops.h"
+#include "tensorflow/cc/ops/random_ops.h"
+#include "tensorflow/cc/ops/sparse_ops.h"
+#include "tensorflow/cc/ops/state_ops.h"
+#include "tensorflow/cc/ops/string_ops.h"
+#include "tensorflow/cc/ops/summary_ops.h"
+#include "tensorflow/cc/ops/training_ops.h"
+#include "tensorflow/cc/ops/user_ops.h"
+
+#endif  // TENSORFLOW_CC_OPS_STANDARD_OPS_H_
--- a/tensorflow/cc/tutorials/example_trainer.cc
+++ b/tensorflow/cc/tutorials/example_trainer.cc
@ -0,0 +1,146 @@
+#include <cstdio>
+#include <functional>
+#include <string>
+#include <vector>
+
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/graph/default_device.h"
+#include "tensorflow/core/graph/graph_def_builder.h"
+#include "tensorflow/core/lib/core/command_line_flags.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/lib/strings/stringprintf.h"
+#include "tensorflow/core/platform/init_main.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/public/session.h"
+#include "tensorflow/core/public/tensor.h"
+
+namespace tensorflow {
+namespace example {
+
+struct Options {
+  int num_concurrent_sessions = 10;  // The number of concurrent sessions
+  int num_concurrent_steps = 10;     // The number of concurrent steps
+  int num_iterations = 100;          // Each step repeats this many times
+  bool use_gpu = false;              // Whether to use gpu in the training
+};
+
+TF_DEFINE_int32(num_concurrent_sessions, 10, "Number of concurrent sessions");
+TF_DEFINE_int32(num_concurrent_steps, 10, "Number of concurrent steps");
+TF_DEFINE_int32(num_iterations, 100, "Number of iterations");
+TF_DEFINE_bool(use_gpu, false, "Whether to use gpu in the training");
+
+// A = [3 2; -1 0]; x = rand(2, 1);
+// We want to compute the largest eigenvalue for A.
+// repeat x = y / y.norm(); y = A * x; end
+GraphDef CreateGraphDef() {
+  // TODO(jeff,opensource): This should really be a more interesting
+  // computation.  Maybe turn this into an mnist model instead?
+  GraphDefBuilder b;
+  using namespace ::tensorflow::ops;  // NOLINT(build/namespaces)
+  // Store rows [3, 2] and [-1, 0] in row major format.
+  Node* a = Const({3.f, 2.f, -1.f, 0.f}, {2, 2}, b.opts());
+
+  // x is from the feed.
+  Node* x = Const({0.f}, {2, 1}, b.opts().WithName("x"));
+
+  // y = A * x
+  Node* y = MatMul(a, x, b.opts().WithName("y"));
+
+  // y2 = y.^2
+  Node* y2 = Square(y, b.opts());
+
+  // y2_sum = sum(y2)
+  Node* y2_sum = Sum(y2, Const(0, b.opts()), b.opts());
+
+  // y_norm = sqrt(y2_sum)
+  Node* y_norm = Sqrt(y2_sum, b.opts());
+
+  // y_normalized = y ./ y_norm
+  Div(y, y_norm, b.opts().WithName("y_normalized"));
+
+  GraphDef def;
+  TF_CHECK_OK(b.ToGraphDef(&def));
+  return def;
+}
+
+string DebugString(const Tensor& x, const Tensor& y) {
+  CHECK_EQ(x.NumElements(), 2);
+  CHECK_EQ(y.NumElements(), 2);
+  auto x_flat = x.flat<float>();
+  auto y_flat = y.flat<float>();
+  const float lambda = y_flat(0) / x_flat(0);
+  return strings::Printf("lambda = %8.6f x = [%8.6f %8.6f] y = [%8.6f %8.6f]",
+                         lambda, x_flat(0), x_flat(1), y_flat(0), y_flat(1));
+}
+
+void ConcurrentSteps(const Options* opts, int session_index) {
+  // Creates a session.
+  SessionOptions options;
+  std::unique_ptr<Session> session(NewSession(options));
+  GraphDef def = CreateGraphDef();
+  if (options.target.empty()) {
+    graph::SetDefaultDevice(opts->use_gpu ? "/gpu:0" : "/cpu:0", &def);
+  }
+
+  TF_CHECK_OK(session->Create(def));
+
+  // Spawn M threads for M concurrent steps.
+  const int M = opts->num_concurrent_steps;
+  thread::ThreadPool step_threads(Env::Default(), "trainer", M);
+
+  for (int step = 0; step < M; ++step) {
+    step_threads.Schedule([&session, opts, session_index, step]() {
+      // Randomly initialize the input.
+      Tensor x(DT_FLOAT, TensorShape({2, 1}));
+      x.flat<float>().setRandom();
+
+      // Iterations.
+      std::vector<Tensor> outputs;
+      for (int iter = 0; iter < opts->num_iterations; ++iter) {
+        outputs.clear();
+        TF_CHECK_OK(
+            session->Run({{"x", x}}, {"y:0", "y_normalized:0"}, {}, &outputs));
+        CHECK_EQ(2, outputs.size());
+
+        const Tensor& y = outputs[0];
+        const Tensor& y_norm = outputs[1];
+        // Print out lambda, x, and y.
+        std::printf("%06d/%06d %s\n", session_index, step,
+                    DebugString(x, y).c_str());
+        // Copies y_normalized to x.
+        x = y_norm;
+      }
+    });
+  }
+
+  TF_CHECK_OK(session->Close());
+}
+
+void ConcurrentSessions(const Options& opts) {
+  // Spawn N threads for N concurrent sessions.
+  const int N = opts.num_concurrent_sessions;
+  thread::ThreadPool session_threads(Env::Default(), "trainer", N);
+  for (int i = 0; i < N; ++i) {
+    session_threads.Schedule(std::bind(&ConcurrentSteps, &opts, i));
+  }
+}
+
+}  // end namespace example
+}  // end namespace tensorflow
+
+int main(int argc, char* argv[]) {
+  tensorflow::example::Options opts;
+  tensorflow::Status s = tensorflow::ParseCommandLineFlags(&argc, argv);
+  if (!s.ok()) {
+    LOG(FATAL) << "Error parsing command line flags: " << s.ToString();
+  }
+  tensorflow::port::InitMain(argv[0], &argc, &argv);
+
+  opts.num_concurrent_sessions =
+      tensorflow::example::FLAGS_num_concurrent_sessions;
+  opts.num_concurrent_steps = tensorflow::example::FLAGS_num_concurrent_steps;
+  opts.num_iterations = tensorflow::example::FLAGS_num_iterations;
+  opts.use_gpu = tensorflow::example::FLAGS_use_gpu;
+  tensorflow::example::ConcurrentSessions(opts);
+}
--- a/tensorflow/core/BUILD
+++ b/tensorflow/core/BUILD
@ -0,0 +1,695 @@
+# Description:
+# TensorFlow is a computational framework, primarily for use in machine
+# learning applications.
+
+package(default_visibility = ["//tensorflow:internal"])
+
+package_group(name = "friends")
+
+licenses(["notice"])  # Apache 2.0
+
+exports_files(["LICENSE"])
+
+load("/tensorflow/tensorflow", "tf_copts")
+load("/tensorflow/tensorflow", "tf_cc_tests")
+load("/tensorflow/tensorflow", "tf_cuda_library")
+load("/tensorflow/tensorflow", "tf_gen_op_libs")
+load("/tensorflow/tensorflow", "tf_gpu_kernel_library")
+
+# For platform specific build config
+load(
+    "/tensorflow/core/platform/default/build_config",
+    "tf_proto_library",
+    "tf_additional_lib_srcs",
+    "tf_additional_test_srcs",
+    "tf_kernel_tests_linkstatic",
+)
+load(
+    "/tensorflow/core/platform/default/build_config_root",
+    "tf_cuda_tests_tags",
+)
+
+cc_library(
+    name = "lib",
+    srcs = glob(
+        [
+            "lib/**/*.h",
+            "lib/**/*.cc",
+            "platform/*.h",
+            "platform/*.cc",
+            "public/*.h",
+        ] + tf_additional_lib_srcs(),
+        exclude = [
+            "**/*test*",
+        ],
+    ),
+    copts = tf_copts(),
+    visibility = [
+        ":friends",
+        "//tensorflow:internal",
+    ],
+    deps = [
+        ":protos_cc",
+        "//tensorflow/core/platform/default/build_config:platformlib",
+    ],
+)
+
+tf_cuda_library(
+    name = "core_cpu",
+    srcs = glob(
+        [
+            "common_runtime/**/*.h",
+            "client/**/*.cc",
+            "common_runtime/**/*.cc",
+            "graph/**/*.h",
+            "graph/**/*.cc",
+        ],
+        exclude = [
+            "**/*test*",
+            "**/*main.cc",
+            "common_runtime/gpu/*.cc",
+            "common_runtime/copy_tensor.cc",
+            "common_runtime/gpu_device_factory.cc",
+            "common_runtime/local_session.cc",
+            "common_runtime/local_session.h",
+        ],
+    ),
+    hdrs = glob(["public/**/*.h"]),
+    copts = tf_copts(),
+    visibility = ["//visibility:public"],
+    deps = [
+        ":copy_tensor",
+        ":framework",
+        ":lib",
+        ":protos_cc",
+        "//third_party/eigen3",
+    ],
+    alwayslink = 1,
+)
+
+tf_cuda_library(
+    name = "framework",
+    srcs = glob(
+        [
+            "framework/**/*.h",
+            "framework/**/*.cc",
+            "util/**/*.h",
+            "util/**/*.cc",
+        ],
+        exclude = [
+            "**/*test*",
+            "**/*main.cc",
+        ],
+    ),
+    hdrs = glob(["public/**/*.h"]),
+    copts = tf_copts(),
+    visibility = ["//visibility:public"],
+    deps = [
+        ":lib",
+        ":protos_cc",
+        "//third_party/eigen3",
+    ],
+    alwayslink = 1,
+)
+
+tf_cuda_library(
+    name = "local",
+    srcs = [
+        "common_runtime/local_session.cc",
+        "common_runtime/local_session.h",
+    ],
+    copts = tf_copts(),
+    cuda_deps = [
+        ":cuda",
+    ],
+    linkstatic = 1,
+    deps = [
+        ":core",
+        ":lib",
+    ],
+    alwayslink = 1,
+)
+
+cc_library(
+    name = "copy_tensor",
+    deps = [
+        ":lib",
+        ":protos_cc",
+        ":stream_executor",
+        "//third_party/eigen3",
+    ],
+)
+
+tf_cuda_library(
+    name = "gpu_runtime",
+    srcs = glob(
+        [
+            "common_runtime/gpu/**/*.h",
+            "common_runtime/gpu/**/*.cc",
+        ],
+        exclude = [
+            "**/*main.cc",
+            "**/*test.cc",
+        ],
+    ),
+    copts = tf_copts(),
+    cuda_deps = [
+        ":cuda",
+    ],
+    linkstatic = 1,
+    deps = [
+        ":core_cpu",
+        ":lib",
+        ":protos_cc",
+        ":stream_executor",
+        "//third_party/eigen3",
+    ],
+    alwayslink = 1,
+)
+
+# Test support library needed for higher-level tests
+cc_library(
+    name = "testlib",
+    testonly = 1,
+    srcs = [
+        "common_runtime/kernel_benchmark_testlib.cc",
+        "common_runtime/kernel_benchmark_testlib.h",
+        "framework/function_testlib.cc",
+        "framework/function_testlib.h",
+        "framework/tensor_testutil.cc",
+        "framework/tensor_testutil.h",
+        "graph/testlib.cc",
+        "graph/testlib.h",
+    ],
+    copts = tf_copts(),
+    visibility = [
+        ":friends",
+        "//tensorflow:internal",
+    ],
+    deps = [
+        ":core_cpu",
+        ":tensorflow",
+        ":test",
+        "//tensorflow/core/platform/default/build_config:gtest",
+    ],
+)
+
+tf_cuda_library(
+    name = "tensorflow_opensource",
+    copts = tf_copts(),
+    visibility = ["//visibility:public"],
+    deps = [
+        ":core",
+        ":gpu_runtime",
+        ":kernels",
+        ":lib",
+        ":local",
+    ],
+)
+
+tf_cuda_library(
+    name = "kernels",
+    srcs = glob(
+        [
+            "kernels/**/*.h",
+            "kernels/**/*.cc",
+            "ops/**/*.h",
+            "ops/**/*.cc",
+            "user_ops/**/*.h",
+            "user_ops/**/*.cc",
+        ],
+        exclude = [
+            "**/*test*",
+            "**/*main.cc",
+            "kernels/**/*.cu.cc",
+            "user_ops/**/*.cu.cc",
+        ],
+    ),
+    copts = tf_copts(),
+    cuda_deps = [
+        ":gpu_kernels",
+        ":cuda",
+    ],
+    linkstatic = 0,
+    visibility = ["//visibility:public"],
+    deps = [
+        "@gemmlowp//:eight_bit_int_gemm",
+        ":core",
+        ":lib",
+        ":protos_cc",
+        ":stream_executor",
+        "//tensorflow/models/embedding:word2vec_kernels",
+        "//tensorflow/models/embedding:word2vec_ops",
+        "//third_party/eigen3",
+    ],
+    alwayslink = 1,
+)
+
+tf_gpu_kernel_library(
+    name = "gpu_kernels",
+    srcs = glob(
+        [
+            "kernels/**/*.h",
+            "kernels/*.cu.cc",
+            "user_ops/**/*.h",
+            "user_ops/*.cu.cc",
+        ],
+    ),
+    visibility = ["//visibility:public"],
+    deps = [
+        "//third_party/eigen3",
+    ],
+)
+
+# Test support library needed for all tests
+cc_library(
+    name = "test",
+    testonly = 1,
+    srcs = [
+        "platform/test.cc",
+    ] + tf_additional_test_srcs(),
+    hdrs = [
+        "platform/test.h",
+        "platform/test_benchmark.h",
+    ],
+    copts = tf_copts(),
+    linkopts = ["-lm"],
+    deps = [
+        ":lib",
+        "//tensorflow/core/platform/default/build_config:gtest",
+    ],
+)
+
+# Main program for tests
+cc_library(
+    name = "test_main",
+    testonly = 1,
+    srcs = ["platform/test_main.cc"],
+    copts = tf_copts(),
+    linkopts = ["-lm"],
+    deps = [
+        ":test",
+        "//tensorflow/core/platform/default/build_config:test_main",
+    ],
+)
+
+# TODO(opensource): Make it work externally
+tf_proto_library(
+    name = "protos_all",
+    srcs = glob(["**/*.proto"]),
+    cc_api_version = 2,
+    go_api_version = 2,
+    java_api_version = 2,
+    py_api_version = 2,
+    visibility = ["//tensorflow:internal"],
+)
+
+cc_library(
+    name = "protos_cc",
+    deps = ["//tensorflow/core/platform/default/build_config:protos_cc"],
+)
+
+# Generates library per group of ops.
+tf_gen_op_libs(
+    op_lib_names = [
+        "array_ops",
+        "attention_ops",
+        "candidate_sampling_ops",
+        "control_flow_ops",
+        "data_flow_ops",
+        "image_ops",
+        "io_ops",
+        "linalg_ops",
+        "logging_ops",
+        "math_ops",
+        "nn_ops",
+        "no_op",
+        "parsing_ops",
+        "random_ops",
+        "sendrecv_ops",
+        "sparse_ops",
+        "state_ops",
+        "string_ops",
+        "summary_ops",
+        "training_ops",
+    ],
+)
+
+# And one for all user ops
+cc_library(
+    name = "user_ops_op_lib",
+    srcs = glob(["user_ops/**/*.cc"]),
+    copts = tf_copts(),
+    linkstatic = 1,
+    visibility = ["//visibility:public"],
+    deps = [":framework"],
+    alwayslink = 1,
+)
+
+# Low level library tests
+tf_cc_tests(
+    tests = glob(
+        [
+            "lib/**/*_test.cc",
+            "platform/**/*_test.cc",
+        ],
+        exclude = ["lib/strings/ordered_code_test.cc"],
+    ),
+    deps = [
+        ":lib",
+        ":test_main",
+    ],
+)
+
+cc_test(
+    name = "lib_jpeg_jpeg_mem_unittest",
+    srcs = ["lib/jpeg/jpeg_mem_unittest.cc"],
+    data = glob(["lib/jpeg/testdata/*.jpg"]),
+    deps = [
+        ":lib",
+        ":test_main",
+    ],
+)
+
+cc_test(
+    name = "lib_strings_ordered_code_test",
+    srcs = ["lib/strings/ordered_code_test.cc"],
+    copts = ["$(STACK_FRAME_UNLIMITED)"],  # Tests initialize large vectors
+    deps = [
+        ":lib",
+        ":test_main",
+    ],
+)
+
+# higher level tests
+tf_cc_tests(
+    linkstatic = tf_kernel_tests_linkstatic(),
+    tests = glob(
+        [
+            "client/**/*_test.cc",
+            "common_runtime/**/*_test.cc",
+            "framework/**/*_test.cc",
+            "graph/**/*_test.cc",
+            "util/**/*_test.cc",
+        ],
+        exclude = [
+            # TODO(opensource): fix
+            "common_runtime/gpu/*_test.cc",
+            # Run by tests below
+            "common_runtime/gpu/gpu_region_allocator_test.cc",
+            "common_runtime/gpu/gpu_bfc_allocator_test.cc",
+        ],
+    ),
+    deps = [
+        ":core",
+        ":kernels",
+        ":lib",
+        ":local",
+        ":test_main",
+        ":testlib",
+        "//tensorflow/cc:cc_ops",
+    ],
+)
+
+# GPU-related tests
+tf_cc_tests(
+    linkstatic = tf_kernel_tests_linkstatic(),
+    tags = tf_cuda_tests_tags(),
+    tests = glob(
+        [
+            "kernels/**/*_test.cc",
+            "user_ops/**/*_test.cc",
+            "common_runtime/gpu/*_test.cc",
+        ],
+    ),
+    deps = [
+        ":kernels",
+        ":local",
+        ":test_main",
+        ":testlib",
+        "//tensorflow/cc:cc_ops",
+    ],
+)
+
+tf_cuda_library(
+    name = "stream_executor",
+    deps = [
+        "//tensorflow/core/platform/default/build_config:stream_executor",
+    ],
+)
+
+cc_library(
+    name = "cuda",
+    visibility = [
+        ":friends",
+        "//tensorflow:internal",
+    ],
+    deps = [
+        "//tensorflow/core/platform/default/build_config:cuda",
+    ],
+)
+
+cc_library(
+    name = "tensorflow",
+    visibility = ["//visibility:public"],
+    deps = [
+        "tensorflow_opensource",
+        "//tensorflow/core/platform/default/build_config:tensorflow_platform_specific",
+    ],
+)
+
+cc_library(
+    name = "core",
+    visibility = ["//visibility:public"],
+    deps = [
+        ":core_cpu",
+        ":gpu_runtime",
+    ],
+)
+
+# Android-specific BUILD targets
+load("/tensorflow/tensorflow", "tf_android_core_proto_sources")
+
+# List of protos we want on android
+filegroup(
+    name = "android_proto_srcs",
+    srcs = tf_android_core_proto_sources(),
+    visibility = ["//visibility:public"],
+)
+
+# Core sources. Should eventually become identical to open source
+# sources.
+filegroup(
+    name = "android_srcs",
+    srcs = glob(
+        [
+            "client/**/*.cc",
+            "common_runtime/**/*.h",
+            "common_runtime/**/*.cc",
+            "framework/**/*.h",
+            "framework/**/*.cc",
+            "graph/**/*.h",
+            "graph/**/*.cc",
+            "lib/**/*.h",
+            "lib/**/*.cc",
+            "ops/**/*.cc",
+            "ops/**/*.h",
+            "platform/*.h",
+            "platform/*.cc",
+            "platform/**/*.h",
+            "platform/**/*.cc",
+            "public/**/*.h",
+            "util/**/*.h",
+            "util/**/*.cc",
+            "kernels/ops_util.cc",
+            "kernels/ops_util.h",
+            "kernels/avgpooling_op.h",
+            "kernels/maxpooling_op.h",
+            "kernels/pooling_ops_common.h",
+            "kernels/pooling_ops_common.cc",
+            "kernels/reference_gemm.h",
+        ],
+        exclude = [
+            "**/*test.cc",
+            "**/*testutil*",
+            "**/*testlib*",
+            "**/*main.cc",
+            "lib/jpeg/*.h",
+            "lib/jpeg/*.cc",
+            "lib/png/*.h",
+            "lib/png/*.cc",
+            "util/events_writer.cc",
+            "util/events_writer.h",
+            # Exclude all protobuf/google headers except protobuf_android.h
+            "platform/google/cord_coding.h",
+            "platform/google/dynamic_annotations.h",
+            "platform/google/integral_types.h",
+            "platform/google/mutex.h",
+            "platform/google/protobuf.h",
+            "platform/google/stream_executor_util.h",
+            "platform/google/tracing_impl.h",
+            "platform/google/*.cc",
+            "platform/google/test_benchmark.cc",
+            "platform/google/test_benchmark.h",
+            "kernels/**/*.cu.cc",
+            "user_ops/**/*.cu.cc",
+            "common_runtime/gpu/*.cc",
+            "common_runtime/gpu_device_factory.cc",
+        ],
+    ),
+    visibility = ["//visibility:public"],
+)
+
+# Core kernels we want on Android. Only a subset of kernels to keep
+# base library small.
+filegroup(
+    name = "android_core_ops",
+    srcs = [
+        "//tensorflow/core:kernels/aggregate_ops.cc",
+        "//tensorflow/core:kernels/aggregate_ops.h",
+        "//tensorflow/core:kernels/assign_op.h",
+        "//tensorflow/core:kernels/bias_op.cc",
+        "//tensorflow/core:kernels/bias_op.h",
+        "//tensorflow/core:kernels/cast_op.cc",
+        "//tensorflow/core:kernels/cast_op.h",
+        "//tensorflow/core:kernels/concat_op.cc",
+        "//tensorflow/core:kernels/concat_op.h",
+        "//tensorflow/core:kernels/concat_op_cpu.cc",
+        "//tensorflow/core:kernels/constant_op.cc",
+        "//tensorflow/core:kernels/constant_op.h",
+        "//tensorflow/core:kernels/cwise_ops.h",
+        "//tensorflow/core:kernels/cwise_ops_common.cc",
+        "//tensorflow/core:kernels/cwise_ops_common.h",
+        "//tensorflow/core:kernels/dense_update_ops.cc",
+        "//tensorflow/core:kernels/dense_update_ops.h",
+        "//tensorflow/core:kernels/fill_functor.h",
+        "//tensorflow/core:kernels/gather_op.cc",
+        "//tensorflow/core:kernels/identity_op.cc",
+        "//tensorflow/core:kernels/identity_op.h",
+        "//tensorflow/core:kernels/matmul_op.cc",
+        "//tensorflow/core:kernels/matmul_op.h",
+        "//tensorflow/core:kernels/no_op.cc",
+        "//tensorflow/core:kernels/no_op.h",
+        "//tensorflow/core:kernels/pack_op.cc",
+        "//tensorflow/core:kernels/reference_gemm.h",
+        "//tensorflow/core:kernels/reshape_op.cc",
+        "//tensorflow/core:kernels/reshape_op.h",
+        "//tensorflow/core:kernels/reverse_sequence_op.cc",
+        "//tensorflow/core:kernels/reverse_sequence_op.h",
+        "//tensorflow/core:kernels/sendrecv_ops.cc",
+        "//tensorflow/core:kernels/sendrecv_ops.h",
+        "//tensorflow/core:kernels/sequence_ops.cc",
+        "//tensorflow/core:kernels/shape_ops.cc",
+        "//tensorflow/core:kernels/slice_op.cc",
+        "//tensorflow/core:kernels/slice_op.h",
+        "//tensorflow/core:kernels/softmax_op.cc",
+        "//tensorflow/core:kernels/softmax_op.h",
+        "//tensorflow/core:kernels/split_op.cc",
+        "//tensorflow/core:kernels/split_op.h",
+        "//tensorflow/core:kernels/split_op_cpu.cc",
+        "//tensorflow/core:kernels/unpack_op.cc",
+        "//tensorflow/core:kernels/variable_ops.cc",
+        "//tensorflow/core:kernels/variable_ops.h",
+    ],
+    visibility = ["//visibility:public"],
+)
+
+# Other kernels we may want on Android.
+filegroup(
+    name = "android_extended_ops",
+    srcs = [
+        "//tensorflow/core:kernels/avgpooling_op.cc",
+        "//tensorflow/core:kernels/avgpooling_op.h",
+        "//tensorflow/core:kernels/control_flow_ops.cc",
+        "//tensorflow/core:kernels/control_flow_ops.h",
+        "//tensorflow/core:kernels/conv_2d.h",
+        "//tensorflow/core:kernels/conv_ops.cc",
+        "//tensorflow/core:kernels/cwise_op_add.cc",
+        "//tensorflow/core:kernels/cwise_op_div.cc",
+        "//tensorflow/core:kernels/cwise_op_exp.cc",
+        "//tensorflow/core:kernels/cwise_op_log.cc",
+        "//tensorflow/core:kernels/cwise_op_mul.cc",
+        "//tensorflow/core:kernels/cwise_op_sigmoid.cc",
+        "//tensorflow/core:kernels/cwise_op_sqrt.cc",
+        "//tensorflow/core:kernels/cwise_op_square.cc",
+        "//tensorflow/core:kernels/cwise_op_sub.cc",
+        "//tensorflow/core:kernels/cwise_op_tanh.cc",
+        "//tensorflow/core:kernels/lrn_op.cc",
+        "//tensorflow/core:kernels/maxpooling_op.cc",
+        "//tensorflow/core:kernels/maxpooling_op.h",
+        "//tensorflow/core:kernels/reduction_ops.h",
+        "//tensorflow/core:kernels/reduction_ops_common.h",
+        "//tensorflow/core:kernels/reduction_ops_max.cc",
+        "//tensorflow/core:kernels/reduction_ops_min.cc",
+        "//tensorflow/core:kernels/reduction_ops_sum.cc",
+        "//tensorflow/core:kernels/relu_op.cc",
+        "//tensorflow/core:kernels/relu_op.h",
+        "//tensorflow/core:kernels/softplus_op.cc",
+        "//tensorflow/core:kernels/softplus_op.h",
+        "//tensorflow/core:kernels/transpose_op.cc",
+        "//tensorflow/core:kernels/transpose_op.h",
+        "//tensorflow/core:kernels/transpose_op_functor.h",
+    ],
+    visibility = ["//visibility:public"],
+)
+
+# Test data
+filegroup(
+    name = "image_testdata",
+    srcs = [
+        # PNG data
+        "lib/png/testdata/lena_gray.png",
+        "lib/png/testdata/lena_rgba.png",
+        # JPEG data
+        "lib/jpeg/testdata/jpeg_merge_test1.jpg",
+        "lib/jpeg/testdata/jpeg_merge_test1_cmyk.jpg",
+        # Corrupted JPEG files for tests
+        "lib/jpeg/testdata/bad_huffman.jpg",
+        "lib/jpeg/testdata/corrupt.jpg",
+        # -- hand-edited variant: stops at line 0
+        "lib/jpeg/testdata/corrupt34_2.jpg",
+        # -- hand-edited variant: stops at line 4
+        "lib/jpeg/testdata/corrupt34_3.jpg",
+        # -- hand-edited variant: stops after a restart marker
+        "lib/jpeg/testdata/corrupt34_4.jpg",
+    ],
+)
+
+# For portable_proto_library
+
+# Native library support for Android applications.
+# Should be built to target Android with flag --copt=-mfpu=neon.
+cc_library(
+    name = "android_tensorflow_lib",
+    srcs = [
+        "//tensorflow/core:android_core_ops",
+        "//tensorflow/core:android_extended_ops",
+        "//tensorflow/core:android_srcs",
+    ],
+    copts = [
+        "-mfpu=neon",
+        "-std=c++11",
+    ],
+    tags = [
+        "manual",
+        "notap",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        "@re2//:re2",
+        ":protos_cc",
+        "//third_party/eigen3",
+    ],
+)
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+        ],
+    ),
+    visibility = ["//tensorflow:__subpackages__"],
+)
--- a/tensorflow/core/client/tensor_c_api.cc
+++ b/tensorflow/core/client/tensor_c_api.cc
@ -0,0 +1,370 @@
+#include "tensorflow/core/public/tensor_c_api.h"
+
+#include <memory>
+
+#include "tensorflow/core/lib/core/coding.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/platform/protobuf.h"
+#include "tensorflow/core/public/session.h"
+#include "tensorflow/core/public/status.h"
+#include "tensorflow/core/public/tensor.h"
+#include "tensorflow/core/public/tensor_shape.h"
+
+// The implementation below is at the top level instead of the
+// brain namespace because we are defining 'extern "C"' functions.
+using tensorflow::error::Code;
+using tensorflow::errors::InvalidArgument;
+using tensorflow::gtl::ArraySlice;
+using tensorflow::AllocationDescription;
+using tensorflow::Status;
+using tensorflow::DataType;
+using tensorflow::Env;
+using tensorflow::GraphDef;
+using tensorflow::NewSession;
+using tensorflow::Session;
+using tensorflow::Tensor;
+using tensorflow::TensorBuffer;
+using tensorflow::SessionOptions;
+using tensorflow::TensorShape;
+
+extern "C" {
+
+// --------------------------------------------------------------------------
+struct TF_Status {
+  Status status;
+};
+
+TF_Status* TF_NewStatus() { return new TF_Status; }
+
+void TF_DeleteStatus(TF_Status* s) { delete s; }
+
+void TF_SetStatus(TF_Status* s, TF_Code code, const char* msg) {
+  s->status = Status(static_cast<Code>(code), tensorflow::StringPiece(msg));
+}
+
+TF_Code TF_GetCode(const TF_Status* s) {
+  return static_cast<TF_Code>(s->status.code());
+}
+
+const char* TF_Message(const TF_Status* s) {
+  return s->status.error_message().c_str();
+}
+
+// --------------------------------------------------------------------------
+
+namespace {
+class TF_ManagedBuffer : public TensorBuffer {
+ public:
+  void* data_;
+  size_t len_;
+  void (*deallocator_)(void* data, size_t len, void* arg);
+  void* deallocator_arg_;
+
+  ~TF_ManagedBuffer() override {
+    (*deallocator_)(data_, len_, deallocator_arg_);
+  }
+
+  void* data() const override { return data_; }
+  size_t size() const override { return len_; }
+  TensorBuffer* root_buffer() override { return this; }
+  void FillAllocationDescription(AllocationDescription* proto) const override {
+    tensorflow::int64 rb = size();
+    proto->set_requested_bytes(rb);
+    proto->set_allocator_name(tensorflow::cpu_allocator()->Name());
+  }
+};
+
+void deallocate_realigned_buffer(void* data, size_t len, void* arg) {
+  tensorflow::cpu_allocator()->DeallocateRaw(data);
+}
+}  // namespace
+
+struct TF_Tensor {
+  TF_DataType dtype;
+  TensorShape shape;
+  TensorBuffer* buffer;
+};
+
+TF_Tensor* TF_NewTensor(TF_DataType dtype, tensorflow::int64* dims,
+                        int num_dims, void* data, size_t len,
+                        void (*deallocator)(void* data, size_t len, void* arg),
+                        void* deallocator_arg) {
+  std::vector<tensorflow::int64> dimvec(num_dims);
+  for (int i = 0; i < num_dims; i++) {
+    dimvec[i] = dims[i];
+  }
+
+  TF_ManagedBuffer* buf = new TF_ManagedBuffer;
+  buf->len_ = len;
+  if (reinterpret_cast<intptr_t>(data) % EIGEN_MAX_ALIGN_BYTES != 0) {
+    // Copy the data into a buffer that satisfies Eigen's alignment
+    // requirements.
+    buf->data_ =
+        tensorflow::cpu_allocator()->AllocateRaw(EIGEN_MAX_ALIGN_BYTES, len);
+    std::memcpy(buf->data_, data, len);
+    buf->deallocator_ = deallocate_realigned_buffer;
+    buf->deallocator_arg_ = nullptr;
+    // Free the original buffer.
+    deallocator(data, len, deallocator_arg);
+  } else {
+    buf->data_ = data;
+    buf->deallocator_ = deallocator;
+    buf->deallocator_arg_ = deallocator_arg;
+  }
+  return new TF_Tensor{dtype, TensorShape(dimvec), buf};
+}
+
+void TF_DeleteTensor(TF_Tensor* t) {
+  t->buffer->Unref();
+  delete t;
+}
+
+TF_DataType TF_TensorType(const TF_Tensor* t) { return t->dtype; }
+int TF_NumDims(const TF_Tensor* t) { return t->shape.dims(); }
+tensorflow::int64 TF_Dim(const TF_Tensor* t, int dim_index) {
+  return t->shape.dim_size(dim_index);
+}
+size_t TF_TensorByteSize(const TF_Tensor* t) { return t->buffer->size(); }
+void* TF_TensorData(const TF_Tensor* t) { return t->buffer->data(); }
+
+// --------------------------------------------------------------------------
+struct TF_SessionOptions {
+  SessionOptions options;
+};
+TF_SessionOptions* TF_NewSessionOptions() { return new TF_SessionOptions; }
+void TF_DeleteSessionOptions(TF_SessionOptions* opt) { delete opt; }
+
+void TF_SetTarget(TF_SessionOptions* options, const char* target) {
+  options->options.target = target;
+}
+
+void TF_SetConfig(TF_SessionOptions* options, const char* config,
+                  size_t config_len, TF_Status* status) {
+  if (!options->options.config.ParseFromArray(config, config_len)) {
+    status->status =
+        tensorflow::errors::InvalidArgument("Unparseable ConfigProto");
+  }
+}
+
+// --------------------------------------------------------------------------
+struct TF_Session {
+  Session* session;
+};
+
+TF_Session* TF_NewSession(const TF_SessionOptions* opt, TF_Status* status) {
+  Session* session;
+  status->status = NewSession(opt->options, &session);
+  if (status->status.ok()) {
+    return new TF_Session({session});
+  } else {
+    DCHECK_EQ(nullptr, session);
+    return NULL;
+  }
+}
+
+void TF_CloseSession(TF_Session* s, TF_Status* status) {
+  status->status = s->session->Close();
+}
+
+void TF_DeleteSession(TF_Session* s, TF_Status* status) {
+  status->status = Status::OK();
+  delete s->session;
+  delete s;
+}
+
+void TF_ExtendGraph(TF_Session* s, const void* proto, size_t proto_len,
+                    TF_Status* status) {
+  GraphDef g;
+  if (!tensorflow::ParseProtoUnlimited(&g, proto, proto_len)) {
+    status->status = tensorflow::errors::InvalidArgument("Invalid GraphDef");
+    return;
+  }
+  status->status = s->session->Extend(g);
+}
+
+static void DeleteArray(void* data, size_t size, void* arg) {
+  DCHECK_EQ(data, arg);
+  delete[] reinterpret_cast<char*>(arg);
+}
+
+}  // end extern "C"
+
+namespace tensorflow {
+
+// Non-static for testing.
+bool TF_Tensor_DecodeStrings(TF_Tensor* src, Tensor* dst, TF_Status* status) {
+  const tensorflow::int64 num_elements = src->shape.num_elements();
+  const char* input = reinterpret_cast<const char*>(TF_TensorData(src));
+  const size_t src_size = TF_TensorByteSize(src);
+  if (static_cast<tensorflow::int64>(src_size / sizeof(tensorflow::uint64)) <
+      num_elements) {
+    status->status = InvalidArgument(
+        "Malformed TF_STRING tensor; too short to hold number of elements");
+    return false;
+  }
+  const char* data_start = input + sizeof(tensorflow::uint64) * num_elements;
+  const char* limit = input + src_size;
+
+  *dst = Tensor(static_cast<DataType>(src->dtype), src->shape);
+  auto dstarray = dst->flat<tensorflow::string>();
+  for (tensorflow::int64 i = 0; i < num_elements; i++) {
+    tensorflow::uint64 offset =
+        reinterpret_cast<const tensorflow::uint64*>(input)[i];
+    tensorflow::uint64 len;
+    const char* p;
+    if (static_cast<ptrdiff_t>(offset) >= (limit - data_start) ||
+        !(p = tensorflow::core::GetVarint64Ptr(data_start + offset, limit,
+                                               &len)) ||
+        (static_cast<ptrdiff_t>(len) > (limit - p))) {
+      status->status = InvalidArgument("Malformed TF_STRING tensor; element ",
+                                       i, " out of range");
+      return false;
+    }
+    dstarray(i).assign(p, len);
+  }
+  return true;
+}
+
+// Non-static for testing.
+TF_Tensor* TF_Tensor_EncodeStrings(const Tensor& src) {
+  // Compute bytes needed for encoding.
+  size_t size = 0;
+  const auto& srcarray = src.flat<tensorflow::string>();
+  for (int i = 0; i < srcarray.size(); i++) {
+    const tensorflow::string& s = srcarray(i);
+    // uint64 starting_offset, varint64 length, string contents
+    size += sizeof(tensorflow::uint64) +
+            tensorflow::core::VarintLength(s.size()) + s.size();
+  }
+
+  // Encode all strings.
+  char* base = new char[size];
+  char* data_start = base + sizeof(tensorflow::uint64) * srcarray.size();
+  char* dst = data_start;  // Where next string is encoded.
+  tensorflow::uint64* offsets = reinterpret_cast<tensorflow::uint64*>(base);
+  for (int i = 0; i < srcarray.size(); i++) {
+    const tensorflow::string& s = srcarray(i);
+    *offsets = (dst - data_start);
+    offsets++;
+    dst = tensorflow::core::EncodeVarint64(dst, s.size());
+    memcpy(dst, s.data(), s.size());
+    dst += s.size();
+  }
+  CHECK_EQ(dst, base + size);
+
+  auto dims = src.shape().dim_sizes();
+  std::vector<tensorflow::int64> dimvec(dims.size());
+  for (size_t i = 0; i < dims.size(); i++) {
+    dimvec[i] = dims[i];
+  }
+  return TF_NewTensor(TF_STRING, dimvec.data(), dimvec.size(), base, size,
+                      DeleteArray, base);
+}
+
+class TensorCApi {
+ public:
+  static TensorBuffer* Buffer(const Tensor& tensor) { return tensor.buf_; }
+  static Tensor MakeTensor(TF_DataType type, const TensorShape& shape,
+                           TensorBuffer* buf) {
+    return Tensor(static_cast<DataType>(type), shape, buf);
+  }
+};
+
+// Create an empty tensor of type 'dtype'. 'shape' can be arbitrary, but has to
+// result in a zero-sized tensor.
+static TF_Tensor* EmptyTensor(TF_DataType dtype, const TensorShape& shape) {
+  static char empty;
+  tensorflow::int64 nelems = 1;
+  std::vector<tensorflow::int64> dims;
+  for (int i = 0; i < shape.dims(); ++i) {
+    dims.push_back(shape.dim_size(i));
+    nelems *= shape.dim_size(i);
+  }
+  CHECK_EQ(nelems, 0);
+  return TF_NewTensor(dtype, dims.data(), shape.dims(),
+                      reinterpret_cast<void*>(&empty), 0,
+                      [](void*, size_t, void*) {}, nullptr);
+}
+
+}  // namespace tensorflow
+
+extern "C" {
+
+void TF_Run(TF_Session* s,
+            // Input tensors
+            const char** c_input_names, TF_Tensor** c_inputs, int ninputs,
+            // Output tensors
+            const char** c_output_tensor_names, TF_Tensor** c_outputs,
+            int noutputs,
+            // Target nodes
+            const char** c_target_node_names, int ntargets, TF_Status* status) {
+  status->status = Status::OK();
+  for (int i = 0; i < noutputs; i++) {
+    c_outputs[i] = NULL;
+  }
+
+  // Initialize inputs.
+  std::vector<std::pair<tensorflow::string, Tensor>> inputs(ninputs);
+  bool ok = true;
+  for (int i = 0; i < ninputs; i++) {
+    TF_Tensor* src = c_inputs[i];
+    if (ok) {
+      inputs[i].first = c_input_names[i];
+      if (c_inputs[i]->dtype != TF_STRING) {
+        inputs[i].second = tensorflow::TensorCApi::MakeTensor(
+            src->dtype, src->shape, src->buffer);
+      } else {
+        // TF_STRING tensors require copying since Tensor class expects
+        // a sequence of string objects.
+        ok =
+            tensorflow::TF_Tensor_DecodeStrings(src, &inputs[i].second, status);
+        // Must keep looping through all inputs even if there is an error
+        // so that TF_DeleteTensor() is called unconditionally on all inputs.
+      }
+    }
+    TF_DeleteTensor(src);
+  }
+  if (!ok) {
+    return;
+  }
+
+  std::vector<tensorflow::string> output_tensor_names(noutputs);
+  std::vector<Tensor> outputs(noutputs);
+  std::vector<tensorflow::string> target_node_names(ntargets);
+  for (int i = 0; i < noutputs; i++) {
+    output_tensor_names[i] = c_output_tensor_names[i];
+  }
+  for (int i = 0; i < ntargets; i++) {
+    target_node_names[i] = c_target_node_names[i];
+  }
+  Status result =
+      s->session->Run(inputs, output_tensor_names, target_node_names, &outputs);
+  if (!result.ok()) {
+    status->status = result;
+    return;
+  }
+
+  // Store results in c_outputs[]
+  for (int i = 0; i < noutputs; i++) {
+    const Tensor& src = outputs[i];
+    if (!src.IsInitialized()) {
+      c_outputs[i] = tensorflow::EmptyTensor(
+          static_cast<TF_DataType>(src.dtype()), src.shape());
+      continue;
+    }
+    if (src.dtype() != tensorflow::DT_STRING) {
+      // Share the underlying buffer.
+      TensorBuffer* buf = tensorflow::TensorCApi::Buffer(src);
+      buf->Ref();
+      c_outputs[i] = new TF_Tensor{static_cast<TF_DataType>(src.dtype()),
+                                   src.shape(), buf};
+    } else {
+      c_outputs[i] = tensorflow::TF_Tensor_EncodeStrings(src);
+    }
+  }
+}
+
+}  // end extern "C"
--- a/tensorflow/core/client/tensor_c_api_test.cc
+++ b/tensorflow/core/client/tensor_c_api_test.cc
@ -0,0 +1,94 @@
+#include "tensorflow/core/public/tensor_c_api.h"
+
+#include <gtest/gtest.h>
+#include "tensorflow/core/public/tensor.h"
+
+using tensorflow::Tensor;
+using tensorflow::TensorShape;
+
+namespace tensorflow {
+bool TF_Tensor_DecodeStrings(TF_Tensor* src, Tensor* dst, TF_Status* status);
+TF_Tensor* TF_Tensor_EncodeStrings(const Tensor& src);
+}  // namespace tensorflow
+
+TEST(CApi, Status) {
+  TF_Status* s = TF_NewStatus();
+  EXPECT_EQ(TF_OK, TF_GetCode(s));
+  EXPECT_EQ(tensorflow::string(), TF_Message(s));
+  TF_SetStatus(s, TF_CANCELLED, "cancel");
+  EXPECT_EQ(TF_CANCELLED, TF_GetCode(s));
+  EXPECT_EQ(tensorflow::string("cancel"), TF_Message(s));
+  TF_DeleteStatus(s);
+}
+
+static void Deallocator(void* data, size_t, void* arg) {
+  tensorflow::cpu_allocator()->DeallocateRaw(data);
+  *reinterpret_cast<bool*>(arg) = true;
+}
+
+TEST(CApi, Tensor) {
+  float* values =
+      reinterpret_cast<float*>(tensorflow::cpu_allocator()->AllocateRaw(
+          EIGEN_MAX_ALIGN_BYTES, 6 * sizeof(float)));
+  tensorflow::int64 dims[] = {2, 3};
+  bool deallocator_called = false;
+  TF_Tensor* t = TF_NewTensor(TF_FLOAT, dims, 2, values, sizeof(values),
+                              &Deallocator, &deallocator_called);
+  EXPECT_FALSE(deallocator_called);
+  EXPECT_EQ(TF_FLOAT, TF_TensorType(t));
+  EXPECT_EQ(2, TF_NumDims(t));
+  EXPECT_EQ(dims[0], TF_Dim(t, 0));
+  EXPECT_EQ(dims[1], TF_Dim(t, 1));
+  EXPECT_EQ(sizeof(values), TF_TensorByteSize(t));
+  EXPECT_EQ(static_cast<void*>(values), TF_TensorData(t));
+  TF_DeleteTensor(t);
+  EXPECT_TRUE(deallocator_called);
+}
+
+static void TestEncodeDecode(int line,
+                             const std::vector<tensorflow::string>& data) {
+  const tensorflow::int64 n = data.size();
+  for (std::vector<tensorflow::int64> dims :
+       std::vector<std::vector<tensorflow::int64>>{
+           {n}, {1, n}, {n, 1}, {n / 2, 2}}) {
+    // Create C++ Tensor
+    Tensor src(tensorflow::DT_STRING, TensorShape(dims));
+    for (tensorflow::int64 i = 0; i < src.NumElements(); i++) {
+      src.flat<tensorflow::string>()(i) = data[i];
+    }
+    TF_Tensor* dst = TF_Tensor_EncodeStrings(src);
+
+    // Convert back to a C++ Tensor and ensure we get expected output.
+    TF_Status* status = TF_NewStatus();
+    Tensor output;
+    ASSERT_TRUE(TF_Tensor_DecodeStrings(dst, &output, status)) << line;
+    ASSERT_EQ(TF_OK, TF_GetCode(status)) << line;
+    ASSERT_EQ(src.NumElements(), output.NumElements()) << line;
+    for (tensorflow::int64 i = 0; i < src.NumElements(); i++) {
+      ASSERT_EQ(data[i], output.flat<tensorflow::string>()(i)) << line;
+    }
+
+    TF_DeleteStatus(status);
+    TF_DeleteTensor(dst);
+  }
+}
+
+TEST(CApi, TensorEncodeDecodeStrings) {
+  TestEncodeDecode(__LINE__, {});
+  TestEncodeDecode(__LINE__, {"hello"});
+  TestEncodeDecode(__LINE__,
+                   {"the", "quick", "brown", "fox", "jumped", "over"});
+
+  tensorflow::string big(1000, 'a');
+  TestEncodeDecode(__LINE__, {"small", big, "small2"});
+}
+
+TEST(CApi, SessionOptions) {
+  TF_SessionOptions* opt = TF_NewSessionOptions();
+  TF_DeleteSessionOptions(opt);
+}
+
+// TODO(jeff,sanjay): Session tests
+// . Create and delete
+// . Extend graph
+// . Run
--- a/tensorflow/core/common_runtime/device.cc
+++ b/tensorflow/core/common_runtime/device.cc
@ -0,0 +1,37 @@
+#include "tensorflow/core/common_runtime/device.h"
+
+#include "tensorflow/core/framework/op_segment.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/random/random.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+
+namespace tensorflow {
+
+Device::Device(Env* env, const DeviceAttributes& device_attributes,
+               Allocator* device_allocator)
+    : DeviceBase(env), device_attributes_(device_attributes) {
+  CHECK(DeviceNameUtils::ParseFullName(name(), &parsed_name_))
+      << "Invalid device name: " << name();
+  rmgr_ = new ResourceMgr(parsed_name_.job);
+}
+
+Device::~Device() { delete rmgr_; }
+
+// static
+DeviceAttributes Device::BuildDeviceAttributes(
+    const string& name, DeviceType device, Bytes memory_limit,
+    BusAdjacency bus_adjacency, const string& physical_device_desc) {
+  DeviceAttributes da;
+  da.set_name(name);
+  do {
+    da.set_incarnation(random::New64());
+  } while (da.incarnation() == 0);  // This proto field must not be zero
+  da.set_device_type(device.type());
+  da.set_memory_limit(memory_limit.value());
+  da.set_bus_adjacency(bus_adjacency);
+  da.set_physical_device_desc(physical_device_desc);
+  return da;
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/device.h
+++ b/tensorflow/core/common_runtime/device.h
@ -0,0 +1,128 @@
+// A Device is a something that can perform computations as part of a
+// model.  Devices can be local (runs computation on this machine), or
+// remote (contacts a device local to another machine using an RPC to
+// do the work).  Devices are registered in a DeviceSet, which is also
+// responsible for the Device <-> id mapping.
+//
+// Device names
+// * Every Device should have a unique name with the format:
+//     /job:___/replica:___/task:___/(gpu|cpu):___
+//   An example name would be "/job:train/replica:0/task:3/gpu:2".
+// * Task numbers are within the specified replica, so there are as
+//   many "task zeros" as replicas.
+
+#ifndef TENSORFLOW_COMMON_RUNTIME_DEVICE_H_
+#define TENSORFLOW_COMMON_RUNTIME_DEVICE_H_
+
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/control_flow.h"
+#include "tensorflow/core/framework/device_attributes.pb.h"
+#include "tensorflow/core/framework/device_base.h"
+#include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/op_segment.h"
+#include "tensorflow/core/framework/resource_mgr.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/graph/types.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/public/status.h"
+#include "tensorflow/core/util/device_name_utils.h"
+
+namespace tensorflow {
+
+class Device : public DeviceBase {
+ public:
+  Device(Env* env, const DeviceAttributes& device_attributes,
+         Allocator* device_allocator);
+  ~Device() override;
+
+  // Full name of this device (see top comment).
+  const string& name() const { return device_attributes_.name(); }
+
+  // Parsed name of this device
+  const DeviceNameUtils::ParsedName parsed_name() const { return parsed_name_; }
+
+  // Describes what kind of device this is.  This is intended to be
+  // human-readable and not computer-parsed, except that two devices
+  // with the same device_type() are expected to perform similarly
+  // (both from a computation and communication perspective).
+  const string& device_type() const { return device_attributes_.device_type(); }
+
+  // Returns an aggregation of device attributes.
+  const DeviceAttributes& attributes() const override {
+    return device_attributes_;
+  }
+
+  // Performs the actual compute function.
+  //
+  // Subclasses may override this function if they wish to perform
+  // some initialization before each compute.
+  virtual void Compute(OpKernel* op_kernel, OpKernelContext* context) {
+    op_kernel->Compute(context);
+  }
+
+  // Asynchronous kernel's compute.
+  virtual void ComputeAsync(AsyncOpKernel* op_kernel, OpKernelContext* context,
+                            AsyncOpKernel::DoneCallback done) {
+    op_kernel->ComputeAsync(context, done);
+  }
+
+  // Blocks until all operations queued on the device at the time of
+  // the call have completed.  Returns any error pending on the device
+  // at completion.
+  virtual Status Sync() = 0;
+
+  // Fill in the context map for the graph. Default behavior is to do
+  // nothing.
+  //
+  // The caller takes ownership over the DeviceContext objects given
+  // by the device.
+  virtual Status FillContextMap(const Graph* graph,
+                                DeviceContextMap* device_context_map) {
+    return Status::OK();
+  }
+
+  // Returns the op segment of this device.  The caller can reuse op
+  // kernels registered for the same session running on this device.
+  OpSegment* op_segment() { return &op_seg_; }
+
+  // Returns the resource manager associated w/ this device.
+  ResourceMgr* resource_manager() { return rmgr_; }
+
+  // Summarizes the status of this Device, for debugging.
+  string DebugString() const { return device_attributes_.DebugString(); }
+
+  // Assembles the parameter components into a complete DeviceAttributes value.
+  static DeviceAttributes BuildDeviceAttributes(
+      const string& name, DeviceType device, Bytes memory_limit,
+      BusAdjacency bus_adjacency, const string& physical_device_desc);
+
+  static DeviceAttributes BuildDeviceAttributes(const string& name,
+                                                DeviceType device,
+                                                Bytes memory_limit,
+                                                BusAdjacency bus_adjacency) {
+    // Pass in an empty string as physical device name.
+    return BuildDeviceAttributes(name, device, memory_limit, bus_adjacency, "");
+  }
+
+ private:
+  const DeviceAttributes device_attributes_;
+  DeviceNameUtils::ParsedName parsed_name_;
+
+  // op_seg_ maps session handle and op name to OpKernel objects.
+  OpSegment op_seg_;
+
+  // Resources associated w/ this device. E.g., shared variables, etc.
+  ResourceMgr* rmgr_ = nullptr;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(Device);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_DEVICE_H_
--- a/tensorflow/core/common_runtime/device_factory.cc
+++ b/tensorflow/core/common_runtime/device_factory.cc
@ -0,0 +1,106 @@
+#include "tensorflow/core/common_runtime/device_factory.h"
+
+#include <memory>
+#include <string>
+#include <unordered_map>
+
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/public/session_options.h"
+
+namespace tensorflow {
+
+namespace {
+
+static mutex* get_device_factory_lock() {
+  static mutex device_factory_lock;
+  return &device_factory_lock;
+}
+
+struct FactoryItem {
+  std::unique_ptr<DeviceFactory> factory;
+  int priority;
+};
+
+std::unordered_map<string, FactoryItem>& device_factories() {
+  static std::unordered_map<string, FactoryItem>* factories =
+      new std::unordered_map<string, FactoryItem>;
+  return *factories;
+}
+}  // namespace
+
+void DeviceFactory::Register(const string& device_type, DeviceFactory* factory,
+                             int priority) {
+  mutex_lock l(*get_device_factory_lock());
+  std::unique_ptr<DeviceFactory> factory_ptr(factory);
+  std::unordered_map<string, FactoryItem>& factories = device_factories();
+  auto iter = factories.find(device_type);
+  if (iter == factories.end()) {
+    factories[device_type] = {std::move(factory_ptr), priority};
+  } else {
+    if (iter->second.priority < priority) {
+      iter->second = {std::move(factory_ptr), priority};
+    } else if (iter->second.priority == priority) {
+      LOG(FATAL) << "Duplicate registration of device factory for type "
+                 << device_type << " with the same priority " << priority;
+    }
+  }
+}
+
+DeviceFactory* DeviceFactory::GetFactory(const string& device_type) {
+  mutex_lock l(*get_device_factory_lock());  // could use reader lock
+  auto it = device_factories().find(device_type);
+  if (it == device_factories().end()) {
+    return nullptr;
+  }
+  return it->second.factory.get();
+}
+
+void DeviceFactory::AddDevices(const SessionOptions& options,
+                               const string& name_prefix,
+                               std::vector<Device*>* devices) {
+  // CPU first.
+  auto cpu_factory = GetFactory("CPU");
+  if (!cpu_factory) {
+    LOG(FATAL)
+        << "CPU Factory not registered.  Did you link in threadpool_device?";
+  }
+  size_t init_size = devices->size();
+  cpu_factory->CreateDevices(options, name_prefix, devices);
+  if (devices->size() == init_size) {
+    LOG(FATAL) << "No CPU devices are available in this process";
+  }
+
+  // Then GPU.
+  auto gpu_factory = GetFactory("GPU");
+  if (gpu_factory) {
+    gpu_factory->CreateDevices(options, name_prefix, devices);
+  }
+
+  // Then the rest.
+  mutex_lock l(*get_device_factory_lock());
+  for (auto& p : device_factories()) {
+    auto factory = p.second.factory.get();
+    if (factory != cpu_factory && factory != gpu_factory) {
+      factory->CreateDevices(options, name_prefix, devices);
+    }
+  }
+}
+
+Device* DeviceFactory::NewDevice(const string& type,
+                                 const SessionOptions& options,
+                                 const string& name_prefix) {
+  auto device_factory = GetFactory(type);
+  if (!device_factory) {
+    return nullptr;
+  }
+  SessionOptions opt = options;
+  (*opt.config.mutable_device_count())[type] = 1;
+  std::vector<Device*> devices;
+  device_factory->CreateDevices(opt, name_prefix, &devices);
+  CHECK_EQ(devices.size(), 1);
+  return devices[0];
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/device_factory.h
+++ b/tensorflow/core/common_runtime/device_factory.h
@ -0,0 +1,69 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_DEVICE_FACTORY_H_
+#define TENSORFLOW_COMMON_RUNTIME_DEVICE_FACTORY_H_
+
+#include <string>
+#include <vector>
+#include "tensorflow/core/platform/port.h"
+
+namespace tensorflow {
+
+class Device;
+struct SessionOptions;
+
+class DeviceFactory {
+ public:
+  virtual ~DeviceFactory() {}
+  static void Register(const string& device_type, DeviceFactory* factory,
+                       int priority);
+  static DeviceFactory* GetFactory(const string& device_type);
+
+  // Append to "*devices" all suitable devices, respecting
+  // any device type specific properties/counts listed in "options".
+  //
+  // CPU devices are added first.
+  static void AddDevices(const SessionOptions& options,
+                         const string& name_prefix,
+                         std::vector<Device*>* devices);
+
+  // Helper for tests.  Create a single device of type "type".  The
+  // returned device is always numbered zero, so if creating multiple
+  // devices of the same type, supply distinct name_prefix arguments.
+  static Device* NewDevice(const string& type, const SessionOptions& options,
+                           const string& name_prefix);
+
+  // Most clients should call AddDevices() instead.
+  virtual void CreateDevices(const SessionOptions& options,
+                             const string& name_prefix,
+                             std::vector<Device*>* devices) = 0;
+};
+
+namespace dfactory {
+
+template <class Factory>
+class Registrar {
+ public:
+  // Multiple registrations for the same device type with different priorities
+  // are allowed. The registration with the highest priority will be used.
+  explicit Registrar(const string& device_type, int priority = 0) {
+    DeviceFactory::Register(device_type, new Factory(), priority);
+  }
+};
+
+}  // namespace dfactory
+
+#define REGISTER_LOCAL_DEVICE_FACTORY(device_type, device_factory, ...) \
+  INTERNAL_REGISTER_LOCAL_DEVICE_FACTORY(device_type, device_factory,   \
+                                         __COUNTER__, ##__VA_ARGS__)
+
+#define INTERNAL_REGISTER_LOCAL_DEVICE_FACTORY(device_type, device_factory, \
+                                               ctr, ...)                    \
+  static ::tensorflow::dfactory::Registrar<device_factory>                  \
+      INTERNAL_REGISTER_LOCAL_DEVICE_FACTORY_NAME(ctr)(device_type,         \
+                                                       ##__VA_ARGS__)
+
+// __COUNTER__ must go through another macro to be properly expanded
+#define INTERNAL_REGISTER_LOCAL_DEVICE_FACTORY_NAME(ctr) ___##ctr##__object_
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_DEVICE_FACTORY_H_
--- a/tensorflow/core/common_runtime/device_mgr.cc
+++ b/tensorflow/core/common_runtime/device_mgr.cc
@ -0,0 +1,90 @@
+#include "tensorflow/core/common_runtime/device_mgr.h"
+
+#include "tensorflow/core/common_runtime/local_device.h"
+#include "tensorflow/core/framework/device_attributes.pb.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/util/device_name_utils.h"
+
+namespace tensorflow {
+
+DeviceMgr::DeviceMgr(const std::vector<Device*>& devices) {
+  for (Device* d : devices) {
+    devices_.push_back(d);
+
+    // Register under both the full name and the local name.
+    device_map_[d->name()] = d;
+    device_map_[DeviceNameUtils::LocalName(d->name())] = d;
+    device_type_counts_[d->device_type()]++;
+  }
+}
+
+DeviceMgr::~DeviceMgr() {
+  for (auto p : devices_) delete p;
+}
+
+void DeviceMgr::ListDeviceAttributes(
+    std::vector<DeviceAttributes>* devices) const {
+  devices->reserve(devices_.size());
+  for (Device* dev : devices_) {
+    devices->emplace_back(dev->attributes());
+  }
+}
+
+std::vector<Device*> DeviceMgr::ListDevices() const {
+  return std::vector<Device*>(devices_.begin(), devices_.end());
+}
+
+string DeviceMgr::DebugString() const {
+  string out;
+  for (Device* dev : devices_) {
+    strings::StrAppend(&out, dev->name(), "\n");
+  }
+  return out;
+}
+
+string DeviceMgr::DeviceMappingString() const {
+  string out;
+  for (Device* dev : devices_) {
+    if (!dev->attributes().physical_device_desc().empty()) {
+      strings::StrAppend(&out, dev->name(), " -> ",
+                         dev->attributes().physical_device_desc(), "\n");
+    }
+  }
+  return out;
+}
+
+Status DeviceMgr::LookupDevice(const string& name, Device** device) const {
+  Status s;
+  auto iter = device_map_.find(name);
+  if (iter == device_map_.end()) {
+    return errors::InvalidArgument(name, " unknown device.");
+  }
+  *device = iter->second;
+  return Status::OK();
+}
+
+void DeviceMgr::ClearContainers(gtl::ArraySlice<string> containers) const {
+  Status s;
+  for (Device* dev : devices_) {
+    if (containers.empty()) {
+      s.Update(dev->resource_manager()->Cleanup(
+          dev->resource_manager()->default_container()));
+    } else {
+      for (const string& c : containers) {
+        s.Update(dev->resource_manager()->Cleanup(c));
+      }
+    }
+    if (!s.ok()) {
+      LOG(WARNING) << s;
+    }
+  }
+}
+
+int DeviceMgr::NumDeviceType(const string& type) const {
+  auto iter = device_type_counts_.find(type);
+  if (iter != device_type_counts_.end()) return iter->second;
+  return 0;
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/device_mgr.h
+++ b/tensorflow/core/common_runtime/device_mgr.h
@ -0,0 +1,55 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_DEVICE_MGR_H_
+#define TENSORFLOW_COMMON_RUNTIME_DEVICE_MGR_H_
+
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
+#include <vector>
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
+#include "tensorflow/core/public/status.h"
+
+namespace tensorflow {
+
+class DeviceAttributes;
+
+class DeviceMgr {
+ public:
+  // TODO(zhifengc): Other initialization information.
+  explicit DeviceMgr(const std::vector<Device*>& devices);
+  ~DeviceMgr();
+
+  // Returns attributes of all devices.
+  void ListDeviceAttributes(std::vector<DeviceAttributes>* devices) const;
+
+  std::vector<Device*> ListDevices() const;
+
+  // Returns a string listing all devices.
+  string DebugString() const;
+
+  // Returns a string of all the device mapping.
+  string DeviceMappingString() const;
+
+  // Assigns *device with pointer to Device of the given name.
+  // Accepts either a full device name, or just the replica-local suffix.
+  Status LookupDevice(const string& name, Device** device) const;
+
+  // Clears given containers of all devices if 'container' is
+  // non-empty. Otherwise, clears default containers of all devices.
+  void ClearContainers(gtl::ArraySlice<string> containers) const;
+
+  int NumDeviceType(const string& type) const;
+
+ private:
+  typedef gtl::InlinedVector<Device*, 8> DeviceVec;
+  DeviceVec devices_;
+  std::unordered_map<string, Device*> device_map_;
+  std::unordered_map<string, int> device_type_counts_;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(DeviceMgr);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_DEVICE_MGR_H_
--- a/tensorflow/core/common_runtime/device_set.cc
+++ b/tensorflow/core/common_runtime/device_set.cc
@ -0,0 +1,68 @@
+#include "tensorflow/core/common_runtime/device_set.h"
+
+#include <set>
+#include <utility>
+#include <vector>
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/lib/gtl/map_util.h"
+
+namespace tensorflow {
+
+DeviceSet::DeviceSet() {}
+
+DeviceSet::~DeviceSet() {}
+
+void DeviceSet::AddDevice(Device* device) {
+  devices_.push_back(device);
+  device_by_name_.insert({device->name(), device});
+}
+
+void DeviceSet::FindMatchingDevices(const DeviceNameUtils::ParsedName& spec,
+                                    std::vector<Device*>* devices) const {
+  // TODO(jeff): If we are going to repeatedly lookup the set of devices
+  // for the same spec, maybe we should have a cache of some sort
+  devices->clear();
+  for (Device* d : devices_) {
+    if (DeviceNameUtils::IsCompleteSpecification(spec, d->parsed_name())) {
+      devices->push_back(d);
+    }
+  }
+}
+
+Device* DeviceSet::FindDeviceByName(const string& name) const {
+  return gtl::FindPtrOrNull(device_by_name_, name);
+}
+
+// Higher result implies lower priority.
+static int Order(const DeviceType& d) {
+  if (StringPiece(d.type()) == DEVICE_CPU) {
+    return 3;
+  } else if (StringPiece(d.type()) == DEVICE_GPU) {
+    return 2;
+  } else {
+    return 1;
+  }
+}
+
+static bool ByPriority(const DeviceType& a, const DeviceType& b) {
+  // Order by "order number"; break ties lexicographically.
+  return std::make_pair(Order(a), StringPiece(a.type())) <
+         std::make_pair(Order(b), StringPiece(b.type()));
+}
+
+std::vector<DeviceType> DeviceSet::PrioritizedDeviceTypeList() const {
+  std::vector<DeviceType> result;
+  std::set<string> seen;
+  for (Device* d : devices_) {
+    auto t = d->device_type();
+    if (seen.insert(t).second) {
+      result.emplace_back(DeviceType(t));
+    }
+  }
+  std::sort(result.begin(), result.end(), ByPriority);
+  return result;
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/device_set.h
+++ b/tensorflow/core/common_runtime/device_set.h
@ -0,0 +1,64 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_DEVICE_SET_H_
+#define TENSORFLOW_COMMON_RUNTIME_DEVICE_SET_H_
+
+#include <memory>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/util/device_name_utils.h"
+
+namespace tensorflow {
+
+// DeviceSet is a container class for managing the various types of
+// devices used by a model.
+class DeviceSet {
+ public:
+  DeviceSet();
+  ~DeviceSet();
+
+  // Does not take ownership of 'device'.
+  void AddDevice(Device* device);
+
+  // Set the device designated as the "client".  This device
+  // must also be registered via AddDevice().
+  void set_client_device(Device* device) { client_device_ = device; }
+
+  // Returns a pointer to the device designated as the "client".
+  Device* client_device() const { return client_device_; }
+
+  // Return the list of devices in this set.
+  const std::vector<Device*>& devices() const { return devices_; }
+
+  // Given a DeviceNameUtils::ParsedName (which may have some
+  // wildcards for different components), fills "*devices" with all
+  // devices in "*this" that match "spec".
+  void FindMatchingDevices(const DeviceNameUtils::ParsedName& spec,
+                           std::vector<Device*>* devices) const;
+
+  // Finds the device with the given "fullname". Returns nullptr if
+  // not found.
+  Device* FindDeviceByName(const string& fullname) const;
+
+  // Return the list of unique device types in this set, ordered
+  // with more preferable devices earlier.
+  std::vector<DeviceType> PrioritizedDeviceTypeList() const;
+
+ private:
+  // Not owned.
+  std::vector<Device*> devices_;
+
+  // Fullname -> device* for device in devices_.
+  std::unordered_map<string, Device*> device_by_name_;
+
+  // client_device_ points to an element of devices_ that we consider
+  // to be the client device (in this local process).
+  Device* client_device_ = nullptr;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(DeviceSet);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_DEVICE_SET_H_
--- a/tensorflow/core/common_runtime/device_set_test.cc
+++ b/tensorflow/core/common_runtime/device_set_test.cc
@ -0,0 +1,65 @@
+#include "tensorflow/core/common_runtime/device_set.h"
+
+#include "tensorflow/core/public/status.h"
+#include <gtest/gtest.h>
+
+namespace tensorflow {
+namespace {
+
+// Return a fake device with the specified type and name.
+static Device* Dev(const char* type, const char* name) {
+  class FakeDevice : public Device {
+   public:
+    explicit FakeDevice(const DeviceAttributes& attr)
+        : Device(nullptr, attr, nullptr) {}
+    Status Sync() override { return Status::OK(); }
+    Allocator* GetAllocator(AllocatorAttributes) override { return nullptr; }
+  };
+  DeviceAttributes attr;
+  attr.set_name(name);
+  attr.set_device_type(type);
+  return new FakeDevice(attr);
+}
+
+class DeviceSetTest : public testing::Test {
+ public:
+  void AddDevice(const char* type, const char* name) {
+    Device* d = Dev(type, name);
+    owned_.emplace_back(d);
+    devices_.AddDevice(d);
+  }
+
+  std::vector<DeviceType> types() const {
+    return devices_.PrioritizedDeviceTypeList();
+  }
+
+ private:
+  DeviceSet devices_;
+  std::vector<std::unique_ptr<Device>> owned_;
+};
+
+TEST_F(DeviceSetTest, PrioritizedDeviceTypeList) {
+  EXPECT_EQ(std::vector<DeviceType>{}, types());
+
+  AddDevice("CPU", "/job:a/replica:0/task:0/cpu:0");
+  EXPECT_EQ(std::vector<DeviceType>{DeviceType(DEVICE_CPU)}, types());
+
+  AddDevice("CPU", "/job:a/replica:0/task:0/cpu:1");
+  EXPECT_EQ(std::vector<DeviceType>{DeviceType(DEVICE_CPU)}, types());
+
+  AddDevice("GPU", "/job:a/replica:0/task:0/gpu:0");
+  EXPECT_EQ(
+      (std::vector<DeviceType>{DeviceType(DEVICE_GPU), DeviceType(DEVICE_CPU)}),
+      types());
+
+  AddDevice("T1", "/job:a/replica:0/task:0/device:T1:0");
+  AddDevice("T1", "/job:a/replica:0/task:0/device:T1:1");
+  AddDevice("T2", "/job:a/replica:0/task:0/device:T2:0");
+  EXPECT_EQ(
+      (std::vector<DeviceType>{DeviceType("T1"), DeviceType("T2"),
+                               DeviceType(DEVICE_GPU), DeviceType(DEVICE_CPU)}),
+      types());
+}
+
+}  // namespace
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/eigen_thread_pool.h
+++ b/tensorflow/core/common_runtime/eigen_thread_pool.h
@ -0,0 +1,22 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_EIGEN_THREAD_POOL_H_
+#define TENSORFLOW_COMMON_RUNTIME_EIGEN_THREAD_POOL_H_
+
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+
+namespace tensorflow {
+
+class EigenThreadPoolWrapper : public Eigen::ThreadPoolInterface {
+ public:
+  explicit EigenThreadPoolWrapper(thread::ThreadPool* pool) : pool_(pool) {}
+  ~EigenThreadPoolWrapper() override {}
+
+  void Schedule(std::function<void()> fn) override { pool_->Schedule(fn); }
+
+ private:
+  thread::ThreadPool* pool_ = nullptr;
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_EIGEN_THREAD_POOL_H_
--- a/tensorflow/core/common_runtime/executor.cc
+++ b/tensorflow/core/common_runtime/executor.cc
--- a/tensorflow/core/common_runtime/executor.h
+++ b/tensorflow/core/common_runtime/executor.h
@ -0,0 +1,209 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_EXECUTOR_H_
+#define TENSORFLOW_COMMON_RUNTIME_EXECUTOR_H_
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/framework/rendezvous.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/lib/core/notification.h"
+#include "tensorflow/core/public/status.h"
+#include "tensorflow/core/public/tensor.h"
+
+namespace tensorflow {
+
+class StepStatsCollector;
+
+// Executor runs a graph computation.
+// Example:
+//   Graph* graph = ...;
+//      ... construct graph ...
+//   Executor* executor;
+//   TF_CHECK_OK(NewSimpleExecutor(my_device, graph, &executor));
+//   Rendezvous* rendezvous = NewNaiveRendezvous();
+//   TF_CHECK_OK(rendezvous->Send("input", some_input_tensor));
+//   TF_CHECK_OK(executor->Run({ExecutorOpts, rendezvous, nullptr}));
+//   TF_CHECK_OK(rendezvous->Recv("input", &output_tensor));
+//   ... ...
+//
+// Multiple threads can call Executor::Run concurrently.
+class Executor {
+ public:
+  virtual ~Executor() {}
+
+  // RunAsync() executes the graph computation. "done" is run when the
+  // graph computation completes. If any error happens during the
+  // computation, "done" is run and the error is passed to "done".
+  //
+  // RunAsync() is given a few arguments in Args. The caller must
+  // ensure objects passed in Args (rendezvous, stats_collector, etc.)
+  // are alive at least until done is invoked. All pointers to the
+  // argument objects can be nullptr.
+  //
+  // RunAsync() uses the given "rendezvous", if not null, as the
+  // mechanism to communicate inputs and outputs of the underlying
+  // graph computation.
+  //
+  // RunAsync() calls "stats_collector", if not null, to keep track of
+  // stats. This allows us to collect statistics and traces on demand.
+  //
+  // RunAsync() is provided a "call_frame", if the executor is used
+  // for executing a function, is used to pass arguments and return
+  // values between the caller and the callee.
+  //
+  // RunAsync() uses "cancellation_manager", if not nullptr, to
+  // register callbacks that should be called if the graph computation
+  // is cancelled. Note that the callbacks merely unblock any
+  // long-running computation, and a cancelled step will terminate by
+  // returning/calling the DoneCallback as usual.
+  //
+  // RunAsync() dispatches closures to "runner". Typically, "runner"
+  // is backed up by a bounded threadpool.
+  struct Args {
+    Rendezvous* rendezvous = nullptr;
+    StepStatsCollector* stats_collector = nullptr;
+    FunctionCallFrame* call_frame = nullptr;
+    CancellationManager* cancellation_manager = nullptr;
+
+    typedef std::function<void()> Closure;
+    typedef std::function<void(Closure)> Runner;
+    Runner runner = nullptr;
+  };
+  typedef std::function<void(const Status&)> DoneCallback;
+  virtual void RunAsync(const Args& args, DoneCallback done) = 0;
+
+  // Synchronous wrapper for RunAsync().
+  Status Run(const Args& args) {
+    Status ret;
+    Notification n;
+    RunAsync(args, [&ret, &n](const Status& s) {
+      ret = s;
+      n.Notify();
+    });
+    n.WaitForNotification();
+    return ret;
+  }
+};
+
+// Creates an Executor that computes the given "graph".
+//
+// If successful, returns the constructed executor in "*executor". The
+// caller keeps the ownership of "device". The returned executor takes
+// the ownership of "graph". Otherwise, returns an error status.
+//
+// "params" provides a set of context for the executor. We expect that
+// different context would provide different implementations.
+struct LocalExecutorParams {
+  Device* device;
+
+  // The library runtime support.
+  FunctionLibraryRuntime* function_library;
+
+  // True iff the computation contains control flow nodes.
+  bool has_control_flow;
+
+  // create_kernel returns an instance of op kernel based on NodeDef.
+  // delete_kernel is called for every kernel used by the executor
+  // when the executor is deleted.
+  std::function<Status(const NodeDef&, OpKernel**)> create_kernel;
+  std::function<void(OpKernel*)> delete_kernel;
+};
+::tensorflow::Status NewLocalExecutor(const LocalExecutorParams& params,
+                                      const Graph* graph, Executor** executor);
+
+// A class to help run multiple executors in parallel and wait until
+// all of them are complete.
+//
+// ExecutorBarrier deletes itself after the function returned by Get()
+// is called.
+class ExecutorBarrier {
+ public:
+  typedef std::function<void(const Status&)> StatusCallback;
+
+  // Create an ExecutorBarrier for 'num' different executors.
+  //
+  // 'r' is the shared Rendezvous object that is used to communicate
+  // state.  If any of the executors experiences an error, the
+  // rendezvous object will be aborted exactly once.
+  //
+  // 'done' is called after the last executor completes, and
+  // ExecutorBarrier is deleted.
+  ExecutorBarrier(int num, Rendezvous* r, StatusCallback done)
+      : rendez_(r), done_cb_(done), pending_(num) {}
+
+  ~ExecutorBarrier() {}
+
+  // Returns a closure that Executors must call when they are done
+  // computing, passing the status of their execution as an argument.
+  StatusCallback Get() {
+    return std::bind(&ExecutorBarrier::WhenDone, this, std::placeholders::_1);
+  }
+
+ private:
+  Rendezvous* rendez_ = nullptr;
+  StatusCallback done_cb_ = nullptr;
+
+  mutable mutex mu_;
+  int pending_ GUARDED_BY(mu_) = 0;
+  Status status_ GUARDED_BY(mu_);
+
+  void WhenDone(const Status& s) {
+    bool error = false;
+    StatusCallback done = nullptr;
+    Status status;
+    {
+      mutex_lock l(mu_);
+      // If we are the first error encountered, mark the status
+      // appropriately and later trigger an abort of the Rendezvous
+      // object by this thread only.
+      if (status_.ok() && !s.ok()) {
+        error = true;
+        status_ = s;
+      }
+
+      // If this is the last call to WhenDone, call the final callback
+      // below.
+      if (--pending_ == 0) {
+        CHECK(done_cb_ != nullptr);
+        done = done_cb_;
+        done_cb_ = nullptr;
+      }
+      status = status_;
+    }
+    if (error) {
+      rendez_->StartAbort(status);
+    }
+    if (done != nullptr) {
+      delete this;
+      done(status);
+    }
+  }
+
+  TF_DISALLOW_COPY_AND_ASSIGN(ExecutorBarrier);
+};
+
+// A few helpers to facilitate create/delete kernels.
+
+// Creates a kernel based on "ndef" on device "device". The kernel can
+// access the functions in the "flib". The caller takes ownership of
+// returned "*kernel".
+Status CreateNonCachedKernel(Device* device, FunctionLibraryRuntime* flib,
+                             const NodeDef& ndef, OpKernel** kernel);
+
+// Deletes "kernel" returned by CreateKernel.
+void DeleteNonCachedKernel(OpKernel* kernel);
+
+// Creates a kernel based on "ndef" on device "device". The kernel can
+// access the functions in the "flib". The caller does not take
+// ownership of returned "*kernel". If a kernel has been created for
+// ndef.name(), returns the same kernel instance.
+Status CreateCachedKernel(Device* device, const string& session,
+                          FunctionLibraryRuntime* flib, const NodeDef& ndef,
+                          OpKernel** kernel);
+
+// Deletes "kernel" returned by CreateCachedKernel.
+void DeleteCachedKernel(Device* device, const string& session,
+                        OpKernel* kernel);
+
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_EXECUTOR_H_
--- a/tensorflow/core/common_runtime/function.cc
+++ b/tensorflow/core/common_runtime/function.cc
--- a/tensorflow/core/common_runtime/function.h
+++ b/tensorflow/core/common_runtime/function.h
@ -0,0 +1,100 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_FUNCTION_H_
+#define TENSORFLOW_COMMON_RUNTIME_FUNCTION_H_
+
+#include <functional>
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/framework/function.h"
+#include "tensorflow/core/graph/graph.h"
+
+namespace tensorflow {
+
+// Creates a FunctionLibraryRuntime, which instantiates functions
+// defined in "lib_def" and executes functions on the "device".
+//
+// The returned object does not take ownerships of "device" or
+// "lib_def".  The caller must ensure "device" and "lib_def" outlives
+// the returned object.
+typedef std::function<void()> Closure;
+typedef std::function<void(Closure)> Runner;
+FunctionLibraryRuntime* NewFunctionLibraryRuntime(
+    Device* device, Runner runner, const FunctionLibraryDefinition* lib_def);
+
+// FunctionLibraryRuntime::GetFunctionBody returns a description of an
+// instantiated function that is represented as a Graph with arg/ret
+// nodes annotated.
+struct FunctionBody {
+  FunctionDef fdef;
+  Graph* graph = nullptr;  // owned.
+  DataTypeVector arg_types;
+  DataTypeVector ret_types;
+  gtl::InlinedVector<Node*, 4> arg_nodes;
+  gtl::InlinedVector<Node*, 4> ret_nodes;
+
+  FunctionBody() {}
+  FunctionBody(const FunctionDef& f, DataTypeSlice arg_types,
+               DataTypeSlice ret_types, Graph* g);
+  ~FunctionBody();
+};
+
+// Debugging facility.  Returns a debug string for a graph
+// representing an instantiated function.
+string DebugString(const Graph* instantiated_func_graph);
+
+// A few hand-crafted optimization on the instantiated function body
+// (a Graph*).
+
+// Removes nodes that are
+//   1. not stateful; and
+//   2. not _Arg; and
+//   3. not reachable from _Retval.
+// Returns true iff any node is removed from "g".
+bool RemoveDeadNodes(Graph* g);
+
+// Find a pattern:
+//   src -(in)-> node -(out)-> dst, where
+// 1) node is an identity node;
+// 2) in is the only incoming data edge;
+// 3) out is the only outgoing data edge;
+//
+// Rewrites the above pattern with src->dst and relevant data
+// dependencies updated. Repeat the process until no such pattern
+// left.
+bool RemoveIdentityNodes(Graph* g);
+
+// Rewrites _ListToArray and _ArrayToList to a set of Identity nodes.
+bool RemoveListArrayConverter(Graph* g);
+
+// For each node in "graph", if "lib" indicates that the node is a
+// function call, inline the function body.  Returns true if at least
+// one node is inlined.
+//
+// This routine goes through "graph" nodes once and applies the
+// inlining.  The caller may decide to apply the inlining on "graph"
+// multiple times by calling ExpandInlineFunctions a few times.
+bool ExpandInlineFunctions(FunctionLibraryRuntime* lib, Graph* graph);
+
+// Applies graph rewrite optimzation such as inlining, dead code
+// removal, etc.
+//
+// **g is a graph constructed based on the runtime library 'lib'.
+// OptimizeGraph mutates **g extensively and replaces '*g' with a
+// complete copy. Therefore, the caller should not keep any references
+// to nodes *g.
+void OptimizeGraph(FunctionLibraryRuntime* lib, Graph** g);
+
+// Given a numerical function "f", returns another numerical function
+// "g", such that if "f" takes N inputs and produces M outputs, "g"
+// takes N + M inputs and produces N outputs. I.e., if
+//   (y1, y2, ..., y_M) = f(x1, x2, ..., x_N),
+// g is a function which is
+//   (dL/dx1, dL/dx2, ..., dL/dx_N) = g(x1, x2, ..., x_N,
+//                                     dL/dy1, dL/dy2, ..., dL/dy_M),
+// where L is a scalar-value function of (...x_i...).
+//
+// TODO(zhifengc): Asks math expert to say the comment again.
+FunctionBody* SymbolicGradient(const FunctionBody& f);
+
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_FUNCTION_H_
--- a/tensorflow/core/common_runtime/gpu/dma_helper.h
+++ b/tensorflow/core/common_runtime/gpu/dma_helper.h
@ -0,0 +1,18 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_DMA_HELPER_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_DMA_HELPER_H_
+
+#include "tensorflow/core/public/tensor.h"
+
+// For internal use only.  Visibility should be limited to brain/framework.
+
+namespace tensorflow {
+class DMAHelper {
+ public:
+  static bool CanUseDMA(const Tensor* t) { return t->CanUseDMA(); }
+  static const void* base(const Tensor* t) { return t->base<const void>(); }
+  static void* base(Tensor* t) { return t->base<void>(); }
+  static TensorBuffer* buffer(Tensor* t) { return t->buf_; }
+  static const TensorBuffer* buffer(const Tensor* t) { return t->buf_; }
+};
+}  // namespace tensorflow
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_DMA_HELPER_H_
--- a/tensorflow/core/common_runtime/gpu/gpu_allocator_retry.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_allocator_retry.cc
@ -0,0 +1,49 @@
+#include "tensorflow/core/common_runtime/gpu/gpu_allocator_retry.h"
+#include "tensorflow/core/public/env.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+
+namespace tensorflow {
+
+GPUAllocatorRetry::GPUAllocatorRetry() : env_(Env::Default()) {}
+
+void* GPUAllocatorRetry::AllocateRaw(
+    std::function<void*(size_t alignment, size_t num_bytes,
+                        bool verbose_failure)> alloc_func,
+    int max_millis_to_wait, size_t alignment, size_t num_bytes) {
+  if (num_bytes == 0) {
+    LOG(WARNING) << "Request to allocate 0 bytes";
+    return nullptr;
+  }
+  uint64 deadline_micros = env_->NowMicros() + max_millis_to_wait * 1000;
+  void* ptr = nullptr;
+  while (ptr == nullptr) {
+    ptr = alloc_func(alignment, num_bytes, false);
+    if (ptr == nullptr) {
+      uint64 now = env_->NowMicros();
+      if (now < deadline_micros) {
+        mutex_lock l(mu_);
+        WaitForMilliseconds(&l, &memory_returned_,
+                            (deadline_micros - now) / 1000);
+      } else {
+        return alloc_func(alignment, num_bytes, true);
+      }
+    }
+  }
+  return ptr;
+}
+
+void GPUAllocatorRetry::DeallocateRaw(std::function<void(void*)> dealloc_func,
+                                      void* ptr) {
+  if (ptr == nullptr) {
+    LOG(ERROR) << "Request to free nullptr";
+    return;
+  }
+  dealloc_func(ptr);
+  {
+    mutex_lock l(mu_);
+    memory_returned_.notify_all();
+  }
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/gpu_allocator_retry.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_allocator_retry.h
@ -0,0 +1,36 @@
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_GPU_GPU_ALLOCATOR_RETRY_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_GPU_GPU_ALLOCATOR_RETRY_H_
+
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/public/env.h"
+
+namespace tensorflow {
+
+// A retrying wrapper for a memory allocator.
+class GPUAllocatorRetry {
+ public:
+  GPUAllocatorRetry();
+
+  // Call 'alloc_func' to obtain memory.  On first call,
+  // 'verbose_failure' will be false.  If return value is nullptr,
+  // then wait up to 'max_millis_to_wait' milliseconds, retrying each
+  // time a call to DeallocateRaw() is detected, until either a good
+  // pointer is returned or the deadline is exhausted.  If the
+  // deadline is exahusted, try one more time with 'verbose_failure'
+  // set to true.  The value returned is either the first good pointer
+  // obtained from 'alloc_func' or nullptr.
+  void* AllocateRaw(std::function<void*(size_t alignment, size_t num_bytes,
+                                        bool verbose_failure)> alloc_func,
+                    int max_millis_to_wait, size_t alignment, size_t bytes);
+
+  // Calls dealloc_func(ptr) and then notifies any threads blocked in
+  // AllocateRaw() that would like to retry.
+  void DeallocateRaw(std::function<void(void* ptr)> dealloc_func, void* ptr);
+
+ private:
+  Env* env_;
+  mutex mu_;
+  condition_variable memory_returned_;
+};
+}  // namespace tensorflow
+#endif  // TENSORFLOW_CORE_COMMON_RUNTIME_GPU_GPU_ALLOCATOR_RETRY_H_
--- a/tensorflow/core/common_runtime/gpu/gpu_allocator_retry_test.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_allocator_retry_test.cc
@ -0,0 +1,175 @@
+#include "tensorflow/core/common_runtime/gpu/gpu_allocator_retry.h"
+
+#include "tensorflow/core/lib/core/notification.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/thread_annotations.h"
+#include "tensorflow/core/public/env.h"
+#include <gtest/gtest.h>
+
+namespace tensorflow {
+namespace {
+
+class FakeAllocator {
+ public:
+  FakeAllocator(size_t cap, int millis_to_wait)
+      : memory_capacity_(cap), millis_to_wait_(millis_to_wait) {}
+
+  // Allocate just keeps track of the number of outstanding allocations,
+  // not their sizes.  Assume a constant size for each.
+  void* AllocateRaw(size_t alignment, size_t num_bytes) {
+    return retry_.AllocateRaw(
+        [this](size_t a, size_t nb, bool v) {
+          mutex_lock l(mu_);
+          if (memory_capacity_ > 0) {
+            --memory_capacity_;
+            return good_ptr_;
+          } else {
+            return static_cast<void*>(nullptr);
+          }
+        },
+        millis_to_wait_, alignment, num_bytes);
+  }
+
+  void DeallocateRaw(void* ptr) {
+    retry_.DeallocateRaw(
+        [this](void* p) {
+          mutex_lock l(mu_);
+          ++memory_capacity_;
+        },
+        ptr);
+  }
+
+ private:
+  GPUAllocatorRetry retry_;
+  void* good_ptr_ = reinterpret_cast<void*>(0xdeadbeef);
+  mutex mu_;
+  size_t memory_capacity_ GUARDED_BY(mu_);
+  int millis_to_wait_;
+};
+
+class GPUAllocatorRetryTest : public ::testing::Test {
+ protected:
+  GPUAllocatorRetryTest() {}
+
+  void LaunchConsumerThreads(int num_consumers, int cap_needed) {
+    consumer_count_.resize(num_consumers, 0);
+    for (int i = 0; i < num_consumers; ++i) {
+      consumers_.push_back(Env::Default()->StartThread(
+          ThreadOptions(), "anon_thread", [this, i, cap_needed]() {
+            do {
+              void* ptr = nullptr;
+              for (int j = 0; j < cap_needed; ++j) {
+                ptr = alloc_->AllocateRaw(16, 1);
+                if (ptr == nullptr) {
+                  mutex_lock l(mu_);
+                  has_failed_ = true;
+                  return;
+                }
+              }
+              ++consumer_count_[i];
+              for (int j = 0; j < cap_needed; ++j) {
+                alloc_->DeallocateRaw(ptr);
+              }
+            } while (!notifier_.HasBeenNotified());
+          }));
+    }
+  }
+
+  // Wait up to wait_micros microseconds for has_failed_ to equal expected,
+  // then terminate all threads.
+  void JoinConsumerThreads(bool expected, int wait_micros) {
+    while (wait_micros > 0) {
+      {
+        mutex_lock l(mu_);
+        if (has_failed_ == expected) break;
+      }
+      int interval_micros = std::min(1000, wait_micros);
+      Env::Default()->SleepForMicroseconds(interval_micros);
+      wait_micros -= interval_micros;
+    }
+    notifier_.Notify();
+    for (auto c : consumers_) {
+      // Blocks until thread terminates.
+      delete c;
+    }
+  }
+
+  std::unique_ptr<FakeAllocator> alloc_;
+  std::vector<Thread*> consumers_;
+  std::vector<int> consumer_count_;
+  Notification notifier_;
+  mutex mu_;
+  bool has_failed_ GUARDED_BY(mu_) = false;
+  int count_ GUARDED_BY(mu_) = 0;
+};
+
+// Verifies correct retrying when memory is slightly overcommitted but
+// we allow retry.
+TEST_F(GPUAllocatorRetryTest, RetrySuccess) {
+  // Support up to 2 allocations simultaneously, waits up to 10 msec for
+  // a chance to alloc.
+  alloc_.reset(new FakeAllocator(2, 10000));
+  // Launch 3 consumers, each of whom needs 1 unit at a time.
+  LaunchConsumerThreads(3, 1);
+  // This should be enough time for each consumer to be satisfied many times.
+  Env::Default()->SleepForMicroseconds(50000);
+  JoinConsumerThreads(false, 0);
+  for (int i = 0; i < 3; ++i) {
+    LOG(INFO) << "Consumer " << i << " is " << consumer_count_[i];
+  }
+  {
+    mutex_lock l(mu_);
+    EXPECT_FALSE(has_failed_);
+  }
+  EXPECT_GT(consumer_count_[0], 0);
+  EXPECT_GT(consumer_count_[1], 0);
+  EXPECT_GT(consumer_count_[2], 0);
+}
+
+// Verifies OutOfMemory failure when memory is slightly overcommitted
+// and retry is not allowed.
+TEST_F(GPUAllocatorRetryTest, NoRetryFail) {
+  // Support up to 2 allocations simultaneously, waits up to 0 msec for
+  // a chance to alloc.
+  alloc_.reset(new FakeAllocator(2, 0));
+  // Launch 3 consumers, each of whom needs 1 unit at a time.
+  LaunchConsumerThreads(3, 1);
+  Env::Default()->SleepForMicroseconds(50000);
+  // Will wait up to 10 seconds for proper race condition to occur, resulting
+  // in failure.
+  JoinConsumerThreads(true, 10000000);
+  for (int i = 0; i < 3; ++i) {
+    LOG(INFO) << "Consumer " << i << " is " << consumer_count_[i];
+  }
+  {
+    mutex_lock l(mu_);
+    EXPECT_TRUE(has_failed_);
+  }
+}
+
+// Verifies OutOfMemory failure when retry is allowed but memory capacity
+// is too low even for retry.
+TEST_F(GPUAllocatorRetryTest, RetryInsufficientFail) {
+  // Support up to 2 allocations simultaneously, waits up to 10 msec for
+  // a chance to alloc.
+  alloc_.reset(new FakeAllocator(2, 10000));
+  // Launch 3 consumers, each of whom needs 2 units at a time.  We expect
+  // deadlock where 2 consumers each hold 1 unit, and timeout trying to
+  // get the second.
+  LaunchConsumerThreads(3, 2);
+  Env::Default()->SleepForMicroseconds(50000);
+  // Will wait up to 10 seconds for proper race condition to occur, resulting
+  // in failure.
+  JoinConsumerThreads(true, 10000000);
+  for (int i = 0; i < 3; ++i) {
+    LOG(INFO) << "Consumer " << i << " is " << consumer_count_[i];
+  }
+  {
+    mutex_lock l(mu_);
+    EXPECT_TRUE(has_failed_);
+  }
+}
+
+}  // namespace
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc
@ -0,0 +1,397 @@
+#include "tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.h"
+
+#include "tensorflow/stream_executor/multi_platform_manager.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_allocator_retry.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+#include "tensorflow/core/lib/core/bits.h"
+#include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/lib/strings/numbers.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+
+GPUBFCAllocator::GPUBFCAllocator(int device_id, size_t total_memory)
+    : device_id_(device_id) {
+  // Get a pointer to the stream_executor for this device
+  stream_exec_ = GPUMachineManager()->ExecutorForDevice(device_id).ValueOrDie();
+
+  // Allocate the requested amount of memory.
+  gpu_memory_size_ = total_memory;
+
+  LOG(INFO) << "Allocating " << strings::HumanReadableNumBytes(gpu_memory_size_)
+            << " bytes.";
+  gpu::DeviceMemory<char> gpu_mem =
+      stream_exec_->AllocateArray<char>(gpu_memory_size_);
+
+  QCHECK(gpu_mem != nullptr)
+      << " Could not allocate GPU device memory for device " << device_id
+      << ". Tried to allocate "
+      << strings::HumanReadableNumBytes(gpu_memory_size_);
+  base_ptr_ = gpu_mem.opaque();
+  LOG(INFO) << "GPU " << device_id << " memory begins at " << base_ptr_
+            << " extends to "
+            << static_cast<void*>(
+                   (static_cast<char*>(base_ptr_) + gpu_memory_size_));
+
+  // Create a bunch of bins of various good sizes.
+
+  // Covers allocations of exactly 256 bytes (the minimum size).
+  bins_.insert(std::make_pair(256, new Bin(256)));
+
+  // We create bins to fit all possible ranges that cover the
+  // gpu_memory_size_ starting from allocations up to 1024 bytes to
+  // allocations up to (and including) the memory limit.
+  for (size_t bin_size = 1024; bin_size < gpu_memory_size_ * 2; bin_size *= 2) {
+    LOG(INFO) << "Creating bin of max chunk size "
+              << strings::HumanReadableNumBytes(bin_size);
+    bins_.insert(std::make_pair(bin_size, new Bin(bin_size)));
+  }
+
+  // Create one large chunk for the whole memory space that will
+  // be chunked later.
+  GPUBFCAllocator::Chunk* c = new GPUBFCAllocator::Chunk();
+  c->ptr = gpu_mem.opaque();
+  c->size = gpu_memory_size_;
+  c->in_use = false;
+  c->prev = nullptr;
+  c->next = nullptr;
+
+  ptr_to_chunk_map_.insert(std::make_pair(c->ptr, c));
+
+  // Insert the chunk into the right bin.
+  ReassignChunkToBin(c);
+}
+
+GPUBFCAllocator::~GPUBFCAllocator() {
+  // Return memory back.
+  if (base_ptr_) {
+    gpu::DeviceMemoryBase gpu_ptr{base_ptr_};
+    stream_exec_->Deallocate(&gpu_ptr);
+  }
+
+  gtl::STLDeleteValues(&bins_);
+}
+
+void* GPUBFCAllocator::AllocateRaw(size_t unused_alignment, size_t num_bytes) {
+  static const int64 kMaxMillisToWait = 10000;  // 10 seconds
+  return retry_helper_.AllocateRaw(
+      [this](size_t a, size_t nb, bool v) {
+        return AllocateRawInternal(a, nb, v);
+      },
+      kMaxMillisToWait, unused_alignment, num_bytes);
+}
+
+void* GPUBFCAllocator::AllocateRawInternal(size_t unused_alignment,
+                                           size_t num_bytes,
+                                           bool dump_log_on_failure) {
+  if (num_bytes == 0) {
+    LOG(ERROR) << "tried to allocate 0 bytes";
+    return nullptr;
+  }
+  // First, always allocate memory of at least 256 bytes, and always
+  // allocate multiples of 256 bytes so all memory addresses are
+  // nicely byte aligned.
+  size_t rounded_bytes = (256 * ((num_bytes + 255) / 256));
+  DCHECK_EQ(0, rounded_bytes % 256);
+
+  // The BFC allocator tries to find the best fit first.
+  //
+  // First identify the first bin that could satisfy rounded_bytes.
+  auto it = bins_.lower_bound(rounded_bytes);
+  if (it == bins_.end()) {
+    LOG(ERROR) << " Asked for " << rounded_bytes << " but largest bin was "
+               << bins_.rbegin()->first;
+    return nullptr;
+  }
+
+  mutex_lock l(lock_);
+  for (; it != bins_.end(); ++it) {
+    // Start searching from the first bin for the smallest chunk that fits
+    // rounded_bytes.
+    Bin* b = it->second;
+    for (GPUBFCAllocator::Chunk* chunk : b->chunks) {
+      if (!chunk->in_use && chunk->size > rounded_bytes) {
+        // We found an existing chunk that fits us that wasn't in use.
+        chunk->in_use = true;
+
+        // If we can break the size of the chunk into two reasonably
+        // large pieces, do so.
+        //
+        // TODO(vrv): What should be the criteria when deciding when
+        // to split?
+        if (chunk->size >= rounded_bytes * 2) {
+          SplitChunk(chunk, rounded_bytes);
+        }
+
+        // The requested size of the returned chunk is what the user
+        // has allocated.
+        chunk->requested_size = num_bytes;
+
+        VLOG(4) << "Returning: " << chunk->ptr;
+        return chunk->ptr;
+      }
+    }
+  }
+
+  // We searched all bins for an existing free chunk to use and
+  // couldn't find one.  This means we must have run out of memory,
+  // Dump the memory log for analysis.
+  if (dump_log_on_failure) {
+    DumpMemoryLog(rounded_bytes);
+    LOG(WARNING) << "Ran out of memory trying to allocate "
+                 << strings::HumanReadableNumBytes(num_bytes)
+                 << ".  See logs for memory state";
+  }
+  return nullptr;
+}
+
+void GPUBFCAllocator::SplitChunk(GPUBFCAllocator::Chunk* c, size_t num_bytes) {
+  // Create a new chunk starting num_bytes after c
+  GPUBFCAllocator::Chunk* new_chunk = new GPUBFCAllocator::Chunk();
+  new_chunk->ptr = static_cast<void*>(static_cast<char*>(c->ptr) + num_bytes);
+  VLOG(6) << "Adding to chunk map: " << new_chunk->ptr;
+  ptr_to_chunk_map_.insert(std::make_pair(new_chunk->ptr, new_chunk));
+
+  // Set the new sizes of the chunks.
+  new_chunk->size = c->size - num_bytes;
+  c->size = num_bytes;
+
+  // The new chunk is not in use.
+  new_chunk->in_use = false;
+
+  // Maintain the pointers.
+  // c <-> c_neighbor becomes
+  // c <-> new_chunk <-> c_neighbor
+  GPUBFCAllocator::Chunk* c_neighbor = c->next;
+  new_chunk->prev = c;
+  new_chunk->next = c_neighbor;
+  c->next = new_chunk;
+  if (c_neighbor) {
+    c_neighbor->prev = new_chunk;
+  }
+
+  // Maintain the bins
+  ReassignChunkToBin(new_chunk);
+  ReassignChunkToBin(c);
+}
+
+void GPUBFCAllocator::DeallocateRaw(void* ptr) {
+  retry_helper_.DeallocateRaw([this](void* p) { DeallocateRawInternal(p); },
+                              ptr);
+}
+
+void GPUBFCAllocator::DeallocateRawInternal(void* ptr) {
+  if (ptr == nullptr) {
+    LOG(ERROR) << "tried to deallocate nullptr";
+    return;
+  }
+  mutex_lock l(lock_);
+
+  // Find the chunk from the ptr.
+  auto it = ptr_to_chunk_map_.find(ptr);
+  CHECK(it != ptr_to_chunk_map_.end())
+      << "Asked to deallocate a pointer we never allocated: " << ptr;
+
+  GPUBFCAllocator::Chunk* c = it->second;
+  VLOG(6) << "Chunk at " << c->ptr << " no longer in use";
+  // Mark the chunk as no longer in use
+  c->in_use = false;
+
+  // Consider coalescing it.
+  MaybeCoalesce(c);
+}
+
+// Merges c1 and c2 when c1->next is c2 and c2->prev is c1.
+// We merge c2 into c1.
+void GPUBFCAllocator::Merge(GPUBFCAllocator::Chunk* c1,
+                            GPUBFCAllocator::Chunk* c2) {
+  // We can only merge chunks that are not in use.
+  DCHECK(!c1->in_use && !c2->in_use);
+
+  // c1's prev doesn't change, still points to the same ptr, and is
+  // still not in use.
+
+  // Fix up neighbor pointers
+  //
+  // c1 <-> c2 <-> c3 should become
+  // c1 <-> c3
+  GPUBFCAllocator::Chunk* c3 = c2->next;
+  c1->next = c3;
+  CHECK(c2->prev == c1);
+  if (c3 != nullptr) {
+    c3->prev = c1;
+  }
+
+  // Set the new size
+  c1->size += c2->size;
+
+  // Delete c2 and cleanup all state
+  RemoveChunkFromBin(c2);
+}
+
+void GPUBFCAllocator::ReassignChunkToBin(GPUBFCAllocator::Chunk* c) {
+  auto it = bins_.lower_bound(c->size);
+  CHECK(it != bins_.end()) << " Tried to reassign to non-existent bin for size "
+                           << c->size;
+
+  Bin* new_bin = it->second;
+
+  // If the bin has not changed, do nothing.
+  Bin* old_bin = c->bin;
+  if (old_bin != nullptr && new_bin == old_bin) {
+    return;
+  }
+
+  // The bin has changed.  Add the chunk to the new bin and remove
+  // the chunk from the old bin.
+  new_bin->chunks.insert(c);
+  c->bin = new_bin;
+
+  if (old_bin == nullptr) {
+    return;
+  }
+
+  // Remove chunk from old bin
+  for (auto it = old_bin->chunks.begin(); it != old_bin->chunks.end(); ++it) {
+    if (*it == c) {
+      old_bin->chunks.erase(it);
+      return;
+    }
+  }
+  CHECK(false) << "Could not find chunk in old bin";
+}
+
+void GPUBFCAllocator::RemoveChunkFromBin(GPUBFCAllocator::Chunk* c) {
+  Bin* b = c->bin;
+  for (auto it = b->chunks.begin(); it != b->chunks.end(); ++it) {
+    Chunk* other_c = *it;
+    if (other_c->ptr == c->ptr) {
+      b->chunks.erase(it);
+      VLOG(4) << "Removing: " << c->ptr;
+      ptr_to_chunk_map_.erase(c->ptr);
+      delete c;
+      return;
+    }
+  }
+
+  CHECK(false) << "Could not find chunk in bin";
+}
+
+void GPUBFCAllocator::MaybeCoalesce(GPUBFCAllocator::Chunk* c) {
+  // This chunk is no longer in-use, consider coalescing the chunk
+  // with adjacent chunks.
+  Chunk* chunk_to_reassign = nullptr;
+
+  // If the next chunk is free, coalesce the two, if the result would
+  // fit in an existing bin.
+  if (c->next && !c->next->in_use) {
+    VLOG(8) << "Chunk at " << c->next->ptr << " merging with c " << c->ptr;
+
+    chunk_to_reassign = c;
+
+    // Deletes c->next
+    Merge(c, c->next);
+  }
+
+  // If the previous chunk is free, coalesce the two
+  if (c->prev && !c->prev->in_use) {
+    VLOG(8) << "Chunk at " << c->ptr << " merging into c->prev "
+            << c->prev->ptr;
+
+    chunk_to_reassign = c->prev;
+
+    // Deletes c
+    Merge(c->prev, c);
+  }
+
+  // Reassign the final merged chunk into the right bin.
+  if (chunk_to_reassign) {
+    ReassignChunkToBin(chunk_to_reassign);
+  }
+}
+
+void GPUBFCAllocator::AddAllocVisitor(Visitor visitor) {
+  VLOG(1) << "AddVisitor";
+  mutex_lock l(lock_);
+  region_visitors_.push_back(visitor);
+  visitor(base_ptr_, gpu_memory_size_);
+}
+
+bool GPUBFCAllocator::TracksAllocationSizes() { return true; }
+
+size_t GPUBFCAllocator::RequestedSize(void* ptr) {
+  mutex_lock l(lock_);
+  auto it = ptr_to_chunk_map_.find(ptr);
+  CHECK(it != ptr_to_chunk_map_.end())
+      << "Asked for requested size of pointer we never allocated: " << ptr;
+  GPUBFCAllocator::Chunk* c = it->second;
+  return c->requested_size;
+}
+
+size_t GPUBFCAllocator::AllocatedSize(void* ptr) {
+  mutex_lock l(lock_);
+  auto it = ptr_to_chunk_map_.find(ptr);
+  CHECK(it != ptr_to_chunk_map_.end())
+      << "Asked for allocated size of pointer we never allocated: " << ptr;
+  GPUBFCAllocator::Chunk* c = it->second;
+  return c->size;
+}
+
+void GPUBFCAllocator::DumpMemoryLog(size_t num_bytes) {
+  // For each bin: tally up the total number of chunks and bytes.
+  for (auto bit : bins_) {
+    Bin* b = bit.second;
+
+    size_t total_bytes_in_use = 0;
+    size_t total_bytes_in_bin = 0;
+    size_t total_requested_bytes_in_use = 0;
+    size_t total_requested_bytes_in_bin = 0;
+    size_t total_chunks_in_use = 0;
+    size_t total_chunks_in_bin = 0;
+    for (Chunk* c : b->chunks) {
+      total_bytes_in_bin += c->size;
+      total_requested_bytes_in_bin += c->requested_size;
+      ++total_chunks_in_bin;
+      if (c->in_use) {
+        total_bytes_in_use += c->size;
+        total_requested_bytes_in_use += c->requested_size;
+        ++total_chunks_in_use;
+      }
+    }
+
+    LOG(INFO) << "Bin (" << b->bin_size
+              << "): \tTotal Chunks: " << total_chunks_in_bin
+              << ", Chunks in use: " << total_chunks_in_use << " "
+              << strings::HumanReadableNumBytes(total_bytes_in_bin)
+              << " allocated for chunks. "
+              << strings::HumanReadableNumBytes(total_requested_bytes_in_bin)
+              << " client-requested for chunks. "
+              << strings::HumanReadableNumBytes(total_bytes_in_use)
+              << " in use in bin. "
+              << strings::HumanReadableNumBytes(total_requested_bytes_in_use)
+              << " client-requested in use in bin.";
+  }
+
+  // Find the bin that we would have liked to allocate in, so we
+  // can get some further analysis about fragmentation.
+  auto it = bins_.lower_bound(num_bytes);
+  if (it != bins_.end()) {
+    Bin* b = it->second;
+
+    LOG(INFO) << "Bin for " << strings::HumanReadableNumBytes(num_bytes)
+              << " was " << strings::HumanReadableNumBytes(b->bin_size)
+              << ", Chunk State: ";
+
+    for (Chunk* c : b->chunks) {
+      LOG(INFO) << c->DebugString(true);
+    }
+  }
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.h
@ -0,0 +1,156 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_GPU_BFC_ALLOCATOR_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_GPU_BFC_ALLOCATOR_H_
+
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/stream_executor/stream_executor.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_allocator_retry.h"
+#include "tensorflow/core/common_runtime/gpu/visitable_allocator.h"
+#include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/platform/thread_annotations.h"
+
+namespace tensorflow {
+
+// A GPU memory allocator that implements a 'best-fit with coalescing'
+// algorithm.  This is essentially a very simple version of Doug Lea's
+// malloc (dlmalloc).
+//
+// The goal of this allocator is to support defragmentation via
+// coalescing.  One assumption we make is that the process using this
+// allocator owns pretty much all of the GPU memory, and that nearly
+// all requests to allocate GPU memory go through this interface.
+class GPUBFCAllocator : public VisitableAllocator {
+ public:
+  // 'device_id' refers to the StreamExecutor ID of the device within
+  // the process and must reference a valid ID in the process.
+  explicit GPUBFCAllocator(int device_id, size_t total_memory);
+  ~GPUBFCAllocator() override;
+
+  string Name() override { return "gpu_bfc"; }
+  void* AllocateRaw(size_t alignment, size_t num_bytes) override;
+  void DeallocateRaw(void* ptr) override;
+
+  void AddAllocVisitor(Visitor visitor) override;
+
+  // Does nothing, because gpu memory is never freed.
+  void AddFreeVisitor(Visitor visitor) override {}
+
+  bool TracksAllocationSizes() override;
+
+  size_t RequestedSize(void* ptr) override;
+
+  size_t AllocatedSize(void* ptr) override;
+
+ private:
+  struct Bin;
+
+  void* AllocateRawInternal(size_t alignment, size_t num_bytes,
+                            bool dump_log_on_failure);
+  void DeallocateRawInternal(void* ptr);
+
+  // Chunks point to GPU memory.  Their prev/next pointers form a
+  // doubly-linked list of addresses sorted by GPU base address that
+  // must be contiguous.  Chunks contain information about whether
+  // they are in use or whether they are free, and contain a pointer
+  // to the bin they are in.
+  struct Chunk {
+    size_t size = 0;  // Full size of GPU buffer.
+
+    // We sometimes give chunks that are larger than needed to reduce
+    // fragmentation.  requested_size keeps track of what the client
+    // actually wanted so we can understand whether our splitting
+    // strategy is efficient.
+    size_t requested_size = 0;
+
+    bool in_use = false;
+    void* ptr = nullptr;  // pointer to granted GPU subbuffer.
+
+    // If not null, the memory referred to by 'prev' is directly
+    // preceding the memory used by this chunk.  E.g., It should start
+    // at 'ptr - prev->size'
+    Chunk* prev = nullptr;
+
+    // If not null, the memory referred to by 'next' is directly
+    // following the memory used by this chunk.  E.g., It should be at
+    // 'ptr + size'
+    Chunk* next = nullptr;
+
+    // What bin are we in?
+    Bin* bin = nullptr;
+
+    string DebugString(bool recurse) {
+      string dbg;
+      strings::StrAppend(&dbg, "  Size: ", strings::HumanReadableNumBytes(size),
+                         " | Requested Size: ",
+                         strings::HumanReadableNumBytes(requested_size),
+                         " | in_use: ", in_use);
+      if (recurse && prev) {
+        strings::StrAppend(&dbg, ", prev: ", prev->DebugString(false));
+      }
+      if (recurse && next) {
+        strings::StrAppend(&dbg, ", next: ", next->DebugString(false));
+      }
+      return dbg;
+    }
+  };
+
+  Chunk* AllocateNewChunk(size_t num_bytes);
+  void SplitChunk(Chunk* c, size_t num_bytes);
+  void Merge(Chunk* c1, Chunk* c2);
+  void MaybeCoalesce(Chunk* c);
+
+  void ReassignChunkToBin(Chunk* c);
+  void RemoveChunkFromBin(Chunk* c);
+
+  void DumpMemoryLog(size_t num_bytes);
+
+  // A Bin is a collection of similar-sized Chunks.
+  struct Bin {
+    // All chunks in this bin have >= bin_size memory.
+    size_t bin_size = 0;
+
+    struct ChunkComparator {
+      bool operator()(Chunk* a, Chunk* b) { return a->size < b->size; }
+    };
+
+    // List of chunks within the bin, sorted by chunk size.
+    std::multiset<Chunk*, ChunkComparator> chunks;
+
+    explicit Bin(size_t bs) : bin_size(bs) {}
+
+    ~Bin() { gtl::STLDeleteElements(&chunks); }
+  };
+
+  GPUAllocatorRetry retry_helper_;
+
+  // Structures immutable after construction
+  const int device_id_;
+  // The base pointer where all the GPU memory begins.
+  void* base_ptr_ = nullptr;
+  size_t gpu_memory_size_ = 0;
+
+  // Map from bin size to Bin
+  // After construction, the bin map is never resized.
+  std::map<size_t, Bin*> bins_;
+
+  perftools::gputools::StreamExecutor* stream_exec_;  // Not owned.
+
+  // Structures mutable after construction
+  mutable mutex lock_;
+  // Not owned.
+  std::unordered_map<void*, Chunk*> ptr_to_chunk_map_;
+
+  // Called once on each region, ASAP.
+  std::vector<Visitor> region_visitors_;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(GPUBFCAllocator);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_GPU_BFC_ALLOCATOR_H_
--- a/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator_test.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator_test.cc
@ -0,0 +1,166 @@
+#if GOOGLE_CUDA
+
+#include "tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.h"
+
+#include <algorithm>
+#include <vector>
+
+#include "tensorflow/stream_executor/stream_executor.h"
+#include <gtest/gtest.h>
+#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
+#include "tensorflow/core/lib/random/simple_philox.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+namespace {
+
+TEST(GPUBFCAllocatorTest, NoDups) {
+  GPUBFCAllocator a(0, 1 << 30);
+  // Allocate a lot of raw pointers
+  std::vector<void*> ptrs;
+  for (int s = 1; s < 1024; s++) {
+    void* raw = a.AllocateRaw(1, s);
+    ptrs.push_back(raw);
+  }
+
+  std::sort(ptrs.begin(), ptrs.end());
+
+  // Make sure none of them are equal, and that none of them overlap.
+  for (int i = 0; i < ptrs.size(); i++) {
+    if (i > 0) {
+      ASSERT_NE(ptrs[i], ptrs[i - 1]);  // No dups
+      size_t req_size = a.RequestedSize(ptrs[i - 1]);
+      ASSERT_GT(req_size, 0);
+      ASSERT_GE(static_cast<char*>(ptrs[i]) - static_cast<char*>(ptrs[i - 1]),
+                req_size);
+    }
+  }
+
+  for (int i = 0; i < ptrs.size(); i++) {
+    a.DeallocateRaw(ptrs[i]);
+  }
+}
+
+TEST(GPUBFCAllocatorTest, AllocationsAndDeallocations) {
+  GPUBFCAllocator a(0, 1 << 30);
+  // Allocate 256 raw pointers of sizes between 100 bytes and about
+  // a meg
+  random::PhiloxRandom philox(123, 17);
+  random::SimplePhilox rand(&philox);
+
+  std::vector<void*> initial_ptrs;
+  for (int s = 1; s < 256; s++) {
+    size_t size = std::min<size_t>(
+        std::max<size_t>(rand.Rand32() % 1048576, 100), 1048576);
+    void* raw = a.AllocateRaw(1, size);
+
+    initial_ptrs.push_back(raw);
+  }
+
+  // Deallocate half of the memory, and keep track of the others.
+  std::vector<void*> existing_ptrs;
+  for (int i = 0; i < initial_ptrs.size(); i++) {
+    if (i % 2 == 1) {
+      a.DeallocateRaw(initial_ptrs[i]);
+    } else {
+      existing_ptrs.push_back(initial_ptrs[i]);
+    }
+  }
+
+  // Allocate a lot of raw pointers
+  for (int s = 1; s < 256; s++) {
+    size_t size = std::min<size_t>(
+        std::max<size_t>(rand.Rand32() % 1048576, 100), 1048576);
+    void* raw = a.AllocateRaw(1, size);
+    existing_ptrs.push_back(raw);
+  }
+
+  std::sort(existing_ptrs.begin(), existing_ptrs.end());
+  // Make sure none of them are equal
+  for (int i = 0; i < existing_ptrs.size(); i++) {
+    if (i > 0) {
+      CHECK_NE(existing_ptrs[i], existing_ptrs[i - 1]);  // No dups
+
+      size_t req_size = a.RequestedSize(existing_ptrs[i - 1]);
+      ASSERT_GT(req_size, 0);
+
+      // Check that they don't overlap.
+      ASSERT_GE(static_cast<char*>(existing_ptrs[i]) -
+                    static_cast<char*>(existing_ptrs[i - 1]),
+                req_size);
+    }
+  }
+
+  for (int i = 0; i < existing_ptrs.size(); i++) {
+    a.DeallocateRaw(existing_ptrs[i]);
+  }
+}
+
+TEST(GPUBFCAllocatorTest, ExerciseCoalescing) {
+  GPUBFCAllocator a(0, 1 << 30);
+
+  float* first_ptr = a.Allocate<float>(1024);
+  a.Deallocate(first_ptr);
+  for (int i = 0; i < 1024; ++i) {
+    // Allocate several buffers of different sizes, and then clean them
+    // all up.  We should be able to repeat this endlessly without
+    // causing fragmentation and growth.
+    float* t1 = a.Allocate<float>(1024);
+
+    int64* t2 = a.Allocate<int64>(1048576);
+    double* t3 = a.Allocate<double>(2048);
+    float* t4 = a.Allocate<float>(10485760);
+
+    a.Deallocate(t1);
+    a.Deallocate(t2);
+    a.Deallocate(t3);
+    a.Deallocate(t4);
+  }
+
+  // At the end, we should have coalesced all memory into one region
+  // starting at the beginning, so validate that allocating a pointer
+  // starts from this region.
+  float* first_ptr_after = a.Allocate<float>(1024);
+  EXPECT_EQ(first_ptr, first_ptr_after);
+  a.Deallocate(first_ptr_after);
+}
+
+TEST(GPUBFCAllocatorTest, AllocateZeroBufSize) {
+  GPUBFCAllocator a(0, 1 << 30);
+  float* ptr = a.Allocate<float>(0);
+  EXPECT_EQ(nullptr, ptr);
+}
+
+TEST(GPUBFCAllocatorTest, TracksSizes) {
+  GPUBFCAllocator a(0, 1 << 30);
+  EXPECT_EQ(true, a.TracksAllocationSizes());
+}
+
+TEST(GPUBFCAllocatorTest, AllocatedVsRequested) {
+  GPUBFCAllocator a(0, 1 << 30);
+  float* t1 = a.Allocate<float>(1);
+  EXPECT_EQ(4, a.RequestedSize(t1));
+  EXPECT_EQ(256, a.AllocatedSize(t1));
+  a.Deallocate(t1);
+}
+
+TEST(GPUBFCAllocatorTest, TestCustomMemoryLimit) {
+  // Configure a 1MiB byte limit
+  GPUBFCAllocator a(0, 1 << 20);
+
+  float* first_ptr = a.Allocate<float>(1 << 6);
+  float* second_ptr = a.Allocate<float>(1 << 20);
+
+  EXPECT_NE(nullptr, first_ptr);
+  EXPECT_EQ(nullptr, second_ptr);
+  a.Deallocate(first_ptr);
+}
+
+}  // namespace
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
--- a/tensorflow/core/common_runtime/gpu/gpu_debug_allocator.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_debug_allocator.cc
@ -0,0 +1,186 @@
+#include "tensorflow/core/common_runtime/gpu/gpu_debug_allocator.h"
+
+#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+#include "tensorflow/stream_executor/multi_platform_manager.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+
+#define MASK_WORDS 2
+#define MASK_BYTES (MASK_WORDS * sizeof(int64))
+
+namespace {
+
+static int64* NewMask(int64 word) {
+  int64* m = new int64[MASK_WORDS];
+  for (int i = 0; i < MASK_WORDS; ++i) {
+    m[i] = word;
+  }
+  return m;
+}
+
+static int64* before_mask = NewMask(0xabababababababab);
+static int64* after_mask = NewMask(0xcdcdcdcdcdcdcdcd);
+
+bool CheckMask(perftools::gputools::StreamExecutor* exec, void* ptr,
+               int64* mask) {
+  gpu::DeviceMemory<int64> gpu_ptr{gpu::DeviceMemoryBase{ptr, MASK_BYTES}};
+  int64 tmp[MASK_WORDS];
+
+  if (!exec->SynchronousMemcpy(&tmp, gpu_ptr, MASK_BYTES)) {
+    LOG(FATAL) << "Could not copy debug mask";
+  }
+
+  bool ok = true;
+  for (int i = 0; i < MASK_WORDS; ++i) {
+    ok &= (mask[i] == tmp[i]);
+    if (!ok) {
+      LOG(ERROR) << "i=" << i
+                 << " mask=" << reinterpret_cast<const void*>(mask[i])
+                 << " field=" << reinterpret_cast<const void*>(tmp[i]);
+    }
+  }
+
+  return ok;
+}
+
+void InitMask(perftools::gputools::StreamExecutor* exec, void* ptr,
+              int64* mask) {
+  gpu::DeviceMemory<int64> gpu_ptr{gpu::DeviceMemoryBase{ptr, MASK_BYTES}};
+  if (!exec->SynchronousMemcpy(&gpu_ptr, mask, MASK_BYTES)) {
+    LOG(FATAL) << "Could not copy debug mask";
+  }
+}
+
+}  // namespace
+
+// -----------------------------------------------------------------------------
+// GPUDebugAllocator
+// -----------------------------------------------------------------------------
+GPUDebugAllocator::GPUDebugAllocator(VisitableAllocator* allocator,
+                                     int device_id)
+    : base_allocator_(allocator) {
+  stream_exec_ = GPUMachineManager()->ExecutorForDevice(device_id).ValueOrDie();
+}
+
+GPUDebugAllocator::~GPUDebugAllocator() { delete base_allocator_; }
+
+void* GPUDebugAllocator::AllocateRaw(size_t alignment, size_t num_bytes) {
+  num_bytes += (2 * MASK_BYTES);
+
+  void* allocated_ptr = base_allocator_->AllocateRaw(alignment, num_bytes);
+
+  // Return the pointer after the header
+  void* rv = static_cast<char*>(allocated_ptr) + MASK_BYTES;
+
+  // Write the header at allocated_ptr
+  InitMask(stream_exec_, allocated_ptr, before_mask);
+
+  // Write the footer at the end.
+  size_t req_size = base_allocator_->RequestedSize(allocated_ptr);
+  InitMask(stream_exec_,
+           static_cast<char*>(allocated_ptr) + req_size - MASK_BYTES,
+           after_mask);
+  return rv;
+}
+void GPUDebugAllocator::DeallocateRaw(void* ptr) {
+  CHECK(CheckHeader(ptr)) << "before_mask has been overwritten";
+  CHECK(CheckFooter(ptr)) << "after_mask has been overwritten";
+
+  // Backtrack to the beginning of the header.
+  ptr = static_cast<void*>(static_cast<char*>(ptr) - MASK_BYTES);
+  // Deallocate the memory
+  base_allocator_->DeallocateRaw(ptr);
+}
+
+void GPUDebugAllocator::AddAllocVisitor(Visitor visitor) {
+  return base_allocator_->AddAllocVisitor(visitor);
+}
+
+void GPUDebugAllocator::AddFreeVisitor(Visitor visitor) {
+  return base_allocator_->AddFreeVisitor(visitor);
+}
+
+bool GPUDebugAllocator::TracksAllocationSizes() { return true; }
+
+size_t GPUDebugAllocator::RequestedSize(void* ptr) {
+  auto req_size =
+      base_allocator_->RequestedSize(static_cast<char*>(ptr) - MASK_BYTES);
+  return req_size - 2 * MASK_BYTES;
+}
+
+size_t GPUDebugAllocator::AllocatedSize(void* ptr) {
+  return base_allocator_->AllocatedSize(static_cast<char*>(ptr) - MASK_BYTES);
+}
+
+bool GPUDebugAllocator::CheckHeader(void* ptr) {
+  return CheckMask(stream_exec_, static_cast<char*>(ptr) - MASK_BYTES,
+                   before_mask);
+}
+
+bool GPUDebugAllocator::CheckFooter(void* ptr) {
+  char* original_ptr = static_cast<char*>(ptr) - MASK_BYTES;
+  size_t req_size = base_allocator_->RequestedSize(original_ptr);
+  return CheckMask(stream_exec_, original_ptr + req_size - MASK_BYTES,
+                   after_mask);
+}
+
+// -----------------------------------------------------------------------------
+// GPUNanResetAllocator
+// -----------------------------------------------------------------------------
+GPUNanResetAllocator::GPUNanResetAllocator(VisitableAllocator* allocator,
+                                           int device_id)
+    : base_allocator_(allocator) {
+  stream_exec_ = GPUMachineManager()->ExecutorForDevice(device_id).ValueOrDie();
+}
+
+GPUNanResetAllocator::~GPUNanResetAllocator() { delete base_allocator_; }
+
+void* GPUNanResetAllocator::AllocateRaw(size_t alignment, size_t num_bytes) {
+  void* allocated_ptr = base_allocator_->AllocateRaw(alignment, num_bytes);
+
+  // Initialize the buffer to Nans
+  size_t req_size = base_allocator_->RequestedSize(allocated_ptr);
+  std::vector<float> nans(req_size / sizeof(float), std::nanf(""));
+  gpu::DeviceMemory<float> nan_ptr{
+      gpu::DeviceMemoryBase{static_cast<float*>(allocated_ptr), req_size}};
+
+  if (!stream_exec_->SynchronousMemcpy(&nan_ptr, &nans[0], req_size)) {
+    LOG(ERROR) << "Could not initialize to NaNs";
+  }
+
+  return allocated_ptr;
+}
+void GPUNanResetAllocator::DeallocateRaw(void* ptr) {
+  // Reset the buffer to Nans
+  size_t req_size = base_allocator_->RequestedSize(ptr);
+  std::vector<float> nans(req_size / sizeof(float), std::nanf(""));
+  gpu::DeviceMemory<float> nan_ptr{
+      gpu::DeviceMemoryBase{static_cast<float*>(ptr), req_size}};
+  if (!stream_exec_->SynchronousMemcpy(&nan_ptr, &nans[0], req_size)) {
+    LOG(ERROR) << "Could not initialize to NaNs";
+  }
+
+  // Deallocate the memory
+  base_allocator_->DeallocateRaw(ptr);
+}
+
+void GPUNanResetAllocator::AddAllocVisitor(Visitor visitor) {
+  return base_allocator_->AddAllocVisitor(visitor);
+}
+
+void GPUNanResetAllocator::AddFreeVisitor(Visitor visitor) {
+  return base_allocator_->AddFreeVisitor(visitor);
+}
+
+size_t GPUNanResetAllocator::RequestedSize(void* ptr) {
+  return base_allocator_->RequestedSize(ptr);
+}
+
+size_t GPUNanResetAllocator::AllocatedSize(void* ptr) {
+  return base_allocator_->AllocatedSize(ptr);
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/gpu_debug_allocator.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_debug_allocator.h
@ -0,0 +1,68 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_GPU_DEBUG_ALLOCATOR_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_GPU_DEBUG_ALLOCATOR_H_
+
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/common_runtime/gpu/visitable_allocator.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+
+namespace tensorflow {
+
+// An allocator that wraps a GPU allocator and adds debugging
+// functionality that verifies that users do not write outside their
+// allocated memory.
+class GPUDebugAllocator : public VisitableAllocator {
+ public:
+  explicit GPUDebugAllocator(VisitableAllocator* allocator, int device_id);
+  ~GPUDebugAllocator() override;
+  string Name() override { return "gpu_debug"; }
+  void* AllocateRaw(size_t alignment, size_t num_bytes) override;
+  void DeallocateRaw(void* ptr) override;
+  void AddAllocVisitor(Visitor visitor) override;
+  void AddFreeVisitor(Visitor visitor) override;
+  bool TracksAllocationSizes() override;
+  size_t RequestedSize(void* ptr) override;
+  size_t AllocatedSize(void* ptr) override;
+
+  // For testing.
+  bool CheckHeader(void* ptr);
+  bool CheckFooter(void* ptr);
+
+ private:
+  VisitableAllocator* base_allocator_ = nullptr;  // owned
+
+  perftools::gputools::StreamExecutor* stream_exec_;  // Not owned.
+
+  TF_DISALLOW_COPY_AND_ASSIGN(GPUDebugAllocator);
+};
+
+// An allocator that wraps a GPU allocator and resets the memory on
+// allocation and free to 'NaN', helping to identify cases where the
+// user forgets to initialize the memory.
+class GPUNanResetAllocator : public VisitableAllocator {
+ public:
+  explicit GPUNanResetAllocator(VisitableAllocator* allocator, int device_id);
+  ~GPUNanResetAllocator() override;
+  string Name() override { return "gpu_nan_reset"; }
+  void* AllocateRaw(size_t alignment, size_t num_bytes) override;
+  void DeallocateRaw(void* ptr) override;
+  void AddAllocVisitor(Visitor visitor) override;
+  void AddFreeVisitor(Visitor visitor) override;
+  size_t RequestedSize(void* ptr) override;
+  size_t AllocatedSize(void* ptr) override;
+
+ private:
+  VisitableAllocator* base_allocator_ = nullptr;  // owned
+
+  perftools::gputools::StreamExecutor* stream_exec_;  // Not owned.
+
+  TF_DISALLOW_COPY_AND_ASSIGN(GPUNanResetAllocator);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_GPU_DEBUG_ALLOCATOR_H_
--- a/tensorflow/core/common_runtime/gpu/gpu_debug_allocator_test.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_debug_allocator_test.cc
@ -0,0 +1,207 @@
+#if GOOGLE_CUDA
+
+#include "tensorflow/core/common_runtime/gpu/gpu_debug_allocator.h"
+
+#include <algorithm>
+#include <vector>
+
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.h"
+#include "tensorflow/stream_executor/multi_platform_manager.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+#include <gtest/gtest.h>
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+
+TEST(GPUDebugAllocatorTest, OverwriteDetection_None) {
+  const int device_id = 0;
+  GPUDebugAllocator a(new GPUBFCAllocator(device_id, 1 << 30), device_id);
+  auto stream_exec =
+      GPUMachineManager()->ExecutorForDevice(device_id).ValueOrDie();
+
+  for (int s : {8}) {
+    std::vector<int64> cpu_array(s);
+    memset(&cpu_array[0], 0, cpu_array.size() * sizeof(int64));
+    int64* gpu_array = a.Allocate<int64>(cpu_array.size());
+    gpu::DeviceMemory<int64> gpu_array_ptr{gpu::DeviceMemoryBase{gpu_array}};
+    ASSERT_TRUE(stream_exec->SynchronousMemcpy(&gpu_array_ptr, &cpu_array[0],
+                                               s * sizeof(int64)));
+    EXPECT_TRUE(a.CheckHeader(gpu_array));
+    EXPECT_TRUE(a.CheckFooter(gpu_array));
+
+    // Confirm no error on free.
+    a.DeallocateRaw(gpu_array);
+  }
+}
+
+TEST(GPUDebugAllocatorTest, OverwriteDetection_Header) {
+  for (int s : {8, 211}) {
+    EXPECT_DEATH(
+        {
+          const int device_id = 0;
+          GPUDebugAllocator a(new GPUBFCAllocator(device_id, 1 << 30),
+                              device_id);
+          auto stream_exec =
+              GPUMachineManager()->ExecutorForDevice(device_id).ValueOrDie();
+
+          std::vector<int64> cpu_array(s);
+          memset(&cpu_array[0], 0, cpu_array.size() * sizeof(int64));
+          int64* gpu_array = a.Allocate<int64>(cpu_array.size());
+
+          gpu::DeviceMemory<int64> gpu_array_ptr{
+              gpu::DeviceMemoryBase{gpu_array}};
+          ASSERT_TRUE(stream_exec->SynchronousMemcpy(
+              &gpu_array_ptr, &cpu_array[0], cpu_array.size() * sizeof(int64)));
+
+          gpu::DeviceMemory<int64> gpu_hdr_ptr{
+              gpu::DeviceMemoryBase{gpu_array - 1}};
+          // Clobber first word of the header.
+          float pi = 3.1417;
+          ASSERT_TRUE(
+              stream_exec->SynchronousMemcpy(&gpu_hdr_ptr, &pi, sizeof(float)));
+
+          // Expect error on free.
+          a.Deallocate(gpu_array);
+        },
+        "");
+  }
+}
+
+TEST(GPUDebugAllocatorTest, OverwriteDetection_Footer) {
+  for (int s : {8, 22}) {
+    EXPECT_DEATH(
+        {
+          const int device_id = 0;
+          GPUDebugAllocator a(new GPUBFCAllocator(device_id, 1 << 30),
+                              device_id);
+          auto stream_exec =
+              GPUMachineManager()->ExecutorForDevice(device_id).ValueOrDie();
+
+          std::vector<int64> cpu_array(s);
+          memset(&cpu_array[0], 0, cpu_array.size() * sizeof(int64));
+          int64* gpu_array = a.Allocate<int64>(cpu_array.size());
+
+          gpu::DeviceMemory<int64> gpu_array_ptr{
+              gpu::DeviceMemoryBase{gpu_array}};
+          ASSERT_TRUE(stream_exec->SynchronousMemcpy(
+              &gpu_array_ptr, &cpu_array[0], cpu_array.size() * sizeof(int64)));
+
+          // Clobber word of the footer.
+          gpu::DeviceMemory<int64> gpu_ftr_ptr{
+              gpu::DeviceMemoryBase{gpu_array + s}};
+          float pi = 3.1417;
+          ASSERT_TRUE(
+              stream_exec->SynchronousMemcpy(&gpu_ftr_ptr, &pi, sizeof(float)));
+
+          // Expect error on free.
+          a.Deallocate(gpu_array);
+        },
+        "");
+  }
+}
+
+TEST(GPUDebugAllocatorTest, ResetToNan) {
+  const int device_id = 0;
+  GPUNanResetAllocator a(new GPUBFCAllocator(device_id, 1 << 30), device_id);
+  auto stream_exec =
+      GPUMachineManager()->ExecutorForDevice(device_id).ValueOrDie();
+
+  std::vector<float> cpu_array(1024);
+  std::vector<float> cpu_array_result(1024);
+
+  // Allocate 1024 floats
+  float* gpu_array = a.Allocate<float>(cpu_array.size());
+  gpu::DeviceMemory<float> gpu_array_ptr{gpu::DeviceMemoryBase{gpu_array}};
+  ASSERT_TRUE(stream_exec->SynchronousMemcpy(&cpu_array[0], gpu_array_ptr,
+                                             cpu_array.size() * sizeof(float)));
+  for (float f : cpu_array) {
+    ASSERT_FALSE(std::isfinite(f));
+  }
+
+  // Set one of the fields to 1.0.
+  cpu_array[0] = 1.0;
+  ASSERT_TRUE(stream_exec->SynchronousMemcpy(&gpu_array_ptr, &cpu_array[0],
+                                             cpu_array.size() * sizeof(float)));
+  // Copy the data back and verify.
+  ASSERT_TRUE(
+      stream_exec->SynchronousMemcpy(&cpu_array_result[0], gpu_array_ptr,
+                                     cpu_array_result.size() * sizeof(float)));
+  ASSERT_EQ(1.0, cpu_array_result[0]);
+
+  // Free the array
+  a.Deallocate(gpu_array);
+
+  // All values should be reset to nan.
+  ASSERT_TRUE(
+      stream_exec->SynchronousMemcpy(&cpu_array_result[0], gpu_array_ptr,
+                                     cpu_array_result.size() * sizeof(float)));
+  for (float f : cpu_array_result) {
+    ASSERT_FALSE(std::isfinite(f));
+  }
+}
+
+TEST(GPUDebugAllocatorTest, ResetToNanWithHeaderFooter) {
+  const int device_id = 0;
+  // NaN reset must be the outer-most allocator.
+  GPUNanResetAllocator a(
+      new GPUDebugAllocator(new GPUBFCAllocator(device_id, 1 << 30), device_id),
+      device_id);
+  auto stream_exec =
+      GPUMachineManager()->ExecutorForDevice(device_id).ValueOrDie();
+
+  std::vector<float> cpu_array(1024);
+  std::vector<float> cpu_array_result(1024);
+
+  // Allocate 1024 floats
+  float* gpu_array = a.Allocate<float>(cpu_array.size());
+  gpu::DeviceMemory<float> gpu_array_ptr{gpu::DeviceMemoryBase{gpu_array}};
+  ASSERT_TRUE(stream_exec->SynchronousMemcpy(&cpu_array[0], gpu_array_ptr,
+                                             cpu_array.size() * sizeof(float)));
+  for (float f : cpu_array) {
+    ASSERT_FALSE(std::isfinite(f));
+  }
+
+  // Set one of the fields to 1.0.
+  cpu_array[0] = 1.0;
+  ASSERT_TRUE(stream_exec->SynchronousMemcpy(&gpu_array_ptr, &cpu_array[0],
+                                             cpu_array.size() * sizeof(float)));
+  // Copy the data back and verify.
+  ASSERT_TRUE(
+      stream_exec->SynchronousMemcpy(&cpu_array_result[0], gpu_array_ptr,
+                                     cpu_array_result.size() * sizeof(float)));
+  ASSERT_EQ(1.0, cpu_array_result[0]);
+
+  // Free the array
+  a.Deallocate(gpu_array);
+
+  // All values should be reset to nan.
+  ASSERT_TRUE(
+      stream_exec->SynchronousMemcpy(&cpu_array_result[0], gpu_array_ptr,
+                                     cpu_array_result.size() * sizeof(float)));
+  for (float f : cpu_array_result) {
+    ASSERT_FALSE(std::isfinite(f));
+  }
+}
+
+TEST(GPUDebugAllocatorTest, TracksSizes) {
+  GPUDebugAllocator a(new GPUBFCAllocator(0, 1 << 30), 0);
+  EXPECT_EQ(true, a.TracksAllocationSizes());
+}
+
+TEST(GPUDebugAllocatorTest, AllocatedVsRequested) {
+  GPUNanResetAllocator a(
+      new GPUDebugAllocator(new GPUBFCAllocator(0, 1 << 30), 0), 0);
+  float* t1 = a.Allocate<float>(1);
+  EXPECT_EQ(4, a.RequestedSize(t1));
+  EXPECT_EQ(256, a.AllocatedSize(t1));
+  a.Deallocate(t1);
+}
+
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
--- a/tensorflow/core/common_runtime/gpu/gpu_device.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_device.cc
@ -0,0 +1,651 @@
+// TODO(opensource): Use a more generic sounding preprocessor name than
+// GOOGLE_CUDA
+#if GOOGLE_CUDA
+
+#define EIGEN_USE_GPU
+
+#include "tensorflow/core/common_runtime/gpu/gpu_device.h"
+
+#include <stdlib.h>
+#include <string.h>
+
+//#include "base/commandlineflags.h"
+#include "tensorflow/stream_executor/cuda/cuda_activation.h"
+#include "tensorflow/stream_executor/multi_platform_manager.h"
+#include "tensorflow/stream_executor/stream.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_event_mgr.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_stream_util.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_util.h"
+#include "tensorflow/core/common_runtime/gpu/process_state.h"
+#include "tensorflow/core/common_runtime/gpu_device_context.h"
+#include "tensorflow/core/common_runtime/local_device.h"
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/device_base.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/graph/types.h"
+#include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/lib/strings/numbers.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/platform/tracing.h"
+#include "tensorflow/core/public/session_options.h"
+#include "tensorflow/core/public/status.h"
+#include "tensorflow/core/public/tensor.h"
+#include "tensorflow/core/util/device_name_utils.h"
+
+#if defined(PLATFORM_GOOGLE)
+DEFINE_bool(brain_gpu_sync_every_op, false,
+            "If true, call GPUUtil::Sync() between every dispatched opkernel.");
+
+DEFINE_int32(brain_gpu_max_streams, 1,
+             "Max number of GPU streams to use for computation.");
+#else
+// TODO(opensource): These should be made options in some options struct,
+// rather than flags.
+bool FLAGS_brain_gpu_sync_every_op = false;
+tensorflow::int32 FLAGS_brain_gpu_max_streams = 1;
+#endif
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+
+// Eigen Ops directly allocate memory only for temporary buffers used
+// during OpKernel::Compute().  The recommended way of allocating such
+// memory is via OpKernelContext::allocate_temp().  However, Eigen Ops
+// don't have access to OpKernelContext, instead they get access to
+// memory directly through the device allocator.  As an Open Source
+// project, Eigen assumes allocator semantics similar to those of the
+// CUDA memory allocator, and may not work correctly due to race
+// conditions if used with some other allocator.  For safety, we need
+// to delay deallocation calls out of Eigen until all events on the
+// corresponding stream have completed.  The following two classes
+// serve this purpose in two different compilation environments.
+
+#if defined(__GCUDACC__) || defined(__GCUDACC_HOST__)
+class EigenAllocator : public ::Eigen::Allocator {
+ public:
+  explicit EigenAllocator(gpu::Stream* stream, ::tensorflow::Allocator* alloc,
+                          EventMgr* em)
+      : stream_(stream), allocator_(alloc), em_(em) {}
+
+  void* allocate(size_t num_bytes) const override {
+    void* ret = allocator_->AllocateRaw(32 /* alignment */, num_bytes);
+    // Eigen doesn't typically check the return pointer from allocate,
+    // so we do it here and die with a more helpful error message.
+    if (ret == nullptr) {
+      LOG(FATAL) << "EigenAllocator for GPU ran out of memory when allocating "
+                 << num_bytes << ". See error logs for more detailed info.";
+    }
+    return ret;
+  }
+
+  void deallocate(void* buffer) const override {
+    em_->ThenDeleteBuffer(stream_, {allocator_, buffer});
+  }
+
+ private:
+  gpu::Stream* stream_;            // Not owned.
+  ::tensorflow::Allocator* allocator_;  // Not owned.
+  ::tensorflow::EventMgr* em_;          // Not owned.
+
+  TF_DISALLOW_COPY_AND_ASSIGN(EigenAllocator);
+};
+
+#else
+class EigenCudaStreamDevice : public ::Eigen::StreamInterface {
+ public:
+  EigenCudaStreamDevice(const cudaStream_t* cuda_stream, int gpu_id,
+                        ::tensorflow::Allocator* alloc)
+      : stream_(cuda_stream), allocator_(alloc) {
+    Eigen::initializeDeviceProp();
+    device_prop_ = &Eigen::m_deviceProperties[gpu_id];
+  }
+
+  const cudaStream_t& stream() const override { return *stream_; }
+  const cudaDeviceProp& deviceProperties() const override {
+    return *device_prop_;
+  }
+
+  void* allocate(size_t num_bytes) const override {
+    void* ret = allocator_->AllocateRaw(32 /* alignment */, num_bytes);
+    if (ret == nullptr) {
+      LOG(FATAL) << "EigenAllocator for GPU ran out of memory when allocating "
+                 << num_bytes << ". See error logs for more detailed info.";
+    }
+
+    return ret;
+  }
+  void deallocate(void* buffer) const override {
+    AsyncFreeData* afData = new AsyncFreeData(allocator_, buffer);
+    cudaError_t err = cudaStreamAddCallback(*stream_, asyncFree, afData, 0);
+    CHECK_EQ(err, cudaSuccess);
+  }
+
+ private:
+  struct AsyncFreeData {
+    AsyncFreeData(::tensorflow::Allocator* a, void* p)
+        : allocator_(a), address_(p) {}
+    ::tensorflow::Allocator* allocator_;
+    void* address_;
+  };
+
+  static void CUDART_CB asyncFree(cudaStream_t stream, cudaError_t status,
+                                  void* userData) {
+    AsyncFreeData* data = static_cast<AsyncFreeData*>(userData);
+    data->allocator_->DeallocateRaw(data->address_);
+    delete data;
+  }
+
+  const cudaStream_t* stream_;         // Not owned.
+  const cudaDeviceProp* device_prop_;  // Not owned.
+  ::tensorflow::Allocator* allocator_;  // Not owned.
+
+  TF_DISALLOW_COPY_AND_ASSIGN(EigenCudaStreamDevice);
+};
+
+#endif
+
+BaseGPUDevice::BaseGPUDevice(const SessionOptions& options, const string& name,
+                             Bytes memory_limit, BusAdjacency bus_adjacency,
+                             int gpu_id, const string& physical_device_desc,
+                             Allocator* gpu_allocator, Allocator* cpu_allocator)
+    : LocalDevice(options, Device::BuildDeviceAttributes(
+                               name, DEVICE_GPU, memory_limit, bus_adjacency,
+                               physical_device_desc),
+                  gpu_allocator),
+      gpu_allocator_(gpu_allocator),
+      cpu_allocator_(cpu_allocator),
+      gpu_id_(gpu_id) {
+  gpu::StreamExecutor* executor =
+      GPUMachineManager()->ExecutorForDevice(gpu_id_).ValueOrDie();
+  if (!executor) {
+    LOG(ERROR) << "Failed to get StreamExecutor for device " << gpu_id_;
+    return;
+  }
+  em_.reset(new EventMgr(executor));
+
+  if (FLAGS_brain_gpu_max_streams < 1) {
+    LOG(FATAL) << "Invalid value for brain_gpu_max_streams.";
+  }
+
+  // Create the specified number of GPU streams
+  for (int i = 0; i < FLAGS_brain_gpu_max_streams; i++) {
+    auto stream = new gpu::Stream(executor);
+    stream->Init();
+    VLOG(2) << "Created stream[" << i << "] = " << stream;
+    streams_.push_back(stream);
+    device_contexts_.push_back(new GPUDeviceContext(i, stream));
+  }
+  gpu_device_info_ = new GpuDeviceInfo;
+  gpu_device_info_->stream = streams_[0];
+  gpu_device_info_->default_context = device_contexts_[0];
+  gpu_device_info_->event_mgr = em_.get();
+  set_tensorflow_gpu_device_info(gpu_device_info_);
+}
+
+BaseGPUDevice::~BaseGPUDevice() {
+  delete gpu_device_info_;
+  for (auto ctx : device_contexts_) ctx->Unref();
+  gtl::STLDeleteElements(&streams_);
+}
+
+Status BaseGPUDevice::FillContextMap(const Graph* graph,
+                                     DeviceContextMap* device_context_map) {
+  VLOG(2) << "FillContextMap";
+
+  const auto num_streams = streams_.size();
+  // Special case for single stream.
+  if (num_streams == 1) {
+    return Status::OK();
+  }
+  const int64 before = Env::Default()->NowMicros();
+  gpu_stream_util::AssignStreamsOpts opts;
+  opts.max_streams = num_streams;
+  std::unordered_map<int, int> node_to_stream_id;
+  TF_RETURN_IF_ERROR(
+      gpu_stream_util::AssignStreams(graph, opts, &node_to_stream_id));
+  int64 elapsed = Env::Default()->NowMicros() - before;
+  VLOG(3) << "AssignStreams took " << elapsed << "us";
+
+  // Fill in the context map.  It is OK for this map to contain
+  // duplicate DeviceContexts so long as we increment the refcount.
+  for (Node* n : graph->nodes()) {
+    auto mapped_stream = node_to_stream_id[n->id()];
+    CHECK_LE(mapped_stream, num_streams);
+    auto ctx = device_contexts_[mapped_stream];
+    VLOG(3) << "Assigned stream " << node_to_stream_id[n->id()]
+            << " ==> stream[" << ctx->stream_id() << "] for node id " << n->id()
+            << " " << n->type_string() << " " << n->name();
+    ctx->Ref();
+    device_context_map->insert(std::make_pair(n->id(), ctx));
+  }
+
+  return Status::OK();
+}
+
+void BaseGPUDevice::Compute(OpKernel* op_kernel, OpKernelContext* context) {
+  // ScopedActivity is cheap when tracing is not active, but we
+  // can avoid computing the Hash64.
+  // TODO(pbar) This would no longer be needed if Ops have a unique id.
+  const uint64 id = port::Tracing::IsActive() ? Hash64(op_kernel->name()) : 0;
+  port::Tracing::ScopedActivity region(port::Tracing::EventCategory::kCompute,
+                                       id);
+
+  GPUDeviceContext* gpu_device_context = device_contexts_[0];
+  if (context->op_device_context() != nullptr) {
+    gpu_device_context =
+        static_cast<GPUDeviceContext*>(context->op_device_context());
+  }
+  gpu::Stream* stream = gpu_device_context->stream();
+  const auto stream_id = gpu_device_context->stream_id();
+
+  VLOG(1) << "GpuDevice::Compute " << op_kernel->name() << " op "
+          << op_kernel->def().op() << " on GPU" << gpu_id_ << " stream["
+          << stream_id << "]";
+
+  // NOTE(tucker): We need to discriminate between Eigen GPU
+  // operations and all others.  If an operation is Eigen
+  // implemented (or otherwise tries to launch a cuda kernel
+  // directly), we need to establish a stacked-scoped environment
+  // that directs it to execute on the proper device.  Otherwise we
+  // expect the Op to use StreamExecutor directly and correctly.  The
+  // way we make this discrimination is quite hacky: At the moment
+  // the only non-Eigen GPU Op is the recv-op, which is known to be
+  // asynchronous.
+  if (op_kernel->type_string() == "_Recv") {
+    context->SetStatus(errors::Internal(
+        "Invalid synchronous 'Compute' on GPU for '_Recv' op"));
+  } else {
+    const string label =
+        strings::StrCat(op_kernel->name(), ":", op_kernel->type_string());
+    port::Tracing::ScopedAnnotation annotation(label);
+
+    const auto num_streams = streams_.size();
+    if (num_streams > 1) {
+      // If this op's device context is different from the other contexts,
+      // we must wait on the stream.
+      for (int i = 0; i < context->num_inputs(); ++i) {
+        const GPUDeviceContext* idc =
+            static_cast<GPUDeviceContext*>(context->input_device_context(i));
+        OP_REQUIRES(context, idc != nullptr,
+                    errors::Internal("Input device context ", i,
+                                     " was not set properly."));
+        if (VLOG_IS_ON(2)) {
+          const void* base;
+          size_t len;
+          if (context->has_input(i)) {
+            if (IsRefType(context->input_dtype(i))) {
+              Tensor tensor = context->mutable_input(i, false);
+              base = DMAHelper::base(&tensor);
+              len = tensor.TotalBytes();
+            } else {
+              const Tensor& tensor = context->input(i);
+              base = DMAHelper::base(&tensor);
+              len = tensor.TotalBytes();
+            }
+            VLOG(2) << "Input " << i << " " << base << "  " << len;
+            VLOG(2) << "  stream[" << stream_id << "].ThenWaitFor(stream["
+                    << idc->stream_id() << "])"
+                    << ((idc->stream() == stream) ? " not needed" : "");
+          }
+        }
+        if (idc->stream() != stream) stream->ThenWaitFor(idc->stream());
+      }
+    }
+    gpu::cuda::ScopedActivateExecutorContext scoped_activation{
+        stream->parent(), gpu::cuda::MultiOpActivation::kYes};
+    // Keep a copy of the inputs before Compute runs, in case they get
+    // deleted. TODO(misard) this will be fixed when the tracking is
+    // done right.
+    std::vector<Tensor>* tensor_refs = nullptr;
+    if (!FLAGS_brain_gpu_sync_every_op) {
+      tensor_refs = new std::vector<Tensor>;
+      tensor_refs->reserve(context->num_inputs() + context->num_outputs());
+      for (int ii = 0; ii < context->num_inputs(); ++ii) {
+        if (context->has_input(ii)) {
+          if (IsRefType(context->input_dtype(ii))) {
+            Tensor in = context->mutable_input(ii, false);
+            tensor_refs->push_back(in);
+          } else {
+            const Tensor& in = context->input(ii);
+            tensor_refs->push_back(in);
+          }
+        }
+      }
+    }
+    op_kernel->Compute(context);
+    if (context->status().ok()) {
+      if (FLAGS_brain_gpu_sync_every_op) {
+        // Note: GPUUtil::Sync() only syncs the default stream.
+        // We need to either sync the stream used by this op, or
+        // all streams.  Given that this flag is typically used for
+        // debugging it makes more sense to sync all GPU activity.
+        context->SetStatus(GPUUtil::SyncAll(this));
+      } else {
+        // The GPU kernel has been queued, but may not complete for some
+        // time.  As soon as this function completes, the caller will
+        // discard its refs on the inputs, outputs and any scratch
+        // tensors it created. Create additional refs here that will be
+        // held until the kernel completes.
+        for (int ii = 0; ii < context->num_temps(); ++ii) {
+          Tensor* temp = context->temp(ii);
+          VLOG(2) << "Saving ref to temp Tensor @ " << DMAHelper::base(temp);
+          tensor_refs->push_back(*temp);
+        }
+        for (int ii = 0; ii < context->num_outputs(); ++ii) {
+          Tensor* temp = context->mutable_output(ii);
+          if (nullptr != temp) {
+            tensor_refs->push_back(*temp);
+          }
+        }
+        em_->ThenDeleteTensors(stream, tensor_refs);
+      }
+    } else {
+      if (!FLAGS_brain_gpu_sync_every_op) {
+        delete tensor_refs;
+      }
+    }
+  }
+}
+
+Status BaseGPUDevice::Sync() { return GPUUtil::Sync(this); }
+
+void BaseGPUDevice::ComputeAsync(AsyncOpKernel* op_kernel,
+                                 OpKernelContext* context,
+                                 AsyncOpKernel::DoneCallback done) {
+  GPUDeviceContext* gpu_device_context = device_contexts_[0];
+  if (context->op_device_context() != nullptr) {
+    gpu_device_context =
+        static_cast<GPUDeviceContext*>(context->op_device_context());
+  }
+  const auto stream_id = gpu_device_context->stream_id();
+
+  VLOG(1) << "GpuDevice::ComputeAsync " << op_kernel->name() << " op "
+          << op_kernel->def().op() << " on GPU" << gpu_id_ << " stream["
+          << stream_id << "]";
+
+  port::Tracing::TraceMe activity(
+      strings::StrCat(op_kernel->name(), ":", op_kernel->type_string()));
+  op_kernel->ComputeAsync(context, done);
+}
+
+Status BaseGPUDevice::MakeTensorFromProto(const TensorProto& tensor_proto,
+                                          const AllocatorAttributes alloc_attrs,
+                                          Tensor* tensor) {
+  AllocatorAttributes attr;
+  attr.set_on_host(true);
+  attr.set_gpu_compatible(true);
+  Allocator* host_alloc = GetAllocator(attr);
+  Tensor parsed(tensor_proto.dtype());
+  if (!parsed.FromProto(host_alloc, tensor_proto)) {
+    return errors::InvalidArgument("Cannot parse tensor from proto: ",
+                                   tensor_proto.DebugString());
+  }
+  Status status;
+  if (alloc_attrs.on_host()) {
+    *tensor = parsed;
+  } else {
+    if (!DMAHelper::CanUseDMA(&parsed)) {
+      return errors::Internal("GPU copy from non-DMA ",
+                              DataTypeString(parsed.dtype()), " tensor");
+    }
+    Tensor copy(GetAllocator(alloc_attrs), parsed.dtype(), parsed.shape());
+    port::Tracing::ScopedAnnotation annotation("MakeTensorFromProto");
+    Notification n;
+    device_contexts_[0]->CopyCPUTensorToDevice(&parsed, this, &copy,
+                                               [&n, &status](const Status& s) {
+                                                 status = s;
+                                                 n.Notify();
+                                               });
+    n.WaitForNotification();
+    *tensor = copy;
+  }
+  return status;
+}
+
+namespace {
+#if defined(__GCUDACC__) || defined(__GCUDACC_HOST__)
+class ConcretePerOpGpuDevice : public PerOpGpuDevice {
+ public:
+  explicit ConcretePerOpGpuDevice(gpu::Stream* stream,
+                                  EigenAllocator* allocator)
+      : device_(stream, allocator), allocator_(allocator) {}
+  ~ConcretePerOpGpuDevice() { delete allocator_; }
+
+  const Eigen::GpuDevice& device() const override { return device_; }
+
+ private:
+  Eigen::GpuDevice device_;
+  EigenAllocator* allocator_;
+};
+#else
+class ConcretePerOpGpuDevice : public PerOpGpuDevice {
+ public:
+  explicit ConcretePerOpGpuDevice(EigenCudaStreamDevice* stream_device)
+      : device_(stream_device), stream_device_(stream_device) {}
+  ~ConcretePerOpGpuDevice() { delete stream_device_; }
+
+  const Eigen::GpuDevice& device() const override { return device_; }
+
+ private:
+  Eigen::GpuDevice device_;
+  EigenCudaStreamDevice* stream_device_;
+};
+#endif
+}  // namespace
+
+const PerOpGpuDevice* BaseGPUDevice::NewDevice(int stream_id,
+                                               Allocator* allocator) {
+#if defined(__GCUDACC__) || defined(__GCUDACC_HOST__)
+  auto ea = new EigenAllocator(streams_[stream_id], allocator, em_.get());
+  return new ConcretePerOpGpuDevice(streams_[stream_id], ea);
+#else
+  const cudaStream_t* cuda_stream = reinterpret_cast<const cudaStream_t*>(
+      streams_[stream_id]->implementation()->CudaStreamMemberHack());
+  auto es = new EigenCudaStreamDevice(cuda_stream, gpu_id_, allocator);
+  return new ConcretePerOpGpuDevice(es);
+#endif
+}
+
+const PerOpGpuDevice* BaseGPUDevice::MakeGpuDevice(DeviceContext* dc,
+                                                   Allocator* allocator) {
+  if (dc) {
+    const GPUDeviceContext* gpu_dc = static_cast<GPUDeviceContext*>(dc);
+    const int stream_id = gpu_dc->stream_id();
+    VLOG(1) << "  eigen_gpu_device(" << dc << ") => stream[" << stream_id
+            << "]";
+    CHECK_LT(stream_id, streams_.size());
+    return NewDevice(stream_id, allocator);
+  } else {
+    return NewDevice(0, allocator);
+  }
+}
+
+void BaseGPUDeviceFactory::CreateDevices(const SessionOptions& options,
+                                         const string& name_prefix,
+                                         std::vector<Device*>* devices) {
+  int n = INT_MAX;
+  auto iter = options.config.device_count().find("GPU");
+  if (iter != options.config.device_count().end()) {
+    n = iter->second;
+  }
+  std::vector<int> valid_gpu_ids;
+  GetValidDeviceIds(&valid_gpu_ids);
+  if (static_cast<size_t>(n) > valid_gpu_ids.size()) {
+    n = valid_gpu_ids.size();
+  }
+  for (int i = 0; i < n; i++) {
+    devices->push_back(CreateGPUDevice(
+        options, strings::StrCat(name_prefix, "/gpu:", i), valid_gpu_ids[i]));
+  }
+}
+
+namespace {
+int64 MinSystemMemory(int64 available_memory) {
+  // We use the following heuristic for now:
+  //
+  // If the available_memory is < 2GiB, we allocate 200MiB to system memory.
+  // Otherwise, allocate 300MiB to system memory.
+  //
+  // In the future we could be more sophisticated by using a table of
+  // devices.
+  if (available_memory < (1LL << 31)) {
+    // 200MiB
+    return 209715200LL;
+  } else {
+    // max(300 MiB, 0.95 * available_memory)
+    return std::max(314572800LL, static_cast<int64>(available_memory * 0.05));
+  }
+}
+}  // namespace
+
+static string GetShortDeviceDescription(int device_id,
+                                        const gpu::DeviceDescription& desc) {
+  return strings::StrCat("device: ", device_id, ", name: ", desc.name(),
+                         ", pci bus id: ", desc.pci_bus_id());
+}
+
+LocalDevice* BaseGPUDeviceFactory::CreateGPUDevice(
+    const SessionOptions& options, const string& name, int gpu_id) {
+  CHECK_GE(gpu_id, 0);
+
+  // Look up the device, to see its attributes.
+  gpu::Platform* gpu_platform = GPUMachineManager();
+  CHECK_LT(gpu_id, gpu_platform->VisibleDeviceCount());
+  gpu::StreamExecutor* se =
+      gpu_platform->ExecutorForDevice(gpu_id).ValueOrDie();
+  const gpu::DeviceDescription& desc = se->GetDeviceDescription();
+
+  int64 total_memory, available_memory;
+  CHECK(se->DeviceMemoryUsage(&available_memory, &total_memory));
+
+  int64 allocated_memory = available_memory;
+  double config_memory_fraction =
+      options.config.gpu_options().per_process_gpu_memory_fraction();
+  if (config_memory_fraction == 0) {
+    const int64 min_system_memory = MinSystemMemory(available_memory);
+    if (min_system_memory < allocated_memory) {
+      allocated_memory -= min_system_memory;
+    }
+  } else {
+    allocated_memory *= config_memory_fraction;
+  }
+
+  Bytes allocated_bytes = static_cast<Bytes>(allocated_memory);
+
+  // Get GPU BusAdjacency from its reported NUMA affinity.
+  // Because GPUs are virtualized in some environments, we can't just
+  // use the GPU id.
+  BusAdjacency bus_adjacency = BUS_ANY;
+  switch (desc.numa_node()) {
+    case 0:
+      bus_adjacency = BUS_0;
+      break;
+    case 1:
+      bus_adjacency = BUS_1;
+      break;
+    default:
+      bus_adjacency = BUS_ANY;
+  }
+  VLOG(1) << "GPUDevice id " << gpu_id << " on bus " << bus_adjacency
+          << " numa: " << desc.numa_node() << " pci: " << desc.pci_bus_id();
+
+  ProcessState* process_state = ProcessState::singleton();
+  return CreateGPUDevice(
+      options, name, allocated_bytes, bus_adjacency, gpu_id,
+      GetShortDeviceDescription(gpu_id, desc),
+      process_state->GetGPUAllocator(gpu_id, allocated_memory),
+      process_state->GetCPUAllocator(desc.numa_node()));
+}
+
+static int GetMinGPUMultiprocessorCount() {
+  static const int kDefaultMinGPUMultiprocessorCount = 8;
+
+  const char* tf_min_gpu_core_count = getenv("TF_MIN_GPU_MULTIPROCESSOR_COUNT");
+
+  if (tf_min_gpu_core_count == nullptr ||
+      strcmp(tf_min_gpu_core_count, "") == 0) {
+    return kDefaultMinGPUMultiprocessorCount;
+  }
+
+  int min_gpu_core_count = -1;
+  if (strings::safe_strto32(tf_min_gpu_core_count, &min_gpu_core_count)) {
+    if (min_gpu_core_count >= 0) {
+      return min_gpu_core_count;
+    }
+  }
+
+  LOG(ERROR) << "Invalid minimum GPU multiprocessor count: ["
+             << tf_min_gpu_core_count << "]. "
+             << "Using the default value: "
+             << kDefaultMinGPUMultiprocessorCount;
+  return kDefaultMinGPUMultiprocessorCount;
+}
+
+void BaseGPUDeviceFactory::GetValidDeviceIds(std::vector<int>* ids) {
+  auto gpu_manager = GPUMachineManager();
+  int min_gpu_core_count = GetMinGPUMultiprocessorCount();
+  if (gpu_manager) {
+    auto visible_device_count = gpu_manager->VisibleDeviceCount();
+    for (int i = 0; i < gpu_manager->VisibleDeviceCount(); ++i) {
+      auto exec_status = gpu_manager->ExecutorForDevice(i);
+      if (!exec_status.ok()) {
+        continue;
+      }
+      gpu::StreamExecutor* se = exec_status.ValueOrDie();
+      const gpu::DeviceDescription& desc = se->GetDeviceDescription();
+      int major, minor;
+      if (!desc.cuda_compute_capability(&major, &minor)) {
+        continue;
+      }
+      // Only consider GPUs with compute capability >= 3.5 (Kepler or
+      // higher)
+      if (major < 3 || (major == 3 && minor < 5)) {
+        LOG(INFO) << "Ignoring gpu device "
+                  << "(" << GetShortDeviceDescription(i, desc) << ") "
+                  << "with Cuda compute capability " << major << "." << minor
+                  << ". The minimum required Cuda capability is 3.5.";
+        continue;
+      }
+
+      // TensorFlow currently places computation on devices assuming
+      // they have similar capability.
+      //
+      // If there are multiple GPUs available on the machine, only
+      // consider GPUs with 8 or more multiprocessors.
+      //
+      // TODO(vrv): In the medium term: we should only filter out GPUs
+      // that are slow relative to the fastest GPU. In the long term,
+      // TensorFlow should support automatic placement based on
+      // capability.
+      if (visible_device_count > 1) {
+        if (desc.core_count() < min_gpu_core_count) {
+          LOG(INFO) << "Ignoring gpu device "
+                    << "(" << GetShortDeviceDescription(i, desc) << ") "
+                    << "with Cuda multiprocessor count: " << desc.core_count()
+                    << ". The minimum required count is " << min_gpu_core_count
+                    << ". You can adjust this requirement with the env var "
+                       "TF_MIN_GPU_MULTIPROCESSOR_COUNT.";
+          continue;
+        }
+      }
+
+      int new_id = ids->size();
+      ids->push_back(i);
+
+      LOG(INFO) << "Creating TensorFlow device (/gpu:" << new_id << ") -> "
+                << "(" << GetShortDeviceDescription(i, desc) << ")";
+    }
+  }
+}
+
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
--- a/tensorflow/core/common_runtime/gpu/gpu_device.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_device.h
@ -0,0 +1,94 @@
+#if !GOOGLE_CUDA
+#error This file must only be included when building with Cuda support
+#endif
+
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_GPU_DEVICE_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_GPU_DEVICE_H_
+
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/gpu_device_context.h"
+#include "tensorflow/core/common_runtime/local_device.h"
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/device_base.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/public/session_options.h"
+#include "tensorflow/core/public/status.h"
+#include "tensorflow/core/public/tensor.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_event_mgr.h"
+#include "tensorflow/stream_executor/stream.h"
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+
+namespace tensorflow {
+
+class EigenAllocator;
+
+class BaseGPUDevice : public LocalDevice {
+ public:
+  BaseGPUDevice(const SessionOptions& options, const string& name,
+                Bytes memory_limit, BusAdjacency bus_adjacency, int gpu_id,
+                const string& physical_device_desc, Allocator* gpu_allocator,
+                Allocator* cpu_allocator);
+
+  ~BaseGPUDevice() override;
+
+  // GPU devices require the Op Compute method to save a reference to
+  // any temporary tensors that are allocated until the Op execution
+  // completes.
+  bool SaveTemporaryTensors() const override { return true; }
+
+  Status FillContextMap(const Graph* graph,
+                        DeviceContextMap* device_context_map);
+
+  void Compute(OpKernel* op_kernel, OpKernelContext* context) override;
+
+  Status Sync() override;
+
+  void ComputeAsync(AsyncOpKernel* op_kernel, OpKernelContext* context,
+                    AsyncOpKernel::DoneCallback done) override;
+
+  Status MakeTensorFromProto(const TensorProto& tensor_proto,
+                             const AllocatorAttributes alloc_attrs,
+                             Tensor* tensor) override;
+
+  // The caller owns the returned device.
+  const PerOpGpuDevice* MakeGpuDevice(DeviceContext* dc,
+                                      Allocator* allocator) override;
+
+ protected:
+  Allocator* gpu_allocator_;  // not owned
+  Allocator* cpu_allocator_;  // not owned
+
+ private:
+  std::vector<gpu::Stream*> streams_;
+  std::vector<GPUDeviceContext*> device_contexts_;
+  GpuDeviceInfo* gpu_device_info_ = nullptr;
+  mutex trace_mu_;
+  int gpu_id_ = -1;
+  std::unique_ptr<EventMgr> em_;
+
+  const PerOpGpuDevice* NewDevice(int stream_id, Allocator* allocator);
+};
+
+class BaseGPUDeviceFactory : public DeviceFactory {
+ public:
+  void CreateDevices(const SessionOptions& options, const string& name_prefix,
+                     std::vector<Device*>* devices) override;
+
+ private:
+  LocalDevice* CreateGPUDevice(const SessionOptions& options,
+                               const string& name, int gpu_id);
+
+  virtual LocalDevice* CreateGPUDevice(const SessionOptions& options,
+                                       const string& name, Bytes memory_limit,
+                                       BusAdjacency bus_adjacency, int gpu_id,
+                                       const string& physical_device_desc,
+                                       Allocator* gpu_allocator,
+                                       Allocator* cpu_allocator) = 0;
+
+  void GetValidDeviceIds(std::vector<int>* ids);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_GPU_DEVICE_H_
--- a/tensorflow/core/common_runtime/gpu/gpu_device_factory.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_device_factory.cc
@ -0,0 +1,52 @@
+#if GOOGLE_CUDA
+
+#define EIGEN_USE_GPU
+
+#include "tensorflow/core/common_runtime/gpu/gpu_device.h"
+#include "tensorflow/core/common_runtime/gpu/process_state.h"
+
+namespace tensorflow {
+
+void RequireGPUDevice() {}
+
+class GPUDevice : public BaseGPUDevice {
+ public:
+  GPUDevice(const SessionOptions& options, const string& name,
+            Bytes memory_limit, BusAdjacency bus_adjacency, int gpu_id,
+            const string& physical_device_desc, Allocator* gpu_allocator,
+            Allocator* cpu_allocator)
+      : BaseGPUDevice(options, name, memory_limit, bus_adjacency, gpu_id,
+                      physical_device_desc, gpu_allocator, cpu_allocator) {}
+
+  Allocator* GetAllocator(AllocatorAttributes attr) override {
+    if (attr.on_host()) {
+      ProcessState* ps = ProcessState::singleton();
+      if (attr.gpu_compatible()) {
+        return ps->GetCUDAHostAllocator(0);
+      } else {
+        return cpu_allocator_;
+      }
+    } else {
+      return gpu_allocator_;
+    }
+  }
+};
+
+class GPUDeviceFactory : public BaseGPUDeviceFactory {
+ private:
+  LocalDevice* CreateGPUDevice(const SessionOptions& options,
+                               const string& name, Bytes memory_limit,
+                               BusAdjacency bus_adjacency, int gpu_id,
+                               const string& physical_device_desc,
+                               Allocator* gpu_allocator,
+                               Allocator* cpu_allocator) override {
+    return new GPUDevice(options, name, memory_limit, bus_adjacency, gpu_id,
+                         physical_device_desc, gpu_allocator, cpu_allocator);
+  }
+};
+
+REGISTER_LOCAL_DEVICE_FACTORY("GPU", GPUDeviceFactory);
+
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
--- a/tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc
@ -0,0 +1,132 @@
+#include "tensorflow/core/common_runtime/gpu/gpu_event_mgr.h"
+
+#include "tensorflow/stream_executor/event.h"
+#include "tensorflow/stream_executor/stream.h"
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+
+EventMgr::EventMgr(gpu::StreamExecutor* se)
+    : exec_(se),
+      // threadpool_ has 1 thread for the polling loop, and one to execute
+      // event callback functions. Maybe we should have more?
+      threadpool_(Env::Default(), "GPU_Event_Manager", 2) {
+  threadpool_.Schedule([this]() { PollLoop(); });
+}
+
+EventMgr::~EventMgr() {
+  stop_polling_.Notify();
+  // Shut down the backup polling loop.
+  polling_stopped_.WaitForNotification();
+
+  // Events are owned by this object.
+  for (auto& e : free_events_) {
+    delete e;
+  }
+  while (!used_events_.empty()) {
+    delete used_events_[0].event;
+    delete used_events_[0].mem;
+    if (used_events_[0].bufrec.buf) {
+      used_events_[0].bufrec.alloc->DeallocateRaw(used_events_[0].bufrec.buf);
+    }
+    if (used_events_[0].func != nullptr)
+      threadpool_.Schedule(used_events_[0].func);
+    used_events_.pop_front();
+  }
+}
+
+// This polling loop runs at a relatively low frequency. Most calls to
+// PollEvents() should come directly from Compute() via
+// ThenDeleteTensors().  This function's purpose is to ensure that
+// even if no more GPU operations are being requested, we still
+// eventually clear the queue. It seems to prevent some tensorflow
+// programs from stalling for reasons not yet understood.
+void EventMgr::PollLoop() {
+  while (!stop_polling_.HasBeenNotified()) {
+    Env::Default()->SleepForMicroseconds(1 * 1000);
+    {
+      mutex_lock l(mu_);
+      PollEvents(true);
+    }
+  }
+  polling_stopped_.Notify();
+}
+
+void EventMgr::QueueInUse(gpu::Stream* stream, InUse iu) {
+  VLOG(2) << "QueueInUse  free_events_ " << free_events_.size()
+          << " used_events_ " << used_events_.size();
+  // Events are created on demand, and repeatedly reused.  There is no
+  // limit placed here on the number of allocated Events.
+  if (free_events_.empty()) {
+    free_events_.push_back(new gpu::Event(exec_));
+    free_events_.back()->Init();
+  }
+  gpu::Event* e = free_events_.back();
+  free_events_.pop_back();
+  stream->ThenRecordEvent(e);
+  iu.event = e;
+  used_events_.push_back(iu);
+}
+
+// This function must be called periodically to check whether pending
+// events have recorded, and then retire them.  Initial observations
+// suggest that typical behavior in a TensorFlow program is to have
+// 0-3 events pending most of the time, but there are occasionally
+// spikes of up to several hundred outstanding.
+//
+// NOTE: If all events are on the same stream, no later event will
+// complete before an earlier event, except possibly if the earlier
+// event transitions to an error state, so there's no advantage in
+// looking past the first kPending event.  However, if we're using
+// multiple streams there may be some gain in looking deeper.
+// As a compromise, PollEvent() calls that are triggered by the queueing
+// of a single event never look past the first kPending event.  Calls
+// coming from the dedicated polling thread always sweep the full queue.
+//
+// Note that allowing the queue to grow very long could cause overall
+// GPU memory use to spike needlessly.  An alternative strategy would
+// be to throttle new Op execution until the pending event queue
+// clears.
+void EventMgr::PollEvents(bool is_dedicated_poller) {
+  VLOG(2) << "PollEvents  free_events_ " << free_events_.size()
+          << " used_events_ " << used_events_.size();
+  // Sweep the remaining events in order.  If this is the dedicated
+  // polling thread, check the entire set.  Otherwise, just sweep up to
+  // the first non-complete record that is still pending.
+  for (auto& iu : used_events_) {
+    if (iu.event == nullptr) continue;
+    gpu::Event::Status s = iu.event->PollForStatus();
+    switch (s) {
+      case gpu::Event::Status::kUnknown:
+      case gpu::Event::Status::kError:
+        // We don't expect to see these.  Someday maybe propagate
+        // a Status error, but for now fail hard.
+        LOG(FATAL) << "Unexpected Event status: " << static_cast<int>(s);
+        break;
+      case gpu::Event::Status::kPending:
+        if (!is_dedicated_poller) return;  // quit processing queue
+        break;
+      case gpu::Event::Status::kComplete:
+        delete iu.mem;
+        if (iu.bufrec.buf) iu.bufrec.alloc->DeallocateRaw(iu.bufrec.buf);
+        // The function must be called in another thread, outside of
+        // the mutex held here.
+        if (iu.func != nullptr) threadpool_.Schedule(iu.func);
+        free_events_.push_back(iu.event);
+        // Mark this InUse record as completed.
+        iu.event = nullptr;
+    }
+  }
+  // Then clear any completed InUse records from the front of the queue.
+  while (!used_events_.empty()) {
+    InUse& iu = used_events_.front();
+    if (iu.event == nullptr) {
+      used_events_.pop_front();
+    } else {
+      break;
+    }
+  }
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/gpu_event_mgr.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_event_mgr.h
@ -0,0 +1,118 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_GPU_EVENT_MGR_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_GPU_EVENT_MGR_H_
+
+#include <deque>
+#include <vector>
+#include "tensorflow/core/lib/core/notification.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/platform/thread_annotations.h"
+#include "tensorflow/core/public/tensor.h"
+
+namespace perftools {
+namespace gputools {
+class Event;
+class Stream;
+class StreamExecutor;
+}  // namespace gputools
+}  // namespace perftools
+
+namespace tensorflow {
+
+// An object to keep track of pending Events in the StreamExecutor streams
+// and associated Tensors that cannot safely be deleted until the associated
+// Events are recorded.
+class EventMgr {
+ public:
+  explicit EventMgr(perftools::gputools::StreamExecutor* se);
+
+  ~EventMgr();
+
+  // Takes ownership of *tensors and deletes it as soon as all events
+  // currently enqueued on *stream have completed.
+  inline void ThenDeleteTensors(perftools::gputools::Stream* stream,
+                                std::vector<Tensor>* tensors) {
+    mutex_lock l(mu_);
+    QueueTensors(stream, tensors);
+    PollEvents(false);
+  }
+
+  struct BufRec {
+    Allocator* alloc;
+    void* buf;
+  };
+
+  // Takes ownership of *bufrec.buf and calls bufrec.alloc->DeallocateRaw()
+  // on it as soon as all events currently enqueued on *stream have completed.
+  inline void ThenDeleteBuffer(perftools::gputools::Stream* stream,
+                               BufRec bufrec) {
+    mutex_lock l(mu_);
+    QueueBuffer(stream, bufrec);
+    PollEvents(false);
+  }
+
+  inline void ThenExecute(perftools::gputools::Stream* stream,
+                          std::function<void()> func) {
+    mutex_lock l(mu_);
+    QueueFunc(stream, func);
+    PollEvents(false);
+  }
+
+ private:
+  friend class TEST_EventMgrHelper;
+  mutex mu_;
+  perftools::gputools::StreamExecutor* exec_;
+
+  struct InUse {
+    perftools::gputools::Event* event;
+    std::vector<Tensor>* mem;
+    BufRec bufrec;
+    std::function<void()> func;
+  };
+
+  // Stream-enqueue an unused Event and save with it a collection of
+  // Tensors and/or a BufRec to be deleted only after the Event
+  // records.
+  void QueueInUse(perftools::gputools::Stream* stream, InUse in_use)
+      EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
+  void QueueTensors(perftools::gputools::Stream* stream,
+                    std::vector<Tensor>* tensors)
+      EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+    QueueInUse(stream, {nullptr, tensors, BufRec(), nullptr});
+  }
+
+  void QueueBuffer(perftools::gputools::Stream* stream, BufRec bufrec)
+      EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+    QueueInUse(stream, {nullptr, nullptr, bufrec, nullptr});
+  }
+
+  void QueueFunc(perftools::gputools::Stream* stream,
+                 std::function<void()> func) EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+    QueueInUse(stream, {nullptr, nullptr, BufRec(), func});
+  }
+
+  // This function should be called at roughly the same tempo as
+  // QueueTensors() to check whether pending events have recorded,
+  // and then retire them.
+  void PollEvents(bool is_dedicated_poller) EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
+  // An internal polling loop that runs at a low frequency to clear
+  // straggler Events.
+  void PollLoop();
+
+  // A stack of unused events
+  std::vector<perftools::gputools::Event*> free_events_ GUARDED_BY(mu_);
+
+  // A FIFO queue of InUse events and associated tensors.
+  std::deque<InUse> used_events_ GUARDED_BY(mu_);
+
+  Notification stop_polling_;
+  Notification polling_stopped_;
+
+  // The main PollLoop for the event manager runs in this threadpool.
+  thread::ThreadPool threadpool_;
+};
+
+}  // namespace tensorflow
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_GPU_EVENT_MGR_H_
--- a/tensorflow/core/common_runtime/gpu/gpu_event_mgr_test.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_event_mgr_test.cc
@ -0,0 +1,152 @@
+#if GOOGLE_CUDA
+
+#include "tensorflow/core/common_runtime/gpu/gpu_event_mgr.h"
+
+#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+#include "tensorflow/stream_executor/multi_platform_manager.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+#include <gtest/gtest.h>
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+
+class TEST_EventMgrHelper {
+ public:
+  explicit TEST_EventMgrHelper(EventMgr* em) : em_(em) {}
+
+  int queue_size() {
+    mutex_lock l(em_->mu_);
+    return em_->used_events_.size();
+  }
+
+  int free_size() {
+    mutex_lock l(em_->mu_);
+    return em_->free_events_.size();
+  }
+
+  void QueueTensors(perftools::gputools::Stream* stream,
+                    std::vector<Tensor>* tensors) {
+    mutex_lock l(em_->mu_);
+    em_->QueueTensors(stream, tensors);
+  }
+
+  void PollEvents(bool is_dedicated_poller) {
+    mutex_lock l(em_->mu_);
+    em_->PollEvents(is_dedicated_poller);
+  }
+
+ private:
+  EventMgr* em_;
+};
+
+namespace {
+
+TEST(EventMgr, Empty) {
+  auto stream_exec = GPUMachineManager()->ExecutorForDevice(0).ValueOrDie();
+  EventMgr em(stream_exec);
+  TEST_EventMgrHelper th(&em);
+  EXPECT_EQ(0, th.queue_size());
+  EXPECT_EQ(0, th.free_size());
+}
+
+// Delaying polling until after several enqueings should grow the
+// total number of allocated events.  Once we have enough events for
+// the max simultaneously pending, we should not allocate any more.
+TEST(EventMgr, DelayedPolling) {
+  auto stream_exec = GPUMachineManager()->ExecutorForDevice(0).ValueOrDie();
+  EventMgr em(stream_exec);
+  TEST_EventMgrHelper th(&em);
+  EXPECT_EQ(0, th.queue_size());
+  std::vector<Tensor>* v = nullptr;
+  std::unique_ptr<gpu::Stream> stream(new gpu::Stream(stream_exec));
+  CHECK(stream.get());
+  stream->Init();
+  for (int i = 0; i < 5; ++i) {
+    v = new std::vector<Tensor>;
+    th.QueueTensors(stream.get(), v);
+    EXPECT_EQ(i + 1, th.queue_size());
+    EXPECT_EQ(0, th.free_size());
+  }
+  th.PollEvents(false);
+  EXPECT_EQ(0, th.queue_size());
+  EXPECT_EQ(5, th.free_size());
+  for (int j = 0; j < 2; ++j) {
+    for (int i = 0; i < 5; ++i) {
+      v = new std::vector<Tensor>;
+      th.QueueTensors(stream.get(), v);
+      EXPECT_EQ(i + 1, th.queue_size());
+      EXPECT_EQ(4 - i, th.free_size());
+    }
+    th.PollEvents(false);
+    EXPECT_EQ(0, th.queue_size());
+    EXPECT_EQ(5, th.free_size());
+  }
+}
+
+// Immediate polling should require only one event to be allocated.
+TEST(EventMgr, ImmediatePolling) {
+  auto stream_exec = GPUMachineManager()->ExecutorForDevice(0).ValueOrDie();
+  EventMgr em(stream_exec);
+  TEST_EventMgrHelper th(&em);
+  EXPECT_EQ(0, th.queue_size());
+  EXPECT_EQ(0, th.free_size());
+  std::vector<Tensor>* v = nullptr;
+  std::unique_ptr<gpu::Stream> stream(new gpu::Stream(stream_exec));
+  CHECK(stream.get());
+  stream->Init();
+  for (int i = 0; i < 5; ++i) {
+    v = new std::vector<Tensor>;
+    em.ThenDeleteTensors(stream.get(), v);
+    EXPECT_EQ(0, th.queue_size());
+    EXPECT_EQ(1, th.free_size());
+  }
+}
+
+// If we delay polling by more than 1 second, the backup polling loop
+// should clear the queue.
+TEST(EventMgr, LongDelayedPolling) {
+  auto stream_exec = GPUMachineManager()->ExecutorForDevice(0).ValueOrDie();
+  EventMgr em(stream_exec);
+  TEST_EventMgrHelper th(&em);
+  EXPECT_EQ(0, th.queue_size());
+  EXPECT_EQ(0, th.free_size());
+  std::vector<Tensor>* v = nullptr;
+  std::unique_ptr<gpu::Stream> stream(new gpu::Stream(stream_exec));
+  CHECK(stream.get());
+  stream->Init();
+  for (int i = 0; i < 5; ++i) {
+    v = new std::vector<Tensor>;
+    th.QueueTensors(stream.get(), v);
+    EXPECT_EQ(1 + i, th.queue_size());
+    EXPECT_EQ(0, th.free_size());
+  }
+  sleep(1);
+  EXPECT_EQ(0, th.queue_size());
+  EXPECT_EQ(5, th.free_size());
+}
+
+// Deleting the EventMgr when events are still pending should shut
+// down gracefully.
+TEST(EventMgr, NonEmptyShutdown) {
+  auto stream_exec = GPUMachineManager()->ExecutorForDevice(0).ValueOrDie();
+  EventMgr em(stream_exec);
+  TEST_EventMgrHelper th(&em);
+  EXPECT_EQ(0, th.queue_size());
+  EXPECT_EQ(0, th.free_size());
+  std::vector<Tensor>* v = nullptr;
+  std::unique_ptr<gpu::Stream> stream(new gpu::Stream(stream_exec));
+  CHECK(stream.get());
+  stream->Init();
+  for (int i = 0; i < 5; ++i) {
+    v = new std::vector<Tensor>;
+    th.QueueTensors(stream.get(), v);
+    EXPECT_EQ(1 + i, th.queue_size());
+    EXPECT_EQ(0, th.free_size());
+  }
+}
+
+}  // namespace
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
--- a/tensorflow/core/common_runtime/gpu/gpu_init.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_init.cc
@ -0,0 +1,147 @@
+#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+
+#include <string>
+
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/stream_executor/multi_platform_manager.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/strings/numbers.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+
+namespace {
+
+std::unique_ptr<std::map<std::pair<int, int>, bool>> GetPeerAccessMap(
+    gpu::Platform* platform, int device_count) {
+  auto* map = new std::map<std::pair<int, int>, bool>;
+  for (int i = 0; i < device_count; ++i) {
+    for (int j = 0; j < device_count; ++j) {
+      gpu::StreamExecutor* from = platform->ExecutorForDevice(i).ValueOrDie();
+      gpu::StreamExecutor* to = platform->ExecutorForDevice(j).ValueOrDie();
+      (*map)[{i, j}] = from->CanEnablePeerAccessTo(to);
+    }
+  }
+
+  return std::unique_ptr<std::map<std::pair<int, int>, bool>>{map};
+}
+
+Status EnablePeerAccess(gpu::Platform* platform, int device_count) {
+  for (int i = 0; i < device_count; ++i) {
+    for (int j = 0; j < device_count; ++j) {
+      gpu::StreamExecutor* from = platform->ExecutorForDevice(i).ValueOrDie();
+      gpu::StreamExecutor* to = platform->ExecutorForDevice(j).ValueOrDie();
+
+      if (from->CanEnablePeerAccessTo(to)) {
+        auto status = from->EnablePeerAccessTo(to);
+        if (!status.ok()) {
+          return errors::Internal(status.ToString());
+        }
+      } else {
+        LOG(INFO) << "cannot enable peer access from device ordinal " << i
+                  << " to device ordinal " << j;
+      }
+    }
+  }
+  return Status::OK();
+}
+
+static void InitGPU() {
+  auto result = gpu::MultiPlatformManager::PlatformWithName("CUDA");
+  if (!result.ok()) {
+    LOG(WARNING)
+        << "Not initializing the GPU, could not create GPU MachineManager. "
+        << "Error: " << result.status();
+    return;
+  }
+
+  gpu::Platform* platform = result.ValueOrDie();
+
+  int dev_count = platform->VisibleDeviceCount();
+
+  if (dev_count == 0) {
+    LOG(INFO) << "No GPU devices available on machine.";
+    return;
+  }
+
+  for (int i = 0; i < dev_count; ++i) {
+    auto stream_exec = platform->ExecutorForDevice(i).ValueOrDie();
+    int64 free_bytes;
+    int64 total_bytes;
+    if (!stream_exec->DeviceMemoryUsage(&free_bytes, &total_bytes)) {
+      // Logs internally on failure.
+      free_bytes = 0;
+      total_bytes = 0;
+    }
+    const auto& description = stream_exec->GetDeviceDescription();
+    int cc_major;
+    int cc_minor;
+    if (!description.cuda_compute_capability(&cc_major, &cc_minor)) {
+      // Logs internally on failure.
+      cc_major = 0;
+      cc_minor = 0;
+    }
+    LOG(INFO) << "Found device " << i << " with properties: "
+              << "\nname: " << description.name() << "\nmajor: " << cc_major
+              << " minor: " << cc_minor << " memoryClockRate (GHz) "
+              << description.clock_rate_ghz() << "\npciBusID "
+              << description.pci_bus_id() << "\nTotal memory: "
+              << strings::HumanReadableNumBytes(total_bytes)
+              << "\nFree memory: "
+              << strings::HumanReadableNumBytes(free_bytes);
+  }
+
+  // Enable peer access
+
+  auto status = EnablePeerAccess(platform, dev_count);
+  if (!status.ok()) {
+    LOG(FATAL) << "could not enable peer access for GPU devices: " << status;
+  }
+
+  // Print out a matrix showing which devices can DMA to one
+  // another.
+  auto access_map = GetPeerAccessMap(platform, dev_count);
+  string line_buf = "DMA: ";
+  for (int i = 0; i < dev_count; ++i) {
+    strings::StrAppend(&line_buf, i, " ");
+  }
+  LOG(INFO) << line_buf;
+  for (int i = 0; i < dev_count; ++i) {
+    line_buf = strings::StrCat(i, ":   ");
+    for (int j = 0; j < dev_count; ++j) {
+      if ((*access_map)[{i, j}]) {
+        line_buf.append("Y ");
+      } else {
+        line_buf.append("N ");
+      }
+    }
+    LOG(INFO) << line_buf;
+  }
+}
+
+static bool InitModule() {
+  InitGPU();
+  return true;
+}
+
+}  // namespace
+
+gpu::Platform* GPUMachineManager() {
+  // Create the machine manager singleton and initialize the GPUs only
+  // once.
+  static bool init = InitModule();
+  CHECK(init);  // Avoids compiler warning that init is unused.
+
+  auto result = gpu::MultiPlatformManager::PlatformWithName("CUDA");
+  if (!result.ok()) {
+    return nullptr;
+  }
+
+  return result.ValueOrDie();
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/gpu_init.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_init.h
@ -0,0 +1,19 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_GPU_INIT_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_GPU_INIT_H_
+
+namespace perftools {
+namespace gputools {
+class Platform;
+}  // namespace gputools
+}  // namespace perftools
+
+namespace tensorflow {
+
+// Returns the GPU machine manager singleton, creating it and
+// initializing the GPUs on the machine if needed the first time it is
+// called.
+perftools::gputools::Platform* GPUMachineManager();
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_GPU_INIT_H_
--- a/tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc
@ -0,0 +1,371 @@
+#include "tensorflow/core/common_runtime/gpu/gpu_region_allocator.h"
+
+//#include "base/commandlineflags.h"
+#include "tensorflow/stream_executor/multi_platform_manager.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_allocator_retry.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+#include "tensorflow/core/lib/core/bits.h"
+#include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/lib/strings/numbers.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+
+#if defined(PLATFORM_GOOGLE)
+DEFINE_bool(brain_gpu_region_allocator_heap_check_on_destruction, true,
+            "If true, the CUDA gpu manager checks that all allocated "
+            "memory through the GPU memory pool implementation has been "
+            "freed.");
+
+DEFINE_int64(brain_gpu_region_allocator_region_size, 0,
+             "If > 0, sets the default chunk-size allocatable from GPU memory. "
+             "Else defaults to entire GPU memory.");
+
+#else
+bool FLAGS_brain_gpu_region_allocator_heap_check_on_destruction = true;
+tensorflow::int64 FLAGS_brain_gpu_region_allocator_region_size = 0;
+#endif
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+
+GPURegionAllocator::GPURegionAllocator(int device_id, size_t total_bytes)
+    : device_id_(device_id), total_bytes_(total_bytes) {
+  // Get a pointer to the stream_executor for this device
+  stream_exec_ = GPUMachineManager()->ExecutorForDevice(device_id).ValueOrDie();
+
+  // Set the region size based on explicit user request, or based on
+  // total GPU capacity.
+  if (FLAGS_brain_gpu_region_allocator_region_size > 0) {
+    region_size_ = FLAGS_brain_gpu_region_allocator_region_size;
+  } else {
+    region_size_ = static_cast<size_t>(total_bytes_);
+  }
+
+  LOG(INFO) << "Setting region size to " << region_size_;
+}
+
+GPURegionAllocator::~GPURegionAllocator() {
+  if (FLAGS_brain_gpu_region_allocator_heap_check_on_destruction) {
+    CheckForMemoryLeaks();
+  }
+
+  gtl::STLDeleteValues(&chunk_map_);
+
+  for (auto r : regions_) {
+    gpu::DeviceMemoryBase gpu_ptr{r->ptr};
+    stream_exec_->Deallocate(&gpu_ptr);
+    delete r;
+  }
+}
+
+void* GPURegionAllocator::AllocateRaw(size_t alignment, size_t num_bytes) {
+  static const int64 kMaxMillisToWait = 10000;  // 10 seconds
+  return retry_helper_.AllocateRaw(
+      [this](size_t a, size_t nb, bool v) {
+        return AllocateRawInternal(a, nb, v);
+      },
+      kMaxMillisToWait, alignment, num_bytes);
+}
+
+void* GPURegionAllocator::AllocateRawInternal(size_t alignment,
+                                              size_t num_bytes,
+                                              bool dump_log_on_failure) {
+  if (num_bytes == 0) {
+    LOG(ERROR) << "tried to allocate 0 bytes";
+    return nullptr;
+  }
+  size_t chunk_size = ChunkSize(num_bytes);
+
+  VLOG(2) << "chunk_size " << chunk_size << " from num_bytes "
+          << strings::HumanReadableNumBytes(num_bytes);
+  mutex_lock l(lock_);
+  Pool* pool = &pools_[chunk_size];
+  if (pool->num_free == 0) {
+    if (!ExpandPool(pool, chunk_size, num_bytes, dump_log_on_failure)) {
+      if (dump_log_on_failure) {
+        LOG(WARNING) << "Out of GPU memory, see memory state dump above";
+      }
+      return nullptr;
+    }
+  }
+  CHECK_LT(0, pool->num_free);
+  CHECK(pool->first);
+  CHECK(pool->last);
+  Chunk* c = pool->first;
+  CHECK(c);
+  CHECK(!c->in_use);
+
+  c->in_use = true;
+  // Move c to the back of the queue.
+  if (c->next != nullptr) {
+    pool->first = c->next;
+    pool->first->prev = nullptr;
+    c->next = nullptr;
+  }
+
+  if (pool->last != c) {
+    pool->last->next = c;
+    c->prev = pool->last;
+    pool->last = c;
+  }
+  pool->num_free--;
+  pool->cumulative_malloced++;
+
+  void* rv = c->ptr;
+  c->bytes_allocated = num_bytes;
+
+  VLOG(2) << "new ptr " << rv;
+  return rv;
+}
+
+void GPURegionAllocator::DeallocateRaw(void* ptr) {
+  retry_helper_.DeallocateRaw([this](void* p) { DeallocateRawInternal(p); },
+                              ptr);
+}
+
+void GPURegionAllocator::DeallocateRawInternal(void* ptr) {
+  VLOG(2) << "DeallocateRaw: " << ptr;
+  if (ptr == nullptr) {
+    LOG(ERROR) << "tried to deallocate nullptr";
+    return;
+  }
+
+  mutex_lock l(lock_);
+  ChunkMap::const_iterator iter = chunk_map_.find(ptr);
+  CHECK(iter != chunk_map_.end());
+
+  Chunk* c = iter->second;
+  VLOG(2) << "chunk of size " << c->size << " at " << c;
+
+  Pool* pool = &(pools_[c->size]);
+  // Move chunk to head of queue, and mark free.
+  DCHECK(c->in_use);
+  c->in_use = false;
+  if (c->prev) c->prev->next = c->next;
+  if (c->next) c->next->prev = c->prev;
+  if (pool->first == c) pool->first = c->next;
+  if (pool->last == c) pool->last = c->prev;
+  c->next = pool->first;
+  c->prev = nullptr;
+  if (c->next) c->next->prev = c;
+  pool->first = c;
+  if (pool->last == nullptr) pool->last = c;
+  pool->num_free++;
+  pool->cumulative_freed++;
+}
+
+bool GPURegionAllocator::ExpandPool(Pool* pool, size_t chunk_size,
+                                    size_t requested_size,
+                                    bool dump_log_on_failure) {
+  VLOG(1) << "ExpandPool of " << chunk_size << " from " << pool->num_chunks
+          << " current members";
+  DCHECK_NE(0, chunk_size);
+  // If chunk_size is < 4096, double the pool size.  Otherwise
+  // just increase by one.
+  int num_chunks = pool->num_chunks;
+  if (num_chunks == 0) {
+    if (chunk_size > 4096) {
+      num_chunks = 1;
+    } else {
+      num_chunks = 4096 / chunk_size;
+    }
+  }
+  // For larger chunks, limit the amount of expansion.
+  size_t aggregate_size = num_chunks * chunk_size;
+  if (aggregate_size > (1 << 20)) {
+    num_chunks = static_cast<int>(
+        std::max(static_cast<size_t>(1), (1 << 20) / chunk_size));
+  }
+  while (num_chunks > 0) {
+    Region* r = (regions_.empty() ? nullptr : regions_.back());
+    if (r == nullptr ||
+        (((r->ptr + r->size) - r->next) < static_cast<int64>(chunk_size))) {
+      // Current region is not large enough to accommodate another chunk.
+      while (r == nullptr || (((r->ptr + r->size) - r->next) <
+                              static_cast<int64>(chunk_size))) {
+        // Get another region.
+        size_t this_region_size = std::max(region_size_, chunk_size);
+
+        // Check if we would exceed our limit.
+        if (allocated_memory_ + this_region_size > total_bytes_) {
+          if (dump_log_on_failure) DumpMemoryLog();
+          return false;
+        }
+
+        // Perform the allocation, still checking that the allocator
+        // has not run out of memory.
+        gpu::DeviceMemory<char> gpu_mem =
+            stream_exec_->AllocateArray<char>(this_region_size);
+        if (gpu_mem == nullptr) {
+          if (dump_log_on_failure) DumpMemoryLog();
+          return false;
+        }
+
+        // We never release memory once expanded.
+        allocated_memory_ += this_region_size;
+
+        Region* nr = new Region;
+        nr->ptr = static_cast<char*>(gpu_mem.opaque());
+
+        if (VLOG_IS_ON(2)) {
+          int64 free_bytes;
+          int64 total_bytes;
+          if (stream_exec_->DeviceMemoryUsage(&free_bytes, &total_bytes)) {
+            VLOG(2) << "free " << free_bytes << " total " << total_bytes;
+          } else {
+            // Note: stream_exec call also logs internally on failure.
+            VLOG(2) << "could not retrieve memory usage";
+          }
+        }
+        VLOG(1) << "new Region of size " << this_region_size << " at "
+                << static_cast<void*>(nr->ptr) << " on device " << device_id_;
+        r = nr;
+        r->size = this_region_size;
+        r->next = r->ptr;
+        regions_.push_back(r);
+
+        for (auto visitor : region_visitors_) {
+          visitor(r->ptr, r->size);
+        }
+      }
+    } else {
+      // Allocate a new chunk and push on front of Pool.
+      Chunk* c = new Chunk;
+      c->ptr = r->next;
+      chunk_map_[c->ptr] = c;
+      c->size = chunk_size;
+      r->next += chunk_size;
+      c->next = pool->first;
+      if (c->next != nullptr) c->next->prev = c;
+      pool->first = c;
+      if (pool->last == nullptr) pool->last = c;
+      pool->num_chunks++;
+      pool->num_free++;
+      --num_chunks;
+    }
+  }
+
+  return true;
+}
+
+void GPURegionAllocator::CheckForMemoryLeaks() {
+  std::vector<string> errors;
+  mutex_lock l(lock_);  // could use reader lock
+  for (auto pool_map : pools_) {
+    const Pool& p = pool_map.second;
+    Chunk* curr_chunk = p.first;
+    while (curr_chunk != nullptr) {
+      if (curr_chunk->in_use) {
+        errors.push_back(
+            strings::StrCat("Unfreed chunk of size ", curr_chunk->size));
+      }
+      curr_chunk = curr_chunk->next;
+    }
+  }
+  if (!errors.empty()) {
+    LOG(FATAL) << "GPU Memory leaks:\n" << str_util::Join(errors, "\n");
+  }
+}
+
+// Since there's no merging of chunks once allocated, we want to
+// maximize their reusablity (which argues for fewer, larger sizes),
+// while minimizing waste (which argues for tight-fitting sizes).
+//
+// The smallest unit of allocation is 256 bytes.
+// NOTE(tucker): akrizhevsky says that nvidia's memory manager always
+// aligns to 256 bytes, and doing so results in significant speedup.
+//
+// Up to 2^16 bytes we only allocate in powers of 2.
+//
+// Above that, we pick a max-waste which is the largest power
+// of 2 <= 1/16 of the requested size, then round up to the nearest
+// multiple of max_waste.
+//
+// static
+size_t GPURegionAllocator::ChunkSize(size_t bytes) {
+  if (bytes <= 256) {
+    return 256;
+  } else if (bytes <= (1 << 16)) {
+    return 1uLL << Log2Ceiling64(bytes);
+  } else {
+    // 1/16th of requested size
+    size_t max_waste = 1uLL << (Log2Ceiling64(bytes) - 4);
+    return (bytes + max_waste) & (~(max_waste - 1));
+  }
+}
+
+void GPURegionAllocator::AddAllocVisitor(Visitor visitor) {
+  VLOG(1) << "AddVisitor";
+  mutex_lock l(lock_);
+  region_visitors_.push_back(visitor);
+  for (auto region : regions_) {
+    visitor(region->ptr, region->size);
+  }
+}
+
+void GPURegionAllocator::DumpMemoryLog() {
+  size_t region_bytes = 0;
+  for (auto r : regions_) {
+    region_bytes += r->size;
+  }
+  size_t chunk_bytes = 0;
+  std::vector<size_t> chunk_sizes;
+  for (auto i : pools_) {
+    chunk_sizes.push_back(i.first);
+  }
+  std::sort(chunk_sizes.begin(), chunk_sizes.end());
+  for (auto i : chunk_sizes) {
+    int32 chunks_in_use = 0;
+    const Pool& p = pools_[i];
+    chunk_bytes += i * p.num_chunks;
+
+    if (p.num_chunks > 0) {
+      // Iterate backwards (allocated chunks are last).
+      Chunk* curr_chunk = p.last;
+      while (curr_chunk != nullptr) {
+        if (curr_chunk->in_use) {
+          ++chunks_in_use;
+        }
+        curr_chunk = curr_chunk->prev;
+        if (curr_chunk == p.first) {
+          break;
+        }
+      }
+    }
+
+    LOG(INFO) << "Chunk size: " << i << " ("
+              << strings::HumanReadableNumBytes(i) << ") Pool: " << p.ToString()
+              << "\nNumber of chunks: " << p.num_chunks
+              << ", in_use chunks: " << chunks_in_use;
+  }
+
+  LOG(INFO) << "Aggregate Region Memory: " << region_bytes << " ("
+            << strings::HumanReadableNumBytes(region_bytes) << ")";
+  LOG(INFO) << "Aggregate Chunk Memory: " << chunk_bytes << " ("
+            << strings::HumanReadableNumBytes(chunk_bytes) << ")";
+}
+
+bool GPURegionAllocator::TracksAllocationSizes() { return true; }
+
+size_t GPURegionAllocator::RequestedSize(void* ptr) {
+  mutex_lock l(lock_);
+  auto it = chunk_map_.find(ptr);
+  CHECK(it != chunk_map_.end())
+      << "Asked for requested size of pointer we never allocated: " << ptr;
+  auto c = it->second;
+  return c->bytes_allocated;
+}
+
+size_t GPURegionAllocator::AllocatedSize(void* ptr) {
+  mutex_lock l(lock_);
+  auto it = chunk_map_.find(ptr);
+  CHECK(it != chunk_map_.end())
+      << "Asked for allocated size of pointer we never allocated: " << ptr;
+  auto c = it->second;
+  return c->size;
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/gpu_region_allocator.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_region_allocator.h
@ -0,0 +1,146 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_GPU_REGION_ALLOCATOR_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_GPU_REGION_ALLOCATOR_H_
+
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/stream_executor/stream_executor.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_allocator_retry.h"
+#include "tensorflow/core/common_runtime/gpu/visitable_allocator.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/platform/thread_annotations.h"
+
+namespace tensorflow {
+
+class GPURegionAllocator : public VisitableAllocator {
+ public:
+  // 'device_id' must be a valid device on the machine.
+  //
+  // total_bytes is how many bytes this allocator should allocate up
+  // to.  This may be less than the total available.
+  explicit GPURegionAllocator(int device_id, size_t total_bytes);
+  ~GPURegionAllocator() override;
+
+  string Name() override { return "gpu_region"; }
+  void* AllocateRaw(size_t alignment, size_t num_bytes) override;
+  void DeallocateRaw(void* ptr) override;
+  void AddAllocVisitor(Visitor visitor) override;
+  // Does nothing, because regions are never freed.
+  void AddFreeVisitor(Visitor visitor) override {}
+
+  bool TracksAllocationSizes() override;
+  size_t RequestedSize(void* ptr) override;
+  size_t AllocatedSize(void* ptr) override;
+
+ private:
+  // A Chunk is the header on a single piece of memory given back
+  // in response to an AllocateRaw() call.
+  struct Chunk {
+    char* ptr;               // pointer to granted GPU buffer.
+    size_t size;             // Full size of GPU buffer.
+    size_t bytes_allocated;  // Bytes asked for by client.
+    bool in_use;
+    Chunk* prev;  // Used for chaining in pool.
+    Chunk* next;
+    Chunk()
+        : ptr(nullptr),
+          size(0),
+          bytes_allocated(0),
+          in_use(false),
+          prev(nullptr),
+          next(nullptr) {}
+  };
+
+  // A Pool is a collection of same-sized Chunks.
+  struct Pool {
+    int num_chunks;             // total chunks in this pool
+    int num_free;               // total free chunks in this pool
+    int64 cumulative_malloced;  // number of chunks malloced so far
+    int64 cumulative_freed;     // number of chunks freed so far
+
+    // double-linked ring of chunks; all free chunks precede all
+    // granted chunks
+    Chunk* first;
+    Chunk* last;
+    Pool()
+        : num_chunks(0),
+          num_free(0),
+          cumulative_malloced(0),
+          cumulative_freed(0),
+          first(nullptr),
+          last(nullptr) {}
+
+    string ToString() const {
+      return strings::StrCat("chunks: ", num_chunks, " free: ", num_free,
+                             " cumulative malloc: ", cumulative_malloced,
+                             " cumulative freed: ", cumulative_freed);
+    }
+  };
+
+  // A Region is a single area of GPU memory that has been
+  // reserved by this class and carved up into Chunks.
+  struct Region {
+    char* ptr;   // base GPU ptr
+    char* next;  // frontier of unused part of region
+    size_t size;
+    Region() : ptr(nullptr), size(0) {}
+  };
+
+  // Calculate size of chunk for an allocation of this size.
+  // Min chunk size is 16, for alignment.
+  // For larger sizes, we round up somewhat so there are fewer
+  // size-specific pools.
+  static size_t ChunkSize(size_t bytes);
+
+  void* AllocateRawInternal(size_t alignment, size_t num_bytes,
+                            bool dump_log_on_failure);
+  void DeallocateRawInternal(void* ptr);
+
+  bool ExpandPool(Pool* p, size_t chunk_size, size_t requested_size,
+                  bool dump_log_on_failure) EXCLUSIVE_LOCKS_REQUIRED(lock_);
+
+  // Inspects region maps and crashes with debug information if there
+  // are any memory leaks as detected by the region allocator.
+  void CheckForMemoryLeaks() LOCKS_EXCLUDED(lock_);
+
+  void DumpMemoryLog() EXCLUSIVE_LOCKS_REQUIRED(lock_);
+
+  perftools::gputools::StreamExecutor* stream_exec_;  // Not owned.
+
+  typedef std::unordered_map<size_t, Pool> PoolMap;
+  typedef std::unordered_map<void*, Chunk*> ChunkMap;
+
+  GPUAllocatorRetry retry_helper_;
+  mutable mutex lock_;
+  PoolMap pools_ GUARDED_BY(lock_);
+
+  // Owns regions.
+  std::vector<Region*> regions_ GUARDED_BY(lock_);
+
+  // Maps from GPU ptr to Chunk owning it.
+  //
+  // Owns chunks.
+  ChunkMap chunk_map_ GUARDED_BY(lock_);
+
+  // Called once on each region, ASAP.
+  std::vector<Visitor> region_visitors_ GUARDED_BY(lock_);
+
+  const int device_id_;
+
+  // Total amount of memory (in bytes) available to this Allocator
+  const size_t total_bytes_;
+
+  // Total amount of memory allocated to regions.
+  size_t allocated_memory_ = 0;
+
+  size_t region_size_ = 0;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(GPURegionAllocator);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_GPU_REGION_ALLOCATOR_H_
--- a/tensorflow/core/common_runtime/gpu/gpu_region_allocator_test.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_region_allocator_test.cc
@ -0,0 +1,71 @@
+#if GOOGLE_CUDA
+
+#include "tensorflow/core/common_runtime/gpu/gpu_region_allocator.h"
+
+#include <algorithm>
+#include <vector>
+
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+#include <gtest/gtest.h>
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+namespace {
+
+TEST(GPURegionAllocatorTest, Simple) {
+  GPURegionAllocator a(0, 1 << 26);
+  std::vector<void*> ptrs;
+  for (int s = 1; s < 1024; s++) {
+    void* raw = a.AllocateRaw(1, s);
+    ptrs.push_back(raw);
+  }
+  std::sort(ptrs.begin(), ptrs.end());
+  for (int i = 0; i < ptrs.size(); i++) {
+    if (i > 0) {
+      CHECK_NE(ptrs[i], ptrs[i - 1]);  // No dups
+    }
+    a.DeallocateRaw(ptrs[i]);
+  }
+  float* t1 = a.Allocate<float>(1024);
+  double* t2 = a.Allocate<double>(1048576);
+  a.Deallocate(t1);
+  a.Deallocate(t2);
+}
+
+TEST(GPURegionAllocatorTest, CheckMemLeak) {
+  EXPECT_DEATH(
+      {
+        GPURegionAllocator a(0, 1 << 26);
+        float* t1 = a.Allocate<float>(1024);
+        if (t1) {
+          LOG(INFO) << "Not deallocating";
+        }
+      },
+      "");
+}
+
+TEST(GPURegionAllocatorTest, TracksSizes) {
+  GPURegionAllocator a(0, 1 << 26);
+  EXPECT_EQ(true, a.TracksAllocationSizes());
+}
+
+TEST(GPURegionAllocatorTest, AllocatedVsRequested) {
+  GPURegionAllocator a(0, 1 << 26);
+  float* t1 = a.Allocate<float>(1);
+  EXPECT_EQ(sizeof(float), a.RequestedSize(t1));
+
+  // Minimum allocation size if 256
+  EXPECT_EQ(256, a.AllocatedSize(t1));
+
+  a.Deallocate(t1);
+}
+
+}  // namespace
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
--- a/tensorflow/core/common_runtime/gpu/gpu_stream_util.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_stream_util.cc
@ -0,0 +1,97 @@
+#include "tensorflow/core/common_runtime/gpu/gpu_stream_util.h"
+
+#include <set>
+#include <string>
+#include <unordered_set>
+#include <vector>
+
+#include "tensorflow/core/graph/algorithm.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+
+namespace tensorflow {
+namespace gpu_stream_util {
+
+Status AssignStreams(const Graph* graph, const AssignStreamsOpts& opts,
+                     std::unordered_map<int, int>* node_to_stream_id) {
+  VLOG(1) << "AssignStreams";
+  Status status;
+
+  // Sanity check arguments.
+  if (graph == nullptr)
+    status.Update(errors::InvalidArgument("Bad graph argument supplied."));
+  if (node_to_stream_id == nullptr) {
+    status.Update(
+        errors::InvalidArgument("Bad node_to_stream_id argument supplied."));
+  }
+  if ((opts.max_streams < 1) || (opts.send_stream >= opts.max_streams) ||
+      (opts.recv_stream >= opts.max_streams) ||
+      (opts.const_stream >= opts.max_streams) ||
+      (opts.compute_stream >= opts.max_streams)) {
+    status.Update(errors::InvalidArgument("Bad graph argument supplied."));
+  }
+  TF_RETURN_IF_ERROR(status);
+
+  // Topologically sort the nodes.
+  std::vector<Node*> order;
+  GetReversePostOrder(*graph, &order);
+  if (VLOG_IS_ON(2)) {
+    for (Node* n : order) {
+      const int node_id = n->id();
+      VLOG(2) << "Node " << node_id << " " << n->type_string() << " "
+              << n->name() << " " << n->in_edges().size() << " inputs";
+      for (const Edge* e : n->in_edges()) {
+        VLOG(2) << "  Edge from " << e->src()->id() << "  " << e->src()->name()
+                << " fanout " << e->src()->out_edges().size();
+      }
+    }
+  }
+  // We perform stream assigmnent assuming a large number of
+  // stream IDs and then map these down to the required number of streams
+  // using simple round-robin.
+  // Stream Assignment strategy:
+  // 1. Nodes with zero inputs are always be executed on a
+  // fresh stream.
+  // 2. Try to execute a node on the same stream as one of its
+  // inputs to avoid inter-stream dependencies.
+  // 3. If any input comes from a node with a large fanout then
+  // perhaps an indication that it is shared between parallel
+  // streams of work. We choose a new stream here so that all consumers
+  // of the tensor are likely to run in parallel.
+  int highest_stream_id = -1;
+  for (Node* n : order) {
+    VLOG(3) << "Inspecting node " << n->DebugString();
+    const int node_id = n->id();
+    const string& op = n->type_string();
+
+    // Determine a suitable stream to use.
+    int stream_id = highest_stream_id + 1;
+    for (const Edge* e : n->in_edges()) {
+      const int fanout = e->src()->out_edges().size();
+      if (fanout == 1) {
+        stream_id = (*node_to_stream_id)[e->src()->id()];
+        break;
+      }
+    }
+    // Override stream for specific op types.
+    if (op == "_Send") {
+      if (opts.send_stream >= 0) stream_id = opts.send_stream;
+    } else if (op == "_Recv") {
+      if (opts.recv_stream >= 0) stream_id = opts.recv_stream;
+    } else if (op == "Const") {
+      if (opts.const_stream >= 0) stream_id = opts.const_stream;
+    } else {
+      if (opts.compute_stream >= 0) stream_id = opts.compute_stream;
+    }
+
+    (*node_to_stream_id)[node_id] = stream_id % opts.max_streams;
+    highest_stream_id = std::max(stream_id, highest_stream_id);
+  }
+  VLOG(1) << "Identified " << highest_stream_id << " candidate streams for "
+          << order.size() << " nodes.";
+
+  return Status::OK();
+}
+
+}  // namespace gpu_stream_util
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/gpu_stream_util.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_stream_util.h
@ -0,0 +1,30 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_GPU_STREAM_UTIL_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_GPU_STREAM_UTIL_H_
+
+#include <unordered_map>
+
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/public/status.h"
+
+namespace tensorflow {
+namespace gpu_stream_util {
+
+struct AssignStreamsOpts {
+  int32 max_streams = 1;
+  // The following options specify a stream to use for specific op
+  // types.  The value -1 allows ops to be assigned to any stream.
+  int32 send_stream = -1;
+  int32 recv_stream = -1;
+  int32 const_stream = -1;
+  int32 compute_stream = -1;
+};
+
+// Given the input graph, assigns every node in the graph with a
+// stream_id that should be used.
+Status AssignStreams(const Graph* graph, const AssignStreamsOpts& opts,
+                     std::unordered_map<int, int>* node_to_stream_id);
+
+}  // namespace gpu_stream_util
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_GPU_STREAM_UTIL_H_
--- a/tensorflow/core/common_runtime/gpu/gpu_stream_util_test.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_stream_util_test.cc
@ -0,0 +1,137 @@
+#include "tensorflow/core/common_runtime/gpu/gpu_stream_util.h"
+
+#include <gtest/gtest.h>
+#include "tensorflow/cc/ops/array_ops.h"
+#include "tensorflow/cc/ops/sendrecv_ops.h"
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/types.pb.h"
+#include "tensorflow/core/graph/graph_def_builder.h"
+#include "tensorflow/core/graph/node_builder.h"
+#include "tensorflow/core/kernels/ops_testutil.h"
+#include "tensorflow/core/kernels/ops_util.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+
+namespace tensorflow {
+namespace {
+
+class GpuStreamUtilTest : public OpsTestBase {
+ protected:
+  void SetUp() override { RequireDefaultOps(); }
+};
+
+TEST_F(GpuStreamUtilTest, BogusOpts) {
+  GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+  Graph g(OpRegistry::Global());
+  ASSERT_OK(b.ToGraph(&g));
+  std::unordered_map<int, int> node_to_stream_id;
+  gpu_stream_util::AssignStreamsOpts opts;
+  Status status;
+  status = gpu_stream_util::AssignStreams(nullptr, opts, &node_to_stream_id);
+  EXPECT_FALSE(status.ok());
+  status = gpu_stream_util::AssignStreams(&g, opts, nullptr);
+  EXPECT_FALSE(status.ok());
+  opts.max_streams = 0;
+  status = gpu_stream_util::AssignStreams(&g, opts, &node_to_stream_id);
+  EXPECT_FALSE(status.ok());
+  opts.max_streams = 1;
+  opts.compute_stream = 5;
+  status = gpu_stream_util::AssignStreams(&g, opts, &node_to_stream_id);
+  EXPECT_FALSE(status.ok());
+}
+
+TEST_F(GpuStreamUtilTest, EmptyGraph) {
+  GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+  Graph g(OpRegistry::Global());
+  ASSERT_OK(b.ToGraph(&g));
+  std::unordered_map<int, int> node_to_stream_id;
+  gpu_stream_util::AssignStreamsOpts opts;
+  ASSERT_OK(gpu_stream_util::AssignStreams(&g, opts, &node_to_stream_id));
+  EXPECT_EQ(2, node_to_stream_id.size());  // _SOURCE and _SINK
+}
+
+TEST_F(GpuStreamUtilTest, SimpleGraphOneStream) {
+  GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+  ops::MatMul(ops::Const(Tensor(DT_FLOAT), b.opts()),
+              ops::Const(Tensor(DT_FLOAT), b.opts()), b.opts());
+  Graph g(OpRegistry::Global());
+  ASSERT_OK(b.ToGraph(&g));
+
+  std::unordered_map<int, int> node_to_stream_id;
+  gpu_stream_util::AssignStreamsOpts opts;
+  ASSERT_OK(gpu_stream_util::AssignStreams(&g, opts, &node_to_stream_id));
+
+  // There should be 5 nodes assigned.
+  EXPECT_EQ(5, node_to_stream_id.size());
+
+  // All of them should have stream 0.
+  for (const auto& it : node_to_stream_id) {
+    EXPECT_EQ(0, it.second);
+  }
+}
+
+TEST_F(GpuStreamUtilTest, SimpleGraphManyStreams) {
+  GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+  ops::MatMul(ops::Const(Tensor(DT_FLOAT), b.opts()),
+              ops::Const(Tensor(DT_FLOAT), b.opts()), b.opts());
+  Graph g(OpRegistry::Global());
+  ASSERT_OK(b.ToGraph(&g));
+
+  std::unordered_map<int, int> node_to_stream_id;
+  gpu_stream_util::AssignStreamsOpts opts;
+  opts.max_streams = 3;
+  ASSERT_OK(gpu_stream_util::AssignStreams(&g, opts, &node_to_stream_id));
+
+  // There should be 5 nodes assigned.
+  EXPECT_EQ(5, node_to_stream_id.size());
+
+  // All of them should have a stream in the range [0..max_streams).
+  for (const auto& it : node_to_stream_id) {
+    EXPECT_GE(it.second, 0);
+    EXPECT_LT(it.second, opts.max_streams);
+  }
+}
+
+TEST_F(GpuStreamUtilTest, StreamOverrides) {
+  GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+  ops::_Recv(DT_FLOAT, "input", "/cpu:0", 0, "/gpu:0",
+             b.opts().WithName("input"));
+  auto n = ops::MatMul(ops::Const(Tensor(DT_FLOAT), b.opts()),
+                       ops::Const(Tensor(DT_FLOAT), b.opts()), b.opts());
+  ops::_Send(n, "output", "/gpu:0", 0, "/cpu:0", b.opts().WithName("output"));
+  Graph g(OpRegistry::Global());
+  ASSERT_OK(b.ToGraph(&g));
+
+  // Perform stream assignment using a large number of streams, but with
+  // op types constrained to specific streams.
+  std::unordered_map<int, int> node_to_stream_id;
+  gpu_stream_util::AssignStreamsOpts opts;
+  opts.max_streams = 100;
+  opts.const_stream = 90;
+  opts.send_stream = 91;
+  opts.recv_stream = 92;
+  opts.compute_stream = 93;
+  ASSERT_OK(gpu_stream_util::AssignStreams(&g, opts, &node_to_stream_id));
+
+  // There should be 7 nodes assigned.
+  EXPECT_EQ(7, node_to_stream_id.size());  // including _SOURCE and _SINK
+
+  // Nodes should be assigned to streams by op type.
+  for (const auto& it : node_to_stream_id) {
+    Node* n = g.FindNodeId(it.first);
+    const string op = n->type_string();
+    const int stream = it.second;
+    if (op == "Const") {
+      EXPECT_EQ(stream, 90);
+    } else if (op == "_Send") {
+      EXPECT_EQ(stream, 91);
+    } else if (op == "_Recv") {
+      EXPECT_EQ(stream, 92);
+    } else {  // Compute.
+      EXPECT_EQ(stream, 93);
+    }
+  }
+}
+
+}  // namespace
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/gpu_util.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_util.cc
@ -0,0 +1,345 @@
+#include "tensorflow/core/common_runtime/gpu/gpu_util.h"
+
+//#include "base/commandlineflags.h"
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/common_runtime/gpu_device_context.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/refcount.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/lib/hash/hash.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/lib/strings/stringprintf.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/tensor_coding.h"
+#include "tensorflow/core/platform/tracing.h"
+#include "tensorflow/core/public/tensor.h"
+#include "tensorflow/core/common_runtime/gpu/dma_helper.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_event_mgr.h"
+#include "tensorflow/core/common_runtime/gpu/process_state.h"
+#include "tensorflow/core/util/util.h"
+#include "tensorflow/stream_executor/stream.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+
+#include "tensorflow/core/platform/stream_executor_util.h"
+
+#if defined(PLATFORM_GOOGLE)
+DEFINE_int64(brain_gpu_util_debug_string_maxlen, 128,
+             "When dumping gpu memory, prints up to this many bytes.");
+
+DECLARE_bool(record_mem_types);
+#else
+tensorflow::int64 FLAGS_brain_gpu_util_debug_string_maxlen = 128;
+bool FLAGS_EXPERIMENTAL_brain_gpu_multi_stream = false;
+extern bool FLAGS_record_mem_types;
+#endif
+
+using perftools::gputools::DeviceMemoryBase;
+using perftools::gputools::DeviceMemory;
+using perftools::gputools::Stream;
+
+namespace tensorflow {
+
+namespace gpu = ::perftools::gputools;
+
+/*static*/
+void GPUUtil::SetProtoFromGPU(const Tensor& tensor, Device* dev,
+                              const DeviceContext* device_context,
+                              TensorProto* proto, bool is_dead,
+                              StatusCallback done) {
+  VLOG(1) << "SetProtoFromGPU device_context " << device_context;
+  // Tensor values need to be copied from GPU to CPU ram so that
+  // we can build the protobuf response for a RecvTensor RPC.
+  // "device context" identifies the stream where the _Send op executed.
+  CHECK(device_context);
+  gpu::Stream* stream =
+      static_cast<const GPUDeviceContext*>(device_context)->stream();
+
+  if (!DMAHelper::CanUseDMA(&tensor)) {
+    done(errors::Internal(strings::StrCat(
+        "GPU copy from non-DMA ", DataTypeString(tensor.dtype()), "tensor")));
+    return;
+  }
+  proto->set_dtype(tensor.dtype());
+  tensor.shape().AsProto(proto->mutable_tensor_shape());
+  // Prepare a Cord with the right data buf size, and DMA the
+  // data over from the GPU buffer.  Note that 0-size tensors
+  // do not have a backing buffer.
+  const size_t num_bytes = is_dead ? 0 : tensor.TotalBytes();
+  if (num_bytes > 0) {
+    port::Tracing::ScopedAnnotation annotation("SetProtoFromGPU");
+    Allocator* alloc = ProcessState::singleton()->GetCUDAHostAllocator(0);
+    char* mb = alloc->Allocate<char>(num_bytes);
+    const char* src_ptr =
+        reinterpret_cast<const char*>(DMAHelper::base(&tensor));
+    DeviceMemoryBase gpu_src_ptr(const_cast<char*>(src_ptr), num_bytes);
+    stream->ThenMemcpy(mb, gpu_src_ptr, num_bytes);
+    // Use of tensor may outlive stack scope, so keep a ref.
+    Tensor* tensor_ref = new Tensor(tensor);
+    dev->tensorflow_gpu_device_info()->event_mgr->ThenExecute(
+        stream, [stream, done, proto, mb, num_bytes, alloc, tensor_ref]() {
+          if (!stream->ok()) {
+            done(errors::Internal("SetProtoFromGPU: GPU Memcpy failed"));
+            // TODO(pbar) We currently have no way to recover the
+            // worker from a GPU stream in the error state.  Until
+            // there is a way to reset the CUDA driver, it is
+            // preferable to crash the process and restart.  Tracked
+            // under b/23717097
+            LOG(FATAL) << "SetProtoFromGPU: GPU Memcpy failed";
+            return;
+          }
+          delete tensor_ref;
+          port::CopyFromArray(proto->mutable_tensor_content(), mb, num_bytes);
+          alloc->Deallocate<char>(mb);
+          done(Status::OK());
+        });
+  } else {
+    done(Status::OK());
+  }
+}
+
+typedef ProcessState::MemDesc PMD;
+
+/*static*/
+void GPUUtil::CopyViaDMA(const string& edge_name,
+                         DeviceContext* send_dev_context,
+                         DeviceContext* recv_dev_context, Device* src,
+                         Device* dst, AllocatorAttributes src_alloc_attr,
+                         AllocatorAttributes dst_alloc_attr,
+                         const Tensor* input, Tensor* output,
+                         StatusCallback done) {
+  port::Tracing::ScopedAnnotation annotation(edge_name);
+  VLOG(1) << "CopyViaDMA " << edge_name;
+  size_t total_bytes = input->TotalBytes();
+  // Note that 0-size tensors have no backing buffer.
+  if (total_bytes > 0) {
+    const void* src_ptr = DMAHelper::base(input);
+    void* dst_ptr = DMAHelper::base(output);
+    VLOG(2) << "src_ptr " << src_ptr << " dst_ptr " << dst_ptr;
+    if (FLAGS_record_mem_types) {
+      ProcessState::MemDesc smd = ProcessState::singleton()->PtrType(src_ptr);
+      ProcessState::MemDesc dmd = ProcessState::singleton()->PtrType(dst_ptr);
+      VLOG(0) << "Src " << smd.DebugString() << " Dst " << dmd.DebugString();
+      if (smd.loc == PMD::CPU && dmd.loc == PMD::GPU && (!smd.gpu_registered)) {
+        LOG(WARNING) << "CPU -> GPU no reg for " << edge_name;
+      }
+      if (dmd.loc == PMD::CPU && smd.loc == PMD::GPU && (!dmd.gpu_registered)) {
+        LOG(WARNING) << "GPU -> CPU no reg for " << edge_name;
+      }
+    }
+
+    auto src_device_type = src->attributes().device_type();
+    auto dst_device_type = dst->attributes().device_type();
+
+    bool non_cpu_src = (!src_alloc_attr.on_host() &&
+                        src_device_type != DeviceType(DEVICE_CPU).type());
+    bool non_cpu_dst = (!dst_alloc_attr.on_host() &&
+                        dst_device_type != DeviceType(DEVICE_CPU).type());
+    if (non_cpu_src) {
+      gpu::Stream* stream = send_dev_context->stream();
+      if (stream == nullptr) {
+        done(errors::Internal("Failed to find device stream"));
+        return;
+      }
+      auto* src_dev_info = src->tensorflow_gpu_device_info();
+      CHECK(src_dev_info);
+
+      if (non_cpu_dst) {
+        // Device to device copy
+        DeviceMemoryBase gpu_dst_ptr(dst_ptr, total_bytes);
+        stream->ThenMemcpy(
+            &gpu_dst_ptr,
+            DeviceMemoryBase{const_cast<void*>(src_ptr), total_bytes},
+            total_bytes);
+        if (dst_device_type == DeviceType(DEVICE_GPU).type()) {
+          // Use of input may outlive stack scope, so keep a ref.
+          Tensor* input_ref = new Tensor(*input);
+          src_dev_info->event_mgr->ThenExecute(
+              stream, [done, stream, input_ref]() {
+                delete input_ref;
+                if (!stream->ok()) {
+                  done(errors::Internal("GPU->GPU Memcpy failed"));
+                } else {
+                  done(Status::OK());
+                }
+              });
+        }
+        send_dev_context->MaintainLifetimeOnStream(input, stream);
+      } else {
+        // Device to host copy.
+        return send_dev_context->CopyDeviceTensorToCPU(input, edge_name, src,
+                                                       output, done);
+      }
+    } else if (non_cpu_dst) {
+      // Host to Device copy.
+      // Note that this is already an async copy.
+      recv_dev_context->CopyCPUTensorToDevice(input, dst, output, done);
+    } else {
+      memcpy(dst_ptr, src_ptr, total_bytes);
+      done(Status::OK());
+    }
+  } else {
+    // buffer is empty
+    done(Status::OK());
+  }
+}
+
+void GPUUtil::CopyGPUTensorToCPU(Device* gpu_device,
+                                 const DeviceContext* device_context,
+                                 const Tensor* gpu_tensor, Tensor* cpu_tensor,
+                                 StatusCallback done) {
+  VLOG(1) << "CopyGPUTensorToCPU";
+  size_t total_bytes = gpu_tensor->TotalBytes();
+  // Note that 0-size tensors have no backing buffer.
+  if (total_bytes > 0) {
+    const void* src_ptr = DMAHelper::base(gpu_tensor);
+    void* dst_ptr = DMAHelper::base(cpu_tensor);
+    CHECK(dst_ptr);
+    auto* stream = gpu_device->tensorflow_gpu_device_info()->stream;
+    if (device_context) {
+      stream = static_cast<const GPUDeviceContext*>(device_context)->stream();
+    }
+    stream->ThenMemcpy(
+        dst_ptr, DeviceMemoryBase{const_cast<void*>(src_ptr), total_bytes},
+        total_bytes);
+    stream->BlockHostUntilDone();
+    if (!stream->ok()) {
+      done(errors::Internal("CopyGPUTensorToCPU: GPU->CPU Memcpy failed"));
+      return;
+    }
+  }
+
+  done(Status::OK());
+}
+
+/*  static */
+void GPUUtil::CopyCPUTensorToGPU(const Tensor* cpu_tensor,
+                                 const DeviceContext* device_context,
+                                 Device* gpu_device, Tensor* gpu_tensor,
+                                 StatusCallback done) {
+  VLOG(1) << "CopyCPUTensorToGPU";
+  CHECK(DeviceType(gpu_device->attributes().device_type()) ==
+        DeviceType(DEVICE_GPU));
+
+  auto* dev_info = gpu_device->tensorflow_gpu_device_info();
+  if (!dev_info) {
+    done(errors::Internal("Failed to find dest device GPUDeviceInfo"));
+    return;
+  }
+  if (cpu_tensor->TotalBytes() != gpu_tensor->TotalBytes()) {
+    done(errors::Internal(
+        strings::StrCat("Can't copy ", cpu_tensor->TotalBytes(),
+                        " bytes of a tensor into another with ",
+                        gpu_tensor->TotalBytes(), " bytes buffer.")));
+    return;
+  }
+  const int64 total_bytes = cpu_tensor->TotalBytes();
+  // Note that 0-size tensors have no backing buffer.
+  if (total_bytes > 0) {
+    const void* src_ptr = DMAHelper::base(cpu_tensor);
+    void* dst_ptr = DMAHelper::base(gpu_tensor);
+    DeviceMemoryBase gpu_dst_ptr(dst_ptr, total_bytes);
+
+    CHECK(device_context);
+    auto* stream =
+        static_cast<const GPUDeviceContext*>(device_context)->stream();
+    stream->ThenMemcpy(&gpu_dst_ptr, src_ptr, total_bytes);
+    auto* dev_info = gpu_device->tensorflow_gpu_device_info();
+    // Use of cpu_tensor may outlive stack scope, so keep a ref.
+    Tensor* input_ref = new Tensor(*cpu_tensor);
+    dev_info->event_mgr->ThenExecute(stream, [stream, done, input_ref]() {
+      delete input_ref;
+      if (!stream->ok()) {
+        done(errors::Internal("CopyCPUTensorToGPU: GPU Memcpy failed"));
+      } else {
+        done(Status::OK());
+      }
+    });
+  } else {
+    // empty tensor case
+    done(Status::OK());
+  }
+}
+
+Status GPUUtil::Sync(Device* gpu_device) {
+  VLOG(1) << "GPUUtil::Sync";
+  auto* dev_info = gpu_device->tensorflow_gpu_device_info();
+  if (!dev_info) {
+    return errors::Internal("Failed to find dest device GPUDeviceInfo");
+  }
+  dev_info->stream->BlockHostUntilDone();
+  if (!dev_info->stream->ok()) {
+    LOG(FATAL) << "GPU sync failed";
+  }
+  return Status::OK();
+}
+
+Status GPUUtil::SyncAll(Device* gpu_device) {
+  VLOG(1) << "GPUUtil::SyncAll";
+  auto* dev_info = gpu_device->tensorflow_gpu_device_info();
+  if (!dev_info) {
+    return errors::Internal("Failed to find dest device GPUDeviceInfo");
+  }
+  if (!dev_info->stream->parent()->SynchronizeAllActivity() ||
+      !dev_info->stream->ok()) {
+    LOG(FATAL) << "GPU sync failed";
+  }
+  return Status::OK();
+}
+
+string GPUUtil::MemoryDebugString(const Device* device, Tensor* tensor) {
+  string ret;
+  CHECK(tensor);
+  const int64 num_bytes = std::min<int64>(
+      FLAGS_brain_gpu_util_debug_string_maxlen, tensor->TotalBytes());
+  void* ptr = (num_bytes > 0) ? DMAHelper::base(tensor) : nullptr;
+  strings::Appendf(&ret, "%p:", ptr);
+  if (num_bytes > 0) {
+    auto* dev_info = device->tensorflow_gpu_device_info();
+    if (!dev_info) {
+      strings::StrAppend(
+          &ret, PrintMemory(reinterpret_cast<const char*>(ptr), num_bytes));
+    } else {
+      string buf;
+      buf.resize(num_bytes);
+      DeviceMemoryBase gpu_ptr(ptr, num_bytes);
+      Status s = dev_info->stream->parent()->SynchronousMemcpyD2H(
+          gpu_ptr, num_bytes, gtl::string_as_array(&buf));
+      strings::StrAppend(&ret,
+                         PrintMemory(gtl::string_as_array(&buf), num_bytes));
+    }
+  }
+  return ret;
+}
+
+// TODO(pbar) Checksum is called from places without a valid device context.
+uint64 GPUUtil::Checksum(Device* gpu_device,
+                         const DeviceContext* device_context,
+                         const Tensor& tensor) {
+  Tensor copy(tensor.dtype(), tensor.shape());
+  Status s;
+  Notification n;
+  CopyGPUTensorToCPU(gpu_device, device_context, &tensor, &copy,
+                     [&s, &n](Status status) {
+                       s.Update(status);
+                       n.Notify();
+                     });
+  n.WaitForNotification();
+  CHECK(s.ok()) << s;
+  return Checksum(copy);
+}
+
+uint64 GPUUtil::Checksum(const Tensor& tensor) {
+  const float* fptr = reinterpret_cast<const float*>(DMAHelper::base(&tensor));
+  size_t num_bytes = tensor.TotalBytes();
+  size_t num_floats = num_bytes / sizeof(float);
+  for (size_t i = 0; i < num_floats; ++i) {
+    CHECK(!std::isnan(fptr[i])) << " i " << i;
+  }
+  // TODO(tucker): consider using crc32c instead.
+  return Hash64(reinterpret_cast<const char*>(DMAHelper::base(&tensor)),
+                tensor.TotalBytes(), 0);
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/gpu_util.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_util.h
@ -0,0 +1,89 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_GPU_UTIL_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_GPU_UTIL_H_
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/public/tensor.h"
+#include "tensorflow/core/public/status.h"
+#include "tensorflow/core/common_runtime/gpu/dma_helper.h"
+#include "tensorflow/stream_executor/device_memory.h"
+
+#include "tensorflow/stream_executor/stream.h"
+
+namespace tensorflow {
+
+class RecvTensorResponse;
+class TensorProto;
+
+namespace gpu = ::perftools::gputools;
+
+class GPUUtil {
+ public:
+  // "tensor" is GPU-local.  "dev" is the hosting GPU.
+  // "device_context" should be the context of the GPU "_Send" op
+  // which provides the Tensor.
+  // Sets all necessasry fields of "proto" by transferring value
+  // bytes from GPU to CPU RAM. "is_dead" indicates that the
+  // tensor is dead with an uninit value.
+  static void SetProtoFromGPU(const Tensor& tensor, Device* dev,
+                              const DeviceContext* device_context,
+                              TensorProto* proto, bool is_dead,
+                              StatusCallback done);
+
+  // Copies "input" to "output" between devices accessible to the
+  // local process via some DMA-like method.  "edge_name" is the name
+  // of the tensor being copied, for debugging purposes. Depending on
+  // the type of devices and memory in use, the copy may be performed
+  // synchronously or asynchronously.  'done' will be invoked only
+  // after the copy is actually complete.
+  static void CopyViaDMA(const string& edge_name,
+                         DeviceContext* send_dev_context,
+                         DeviceContext* recv_dev_context, Device* src,
+                         Device* dst, const AllocatorAttributes src_alloc_attr,
+                         const AllocatorAttributes dst_alloc_attr,
+                         const Tensor* input, Tensor* output,
+                         StatusCallback done);
+
+  // Copies the data in 'gpu_tensor' into 'cpu_tensor'.
+  // 'gpu_tensor''s backing memory must be on 'gpu_device' and
+  // 'cpu_tensor' must be allocated to be of the same size as
+  // 'gpu_tensor'. Synchronous: may block.
+  static void CopyGPUTensorToCPU(Device* gpu_device,
+                                 const DeviceContext* device_context,
+                                 const Tensor* gpu_tensor, Tensor* cpu_tensor,
+                                 StatusCallback done);
+
+  // Blocks until all operations queued on the stream associated with
+  // "gpu_device" at the time of the call have completed.  Returns any
+  // error pending on the stream at completion.
+  static Status Sync(Device* gpu_device);
+
+  // Blocks until all operations queued on all streams associated with the
+  // corresponding GPU device at the time of call have completed.
+  // Returns any error pending on the stream at completion.
+  static Status SyncAll(Device* gpu_device);
+
+  // For debugging purpose, given a "device" and a "tensor" allocated
+  // on the device, return a string printing each byte in the tensor
+  // (up to a limit).  "device" can be either a CPU or a GPU device.
+  static string MemoryDebugString(const Device* device, Tensor* tensor);
+
+  static perftools::gputools::DeviceMemory<float> AsGPUFloat(const Tensor& t);
+
+  // Computes a checksum over the contents of "tensor", which is allocated
+  // on "gpu_device".
+  static uint64 Checksum(Device* gpu_device,
+                         const DeviceContext* device_context,
+                         const Tensor& tensor);
+
+  // Computes a checksum over the contents of "tensor", which is allocated
+  // in local CPU RAM.
+  static uint64 Checksum(const Tensor& tensor);
+
+  static void CopyCPUTensorToGPU(const Tensor* cpu_tensor,
+                                 const DeviceContext* device_context,
+                                 Device* gpu_device, Tensor* gpu_tensor,
+                                 StatusCallback done);
+};
+
+}  // namespace tensorflow
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_GPU_UTIL_H_
--- a/tensorflow/core/common_runtime/gpu/gpu_util_platform_specific.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_util_platform_specific.cc
@ -0,0 +1,24 @@
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/common_runtime/gpu_device_context.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/public/tensor.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_util.h"
+#include "tensorflow/stream_executor/stream.h"
+
+namespace tensorflow {
+
+void GPUDeviceContext::CopyCPUTensorToDevice(const Tensor* cpu_tensor,
+                                             Device* device,
+                                             Tensor* device_tensor,
+                                             StatusCallback done) const {
+  GPUUtil::CopyCPUTensorToGPU(cpu_tensor, this, device, device_tensor, done);
+}
+
+void GPUDeviceContext::CopyDeviceTensorToCPU(const Tensor* device_tensor,
+                                             const string& tensor_name,
+                                             Device* device, Tensor* cpu_tensor,
+                                             StatusCallback done) {
+  GPUUtil::CopyGPUTensorToCPU(device, this, device_tensor, cpu_tensor, done);
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/pool_allocator.cc
+++ b/tensorflow/core/common_runtime/gpu/pool_allocator.cc
@ -0,0 +1,269 @@
+#include "tensorflow/core/common_runtime/gpu/pool_allocator.h"
+
+#include <errno.h>
+#include <strings.h>
+#include <sys/mman.h>  // for munmap
+
+#include <map>
+
+#include "tensorflow/core/lib/strings/numbers.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+//#include "prodkernel/api/base/numa.h"
+
+namespace tensorflow {
+
+PoolAllocator::PoolAllocator(size_t pool_size_limit, bool auto_resize,
+                             SubAllocator* allocator,
+                             RoundUpInterface* size_rounder, string name)
+    : name_(name),
+      has_size_limit_(pool_size_limit > 0),
+      auto_resize_(auto_resize),
+      pool_size_limit_(pool_size_limit),
+      allocator_(allocator),
+      size_rounder_(size_rounder),
+      allocation_begun_(false) {
+  if (auto_resize) {
+    CHECK_LT(0, pool_size_limit)
+        << "size limit must be > 0 if auto_resize is true.";
+  }
+}
+
+PoolAllocator::~PoolAllocator() { Clear(); }
+
+namespace {
+// Pools contain Chunks allocatated from the underlying Allocator.
+// Chunk alignment is always on kPoolAlignment boundaries.  Each Chunk
+// begins with a descriptor (ChunkPrefix) that gives its size and a
+// pointer to itself.  The pointer returned to the user is just past
+// the ChunkPrefix.  If the user asks for a larger alignment, we will
+// increase the size of the chunk, then adjust the returned user
+// pointer and also re-write the ChunkPrefix.chunk_ptr value
+// immediately before it.  This way the Chunk address and size can be
+// recovered from the returned user pointer, regardless of alignment.
+// Note that this deferencing of the pointers means that we cannot
+// handle GPU memory, only CPU memory.
+struct ChunkPrefix {
+  size_t num_bytes;
+  void* chunk_ptr;
+};
+// kPoolAlignment cannot be less than the size of ChunkPrefix.
+static const int kPoolAlignment = sizeof(ChunkPrefix);
+
+void* PrepareChunk(void* chunk, size_t alignment, size_t num_bytes) {
+  ChunkPrefix* cp = reinterpret_cast<ChunkPrefix*>(chunk);
+  cp->num_bytes = num_bytes;
+  cp->chunk_ptr = chunk;
+  void* user_ptr = reinterpret_cast<void*>(cp + 1);
+  if (alignment > kPoolAlignment) {
+    // Move user_ptr forward to the first satisfying offset, and write
+    // chunk_ptr just before it.
+    size_t aligned_ptr = reinterpret_cast<size_t>(user_ptr) + alignment;
+    user_ptr = reinterpret_cast<void*>(aligned_ptr & ~(alignment - 1));
+    (reinterpret_cast<ChunkPrefix*>(user_ptr) - 1)->chunk_ptr = chunk;
+  }
+  // Safety check that user_ptr is always past the ChunkPrefix.
+  CHECK_GE(user_ptr, reinterpret_cast<ChunkPrefix*>(chunk) + 1);
+  return user_ptr;
+}
+
+ChunkPrefix* FindPrefix(void* user_ptr) {
+  ChunkPrefix* cp = reinterpret_cast<ChunkPrefix*>(user_ptr) - 1;
+  return reinterpret_cast<ChunkPrefix*>(cp->chunk_ptr);
+}
+}  // namespace
+
+void* PoolAllocator::AllocateRaw(size_t alignment, size_t num_bytes) {
+  if (!allocation_begun_) allocation_begun_ = true;
+  if (num_bytes == 0) return nullptr;
+
+  // If alignment is larger than kPoolAlignment, increase num_bytes so that we
+  // are guaranteed to be able to return an aligned ptr by advancing user_ptr
+  // without overrunning the end of the chunk.
+  if (alignment > kPoolAlignment) {
+    num_bytes += alignment;
+  }
+  num_bytes += sizeof(ChunkPrefix);
+  num_bytes = size_rounder_->RoundUp(num_bytes);
+  PtrRecord* pr = nullptr;
+  if (has_size_limit_) {
+    {
+      mutex_lock lock(mutex_);
+      auto iter = pool_.find(num_bytes);
+      if (iter == pool_.end()) {
+        allocated_count_++;
+        // Deliberately fall out of lock scope before
+        // calling the allocator.  No further modification
+        // to the pool will be performed.
+      } else {
+        get_from_pool_count_++;
+        pr = iter->second;
+        RemoveFromList(pr);
+        pool_.erase(iter);
+        // Fall out of lock scope and do the result without the lock held.
+      }
+    }
+  }
+  if (pr != nullptr) {
+    void* r = pr->ptr;
+    delete pr;
+    return PrepareChunk(r, alignment, num_bytes);
+  } else {
+    void* ptr = allocator_->Alloc(kPoolAlignment, num_bytes);
+    for (auto v : alloc_visitors_) {
+      v(ptr, num_bytes);
+    }
+    return PrepareChunk(ptr, alignment, num_bytes);
+  }
+}
+
+void PoolAllocator::DeallocateRaw(void* ptr) {
+  if (ptr == nullptr) return;
+  ChunkPrefix* cp = FindPrefix(ptr);
+  CHECK_LE((void*)cp, (void*)ptr);
+  if (!has_size_limit_ && !auto_resize_) {
+    for (auto v : free_visitors_) {
+      v(cp, cp->num_bytes);
+    }
+    allocator_->Free(cp, cp->num_bytes);
+  } else {
+    mutex_lock lock(mutex_);
+    ++put_count_;
+    while (pool_.size() >= pool_size_limit_) {
+      EvictOne();
+    }
+    PtrRecord* pr = new PtrRecord;
+    pr->num_bytes = cp->num_bytes;
+    pr->ptr = cp;
+    AddToList(pr);
+    pool_.insert(std::make_pair(cp->num_bytes, pr));
+  }
+}
+
+void PoolAllocator::Clear() {
+  if (has_size_limit_) {
+    mutex_lock lock(mutex_);
+    for (auto iter : pool_) {
+      PtrRecord* pr = iter.second;
+      for (auto v : free_visitors_) {
+        v(pr->ptr, pr->num_bytes);
+      }
+      allocator_->Free(pr->ptr, pr->num_bytes);
+      delete pr;
+    }
+    pool_.clear();
+    get_from_pool_count_ = 0;
+    put_count_ = 0;
+    allocated_count_ = 0;
+    evicted_count_ = 0;
+    lru_head_ = nullptr;
+    lru_tail_ = nullptr;
+  }
+}
+
+void PoolAllocator::RemoveFromList(PtrRecord* pr) {
+  if (pr->prev == nullptr) {
+    DCHECK_EQ(lru_head_, pr);
+    lru_head_ = nullptr;
+  } else {
+    pr->prev->next = pr->next;
+  }
+  if (pr->next == nullptr) {
+    DCHECK_EQ(lru_tail_, pr);
+    lru_tail_ = pr->prev;
+  } else {
+    pr->next->prev = pr->prev;
+    if (lru_head_ == nullptr) {
+      lru_head_ = pr->next;
+    }
+  }
+}
+
+void PoolAllocator::AddToList(PtrRecord* pr) {
+  pr->prev = nullptr;
+  if (lru_head_ == nullptr) {
+    CHECK(lru_tail_ == nullptr);
+    lru_tail_ = pr;
+    pr->next = nullptr;
+  } else {
+    pr->next = lru_head_;
+    pr->next->prev = pr;
+  }
+  lru_head_ = pr;
+}
+
+void PoolAllocator::EvictOne() {
+  DCHECK(lru_tail_ != nullptr);
+  PtrRecord* prec = lru_tail_;
+  RemoveFromList(prec);
+  auto iter = pool_.find(prec->num_bytes);
+  while (iter->second != prec) {
+    ++iter;
+    DCHECK(iter != pool_.end());
+  }
+  pool_.erase(iter);
+  for (auto v : free_visitors_) {
+    v(prec->ptr, prec->num_bytes);
+  }
+  allocator_->Free(prec->ptr, prec->num_bytes);
+  delete prec;
+  ++evicted_count_;
+  // Auto-resizing, and warning messages.
+  static const double kTolerable = 2e-3;
+  static const int kCheckInterval = 1000;
+  static const double kIncreaseFactor = 1.1;
+  static const int kMinPoolSize = 100;
+  if (0 == evicted_count_ % kCheckInterval) {
+    const double eviction_rate =
+        evicted_count_ / static_cast<double>(put_count_);
+    const int64 alloc_request_count = allocated_count_ + get_from_pool_count_;
+    const double alloc_rate =
+        allocated_count_ / static_cast<double>(alloc_request_count);
+    static int log_counter = 0;
+    // (counter increment not thread safe but it's just for logging, so we
+    // don't care).
+    bool should_log = ((log_counter++ % 10) == 0);
+    if (should_log) {
+      LOG(WARNING) << "PoolAllocator: After " << alloc_request_count
+                   << " get requests, put_count=" << put_count_
+                   << " evicted_count=" << evicted_count_
+                   << " eviction_rate=" << eviction_rate
+                   << " and unsatisfied allocation rate=" << alloc_rate;
+    }
+    if (auto_resize_ && (eviction_rate > kTolerable) &&
+        (alloc_rate > kTolerable)) {
+      size_t new_size_limit = (pool_size_limit_ < kMinPoolSize)
+                                  ? kMinPoolSize
+                                  : (kIncreaseFactor * pool_size_limit_);
+      if (should_log) {
+        LOG(INFO) << "Raising pool_size_limit_ from " << pool_size_limit_
+                  << " to " << new_size_limit;
+      }
+      pool_size_limit_ = new_size_limit;
+      // Reset all the counters so that ratios are relative to new sizes
+      // at next test interval.
+      put_count_ = 0;
+      allocated_count_ = 0;
+      evicted_count_ = 0;
+      get_from_pool_count_ = 0;
+    }
+  }
+}
+
+void PoolAllocator::AddAllocVisitor(Visitor visitor) {
+  mutex_lock lock(mutex_);
+  CHECK(!allocation_begun_)
+      << "AddAllocVisitor may not be called after pool allocation "
+      << "has begun.";
+  alloc_visitors_.push_back(visitor);
+}
+
+void PoolAllocator::AddFreeVisitor(Visitor visitor) {
+  mutex_lock lock(mutex_);
+  CHECK(!allocation_begun_)
+      << "AddFreeVisitor may not be called after pool allocation "
+      << "has begun.";
+  free_visitors_.push_back(visitor);
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/pool_allocator.h
+++ b/tensorflow/core/common_runtime/gpu/pool_allocator.h
@ -0,0 +1,202 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_POOL_ALLOCATOR_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_POOL_ALLOCATOR_H_
+
+// Simple LRU pool allocators for various flavors of CPU RAM that
+// implement the VisitableAllocator interface. GPU memory is managed
+// by GPURegionAllocator.
+
+#include <atomic>
+#include <map>
+#include <memory>
+#include "tensorflow/core/lib/core/bits.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/common_runtime/gpu/visitable_allocator.h"
+#include "tensorflow/stream_executor/stream_executor.h"
+
+namespace tensorflow {
+
+// Interface of an object that does the underlying alloc/free of memory.
+class SubAllocator {
+ public:
+  virtual ~SubAllocator() {}
+  virtual void* Alloc(size_t alignment, size_t num_bytes) = 0;
+  virtual void Free(void* ptr, size_t num_bytes) = 0;
+};
+
+// Interface of an object that rounds up integers.
+class RoundUpInterface {
+ public:
+  virtual ~RoundUpInterface() {}
+  virtual size_t RoundUp(size_t num_bytes) = 0;
+};
+
+// Size-limited pool of memory buffers obtained from a SubAllocator
+// instance.  Pool eviction policy is LRU.
+class PoolAllocator : public VisitableAllocator {
+ public:
+  // "pool_size_limit" is the maximum number of returned, re-usable
+  // memory buffers to keep in the pool.  If pool_size_limit == 0, the
+  // pool is effectively a thin wrapper around the allocator.
+  // If "auto_resize" is true, then the pool_size_limit will gradually
+  // be raised so that deallocations happen very rarely, if at all.
+  // Transitory start-up objects may deallocate, but the long-term
+  // working-set should not. Auto-resizing can raise pool_size_limit
+  // but will never lower it.
+  // "allocator" is the object that performs the underlying memory
+  // malloc/free operations.  This object takes ownership of allocator.
+  PoolAllocator(size_t pool_size_limit, bool auto_resize,
+                SubAllocator* allocator, RoundUpInterface* size_rounder,
+                string name);
+  ~PoolAllocator() override;
+
+  string Name() override { return name_; }
+
+  void* AllocateRaw(size_t alignment, size_t num_bytes) override;
+
+  void DeallocateRaw(void* ptr) override;
+
+  // REQUIRES: The following functions may only be called prior
+  // to the first Allocate*() call.  Once allocation has begun, it is
+  // illegal to register another visitor.
+
+  void AddAllocVisitor(Visitor visitor) override;
+
+  void AddFreeVisitor(Visitor visitor) override;
+
+  // Allocate an unused memory region of size "num_bytes".  Fetch from
+  // the pool if available, otherwise call allocator_.
+  void* Get(size_t num_bytes);
+
+  // Return a no-longer needed memory region to the pool.  It is an error
+  // to deference "ptr" after this call.  If the pool is full, the least
+  // recently used region will be deallocated.
+  void Put(void* ptr, size_t num_bytes);
+
+  // Reset the pool to empty.
+  void Clear();
+
+  // The following accessors permit monitoring the effectiveness of
+  // the pool at avoiding repeated malloc/frees on the underlying
+  // allocator.  Read locks are not taken on the theory that value
+  // consistency with other threads is not important.
+
+  // Number of Get() requests satisfied from pool.
+  int64 get_from_pool_count() const NO_THREAD_SAFETY_ANALYSIS {
+    return get_from_pool_count_;
+  }
+  // Number of Put() requests.
+  int64 put_count() const NO_THREAD_SAFETY_ANALYSIS { return put_count_; }
+  // Number of Get() requests requiring a fresh allocation.
+  int64 allocated_count() const NO_THREAD_SAFETY_ANALYSIS {
+    return allocated_count_;
+  }
+  // Number of pool evictions.
+  int64 evicted_count() const NO_THREAD_SAFETY_ANALYSIS {
+    return evicted_count_;
+  }
+  // Current size limit.
+  size_t size_limit() const NO_THREAD_SAFETY_ANALYSIS {
+    return pool_size_limit_;
+  }
+
+ private:
+  struct PtrRecord {
+    void* ptr;
+    size_t num_bytes;
+    PtrRecord* prev;
+    PtrRecord* next;
+  };
+
+  // Remove "pr" from the double-linked LRU list.
+  void RemoveFromList(PtrRecord* pr) EXCLUSIVE_LOCKS_REQUIRED(mutex_);
+
+  // Add "pr" to the head of the double-linked LRU list.
+  void AddToList(PtrRecord* pr) EXCLUSIVE_LOCKS_REQUIRED(mutex_);
+
+  // Delete the least recently used record.
+  void EvictOne() EXCLUSIVE_LOCKS_REQUIRED(mutex_);
+
+  const string name_;
+  const bool has_size_limit_;
+  const bool auto_resize_;
+  size_t pool_size_limit_;
+  std::unique_ptr<SubAllocator> allocator_;
+  std::unique_ptr<RoundUpInterface> size_rounder_;
+  mutex mutex_;
+  std::multimap<const size_t, PtrRecord*> pool_ GUARDED_BY(mutex_);
+  PtrRecord* lru_head_ GUARDED_BY(mutex_) = nullptr;
+  PtrRecord* lru_tail_ GUARDED_BY(mutex_) = nullptr;
+  int64 get_from_pool_count_ GUARDED_BY(mutex_) = 0;
+  int64 put_count_ GUARDED_BY(mutex_) = 0;
+  int64 allocated_count_ GUARDED_BY(mutex_) = 0;
+  int64 evicted_count_ GUARDED_BY(mutex_) = 0;
+  // Write access to these is guarded by mutex_, but not read
+  // access. They may only be modified prior to the first
+  // allocation.  Later attempts to modify will fail.
+  std::vector<Visitor> alloc_visitors_;
+  std::vector<Visitor> free_visitors_;
+  std::atomic<bool> allocation_begun_;
+};
+
+// Do-nothing rounder. Passes through sizes unchanged.
+class NoopRounder : public RoundUpInterface {
+ public:
+  size_t RoundUp(size_t num_bytes) override { return num_bytes; }
+};
+
+// Power of 2 rounder: rounds up to nearest power of 2 size.
+class Pow2Rounder : public RoundUpInterface {
+ public:
+  size_t RoundUp(size_t num_bytes) override {
+    return 1uLL << Log2Ceiling64(num_bytes);
+  }
+};
+
+class BasicCPUAllocator : public SubAllocator {
+ public:
+  ~BasicCPUAllocator() override {}
+
+  void* Alloc(size_t alignment, size_t num_bytes) override {
+    return port::aligned_malloc(num_bytes, alignment);
+  }
+  void Free(void* ptr, size_t num_bytes) override { free(ptr); }
+};
+
+// Allocator for pinned CPU RAM that is made known to CUDA for the
+// purpose of efficient DMA with a GPU.
+class CUDAHostAllocator : public SubAllocator {
+ public:
+  // Note: stream_exec cannot be null.
+  explicit CUDAHostAllocator(perftools::gputools::StreamExecutor* stream_exec)
+      : stream_exec_(stream_exec) {
+    CHECK(stream_exec_ != nullptr);
+  }
+  ~CUDAHostAllocator() override {}
+
+  void* Alloc(size_t alignment, size_t num_bytes) override {
+    void* ptr = nullptr;
+    if (num_bytes > 0) {
+      ptr = stream_exec_->HostMemoryAllocate(num_bytes);
+      if (ptr == nullptr) {
+        LOG(FATAL) << "could not allocate pinned host memory of size: "
+                   << num_bytes;
+      }
+    }
+    return ptr;
+  }
+
+  void Free(void* ptr, size_t num_bytes) override {
+    if (ptr != nullptr) {
+      stream_exec_->HostMemoryDeallocate(ptr);
+    }
+  }
+
+ private:
+  perftools::gputools::StreamExecutor* stream_exec_;  // not owned, non-null
+
+  TF_DISALLOW_COPY_AND_ASSIGN(CUDAHostAllocator);
+};
+
+}  // namespace tensorflow
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_POOL_ALLOCATOR_H_
--- a/tensorflow/core/common_runtime/gpu/pool_allocator_test.cc
+++ b/tensorflow/core/common_runtime/gpu/pool_allocator_test.cc
@ -0,0 +1,203 @@
+#if GOOGLE_CUDA
+
+#include "tensorflow/core/common_runtime/gpu/pool_allocator.h"
+
+#include "tensorflow/stream_executor/multi_platform_manager.h"
+#include "tensorflow/stream_executor/platform.h"
+#include <gtest/gtest.h>
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+namespace {
+
+TEST(PoolAllocatorTest, ZeroSizeBuffers) {
+  gpu::Platform* platform =
+      gpu::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
+  PoolAllocator pool(
+      2 /*pool_size_limit*/, false /*auto_resize*/,
+      new CUDAHostAllocator(
+          platform->GetExecutor(gpu::StreamExecutorConfig(/*ordinal=*/0))
+              .ValueOrDie()),
+      new NoopRounder, "pool");
+
+  EXPECT_EQ(nullptr, pool.AllocateRaw(4 /*alignment*/, 0 /*num_bytes*/));
+  pool.DeallocateRaw(nullptr);  // Should not crash.
+  EXPECT_EQ(0, pool.get_from_pool_count());
+  EXPECT_EQ(0, pool.put_count());
+  EXPECT_EQ(0, pool.allocated_count());
+  EXPECT_EQ(0, pool.evicted_count());
+}
+
+TEST(PoolAllocatorTest, ZeroSizePool) {
+  gpu::Platform* platform =
+      gpu::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
+  PoolAllocator pool(
+      0 /*pool_size_limit*/, false /*auto_resize*/,
+      new CUDAHostAllocator(
+          platform->GetExecutor(gpu::StreamExecutorConfig(/*ordinal=*/0))
+              .ValueOrDie()),
+      new NoopRounder, "pool");
+
+  EXPECT_EQ(0, pool.get_from_pool_count());
+  EXPECT_EQ(0, pool.put_count());
+  EXPECT_EQ(0, pool.allocated_count());
+  EXPECT_EQ(0, pool.evicted_count());
+
+  // All allocations should bypass the pool and return valid pointers.
+  for (int i = 0; i < 3; ++i) {
+    void* p0 = pool.AllocateRaw(4, 0);
+    void* p4 = pool.AllocateRaw(4, 4);
+    void* p12 = pool.AllocateRaw(4, 12);
+    EXPECT_EQ(nullptr, p0);
+    EXPECT_NE(nullptr, p4);
+    EXPECT_NE(nullptr, p12);
+    pool.DeallocateRaw(p0);
+    pool.DeallocateRaw(p4);
+    pool.DeallocateRaw(p12);
+  }
+  EXPECT_EQ(0, pool.get_from_pool_count());
+  EXPECT_EQ(0, pool.put_count());
+  EXPECT_EQ(0, pool.allocated_count());
+  EXPECT_EQ(0, pool.evicted_count());
+}
+
+TEST(PoolAllocatorTest, Alignment) {
+  gpu::Platform* platform =
+      gpu::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
+  PoolAllocator pool(
+      0 /*pool_size_limit*/, false /*auto_resize*/,
+      new CUDAHostAllocator(
+          platform->GetExecutor(gpu::StreamExecutorConfig(/*ordinal=*/0))
+              .ValueOrDie()),
+      new NoopRounder, "pool");
+  for (int i = 0; i < 16; ++i) {
+    size_t alignment = 1 << i;
+    void* p = pool.AllocateRaw(alignment, 111);
+    EXPECT_TRUE(p != nullptr);
+    EXPECT_EQ(0, reinterpret_cast<int64>(p) & (alignment - 1))
+        << "ptr: " << p << " alignment " << alignment;
+    // Intentionally don't deallocate, to test that destruction of
+    // the PoolAllocator frees all pending memory.
+  }
+}
+
+TEST(PoolAllocatorTest, AutoResize) {
+  PoolAllocator pool(2 /*pool_size_limit*/, true /*auto_resize*/,
+                     new BasicCPUAllocator, new NoopRounder, "pool");
+
+  // Alloc/dealloc 10 sizes just a few times, confirming pool size
+  // stays at 2.
+  for (int i = 0; i < 10; ++i) {
+    void* p = pool.AllocateRaw(4, 64 << i);
+    pool.DeallocateRaw(p);
+  }
+  EXPECT_EQ(0, pool.get_from_pool_count());
+  EXPECT_EQ(10, pool.allocated_count());
+  EXPECT_EQ(10, pool.put_count());
+  EXPECT_EQ(8, pool.evicted_count());
+  EXPECT_EQ(2, pool.size_limit());
+
+  // Then repeat 1200 times.  Pool size limit should jump to 100.
+  for (int j = 0; j < 120; ++j) {
+    for (int i = 0; i < 10; ++i) {
+      void* p = pool.AllocateRaw(4, 64 << i);
+      pool.DeallocateRaw(p);
+    }
+  }
+  EXPECT_EQ(100, pool.size_limit());
+}
+
+TEST(PoolAllocatorTest, CudaHostAllocator) {
+  gpu::Platform* platform =
+      gpu::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
+  PoolAllocator pool(
+      2 /*pool_size_limit*/, false /*auto_resize*/,
+      new CUDAHostAllocator(
+          platform->GetExecutor(gpu::StreamExecutorConfig(/*ordinal=*/0))
+              .ValueOrDie()),
+      new NoopRounder, "pool");
+
+  // Repeatedly Get a 16-byte value, confirming that there's only
+  // one real allocation.
+  void* p1_16 = pool.AllocateRaw(4, 16);
+  EXPECT_EQ(0, pool.get_from_pool_count());
+  EXPECT_EQ(1, pool.allocated_count());
+  EXPECT_NE(nullptr, p1_16);
+  pool.DeallocateRaw(p1_16);
+  // Pool contents {16}
+  EXPECT_EQ(1, pool.put_count());
+  void* p2_16 = pool.AllocateRaw(4, 16);  // Get it again.
+  EXPECT_EQ(1, pool.get_from_pool_count());
+  EXPECT_EQ(1, pool.allocated_count());
+  EXPECT_EQ(p1_16, p2_16);    // Same pointer value
+  pool.DeallocateRaw(p2_16);  // Put it back.
+  // Pool contents {16}
+  EXPECT_EQ(2, pool.put_count());
+
+  // Get two more values of different sizes.
+  void* p3_4 = pool.AllocateRaw(4, 4);
+  EXPECT_EQ(2, pool.allocated_count());
+  EXPECT_NE(p1_16, p3_4);  // Different pointer value
+  EXPECT_NE(nullptr, p3_4);
+  pool.DeallocateRaw(p3_4);  // Put it back. Pool is now full.
+  // Pool contents {4, 16}
+  EXPECT_EQ(3, pool.put_count());
+  void* p4_2 = pool.AllocateRaw(4, 2);  // Get a third size buffer.
+  EXPECT_NE(nullptr, p4_2);
+  EXPECT_EQ(0, pool.evicted_count());
+
+  // The pool is full: when we put back p4_2, the 16-byte buffer
+  // should be evicted since it was least recently inserted.
+  pool.DeallocateRaw(p4_2);
+  // Pool contents {2, 4}
+  EXPECT_EQ(4, pool.put_count());
+  EXPECT_EQ(1, pool.evicted_count());
+
+  // Re-getting and putting size 2 or 4 should not alter pool size or
+  // num-evicted.
+  void* p5_4 = pool.AllocateRaw(4, 4);
+  EXPECT_NE(nullptr, p5_4);
+  pool.DeallocateRaw(p5_4);
+  void* p6_2 = pool.AllocateRaw(4, 2);
+  EXPECT_NE(nullptr, p6_2);
+  pool.DeallocateRaw(p6_2);
+  EXPECT_EQ(3, pool.get_from_pool_count());
+  EXPECT_EQ(6, pool.put_count());
+  EXPECT_EQ(3, pool.allocated_count());
+  EXPECT_EQ(1, pool.evicted_count());
+
+  pool.Clear();
+  EXPECT_EQ(0, pool.get_from_pool_count());
+  EXPECT_EQ(0, pool.put_count());
+  EXPECT_EQ(0, pool.allocated_count());
+  EXPECT_EQ(0, pool.evicted_count());
+}
+
+TEST(PoolAllocatorTest, Pow2Rounder) {
+  Pow2Rounder rounder;
+  EXPECT_EQ(1, rounder.RoundUp(1));
+  EXPECT_EQ(2, rounder.RoundUp(2));
+  EXPECT_EQ(16, rounder.RoundUp(9));
+  EXPECT_EQ(16, rounder.RoundUp(16));
+  EXPECT_EQ(65536, rounder.RoundUp(41234));
+  EXPECT_EQ(65536, rounder.RoundUp(65535));
+  EXPECT_EQ(65536, rounder.RoundUp(65536));
+}
+
+TEST(PoolAllocatorTest, Name) {
+  gpu::Platform* platform =
+      gpu::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
+  PoolAllocator pool(
+      2 /*pool_size_limit*/, false /*auto_resize*/,
+      new CUDAHostAllocator(
+          platform->GetExecutor(gpu::StreamExecutorConfig(/*ordinal=*/0))
+              .ValueOrDie()),
+      new NoopRounder, "pool");
+  EXPECT_EQ("pool", pool.Name());
+}
+
+}  // namespace
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
--- a/tensorflow/core/common_runtime/gpu/process_state.cc
+++ b/tensorflow/core/common_runtime/gpu/process_state.cc
@ -0,0 +1,220 @@
+#include "tensorflow/core/common_runtime/gpu/process_state.h"
+
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_debug_allocator.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_region_allocator.h"
+#include "tensorflow/core/common_runtime/gpu/pool_allocator.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/stream_executor/multi_platform_manager.h"
+
+#if defined(PLATFORM_GOOGLE)
+DEFINE_bool(record_mem_types, false,
+            "If true, record attributes of memory allocations and "
+            "dyanmically check for appropriate use of registered memory."
+            "Should only be true for debugging or diagnosis of "
+            "performance issues.");
+DEFINE_bool(brain_mem_reg_cuda_dma, true,
+            "If true, register CPU RAM used to copy to/from GPU RAM "
+            "with the CUDA driver.");
+DEFINE_bool(brain_gpu_use_bfc_allocator, false,
+            "If true, uses the Best-Fit GPU allocator.");
+DEFINE_bool(brain_gpu_region_allocator_debug, false,
+            "If true, checks for memory overwrites by writing "
+            "distinctive patterns on both ends of allocated memory.");
+DEFINE_bool(brain_gpu_region_allocator_reset_to_nan, false,
+            "If true, initializes all new Malloc buffers to NaN, "
+            "and resets the buffer to NaN upon Free.");
+
+#else
+bool FLAGS_record_mem_types = false;
+bool FLAGS_brain_mem_reg_cuda_dma = true;
+bool FLAGS_brain_gpu_region_allocator_debug = false;
+bool FLAGS_brain_gpu_region_allocator_reset_to_nan = false;
+bool FLAGS_brain_gpu_use_bfc_allocator = false;
+#endif
+
+namespace gpu = ::perftools::gputools;
+
+namespace tensorflow {
+
+ProcessState* ProcessState::instance_ = nullptr;
+
+/*static*/ ProcessState* ProcessState::singleton() {
+  if (instance_ == nullptr) {
+    instance_ = new ProcessState;
+  }
+
+  return instance_;
+}
+
+ProcessState::ProcessState() : gpu_count_(0) {
+  CHECK(instance_ == nullptr);
+  instance_ = this;
+}
+
+ProcessState::~ProcessState() {
+  for (auto p : gpu_allocators_) {
+    delete p;
+  }
+  instance_ = nullptr;
+}
+
+string ProcessState::MemDesc::DebugString() {
+  return strings::StrCat((loc == CPU ? "CPU " : "GPU "), dev_index, ", dma: ",
+                         gpu_registered, ", nic: ", nic_registered);
+}
+
+ProcessState::MemDesc ProcessState::PtrType(const void* ptr) {
+  if (FLAGS_record_mem_types) {
+    auto iter = mem_desc_map_.find(ptr);
+    if (iter != mem_desc_map_.end()) {
+      return iter->second;
+    }
+  }
+  return MemDesc();
+}
+
+void ProcessState::SetGPUCount(int c) {
+  CHECK(gpu_count_ == 0 || gpu_count_ == c)
+      << "Cannot call SetGPUCount with a non-zero value "
+      << "not equal to prior set value.";
+  gpu_count_ = c;
+}
+
+int ProcessState::GPUCount() const { return gpu_count_; }
+
+Allocator* ProcessState::GetGPUAllocator(int gpu_id, size_t total_bytes) {
+#if GOOGLE_CUDA
+  mutex_lock lock(mu_);
+  gpu::Platform* gpu_platform = GPUMachineManager();
+
+  // Verify that gpu_id is legitimate.
+  CHECK_LT(gpu_id, gpu_platform->VisibleDeviceCount())
+      << "gpu_id is outside discovered device range";
+
+  if (gpu_id >= static_cast<int64>(gpu_allocators_.size())) {
+    gpu_allocators_.resize(gpu_id + 1);
+    if (FLAGS_record_mem_types) gpu_al_.resize(gpu_id + 1);
+  }
+
+  if (gpu_allocators_[gpu_id] == nullptr) {
+    VisitableAllocator* gpu_allocator;
+
+    if (FLAGS_brain_gpu_use_bfc_allocator) {
+      gpu_allocator = new GPUBFCAllocator(gpu_id, total_bytes);
+    } else {
+      gpu_allocator = new GPURegionAllocator(gpu_id, total_bytes);
+    }
+
+    if (FLAGS_brain_gpu_region_allocator_debug) {
+      gpu_allocator = new GPUDebugAllocator(gpu_allocator, gpu_id);
+    }
+    if (FLAGS_brain_gpu_region_allocator_reset_to_nan) {
+      gpu_allocator = new GPUNanResetAllocator(gpu_allocator, gpu_id);
+    }
+
+    gpu_allocators_[gpu_id] = gpu_allocator;
+
+    // If there are any pending AllocVisitors for this bus, add
+    // them now.
+    gpu::StreamExecutor* se =
+        gpu_platform->ExecutorForDevice(gpu_id).ValueOrDie();
+    int bus_id = se->GetDeviceDescription().numa_node();
+    if (bus_id < static_cast<int64>(gpu_visitors_.size())) {
+      for (auto v : gpu_visitors_[bus_id]) {
+        gpu_allocators_[gpu_id]->AddAllocVisitor(v);
+      }
+    }
+    if (FLAGS_record_mem_types) {
+      MemDesc md;
+      md.loc = MemDesc::GPU;
+      md.dev_index = gpu_id;
+      md.gpu_registered = false;
+      md.nic_registered = true;
+      if (static_cast<int64>(gpu_al_.size()) <= gpu_id)
+        gpu_al_.resize(gpu_id + 1);
+      gpu_al_[gpu_id] = new internal::RecordingAllocator(
+          &mem_desc_map_, gpu_allocators_[gpu_id], md, &mu_);
+    }
+  }
+  if (FLAGS_record_mem_types) return gpu_al_[gpu_id];
+  return gpu_allocators_[gpu_id];
+#else
+  LOG(FATAL) << "GPUAllocator unavailable. Not compiled with --config=cuda.";
+  return nullptr;
+#endif  // GOOGLE_CUDA
+}
+
+Allocator* ProcessState::GetCPUAllocator(int numa_node) {
+  // Although we're temporarily ignoring numa_node, check for legality.
+  CHECK_GE(numa_node, 0);
+  // TODO(tucker): actually maintain separate CPUAllocators for
+  // different numa_nodes.  For now, just one.
+  numa_node = 0;
+  mutex_lock lock(mu_);
+  while (cpu_allocators_.size() <= static_cast<size_t>(numa_node)) {
+    cpu_allocators_.push_back(new PoolAllocator(
+        100 /*pool_size_limit*/, true /*auto_resize*/, new BasicCPUAllocator(),
+        new NoopRounder, "cpu_pool"));
+  }
+  return cpu_allocators_[0];
+}
+
+Allocator* ProcessState::GetCUDAHostAllocator(int numa_node) {
+  if (gpu_count_ == 0 || !FLAGS_brain_mem_reg_cuda_dma) {
+    return GetCPUAllocator(numa_node);
+  }
+  // Although we're temporarily ignoring numa_node, check for legality.
+  CHECK_GE(numa_node, 0);
+  // TODO(tucker): actually maintain separate CPUAllocators for
+  // different numa_nodes.  For now, just one.
+  numa_node = 0;
+  mutex_lock lock(mu_);
+  while (static_cast<int>(cuda_host_allocators_.size()) <= numa_node) {
+    // CUDAHost alloc the same across all gpus, so just get the
+    // executor for the first device.
+    gpu::Platform* gpu_platform = GPUMachineManager();
+    gpu::StreamExecutor* se = gpu_platform->ExecutorForDevice(0).ValueOrDie();
+    CHECK(se);
+    cuda_host_allocators_.push_back(new PoolAllocator(
+        100 /*pool_size_limit*/, true /*auto_resize*/,
+        new CUDAHostAllocator(se), new Pow2Rounder, "cuda_host"));
+    if (FLAGS_record_mem_types) {
+      MemDesc md;
+      md.loc = MemDesc::CPU;
+      md.dev_index = 0;
+      md.gpu_registered = true;
+      md.nic_registered = false;
+      cuda_al_.push_back(new internal::RecordingAllocator(
+          &mem_desc_map_, cuda_host_allocators_.back(), md, &mu_));
+    }
+  }
+  if (FLAGS_record_mem_types) return cuda_al_[0];
+  return cuda_host_allocators_[0];
+}
+
+void ProcessState::AddGPUAllocVisitor(int bus_id, AllocVisitor visitor) {
+#if GOOGLE_CUDA
+  mutex_lock lock(mu_);
+  gpu::Platform* gpu_platform = GPUMachineManager();
+  for (int gpu_id = 0; gpu_id < static_cast<int64>(gpu_allocators_.size());
+       ++gpu_id) {
+    gpu::StreamExecutor* se =
+        gpu_platform->ExecutorForDevice(gpu_id).ValueOrDie();
+    if (gpu_allocators_[gpu_id] &&
+        se->GetDeviceDescription().numa_node() == bus_id) {
+      gpu_allocators_[gpu_id]->AddAllocVisitor(visitor);
+    }
+  }
+  while (bus_id >= static_cast<int64>(gpu_visitors_.size())) {
+    gpu_visitors_.push_back(std::vector<AllocVisitor>());
+  }
+  gpu_visitors_[bus_id].push_back(visitor);
+#endif  // GOOGLE_CUDA
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/gpu/process_state.h
+++ b/tensorflow/core/common_runtime/gpu/process_state.h
@ -0,0 +1,140 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_PROCESS_STATE_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_PROCESS_STATE_H_
+
+#include <functional>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/platform/thread_annotations.h"
+
+namespace tensorflow {
+
+class Allocator;
+class VisitableAllocator;
+class PoolAllocator;
+
+// Singleton that manages per-process state, e.g. allocation
+// of shared resources.
+class ProcessState {
+ public:
+  static ProcessState* singleton();
+
+  // Descriptor for memory allocation attributes, used by optional
+  // runtime correctness analysis logic.
+  struct MemDesc {
+    enum MemLoc { CPU, GPU };
+    MemLoc loc;
+    int dev_index;
+    bool gpu_registered;
+    bool nic_registered;
+    MemDesc()
+        : loc(CPU),
+          dev_index(0),
+          gpu_registered(false),
+          nic_registered(false) {}
+    string DebugString();
+  };
+
+  // Records the number of GPUs available in the local process.
+  // It is a fatal error to call this with a value != to the value
+  // in a prior call.
+  void SetGPUCount(int c);
+
+  // Returns number of GPUs available in local process, as set by
+  // SetGPUCount();  Returns 0 if SetGPUCount has not been called.
+  int GPUCount() const;
+
+  // Returns what we know about the memory at ptr.
+  // If we know nothing, it's called CPU 0 with no other attributes.
+  MemDesc PtrType(const void* ptr);
+
+  // Returns the one CPUAllocator used for the given numa_node.
+  // TEMPORY: ignores numa_node.
+  Allocator* GetCPUAllocator(int numa_node);
+
+  // Returns the one GPU allocator used for the indexed GPU.
+  // Note that this is a system GPU index, not (necessarily) a brain
+  // device index.
+  //
+  // 'total_bytes' is the total number of bytes that should be made
+  // available to the allocator.  The first call to this function for
+  // a given gpu_id creates the allocator, so only the total_bytes
+  // used on that first call is used.
+  //
+  // REQUIRES: gpu_id must be a valid ordinal for a GPU available in the
+  // current system environment.  Otherwise returns nullptr.
+  Allocator* GetGPUAllocator(int gpu_id, size_t total_bytes);
+
+  Allocator* GetCUDAHostAllocator(int numa_node);
+
+  // Registers a function to be called once on every new Region
+  // allocated by every GPURegionAllocator proximate to the specified
+  // bus.  The AllocVisitor is provided with a memory pointer and the
+  // size of the area it identifies.  The pointer is not guaranteed to
+  // be valid after the call terminates.  The intention is for this
+  // interface to be used for network device memory registration.
+  // "bus_id" is platform-specific.  On many platforms it
+  // should be 0.  On machines with multiple PCIe buses, it should be
+  // the index of one of the PCIe buses.  If the the bus_id is invalid,
+  // results are undefined.
+  typedef std::function<void(void*, size_t)> AllocVisitor;
+  void AddGPUAllocVisitor(int bus_id, AllocVisitor visitor);
+
+  typedef std::unordered_map<const void*, MemDesc> MDMap;
+
+ protected:
+  ProcessState();
+
+  static ProcessState* instance_;
+
+  mutex mu_;
+  int gpu_count_;
+
+  std::vector<PoolAllocator*> cpu_allocators_ GUARDED_BY(mu_);
+  std::vector<VisitableAllocator*> gpu_allocators_ GUARDED_BY(mu_);
+  std::vector<std::vector<AllocVisitor>> gpu_visitors_ GUARDED_BY(mu_);
+  std::vector<PoolAllocator*> cuda_host_allocators_ GUARDED_BY(mu_);
+
+  virtual ~ProcessState();
+
+  // Optional RecordingAllocators that wrap the corresponding
+  // Allocators for runtime attribute use analysis.
+  MDMap mem_desc_map_;
+  std::vector<Allocator*> cpu_al_ GUARDED_BY(mu_);
+  std::vector<Allocator*> gpu_al_ GUARDED_BY(mu_);
+  std::vector<Allocator*> cuda_al_ GUARDED_BY(mu_);
+};
+
+namespace internal {
+class RecordingAllocator : public Allocator {
+ public:
+  RecordingAllocator(ProcessState::MDMap* mm, Allocator* a,
+                     ProcessState::MemDesc md, mutex* mu)
+      : mm_(mm), a_(a), md_(md), mu_(mu) {}
+
+  string Name() override { return a_->Name(); }
+  void* AllocateRaw(size_t alignment, size_t num_bytes) override {
+    void* p = a_->AllocateRaw(alignment, num_bytes);
+    mutex_lock l(*mu_);
+    (*mm_)[p] = md_;
+    return p;
+  }
+  void DeallocateRaw(void* p) override {
+    mutex_lock l(*mu_);
+    auto iter = mm_->find(p);
+    mm_->erase(iter);
+    a_->DeallocateRaw(p);
+  }
+  bool TracksAllocationSizes() override { return a_->TracksAllocationSizes(); }
+  size_t RequestedSize(void* p) override { return a_->RequestedSize(p); }
+  size_t AllocatedSize(void* p) override { return a_->AllocatedSize(p); }
+  ProcessState::MDMap* mm_;  // not owned
+  Allocator* a_;             // not owned
+  ProcessState::MemDesc md_;
+  mutex* mu_;
+};
+}  // namespace internal
+}  // namespace tensorflow
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_PROCESS_STATE_H_
--- a/tensorflow/core/common_runtime/gpu/visitable_allocator.h
+++ b/tensorflow/core/common_runtime/gpu/visitable_allocator.h
@ -0,0 +1,30 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_VISITABLE_ALLOCATOR_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_VISITABLE_ALLOCATOR_H_
+
+#include <functional>
+#include "tensorflow/core/framework/allocator.h"
+
+namespace tensorflow {
+
+// Subclass VisitableAllocator instead of Allocator when a memory
+// allocator needs to enable some kind of registration/deregistration
+// of memory areas.
+class VisitableAllocator : public Allocator {
+ public:
+  // Visitor gets called with a pointer to a memory area and its
+  // size in bytes.
+  typedef std::function<void(void*, size_t)> Visitor;
+
+  // Register a visitor guaranteed to be called exactly once on each
+  // chunk of memory newly allocated from the underlying device.
+  // Typically, chunks will be reused and possibly sub-divided by a
+  // pool manager, so the calls will happen only once per process
+  // execution, not once per tensor (re)allocation.
+  virtual void AddAllocVisitor(Visitor visitor) = 0;
+
+  // Register a visitor guaranteed to be called on each chunk of
+  // memory returned to the underlying device.
+  virtual void AddFreeVisitor(Visitor visitor) = 0;
+};
+}  // namespace tensorflow
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_VISITABLE_ALLOCATOR_H_
--- a/tensorflow/core/common_runtime/gpu_device_context.h
+++ b/tensorflow/core/common_runtime/gpu_device_context.h
@ -0,0 +1,45 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_GPU_DEVICE_CONTEXT_H_
+#define TENSORFLOW_COMMON_RUNTIME_GPU_DEVICE_CONTEXT_H_
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/framework/device_base.h"
+
+namespace perftools {
+namespace gputools {
+class Stream;
+}  // namespace gputools
+}  // namespace perftools
+
+namespace tensorflow {
+
+namespace gpu = ::perftools::gputools;
+
+class GPUDeviceContext : public DeviceContext {
+ public:
+  GPUDeviceContext(int stream_id, gpu::Stream* stream)
+      : stream_id_(stream_id), stream_(stream) {}
+
+  ~GPUDeviceContext() override {}
+
+  gpu::Stream* stream() const override { return stream_; }
+  int stream_id() const { return stream_id_; }
+
+  void CopyCPUTensorToDevice(const Tensor* cpu_tensor, Device* device,
+                             Tensor* device_tensor,
+                             StatusCallback done) const override;
+
+  void CopyDeviceTensorToCPU(const Tensor* device_tensor,
+                             const string& edge_name, Device* device,
+                             Tensor* cpu_tensor, StatusCallback done) override;
+
+  void MaintainLifetimeOnStream(
+      const Tensor* t, perftools::gputools::Stream* stream) const override {}
+
+ private:
+  int stream_id_;
+  gpu::Stream* stream_;
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_GPU_DEVICE_CONTEXT_H_
--- a/tensorflow/core/common_runtime/kernel_benchmark_testlib.cc
+++ b/tensorflow/core/common_runtime/kernel_benchmark_testlib.cc
@ -0,0 +1,160 @@
+#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/framework/op_segment.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/util/device_name_utils.h"
+#include "tensorflow/core/lib/core/notification.h"
+#include "tensorflow/core/kernels/ops_util.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/platform/test_benchmark.h"
+#include "tensorflow/core/public/session_options.h"
+
+#if defined(PLATFORM_GOOGLE)
+DECLARE_bool(brain_gpu_use_bfc_allocator);
+#else
+extern bool FLAGS_brain_gpu_use_bfc_allocator;
+#endif
+
+namespace tensorflow {
+namespace test {
+
+Benchmark::Benchmark(const string& device, Graph* g,
+                     const SessionOptions* options, Graph* init) {
+  RequireDefaultOps();
+
+  FLAGS_brain_gpu_use_bfc_allocator = true;
+
+  SessionOptions default_options;
+  if (!options) {
+    options = &default_options;
+  }
+
+  testing::StopTiming();
+  string t = str_util::Uppercase(device);
+  device_ =
+      DeviceFactory::NewDevice(t, *options, "/job:localhost/replica:0/task:0");
+  CHECK(device_) << "Could not create a " << device << " device";
+
+  pool_ = new thread::ThreadPool(options->env, "blocking",
+                                 port::NumSchedulableCPUs());
+
+  auto runner = [this](std::function<void()> closure) {
+    pool_->Schedule(closure);
+  };
+
+  rendez_ = NewLocalRendezvous();
+
+  if (init) {
+    Executor* init_exec;
+    TF_CHECK_OK(NewLocalExecutor(
+        {
+            device_, nullptr, false,
+            [this](const NodeDef& ndef, OpKernel** kernel) {
+              return CreateNonCachedKernel(device_, nullptr, ndef, kernel);
+            },
+            [](OpKernel* kernel) { DeleteNonCachedKernel(kernel); },
+        },
+        init, &init_exec));
+    Executor::Args args;
+    args.rendezvous = rendez_;
+    args.runner = runner;
+    TF_CHECK_OK(init_exec->Run(args));
+    delete init_exec;
+  }
+
+  TF_CHECK_OK(NewLocalExecutor(
+      {
+          device_,
+          nullptr,
+          false,
+          [this](const NodeDef& ndef, OpKernel** kernel) {
+            return CreateNonCachedKernel(device_, nullptr, ndef, kernel);
+          },
+          [](OpKernel* kernel) { DeleteNonCachedKernel(kernel); },
+      },
+      g, &exec_));
+}
+
+Benchmark::~Benchmark() {
+  if (device_) {
+    rendez_->Unref();
+    delete exec_;
+    delete device_;
+    delete pool_;
+  }
+}
+
+void Benchmark::Run(int iters) { RunWithArgs({}, {}, iters); }
+
+string GetRendezvousKey(const Node* node) {
+  string send_device;
+  TF_CHECK_OK(GetNodeAttr(node->def(), "send_device", &send_device));
+  string recv_device;
+  TF_CHECK_OK(GetNodeAttr(node->def(), "recv_device", &recv_device));
+  string tensor_name;
+  TF_CHECK_OK(GetNodeAttr(node->def(), "tensor_name", &tensor_name));
+  uint64 send_device_incarnation;
+  TF_CHECK_OK(GetNodeAttr(node->def(), "send_device_incarnation",
+                          reinterpret_cast<int64*>(&send_device_incarnation)));
+  return Rendezvous::CreateKey(send_device, send_device_incarnation,
+                               recv_device, tensor_name, FrameAndIter(0, 0));
+}
+
+void Benchmark::RunWithArgs(
+    const std::vector<std::pair<const Node*, Tensor>>& inputs,
+    const std::vector<const Node*>& outputs, int iters) {
+  if (device_) {
+    // Gets inputs' and outputs' rendezvous keys.
+    std::vector<std::pair<string, Tensor>> in;
+    for (const auto& p : inputs) {
+      in.push_back({GetRendezvousKey(p.first), p.second});
+    }
+    std::vector<string> out;
+    for (const auto& n : outputs) {
+      out.push_back(GetRendezvousKey(n));
+    }
+    Tensor unused;  // In benchmark, we don't care the return value.
+    bool is_dead;
+
+    // Warm up
+    Executor::Args args;
+    args.rendezvous = rendez_;
+    args.runner = [this](std::function<void()> closure) {
+      pool_->Schedule(closure);
+    };
+    for (int i = 0; i < 3; ++i) {
+      for (const auto& p : in) {
+        rendez_->Send(p.first, Rendezvous::Args(), p.second, false);
+      }
+      TF_CHECK_OK(exec_->Run(args));
+      for (const string& key : out) {
+        rendez_->Recv(key, Rendezvous::Args(), &unused, &is_dead);
+      }
+    }
+    TF_CHECK_OK(device_->Sync());
+
+    testing::StartTiming();
+    while (iters-- > 0) {
+      for (const auto& p : in) {
+        rendez_->Send(p.first, Rendezvous::Args(), p.second, false);
+      }
+      TF_CHECK_OK(exec_->Run(args));
+      for (const string& key : out) {
+        rendez_->Recv(key, Rendezvous::Args(), &unused, &is_dead);
+      }
+    }
+
+    TF_CHECK_OK(device_->Sync());
+    testing::StopTiming();
+  }
+}
+
+}  // end namespace test
+}  // end namespace tensorflow
--- a/tensorflow/core/common_runtime/kernel_benchmark_testlib.h
+++ b/tensorflow/core/common_runtime/kernel_benchmark_testlib.h
@ -0,0 +1,52 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_KERNEL_BENCHMARK_TESTLIB_H_
+#define TENSORFLOW_COMMON_RUNTIME_KERNEL_BENCHMARK_TESTLIB_H_
+
+#include <string>
+#include <vector>
+
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/common_runtime/executor.h"
+#include "tensorflow/core/graph/testlib.h"
+#include "tensorflow/core/public/tensor.h"
+
+namespace tensorflow {
+
+class Device;
+class SessionOptions;
+
+namespace test {
+
+class Benchmark {
+ public:
+  // "device" must be either "cpu" or "gpu".  Takes ownership of "g"
+  // and "init".
+  Benchmark(const string& device, Graph* g,
+            const SessionOptions* options = nullptr, Graph* init = nullptr);
+  ~Benchmark();
+
+  // Executes the graph for "iters" times.
+  void Run(int iters);
+
+  // If "g" contains send/recv nodes, before each execution, we send
+  // inputs to the corresponding recv nodes in the graph, after each
+  // execution, we recv outputs from the corresponding send nodes in
+  // the graph. In the benchmark, we throw away values returned by the
+  // graph.
+  void RunWithArgs(const std::vector<std::pair<const Node*, Tensor>>& inputs,
+                   const std::vector<const Node*>& outputs, int iters);
+
+ private:
+  thread::ThreadPool* pool_ = nullptr;
+  thread::ThreadPool* non_blocking_pool_ = nullptr;
+  Device* device_ = nullptr;
+  Rendezvous* rendez_ = nullptr;
+  Executor* exec_ = nullptr;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(Benchmark);
+};
+
+}  // end namespace test
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_KERNEL_BENCHMARK_TESTLIB_H_
--- a/tensorflow/core/common_runtime/local_device.cc
+++ b/tensorflow/core/common_runtime/local_device.cc
@ -0,0 +1,51 @@
+#define EIGEN_USE_THREADS
+
+#include "tensorflow/core/common_runtime/eigen_thread_pool.h"
+#include "tensorflow/core/common_runtime/local_device.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/public/session_options.h"
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+
+namespace tensorflow {
+
+namespace {
+
+DeviceBase::CpuWorkerThreads eigen_worker_threads;
+Eigen::ThreadPoolInterface* eigen_thread_pool = nullptr;
+Eigen::ThreadPoolDevice* eigen_device = nullptr;
+
+static bool InitModule(const SessionOptions& options) {
+  int32 intra_op_parallelism_threads =
+      options.config.intra_op_parallelism_threads();
+  if (intra_op_parallelism_threads == 0) {
+    intra_op_parallelism_threads = port::NumSchedulableCPUs();
+  }
+  LOG(INFO) << "Local device intra op parallelism threads: "
+            << intra_op_parallelism_threads;
+  eigen_worker_threads.num_threads = intra_op_parallelism_threads;
+  eigen_worker_threads.workers = new thread::ThreadPool(
+      options.env, "Eigen", intra_op_parallelism_threads);
+  eigen_thread_pool = new EigenThreadPoolWrapper(eigen_worker_threads.workers);
+  eigen_device = new Eigen::ThreadPoolDevice(eigen_thread_pool,
+                                             eigen_worker_threads.num_threads);
+  return true;
+}
+}  // end namespace
+
+// LocalDevice ----------------------------------------------------------------
+
+LocalDevice::LocalDevice(const SessionOptions& options,
+                         const DeviceAttributes& attributes,
+                         Allocator* device_allocator)
+    : Device(options.env, attributes, device_allocator) {
+  // All ThreadPoolDevices in the process will use this single fixed
+  // sized threadpool for numerical computations.
+  static bool init = InitModule(options);
+  CHECK(init);  // Avoids compiler warning that init is unused.
+  set_tensorflow_cpu_worker_threads(&eigen_worker_threads);
+  set_eigen_cpu_device(eigen_device);
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/local_device.h
+++ b/tensorflow/core/common_runtime/local_device.h
@ -0,0 +1,27 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_LOCAL_DEVICE_H_
+#define TENSORFLOW_COMMON_RUNTIME_LOCAL_DEVICE_H_
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/framework/device_attributes.pb.h"
+
+namespace tensorflow {
+
+class SessionOptions;
+
+// This class is shared by ThreadPoolDevice and GPUDevice and
+// initializes a shared Eigen compute device used by both.  This
+// should eventually be removed once we refactor ThreadPoolDevice and
+// GPUDevice into more 'process-wide' abstractions.
+class LocalDevice : public Device {
+ public:
+  LocalDevice(const SessionOptions& options, const DeviceAttributes& attributes,
+              Allocator* device_allocator);
+  ~LocalDevice() override {}
+
+ private:
+  TF_DISALLOW_COPY_AND_ASSIGN(LocalDevice);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_LOCAL_DEVICE_H_
--- a/tensorflow/core/common_runtime/local_session.cc
+++ b/tensorflow/core/common_runtime/local_session.cc
@ -0,0 +1,500 @@
+#include "tensorflow/core/common_runtime/local_session.h"
+
+#include <string>
+#include <vector>
+
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/executor.h"
+#include "tensorflow/core/common_runtime/rendezvous_mgr.h"
+#include "tensorflow/core/common_runtime/session_factory.h"
+#include "tensorflow/core/common_runtime/simple_placer.h"
+#include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/graph/algorithm.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/graph/graph_constructor.h"
+#include "tensorflow/core/graph/graph_partition.h"
+#include "tensorflow/core/graph/subgraph.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/notification.h"
+#include "tensorflow/core/lib/core/refcount.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/lib/gtl/stl_util.h"
+#include "tensorflow/core/lib/random/random.h"
+#include "tensorflow/core/lib/strings/numbers.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/public/status.h"
+#include "tensorflow/core/public/tensor.h"
+#include "tensorflow/core/util/device_name_utils.h"
+
+namespace tensorflow {
+
+namespace {
+
+thread::ThreadPool* kernel_thread_pool_ = nullptr;
+static bool InitModule(const SessionOptions& options) {
+  int32 inter_op_parallelism_threads =
+      options.config.inter_op_parallelism_threads();
+  if (inter_op_parallelism_threads == 0) {
+    // Default to using the number of cores available in the process.
+    inter_op_parallelism_threads = port::NumSchedulableCPUs();
+  }
+  LOG(INFO) << "Local session inter op parallelism threads: "
+            << inter_op_parallelism_threads;
+  kernel_thread_pool_ = new thread::ThreadPool(options.env, "Compute",
+                                               inter_op_parallelism_threads);
+  return true;
+}
+
+// TODO(vrv): Figure out how to unify the many different functions
+// that generate RendezvousKey, since many of them have to be
+// consistent with each other.
+string GetRendezvousKey(const string& tensor_name,
+                        const DeviceAttributes& device_info,
+                        const FrameAndIter& frame_iter) {
+  return strings::StrCat(device_info.name(), ";",
+                         strings::FpToString(device_info.incarnation()), ";",
+                         device_info.name(), ";", tensor_name, ";",
+                         frame_iter.frame_id, ":", frame_iter.iter_id);
+}
+
+// NOTE: On Android with a single device, there is never
+// a risk of an OpKernel blocking indefinitely:
+//
+// 1) No operations do I/O that depends on other simultaneous kernels,
+//
+// 2) Recv nodes always complete immediately: The inputs are sent into
+//    the local rendezvous before we start the executor, so the
+//    corresonding recvs will not block.
+//
+// Based on these assumptions, we can use the same thread pool for
+// both "non-blocking" and "blocking" OpKernels on Android.
+//
+// This may change down the road when we add support for multiple
+// devices that run concurrently, in which case we will need to
+// revisit this decision.
+void SchedClosure(std::function<void()> c) {
+// TODO(sanjay): Get rid of __ANDROID__ path
+#ifdef __ANDROID__
+  // On Android, there is no implementation of ThreadPool that takes
+  // std::function, only Closure, which we cannot easily convert.
+  //
+  // Instead, we just run the function in-line, which is currently
+  // safe given the reasoning above.
+  c();
+#else
+  kernel_thread_pool_->Schedule(c);
+#endif  // __ANDROID__
+}
+
+}  // namespace
+
+LocalSession::LocalSession(const SessionOptions& options,
+                           const DeviceMgr* device_mgr)
+    : options_(options),
+      device_mgr_(device_mgr),
+      cancellation_manager_(new CancellationManager()) {
+  static bool init = InitModule(options);
+  CHECK(init);  // Avoids compiler warning that init is unused.
+  session_handle_ = strings::FpToString(random::New64());
+  int devices_added = 0;
+  if (options.config.log_device_placement()) {
+    const string mapping_str = device_mgr_->DeviceMappingString();
+    printf("Device mapping:\n%s", mapping_str.c_str());
+    LOG(INFO) << "Device mapping:\n" << mapping_str;
+  }
+  for (auto d : device_mgr_->ListDevices()) {
+    devices_.push_back(d);
+    device_set_.AddDevice(d);
+    d->op_segment()->AddHold(session_handle_);
+
+    // The first device added is special: it is the 'client device' (a
+    // CPU device) from which we feed and fetch Tensors.
+    if (devices_added == 0) {
+      device_set_.set_client_device(d);
+    }
+    ++devices_added;
+  }
+}
+
+LocalSession::~LocalSession() {
+  for (auto d : device_mgr_->ListDevices()) {
+    d->op_segment()->RemoveHold(session_handle_);
+  }
+  for (auto it : executors_) {
+    delete it.second;
+  }
+  delete cancellation_manager_;
+}
+
+Status LocalSession::Create(const GraphDef& graph) {
+  mutex_lock l(graph_def_lock_);
+  if (graph_created_) {
+    return errors::AlreadyExists(
+        "A Graph has already been created for this session.");
+  }
+  return ExtendLocked(graph);
+}
+
+Status LocalSession::Extend(const GraphDef& graph) {
+  mutex_lock l(graph_def_lock_);
+  return ExtendLocked(graph);
+}
+
+Status LocalSession::ExtendLocked(const GraphDef& graph) {
+  graph_created_ = true;  // In case this is first call
+  graph_def_.MergeFrom(graph);
+  return Status::OK();
+}
+
+Status LocalSession::Run(const std::vector<std::pair<string, Tensor>>& inputs,
+                         const std::vector<string>& output_names,
+                         const std::vector<string>& target_nodes,
+                         std::vector<Tensor>* outputs) {
+  {
+    mutex_lock l(graph_def_lock_);
+    if (!graph_created_) {
+      return errors::InvalidArgument(
+          "Session was not created with a graph before Run()!");
+    }
+  }
+
+  // Extract the inputs names for this run of the session.
+  std::vector<string> input_tensor_names;
+  input_tensor_names.reserve(inputs.size());
+  for (const auto& it : inputs) {
+    input_tensor_names.push_back(it.first);
+  }
+
+  // Check if we already have an executor for these arguments.
+  ExecutorsAndKeys* executors_and_keys;
+  Status s = GetOrCreateExecutors(input_tensor_names, output_names,
+                                  target_nodes, &executors_and_keys);
+  if (!s.ok()) {
+    return s;
+  }
+
+  IntraProcessRendezvous* rendez =
+      new IntraProcessRendezvous(device_mgr_.get());
+  core::ScopedUnref rendez_unref(rendez);
+
+  // Insert the input tensors into the local rendezvous by their
+  // rendezvous key.
+  for (const auto& input : inputs) {
+    const string& input_key = executors_and_keys->input_keys[input.first];
+    s = rendez->Send(input_key, Rendezvous::Args(), input.second, false);
+    if (!s.ok()) {
+      rendez->StartAbort(s);
+      return s;
+    }
+  }
+
+  // Start parallel Executors.
+  Notification executors_done;
+  const int num_executors = executors_and_keys->device_executors.size();
+  ExecutorBarrier* barrier = new ExecutorBarrier(
+      num_executors, rendez, [&executors_done, &s](const Status& ret) {
+        s = ret;
+        executors_done.Notify();
+      });
+
+  Executor::Args args;
+  args.rendezvous = rendez;
+  args.cancellation_manager = cancellation_manager_;
+  args.runner = SchedClosure;
+
+  for (auto device_executor : executors_and_keys->device_executors) {
+    Executor* exec = device_executor.second;
+    exec->RunAsync(args, barrier->Get());
+  }
+
+  executors_done.WaitForNotification();
+
+  TF_RETURN_IF_ERROR(s);
+
+  if (!output_names.empty()) {
+    outputs->resize(output_names.size());
+  }
+
+  // Get the outputs from the rendezvous
+  for (size_t output_offset = 0; output_offset < output_names.size();
+       ++output_offset) {
+    const string& output_key =
+        executors_and_keys->output_keys[output_names[output_offset]];
+    Tensor output_tensor;
+    bool is_dead;
+
+    // Fetch data from the Rendezvous.
+    s = rendez->Recv(output_key, Rendezvous::Args(), &output_tensor, &is_dead);
+    if (is_dead) {
+      s = errors::InvalidArgument("The tensor returned for ",
+                                  output_names[output_offset],
+                                  " was not valid.");
+    }
+    if (!s.ok()) {
+      rendez->StartAbort(s);
+      outputs->clear();
+      return s;
+    }
+
+    (*outputs)[output_offset] = output_tensor;
+  }
+
+  return s;
+}
+
+Status LocalSession::GetOrCreateExecutors(
+    gtl::ArraySlice<string> inputs, gtl::ArraySlice<string> outputs,
+    gtl::ArraySlice<string> target_nodes,
+    ExecutorsAndKeys** executors_and_keys) {
+  // Sort the inputs and outputs, so we don't create separate
+  // executors when a user passes in the same inputs/outputs in
+  // different orders.
+  //
+  // We could consider some other signature instead of sorting that
+  // preserves the same property to avoid the sort in the future.
+  std::vector<string> inputs_sorted(inputs.begin(), inputs.end());
+  std::vector<string> outputs_sorted(outputs.begin(), outputs.end());
+  std::vector<string> tn_sorted(target_nodes.begin(), target_nodes.end());
+  std::sort(inputs_sorted.begin(), inputs_sorted.end());
+  std::sort(outputs_sorted.begin(), outputs_sorted.end());
+  std::sort(tn_sorted.begin(), tn_sorted.end());
+
+  const string key = strings::StrCat(str_util::Join(inputs_sorted, ","), "->",
+                                     str_util::Join(outputs_sorted, ","), "/",
+                                     str_util::Join(tn_sorted, ","));
+
+  // See if we already have the executors for this run.
+  {
+    mutex_lock l(executor_lock_);  // could use reader lock
+    auto it = executors_.find(key);
+    if (it != executors_.end()) {
+      *executors_and_keys = it->second;
+      return Status::OK();
+    }
+  }
+
+  // The executor_lock_ is intentionally released while executor is
+  // being created.
+  std::unordered_map<string, Graph*> graphs;
+  Status s = CreateGraphs(inputs, outputs, target_nodes, &graphs);
+  if (!s.ok()) {
+    return s;
+  }
+
+  bool has_control_flow = false;
+  for (const auto& graph : graphs) {
+    for (const Node* n : graph.second->nodes()) {
+      if (IsControlFlow(n)) {
+        has_control_flow = true;
+        break;
+      }
+    }
+    if (has_control_flow) break;
+  }
+
+  std::unique_ptr<ExecutorsAndKeys> ek(new ExecutorsAndKeys);
+
+  for (const auto& graph : graphs) {
+    const string& partition_name = graph.first;
+    Graph* partition_graph = graph.second;
+
+    Device* d;
+    s = device_mgr_->LookupDevice(partition_name, &d);
+    if (!s.ok()) {
+      return s;
+    }
+
+    LocalExecutorParams params;
+    params.has_control_flow = has_control_flow;
+    params.device = d;
+    params.create_kernel = [this, d](const NodeDef& ndef, OpKernel** kernel) {
+      return CreateCachedKernel(d, session_handle_, nullptr, ndef, kernel);
+    };
+    params.delete_kernel = [this, d](OpKernel* kernel) {
+      DeleteCachedKernel(d, session_handle_, kernel);
+    };
+
+    Executor* tmp_exec;
+    s = NewLocalExecutor(params, partition_graph, &tmp_exec);
+    if (!s.ok()) {
+      return s;
+    }
+    ek->device_executors.insert(std::make_pair(graph.first, tmp_exec));
+  }
+
+  // Compute the rendezvous keys to avoid recomputing them every time.
+  //
+  // We always use the first device as the device name portion of the
+  // key, even if we're feeding another graph.
+  for (const string& input : inputs) {
+    ek->input_keys[input] = GetRendezvousKey(
+        input, device_set_.client_device()->attributes(), FrameAndIter(0, 0));
+  }
+  for (const string& output : outputs) {
+    ek->output_keys[output] = GetRendezvousKey(
+        output, device_set_.client_device()->attributes(), FrameAndIter(0, 0));
+  }
+
+  // Reacquire the lock, try to insert into the map.
+  mutex_lock l(executor_lock_);
+  const bool inserted = executors_.insert(std::make_pair(key, ek.get())).second;
+  if (!inserted) {
+    // Another thread created the entry before us, so delete the
+    // one we created and return the already created one.
+    auto it = executors_.find(key);
+    *executors_and_keys = it->second;
+  } else {
+    *executors_and_keys = ek.release();
+  }
+
+  return Status::OK();
+}
+
+void LocalSession::SaveStatefulNodes(Graph* graph) {
+  for (Node* n : graph->nodes()) {
+    if (n->op_def().is_stateful()) {
+      VLOG(2) << "Saving " << n->DebugString();
+      stateful_placements_[n->name()] = n->assigned_device_name();
+    }
+  }
+}
+
+void LocalSession::RestoreStatefulNodes(Graph* graph) {
+  for (Node* n : graph->nodes()) {
+    if (n->op_def().is_stateful()) {
+      auto iter = stateful_placements_.find(n->name());
+      if (iter != stateful_placements_.end()) {
+        n->set_assigned_device_name(iter->second);
+        VLOG(2) << "Restored " << n->DebugString();
+      }
+    }
+  }
+}
+
+Status LocalSession::CreateGraphs(gtl::ArraySlice<string> feeds,
+                                  gtl::ArraySlice<string> fetches,
+                                  gtl::ArraySlice<string> target_nodes,
+                                  std::unordered_map<string, Graph*>* outputs) {
+  Graph graph(OpRegistry::Global());
+  GraphConstructorOptions opts;
+
+  {
+    mutex_lock l(graph_def_lock_);
+    TF_RETURN_IF_ERROR(ConvertGraphDefToGraph(opts, graph_def_, &graph));
+  }
+
+  TF_RETURN_IF_ERROR(subgraph::RewriteGraphForExecution(
+      &graph, feeds, fetches, target_nodes,
+      device_set_.client_device()->attributes()));
+
+  // Run the simple placer after rewriting the graph.
+  std::unordered_map<string, int32> node_name_to_cost_map;
+  for (Node* n : graph.nodes()) {
+    node_name_to_cost_map[n->name()] = n->cost_id();
+  }
+  SimplePlacer placer(&graph, &device_set_, &node_name_to_cost_map, &options_);
+
+  {
+    mutex_lock l(mu_);
+    // Restore stateful nodes.
+    RestoreStatefulNodes(&graph);
+    TF_RETURN_IF_ERROR(placer.Run());
+    // Save stateful nodes.
+    SaveStatefulNodes(&graph);
+  }
+
+  // Partition the graph across devices.
+  std::unordered_map<string, GraphDef> partitions;
+  PartitionOptions popts;
+  popts.node_to_loc = [](const Node* node) {
+    return node->assigned_device_name();
+  };
+  popts.new_name = [this](const string& prefix) {
+    mutex_lock l(mu_);
+    return strings::StrCat(prefix, "/_", name_counter_++);
+  };
+  popts.get_incarnation = [](const string& name) {
+    // The local session does not have changing incarnation numbers.
+    // Just return '1'.
+    return 1;
+  };
+  popts.control_flow_added = false;
+  TF_RETURN_IF_ERROR(Partition(popts, &graph, &partitions));
+
+  std::vector<string> device_names;
+  for (auto device : devices_) {
+    // Extract the LocalName from the device.
+    device_names.push_back(DeviceNameUtils::LocalName(device->name()));
+  }
+
+  // Check for valid partitions.
+  for (const auto& partition : partitions) {
+    const string& local_partition_name =
+        DeviceNameUtils::LocalName(partition.first);
+    if (std::count(device_names.begin(), device_names.end(),
+                   local_partition_name) == 0) {
+      return errors::InvalidArgument(
+          "Creating a partition for ", local_partition_name,
+          " which doesn't exist in the list of available devices. Available "
+          "devices: ",
+          str_util::Join(device_names, ","));
+    }
+  }
+
+  for (const auto& partition : partitions) {
+    const string& partition_name = partition.first;
+
+    const GraphDef& graph_def = partition.second;
+    VLOG(2) << "Created " << graph_def.DebugString() << " for "
+            << partition_name;
+
+    Graph* device_graph = new Graph(OpRegistry::Global());
+    GraphConstructorOptions device_opts;
+    // There are internal operations (e.g., send/recv) that we now
+    // allow.
+    device_opts.allow_internal_ops = true;
+    device_opts.expect_device_spec = true;
+    Status s =
+        ConvertGraphDefToGraph(device_opts, graph_def, device_graph);
+    if (!s.ok()) {
+      delete device_graph;
+      // Also delete other graphs created during the loop.
+      gtl::STLDeleteValues(outputs);
+      return s;
+    }
+    outputs->insert(std::make_pair(partition_name, device_graph));
+  }
+
+  return Status::OK();
+}
+
+::tensorflow::Status LocalSession::Close() {
+  cancellation_manager_->StartCancel();
+  return ::tensorflow::Status::OK();
+}
+
+class LocalSessionFactory : public SessionFactory {
+ public:
+  LocalSessionFactory() {}
+
+  Session* NewSession(const SessionOptions& options) override {
+    std::vector<Device*> devices;
+    DeviceFactory::AddDevices(options, "/job:localhost/replica:0/task:0",
+                              &devices);
+    return new LocalSession(options, new DeviceMgr(devices));
+  }
+};
+
+class LocalSessionRegistrar {
+ public:
+  LocalSessionRegistrar() {
+    SessionFactory::Register("LOCAL_SESSION", new LocalSessionFactory());
+  }
+};
+static LocalSessionRegistrar registrar;
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/local_session.h
+++ b/tensorflow/core/common_runtime/local_session.h
@ -0,0 +1,109 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_LOCAL_SESSION_H_
+#define TENSORFLOW_COMMON_RUNTIME_LOCAL_SESSION_H_
+
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/core/platform/thread_annotations.h"
+#include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/common_runtime/device_set.h"
+#include "tensorflow/core/common_runtime/executor.h"
+#include "tensorflow/core/framework/cancellation.h"
+#include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/public/session.h"
+#include "tensorflow/core/public/tensor.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/public/status.h"
+
+namespace tensorflow {
+
+class Device;
+
+class LocalSession : public Session {
+ public:
+  // Takes ownership of 'device_mgr'.
+  LocalSession(const SessionOptions& options, const DeviceMgr* device_mgr);
+  ~LocalSession() override;
+
+  ::tensorflow::Status Create(const GraphDef& graph) override;
+  ::tensorflow::Status Extend(const GraphDef& graph) override;
+  ::tensorflow::Status Run(const std::vector<std::pair<string, Tensor>>& inputs,
+                           const std::vector<string>& output_names,
+                           const std::vector<string>& target_nodes,
+                           std::vector<Tensor>* outputs) override;
+  ::tensorflow::Status Close() override;
+
+ private:
+  struct ExecutorsAndKeys {
+    std::unordered_map<string, Executor*> device_executors;
+    std::unordered_map<string, string> input_keys;
+    std::unordered_map<string, string> output_keys;
+
+    ~ExecutorsAndKeys() {
+      for (auto it : device_executors) {
+        delete it.second;
+      }
+    }
+  };
+
+  // Retrieves an already existing set of executors to run 'inputs' and
+  // 'outputs', or creates and caches them for future use.
+  ::tensorflow::Status GetOrCreateExecutors(
+      gtl::ArraySlice<string> inputs, gtl::ArraySlice<string> outputs,
+      gtl::ArraySlice<string> target_nodes,
+      ExecutorsAndKeys** executors_and_keys);
+
+  // Creates several graphs given the existing graph_def_ and the
+  // input feeds and fetches, given 'devices'.
+  ::tensorflow::Status CreateGraphs(
+      gtl::ArraySlice<string> feeds, gtl::ArraySlice<string> fetches,
+      gtl::ArraySlice<string> target_nodes,
+      std::unordered_map<string, Graph*>* outputs);
+
+  ::tensorflow::Status ExtendLocked(const GraphDef& graph)
+      EXCLUSIVE_LOCKS_REQUIRED(graph_def_lock_);
+
+  const SessionOptions options_;
+
+  // Device structures.
+  const std::unique_ptr<const DeviceMgr> device_mgr_;
+  std::vector<Device*> devices_;           // not owned
+  DeviceSet device_set_;
+
+  string session_handle_;
+  bool graph_created_ GUARDED_BY(graph_def_lock_) = false;
+
+  mutex graph_def_lock_;
+  GraphDef graph_def_ GUARDED_BY(graph_def_lock_);
+
+  mutex executor_lock_;  // protects executors_
+  // Holds mappings from signature to the executors that process
+  // it. The reason for a level of indirection around mapped_type is
+  // to guarantee address stability.
+  std::unordered_map<string, ExecutorsAndKeys*> executors_
+      GUARDED_BY(executor_lock_);
+
+  CancellationManager* cancellation_manager_;
+
+  // Saves and restores device placements for stateful nodes.
+  mutex mu_;
+  void SaveStatefulNodes(Graph* graph) EXCLUSIVE_LOCKS_REQUIRED(mu_);
+  void RestoreStatefulNodes(Graph* graph) EXCLUSIVE_LOCKS_REQUIRED(mu_);
+  // Map of placed stateful nodes, i.e. nodes for which is_stateful()
+  // is true, such as "params" and "queue" nodes.  Once placed these
+  // nodes can not be moved to a different device.  Maps node names to
+  // device names.
+  std::unordered_map<string, string> stateful_placements_ GUARDED_BY(mu_);
+
+  // For generating unique names.
+  int64 name_counter_ GUARDED_BY(mu_) = 0;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(LocalSession);
+};
+
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_LOCAL_SESSION_H_
--- a/tensorflow/core/common_runtime/local_session_test.cc
+++ b/tensorflow/core/common_runtime/local_session_test.cc
@ -0,0 +1,314 @@
+#include "tensorflow/core/common_runtime/local_session.h"
+
+#include <map>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/framework/types.pb.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/graph/testlib.h"
+#include "tensorflow/core/kernels/ops_util.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/public/session_options.h"
+#include "tensorflow/core/public/status.h"
+#include "tensorflow/core/public/tensor.h"
+#include "tensorflow/core/util/device_name_utils.h"
+#include <gtest/gtest.h>
+
+namespace tensorflow {
+namespace {
+
+Session* CreateSession() {
+  SessionOptions options;
+  (*options.config.mutable_device_count())["CPU"] = 2;
+  return NewSession(options);
+}
+
+class LocalSessionMinusAXTest : public ::testing::Test {
+ public:
+  void Initialize(std::initializer_list<float> a_values) {
+    RequireDefaultOps();
+    Graph graph(OpRegistry::Global());
+
+    Tensor a_tensor(DT_FLOAT, TensorShape({2, 2}));
+    test::FillValues<float>(&a_tensor, a_values);
+    Node* a = test::graph::Constant(&graph, a_tensor);
+    a->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:0");
+
+    Tensor x_tensor(DT_FLOAT, TensorShape({2, 1}));
+    test::FillValues<float>(&x_tensor, {1, 1});
+    Node* x = test::graph::Constant(&graph, x_tensor);
+    x->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:1");
+    x_ = x->name();
+
+    // y = A * x
+    Node* y = test::graph::Matmul(&graph, a, x, false, false);
+    y->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:0");
+    y_ = y->name();
+
+    Node* y_neg = test::graph::Unary(&graph, "Neg", y);
+    y_neg_ = y_neg->name();
+    y_neg->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:1");
+
+    test::graph::ToGraphDef(&graph, &def_);
+  }
+
+  string x_;
+  string y_;
+  string y_neg_;
+  GraphDef def_;
+};
+
+TEST_F(LocalSessionMinusAXTest, RunSimpleNetwork) {
+  Initialize({3, 2, -1, 0});
+  std::unique_ptr<Session> session(CreateSession());
+  ASSERT_TRUE(session != nullptr);
+  ASSERT_OK(session->Create(def_));
+  std::vector<std::pair<string, Tensor>> inputs;
+
+  // Request two targets: one fetch output and one non-fetched output.
+  std::vector<string> output_names = {y_ + ":0"};
+  std::vector<string> target_nodes = {y_neg_};
+  std::vector<Tensor> outputs;
+  Status s = session->Run(inputs, output_names, target_nodes, &outputs);
+  ASSERT_OK(s);
+
+  ASSERT_EQ(1, outputs.size());
+  // The first output should be initiailzed and have the correct
+  // output.
+  auto mat = outputs[0].matrix<float>();
+  ASSERT_TRUE(outputs[0].IsInitialized());
+  EXPECT_FLOAT_EQ(5.0, mat(0, 0));
+}
+
+TEST_F(LocalSessionMinusAXTest, TestFeed) {
+  Initialize({1, 2, 3, 4});
+  std::unique_ptr<Session> session(CreateSession());
+  ASSERT_TRUE(session != nullptr);
+
+  ASSERT_OK(session->Create(def_));
+
+  // Fill in the input and ask for the output
+  //
+  // Note that the input being fed is on the second device.
+  Tensor t(DT_FLOAT, TensorShape({2, 1}));
+  t.matrix<float>()(0, 0) = 5;
+  t.matrix<float>()(1, 0) = 6;
+  std::vector<std::pair<string, Tensor>> inputs = {{x_, t}};
+  std::vector<string> output_names = {y_ + ":0"};
+  std::vector<Tensor> outputs;
+
+  // Run the graph
+  Status s = session->Run(inputs, output_names, {}, &outputs);
+  ASSERT_OK(s);
+
+  ASSERT_EQ(1, outputs.size());
+  auto mat = outputs[0].matrix<float>();
+
+  // Expect outputs to be; 1*5 + 2*6, 3*5 + 4*6
+  EXPECT_FLOAT_EQ(17.0, mat(0, 0));
+  EXPECT_FLOAT_EQ(39.0, mat(1, 0));
+}
+
+TEST_F(LocalSessionMinusAXTest, TestConcurrency) {
+  Initialize({1, 2, 3, 4});
+  std::unique_ptr<Session> session(CreateSession());
+  ASSERT_TRUE(session != nullptr);
+  ASSERT_OK(session->Create(def_));
+
+  // Fill in the input and ask for the output
+  thread::ThreadPool* tp = new thread::ThreadPool(Env::Default(), "test", 4);
+
+  // Run the graph 1000 times in 4 different threads concurrently.
+  std::vector<string> output_names = {y_ + ":0"};
+  auto fn = [&session, output_names]() {
+    for (int i = 0; i < 1000; ++i) {
+      std::vector<std::pair<string, Tensor>> inputs;
+      std::vector<Tensor> outputs;
+      // Run the graph
+      Status s = session->Run(inputs, output_names, {}, &outputs);
+      ASSERT_TRUE(s.ok());
+      ASSERT_EQ(1, outputs.size());
+      auto mat = outputs[0].matrix<float>();
+      EXPECT_FLOAT_EQ(3.0, mat(0, 0));
+    }
+  };
+
+  for (int i = 0; i < 4; ++i) {
+    tp->Schedule(fn);
+  }
+
+  // Wait for the functions to finish.
+  delete tp;
+}
+
+TEST_F(LocalSessionMinusAXTest, TwoCreateCallsFails) {
+  Initialize({1, 2, 3, 4});
+  std::unique_ptr<Session> session(CreateSession());
+  ASSERT_TRUE(session != nullptr);
+  ASSERT_OK(session->Create(def_));
+
+  // Second is not.
+  ASSERT_FALSE(session->Create(def_).ok());
+}
+
+TEST_F(LocalSessionMinusAXTest, ForgetToCreate) {
+  Initialize({1, 2, 3, 4});
+  std::unique_ptr<Session> session(CreateSession());
+  ASSERT_TRUE(session != nullptr);
+  std::vector<std::pair<string, Tensor>> inputs;
+  std::vector<Tensor> outputs;
+  ASSERT_FALSE(session->Run(inputs, {y_ + ":0"}, {y_neg_}, &outputs).ok());
+}
+
+TEST_F(LocalSessionMinusAXTest, InvalidDevice) {
+  GraphDef def;
+  Graph graph(OpRegistry::Global());
+
+  Tensor a_tensor(DT_FLOAT, TensorShape({2, 2}));
+  a_tensor.flat<float>().setRandom();
+  Node* a = test::graph::Constant(&graph, a_tensor);
+  a->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:0");
+  Tensor x_tensor(DT_FLOAT, TensorShape({2, 1}));
+  x_tensor.flat<float>().setRandom();
+  Node* x = test::graph::Constant(&graph, x_tensor);
+  x->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:1");
+  // Skip placing y.
+  Node* y = test::graph::Matmul(&graph, a, x, false, false);
+  y->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:2");
+
+  test::graph::ToGraphDef(&graph, &def);
+
+  std::unique_ptr<Session> session(CreateSession());
+  ASSERT_TRUE(session != nullptr);
+  ASSERT_OK(session->Create(def));
+  std::vector<std::pair<string, Tensor>> inputs;
+  std::vector<string> output_names = {y->name() + ":0"};
+  std::vector<Tensor> outputs;
+
+  // Should return an error.
+  ASSERT_FALSE(session->Run(inputs, output_names, {}, &outputs).ok());
+
+  // Fix placement and run again
+  def.Clear();
+  y->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:1");
+  test::graph::ToGraphDef(&graph, &def);
+  session.reset(CreateSession());
+  ASSERT_OK(session->Create(def));
+  ASSERT_OK(session->Run(inputs, output_names, {}, &outputs));
+}
+
+TEST(LocalSessionTest, KeepsStateAcrossRunsOfSession) {
+  GraphDef def;
+  Graph g(OpRegistry::Global());
+  Node* var = test::graph::Var(&g, DT_FLOAT, TensorShape({10}));
+  var->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:0");
+
+  Tensor twenty(DT_FLOAT, TensorShape({10}));
+  for (int i = 0; i < 10; ++i) {
+    twenty.flat<float>()(i) = 20.0;
+  }
+
+  Node* twenty_node = test::graph::Constant(&g, twenty);
+  twenty_node->set_assigned_device_name(
+      "/job:localhost/replica:0/task:0/cpu:0");
+
+  Node* init = test::graph::Assign(&g, var, twenty_node);
+  init->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:0");
+
+  test::graph::ToGraphDef(&g, &def);
+
+  std::unique_ptr<Session> session(CreateSession());
+  ASSERT_TRUE(session != nullptr);
+  ASSERT_OK(session->Create(def));
+
+  std::vector<std::pair<string, Tensor>> inputs;
+  std::vector<Tensor> outputs;
+
+  // Initialize the variable
+  Status s = session->Run(inputs, {init->name()}, {}, &outputs);
+  ASSERT_OK(s);
+
+  // Get the variable's data
+  s = session->Run(inputs, {var->name() + ":0"}, {}, &outputs);
+  ASSERT_OK(s);
+  ASSERT_EQ(1, outputs.size());
+  ASSERT_TRUE(outputs[0].IsInitialized());
+  EXPECT_EQ(20.0, outputs[0].flat<float>()(0));
+}
+
+TEST(LocalSessionTest, MultipleFeedTest) {
+  GraphDef def;
+  Graph g(OpRegistry::Global());
+  Node* var = test::graph::Var(&g, DT_FLOAT, TensorShape({10}));
+  var->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:0");
+
+  Tensor first_value(DT_FLOAT, TensorShape({}));
+  first_value.scalar<float>()() = 1.0;
+  Node* first_const = test::graph::Constant(&g, first_value);
+  Node* first_identity = test::graph::Identity(&g, first_const);
+
+  Tensor second_value(DT_FLOAT, TensorShape({}));
+  second_value.scalar<float>()() = 2.0;
+  Node* second_const = test::graph::Constant(&g, second_value);
+  Node* second_identity = test::graph::Identity(&g, second_const);
+
+  test::graph::ToGraphDef(&g, &def);
+
+  std::unique_ptr<Session> session(CreateSession());
+  ASSERT_TRUE(session != nullptr);
+  ASSERT_OK(session->Create(def));
+
+  std::vector<Tensor> outputs;
+
+  // Fetch without feeding.
+  Status s = session->Run(
+      {}, {first_identity->name() + ":0", second_identity->name() + ":0"}, {},
+      &outputs);
+  ASSERT_TRUE(s.ok());
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(1.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(2.0, outputs[1].flat<float>()(0));
+
+  s = session->Run(
+      {}, {second_identity->name() + ":0", first_identity->name() + ":0"}, {},
+      &outputs);
+  ASSERT_TRUE(s.ok());
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(2.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(1.0, outputs[1].flat<float>()(0));
+
+  Tensor value_11(DT_FLOAT, TensorShape({}));
+  value_11.scalar<float>()() = 11.0;
+  Tensor value_22(DT_FLOAT, TensorShape({}));
+  value_22.scalar<float>()() = 22.0;
+
+  // Feed [first_const, second_const]
+  s = session->Run(
+      {{first_const->name(), value_11}, {second_const->name(), value_22}},
+      {first_identity->name() + ":0", second_identity->name() + ":0"}, {},
+      &outputs);
+  ASSERT_TRUE(s.ok());
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(11.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(22.0, outputs[1].flat<float>()(0));
+
+  // Feed [second_const, first_const]
+  s = session->Run(
+      {{second_const->name(), value_22}, {first_const->name(), value_11}},
+      {first_identity->name() + ":0", second_identity->name() + ":0"}, {},
+      &outputs);
+  ASSERT_TRUE(s.ok());
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(11.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(22.0, outputs[1].flat<float>()(0));
+}
+
+}  // namespace
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/rendezvous_mgr.cc
+++ b/tensorflow/core/common_runtime/rendezvous_mgr.cc
@ -0,0 +1,170 @@
+#include "tensorflow/core/common_runtime/rendezvous_mgr.h"
+
+#include <unordered_set>
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/common_runtime/device_mgr.h"
+#if (!defined(PLATFORM_POSIX_ANDROID) && !defined(PLATFORM_GOOGLE_ANDROID)) && \
+    (defined(PLATFORM_GOOGLE) || GOOGLE_CUDA)
+#include "tensorflow/core/common_runtime/gpu/gpu_util.h"
+#endif
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/notification.h"
+#include "tensorflow/core/lib/strings/numbers.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+
+namespace tensorflow {
+
+namespace {
+
+void CopyTensorBetweenDevices(const string& id, DeviceContext* send_dev_context,
+                              DeviceContext* recv_dev_context, Device* src,
+                              Device* dst,
+                              const AllocatorAttributes src_alloc_attr,
+                              const AllocatorAttributes dst_alloc_attr,
+                              const Tensor* input, Tensor* output,
+                              std::function<void(const Status&)> done) {
+  if (src->attributes().device_type() != dst->attributes().device_type()) {
+    done(errors::Unimplemented(
+        "Copy between device types not yet implemented: src=", src->name(),
+        " dst=", dst->name()));
+  } else if (src->attributes().device_type() != "CPU") {
+    done(errors::Unimplemented(
+        "Copy between non-CPU devices not yet implemented"));
+  }
+  *output = *input;
+  done(Status::OK());
+}
+
+#if (!defined(PLATFORM_POSIX_ANDROID) && !defined(PLATFORM_GOOGLE_ANDROID)) && \
+    (defined(PLATFORM_GOOGLE) || GOOGLE_CUDA)
+constexpr auto CopyTensorBetweenDevicesFunc = &GPUUtil::CopyViaDMA;
+#else
+constexpr auto CopyTensorBetweenDevicesFunc = &CopyTensorBetweenDevices;
+#endif
+
+}  // end namespace
+
+IntraProcessRendezvous::IntraProcessRendezvous(const DeviceMgr* device_mgr)
+    : device_mgr_(device_mgr), local_(NewLocalRendezvous()) {}
+
+IntraProcessRendezvous::~IntraProcessRendezvous() { local_->Unref(); }
+
+Status IntraProcessRendezvous::Send(const string& key,
+                                    const Rendezvous::Args& args,
+                                    const Tensor& val, const bool is_dead) {
+  VLOG(1) << "IntraProcessRendezvous Send " << this << " " << key;
+  {
+    mutex_lock l(mu_);
+    if (!status_.ok()) return status_;
+  }
+  Rendezvous::ParsedKey parsed;
+  TF_RETURN_IF_ERROR(Rendezvous::ParseKey(key, &parsed));
+
+  // Buffers "val" and "device_context" in local_.
+  return local_->Send(key, args, val, is_dead);
+}
+
+Status IntraProcessRendezvous::ParseKey(const string& key, bool is_src,
+                                        Rendezvous::ParsedKey* parsed) {
+  {
+    mutex_lock l(mu_);
+    if (!status_.ok()) return status_;
+  }
+  TF_RETURN_IF_ERROR(Rendezvous::ParseKey(key, parsed));
+  return Status::OK();
+}
+
+void IntraProcessRendezvous::SameWorkerRecvDone(
+    const Rendezvous::ParsedKey& parsed, const Rendezvous::Args& send_args,
+    const Rendezvous::Args& recv_args, const Tensor& in, Tensor* out,
+    StatusCallback done) {
+  // Do a quick copy (sharing the underlying buffer) if both tensors
+  // are on host memory.
+  const bool src_host =
+      (send_args.alloc_attrs.on_host() || parsed.src.type == "CPU");
+  const bool dst_host =
+      (recv_args.alloc_attrs.on_host() || parsed.dst.type == "CPU");
+  if (src_host && dst_host) {
+    *out = in;
+    done(Status::OK());
+    return;
+  }
+
+  // This copy must involve a non-CPU device. Hence, "in" must support DMA
+  // (e.g., string tensors do not work on GPU).
+  if (!DataTypeCanUseMemcpy(in.dtype())) {
+    done(errors::InvalidArgument("Non-DMA-safe ", DataTypeString(in.dtype()),
+                                 " tensor may not be copied from/to a GPU."));
+    return;
+  }
+
+  Device* src_device;
+  Status s = device_mgr_->LookupDevice(parsed.src_device, &src_device);
+  if (!s.ok()) {
+    done(s);
+    return;
+  }
+  Device* dst_device;
+  s = device_mgr_->LookupDevice(parsed.dst_device, &dst_device);
+  if (!s.ok()) {
+    done(s);
+    return;
+  }
+
+  AllocatorAttributes attr = recv_args.alloc_attrs;
+  attr.set_gpu_compatible(send_args.alloc_attrs.gpu_compatible() ||
+                          recv_args.alloc_attrs.gpu_compatible());
+  Allocator* out_allocator = dst_device->GetAllocator(attr);
+  Tensor copy(out_allocator, in.dtype(), in.shape());
+  *out = copy;
+
+  CopyTensorBetweenDevicesFunc(parsed.edge_name, send_args.device_context,
+                               recv_args.device_context, src_device, dst_device,
+                               send_args.alloc_attrs, recv_args.alloc_attrs,
+                               &in, out, done);
+}
+
+void IntraProcessRendezvous::RecvAsync(const string& key,
+                                       const Rendezvous::Args& recv_args,
+                                       DoneCallback done) {
+  VLOG(1) << "IntraProcessRendezvous Recv " << this << " " << key;
+
+  Rendezvous::ParsedKey parsed;
+  Status s = ParseKey(key, false /*!is_src*/, &parsed);
+  if (!s.ok()) {
+    done(s, Args(), recv_args, Tensor(), false);
+    return;
+  }
+
+  // Recv the tensor from local_.
+  local_->RecvAsync(key, recv_args, [this, parsed, done](
+                                        const Status& status,
+                                        const Rendezvous::Args& send_args,
+                                        const Rendezvous::Args& recv_args,
+                                        const Tensor& in, bool is_dead) {
+    Status s = status;
+    Tensor* out = new Tensor;
+    StatusCallback final_callback = [done, send_args, recv_args, out,
+                                     is_dead](const Status& s) {
+      done(s, send_args, recv_args, *out, is_dead);
+      delete out;
+    };
+
+    if (s.ok()) {
+      SameWorkerRecvDone(parsed, send_args, recv_args, in, out, final_callback);
+    } else {
+      final_callback(s);
+    }
+  });
+}
+
+void IntraProcessRendezvous::StartAbort(const Status& s) {
+  CHECK(!s.ok());
+  local_->StartAbort(s);
+}
+
+}  // end namespace tensorflow
--- a/tensorflow/core/common_runtime/rendezvous_mgr.h
+++ b/tensorflow/core/common_runtime/rendezvous_mgr.h
@ -0,0 +1,73 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_RENDEZVOUS_MGR_H_
+#define TENSORFLOW_COMMON_RUNTIME_RENDEZVOUS_MGR_H_
+
+#include <string>
+#include <unordered_map>
+
+#include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/framework/rendezvous.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/public/status.h"
+#include "tensorflow/core/public/tensor.h"
+
+namespace tensorflow {
+
+// IntraProcessRendezvous is a Rendezvous which expects all producers
+// and consumers to be devices immediately accessible within the
+// process.  That is, it will never be necessary to perform an RPC to
+// communicate with either.
+//
+// Buffering of Tensor values is delegated to a "local" Rendezvous
+// obtained from NewLocalRendezvous().  This class just adds
+// functionality to coordinate multiple process-local devices.
+class IntraProcessRendezvous : public Rendezvous {
+ public:
+  explicit IntraProcessRendezvous(const DeviceMgr* device_mgr);
+
+  // Forwards to local_, where the Tensor "val" will be buffered and
+  // any waiting callback stored.
+  Status Send(const string& key, const Rendezvous::Args& args,
+              const Tensor& val, const bool is_dead) override;
+
+  // This method is called only by the RecvOp.  It tests to see
+  // whether the value will be produced by a local or remote device
+  // and handles accordingly.  In the local case it forwards to
+  // local_, in the remote case it initiates an RPC request.
+  void RecvAsync(const string& key, const Rendezvous::Args& args,
+                 DoneCallback done) override;
+
+  void StartAbort(const Status& status) override;
+
+ private:
+  const DeviceMgr* device_mgr_;
+  Rendezvous* local_;  // Owns a Ref on this object.
+
+  mutable mutex mu_;
+
+  // Status given by StartAbort() if any.
+  Status status_ GUARDED_BY(mu_);
+
+  ~IntraProcessRendezvous() override;
+
+  // Parses "key" into "parsed". If "is_src" is true, checks that the
+  // rendezvous key's source is in this process. If "is_src" is false,
+  // checks that the rendezvous key's destination is in this process.
+  Status ParseKey(const string& key, bool is_src,
+                  Rendezvous::ParsedKey* parsed);
+
+  // Callback handling the case when a rendezvous has been
+  // accomplished in local_ and the consumer is local to this process.
+  // Tensor "in" will be copied into "out". The key "parsed" encodes
+  // the src and dst devices.
+  typedef std::function<void(const Status&)> StatusCallback;
+  void SameWorkerRecvDone(const Rendezvous::ParsedKey& parsed,
+                          const Rendezvous::Args& send_args,
+                          const Rendezvous::Args& recv_args, const Tensor& in,
+                          Tensor* out, StatusCallback done);
+
+  TF_DISALLOW_COPY_AND_ASSIGN(IntraProcessRendezvous);
+};
+
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_RENDEZVOUS_MGR_H_
--- a/tensorflow/core/common_runtime/session.cc
+++ b/tensorflow/core/common_runtime/session.cc
@ -0,0 +1,51 @@
+#include <string>
+
+#include "tensorflow/core/common_runtime/session_factory.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/public/session.h"
+
+namespace tensorflow {
+
+namespace {
+Status GetFactory(const SessionOptions& options, SessionFactory** ret) {
+  string runtime_type = "LOCAL_SESSION";
+  if (!options.target.empty()) {
+    // Use the service based session.
+    runtime_type = "REMOTE_SESSION";
+  }
+  *ret = SessionFactory::GetFactory(runtime_type);
+  if (!*ret) {
+    return errors::NotFound("Could not find session factory for ",
+                            runtime_type);
+  }
+  return Status::OK();
+}
+}  // end namespace
+
+Session* NewSession(const SessionOptions& options) {
+  SessionFactory* factory;
+  Status s = GetFactory(options, &factory);
+  if (!s.ok()) {
+    LOG(ERROR) << s;
+    return nullptr;
+  }
+  return factory->NewSession(options);
+}
+
+Status NewSession(const SessionOptions& options, Session** out_session) {
+  SessionFactory* factory;
+  Status s = GetFactory(options, &factory);
+  if (!s.ok()) {
+    *out_session = nullptr;
+    LOG(ERROR) << s;
+    return s;
+  }
+  *out_session = factory->NewSession(options);
+  if (!*out_session) {
+    return errors::Internal("Failed to create session.");
+  }
+  return Status::OK();
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/session_factory.cc
+++ b/tensorflow/core/common_runtime/session_factory.cc
@ -0,0 +1,41 @@
+#include "tensorflow/core/common_runtime/session_factory.h"
+
+#include <unordered_map>
+
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/port.h"
+namespace tensorflow {
+namespace {
+
+static mutex* get_session_factory_lock() {
+  static mutex session_factory_lock;
+  return &session_factory_lock;
+}
+
+typedef std::unordered_map<string, SessionFactory*> SessionFactories;
+SessionFactories* session_factories() {
+  static SessionFactories* factories = new SessionFactories;
+  return factories;
+}
+
+}  // namespace
+
+void SessionFactory::Register(const string& runtime_type,
+                              SessionFactory* factory) {
+  mutex_lock l(*get_session_factory_lock());
+  if (!session_factories()->insert({runtime_type, factory}).second) {
+    LOG(ERROR) << "Two session factories are being registered "
+               << "under" << runtime_type;
+  }
+}
+
+SessionFactory* SessionFactory::GetFactory(const string& runtime_type) {
+  mutex_lock l(*get_session_factory_lock());  // could use reader lock
+  auto it = session_factories()->find(runtime_type);
+  if (it == session_factories()->end()) {
+    return nullptr;
+  }
+  return it->second;
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/session_factory.h
+++ b/tensorflow/core/common_runtime/session_factory.h
@ -0,0 +1,25 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_SESSION_FACTORY_H_
+#define TENSORFLOW_COMMON_RUNTIME_SESSION_FACTORY_H_
+
+#include <string>
+
+#include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/public/status.h"
+
+namespace tensorflow {
+
+class Session;
+class SessionOptions;
+
+class SessionFactory {
+ public:
+  virtual Session* NewSession(const SessionOptions& options) = 0;
+  virtual ~SessionFactory() {}
+  static void Register(const string& runtime_type, SessionFactory* factory);
+  static SessionFactory* GetFactory(const string& runtime_type);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_SESSION_FACTORY_H_
--- a/tensorflow/core/common_runtime/session_options.cc
+++ b/tensorflow/core/common_runtime/session_options.cc
@ -0,0 +1,9 @@
+#include "tensorflow/core/public/session_options.h"
+
+#include "tensorflow/core/public/env.h"
+
+namespace tensorflow {
+
+SessionOptions::SessionOptions() : env(Env::Default()) {}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/session_test.cc
+++ b/tensorflow/core/common_runtime/session_test.cc
@ -0,0 +1,17 @@
+#include "tensorflow/core/public/session.h"
+
+#include "tensorflow/core/public/session_options.h"
+#include <gtest/gtest.h>
+
+namespace tensorflow {
+namespace {
+
+TEST(SessionTest, InvalidTargetReturnsNull) {
+  SessionOptions options;
+  options.target = "invalid target";
+
+  EXPECT_EQ(nullptr, tensorflow::NewSession(options));
+}
+
+}  // namespace
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/simple_placer.cc
+++ b/tensorflow/core/common_runtime/simple_placer.cc
@ -0,0 +1,559 @@
+#include "tensorflow/core/common_runtime/simple_placer.h"
+
+#include <memory>
+#include <utility>
+#include <vector>
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/framework/device_attributes.pb.h"
+#include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/framework/node_def_util.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/framework/types.pb.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+
+namespace tensorflow {
+
+namespace {
+
+// Returns a list of devices sorted by name from 'devices' whose type is in
+// 'supported_device_types'.  This function searches in order of the device
+// types in 'supported_device_types' and returns the *first* subset of devices
+// that match.
+//
+// For example, if suported_device_types contains {GPU, CPU} and
+// 'devices' contains CPU and GPU devices, the returned vector will
+// include *only* GPU devices, since that is higher in the priority
+// order in 'supported_device_types'.
+std::vector<Device*> FilterSupportedDevices(
+    const std::vector<Device*>& devices,
+    const DeviceTypeVector& supported_device_types) {
+  std::vector<Device*> filtered_devices;
+  auto device_sort = [](const Device* a, const Device* b) {
+    return a->name() < b->name();
+  };
+  for (DeviceType d : supported_device_types) {
+    for (Device* device : devices) {
+      if (DeviceType(device->attributes().device_type()) == d) {
+        filtered_devices.emplace_back(device);
+      }
+    }
+
+    // If there are any devices under this device type, return this
+    // subset.
+    if (!filtered_devices.empty()) {
+      std::sort(filtered_devices.begin(), filtered_devices.end(), device_sort);
+      return filtered_devices;
+    }
+  }
+
+  std::sort(filtered_devices.begin(), filtered_devices.end(), device_sort);
+  return filtered_devices;
+}
+
+bool HasColocatedNodeName(const Node& node) {
+  return StringPiece(node.def().device()).starts_with("@");
+}
+
+Status ParseColocatedNodeName(const Node& node,
+                              string* out_colocated_node_name) {
+  StringPiece device(node.def().device());
+  if (!device.Consume("@")) {
+    return errors::InvalidArgument("Malformed colocated node name: '", device,
+                                   "'");
+  }
+  // TODO(mrry): Validate that the node name is a valid node name.
+  *out_colocated_node_name = device.ToString();
+  return Status::OK();
+}
+
+// This class maintains the connected components of a colocation
+// constraint graph, and uses this information to assign a satisfying
+// device placement to the nodes of the graph.
+//
+// The typical usage pattern is:
+//
+//   Graph graph = ...;
+//   DeviceSet device_set = ...;
+//   ColocationGraph colocation_graph(graph, device_set);
+//
+//   // Add all the nodes of graph to colocation_graph.
+//   for (Node* node : graph.nodes()) {
+//     TF_RETURN_IF_ERROR(colocation_graph.AddNode(*node));
+//   }
+//
+//   // Add one or more colocation constraint.
+//   Node node_1 = *graph.FindNodeId(...);
+//   Node node_2 = *graph.FindNodeId(...);
+//   TF_RETURN_IF_ERROR(colocation_graph.ColocateNodes(node_1, node_2));
+//
+//   // Assign devices based on the accumulated constraints.
+//   for (Node* node : graph.nodes()) {
+//     TF_RETURN_IF_ERROR(colocation_graph.AssignDevice(node));
+//   }
+//
+// The implementation uses the union-find algorithm to maintain the
+// connected components efficiently and incrementally as edges
+// (implied by ColocationGraph::ColocateNodes() invocations) are added.
+class ColocationGraph {
+ public:
+  ColocationGraph(Graph* graph, const DeviceSet* device_set,
+                  const SessionOptions* options)
+      : device_set_(device_set),
+        device_types_(device_set->PrioritizedDeviceTypeList()),
+        options_(options) {
+    members_.reserve(graph->num_node_ids());
+  }
+
+  // Adds the given node to this ColocationGraph as a singleton.
+  //
+  // NOTE: The implementation assumes that the ids of nodes passed to
+  // this method are dense and zero-based; the memory used will be linear in
+  // the largest node ID.
+  // NOTE: If this method returns an error, *this is left in an undefined
+  // state.
+  Status AddNode(const Node& node) {
+    Member member;
+    TF_RETURN_IF_ERROR(InitializeMember(node, &member));
+    CHECK_GE(member.parent, 0);
+    members_.resize(member.parent + 1);
+    members_[member.parent] = std::move(member);
+    return Status::OK();
+  }
+
+  // Merge the (possibly disjoint) sets containing nodes "x" and
+  // "y". Returns OK if the all nodes in the union of these sets can
+  // be placed on the same device type.
+  //
+  // NOTE: If this method returns an error, *this is left in an undefined
+  // state.
+  Status ColocateNodes(const Node& x, const Node& y) {
+    int x_root = FindRoot(x.id());
+    int y_root = FindRoot(y.id());
+    if (x_root != y_root) {
+      // Merge the sets by swinging the parent pointer of the smaller
+      // tree to point to the root of the larger tree. Together with
+      // path compression in ColocationGraph::FindRoot, this ensures
+      // that we do not experience pathological performance on graphs
+      // such as chains.
+      int new_root, old_root;
+      if (members_[x_root].rank < members_[y_root].rank) {
+        // The tree rooted at x_root is shallower, so connect it to
+        // y_root. The rank of y_root is unchanged because its new
+        // child has strictly less rank.
+        members_[x_root].parent = y_root;
+        new_root = y_root;
+        old_root = x_root;
+      } else if (members_[x_root].rank > members_[y_root].rank) {
+        // The tree rooted at y_root is shallower, so connect it to
+        // x_root. The rank of x_root is unchanged because its new
+        // child has strictly less rank.
+        members_[y_root].parent = x_root;
+        new_root = x_root;
+        old_root = y_root;
+      } else {
+        // Both trees have the same rank, so break the tie by choosing
+        // x_root as the new root.
+        members_[y_root].parent = x_root;
+        // Increment the rank of the tree rooted at x_root, because it
+        // is now strictly deeper than before.
+        ++members_[x_root].rank;
+        new_root = x_root;
+        old_root = y_root;
+      }
+
+      // Merge the partial device specifications, and ensure that they are
+      // compatible. NULL options_ is treated as allowing soft placement.
+      // TODO(mrry): Consider enriching the error message by pointing
+      // out which nodes have the explicit partial device
+      // specifications that caused this conflict.
+      TF_RETURN_IF_ERROR(DeviceNameUtils::MergeDevNames(
+          &members_[new_root].device_name, members_[old_root].device_name,
+          options_ == nullptr || options_->config.allow_soft_placement()));
+
+      // Ensure that the common root has at least one supported device
+      // type, by computing the intersection of
+      // members_[new_root].supported_device_types and
+      // members_[old_root].supported_device_types.
+      MergeSupportedDevices(&members_[new_root].supported_device_types,
+                            members_[old_root].supported_device_types);
+      if (members_[x_root].supported_device_types.size() == 0) {
+        return errors::InvalidArgument(
+            "Cannot colocate nodes '", x.name(), "' and '", y.name(),
+            "' because no device type supports both of those nodes and the "
+            "other nodes colocated with them");
+      }
+    }
+    return Status::OK();
+  }
+
+  // For the given node, subject to the constraints previously given
+  // to this ColocationGraph, set its assigned_device_name. Returns OK
+  // if a satisfying device can be found, otherwise an error.
+  Status AssignDevice(Node* node) {
+    int node_root = FindRoot(node->id());
+    if (members_[node_root].assigned_device == nullptr) {
+      // We have not yet assigned a device for the colocated node set containing
+      // n, so we do so now using the constraints on the root node.
+
+      // "devices" will contain the set of feasible placements for the
+      // colocated node set containing n.
+      std::vector<Device*> devices;
+      if (DeviceNameUtils::HasSomeDetails(members_[node_root].device_name)) {
+        // The root node has a (possibly partial) device
+        // specification, so enumerate the physical devices that
+        // conform to it.
+        device_set_->FindMatchingDevices(members_[node_root].device_name,
+                                         &devices);
+
+        if (!devices.empty()) {
+          // Filter devices into those that are compatible with the root
+          // node (and its children).
+          devices = FilterSupportedDevices(
+              devices, members_[node_root].supported_device_types);
+        }
+
+        // Perform soft placement if allow_soft_placement is set.  options_
+        // being NULL is treated as allowing soft placement.
+        if (devices.empty() &&
+            (options_ == nullptr || options_->config.allow_soft_placement())) {
+          // The soft_device_name is the same as the node's device name
+          // without specifying the device type or ID.
+          DeviceNameUtils::ParsedName soft_device_name =
+              members_[node_root].device_name;
+          soft_device_name.type.clear();
+          soft_device_name.has_type = false;
+          soft_device_name.has_id = false;
+          device_set_->FindMatchingDevices(soft_device_name, &devices);
+          if (!devices.empty()) {
+            devices = FilterSupportedDevices(
+                devices, members_[node_root].supported_device_types);
+          }
+        }
+
+        if (devices.empty()) {
+          // Return an error when a physical device that matches an explicit
+          // device specification is not found. This ensures that we don't
+          // assign a node to GPU when the user wanted to force it on CPU.
+          DeviceNameUtils::ParsedName specified_device_name;
+          if (DeviceNameUtils::ParseFullName(node->def().device(),
+                                             &specified_device_name) &&
+              specified_device_name == members_[node_root].device_name) {
+            // The specified device and merged set device match, and
+            // will appear in the GraphDef (for debugging), so just
+            // print the specified device.
+            return errors::InvalidArgument(
+                "Could not satisfy explicit device specification '",
+                node->def().device(), "'");
+          } else {
+            // The specified device may be a valid device but the
+            // merged set device is different, so print both.
+            return errors::InvalidArgument(
+                "Could not satisfy explicit device specification '",
+                node->def().device(),
+                "' because the node was colocated with a group of nodes that "
+                "required incompatible device '",
+                DeviceNameUtils::ParsedNameToString(
+                    members_[node_root].device_name),
+                "'");
+          }
+        }
+      } else {
+        // The device is completely unspecified, so enumerate the devices that
+        // support all of the nodes in the set.
+        if (device_set_->devices().empty()) {
+          return errors::Internal("No devices are registered");
+        }
+        devices = FilterSupportedDevices(
+            device_set_->devices(), members_[node_root].supported_device_types);
+
+        if (devices.empty()) {
+          return errors::InvalidArgument(
+              "Node had no OpKernel registered to support this operation: ",
+              "Operation was ", node->type_string(), " and inputs were ",
+              DataTypeVectorString(node->input_types()));
+        }
+      }
+
+      // Returns the first device in sorted devices list so we will always
+      // choose the same device.
+      members_[node_root].assigned_device = devices[0];
+    }
+    node->set_assigned_device_name(members_[node_root].assigned_device->name());
+
+    // Log placement if log_device_placement is set.
+    if (options_ && options_->config.log_device_placement()) {
+      printf("%s: %s\n", node->name().c_str(),
+             node->assigned_device_name().c_str());
+      LOG(INFO) << node->name() << ": " << node->assigned_device_name();
+    }
+
+    return Status::OK();
+  }
+
+ private:
+  // Represents a node in the disjoint node set forest, and the
+  // accumulated constraints on the device used by that node.
+  struct Member {
+    Member() = default;
+    // The id of the node that is the parent of this one, or its own
+    // id if it is a root. parent <= 0 indicates that this member is invalid.
+    int parent = -1;
+    // A proxy for the depth of the tree that is used to prefer
+    // connecting smaller trees to larger trees when merging disjoint
+    // sets.
+    int rank = 0;
+    // The intersection of all device types supported by this node,
+    // and those of all of its children, in priority order
+    // of the preferred device.
+    DeviceTypeVector supported_device_types;
+    // The merged form of the device requested for this node, with
+    // those of all of its children.
+    DeviceNameUtils::ParsedName device_name;
+    // If this node is a root, stores the Device to which this node
+    // and all of its children have been assigned, or nullptr if this
+    // has not yet been computed by GetAssignedDevice().
+    Device* assigned_device = nullptr;
+  };
+
+  Status InitializeMember(const Node& node, Member* member) {
+    const int id = node.id();
+    if (id < 0) {
+      return errors::InvalidArgument("Node id was not positive: ", id);
+    }
+    member->parent = id;
+    TF_RETURN_IF_ERROR(SupportedDeviceTypesForNode(
+        device_types_, node.def(), &member->supported_device_types));
+
+    if (!node.assigned_device_name().empty()) {
+      // This node has already been assigned to a device, so we
+      // respect this placement, after sanity-checking it.  The
+      // device_name and supported_device_types for this node reflect
+      // the assigned device, so any nodes colocated with this node
+      // will be assigned to the same device (assuming this is
+      // possible).
+      // NOTE: Since any assignment must have been performed by
+      // the TensorFlow runtime, we consider errors in this branch to
+      // be INTERNAL.
+      if (!DeviceNameUtils::ParseFullName(node.assigned_device_name(),
+                                          &member->device_name)) {
+        return errors::Internal("Malformed assigned device '",
+                                node.assigned_device_name(), "'");
+      }
+      std::vector<Device*> devices;
+      const Device* assigned_device =
+          device_set_->FindDeviceByName(node.assigned_device_name());
+      if (assigned_device == nullptr) {
+        return errors::Internal("Assigned device '",
+                                node.assigned_device_name(),
+                                "' does not match any device");
+      }
+
+      for (DeviceType d : member->supported_device_types) {
+        if (DeviceType(assigned_device->attributes().device_type()) == d) {
+          return Status::OK();
+        }
+      }
+
+      return errors::Internal("Assigned device '", node.assigned_device_name(),
+                              "' does not have registered OpKernel support "
+                              "for ",
+                              node.def().op());
+    } else {
+      // This node has not yet been assigned to a device, so we
+      // calculate any constraints due to the set of registered
+      // kernels and any (partial) user-provided device specification
+      // in the NodeDef.
+
+      // If no kernels are registered for this op type, fail with an error.
+      if (member->supported_device_types.empty()) {
+        return errors::InvalidArgument(
+            "No OpKernel was registered to support "
+            "Op '",
+            node.def().op(), "' with these attrs");
+      }
+
+      // If the NodeDef contains a device that is *not* a colocated node name
+      // (i.e. it does not begin with '@') then we interpret it as a (partial)
+      // device specification.
+      string colocated_node_name;
+      if (!node.def().device().empty() && !HasColocatedNodeName(node)) {
+        // The user has specified a device in the NodeDef, try to find a
+        // valid device matching their specification in the set of
+        // devices.
+        // NOTE: The full name may specify a device that is not in
+        // n.supported_device_types(), but we check that in AssignDevice().
+        if (!DeviceNameUtils::ParseFullName(node.def().device(),
+                                            &member->device_name)) {
+          return errors::InvalidArgument("Malformed device specification '",
+                                         node.def().device(), "'");
+        }
+      }
+    }
+    return Status::OK();
+  }
+
+  // Updates target to contain the intersection of the device types in
+  // "target" and "other".
+  static void MergeSupportedDevices(DeviceTypeVector* target,
+                                    const DeviceTypeVector& other) {
+    DeviceTypeVector temp = *target;
+    target->clear();
+
+    // Iterate in priority order.
+    for (DeviceType device_type : temp) {
+      bool found = false;
+      for (DeviceType other_device_type : other) {
+        if (device_type == other_device_type) {
+          found = true;
+          break;
+        }
+      }
+      if (found) {
+        target->push_back(device_type);
+      }
+    }
+  }
+
+  // Returns the root node of the disjoint tree to which the node with the
+  // given id is connected.
+  int FindRoot(int node_id) {
+    DCHECK_GE(members_[node_id].parent, 0);
+    if (members_[node_id].parent != node_id) {
+      // NOTE: Compress paths from node_id to its root, so that future
+      // calls to FindRoot and ColocateNodes are more efficient.
+      members_[node_id].parent = FindRoot(members_[node_id].parent);
+    }
+    return members_[node_id].parent;
+  }
+
+  std::vector<Member> members_;
+  const DeviceSet* device_set_;  // Not owned.
+  const std::vector<DeviceType> device_types_;
+  const SessionOptions* options_;  // Not owned;
+};
+
+}  // namespace
+
+SimplePlacer::SimplePlacer(Graph* graph, const DeviceSet* devices,
+                           const NodeNameToIdMap* name_to_id_map,
+                           const SessionOptions* options)
+    : graph_(graph),
+      devices_(devices),
+      name_to_id_map_(name_to_id_map),
+      options_(options) {}
+
+SimplePlacer::SimplePlacer(Graph* graph, const DeviceSet* devices,
+                           const NodeNameToIdMap* name_to_id_map)
+    : graph_(graph), devices_(devices), name_to_id_map_(name_to_id_map) {
+  options_ = nullptr;
+}
+
+SimplePlacer::~SimplePlacer() {}
+
+Status SimplePlacer::Run() {
+  if (devices_->devices().empty()) {
+    return errors::FailedPrecondition("No devices are registered");
+  }
+
+  ColocationGraph colocation_graph(graph_, devices_, options_);
+  Status status;
+
+  // 1. First add all of the nodes. Note that steps (1) and (2)
+  // requires two passes over the nodes because the graph (and hence
+  // the constraints) may not be acyclic.
+  for (Node* node : graph_->nodes()) {
+    // Skip the source and sink nodes.
+    if (!node->IsOp()) {
+      continue;
+    }
+    status = colocation_graph.AddNode(*node);
+    if (!status.ok()) return AttachDef(status, node->def());
+  }
+
+  // 2. Enumerate the constraint edges, and use them to update the disjoint
+  // node set.
+  for (Node* node : graph_->nodes()) {
+    if (!node->IsOp()) {
+      continue;
+    }
+
+    // 2(a). If node n specifies a colocation constraint as its device name,
+    // add an edge from the colocated node to n.
+    if (HasColocatedNodeName(*node)) {
+      string colocated_node_name;
+      status = ParseColocatedNodeName(*node, &colocated_node_name);
+      if (!status.ok()) {
+        return AttachDef(status, node->def());
+      }
+      Node* colocated_node;
+      status = GetNodeByName(colocated_node_name, &colocated_node);
+      if (!status.ok()) {
+        return AttachDef(
+            errors::InvalidArgument("Colocated node named in device '",
+                                    colocated_node_name, "' does not exist"),
+            node->def());
+      }
+      status = colocation_graph.ColocateNodes(*colocated_node, *node);
+      if (!status.ok()) {
+        return AttachDef(
+            errors::InvalidArgument(
+                "Cannot satisfy colocation constraint named in device '",
+                colocated_node_name, "': ", status.error_message()),
+            node->def());
+      }
+    }
+
+    // 2(b). If `node` has an input edge with reference type, add an
+    // edge from the source of that edge to `node`.
+    for (const auto& edge : node->in_edges()) {
+      if (!edge->IsControlEdge() &&
+          IsRefType(node->input_type(edge->dst_input()))) {
+        status = colocation_graph.ColocateNodes(*edge->src(), *node);
+        if (!status.ok()) {
+          return AttachDef(
+              errors::InvalidArgument("Cannot satisfy colocation constraint "
+                                      "implied by reference connection: ",
+                                      status.error_message()),
+              node->def());
+        }
+      }
+    }
+  }
+
+  // 3. For each node, assign a device based on the constraints in the
+  // disjoint node set.
+  for (Node* node : graph_->nodes()) {
+    // Skip the source and sink nodes.
+    if (!node->IsOp()) {
+      continue;
+    }
+    // Skip nodes that already have an assigned name.
+    if (!node->assigned_device_name().empty()) {
+      continue;
+    }
+
+    status = colocation_graph.AssignDevice(node);
+    if (!status.ok()) {
+      return AttachDef(
+          errors::InvalidArgument("Cannot assign a device to node '",
+                                  node->name(), "': ", status.error_message()),
+          node->def());
+    }
+  }
+  return Status::OK();
+}
+
+Status SimplePlacer::GetNodeByName(const string& name, Node** out_node) const {
+  NodeNameToIdMap::const_iterator iter = name_to_id_map_->find(name);
+  if (iter != name_to_id_map_->end()) {
+    *out_node = graph_->FindNodeId(iter->second);
+    if (*out_node) {
+      return Status::OK();
+    }
+  }
+  return errors::NotFound(name);
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/simple_placer.h
+++ b/tensorflow/core/common_runtime/simple_placer.h
@ -0,0 +1,81 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_SIMPLE_PLACER_H_
+#define TENSORFLOW_COMMON_RUNTIME_SIMPLE_PLACER_H_
+
+#include <string>
+#include <unordered_map>
+
+#include "tensorflow/core/common_runtime/device_set.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/public/status.h"
+#include "tensorflow/core/util/device_name_utils.h"
+#include "tensorflow/core/public/session_options.h"
+
+namespace tensorflow {
+
+// A placement algorithm that assigns the nodes of the given Graph to
+// devices the given DeviceSet, respecting the following constraints:
+//
+// 1. Existing device assignments remain unchanged.
+// 2. Requested (partial or complete) device specifications in the
+//    are granted.
+// 3. Nodes connected by edges of a reference type are colocated on
+//    the same device.
+// 4. Given nodes "A" and "B", if node "B" has the device specification
+//    "@A", nodes "A" and "B" will be colocated on the same device.
+//
+// The implementation builds a constraint graph with the same set of
+// nodes, and edges that represent colocation constraints between
+// nodes.  Each connected component in the resulting constraint graph
+// is then assigned to a single device.
+//
+// TODO(mrry): "Soft" constraints, such as "place node 'x' as close as
+// possible to node 'y' while respecting the other constraints"?
+// TODO(mrry): Create a common interface for this and the other
+// placement algorithms so that they may be injected into the graph
+// builder.
+class SimplePlacer {
+ public:
+  // A map from graph node names to numerical IDs (in a Graph object).
+  typedef std::unordered_map<string, int> NodeNameToIdMap;
+
+  // Creates an instance of the SimplePlacer algorithm for the given
+  // Graph "graph" (nodes in which may or may not be assigned) on the
+  // given DeviceSet "devices". The "name_to_id_map" maps the names of
+  // nodes in "g" to their numerical ID.
+  //
+  // REQUIRES: for all mappings (k, v) in "name_to_id_map",
+  // graph.FindNodeId(v)->name() == k.
+  //
+  // The "graph", "devices", and "name_to_id_map" pointer arguments
+  // are borrowed by this SimplePlacer, and must outlive it.
+  SimplePlacer(Graph* graph, const DeviceSet* devices,
+               const NodeNameToIdMap* name_to_id_map,
+               const SessionOptions* options);
+
+  SimplePlacer(Graph* graph, const DeviceSet* devices,
+               const NodeNameToIdMap* name_to_id_map);
+
+  ~SimplePlacer();
+
+  // Assigns each node in this SimplePlacer's graph to a device in its
+  // set of devices.
+  //
+  // This method is not thread-safe.
+  // Run() may be invoked at most once.
+  Status Run();
+
+ private:
+  Status GetNodeByName(const string& name, Node** out_node) const;
+
+  Graph* const graph_;                           // Not owned.
+  const DeviceSet* const devices_;               // Not owned.
+  const NodeNameToIdMap* const name_to_id_map_;  // Not owned.
+  const SessionOptions* options_;                // Not owned.
+
+  TF_DISALLOW_COPY_AND_ASSIGN(SimplePlacer);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_SIMPLE_PLACER_H_
--- a/tensorflow/core/common_runtime/simple_placer_test.cc
+++ b/tensorflow/core/common_runtime/simple_placer_test.cc
@ -0,0 +1,863 @@
+#include "tensorflow/core/common_runtime/simple_placer.h"
+
+#include <memory>
+#include <string>
+#include <utility>
+#include <vector>
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/common_runtime/device_set.h"
+#include "tensorflow/core/framework/device_attributes.pb.h"
+#include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/framework/kernel_def_builder.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/op_def_builder.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/graph/graph_def_builder.h"
+#include "tensorflow/core/kernels/ops_util.h"
+#include "tensorflow/core/lib/core/error_codes.pb.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include <gtest/gtest.h>
+
+namespace tensorflow {
+
+namespace {
+
+////////////////////////////////////////////////////////////////////////////////
+//
+// Op, kernel, and device registrations to set up the environment.
+//
+// The SimplePlacer uses information about the op (input types),
+// kernel (device constraints), and available devices to make
+// placement decisions. To avoid depending on the full runtime, we
+// define dummy implementations of these, and register them with the
+// runtime.
+//
+////////////////////////////////////////////////////////////////////////////////
+
+// A dummy OpKernel that is used to register ops on different devices.
+class DummyOp : public OpKernel {
+ public:
+  explicit DummyOp(OpKernelConstruction* context) : OpKernel(context) {}
+  void Compute(OpKernelContext* context) override {}
+};
+
+// A fake device that has specific device attributes, used to simulate
+// the presence of a CPU or a GPU (without depending on that part of
+// the runtime.
+class FakeDevice : public Device {
+ private:
+  explicit FakeDevice(const DeviceAttributes& device_attributes)
+      : Device(nullptr, device_attributes, nullptr) {}
+
+ public:
+  Status Sync() override { return errors::Unimplemented("FakeDevice::Sync()"); }
+
+  Allocator* GetAllocator(AllocatorAttributes attr) override { return nullptr; }
+
+  static std::unique_ptr<Device> MakeCPU(const string& name) {
+    DeviceAttributes device_attributes;
+    device_attributes.set_name(name);
+    device_attributes.set_device_type(DeviceType(DEVICE_CPU).type());
+    return std::unique_ptr<Device>(new FakeDevice(device_attributes));
+  }
+
+  static std::unique_ptr<Device> MakeGPU(const string& name) {
+    DeviceAttributes device_attributes;
+    device_attributes.set_name(name);
+    device_attributes.set_device_type(DeviceType(DEVICE_GPU).type());
+    return std::unique_ptr<Device>(new FakeDevice(device_attributes));
+  }
+};
+
+// Register the following ops so they can be added to a Graph, and
+// kernels so that they can be placed on particular device types.
+REGISTER_OP("TestVariable").Output("o: Ref(float)");
+REGISTER_KERNEL_BUILDER(Name("TestVariable").Device(DEVICE_CPU), DummyOp);
+REGISTER_KERNEL_BUILDER(Name("TestVariable").Device(DEVICE_GPU), DummyOp);
+
+REGISTER_OP("VariableCPU").Output("o: Ref(float)");
+REGISTER_KERNEL_BUILDER(Name("VariableCPU").Device(DEVICE_CPU), DummyOp);
+
+REGISTER_OP("VariableGPU").Output("o: Ref(float)");
+REGISTER_KERNEL_BUILDER(Name("VariableGPU").Device(DEVICE_GPU), DummyOp);
+
+REGISTER_OP("VariableNoKernels").Output("o: Ref(float)");
+
+REGISTER_OP("TestAdd").Input("a: float").Input("b: float").Output("o: float");
+REGISTER_KERNEL_BUILDER(Name("TestAdd").Device(DEVICE_CPU), DummyOp);
+REGISTER_KERNEL_BUILDER(Name("TestAdd").Device(DEVICE_GPU), DummyOp);
+
+REGISTER_OP("TestRelu").Input("i: float").Output("o: float");
+REGISTER_KERNEL_BUILDER(Name("TestRelu").Device(DEVICE_CPU), DummyOp);
+REGISTER_KERNEL_BUILDER(Name("TestRelu").Device(DEVICE_GPU), DummyOp);
+
+REGISTER_OP("ReluGPU").Input("i: float").Output("o: float");
+REGISTER_KERNEL_BUILDER(Name("ReluGPU").Device(DEVICE_GPU), DummyOp);
+
+REGISTER_OP("TestAssign").Input("i: Ref(float)").Input("v: float");
+REGISTER_KERNEL_BUILDER(Name("TestAssign").Device(DEVICE_CPU), DummyOp);
+REGISTER_KERNEL_BUILDER(Name("TestAssign").Device(DEVICE_GPU), DummyOp);
+
+REGISTER_OP("AssignCPU").Input("i: Ref(float)").Input("v: float");
+REGISTER_KERNEL_BUILDER(Name("AssignCPU").Device(DEVICE_CPU), DummyOp);
+
+REGISTER_OP("AssignGPU").Input("i: Ref(float)").Input("v: float");
+REGISTER_KERNEL_BUILDER(Name("AssignGPU").Device(DEVICE_GPU), DummyOp);
+
+REGISTER_OP("TestInput").Output("a: float").Output("b: float");
+REGISTER_KERNEL_BUILDER(Name("TestInput").Device(DEVICE_CPU), DummyOp);
+
+REGISTER_OP("TestDevice").Output("a: float").Output("b: float");
+REGISTER_KERNEL_BUILDER(Name("TestDevice").Device(DEVICE_GPU), DummyOp);
+
+REGISTER_OP("TestDeviceEnforce").Input("a: Ref(float)").Output("b: float");
+REGISTER_KERNEL_BUILDER(Name("TestDeviceEnforce").Device(DEVICE_CPU), DummyOp);
+REGISTER_KERNEL_BUILDER(Name("TestDeviceEnforce").Device(DEVICE_GPU), DummyOp);
+
+////////////////////////////////////////////////////////////////////////////////
+//
+// A SimplePlacerTest method has three phases:
+//
+// 1. Build a TensorFlow graph, with no (or partial) device assignments.
+// 2. Attempt to compute a placement using the SimplePlacer.
+// 3. EITHER: test that the constraints implied by the graph are respected;
+//    or that an appropriate error was reported.
+//
+////////////////////////////////////////////////////////////////////////////////
+class SimplePlacerTest : public ::testing::Test {
+ protected:
+  SimplePlacerTest() {
+    RequireDefaultOps();
+    // Build a set of 10 GPU and 10 CPU devices.
+    // NOTE: this->local_devices_ owns the device objects;
+    // this->devices_ contains borrowed pointers to the device
+    // objects.
+    for (int i = 0; i < 10; ++i) {
+      local_devices_.emplace_back(FakeDevice::MakeCPU(
+          strings::StrCat("/job:a/replica:0/task:0/cpu:", i)));
+      devices_.AddDevice(local_devices_.back().get());
+      // Insert the GPUs in reverse order.
+      local_devices_.emplace_back(FakeDevice::MakeGPU(
+          strings::StrCat("/job:a/replica:0/task:0/gpu:", 9 - i)));
+      devices_.AddDevice(local_devices_.back().get());
+    }
+  }
+
+  // Builds the given graph, and (if successful) indexes the node
+  // names for use in placement, and later lookup.
+  Status BuildGraph(const GraphDefBuilder& builder, Graph* out_graph) {
+    TF_RETURN_IF_ERROR(builder.ToGraph(out_graph));
+    nodes_by_name_.clear();
+    for (Node* node : out_graph->nodes()) {
+      nodes_by_name_[node->name()] = node->id();
+    }
+    return Status::OK();
+  }
+
+  // Invokes the SimplePlacer on "graph". If no DeviceSet is specified, the
+  // placement will use the default DeviceSet (of 10 CPU and 10 GPU devices).
+  //
+  // REQUIRES: "*graph" was produced by the most recent call to BuildGraph.
+  Status Place(Graph* graph, DeviceSet* devices, SessionOptions* options) {
+    SimplePlacer placer(graph, devices, &nodes_by_name_, options);
+    return placer.Run();
+  }
+
+  Status Place(Graph* graph, DeviceSet* devices) {
+    return Place(graph, devices, nullptr);
+  }
+
+  Status Place(Graph* graph, SessionOptions* options) {
+    return Place(graph, &devices_, options);
+  }
+
+  Status Place(Graph* graph) { return Place(graph, &devices_, nullptr); }
+
+  // Returns the node in "graph" with the given name.
+  //
+  // REQUIRES: "graph" was produced by the most recent call to BuildGraph.
+  Node* GetNodeByName(const Graph& graph, const string& name) {
+    const auto search = nodes_by_name_.find(name);
+    CHECK(search != nodes_by_name_.end()) << "Unknown node name: " << name;
+    return graph.FindNodeId(search->second);
+  }
+
+ protected:
+  std::vector<std::unique_ptr<Device>> local_devices_;
+  DeviceSet devices_;
+  SimplePlacer::NodeNameToIdMap nodes_by_name_;
+
+  Status ReferenceTestHelper(const string& variable_op_type,
+                             const string& assign_op_type,
+                             DeviceType expected_device_type);
+};
+
+#define EXPECT_COLOCATED(g, name_a, name_b)                         \
+  do {                                                              \
+    Graph& g_ = (g);                                                \
+    EXPECT_EQ(GetNodeByName(g_, (name_a))->assigned_device_name(),  \
+              GetNodeByName(g_, (name_b))->assigned_device_name()); \
+  } while (0)
+
+#define EXPECT_DEVICE_TYPE(g, name, expected_device_type)                   \
+  EXPECT_EQ(DeviceType(expected_device_type).type(),                        \
+            devices_.FindDeviceByName(                                      \
+                        GetNodeByName((g), (name))->assigned_device_name()) \
+                ->attributes()                                              \
+                .device_type())
+
+#define EXPECT_DEVICE_CONTAINS(g, name, device_substr)                        \
+  EXPECT_TRUE(StringPiece(GetNodeByName((g), (name))->assigned_device_name()) \
+                  .contains(device_substr))
+
+// Test that a graph with no constraints will successfully assign nodes to the
+// "best available" device (i.e. prefer GPU over CPU).
+TEST_F(SimplePlacerTest, TestNoConstraints) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    Node* input = ops::SourceOp("TestInput", b.opts().WithName("in"));
+    ops::UnaryOp("TestRelu", ops::NodeOut(input, 0), b.opts().WithName("n1"));
+    ops::UnaryOp("TestRelu", ops::NodeOut(input, 1), b.opts().WithName("n2"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  EXPECT_OK(Place(&g));
+  EXPECT_DEVICE_TYPE(g, "in", DEVICE_CPU);
+  EXPECT_DEVICE_TYPE(g, "n1", DEVICE_GPU);
+  EXPECT_DEVICE_TYPE(g, "n2", DEVICE_GPU);
+}
+
+// Test that a graph with device type and reference constraints on
+// some of the ops will successfully assign nodes to the constrained
+// device, and colocate nodes with reference connections.
+TEST_F(SimplePlacerTest, TestDeviceTypeConstraints) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    Node* input = ops::SourceOp("TestInput", b.opts().WithName("in"));
+    Node* var_cpu = ops::SourceOp("VariableCPU", b.opts().WithName("var_cpu"));
+    ops::BinaryOp("AssignCPU", var_cpu, input, b.opts().WithName("assign_cpu"));
+    Node* var_gpu = ops::SourceOp("VariableGPU", b.opts().WithName("var_gpu"));
+    ops::BinaryOp("AssignGPU", var_gpu, input, b.opts().WithName("assign_gpu"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  EXPECT_OK(Place(&g));
+  EXPECT_DEVICE_TYPE(g, "in", DEVICE_CPU);
+  EXPECT_DEVICE_TYPE(g, "var_cpu", DEVICE_CPU);
+  EXPECT_DEVICE_TYPE(g, "assign_cpu", DEVICE_CPU);
+  EXPECT_COLOCATED(g, "var_cpu", "assign_cpu");
+  EXPECT_DEVICE_TYPE(g, "var_gpu", DEVICE_GPU);
+  EXPECT_DEVICE_TYPE(g, "assign_gpu", DEVICE_GPU);
+  EXPECT_COLOCATED(g, "var_gpu", "assign_gpu");
+}
+
+// Test that a graph with partial device specifications on the ops
+// will successfully
+TEST_F(SimplePlacerTest, TestPartialSpec) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in").WithDevice("/job:a"));
+    ops::SourceOp("TestVariable",
+                  b.opts().WithName("var").WithDevice("/job:a"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  EXPECT_OK(Place(&g));
+  EXPECT_DEVICE_TYPE(g, "in", DEVICE_CPU);
+  EXPECT_DEVICE_CONTAINS(g, "in", "/job:a");
+  EXPECT_DEVICE_TYPE(g, "var", DEVICE_GPU);
+  EXPECT_DEVICE_CONTAINS(g, "var", "/job:a");
+}
+
+// Test that a node with an assigned device is not relocated.
+TEST_F(SimplePlacerTest, TestAssignedDevicePreserved) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  GetNodeByName(g, "in")
+      ->set_assigned_device_name("/job:a/replica:0/task:0/cpu:7");
+
+  EXPECT_OK(Place(&g));
+  EXPECT_EQ("/job:a/replica:0/task:0/cpu:7",
+            GetNodeByName(g, "in")->assigned_device_name());
+}
+
+// Test that a graph with partial device specifications for CPU-only ops
+// will be relocated to CPU.
+TEST_F(SimplePlacerTest, TestPartialSpecGpuToCpu) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in").WithDevice("/gpu:0"));
+    ops::SourceOp("TestVariable",
+                  b.opts().WithName("var").WithDevice("/gpu:0"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  SessionOptions options;
+  options.config.set_allow_soft_placement(true);
+  EXPECT_OK(Place(&g, &options));
+  EXPECT_DEVICE_TYPE(g, "in", DEVICE_CPU);
+  EXPECT_DEVICE_CONTAINS(g, "in", "/cpu");
+  EXPECT_DEVICE_TYPE(g, "var", DEVICE_GPU);
+  EXPECT_DEVICE_CONTAINS(g, "var", "/gpu:0");
+}
+
+// Test that a node with an assigned GPU device but has not registered
+// OpKernel will fail.
+TEST_F(SimplePlacerTest, TestAssignedGpuDeviceToCpuDevice) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  GetNodeByName(g, "in")
+      ->set_assigned_device_name("/job:a/replica:0/task:0/gpu:0");
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INTERNAL, s.code());
+  EXPECT_TRUE(
+      StringPiece(s.error_message())
+          .contains("Assigned device '/job:a/replica:0/task:0/gpu:0' "
+                    "does not have registered OpKernel support for TestInput"));
+}
+
+// Test that graphs with reference connections are correctly placed.
+
+// Build a graph containing a Variable op of "variable_op_type" and an
+// Assign op of "assign_op_type", and expect all of the ops to be
+// placed on a device of type "expected_device_type".
+Status SimplePlacerTest::ReferenceTestHelper(const string& variable_op_type,
+                                             const string& assign_op_type,
+                                             DeviceType expected_device_type) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    Node* input = ops::SourceOp("TestInput", b.opts().WithName("in"));
+    // Build ten variable-and-assignment pairs.
+    for (int i = 0; i < 10; ++i) {
+      Node* var = ops::SourceOp(variable_op_type,
+                                b.opts().WithName(strings::StrCat("var_", i)));
+      ops::BinaryOp(assign_op_type, var, input,
+                    b.opts().WithName(strings::StrCat("assign_", i)));
+    }
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  TF_RETURN_IF_ERROR(Place(&g));
+
+  for (int i = 0; i < 10; ++i) {
+    EXPECT_COLOCATED(g, strings::StrCat("var_", i),
+                     strings::StrCat("assign_", i));
+    EXPECT_DEVICE_TYPE(g, strings::StrCat("var_", i), expected_device_type);
+    EXPECT_DEVICE_TYPE(g, strings::StrCat("assign_", i), expected_device_type);
+  }
+
+  return Status::OK();
+}
+
+// Test all 2^3 combinations of Variable and Assignment op types
+// (unconstrained, CPU-only, and GPU-only).
+TEST_F(SimplePlacerTest, TestReferenceConnection) {
+  Status s;
+  EXPECT_OK(ReferenceTestHelper("TestVariable", "TestAssign", DEVICE_GPU));
+  EXPECT_OK(ReferenceTestHelper("TestVariable", "AssignCPU", DEVICE_CPU));
+  EXPECT_OK(ReferenceTestHelper("TestVariable", "AssignGPU", DEVICE_GPU));
+  EXPECT_OK(ReferenceTestHelper("VariableCPU", "TestAssign", DEVICE_CPU));
+  EXPECT_OK(ReferenceTestHelper("VariableCPU", "AssignCPU", DEVICE_CPU));
+  {
+    Status s = ReferenceTestHelper("VariableCPU", "AssignGPU", DEVICE_CPU);
+    EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+    EXPECT_TRUE(StringPiece(s.error_message())
+                    .contains("no device type supports both of those nodes"));
+  }
+  EXPECT_OK(ReferenceTestHelper("VariableGPU", "TestAssign", DEVICE_GPU));
+  {
+    Status s = ReferenceTestHelper("VariableGPU", "AssignCPU", DEVICE_CPU);
+    EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+    EXPECT_TRUE(StringPiece(s.error_message())
+                    .contains("no device type supports both of those nodes"));
+  }
+  EXPECT_OK(ReferenceTestHelper("VariableGPU", "AssignGPU", DEVICE_GPU));
+}
+
+// Test the handling of '@node_name' colocation constraints, when
+// these are arranged in multiple chains.
+TEST_F(SimplePlacerTest, TestColocatedChain) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    Node* input = ops::SourceOp("TestInput", b.opts().WithName("in"));
+    Node* last_node = input;
+    for (int i = 0; i < 100; ++i) {
+      if (i % 10 == 0) {
+        // Every ten nodes, start a new chain.
+        last_node = ops::UnaryOp("TestRelu", last_node,
+                                 b.opts().WithName(strings::StrCat("n_", i)));
+      } else {
+        // Chain each successive node to the previous one.
+        last_node =
+            ops::UnaryOp("TestRelu", last_node,
+                         b.opts()
+                             .WithName(strings::StrCat("n_", i))
+                             .WithDevice(strings::StrCat("@n_", i - 1)));
+      }
+    }
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  EXPECT_OK(Place(&g));
+  for (int i = 0; i < 100; ++i) {
+    if (i % 10 != 0) {
+      EXPECT_COLOCATED(g, strings::StrCat("n_", i - (i % 1)),
+                       strings::StrCat("n_", i));
+    }
+  }
+}
+
+// Test the handling of '@node_name' colocation constraints, when the
+// chains are shuffled.
+TEST_F(SimplePlacerTest, TestColocatedChainWithLongRangeColocations) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    Node* input = ops::SourceOp("TestInput", b.opts().WithName("in"));
+    Node* last_node = input;
+    for (int i = 0; i < 10; ++i) {
+      // Start ten chains.
+      last_node = ops::UnaryOp("TestRelu", last_node,
+                               b.opts().WithName(strings::StrCat("n_", i)));
+    }
+    for (int i = 10; i < 100; ++i) {
+      // Add each node to the (i % 10)^th chain.
+      last_node = ops::UnaryOp("TestRelu", last_node,
+                               b.opts()
+                                   .WithName(strings::StrCat("n_", i))
+                                   .WithDevice(strings::StrCat("@n_", i % 10)));
+    }
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  EXPECT_OK(Place(&g));
+  for (int i = 10; i < 100; ++i) {
+    EXPECT_COLOCATED(g, strings::StrCat("n_", i % 10),
+                     strings::StrCat("n_", i));
+  }
+}
+
+TEST_F(SimplePlacerTest, TestColocationAndReferenceConnections) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    Node* input = ops::SourceOp("TestInput", b.opts().WithName("in"));
+    for (int i = 0; i < 10; ++i) {
+      // Declare ten variable and assignment pairs.
+      Node* var = ops::SourceOp("TestVariable",
+                                b.opts().WithName(strings::StrCat("var_", i)));
+      ops::BinaryOp("TestAssign", var, input,
+                    b.opts().WithName(strings::StrCat("assign_", i)));
+    }
+    for (int i = 10; i < 100; ++i) {
+      // Create a variable colocated with some existing variable, and
+      // an assignment colocated with a possibly-different variable.
+      Node* var = ops::SourceOp(
+          "TestVariable", b.opts()
+                              .WithName(strings::StrCat("var_", i))
+                              .WithDevice(strings::StrCat("@var_", i % 6)));
+      ops::BinaryOp("TestAssign", var, input,
+                    b.opts()
+                        .WithName(strings::StrCat("assign_", i))
+                        .WithDevice(strings::StrCat("@assign_", i % 3)));
+    }
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  EXPECT_OK(Place(&g));
+  for (int i = 0; i < 10; ++i) {
+    EXPECT_COLOCATED(g, strings::StrCat("var_", i),
+                     strings::StrCat("assign_", i));
+  }
+  for (int i = 10; i < 100; ++i) {
+    EXPECT_COLOCATED(g, strings::StrCat("var_", i),
+                     strings::StrCat("assign_", i));
+    EXPECT_COLOCATED(g, strings::StrCat("var_", i),
+                     strings::StrCat("var_", i % 6));
+    EXPECT_COLOCATED(g, strings::StrCat("assign_", i),
+                     strings::StrCat("assign_", i % 3));
+  }
+}
+
+// Test that placement fails when no devices are registered.
+TEST_F(SimplePlacerTest, TestEmptyDeviceSet) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  DeviceSet empty;
+
+  Status s = Place(&g, &empty);
+  EXPECT_TRUE(
+      StringPiece(s.error_message()).contains("No devices are registered"));
+}
+
+// Test that placement fails when the requested device forces an
+// indirect constraint to be violated.
+TEST_F(SimplePlacerTest, TestHeterogeneousDeviceSetFailure) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    Node* in = ops::SourceOp("TestInput", b.opts().WithName("in"));
+    Node* var = ops::SourceOp("VariableGPU", b.opts().WithName("var"));
+    ops::BinaryOp("TestAssign", var, in,
+                  b.opts().WithName("assign").WithDevice("/job:b/task:1"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  DeviceSet heterogeneous;
+  std::unique_ptr<Device> gpu(
+      FakeDevice::MakeGPU("/job:b/replica:0/task:0/gpu:0"));
+  heterogeneous.AddDevice(gpu.get());
+  std::unique_ptr<Device> cpu(
+      FakeDevice::MakeCPU("/job:b/replica:0/task:1/cpu:0"));
+  heterogeneous.AddDevice(cpu.get());
+  Status s = Place(&g, &heterogeneous);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(StringPiece(s.error_message())
+                  .contains("colocated with a group of nodes that required "
+                            "incompatible device"));
+}
+
+// Test that placement fails when an unknown device is requested.
+TEST_F(SimplePlacerTest, TestUnknownDevice) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in").WithDevice("/job:foo"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(
+      StringPiece(s.error_message())
+          .contains(
+              "Could not satisfy explicit device specification '/job:foo'"));
+}
+
+// Test that placement fails when the combination of partial
+// constraints leads to an unknown device.
+TEST_F(SimplePlacerTest, TestUnknownMergedDevice) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in").WithDevice("/job:foo"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(
+      StringPiece(s.error_message())
+          .contains(
+              "Could not satisfy explicit device specification '/job:foo'"));
+}
+
+// Test that placement fails when the previously-assigned device for a
+// node is unknown.
+TEST_F(SimplePlacerTest, TestUnknownAssignedDevice) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  GetNodeByName(g, "in")->set_assigned_device_name("/job:foo");
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INTERNAL, s.code());
+  EXPECT_TRUE(
+      StringPiece(s.error_message())
+          .contains("Assigned device '/job:foo' does not match any device"));
+}
+
+// Test that placement fails when an op with no registered kernels is
+// requested.
+TEST_F(SimplePlacerTest, TestNoKernelsRegistered) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("VariableNoKernels", b.opts().WithName("var"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(
+      StringPiece(s.error_message())
+          .contains(
+              "No OpKernel was registered to support Op 'VariableNoKernels'"));
+}
+
+// Test that placement fails when a kernel is registered but no known
+// device supports it.
+TEST_F(SimplePlacerTest, TestNoDevicesRegistered) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("VariableGPU", b.opts().WithName("var"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  DeviceSet cpu_only;
+  std::unique_ptr<Device> cpu(
+      FakeDevice::MakeCPU("/job:a/replica:0/task:0/cpu:0"));
+  cpu_only.AddDevice(cpu.get());
+
+  Status s = Place(&g, &cpu_only);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(StringPiece(s.error_message())
+                  .contains("No OpKernel was registered to support "
+                            "Op 'VariableGPU'"));
+}
+
+// Test that placement fails when a requested device is malformed.
+TEST_F(SimplePlacerTest, TestMalformedDeviceSpecification) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in").WithDevice("/foo:bar"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(StringPiece(s.error_message())
+                  .contains("Malformed device specification '/foo:bar'"));
+}
+
+// Test that placement fails when a previously-assigned device is malformed.
+TEST_F(SimplePlacerTest, TestMalformedAssignedDevice) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  GetNodeByName(g, "in")->set_assigned_device_name("/foo:bar");
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INTERNAL, s.code());
+  EXPECT_TRUE(StringPiece(s.error_message())
+                  .contains("Malformed assigned device '/foo:bar'"));
+}
+
+// Test that placement fails when a device was previously assigned to
+// a node, but it does not uniquely identify a particular device.
+TEST_F(SimplePlacerTest, TestNonUniqueAssignedDevice) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  GetNodeByName(g, "in")->set_assigned_device_name("/job:a");
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INTERNAL, s.code());
+  EXPECT_TRUE(
+      StringPiece(s.error_message())
+          .contains("Assigned device '/job:a' does not match any device"));
+}
+
+// Test that placement fails when a node requests colocation with another
+// node that does not exist.
+TEST_F(SimplePlacerTest, TestUnknownColocatedNode) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in").WithDevice("@foo"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(StringPiece(s.error_message()).contains("'foo' does not exist"));
+}
+
+// Test that placement fails when a node requests colocation with a
+// malformed node name.
+TEST_F(SimplePlacerTest, TestMalformedColocatedNode) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestInput", b.opts().WithName("in").WithDevice("@"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(StringPiece(s.error_message())
+                  .contains("node named in device '' does not exist"));
+}
+
+// Test that ops request to be placed on non-existent devices will be relocated
+// to existing device of the same type if allow_soft_placement is set.
+TEST_F(SimplePlacerTest, TestNonexistentGpuAllowSoftPlacement) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestDevice", b.opts().WithName("in").WithDevice("/gpu:11"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  SessionOptions options;
+  options.config.set_allow_soft_placement(true);
+  EXPECT_OK(Place(&g, &options));
+  EXPECT_DEVICE_CONTAINS(g, "in", "/gpu:0");
+}
+
+// Test that ops request to be placed on non-existent devices will fail if
+// allow_soft_placement is not set.
+TEST_F(SimplePlacerTest, TestNonexistentGpuNoAllowSoftPlacement) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestDevice", b.opts().WithName("in").WithDevice("/gpu:11"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  SessionOptions options;
+  Status s = Place(&g, &options);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(
+      StringPiece(s.error_message())
+          .contains(
+              "Could not satisfy explicit device specification '/gpu:11'"));
+}
+
+// Test that placement fails when a node requests an explicit device that is not
+// supported by the registered kernels if allow_soft_placement is no set.
+TEST_F(SimplePlacerTest, TestUnsupportedDeviceNoAllowSoftPlacement) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("VariableGPU", b.opts().WithName("var").WithDevice("/cpu:0"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  SessionOptions options;
+  Status s = Place(&g, &options);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(
+      StringPiece(s.error_message())
+          .contains(
+              "Could not satisfy explicit device specification '/cpu:0'"));
+}
+
+TEST_F(SimplePlacerTest, TestUnsupportedDeviceAllowSoftPlacement) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("VariableGPU", b.opts().WithName("var").WithDevice("/cpu:0"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  SessionOptions options;
+  options.config.set_allow_soft_placement(true);
+  EXPECT_OK(Place(&g, &options));
+}
+
+// Test that a graph with device type and reference constraints on
+// some of the ops will successfully assign nodes to the constrained
+// device, and colocate nodes with reference connections.
+TEST_F(SimplePlacerTest, TestDeviceTypeConstraintsAllowSoftPlacement) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    // var_gpu has ref output and runs on GPU.
+    // force_gpu takes var_gpu and requested CPU.
+    // Verify that both are placed on GPU.
+    Node* var_gpu = ops::SourceOp("VariableGPU", b.opts().WithName("var_gpu"));
+    ops::UnaryOp("TestDeviceEnforce", var_gpu,
+                 b.opts().WithName("force_gpu").WithDevice("/cpu:0"));
+    // var_cpu has ref output and runs on CPU.
+    // force_cpu takes var_cpu and requested GPU.
+    // Verify that both are placed on CPU.
+    Node* var_cpu = ops::SourceOp("VariableCPU", b.opts().WithName("var_cpu"));
+    ops::UnaryOp("TestDeviceEnforce", var_cpu,
+                 b.opts().WithName("force_cpu").WithDevice("/gpu:0"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  SessionOptions options;
+  options.config.set_allow_soft_placement(true);
+  EXPECT_OK(Place(&g, &options));
+  EXPECT_DEVICE_TYPE(g, "var_gpu", DEVICE_GPU);
+  EXPECT_DEVICE_TYPE(g, "force_gpu", DEVICE_GPU);
+  EXPECT_COLOCATED(g, "var_gpu", "force_gpu");
+  EXPECT_DEVICE_TYPE(g, "var_cpu", DEVICE_CPU);
+  EXPECT_DEVICE_TYPE(g, "force_cpu", DEVICE_CPU);
+  EXPECT_COLOCATED(g, "var_cpu", "force_cpu");
+}
+
+// Test that placement fails when two nodes have a reference connection
+// constraint, and each node requires a mutually incompatible device.
+TEST_F(SimplePlacerTest, TestUnsatisfiableConstraintWithReferenceConnections) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    Node* var = ops::SourceOp("VariableGPU", b.opts().WithName("var"));
+    Node* input = ops::SourceOp("TestInput", b.opts().WithName("in"));
+    ops::BinaryOp("AssignCPU", var, input, b.opts().WithName("assign"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(StringPiece(s.error_message())
+                  .contains("Cannot colocate nodes 'var' and 'assign'"));
+}
+
+// Test that placement fails when two nodes have an explicit
+// colocation constraint, and each node requires a mutually
+// incompatible device.
+TEST_F(SimplePlacerTest, TestUnsatisfiableConstraintWithColocatedNodes) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    Node* input = ops::SourceOp("TestInput",
+                                b.opts().WithName("in").WithDevice("/gpu:0"));
+    Node* relu_1 = ops::UnaryOp("TestRelu", input,
+                                b.opts().WithName("relu_1").WithDevice("@in"));
+    ops::UnaryOp("ReluGPU", relu_1,
+                 b.opts().WithName("relu_2").WithDevice("@relu_1"));
+    EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  Status s = Place(&g);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(StringPiece(s.error_message())
+                  .contains("Cannot colocate nodes 'relu_1' and 'relu_2'"));
+}
+
+}  // namespace
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/threadpool_device.cc
+++ b/tensorflow/core/common_runtime/threadpool_device.cc
@ -0,0 +1,55 @@
+#include "tensorflow/core/common_runtime/threadpool_device.h"
+
+#include "tensorflow/core/common_runtime/local_device.h"
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/device_base.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/graph/types.h"
+#include "tensorflow/core/lib/hash/hash.h"
+#include "tensorflow/core/platform/port.h"
+#include "tensorflow/core/platform/tracing.h"
+#include "tensorflow/core/public/session_options.h"
+
+namespace tensorflow {
+
+ThreadPoolDevice::ThreadPoolDevice(const SessionOptions& options,
+                                   const string& name, Bytes memory_limit,
+                                   BusAdjacency bus_adjacency,
+                                   Allocator* allocator)
+    : LocalDevice(options, Device::BuildDeviceAttributes(
+                               name, DEVICE_CPU, memory_limit, bus_adjacency),
+                  allocator),
+      allocator_(allocator) {}
+
+ThreadPoolDevice::~ThreadPoolDevice() {}
+
+void ThreadPoolDevice::Compute(OpKernel* op_kernel, OpKernelContext* context) {
+  if (port::Tracing::IsActive()) {
+    // TODO(pbar) We really need a useful identifier of the graph node.
+    const uint64 id = Hash64(op_kernel->name());
+    port::Tracing::ScopedActivity region(port::Tracing::EventCategory::kCompute,
+                                         id);
+    op_kernel->Compute(context);
+  } else {
+    op_kernel->Compute(context);
+  }
+}
+
+Allocator* ThreadPoolDevice::GetAllocator(AllocatorAttributes attr) {
+  return allocator_;
+}
+
+Status ThreadPoolDevice::MakeTensorFromProto(
+    const TensorProto& tensor_proto, const AllocatorAttributes alloc_attrs,
+    Tensor* tensor) {
+  Tensor parsed(tensor_proto.dtype());
+  if (!parsed.FromProto(cpu_allocator(), tensor_proto)) {
+    return errors::InvalidArgument("Cannot parse tensor from proto: ",
+                                   tensor_proto.DebugString());
+  }
+  *tensor = parsed;
+  return Status::OK();
+}
+
+}  // namespace tensorflow
--- a/tensorflow/core/common_runtime/threadpool_device.h
+++ b/tensorflow/core/common_runtime/threadpool_device.h
@ -0,0 +1,31 @@
+#ifndef TENSORFLOW_COMMON_RUNTIME_THREADPOOL_DEVICE_H_
+#define TENSORFLOW_COMMON_RUNTIME_THREADPOOL_DEVICE_H_
+
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/common_runtime/local_device.h"
+
+namespace tensorflow {
+
+// CPU device implementation.
+class ThreadPoolDevice : public LocalDevice {
+ public:
+  ThreadPoolDevice(const SessionOptions& options, const string& name,
+                   Bytes memory_limit, BusAdjacency bus_adjacency,
+                   Allocator* allocator);
+  ~ThreadPoolDevice() override;
+
+  void Compute(OpKernel* op_kernel, OpKernelContext* context) override;
+  Allocator* GetAllocator(AllocatorAttributes attr) override;
+  Status MakeTensorFromProto(const TensorProto& tensor_proto,
+                             const AllocatorAttributes alloc_attrs,
+                             Tensor* tensor) override;
+
+  Status Sync() override { return Status::OK(); }
+
+ private:
+  Allocator* allocator_;  // Not owned
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMMON_RUNTIME_THREADPOOL_DEVICE_H_
--- a/tensorflow/core/common_runtime/threadpool_device_factory.cc
+++ b/tensorflow/core/common_runtime/threadpool_device_factory.cc
@ -0,0 +1,31 @@
+// Register a factory that provides CPU devices.
+#include "tensorflow/core/common_runtime/threadpool_device.h"
+
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/public/session_options.h"
+
+namespace tensorflow {
+
+// TODO(zhifengc/tucker): Figure out the bytes of available RAM.
+class ThreadPoolDeviceFactory : public DeviceFactory {
+ public:
+  void CreateDevices(const SessionOptions& options, const string& name_prefix,
+                     std::vector<Device*>* devices) override {
+    // TODO(zhifengc/tucker): Figure out the number of available CPUs
+    // and/or NUMA configuration.
+    int n = 1;
+    auto iter = options.config.device_count().find("CPU");
+    if (iter != options.config.device_count().end()) {
+      n = iter->second;
+    }
+    for (int i = 0; i < n; i++) {
+      string name = strings::StrCat(name_prefix, "/cpu:", i);
+      devices->push_back(new ThreadPoolDevice(options, name, Bytes(256 << 20),
+                                              BUS_ANY, cpu_allocator()));
+    }
+  }
+};
+REGISTER_LOCAL_DEVICE_FACTORY("CPU", ThreadPoolDeviceFactory);
+
+}  // namespace tensorflow
--- a/tensorflow/core/example/example.proto
+++ b/tensorflow/core/example/example.proto
@ -0,0 +1,95 @@
+// Protocol messages for describing input data Examples for machine learning
+// model training or inference.
+syntax = "proto3";
+
+import "tensorflow/core/example/feature.proto";
+// option cc_enable_arenas = true;
+
+package tensorflow;
+
+// Example for a movie recommendation application:
+//   features {
+//     feature {
+//       key: "age"
+//       float_list {
+//         value: 29.0
+//       }
+//     }
+//     feature {
+//       key: "movie"
+//       bytes_list {
+//         value: "The Shawshank Redemption"
+//         value: "Fight Club"
+//       }
+//     }
+//     feature {
+//       key: "movie_ratings"
+//       float_list {
+//         value: 9.0
+//         value: 9.7
+//       }
+//     }
+//     feature {
+//       key: "suggestion"
+//       bytes_list {
+//         value: "Inception"
+//       }
+//     }
+//     # Note that this feature exists to be used as a label in training.
+//     # E.g., if training a logistic regression model to predict purchase
+//     # probability in our learning tool we would set the label feature to
+//     # "suggestion_purchased".
+//     feature {
+//       key: "suggestion_purchased"
+//       float_list {
+//         value: 1.0
+//       }
+//     }
+//     # Similar to "suggestion_purchased" above this feature exists to be used
+//     # as a label in training.
+//     # E.g., if training a linear regression model to predict purchase
+//     # price in our learning tool we would set the label feature to
+//     # "purchase_price".
+//     feature {
+//       key: "purchase_price"
+//       float_list {
+//         value: 9.99
+//       }
+//     }
+//  }
+//
+// A conformant data set obeys the following conventions:
+//   - If a Feature K exists in one example with data type T, it must be of
+//       type T in all other examples when present. It may be omitted.
+//   - The number of instances of Feature K list data may vary across examples,
+//       depending on the requirements of the model.
+//   - If a Feature K doesn't exist in an example, a K-specific default will be
+//       used, if configured.
+//   - If a Feature K exists in an example but contains no items, the intent
+//       is considered to be an empty tensor and no default will be used.
+
+message Example {
+  Features features = 1;
+};
+
+// Example representing a ranking instance.
+message RankingExample {
+  Features context = 1;
+  repeated Features positive = 2;
+  repeated Features negative = 3;
+};
+
+// Example representing a sequence.
+// The context contains features which apply to the entire sequence.
+// Each element in example represents an entry in the sequence.
+message SequenceExample {
+  Features context = 1;
+  repeated Features features = 2;
+};
+
+// Example representing a list of feature maps.
+// The context contains features which apply to all feature maps.
+message InferenceExample {
+  Features context = 1;
+  repeated Features features = 2;
+};
--- a/Show More
+++ b/Show More
				`@ -0,0 +1 @@`
				`Subproject commit 55ad57a235c009d0414aed1781072adda0c89137`