Commit Graph

80 Commits

Author SHA1 Message Date
Peter Hawkins
e7f6b0c7ee [XLA] Don't pass on_host_shape to ShapedBuffer/ScopedShapedBuffer inside XLA.
PiperOrigin-RevId: 336133292
Change-Id: I47a6fa5a5f2c6a460bdaeb1acc5125ff20710230
2020-10-08 12:00:12 -07:00
Taré Gaskin
939db02ff5 xla directory resolutions 2020-07-26 22:14:33 +00:00
George Karpenkov
1a9b57d729 [XLA] Store host shape in ExecutionInput
Simplify the APIs explicitly passing the host shape

PiperOrigin-RevId: 321083080
Change-Id: I9e124dd4465ee4037f2d0cdbd33f04a43f35abc2
2020-07-13 20:03:34 -07:00
George Karpenkov
6116b7f911 [XLA] [client] Implement a RunAsync overload which does not need a vector of shapes
PiperOrigin-RevId: 317406952
Change-Id: I69d8cc8a68ffdfbf70e2969f5df5e6adba7d2e1d
2020-06-19 17:09:13 -07:00
George Karpenkov
62683d061c [XLA] Rollback of rollback of "Implement LocalClient::Run which supports buffer donation"
PiperOrigin-RevId: 317400695
Change-Id: I56f1f8df347d5a3b2bad9526c7315c63ad6ddadb
2020-06-19 16:33:47 -07:00
A. Unique TensorFlower
a82b75c82b [XLA/Client] Implement LocalClient::Run which supports buffer donation
PiperOrigin-RevId: 317195199
Change-Id: If4d35d0627fa068a0c2b522fdae52466abd21f51
2020-06-18 15:40:54 -07:00
George Karpenkov
ef52b4e088 [XLA/Client] Implement LocalClient::Run which supports buffer donation
PiperOrigin-RevId: 317173848
Change-Id: If92955ac5051376fbf0932b773b675459497c0c4
2020-06-18 14:03:29 -07:00
Peter Hawkins
a056967714 [XLA:Python] Move Compile() options into a CompileOptions struct.
When there are many options to a function, it's cleaner to use an options struct. This change is in preparation for adding another option.

Since ExecutableBuildOptions already has a DeviceAssignment field, we can use it rather than having our own. We can also sink the DeviceAssignment legality check into XLA proper, given it's now just testing that two parts of ExecutableBuildOptions agree.

PiperOrigin-RevId: 300407020
Change-Id: Idcd0acae6bb89eda34085ae2471b0eafd5cdef95
2020-03-11 14:31:35 -07:00
Davide Libenzi
d1085a6e00 Make XLA's kUserAlias work together with XRT's swap/compaction.
The XRT tuple allocation owns the device memory, which, in order for the lower level
aliasing to work, needs to be handed out as "owning" within the parameter's shape tree.
But if the parameter's shape tree get destroyed (for an intermediate error before execute)
the memory will get released and the tuple allocation will be pointing to free memory.
This CL introduces an ExecutionInput data structure which wraps a maybe-owning shape tree
together with the indices which should be released before the shape tree gets destroyed.
This allows the data structure to travel down until the point where the buffers lands
inside the ExecutionOutput, which uses a similar logic (until finally the result gets consumed).
Unfortunately the situation of the device memory data structures got a bit messy, with
Owning, MaybeOwning, ShapedBuffer, ScopedShapedBuffer, ... none of which can work nicely
with buffer sharing.
Ideally we should have something like std::shared_ptr<OwningDeviceMemory> and
ShapeTree<std::shared_ptr<OwningDeviceMemory>> and be done with it.
Unfortunately the change (I started that route first) towards that goal is pretty major.

PiperOrigin-RevId: 298498866
Change-Id: I2e27c11b7187fa2992ae3b606ea95c18f312cb5a
2020-03-02 18:39:10 -08:00
Chris Jones
9e944aa4fc Change LocalClient::Compile to support returning multiple executables (one per partition).
PiperOrigin-RevId: 292094485
Change-Id: Idaa4d14246478e5ec9b45d1d17d5610f35d35611
2020-01-29 01:03:57 -08:00
A. Unique TensorFlower
a3edaa7235 Change LocalClient::Compile to support returning multiple executables (one per partition).
PiperOrigin-RevId: 291343358
Change-Id: I0550040ddbb67e78e9e4078185e0af6b11b96e35
2020-01-24 03:35:03 -08:00
Chris Jones
c42a05f658 Change LocalClient::Compile to support returning multiple executables (one per partition).
PiperOrigin-RevId: 291341897
Change-Id: I8107c03ebb08b40e5e5cabed1002fecee38767ed
2020-01-24 03:20:38 -08:00
Peter Hawkins
4a08e00f6e [XLA] Add support for buffer donation to the XLA local client API.
PiperOrigin-RevId: 289847802
Change-Id: Ic25df197d6cdcea4ef08840ab2ac16d0c986cd06
2020-01-15 06:36:47 -08:00
TensorFlower Gardener
240497c2b2 Merge pull request from kiszk:spelling_tweaks_compiler
PiperOrigin-RevId: 285793343
Change-Id: I20bc2b8ca24d5fee4613f12abeba48957f80221f
2019-12-16 10:53:40 -08:00
Kazuaki Ishizaki
e664420b79 minor spelling tweaks 2019-12-09 18:21:12 +09:00
A. Unique TensorFlower
e4159c833e Update paths to LLVM.
PiperOrigin-RevId: 278455338
Change-Id: I4422a2de8365a8fa8cd5bcfe87b0f70e7dd352d6
2019-11-04 13:53:39 -08:00
Chen Chen
001c155f41 Internal change.
PiperOrigin-RevId: 278446480
Change-Id: I105e8b3841bba3cb6513a7840bac2bb70d1536d9
2019-11-04 13:49:21 -08:00
Peter Hawkins
efd304f40e [XLA] Add an Executable::ExecuteAsyncOnStreamWrapper.
Move logic from ExecuteOnStreamWrapper, make it callback-based, and make ExecuteOnStreamWrapper a thin wrapper around ExecuteAsyncOnStreamWrapper than blocks.

Helps solve https://github.com/google/jax/issues/774 by making the XLA profiling code work on the async path used by JAX.

PiperOrigin-RevId: 262416495
2019-08-08 14:06:31 -07:00
Peter Hawkins
1762bef938 [XLA] Consolidate Executable::ExecuteOnStream and ExecuteAsyncOnStream.
Remove ExecuteOnStream virtual method, make ExecuteOnStream a non-virtual wrapper around ExecuteAsyncOnStream.

This means that backend authors have one method to implement (ExecuteAsyncOnStream) rather than two, and reduces the number of code paths to running an executable.

Comment that ExecuteAsyncOnStream may in fact not be async. While undesirable, this is a quality of implementation issue not a bug. Future changes can make implementations of ExecuteAsyncOnStream truly async.

PiperOrigin-RevId: 261922907
2019-08-06 08:56:57 -07:00
Peter Hawkins
ab8b627e0b [XLA] Clean up execution_profile usage and make it thread-safe.
Currently a mutable execution_profile is attached to xla::Executable. This isn't thread safe, since the same Executable may be invoked concurrently. Instead, clients already have the ability to pass in their own ExecutionProfile via the ExecutableRunOptions; update that one instead.

Simplify APIs that accepted both an ExecutionProfile and an ExecutableRunOptions given the latter contains a pointer to an instance of the former.

PiperOrigin-RevId: 261570194
2019-08-04 08:57:47 -07:00
Peter Hawkins
18f1467496 [XLA] Make HLO snapshot dumping work on the LocalClient::RunAsync path.
PiperOrigin-RevId: 259956061
2019-07-25 09:07:24 -07:00
Peter Hawkins
3f266b1c8d Automated rollback of commit 53da0bc5ce
PiperOrigin-RevId: 259945774
2019-07-25 07:58:19 -07:00
A. Unique TensorFlower
53da0bc5ce Automated rollback of commit e8510ab01d
PiperOrigin-RevId: 259937937
2019-07-25 06:58:45 -07:00
Peter Hawkins
e8510ab01d [XLA] Improve thread-safety of HLO snapshot dumping.
Currently the code keeps a mutable HloSnapshot attached to the xla::Executable object. This cannot work correctly in the presence of concurrent executions. Instead, keep only an immutable HloProto attached to xla::Executable and construct ephemeral HloSnapshots during dumping.

This has the minor downside that it requires copying the HloProto each time we dump, but presumably if you are dumping HLO snapshots you don't particularly care about performance.

PiperOrigin-RevId: 259934176
2019-07-25 06:26:40 -07:00
George Karpenkov
0410cff073 Move DeviceMemoryAllocator and OwningDeviceMemory from XLA to StreamExecutor.
This change achieves three goals:

1. There are currently three different allocator abstractions in three different
places: XLA, stream executor, and tensorflow.
This change shrinks down the number of packages with allocator abstraction to
two.

2. Moving the allocator enables unifying ScopedDeviceMemory
and OwningDeviceMemory which both have "owning pointer" semantics,
but a slightly different API.

3. Moving the allocator enables moving RedzoneAllocator in stream executor,
which we would like to use in tensorflow to catch out-of-bound-writes in
CUDNN convolutions during the autotuning.

PiperOrigin-RevId: 247211996
2019-05-08 10:11:22 -07:00
Peter Hawkins
c7b255ae35 [JAX] Add support for asynchronous execution, but leave it disabled by default for now.
[XLA:Python] Add support for asynchronous execution in the Python client.

Python isn't famous for being the world's fastest language, so for high performance eager-style dispatch it is helpful to be able to hide Python latency behind device computations by having the Python code dispatch device operations asynchronously.

The design here closely follows the design of asynchronous execution in TensorFlow and the TensorFlow/XLA client. We use three main streams:
* a compute stream, for running XLA computations,
* a host-to-device stream, for transferring data onto the device
* a device-to-host stream, for transferring data off the device.

Both host-to-device transfers and compute are asynchronous, that is, they return control to Python as soon as any necessary error checking is complete, but before the operation completes. This allows the Python code to enqueue any subsequent operations while the previously enqueued operations complete. Device-to-host transfers are still blocking, in the sense that they stall the host until the host-side data is ready.

[XLA] Add LocalExecutable::RunAsync() to obtain async execution on a stream. There is currently no way to achieve this via the LocalClient API, only by using internal XLA APIs.
[XLA:GPU] Implement ExecuteAsyncOnStream. It turns out that ExecuteOnStream is already more or less async anyway.

PiperOrigin-RevId: 246650968
2019-05-04 07:36:09 -07:00
Peter Hawkins
4fc7f70df7 [XLA] Change some xla::Literal arguments to xla::LiteralSlice so the functions accept other kinds of literals.
Update a stale comment in xla::ShapeUtil.

PiperOrigin-RevId: 240606826
2019-03-27 11:41:24 -07:00
Justin Lebar
39587aaeb7 [XLA] Rework debug flags for dumping HLO.
The following flags (usually passed via the XLA_FLAGS envvar) are removed:

  xla_dump_computations_to
  xla_dump_executions_to
  xla_dump_ir_to
  xla_dump_optimized_hlo_proto_to
  xla_dump_per_pass_hlo_proto_to
  xla_dump_unoptimized_hlo_proto_to
  xla_generate_hlo_graph
  xla_generate_hlo_text_to
  xla_hlo_dump_as_html
  xla_hlo_graph_path
  xla_log_hlo_text

The following new flags are added:

  xla_dump_to
  xla_dump_hlo_module_re
  xla_dump_hlo_pass_re
  xla_dump_hlo_as_text
  xla_dump_hlo_as_proto
  xla_dump_hlo_as_dot
  xla_dump_hlo_as_url
  xla_dump_hlo_as_html
  xla_dump_ir
  xla_dump_hlo_snapshots

The default is not to dump anything at all, but as soon as some dumping flag is
specified, we enable the following defaults (most of which can be overridden).

 * dump to stdout (overridden by --xla_dump_to)
 * dump HLO modules at the very beginning and end of the optimization pipeline
 * don't dump between any HLO passes (overridden by --xla_dump_hlo_pass_re)
 * dump all HLO modules (overridden by --xla_dump_hlo_module_re)
 * dump in textual format (overridden by
   --xla_dump_hlo_as_{text,proto,dot,url,html}).

For example, to dump optimized and unoptimized HLO text and protos to /tmp/foo,
pass

  --xla_dump_to=/tmp/foo --xla_dump_hlo_as_text --xla_dump_hlo_as_proto

For details on these flags' meanings, see xla.proto.

The intent of this change is to make dumping both simpler to use and more
powerful.

For example:

 * Previously there was no way to dump the HLO module during the pass pipeline
   in HLO text format; the only option was --dump_per_pass_hlo_proto_to, which
   dumped in proto format.

   Now this is --xla_dump_pass_re=.* --xla_dump_hlo_as_text.  (In fact, the
   second flag is not necessary in this case, as dumping as text is the
   default.)

 * Previously there was no way to dump HLO as a graph before and after
   compilation; the only option was --xla_generate_hlo_graph, which would dump
   before/after every pass.

   Now this is --xla_dump_hlo_as_{dot,url,html} (depending on what format you
   want the graph in).

 * Previously, there was no coordination between the filenames written by the
   various flags, so info about one module might be dumped with various
   filename prefixes.  Now the filenames are consistent and all dumps from a
   particular module are next to each other.

If you only specify some of these flags, we try to figure out what you wanted.
For example:

 * --xla_dump_to implies --xla_dump_hlo_as_text unless you specify some
   other --xla_dump_as_* flag.

 * --xla_dump_hlo_as_text or --xla_dump_ir implies dumping to stdout unless you
   specify a different --xla_dump_to directory.  You can explicitly dump to
   stdout with --xla_dump_to=-.

As part of this change, I simplified the debugging code in the HLO passes for
dumping HLO modules.  Previously, many tests explicitly VLOG'ed the HLO module
before, after, and sometimes during the pass.  I removed these VLOGs.  If you
want dumps before/during/after an HLO pass, use --xla_dump_pass_re=<pass_name>.

PiperOrigin-RevId: 237920279
2019-03-11 17:35:39 -07:00
Sanjoy Das
767a1fe746 Delete unused ServiceExecutableRunOptions::xla_intra_op_thread_pool; NFC
PiperOrigin-RevId: 228627868
2019-01-09 18:37:15 -08:00
Justin Lebar
7c7ed2b64c [XLA] Print out shapes' layouts in local_client's "shape/layout doesn't match" error.
Previously we'd say "shape/layout doesn't match" and then print out the shapes
without their layouts!

PiperOrigin-RevId: 224105237
2018-12-04 23:22:57 -08:00
A. Unique TensorFlower
0cd83fc0d8 Use low level API to transfer literal to device in local client.
PiperOrigin-RevId: 221471132
2018-11-14 10:52:09 -08:00
A. Unique TensorFlower
dd6d7c5c58 Global de-std::unique_ptr cleanup for xla::Literal.
PiperOrigin-RevId: 212313258
2018-09-10 12:38:19 -07:00
Tim Shen
6f879f891a [XLA] Rename all (Mutable)ArraySlice to absl::Span.
PiperOrigin-RevId: 210998142
2018-08-30 16:07:27 -07:00
Justin Lebar
d57f5a8202 [XLA] Switch to absl::StrFormat.
Unlike Printf, StrFormat does not require type-length qualifiers, e.g
%z, %ll.  Nor does it require that you call c_str() to print strings.
So these are fixed up here as well.

PiperOrigin-RevId: 210435915
2018-08-27 14:55:29 -07:00
Justin Lebar
e924d67bff [XLA] Use absl::make_unique instead of xla::MakeUnique.
Same for WrapUnique.

PiperOrigin-RevId: 209531124
2018-08-20 20:23:24 -07:00
Kay Zhu
b67b3927a1 [XLA] Fix use-of-unintialized-value msan failure in local_client as well.
PiperOrigin-RevId: 208004791
2018-08-09 00:24:49 -07:00
Kay Zhu
963ef37203 [TF:XLA] Introduce MutableBorrowingLiteral to enable interacting with a (tensor) buffer not owned by XLA/Literal class directly, without having to memcpy the Literal to a (Host)Tensor.
PiperOrigin-RevId: 207972410
2018-08-08 17:24:39 -07:00
A. Unique TensorFlower
abd645085b Use the correct device ordinal to check whether the device the executable was
built for is equivalent to the device the it will run on.

Before this patch, if the device to run on was provided via a stream without
setting the device ordinal in the ExecutableRunOptions, we would check the
default device against the device the executable was built for.

PiperOrigin-RevId: 206892902
2018-08-01 01:06:22 -07:00
Todd Wang
15b155e929 Replace generic Pool with StreamPool, and discard failed streams.
We have a Pool in XLA that maintains a freelist of Streams, to avoid
the overhead of repeatedly allocating new Streams. Streams have a
monotonic state machine; if a stream encounters any error, it will
remain in an error state forever.

The functional change in this CL is to ensure that streams which have
encountered an error are deleted, rather than being put back on the
pool. Without this change, a previously failed stream will be put back
on the pool, only to cause the next usage of the stream to trivially
fail.

I've chosen to replace the generic templatized Pool with a concrete
StreamPool, since this makes the logic more straightforward to reason
about. Also note that the only existing usage of Pool is to hold
streams.

The functional change is in stream_pool.cc; most of everything else is
mechanical updates.

PiperOrigin-RevId: 206100631
2018-07-25 21:02:08 -07:00
Mark Heffernan
438c8e2b0a Move xla_computation.* from xla/client/xla_client up to xla/client.
Plan is to move everything in xla/client/xla_client up to xla/client and remove
the directory.

No functional change.

PiperOrigin-RevId: 206055680
2018-07-25 14:33:14 -07:00
A. Unique TensorFlower
9ab04addfb Remove the ambiguity of device/host computation layouts within the HloModuleConfig.
PiperOrigin-RevId: 201284741
2018-06-19 19:42:57 -07:00
A. Unique TensorFlower
bae4a271c0 Internal change
PiperOrigin-RevId: 201161803
2018-06-19 06:04:39 -07:00
A. Unique TensorFlower
f0230735d1 [XLA] Redesign: delete SessionModule.
PiperOrigin-RevId: 199361402
2018-06-05 14:51:27 -07:00
Justin Lebar
49535c9da6 [XLA] Switch replay_computation to use LocalClient.
This lets replay_computation build an executable once and run it
multiple times.  This is particularly important because in XLA:GPU, the
first run of an executable does some autotuning and therefore is
unrepresentative.

This change removes --xla_hlo_profile_last_run, because I don't see how
to support it in LocalClient -- LocalClient wants the do-profile bit to
be set when we *compile*.  (There may not be an easy fix for this; it
worked with regular Client because we were recompiling every time we
ran.)

PiperOrigin-RevId: 198643577
2018-05-30 17:03:41 -07:00
A. Unique TensorFlower
f36c93505f [XLA] Redesign: delete xla::Computation.
PiperOrigin-RevId: 197069851
2018-05-17 16:26:35 -07:00
Justin Lebar
2f5f2cb425 [XLA] s/tensorflow::Status/Status/.
These are type aliases of one another; we'd like to be consistent and
use the shorter one.

PiperOrigin-RevId: 196322955
2018-05-11 16:07:35 -07:00
A. Unique TensorFlower
210abebd3f [TF:XLA] Separate on-host and on-device shape and layout in HloModule.
Previously, only one layout was stored with an HLO module. This CL allows
HLO passes to modify the on-device layouts without affecting the on-host
layout (provided by the client)

PiperOrigin-RevId: 195014875
2018-05-01 16:19:55 -07:00
Justin Lebar
56fd856425 [XLA] Make Executable return a ScopedShapedBuffer.
Previously, we returned a plain ShapedBuffer.  But this doesn't capture
our semantics: It's up to the callee to free this ShapedBuffer.

PiperOrigin-RevId: 193854051
2018-04-22 14:50:48 -07:00
Justin Lebar
4e17a3f149 [XLA] De-unique_ptr-ify ShapedBuffer and ScopedShapedBuffer.
These are already notionally equivalent to T* and unique_ptr<T>, so
having a unique_ptr of a {Scoped,}ShapedBuffer is pretty redundant.

Also clean up the ScopedShapedBuffer API a bit.

PiperOrigin-RevId: 193599773
2018-04-19 17:22:23 -07:00
Justin Lebar
d77a621a57 [XLA] Convert XLA to use xla::se as a namespace alias for ::stream_executor.
PiperOrigin-RevId: 193301997
2018-04-17 21:07:05 -07:00