STT-tensorflow/tensorflow/compiler/xla/g3doc/architecture.md

# XLA Architecture

<div style="width:50%; margin:auto; margin-bottom:10px; margin-top:20px;">
<img style="width:50%" src="./images/xlalogo.png">
</div>


## Why did we build XLA?

We had several objectives for XLA to work with TensorFlow:

*   *Improve execution speed.* Compile subgraphs to reduce the execution time of
    short-lived Ops to eliminate overhead from the TensorFlow runtime, fuse
    pipelined operations to reduce memory overhead, and specialize to known
    tensor shapes to allow for more aggressive constant propagation.

*   *Improve memory usage.* Analyze and schedule memory usage, in principle
    eliminating many intermediate storage buffers.

*   *Reduce reliance on custom Ops.* Remove the need for many custom Ops by
    improving the performance of automatically fused low-level Ops to match the
    performance of custom Ops that were fused by hand.

*   *Reduce mobile footprint.* Eliminate the TensorFlow runtime by ahead-of-time
    compiling the subgraph and emitting an object/header file pair that can be
    linked directly into another application. The results can reduce the
    footprint for mobile inference by several orders of magnitude.

*   *Improve portability.* Make it relatively easy to write a new backend for
    novel hardware, at which point a large fraction of TensorFlow programs will
    run unmodified on that hardware. This is in contrast with the approach of
    specializing individual monolithic Ops for new hardware, which requires
    TensorFlow programs to be rewritten to make use of those Ops.

## How does XLA work?

The input language to XLA is called "HLO IR", or just HLO (High Level
Optimizer). The semantics of HLO are described on the
[Operation Semantics](./operation_semantics.md) page. It
is most convenient to think of HLO as a
[compiler IR](https://en.wikipedia.org/wiki/Intermediate_representation).

XLA takes graphs ("computations") defined in HLO and compiles them into machine
instructions for various architectures. XLA is modular in the sense that it is
easy to slot in an alternative backend to
[target some novel HW architecture](./developing_new_backend.md).
The CPU backend for x64 and ARM64 as well as the NVIDIA GPU backend are in the
TensorFlow source tree.

The following diagram shows the compilation process in XLA:

<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
  <img src="./images/how-does-xla-work.png">
</div>

XLA comes with several optimizations and analysis passes that are
target-independent, such as
[CSE](https://en.wikipedia.org/wiki/Common_subexpression_elimination),
target-independent operation fusion, and buffer analysis for allocating runtime
memory for the computation.

After the target-independent step, XLA sends the HLO computation to a backend.
The backend can perform further HLO-level optimizations, this time with target
specific information and needs in mind. For example, the XLA GPU backend may
perform operation fusion beneficial specifically for the GPU programming model
and determine how to partition the computation into streams. At this stage,
backends may also pattern-match certain operations or combinations thereof to
optimized library calls.

The next step is target-specific code generation. The CPU and GPU backends
included with XLA use [LLVM](http://llvm.org) for low-level IR, optimization,
and code-generation. These backends emit the LLVM IR necessary to represent the
XLA HLO computation in an efficient manner, and then invoke LLVM to emit native
code from this LLVM IR.

The GPU backend currently supports NVIDIA GPUs via the LLVM NVPTX backend; the
CPU backend supports multiple CPU ISAs.