History

Benoit Jacob 1790e093de Don't round the allocator's storage size to the next power of two. This is typically a huge buffer. We're going to reach a steady state where we have only a few such buffers and they won't get frequently reallocated, anyway. PiperOrigin-RevId: 264669851		2019-08-21 13:07:51 -07:00
..
allocator_test.cc	Make Allocator's RAII behavior more sensible.	2019-05-08 14:01:29 -07:00
allocator.cc	Fix allocator in cases of sizes overflowing 32bit integer arithmetic	2019-08-21 13:07:37 -07:00
allocator.h	Don't round the allocator's storage size to the next power of two. This is typically a huge buffer. We're going to reach a steady state where we have only a few such buffers and they won't get frequently reallocated, anyway.	2019-08-21 13:07:51 -07:00
benchmark.cc	Ruy - restore benchmark parameters	2019-07-30 13:42:59 -07:00
block_map.cc	Avoid divisions when the divisor is a power of two.	2019-07-30 10:45:00 -07:00
block_map.h	Rewrite/simplify tracing.	2019-07-25 13:30:53 -07:00
blocking_counter.cc	Refactor WaitForVariableChange: abstract away the atomic operations from it	2019-06-20 06:37:12 -07:00
blocking_counter.h	Refactor WaitForVariableChange: abstract away the atomic operations from it	2019-06-20 06:37:12 -07:00
BUILD	Fix allocator in cases of sizes overflowing 32bit integer arithmetic	2019-08-21 13:07:37 -07:00
check_macros.h
common.h	Ruy: Introduce x86 (AVX-512) code.	2019-07-30 11:49:00 -07:00
context.cc
context.h	Fix the type to avoid comparison of integers of different signs.	2019-07-02 21:28:48 -07:00
detect_dotprod.cc	Use auxv to detect dotprod on linux >= 4.14.111.	2019-06-19 11:07:47 -07:00
detect_dotprod.h
dispatch.h	Introduce a SidePair concept allowing us to rewrite much internal	2019-07-25 11:07:47 -07:00
example_advanced.cc	Add low-level pre-packing API in ruy_advanced.h	2019-05-10 17:29:44 -07:00
example.cc	Unbreak example.cc: it was running into the recently added assertion against both zero points being the minimum representable value.	2019-05-29 12:35:43 -07:00
internal_matrix.h	Make the kStandardCpp kernel layout and the cache-friendly traversal	2019-06-21 14:08:08 -07:00
kernel_arm32.cc	Ruy - ARM32 asm optimizations	2019-07-30 11:12:29 -07:00
kernel_arm64.cc	Fix performance regression (b/137615815) introduced by new platform	2019-07-16 08:43:21 -07:00
kernel_avx512.cc	Ruy: Introduce x86 (AVX-512) code.	2019-07-30 11:49:00 -07:00
kernel.h	Ruy: Introduce x86 (AVX-512) code.	2019-07-30 11:49:00 -07:00
matrix.h	Make the kStandardCpp kernel layout and the cache-friendly traversal	2019-06-21 14:08:08 -07:00
opt_set.h	While waiting for a block to be packed by another thread, do some other	2019-07-30 10:24:07 -07:00
pack_arm.cc	Ruy - ARM32 asm optimizations	2019-07-30 11:12:29 -07:00
pack_avx512.cc	Fix compile error (use of avx512 intrinsics without including header)	2019-08-02 07:03:30 -07:00
pack.h	Ruy: Introduce x86 (AVX-512) code.	2019-07-30 11:49:00 -07:00
path.h	Ruy: Introduce x86 (AVX-512) code.	2019-07-30 11:49:00 -07:00
platform.h	Ruy: Introduce x86 (AVX-512) code.	2019-07-30 11:49:00 -07:00
pmu.cc	Improvements to PMU stats in ruy benchmark:	2019-06-19 10:20:48 -07:00
pmu.h	Improvements to PMU stats in ruy benchmark:	2019-06-19 10:20:48 -07:00
prepack.h	Changing the packing strategy from being non-blocking but potentially redundant, to being non-redundant but potentially blocking.	2019-07-25 13:22:44 -07:00
README.md	Remove links to benchmark spreadsheets not ready for public release.	2019-04-18 10:24:02 -07:00
ruy_advanced.h	Introduce a SidePair concept allowing us to rewrite much internal	2019-07-25 11:07:47 -07:00
ruy_test_ext.bzl
ruy_test.bzl	Ruy testing.	2019-05-10 22:22:43 -07:00
ruy_visibility.bzl	Introduce a :cpu_backend_gemm library allowing to perform all	2019-04-26 11:36:49 -07:00
ruy.h	Remove some stale comments.	2019-06-10 11:34:37 -07:00
side_pair.h	Introduce a SidePair concept allowing us to rewrite much internal	2019-07-25 11:07:47 -07:00
size_util_test.cc	Fix allocator in cases of sizes overflowing 32bit integer arithmetic	2019-08-21 13:07:37 -07:00
size_util.h	Fix allocator in cases of sizes overflowing 32bit integer arithmetic	2019-08-21 13:07:37 -07:00
spec.h	Make the kStandardCpp kernel layout and the cache-friendly traversal	2019-06-21 14:08:08 -07:00
test_fast.cc	Fix UBSan error in test that was testing unrealistic accumulation depth	2019-07-16 10:13:22 -07:00
test_slow.cc	Detemplatize TrMul and introduce type-erased TrMulParams.	2019-04-30 12:20:02 -07:00
test_special_specs.cc	Generalize MakeBlockMap a little to allow rectangular kernels,	2019-06-21 14:51:10 -07:00
test.h	Ruy: Introduce x86 (AVX-512) code.	2019-07-30 11:49:00 -07:00
thread_pool.cc	ruy::ThreadPool: when there is only 1 task, don't even touch atomic counters.	2019-08-02 08:12:40 -07:00
thread_pool.h	Change existing call sites of the old deprecated gemmlowp WorkersPool::Execute method, which is a footgun because it destroys the Task object that it takes, to the new more explicit name LegacyExecuteAndDestroyTasks for the same behavior.	2019-04-30 13:37:11 -07:00
time.h	Overhaul time.h: don't force everything to go through floating-point	2019-07-30 11:03:32 -07:00
trace.cc	Overhaul time.h: don't force everything to go through floating-point	2019-07-30 11:03:32 -07:00
trace.h	Rewrite/simplify tracing.	2019-07-25 13:30:53 -07:00
trmul_params.h	Changing the packing strategy from being non-blocking but potentially redundant, to being non-redundant but potentially blocking.	2019-07-25 13:22:44 -07:00
trmul.cc	Avoid expensive atomics altogether, including the allocation of arrays,	2019-08-02 07:57:05 -07:00
trmul.h	Introduce a SidePair concept allowing us to rewrite much internal	2019-07-25 11:07:47 -07:00
tune_test.cc	- Disable tuning on Apple - we don't want to use an in-order-tuned	2019-07-19 14:18:25 -07:00
tune_tool.cc
tune.cc	Overhaul time.h: don't force everything to go through floating-point	2019-07-30 11:03:32 -07:00
tune.h	- Disable tuning on Apple - we don't want to use an in-order-tuned	2019-07-19 14:18:25 -07:00
wait_test.cc	Refactor WaitForVariableChange: abstract away the atomic operations from it	2019-06-20 06:37:12 -07:00
wait.cc	Overhaul time.h: don't force everything to go through floating-point	2019-07-30 11:03:32 -07:00
wait.h	Refactor WaitForVariableChange: abstract away the atomic operations from it	2019-06-20 06:37:12 -07:00

README.md

ruy is not BLAS

ruy is a matrix multiplication library. Its focus is to cover the matrix multiplication needs of TensorFlow Lite.

ruy supports both floating-point (like Eigen) and quantized (like gemmlowp).

Status

ruy is very new, immature code. It has quite good test coverage, but the code is in flux, lacks comments, needs more cleanup, and there are no design docs at the moment.

We hope to improve on all that and integrate ruy into TensorFlow Lite, at first as a non-default path for ARM A64 only, over the next few weeks [April 2019].

Efficiency

ruy is designed to achieve maximal performance not just on very large sizes, as is the focus of many established libraries, but on whatever are the actual sizes and shapes of matrices most critical in current TensorFlow Lite applications. This often means quite small sizes, e.g. 100x100 or even 50x50, and all sorts of rectangular shapes.

ruy is currently only optimized for ARM A64; other architectures have only slow reference code at the moment.

ruy is currently optimized only for the following combination of storage orders: LHS = row-major, RHS = column-major, destination = column-major. All other combinations of storage orders fall back to slow reference code at the moment.