Keep only the simple auxv method for detecting dotprod instructions.

It's available on Linux >= 4.15 in general; on Android, at least Linux 4.14.111 thanks to a late backport. This was backported just before the Android 10 release, so this is leaving out pre-release Android 10 builds as well as earlier Android versions.

Part of the rationale for submitting this now is an understanding that most devices with new hardware supporting dotprod instructions either shipped with Android 10 in the first place (Pixel4) or have received an Android 10 update already (LG G8, Samsung Galaxy S10, Note 10). We are probably leaving some devices unsupported, but conversely the signal-handler detection method that this is removing was reported to cause crashes on other devices, so this is a compromise. At least now we won't crash anywhere, at worst some devices won't get the speedup from dotprod instructions.

PiperOrigin-RevId: 288340160
Change-Id: I9f1b934f4e3996456af0489b780d567596f6db92
This commit is contained in:
Benoit Jacob 2020-01-06 11:27:49 -08:00 committed by TensorFlower Gardener
parent 87cab44823
commit c6c81bc173
1 changed files with 30 additions and 181 deletions

View File

@ -13,189 +13,44 @@ See the License for the specific language governing permissions and
limitations under the License. limitations under the License.
==============================================================================*/ ==============================================================================*/
/* Temporary dotprod-detection until we can rely on proper feature-detection /* Detection of dotprod instructions on ARM.
such as getauxval on Linux (requires a newer Linux kernel than we can * The current Linux-specific code relies on sufficiently new Linux kernels:
currently rely on on Android). * At least Linux 4.15 in general; on Android, at least Linux 4.14.111 thanks to
* a late backport. This was backported just before the Android 10 release, so
There are two main ways that this could be implemented: using a signal * this is leaving out pre-release Android 10 builds as well as earlier Android
handler or a fork. The current implementation uses a signal handler. * versions.
This is because on current Android, an uncaught signal gives a latency *
of over 100 ms. In order for the fork approach to be worthwhile, it would * It is possible to detect instructions in other ways that don't rely on
have to save us the hassle of handling signals, and such an approach thus * an OS-provided feature identification mechanism:
has an unavoidable 100ms latency. By contrast, the present signal-handling *
approach has low latency. * (A) We used to have a SIGILL-handler-based method that worked at least
* on Linux. Its downsides were (1) crashes on a few devices where
Downsides of the current signal-handling approach include: * signal handler installation didn't work as intended; (2) additional
1. Setting and restoring signal handlers is not thread-safe: we can't * complexity to generalize to other Unix-ish operating systems including
prevent another thread from interfering with us. We at least prevent * iOS; (3) source code complexity and fragility of anything installing
other threads from calling our present code concurrently by using a lock, * and restoring signal handlers; (4) confusing behavior under a debugger.
but we can't do anything about other threads using their own code to *
set signal handlers. * (B) We also experimented with a fork-ing approach where a subprocess
2. Signal handlers are not entirely portable, e.g. b/132973173 showed that * tries the instruction. Compared to (A), this is much simpler and more
on Apple platform the EXC_BAD_INSTRUCTION signal is not always caught * reliable and portable, but also much higher latency on Android where
by a SIGILL handler (difference between Release and Debug builds). * an uncaught signal typically causes a 100 ms latency.
3. The signal handler approach looks confusing in a debugger (has to *
tell the debugger to 'continue' past the signal every time). Fix: * Should there be interest in either technique again in the future,
``` * code implementing both (A) and (B) can be found in earlier revisions of this
(gdb) handle SIGILL nostop noprint pass * file - in actual code for (A) and in a comment for (B).
``` */
Here is what the nicer fork-based alternative would look like.
Its only downside, as discussed above, is high latency, 100 ms on Android.
```
bool try_asm_snippet(bool (*asm_snippet)()) {
int child_pid = fork();
if (child_pid == -1) {
// Fork failed.
return false;
}
if (child_pid == 0) {
// Child process code path. Pass the raw boolean return value of
// asm_snippet as exit code (unconventional: 1 means true == success).
_exit(asm_snippet());
}
int child_status;
waitpid(child_pid, &child_status, 0);
if (WIFSIGNALED(child_status)) {
// Child process terminated by signal, meaning the instruction was
// not supported.
return false;
}
// Return the exit code of the child, which per child code above was
// the return value of asm_snippet().
return WEXITSTATUS(child_status);
}
```
*/
#include "tensorflow/lite/experimental/ruy/detect_arm.h" #include "tensorflow/lite/experimental/ruy/detect_arm.h"
#if defined __aarch64__ && defined __linux__
#define RUY_IMPLEMENT_DETECT_DOTPROD
#endif
#ifdef RUY_IMPLEMENT_DETECT_DOTPROD
#include <setjmp.h>
#include <signal.h>
#include <unistd.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <mutex> // NOLINT(build/c++11)
// Intentionally keep checking for __linux__ here in case we want to
// extend RUY_IMPLEMENT_DETECT_DOTPROD outside of linux in the future.
#ifdef __linux__ #ifdef __linux__
#include <sys/auxv.h> #include <sys/auxv.h>
#include <sys/utsname.h>
#endif
#endif #endif
namespace ruy { namespace ruy {
#ifdef RUY_IMPLEMENT_DETECT_DOTPROD
namespace { namespace {
// Waits until there are no pending SIGILL's. #if defined __linux__ && defined __aarch64__
void wait_until_no_pending_sigill() {
sigset_t pending;
do {
sigemptyset(&pending);
sigpending(&pending);
} while (sigismember(&pending, SIGILL));
}
// long-jump buffer used to continue execution after a caught SIGILL.
sigjmp_buf& global_sigjmp_buf_just_before_trying_snippet() {
static sigjmp_buf g;
return g;
}
// SIGILL signal handler. Long-jumps to just before
// we ran the snippet that we know is the only thing that could have generated
// the SIGILL.
void sigill_handler(int) {
siglongjmp(global_sigjmp_buf_just_before_trying_snippet(), 1);
}
// Try an asm snippet. Returns true if it passed i.e. ran without generating
// a SIGILL and returned true. Returns false if a SIGILL was generated, or
// if it returned false.
// Other signals are not handled.
bool try_asm_snippet(bool (*asm_snippet)()) {
// This function installs and restores signal handlers. The only way it's ever
// going to be reentrant is with a big lock around it.
static std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex);
// Install the SIGILL signal handler. Save any existing signal handler for
// restoring later.
struct sigaction sigill_action;
memset(&sigill_action, 0, sizeof(sigill_action));
sigill_action.sa_handler = sigill_handler;
sigemptyset(&sigill_action.sa_mask);
struct sigaction old_action;
sigaction(SIGILL, &sigill_action, &old_action);
// Try the snippet.
bool got_sigill =
sigsetjmp(global_sigjmp_buf_just_before_trying_snippet(), true);
bool snippet_retval = false;
if (!got_sigill) {
snippet_retval = asm_snippet();
wait_until_no_pending_sigill();
}
// Restore the old signal handler.
sigaction(SIGILL, &old_action, nullptr);
return snippet_retval && !got_sigill;
}
bool dotprod_asm_snippet() {
// maratek@ mentioned that for some other ISA extensions (fp16)
// there have been implementations that failed to generate SIGILL even
// though they did not correctly implement the instruction. Just in case
// a similar situation might exist here, we do a simple correctness test.
int result = 0;
asm volatile(
"mov w0, #100\n"
"dup v0.16b, w0\n"
"dup v1.4s, w0\n"
".word 0x6e809401 // udot v1.4s, v0.16b, v0.16b\n"
"mov %w[result], v1.s[0]\n"
: [ result ] "=r"(result)
:
: "x0", "v0", "v1");
// Expecting 100 (input accumulator value) + 100 * 100 + ... (repeat 4 times)
return result == 40100;
}
bool DetectDotprodBySigIllMethod() {
return try_asm_snippet(dotprod_asm_snippet);
}
// Intentionally keep checking for __linux__ here in case we want to
// extend RUY_IMPLEMENT_DETECT_DOTPROD outside of linux in the future.
#ifdef __linux__
bool IsLinuxAuxvMethodAvailable() {
struct utsname utsbuf;
uname(&utsbuf);
int major, minor, patch;
if (3 != sscanf(utsbuf.release, "%d.%d.%d", &major, &minor, &patch)) {
return false;
}
// This is implemented in linux 4.14.111, not in 4.14.105.
return major > 4 ||
(major == 4 && (minor > 14 || (minor == 14 && patch >= 111)));
}
bool DetectDotprodByLinuxAuxvMethod() { bool DetectDotprodByLinuxAuxvMethod() {
// This is the value of HWCAP_ASIMDDP in sufficiently recent Linux headers, // This is the value of HWCAP_ASIMDDP in sufficiently recent Linux headers,
// however we need to support building against older headers for the time // however we need to support building against older headers for the time
@ -208,17 +63,11 @@ bool DetectDotprodByLinuxAuxvMethod() {
} // namespace } // namespace
bool DetectDotprod() { bool DetectDotprod() {
#ifdef __linux__ #if defined __linux__ && defined __aarch64__
if (IsLinuxAuxvMethodAvailable()) { return DetectDotprodByLinuxAuxvMethod();
return DetectDotprodByLinuxAuxvMethod();
}
#endif #endif
return DetectDotprodBySigIllMethod(); return false;
} }
#else // RUY_IMPLEMENT_DETECT_DOTPROD
bool DetectDotprod() { return false; }
#endif // RUY_IMPLEMENT_DETECT_DOTPROD
} // namespace ruy } // namespace ruy