Intro

mubench is an in-depth, low-level benchmark for x86 processors. Its primary goal is to provide useful information for people who optimize assembly code and for people who write compilers. It measures latency and throughput for each individual instruction (sometimes several forms of the same instruction), as well as the throughput of arbitrary instruction mixes. The results produced by mubench are typically an order of magnitude more detailed than those found in AMD or Intel manuals.

mubench results for a variety of processors are available. If you find this information useful, please run mubench on your processor and upload the results.

mubench fully supports all SIMD instruction sets for the x86, including SSSE3, SSE3, SSE2, SSE, MMX, MMX Ext, 3DNow! and 3DNow! Ext. Support for non-SIMD instructions is partial: most data move, binary arithmetic, logical, shift/rotate and bit/byte instructions are supported, but other instructions, particularly branch and function call instructions or instructions manipulating the stack, are not supported. Floating-point instructions for the x87 are not supported. mubench only uses register-to-register (or immediate) forms of the instructions; memory operands are not supported. These limitations will be gradually removed in later releases.

Running

perl mubench.pl [options]

Options:

 --(no-)accurate           runs tests several times (default on)
 --mhz=2500                processor speed in MHz (normally autodetected from /proc/cpuinfo, set here if that 
                           is wrong, for example if you have SpeedStep enabled)
 --(no-)64bit              benchmark 64-bit (amd64, emt64, x86-64) instructions (default autodetected)
 --(no-)32bit              benchmark 32-bit instructions
 --(no-)pairs              benchmark instruction mixes (default on, very slow; use --no-pairs for a very fast benchmark 
                           that runs in minutes)
 --include=add,sub         benchmark only instructions matching the given list of patterns (regular expressions ok)
 --output=xml|csv|text     select output format
 --outfile=file.xml        output file to save results to (default mubench-results-<date>.xml if xml, 
                           standard output otherwise)

Run this benchmark on an otherwise idle system, or as close as possible to idle (the benchmark will try to compensate for occasional cpu usage).

The full benchmark takes 6-9 hours to comlpete on a x86-64 system, or 2-3 hours on a x86 system since there are fewer instructions to try.

Some errors are normal when running the benchmark, as it tries to compile and run instruction sets you may not have (just in case ;)

Contribute results

Run perl mubench.pl with no options. It will produce a file "mubench-results-<date>.xml.bz2". This takes 6-9 hours. If you would like to run a quick benchmark, run perl mubench.pl --no-pairs which takes 5-10 minutes and produces a limited set of results. Both forms are extremely helpful, and will be used to expand this site.

To upload your results, please go to the Support Requests > Submit New part of the SourceForge project page of mubench. Under "Upload and Attach a File:" click "Browse..." and select the "mubench-results-<date>.xml.bz2" file produced by mubench. In the Summary field, write "RESULT" and a description of your processor, for example "RESULT: Pentium M 1.4GHz". Click "UPLOAD".

Thanks!

Output

When running with --output=text, the output looks like this:

instruction 1        instruction 2        latency    throughput
---------------------------------------------------------------
add r64, r64                              1.0047     1.0076
add r32, r32                              1.0043     0.47108
...

All numbers are measured in clock cycles.

Latency = 2 means it takes two clock cycles for the result to be available. Throughput = 2 means a new instruction of the same kind can only be started once every two clock cycles (this is actually the reciprocal throughput, which is the form commonly used when talking about assembly code). Note that smaller latency and smaller throughput are faster. Many instructions on recent processors have throughput < 1, meaning more than one of the same instruction can run in the same clock cycle. It is normal to have some non-integer values, although a lot of instructions will typically have throughput = 1. The same instruction with different operands may have different performance.

Requires

Perl modules: IPC::Run >= 0.80 (built-in since mubench-0.2.1)
Recent versions of gcc and binutils (gcc >= 3.3, binutils >= 2.16.92 for SSSE3/MNI support) which must be in your path
Other utilities in the path: bzip2, md5sum, uname

Files used

Creates test.c and test in the current working directory.

Tries to read /proc/cpuinfo on startup.

See also

Software Optimization Guide for the AMD64 Processors, AMD (publication 25112)

IA-32 Intel Architecture Optimization Reference Manual, Intel (publication 248966)

Software optimization resources, Agner Fog

SSE/MMX docs, Stefano Tommesani

Linux Assembly resources

Copyright and license

Copyright 2006 by Alex Izvorski. mubench is licensed under the terms of the GNU General Public License.