4. Benchmarking the Core

The max DMIPS of the Chromite core is 1.72 DMIPs/MHz.

The max CoreMarks of the Chromite core is 2.9 CoreMarks/MHz

The Chromite core is highly configurable and allows workload specific tuning to achieve the maximum performance. This document will highlight some of the settings and their respective benchmark numbers. For the following benchmarks the core has been configured using the default.yaml available in the samples/ folder.

Note

Make sure you are using gcc 9.2.0 or above to replicate the following results.

4.1. Benchmarking Dhrystone

The following numbers have been obtained via simulation where the number of ITERATIONS was fixed at 5000

Flags used for compilation:

-mcmodel=medany -static -std=gnu99 -O2 -ffast-math \
-fno-common -fno-builtin-printf -march=rv64$(march) -mabi=lp64d \
-w -static -nostartfiles -lgcc

When $march is rv64imac the DMIPs/MHz is 1.68:

Microseconds for one run through Dhrystone:     10.0
Dhrystones per Second:                       94652.0

When $march is rv64ima the DMIPs/MHz is 1.72:

Microseconds for one run through Dhrystone:     10.0
Dhrystones per Second:                       96216.0

4.2. Benchmarking CoreMarks

The following numbers have been obtained via simulation where the number of ITERATIONS was fixed at 100

Flags used for compilation are available in the logs below:

When $march is rv64imac the CoreMarks/MHz is 2.84:

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 35205197
Total time (secs): 35
Iterations/Sec   : 2
Iterations       : 100
Compiler version : riscv64-unknown-elf-9.2.0
Compiler flags   : -mcmodel=medany -DCUSTOM -DPERFORMANCE_RUN=1 -DMAIN_HAS_NOARGC=1 \
                   -DHAS_STDIO -DHAS_PRINTF -DHAS_TIME_H -DUSE_CLOCK -DHAS_FLOAT=0 \
                   -DITERATIONS=10 -O3 -fno-common -funroll-loops -finline-functions \
                   -fselective-scheduling -falign-functions=16 -falign-jumps=4 \
                   -falign-loops=4 -finline-limit=1000 -nostartfiles -nostdlib -ffast-math \
                   -fno-builtin-printf -march=rv64imac -mexplicit-relocs
Memory location  : STACK
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0x988c
Correct operation validated. See README.md for run and reporting rules.

When $march is rv64ima the CoreMarks/MHz is 2.897:

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 34516277
Total time (secs): 34
Iterations/Sec   : 2
Iterations       : 100
Compiler version : riscv64-unknown-elf-9.2.0
Compiler flags   : -mcmodel=medany -DCUSTOM -DPERFORMANCE_RUN=1 -DMAIN_HAS_NOARGC=1 \
                   -DHAS_STDIO -DHAS_PRINTF -DHAS_TIME_H -DUSE_CLOCK -DHAS_FLOAT=0 \
                   -DITERATIONS=100 -O3 -fno-common -funroll-loops -finline-functions \
                   -fselective-scheduling -falign-functions=16 -falign-jumps=4 \
                   -falign-loops=4 -finline-limit=1000 -nostartfiles -nostdlib -ffast-math \
                   -fno-builtin-printf -march=rv64ima -mexplicit-relocs
Memory location  : STACK
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0x988c
Correct operation validated. See README.md for run and reporting rules.

4.3. Why Compressed Binaries have reduced performance?

If you have observed the numbers above, it is evident that for the same configuration of the branch-predictor, compressed provides a slight reduction in DMIPs. This is because of the way the fetch-stage (stage1) has been designed.

The fetch stage always expects the I$ to respond with a 32-bit word which is 4-byte aligned. Since it is possible that the 32-bit word can hold upto 2 16-bit compressed instructions the predictor also always presents 2 predictions one for pc and one for pc+2. While analysing the 32-bit word from the I$ the following scenarios can occur:

  • Case-1: entire word is a 32-bit instruction. In this case the entire word and the prediction for pc is sent to the decode stage.

  • Case-2: word contains 2 16-bit instructions. in this case in the first cycle the lower 16-bits of the word and prediction of pc is sent to the decode stage. In the next cycle the upper 16-bits and prediction of pc+2 is sent to the decode stage.

  • Case-3: lower 16-bits need to be concatenated with the upper 16-bits of the previous I$ response. in this case the a new 32-bit instruction is formed and the prediction of the previous response is sent to the decode stage.

  • Case-4” Only the upper 16-bits of the I$ needs to be analysed. If the upper 16-bits are compressed then the same and prediction of pc+2 is sent to the decode stage. If however, the upper 16-bits are the lower part of a 32-bit instruction, then we need to wait for the next I$ response and use the Case-3 scheme then. Now one can land in this case, when there is jump to a 32-bit instruction placed at a 2-byte buondary.

Now that we understand how the fetch-stage works, assume that all the dhrystone code fits within the I$ (i.e. no misses) and predictor is also well trained to provide all correct-predictions. Consider the following sequence from dhrystone:

...
8000106e: 0x00001797            auipc a5,0x1
...
...
...
800010d8: 0xf97ff0ef            jal ra,8000106e
...

Now each time the jal instruction is executed the fetch-stage enters into case-4 where the upper 16-bits of the 32-bit word at 8000106c is the lower part of a 32-bit instruction starting at 0x8000106e and thus lead to a single-cycle stall in sending the auipc instruction into the decode stage.

Since in dhrystone the above kind of sequence occurs for 3 scenarios in each iteration, and thus there is always a single-cycle delay for each scenario - hence the reduced performance for compressed support.