4. Benchmarking the Core

The max DMIPS of the Chromite core is 1.772 DMIPs/MHz.

The max CoreMarks of the Chromite core is 3.2 CoreMarks/MHz

The Chromite core is highly configurable and allows workload specific tuning to achieve the maximum performance. This document will highlight some of the settings and their respective benchmark numbers. For the following benchmarks the core has been configured using the default.yaml available in the samples/ folder.

Note

Make sure you are using gcc 11.1.0 or above to replicate the following results.

4.1. Benchmarking Dhrystone

The following numbers have been obtained via simulation where the number of ITERATIONS was fixed at 10000. The riscv-gnu-toolchain was used to compile the program. The versions used have been populated in the table.

Flags used for compilation:

-mcmodel=medany -static -std=gnu99 -O2 -ffast-math \
-fno-common -fno-builtin-printf -march=rv64$(march) -mabi=lp64d \
-w -static -nostartfiles -lgcc

The following table provides the DMIPs/MHz numbers for varios configurations for 10K iterations of dhrystone:

HW ISA CONFIG

march

mabi

uArch Configs

gcc version

DMIPs/MHz

RV64IMACSU

rv64imac

lp64

Default

(g5964b5cd727) 11.1.0

1.750

RV64IMACSU/ RV64IMASU

rv64ima

lp64

Default

(g5964b5cd727) 11.1.0

1.767

RV64IMACSU

rv64imac

lp64

overlap_redirections=True

(g5964b5cd727) 11.1.0

1.756

RV64IMACSU/ RV64IMASU

rv64ima

lp64

overlap_redirections=True

(g5964b5cd727) 11.1.0

1.772

Note

Enabling the overlap_redirections can affect frequency closure in certain nodes as it muxes the redirected PC to the Instruction Memory Subsystem (IMS) obtained via a mis-prediction in the same cycle (as opposed to registering it before sending it to IMS). Thus, reducing the mis penalty of misprediction by 1. The performance gain obtained by this is visible in the above table.

The reason for lower performance when compressed is enabled is explained below in Section 4.3

4.2. Benchmarking CoreMarks

The following numbers have been obtained via simulation where the number of ITERATIONS was fixed at 100

When $march is rv64ima the CoreMarks/MHz is 3.2:

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 31256616
Total time (secs): 31
Iterations/Sec   : 3
Iterations       : 100
Compiler version : riscv64-unknown-elf-11.1.0
Compiler flags   : -mcmodel=medany -DCUSTOM -DPERFORMANCE_RUN=1 -DMAIN_HAS_NOARGC=1 -DHAS_STDIO -DHAS_PRINTF -DHAS_TIME_H -DUSE_CLOCK -DHAS_FLOAT=0 -DITERATIONS=100 -O3 -fno-common -funroll-loops -finline-functions -fselective-scheduling -falign-functions=16 -falign-jumps=4 -falign-loops=4 -finline-limit=1000 -nostartfiles -nostdlib -ffast-math -fno-builtin-printf -march=rv64imfd -mexplicit-relocs -ffreestanding -fno-builtin -mtune=rocket
Memory location  : STACK
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0x988c
Correct operation validated. See README.md for run and reporting rules.

4.3. Why Compressed Binaries have reduced performance?

If you have observed the numbers above, it is evident that for the same configuration of the branch-predictor, compressed provides a slight reduction in DMIPs. This is because of the way the fetch-stage (stage1) has been designed.

The fetch stage always expects the I$ to respond with a cacheline sized response which is 4-byte aligned, while the instruction packet in the pipeline is a 4-byte aligned address. In cases where the target of a redirection is 2-byte aligned, henceforth referred to as c-redirect, the instruction packet will contain a 2-byte aligned address. Since it is possible that the 32-bit word pointed to by the address can hold upto 2 16-bit compressed instructions the predictor also always presents 2 predictions one for pc and one for pc+2 (pc-2 and pc incase of c-redirect).

For every instruction packet, 4 bytes of instruction is extracted from the cacheline starting from the address pointed to in the packet. The instruction packet is dequeued only after all the 4 bytes(at least) have been consumed. Now these 4 bytes can fall into any of the following cases:

  • Case-1: entire word is a 32-bit instruction. In this case the entire word and the prediction for pc is sent to the decode stage.

  • Case-2: word contains 2 16-bit instructions. in this case in the first cycle the lower 16-bits of the word and prediction of pc is sent to the decode stage. In the next cycle the upper 16-bits and prediction of pc+2 is sent to the decode stage.

  • Case-3: The lower 16 bits belong to a compressed instruction and the upper 16 bits belong to a uncompressed instruction. In this case, in the first cycle the lower 16 bits of the word and the prediction of pc is sent to the decode stage. In the next cycle, the upper 16 bits along with the lower 16 bits of the next word are sent to the decode stage with the prediction of pc+2.

  • Case-4: The address in the instruction packet from pcgen is 2 byte aligned (c-redirect) or the lower 16 bits have been consumed as a part of the previous packet. In this case, if the upper 16 bytes belong to a compressed instruction, they are sent to the decode stage with the prediction for pc+2. If the upper 16 bits of the word are the lower 2 bytes of an uncompressed instruction, the lower 16 bits of the next word are sent to the decode stage too.

In all of the aforementioned cases, the required bytes are readily available to stage1. However, problems arise when a case-4 type scenario at a cacheline boundary i.e the lower most 16 bits from the next cacheline are required to complete the word and begin analysis. In this particular case, the fetch stage stalls until the cache response for both the lines are available. This also occurs whenever there is a c-redirect to a target which points to the upper most 16 bytes in the cacheline(LSB is 3e for a cacheline size of 64 bytes), 2 responses are needed. Assuming both are hits in the cache(dhrystone code fits within the I$ i.e. no misses) and the predictor is also well trained to provide all correct-predictions, a single pipeline bubble is inserted per such sequence. The example below illustrates the case-4 scenario

...
80001a36:   2be40413                addi    s0,s0,702 # 80001cf0 <main+0x440>
80001a3a:   00000997                auipc   s3,0x0
80001a3e:   4e298993                addi    s3,s3,1250 # 80001f1c <Int_Glob>
...

After issuing(say cycle x) the auipc instruction in the scenario above, in cycle x+1 the instruction packet in the fetch stage will point to 80001a3c. Since the lower 16 bits have already been consumed in cycle x (as a part of the auipc instruction), the word consider starts from 80001a3e, however the response for the next cacheline(i.e the one starting at 80001a40) will be available to the fetch stage at cycle x+2 (the cache receives the request for 80001a40 in cycle x+1). Hence the fetch stage stalls for 1 cycle.

Note: This problem does not come up in case-3 because in cycle x+1 the compressed instruction pointed to by the lower 16 bits can be issued and hence there is no pipeline bubble.

Since in dhrystone the above kind of sequence occurs for 3 scenarios in each iteration, and thus there is always a single-cycle delay for each scenario - hence the reduced performance for compressed support.