C++ vs Assembler performance on AVR

So as you know you can always start a good flame war when mentioning performance comparisons. I’m sure this post will qualify for that treatment. The comparison is between UTFT library, which is a fantastic AVR/PIC cross platform TFT library for a whole list of TFT driver chips and UTFT that has almost all controller conditionals removed plus some choice routines were replaced by hand crafted assembler. The performance difference is staggering.

YouTube video

As you can see the optimized library is 15 times as fast. Now how did we get there?

1. Fast Fill

Replace _fast_fill_16 with something more appropriate to the name. Looking at the dissasembly for this piece of code I was horrified to find oodles of code for such a simple thing. All this piece needs to do toggle the WR lines to LOW and to HIGH once for each pixel. The controller will automatically advance the next write address. In AVR writing to the ports can be done in one clock cycle. So that basic element only takes two clock cycles.

.macro TOGGLE_WR_FAST value1, value2
  out _SFR_IO_ADDR(WR_PORT), \value1
  out _SFR_IO_ADDR(WR_PORT), \value2
.endm

load 2 registers of your choice with the values to write to the port and call this macro. What I liked about the original fast_fill_16 is that it unrolled the big loop into 2 loops. One does 16 pixels at a time and the second loops finished whatever was left over. This avoids a lot of branching, so I stuck with that in assembler. Assume that the number of 16-pixels to write is in r24,r25 and the number of single pixels is in r18

sbiw r24,0 // subtract zero and test if zero
breq exitloop16
loop16:

TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30

TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30

TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30

TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30
TOGGLE_WR_FAST r31,r30

sbiw r24,1

brne loop16

exitloop16:

cpi r18,0
breq exitsingleloop
singleloop:
TOGGLE_WR_FAST r31,r30
dec r18
brne singleloop

exitsingleloop:
ret

This alone is a tremendous speedup and takes care of fillRect, clrScr, horizonal and vertical lines.

2. Arbitrary lines

Arbitrary lines were very slow in C code. So I rewrote the Bresenham algorithm in assembler as well. It’s a bit more code so I won’t bother copying it here. I will have the source code available soon though.

3. Bitmaps

Bitmaps could also use the assembler rewrite and benefited tremendously. However when working with the Hack a Day logo I noticed that it has a lot of repeats. Getting the bitmap data from flash memory is slow. 6 clock ticks per pixel. What if I could do a rudimentary compression, RLE seemed to fit the bill. So I went with something similar to Packbits compression. This reduced the storage use from 12kB for a 16 bit 83×76 pixel bitmap to only 2944 bytes.  Lossless compression! This sped up bitmap drawing very well and reduced flash usage quite a lot.

The source code for the C image converter can be found here. The assembler to actually display this bitmap is very very simple and very fast:

.global fastbitmap_pb565
fastbitmap_pb565:

	/*
		r24:r25 data

	*/

	/* this block sets up the TOGGLE_WR_FAST registers r30:r31 */
	in r26, _SFR_IO_ADDR(WR_PORT)
	mov r27, r26
	set
	bld r26,WR_PIN
	clt
	bld r27,WR_PIN

	movw r30, r24

	clr r1

PB565BIT_LOOP:

	LPM r18, Z+	
	cpi r18,0
	breq PB565BIT_DONE

	bst r18,7
	brtc PB565PLAIN

	// compressed loop.
	andi r18,0x7F
	LPM r0, Z+
	out DPHIO, r0
	LPM r0, Z+
	out DPLIO, r0

PB565COMPRESSED:
	TOGGLE_WR_FAST r27,r26
	dec r18
	brne PB565COMPRESSED
	rjmp PB565BIT_LOOP

PB565PLAIN:

	LPM r0, Z+
	out DPHIO, r0
	LPM r0, Z+
	out DPLIO, r0
	TOGGLE_WR_FAST r27,r26
	dec r18
	brne PB565PLAIN
	rjmp PB565BIT_LOOP

PB565BIT_DONE:
	clr r0
	ret;

Those 3 really took care of the low hanging fruit. UTFT is written as a C++ class as it is a Arduino library. I wonder how much time the compiler spends on keeping track of the ‘this’ pointer and de-referencing variables.

I’m very pleased with the speedup so far. The boards have come in. Next blog post will detail the assembly of the board using mostly SMT components.

Update: Full source code is posted here

This entry was posted in graphical reflow controller and tagged , , . Bookmark the permalink.