So as you know you can always start a good flame war when mentioning performance comparisons. I’m sure this post will qualify for that treatment. The comparison is between UTFT library, which is a fantastic AVR/PIC cross platform TFT library for a whole list of TFT driver chips and UTFT that has almost all controller conditionals removed plus some choice routines were replaced by hand crafted assembler. The performance difference is staggering.
As you can see the optimized library is 15 times as fast. Now how did we get there?
1. Fast Fill
Replace _fast_fill_16 with something more appropriate to the name. Looking at the dissasembly for this piece of code I was horrified to find oodles of code for such a simple thing. All this piece needs to do toggle the WR lines to LOW and to HIGH once for each pixel. The controller will automatically advance the next write address. In AVR writing to the ports can be done in one clock cycle. So that basic element only takes two clock cycles.
.macro TOGGLE_WR_FAST value1, value2 out _SFR_IO_ADDR(WR_PORT), \value1 out _SFR_IO_ADDR(WR_PORT), \value2 .endm
load 2 registers of your choice with the values to write to the port and call this macro. What I liked about the original fast_fill_16 is that it unrolled the big loop into 2 loops. One does 16 pixels at a time and the second loops finished whatever was left over. This avoids a lot of branching, so I stuck with that in assembler. Assume that the number of 16-pixels to write is in r24,r25 and the number of single pixels is in r18
sbiw r24,0 // subtract zero and test if zero breq exitloop16 loop16: TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 TOGGLE_WR_FAST r31,r30 sbiw r24,1 brne loop16 exitloop16: cpi r18,0 breq exitsingleloop singleloop: TOGGLE_WR_FAST r31,r30 dec r18 brne singleloop exitsingleloop: ret
This alone is a tremendous speedup and takes care of fillRect, clrScr, horizonal and vertical lines.
2. Arbitrary lines
Arbitrary lines were very slow in C code. So I rewrote the Bresenham algorithm in assembler as well. It’s a bit more code so I won’t bother copying it here. I will have the source code available soon though.
Bitmaps could also use the assembler rewrite and benefited tremendously. However when working with the Hack a Day logo I noticed that it has a lot of repeats. Getting the bitmap data from flash memory is slow. 6 clock ticks per pixel. What if I could do a rudimentary compression, RLE seemed to fit the bill. So I went with something similar to Packbits compression. This reduced the storage use from 12kB for a 16 bit 83×76 pixel bitmap to only 2944 bytes. Lossless compression! This sped up bitmap drawing very well and reduced flash usage quite a lot.
The source code for the C image converter can be found here. The assembler to actually display this bitmap is very very simple and very fast:
.global fastbitmap_pb565 fastbitmap_pb565: /* r24:r25 data */ /* this block sets up the TOGGLE_WR_FAST registers r30:r31 */ in r26, _SFR_IO_ADDR(WR_PORT) mov r27, r26 set bld r26,WR_PIN clt bld r27,WR_PIN movw r30, r24 clr r1 PB565BIT_LOOP: LPM r18, Z+ cpi r18,0 breq PB565BIT_DONE bst r18,7 brtc PB565PLAIN // compressed loop. andi r18,0x7F LPM r0, Z+ out DPHIO, r0 LPM r0, Z+ out DPLIO, r0 PB565COMPRESSED: TOGGLE_WR_FAST r27,r26 dec r18 brne PB565COMPRESSED rjmp PB565BIT_LOOP PB565PLAIN: LPM r0, Z+ out DPHIO, r0 LPM r0, Z+ out DPLIO, r0 TOGGLE_WR_FAST r27,r26 dec r18 brne PB565PLAIN rjmp PB565BIT_LOOP PB565BIT_DONE: clr r0 ret;
Those 3 really took care of the low hanging fruit. UTFT is written as a C++ class as it is a Arduino library. I wonder how much time the compiler spends on keeping track of the ‘this’ pointer and de-referencing variables.
I’m very pleased with the speedup so far. The boards have come in. Next blog post will detail the assembly of the board using mostly SMT components.
Update: Full source code is posted here