mMm... Optimization (c) Homer

Got the USB network working for my GP2X, so I decided to fiddle with it a bit. I can now code in both Pascal and C, but I came across this 32bpp to 16bpp blit, and saw it as the perfect asm optimization challenge to warm up my old ARM assembler skills... :)

I went through one 10-instr and two 9-instr before I finally could go from the original 12 to 8 instructions per pixel-pair. Why do I spend 4 hours shoveling bits? I guess I think it's time well spent <3

.globl fb_write
.globl SaveSP
.data
@ C prototype : extern void fb_write(uint64_t *screen, uint32_t *fb);

@ Converts 32-bpp RGB buffer to 16-bpp (RGB565) buffer. NOTE: alpha channel must be 0 !

@ assemble with: arm-linux-as.exe -mcpu=arm920 -o fb_write.o fb_write.s
@ link to your main.c with : arm-linux-ld -e main fb_write.o main.o -static -s -o <your gpe>

@ Credits to A_SN and senorquack from gp32x.com for the previous asm versions
@ FASTBLIT v1 by Henrik Erlandsson aka Photon of Scoopex    
@ This version uses only 8 pixels per loop, 8 instructions per pixel-pair. Also, only
@ a load and a store per 8 pixels. Derived from Senor Quack's (Dan Silsby's) version:

@ http://wiki.gp2x.org/wiki/Fast_32-bit_to_16-bit_framebuffer_blit

@ which uses two loads and a store plus 12 instructions per pixel-pair.
@ Use freely for anything you want, credit me if you feel like it's worth mentioning ;)
@ but if you put it in a commercial engine or lib, contact me (photon.AT.coppershade.org) first.

@ v2 will just use a lower loop count (code repeated inside the loop) plus recycle the
@ registers to do 12 pixels per store.

@ ! Remember to put the shebang inside a .data section as r13 is saved (and mangled, briefly).

.balign 4

fb_write:    @start addr of 32bpp buf, start addr of 16bpp buf
    stmfd    sp!, {r4-r12,lr}                @store registers
    str r13,SaveSP
    ldr    r12, .L245            @ip = 9599            init of the counter
    mov lr,r0                @source
    mov r13,r1                @destination
    
    mov    r10,#0x1F<<5        @for masking 32bpp blue
    ldr r11,Gmask            @for masking both greens of 2 16bpp pixels
.L240:
    ldmia lr!, {r2-r9}        @load 8 32-bit pixels       

    and r0,r10,r2,lsl #2    @mask blues to r0:9..5 ("b1"=blues for pixel 1, "r2"=reds for px2, etc)
    and r1,r10,r3,lsl #2    @to make space for reds
    orr r0,r0,r3,lsr #19    @r0:[..b1r2] correct for px1 (append "other px" component after)
    orr r1,r1,r2,lsr #19    @r1:[..b2r1] correct for px2
    orr r2,r2,r3,lsl #16    @r2:[31..24]=0, put g1 in [31..26] and g2 16 bits
    and r2,r11,r2,ror #5    @mask off red/blue trash, greens remain in correct position
    orr r2,r2,r1,lsl #11    @put b1r2 after g1
    orr r2,r2,r0,ror #5        @put b2r1 after g2 (wrapping). Done! 2 extra regs used.
   
@ OK, r2 contains both converted R2 and R3 colors: R3 in 0xFFFF0000 and R2 in 0x0000FFFF

    and r0,r10,r4,lsl #2    @mask blues to r0:9..5 ("b1"=blues for pixel 1, "r2"=reds for px2, etc)
    and r1,r10,r5,lsl #2    @to make space for reds
    orr r0,r0,r5,lsr #19    @r0:[..b1r2] correct for px1 (append "other px" component after)
    orr r1,r1,r4,lsr #19    @r1:[..b2r1] correct for px2
    orr r4,r4,r5,lsl #16    @r2:[31..24]=0, put g1 in [31..26] and g2 16 bits
    and r4,r11,r4,ror #5    @mask off red/blue trash, greens remain in correct position
    orr r4,r4,r1,lsl #11    @put b1r2 after g1
    orr r3,r4,r0,ror #5        @put b2r1 after g2 (wrapping). Done! 2 extra regs used.
   
@ OK, r3 contains both converted R4 and R5 colors: R5 in 0xFFFF0000 and R4 in 0x0000FFFF

    and r0,r10,r6,lsl #2    @mask blues to r0:9..5 ("b1"=blues for pixel 1, "r2"=reds for px2, etc)
    and r1,r10,r7,lsl #2    @to make space for reds
    orr r0,r0,r7,lsr #19    @r0:[..b1r2] correct for px1 (append "other px" component after)
    orr r1,r1,r6,lsr #19    @r1:[..b2r1] correct for px2
    orr r6,r6,r7,lsl #16    @r2:[31..24]=0, put g1 in [31..26] and g2 16 bits
    and r6,r11,r6,ror #5    @mask off red/blue trash, greens remain in correct position
    orr r6,r6,r1,lsl #11    @put b1r2 after g1
    orr r4,r6,r0,ror #5        @put b2r1 after g2 (wrapping). Done! 2 extra regs used.
   
@ OK, r4 contains both converted R6 and R7 colors: R7 in 0xFFFF0000 and R6 in 0x0000FFFF

    and r0,r10,r8,lsl #2    @mask blues to r0:9..5 ("b1"=blues for pixel 1, "r2"=reds for px2, etc)
    and r1,r10,r9,lsl #2    @to make space for reds
    orr r0,r0,r9,lsr #19    @r0:[..b1r2] correct for px1 (append "other px" component after)
    orr r1,r1,r8,lsr #19    @r1:[..b2r1] correct for px2
    orr r8,r8,r9,lsl #16    @r2:[31..24]=0, put g1 in [31..26] and g2 16 bits
    and r8,r11,r8,ror #5    @mask off red/blue trash, greens remain in correct position
    orr r8,r8,r1,lsl #11    @put b1r2 after g1
    orr r5,r8,r0,ror #5        @put b2r1 after g2 (wrapping). Done! 2 extra regs used.
   
@ OK, r5 contains both converted R8 and R9 colors: R9 in 0xFFFF0000 and R8 in 0x0000FFFF

@ Now, r2-r5 contain our 8 converted pixels (in 16-bit format)

    stmia r13!,{r2-r5} 

    subs    r12, r12, #1        @ip--                decrementation of ip counter

    bne    .L240                    @if (ip!=0) go back to .L240    loop condition
.Ldone:
    ldr r13,SaveSP
    ldmfd    sp!, {r4-r12,pc}    @restore registers & return

.L245:
    .long    9599                 @ total loops
   
Gmask:
    .long 0x07e007e0

SaveSP:
    .long 0

I called it 10000 times today to bench the above version 1, and this is the result.

PAERYN 16MB STOCK SDL 1.2.9:
----------------------------
 
[root@gp2x benchmark]$./benchmark.gpe
Waiting 5 seconds to begin test...
SDL test: 76694697 usec (10,000 calls)
        AVG ms/CALL:  7.669

SENOR QUACK'S FB_WRITE:
-----------------------
 
Beginning Senor Quack test:
(waiting 2 seconds..)
Senor Quack's fb_write: 73051858 usec (10,000 calls)
        AVG ms/CALL:  7.305

FASTBLITv1:

Photon's fb_write: 64409531 usec (10,000 calls)

        AVG ms/CALL: 6.441

Result for v1:  Takes 84.0% of the SDL routine, or a "16.0% speed gain" (to compare with 4.75% for Senor Quack's version)


Played with it a bit again. 7 instructions per pixels is impossible, but I sketched eight 8-instruction variations, and picked one that used one reg less and allowed the counter to be put in a mask reg, so I could go 16 regs at a time, and then applied the usual loop repetition and usage of 'interlock' timeslots. Read below or download the source.

Photon's FastBlitV2 - Final: 5684 ms/call

(Comparison: 74.1% of A_SN'S version, or 22.2% down from Senor Quack's version.)

.globl fb_write
.data
.balign 4
@ C prototype : extern void fb_write(uint64_t *screen, uint32_t *fb);

@ FASTBLIT v2 by Henrik Erlandsson aka Photon of Scoopex    
@ - Converts 32-bpp RGB buffer to 16-bpp (RGB565) buffer. NOTE: alpha channel must be 0 !
@ - Assemble with: arm-linux-as.exe -mcpu=arm920 -o fb_write.o fb_write.s
@ - Link to your main.c with : arm-linux-ld -e main fb_write.o main.o -static -s -o <your gpe>
@ ! Remember to put the shebang inside a .data section as r13 is saved (and mangled, briefly).

@ Credits to A_SN and SenorQuack from gp32x.com for the previous asm versions
@ This version blits 30x16 pixels per loop, 8 instructions per pixel-pair.
@ Derived from Senor Quack's (Dan Silsby's) version:
@ http://wiki.gp2x.org/wiki/Fast_32-bit_to_16-bit_framebuffer_blit
@ Benchmarks: 5684 ms / call, 77.8% of SenorQuack's version, 74.1% of A_SN's version.
@ (if you blit to screen buffer, use it as noncached, buffered for performance.)
@ (Manuals suggest aligning buffers to 8 word boundaries would remove the occasional
@  split burst, but if so, it's smaller than the small fluctuations in timing.)

@ Use freely for anything you want, credit me if you feel like it... if you use it in
@ commercial software / engines / libs, ask me (photon.AT.coppershade.org) first.


@ Convert 2 32bpp pixels to a register containing 2 RGB565 pixels
.macro PIXELPAIR PX1,PX2,DEST
    and r10,r12,\PX1,lsl #18    @ mask first blue to top-6
    orr r10,r10,\PX1,lsr #19    @ shiftmask first red to bottom

    eor \PX1,\PX1,\PX2,lsl #16    @ or second green to empty top
    and \PX1,r11,\PX1,lsr #5    @ mask out greens and shift them into place

    orr \PX1,\PX1,r10,ror #5+16    @ second 16bpp color + first green done, r10 free

    and r10,r12,\PX2,lsl #18    @ mask second blue to top-6
    orr r10,r10,\PX2,lsr #19    @ shiftmask second red to bottom

    orr \DEST,\PX1,r10,ror #5    @ or first red + green to top half of the pixelpair word. Done!
.endm
    
fb_writeDone:                    @ reorder so SaveSP is within reach in init _and_ exit code
    ldr r13,.SaveSP
    ldmfd sp!,{r0-r1,r4-r12,pc}    @ restore registers & return

.Gmask:
    .long 0x07e007e0
.SaveSP:
    .long 0

fb_write:                    @ args: r0,r1 = start addr of 32bpp buf, start addr of 16bpp buf
    stmfd sp!,{r0-r1,r4-r12,lr}    @ store registers
    mov lr,r0                @ source
    mov r12,#160            @ loop count
    orr r12,r12,#0x1F<<21    @ use count register as mask also
    ldr r11,.Gmask            @ for masking both greens of 2 16bpp pixels
    str r13,.SaveSP
    mov r13,r1                @ destination
.loop:
    ldmia lr!,{r0-r7}        @ load 8 32-bit pixels       
    PIXELPAIR r0,r1,r0
    PIXELPAIR r2,r3,r1
    PIXELPAIR r4,r5,r2
    PIXELPAIR r6,r7,r3
    ldmia lr!,{r4-r9}        @ load 6 32-bit pixels       
    sub r12,r12,#1            @ decrease counter during load-use interlock
    PIXELPAIR r4,r5,r4
    PIXELPAIR r6,r7,r5
    PIXELPAIR r8,r9,r6
    ldmia lr!,{r7-r8}        @ load 2 32-bit pixels       
    tst r12,#0xFF            @ test during load-use interlock: counter bits 0?
    PIXELPAIR r7,r8,r7
    stmia r13!,{r0-r7}        @ store 16 pixels (8 words)
.rept 29
    ldmia lr!,{r0-r7}        @ load 8 32-bit pixels       
    PIXELPAIR r0,r1,r0
    PIXELPAIR r2,r3,r1
    PIXELPAIR r4,r5,r2
    PIXELPAIR r6,r7,r3
    ldmia lr!,{r4-r9}        @ load 6 32-bit pixels       
    PIXELPAIR r4,r5,r4
    PIXELPAIR r6,r7,r5
    PIXELPAIR r8,r9,r6
    ldmia lr!,{r7-r8}        @ load 2 32-bit pixels       
    PIXELPAIR r7,r8,r7
    stmia r13!,{r0-r7}        @ store 16 pixels (8 words)
.endr
    bne .loop                @ if Count<>0, repeat
    b fb_writeDone
 

 

Comment this post

  • Required fields are marked with *.
  • Comments may be moderated.

If you have trouble reading the code, click on the code itself to generate a new random code.
Security Code:
 
senquack
Posts: 4
Comment
Nice work
Reply #1 on : Mon September 29, 2008, 17:58:49
Nice speedup, I think I've learned a few things ;) I will try to adapt this to be used in UQM2X for a further speedup soon.