mMm... Optimization (c) Homer
Got the USB network working for my GP2X, so I decided to fiddle with it a bit. I can now code in both Pascal and C, but I came across this 32bpp to 16bpp blit, and saw it as the perfect asm optimization challenge to warm up my old ARM assembler skills... :)
I went through one 10-instr and two 9-instr before I finally could go from the original 12 to 8 instructions per pixel-pair. Why do I spend 4 hours shoveling bits? I guess I think it's time well spent <3
.globl fb_write
.globl SaveSP
.data
@ C prototype : extern void fb_write(uint64_t *screen, uint32_t *fb);
@ Converts 32-bpp RGB buffer to 16-bpp (RGB565) buffer. NOTE: alpha channel must be 0 !
@ assemble with: arm-linux-as.exe -mcpu=arm920 -o fb_write.o fb_write.s
@ link to your main.c with : arm-linux-ld -e main fb_write.o main.o -static -s -o <your gpe>
@ Credits to A_SN and senorquack from gp32x.com for the previous asm versions
@ FASTBLIT v1 by Henrik Erlandsson aka Photon of Scoopex
@ This version uses only 8 pixels per loop, 8 instructions per pixel-pair. Also, only
@ a load and a store per 8 pixels. Derived from Senor Quack's (Dan Silsby's) version:
@ http://wiki.gp2x.org/wiki/Fast_32-bit_to_16-bit_framebuffer_blit
@ which uses two loads and a store plus 12 instructions per pixel-pair.
@ Use freely for anything you want, credit me if you feel like it's worth mentioning ;)
@ but if you put it in a commercial engine or lib, contact me (photon.AT.coppershade.org) first.
@ v2 will just use a lower loop count (code repeated inside the loop) plus recycle the
@ registers to do 12 pixels per store.
@ ! Remember to put the shebang inside a .data section as r13 is saved (and mangled, briefly).
.balign 4
fb_write: @start addr of 32bpp buf, start addr of 16bpp buf
stmfd sp!, {r4-r12,lr} @store registers
str r13,SaveSP
ldr r12, .L245 @ip = 9599 init of the counter
mov lr,r0 @source
mov r13,r1 @destination
mov r10,#0x1F<<5 @for masking 32bpp blue
ldr r11,Gmask @for masking both greens of 2 16bpp pixels
.L240:
ldmia lr!, {r2-r9} @load 8 32-bit pixels
and r0,r10,r2,lsl #2 @mask blues to r0:9..5 ("b1"=blues for pixel 1, "r2"=reds for px2, etc)
and r1,r10,r3,lsl #2 @to make space for reds
orr r0,r0,r3,lsr #19 @r0:[..b1r2] correct for px1 (append "other px" component after)
orr r1,r1,r2,lsr #19 @r1:[..b2r1] correct for px2
orr r2,r2,r3,lsl #16 @r2:[31..24]=0, put g1 in [31..26] and g2 16 bits
and r2,r11,r2,ror #5 @mask off red/blue trash, greens remain in correct position
orr r2,r2,r1,lsl #11 @put b1r2 after g1
orr r2,r2,r0,ror #5 @put b2r1 after g2 (wrapping). Done! 2 extra regs used.
@ OK, r2 contains both converted R2 and R3 colors: R3 in 0xFFFF0000 and R2 in 0x0000FFFF
and r0,r10,r4,lsl #2 @mask blues to r0:9..5 ("b1"=blues for pixel 1, "r2"=reds for px2, etc)
and r1,r10,r5,lsl #2 @to make space for reds
orr r0,r0,r5,lsr #19 @r0:[..b1r2] correct for px1 (append "other px" component after)
orr r1,r1,r4,lsr #19 @r1:[..b2r1] correct for px2
orr r4,r4,r5,lsl #16 @r2:[31..24]=0, put g1 in [31..26] and g2 16 bits
and r4,r11,r4,ror #5 @mask off red/blue trash, greens remain in correct position
orr r4,r4,r1,lsl #11 @put b1r2 after g1
orr r3,r4,r0,ror #5 @put b2r1 after g2 (wrapping). Done! 2 extra regs used.
@ OK, r3 contains both converted R4 and R5 colors: R5 in 0xFFFF0000 and R4 in 0x0000FFFF
and r0,r10,r6,lsl #2 @mask blues to r0:9..5 ("b1"=blues for pixel 1, "r2"=reds for px2, etc)
and r1,r10,r7,lsl #2 @to make space for reds
orr r0,r0,r7,lsr #19 @r0:[..b1r2] correct for px1 (append "other px" component after)
orr r1,r1,r6,lsr #19 @r1:[..b2r1] correct for px2
orr r6,r6,r7,lsl #16 @r2:[31..24]=0, put g1 in [31..26] and g2 16 bits
and r6,r11,r6,ror #5 @mask off red/blue trash, greens remain in correct position
orr r6,r6,r1,lsl #11 @put b1r2 after g1
orr r4,r6,r0,ror #5 @put b2r1 after g2 (wrapping). Done! 2 extra regs used.
@ OK, r4 contains both converted R6 and R7 colors: R7 in 0xFFFF0000 and R6 in 0x0000FFFF
and r0,r10,r8,lsl #2 @mask blues to r0:9..5 ("b1"=blues for pixel 1, "r2"=reds for px2, etc)
and r1,r10,r9,lsl #2 @to make space for reds
orr r0,r0,r9,lsr #19 @r0:[..b1r2] correct for px1 (append "other px" component after)
orr r1,r1,r8,lsr #19 @r1:[..b2r1] correct for px2
orr r8,r8,r9,lsl #16 @r2:[31..24]=0, put g1 in [31..26] and g2 16 bits
and r8,r11,r8,ror #5 @mask off red/blue trash, greens remain in correct position
orr r8,r8,r1,lsl #11 @put b1r2 after g1
orr r5,r8,r0,ror #5 @put b2r1 after g2 (wrapping). Done! 2 extra regs used.
@ OK, r5 contains both converted R8 and R9 colors: R9 in 0xFFFF0000 and R8 in 0x0000FFFF
@ Now, r2-r5 contain our 8 converted pixels (in 16-bit format)
stmia r13!,{r2-r5}
subs r12, r12, #1 @ip-- decrementation of ip counter
bne .L240 @if (ip!=0) go back to .L240 loop condition
.Ldone:
ldr r13,SaveSP
ldmfd sp!, {r4-r12,pc} @restore registers & return
.L245:
.long 9599 @ total loops
Gmask:
.long 0x07e007e0
SaveSP:
.long 0
I called it 10000 times today to bench the above version 1, and this is the result.
PAERYN 16MB STOCK SDL 1.2.9:
----------------------------
[root@gp2x benchmark]$./benchmark.gpe
Waiting 5 seconds to begin test...
SDL test: 76694697 usec (10,000 calls)
AVG ms/CALL: 7.669
SENOR QUACK'S FB_WRITE:
-----------------------
Beginning Senor Quack test:
(waiting 2 seconds..)
Senor Quack's fb_write: 73051858 usec (10,000 calls)
AVG ms/CALL: 7.305
FASTBLITv1:
Photon's fb_write: 64409531 usec (10,000 calls)
AVG ms/CALL: 6.441
Result for v1: Takes 84.0% of the SDL routine, or a "16.0% speed gain" (to compare with 4.75% for Senor Quack's version)
Played with it a bit again. 7 instructions per pixels is impossible, but I sketched eight 8-instruction variations, and picked one that used one reg less and allowed the counter to be put in a mask reg, so I could go 16 regs at a time, and then applied the usual loop repetition and usage of 'interlock' timeslots. Read below or download the source.
Photon's FastBlitV2 - Final: 5684 ms/call
(Comparison: 74.1% of A_SN'S version, or 22.2% down from Senor Quack's version.)
.globl fb_write
.data
.balign 4
@ C prototype : extern void fb_write(uint64_t *screen, uint32_t *fb);
@ FASTBLIT v2 by Henrik Erlandsson aka Photon of Scoopex
@ - Converts 32-bpp RGB buffer to 16-bpp (RGB565) buffer. NOTE: alpha channel must be 0 !
@ - Assemble with: arm-linux-as.exe -mcpu=arm920 -o fb_write.o fb_write.s
@ - Link to your main.c with : arm-linux-ld -e main fb_write.o main.o -static -s -o <your gpe>
@ ! Remember to put the shebang inside a .data section as r13 is saved (and mangled, briefly).
@ Credits to A_SN and SenorQuack from gp32x.com for the previous asm versions
@ This version blits 30x16 pixels per loop, 8 instructions per pixel-pair.
@ Derived from Senor Quack's (Dan Silsby's) version:
@ http://wiki.gp2x.org/wiki/Fast_32-bit_to_16-bit_framebuffer_blit
@ Benchmarks: 5684 ms / call, 77.8% of SenorQuack's version, 74.1% of A_SN's version.
@ (if you blit to screen buffer, use it as noncached, buffered for performance.)
@ (Manuals suggest aligning buffers to 8 word boundaries would remove the occasional
@ split burst, but if so, it's smaller than the small fluctuations in timing.)
@ Use freely for anything you want, credit me if you feel like it... if you use it in
@ commercial software / engines / libs, ask me (photon.AT.coppershade.org) first.
@ Convert 2 32bpp pixels to a register containing 2 RGB565 pixels
.macro PIXELPAIR PX1,PX2,DEST
and r10,r12,\PX1,lsl #18 @ mask first blue to top-6
orr r10,r10,\PX1,lsr #19 @ shiftmask first red to bottom
eor \PX1,\PX1,\PX2,lsl #16 @ or second green to empty top
and \PX1,r11,\PX1,lsr #5 @ mask out greens and shift them into place
orr \PX1,\PX1,r10,ror #5+16 @ second 16bpp color + first green done, r10 free
and r10,r12,\PX2,lsl #18 @ mask second blue to top-6
orr r10,r10,\PX2,lsr #19 @ shiftmask second red to bottom
orr \DEST,\PX1,r10,ror #5 @ or first red + green to top half of the pixelpair word. Done!
.endm
fb_writeDone: @ reorder so SaveSP is within reach in init _and_ exit code
ldr r13,.SaveSP
ldmfd sp!,{r0-r1,r4-r12,pc} @ restore registers & return
.Gmask:
.long 0x07e007e0
.SaveSP:
.long 0
fb_write: @ args: r0,r1 = start addr of 32bpp buf, start addr of 16bpp buf
stmfd sp!,{r0-r1,r4-r12,lr} @ store registers
mov lr,r0 @ source
mov r12,#160 @ loop count
orr r12,r12,#0x1F<<21 @ use count register as mask also
ldr r11,.Gmask @ for masking both greens of 2 16bpp pixels
str r13,.SaveSP
mov r13,r1 @ destination
.loop:
ldmia lr!,{r0-r7} @ load 8 32-bit pixels
PIXELPAIR r0,r1,r0
PIXELPAIR r2,r3,r1
PIXELPAIR r4,r5,r2
PIXELPAIR r6,r7,r3
ldmia lr!,{r4-r9} @ load 6 32-bit pixels
sub r12,r12,#1 @ decrease counter during load-use interlock
PIXELPAIR r4,r5,r4
PIXELPAIR r6,r7,r5
PIXELPAIR r8,r9,r6
ldmia lr!,{r7-r8} @ load 2 32-bit pixels
tst r12,#0xFF @ test during load-use interlock: counter bits 0?
PIXELPAIR r7,r8,r7
stmia r13!,{r0-r7} @ store 16 pixels (8 words)
.rept 29
ldmia lr!,{r0-r7} @ load 8 32-bit pixels
PIXELPAIR r0,r1,r0
PIXELPAIR r2,r3,r1
PIXELPAIR r4,r5,r2
PIXELPAIR r6,r7,r3
ldmia lr!,{r4-r9} @ load 6 32-bit pixels
PIXELPAIR r4,r5,r4
PIXELPAIR r6,r7,r5
PIXELPAIR r8,r9,r6
ldmia lr!,{r7-r8} @ load 2 32-bit pixels
PIXELPAIR r7,r8,r7
stmia r13!,{r0-r7} @ store 16 pixels (8 words)
.endr
bne .loop @ if Count<>0, repeat
b fb_writeDone
Comment this post
Posts: 4
Reply #1 on : Mon September 29, 2008, 17:58:49