Dirty tricks 6502 programmers use
This post recaps some of the C64 coding tricks used in my little Commodore 64 coding competition. The competition rules were simple: make a C64 executable (PRG) that draws two lines to form the below image. The objective was to do this in as few bytes as possible.
Entries were posted as Twitter replies and DMs, containing only the PRG byte-length and an MD5 hash of the PRG file.
Here’s a list of participants with source code links to their submissions:
- Philip Heron (code - 34 bytes - compo winner)
- Geir Straume (code - 34 bytes)
- Petri Häkkinen (code - 37 bytes)
- Mathlev Raxenblatz (code - 38 bytes)
- Jan Achrenius (code - 48 bytes)
- Jamie Fuller (code - 50 bytes)
- David A. Gershman (code - 53 bytes)
- Janne Hellsten (code - 56 bytes)
(If I missed someone, please let me know and I’ll update the post.)
The rest of this post focuses on some of the assembly coding tricks used in the compo submissions.
Basics
The C64 default graphics mode is the 40x25 charset mode. The framebuffer is split into two arrays in RAM:
$0400
(Screen RAM, 40x25 bytes)$d800
(Color RAM, 40x25 bytes)
To set a character, you store a byte into screen RAM at $0400
(e.g., $0400+y*40+x
). Color RAM is by default initialized to light blue (color 14) which happens to be the same color we use for the lines – meaning we can leave color RAM untouched.
You can control the border and background colors with memory mapped I/O registers at $d020
(border) and $d021
(background).
Drawing the two lines is pretty easy as we can hardcode for the fixed line slope. Here’s a C implementation that draws the lines and dumps screen contents on stdout (register writes stubbed out and screen RAM is malloc()
’d to make it run on PC):
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
void dump(const uint8_t* screen) {
const uint8_t* s = screen;
for (int y = 0; y < 25; y++) {
for (int x = 0; x < 40; x++, s++) {
("%c", *s == 0xa0 ? '#' : '.');
printf}
("\n");
printf}
}
void setreg(uintptr_t dst, uint8_t v) {
// *((uint8_t *)dst) = v;
}
int main() {
// uint8_t* screenRAM = (uint_8*)0x0400;
uint8_t* screenRAM = (uint8_t *)calloc(40*25, 0x20);
(0xd020, 0); // Set border color
setreg(0xd021, 0); // Set background color
setreg
int yslope = (25<<8)/40;
int yf = yslope/2;
for (int x = 0; x < 40; x++) {
int yi = yf >> 8;
// First line
[x + yi*40] = 0xa0;
screenRAM// Second line (X-mirrored)
[(39-x) + yi*40] = 0xa0;
screenRAM+= yslope;
yf }
(screenRAM);
dump}
The screen codes used above are: $20
(blank) and $a0
(8x8 filled block). If you run it, you should see ASCII art for the two lines:
##....................................##
..#..................................#..
...##..............................##...
.....#............................#.....
......##........................##......
........##....................##........
..........#..................#..........
...........##..............##...........
.............#............#.............
..............##........##..............
................##....##................
..................#..#..................
...................##...................
..................#..#..................
................##....##................
..............##........##..............
.............#............#.............
...........##..............##...........
..........#..................#..........
........##....................##........
......##........................##......
.....#............................#.....
...##..............................##...
..#..................................#..
##....................................##
Using 6502 assembly and assembly pseudos, we can trivially implement the same in assembly:
include "c64.asm"
!
::basic_start(entry)
+c64
entry: {
#0 ; black color
lda $d020 ; set border to 0
sta $d021 ; set background to 0
sta
; clear the screen
#0
ldx #$20
lda clrscr:
[0, $100, $200, $300] {
!for i in $0400 + i, x
sta }
inx
bne clrscr
; line drawing, completely unrolled
; with assembly pseudos
#$a0
lda
(40) {
!for i in range= Math.floor(25/40*(i+0.5))
!let y0 $0400 + y0*40 + i
sta $0400 + (24-y0)*40 + i
sta }
inf: jmp inf ; halt
}
This completely unrolls the line drawing part resulting in a fairly large 286 byte PRG.
Before diving into optimized variants, let’s make a couple of observations:
First, we’re running on the C64 with the ROM routines banked in. There’s a bunch of subroutines in ROM that may be useful for our little program. For example, you can clear the screen with JSR $E544
.
Second, address calculations on an 8-bit CPU like the 6502 can be cumbersome and cost a lot of bytes. This CPU also doesn’t have a multiplier, so computing something like y*40+i
usually involves either a bunch of logical shifts or a lookup table, again costing bytes. To avoid multiplying by 40, we can instead advance the screen pointer incrementally:
int yslope = (25<<8)/40;
int yf = yslope/2;
uint8_t* dst = screenRAM;
for (int x = 0; x < 40; x++) {
[x] = 0xa0;
dst[(39-x)] = 0xa0;
dst+= yslope;
yf if (yf & 256) { // Carry set?
+= 40;
dst &= 255;
yf }
}
We keep adding the line slope to a fixed point counter yf
and when the 8-bit addition sets the carry flag, add 40.
Here’s the incremental approach implemented in assembly:
include "c64.asm"
!
::basic_start(entry)
+c64
= $20
!let screenptr = $40
!let x0 = $41
!let x1 = $60
!let yf
entry: {
#0
lda
sta x0$d020
sta $d021
sta
; kernal clear screen
$e544
jsr
; set screenptr = $0400
#<$0400
lda +0
sta screenptr#>$0400
lda +1
sta screenptr
#80
lda
sta yf
#39
lda
sta x1xloop:
#$a0
lda
ldy x0; screenRAM[x] = 0xA0
(screenptr), y
sta
ldy x1; screenRAM[39-x] = 0xA0
(screenptr), y
sta
clc
#160 ; line slope
lda adc yf
sta yf
bcc no_add
; advance screen ptr by 40
clc
lda screenptradc #40
sta screenptr+1
lda screenptradc #0
+1
sta screenptr
no_add:
inc x0
dec x1
bpl xloop
inf: jmp inf
}
At 82 bytes, this is still pretty hefty. A couple of obvious size problems arise from 16-bit address computations:
Setting up the screenptr
value for indirect-indexed addressing:
; set screenptr = $0400
#<$0400
lda +0
sta screenptr#>$0400
lda +1 sta screenptr
Advancing screenptr
to the next row by adding 40:
; advance screen ptr by 40
clc
lda screenptradc #40
sta screenptr+1
lda screenptradc #0
+1 sta screenptr
Sure this code could probably be made smaller but what if we didn’t need manipulate 16-bit addresses in the first place? Let’s see this can be avoided.
Trick 1: Scrolling!
Instead of plotting the line across the screen RAM, we draw only on the last Y=24 screen row, and scroll the whole screen up by calling a “scroll up” ROM function with JSR $E8EA
!
The x-loop becomes:
#0
lda
sta x0#39
lda
sta x1xloop:
#$a0
lda
ldx x0; hardcoded absolute address to last screen line
$0400 + 24*40, x
sta
ldx x1$0400 + 24*40, x
sta
adc yf
sta yf
bcc no_scroll; scroll screen up!
$e8ea
jsr no_scroll:
inc x0
dec x1
bpl xloop
Here’s how the line renderer progresses with this trick:
This trick was one of my favorites in this compo. It was also independently discovered by pretty much every participant.
Trick 2: Self-modifying code
The code to store the pixel values ends up being roughly:
ldx x1; hardcoded absolute address to last screen line
$0400 + 24*40, x
sta
ldx x0$0400 + 24*40, x
sta inc x0
dec x1
This encodes into the following 14 byte sequence:
0803: A6 22 LDX $22
0805: 9D C0 07 STA $07C0,X
0808: A6 20 LDX $20
080A: 9D C0 07 STA $07C0,X
080D: E6 22 INC $22
080F: C6 20 DEC $20
There’s a more compact way to write this using self-modifying code (SMC)..
ldx x1$0400 + 24*40, x
sta addr0: sta $0400 + 24*40
; advance the second x-coord with SMC
inc addr0+1
dec x1
..which encodes to 13 bytes:
0803: A6 22 LDX $22
0805: 9D C0 07 STA $07C0,X
0808: 8D C0 07 STA $07C0
080B: EE 09 08 INC $0809
080E: C6 22 DEC $22
Trick 3: Exploiting the power on state
Making wild assumptions about the running environment was considered OK in this compo: the line drawing PRG is the first thing that’s run after C64 power on, and there was no requirement to exit cleanly back to the BASIC prompt. So anything you find from the initial environment upon entry to your PRG, you can and should use to your advantage. Here are some of the things that were considered “constant” upon entry to the PRG:
- A, X, Y registers were assumed to be all zeros
- All CPU flags cleared
- Zeropage (addresses
$00
-$ff
) contents
Similarly, if you called any KERNAL ROM routines, you could totally take advantage of any side-effects they might have: returned CPU flags, temporary values set into zeropage, etc.
After the first few size-optimization passes, everyone turned their eyes on this machine monitor view to look for any interesting values:
The zeropage indeed contains some useful values for our purposes:
$d5
: 39/$27 == line length - 1$22
: 64/$40 == initial value for line slope counter
You can use these to shave off a few bytes at init time. For example:
= $20
!let x0 #39 ; 0801: A9 27 LDA #$27
lda ; 0803: 85 20 STA $20
sta x0 xloop:
dec x0 ; 0805: C6 20 DEC $20
; 0807: 10 FC BPL $0805 bpl xloop
As $d5
contains a value 39, you can map your x0
counter to point to $d5
and skip the LDA/STA pair:
= $d5
!let x0 ; nothing here!
xloop:
dec x0 ; 0801: C6 D5 DEC $D5
; 0803: 10 FC BPL $0801 bpl xloop
Philip’s winning entry takes this to the extreme. Recall the address of the last char row $07C0
(==$0400+24*40
). This value does not exist in the zeropage on init. However, as a side-effect of how the ROM “scroll up” subroutine uses zeropage temporaries, addresses $D1-$D2
will contain $07C0
on return from this function. So instead of STA $07C0,x
to store a pixel, you can use the one byte shorter indirect-indexed addressing mode store STA ($D1),y
.
Trick 4: Smaller startup
A typical C64 PRG binary file contains the following:
- First 2 bytes: loading address (usually
$0801
) - 12 bytes of BASIC startup sequence
The BASIC startup sequence looks like this (addresses $801-$80C
):
0801: 0B 08 0A 00 9E 32 30 36 31 00 00 00
080D: 8D 20 D0 STA $D020
Without going into details about tokenized BASIC memory layout, this sequence more or less amounts to “10 SYS 2061”. Address 2061
($080D
) is where our actual machine code program starts when the BASIC interpreter executes the SYS command.
14 bytes just to get going feels excessive. Philip, Mathlev and Geir had used some clever tricks to get rid of the BASIC sequence altogether. This requires that the PRG is loaded with LOAD "*",8,1
as LOAD "*",8
ignores the PRG loading address (the first two bytes) and always loads to $0801
.
Two methods were used:
- The stack trick
- The BASIC warm reset vector trick
The stack trick
The trick is to stomp the CPU stack at $01F8
with a value that points to our desired entry point. This is done by crafting a PRG that starts with a 16-bit pointer pointing to our code and loading the PRG into $01F8
:
* = $01F8word scroll - 1 ; overwrite stack
!
scroll: jsr $E8EA
Once the BASIC loader (see disassembly) has finished loading and returns to its caller with RTS
, instead of returning to whoever called LOAD, it returns right into our PRG.
The BASIC warm reset vector trick
This is a little easier to explain by just looking at the PRG disassembly.
02E6: 20 EA E8 JSR $E8EA
02E9: A4 D5 LDY $D5
02EB: A9 A0 LDA #$A0
02ED: 99 20 D0 STA $D020,Y
02F0: 91 D1 STA ($D1),Y
02F2: 9D B5 07 STA $07B5,X
02F5: E6 D6 INC $D6
02F7: 65 90 ADC $90
02F9: 85 90 STA $90
02FB: C6 D5 DEC $D5
02FD: 30 FE BMI $02FD
02FF: 90 E7 BCC $02E8
0301: 4C E6 02 JMP $02E6
Notice the last line (JMP $02E6
). The JMP instruction starts at address $0301
with the branch target stored in addresses $0302-$0303
.
When this code is loaded into memory starting at address $02E6
, a value of $02E6
is written to addresses $0302-$0303
. Well, location $0302-$0303
has a special meaning: it contains a pointer to the “BASIC idle loop” (see C64 memory map for details). Loading the PRG overwrote this location with $02E6
and so when the BASIC interpreter tries to jump to the idle loop after warm reset, it never enters the idle loop but instead ends up in the line renderer!
Other BASIC startup related tricks
Petri had discovered another BASIC start trick which allows injecting your own constants into the zeropage. In this method, you hand-craft your own tokenized BASIC start sequence and encode your constants into the BASIC program line number. The BASIC line number, ahem, your constants, will be stored in addresses $39-$3A
upon entry. Very clever!
Trick 5: Unconventional control flow
Here’s a somewhat simplified version of the x-loop that draws only a single line and then halts execution once the line is done:
#39
lda
sta x1xloop:
#$a0
lda
ldx x1$0400 + 24*40, x
sta
adc yf
sta yf
bcc no_scroll; scroll screen up!
$e8ea
jsr no_scroll:
dec x1
bpl xloop
; intentionally halt at the end
inf: jmp inf
This has a bug in it, though. When we’ve drawn the last pixel of a line, we should NOT scroll the screen up anymore. Thus we need more branching to skip scrolling on the last pixel write:
#39
lda
sta x1xloop:
#$a0
lda
ldx x1$0400 + 24*40, x
sta
dec x1
; skip scrolling if last pixel
bmi done
adc yf
sta yf
bcc no_scroll; scroll screen up!
$e8ea
jsr no_scroll:
jmp xloop
done:
; intentionally halt at the end
inf: jmp inf
The control flow looks a lot like what a C compiler would output from a structured program. The code to skip the last scroll introduced a new JMP abs
instruction that takes up 3 bytes. Conditional branches are only two bytes as they encode the branch target using a relative 8-bit immediate.
The “skip last scroll” JMP can be avoided by moving the scroll up call to the top of the loop, and restructuring the control flow a bit. This is the pattern Philip had come up with:
#39
lda
sta x1scroll: jsr $e8ea
xloop:
#$a0
lda
ldx x1$0400 + 24*40, x
sta
adc yf
sta yfdec x1 ; doesn't set carry!
inf: bmi inf ; hang here if last pixel!
; next pixel if no scroll
bcc xloop ; scroll up and continue bcs scroll
This completely eliminates one 3 byte JMP and converts another JMP to a 2 byte conditional branch, saving 4 bytes in total.
Trick 6: Bitpacked line drawing
Some of the entries didn’t use a line slope counter but rather they had bit-packed the line pattern into an 8-bit constant. This packing comes out of a realisation that the pixel position along the line follows a repeating 8 pixel pattern:
int mask = 0xB6; // 10110110
uint8_t* dst = screenRAM;
for (int x = 0; x < 40; x++) {
[x] = 0xA0;
dstif (mask & (1 << (x&7))) {
+= 40; // go down a row
dst }
}
This translates to pretty compact assembly. The slope counter variants tended to be even smaller, though.
Winner entry
This is the winning 34 byte entry from Philip. Most of the above really comes together nicely in his code:
= $22 ; == $40, initial value for the overflow counter
ov = $D5 ; == $27 / 39, number of passes. Decrementing, finished at -1
ct = $D1 ; == $07C0, pointer to bottom line. Set by the kernal scroller
lp
; Overwrite the return address of the kernal loader on the stack
; with a pointer to our own code
* = $01F8word scroll - 1
.
scroll: jsr $E8EA ; Kernal scroll up, also sets lp pointer to $07C0
loop: ldy ct ; Load the decrementing counter into Y (39 > -1)
#$A0 ; Load the PETSCII block / black col / ov step value
lda $D020, y ; On the last two passes, sets the background black
sta p1: sta $07C0 ; Draw first block (left > right line)
(lp), y ; Draw second block (right > left line)
sta inc p1 + 1 ; Increment pointer for the left > right line
adc ov ; Add step value $A0 to ov
sta ovdec ct ; Decrement the Y counter
* ; If it goes negative, we're finished
bmi ; Repeat. If ov didn't overflow, don't scroll
bcc loop ; Repeat. If ov overflowed, scroll bcs scroll
Why stop at 34 bytes, though?
Once the competition was over, everyone shared code and notes, and a number of lively conversations took place on how to do even better. Several smaller variants were posted after the deadline:
You should check them out – there are some real gems to be found.
…
Thanks for reading. And most of all, thanks Mathlev, Phil, Geir, Petri, Jamie, Jan and David for your participation. (I hope I didn’t miss anyone – it was really difficult to keep track of these in Twitter mentions!)
PS. Petri had named my compo “@nurpax’s annual C64 size optimization compo”, so uhm, see you next year, I guess.