Wednesday, 19 July 2017

A Fast Z80 Integer Square Root

A project I'm working on needs a fast square root but I couldn't find anything suitable online. After implementing several versions of the bit-by-bit algorithm I discovered the following code is particularly efficient when unrolled:

/* Return the square root of numb */

int isqrt( numb )
{
    int root = 0, bit = 04000h;
    while ( bit != 0 )
    {
        if ( numb >= root + bit )
        {
            numb = numb - root - bit;
            root = root / 2 + bit;
        }
        else
            root = root / 2;
        bit = bit / 4;
    }
    return root;
}

First Make It Work

The looping version is small but clunky. It spends almost 400 t-states shifting bits. We'll be able to eliminate most of the shifting by hard-coding the bit positions when the loop is unrolled:

; 16-bit integer square root
; 34 bytes, 1005-1101 cycles (average 1053)

; call with de = number to square root
; returns   hl = square root
; corrupts  bc, de

  ld bc,08000h
  ld h,c
  ld l,c
sqrloop:
  srl b
  rr c
  add hl,bc
  ex de,hl
  sbc hl,de
  jr c,sqrbit
  ex de,hl
  add hl,bc
  jr sqrfi
sqrbit:
  add hl,de
  ex de,hl
  or a
  sbc hl,bc
sqrfi:
  srl h
  rr l
  srl b
  rr c
  jr nc,sqrloop

Then Make It Work Faster

First the loop is unrolled. The first 4 iterations are optimized to work on 8-bit values and bit positions are hard-coded. The first and last iteration are optimized as a special case, we work with the bitwise complement of the root instead of the root and small jumps are replace with overlapping code. The final code finds the root in an average of 362 t-states:

; fast 16-bit integer square root
; 92 bytes, 344-379 cycles (average 362)
; v2 - 3 t-state optimization spotted by Russ McNulty

; call with hl = number to square root
; returns    a = square root
; corrupts  hl, de

  ld a,h
  ld de,0B0C0h
  add a,e
  jr c,sq7
  ld a,h
  ld d,0F0h
sq7:

; ----------

  add a,d
  jr nc,sq6
  res 5,d
  db 254
sq6:
  sub d
  sra d

; ----------

  set 2,d
  add a,d
  jr nc,sq5
  res 3,d
  db 254
sq5:
  sub d
  sra d

; ----------

  inc d
  add a,d
  jr nc,sq4
  res 1,d
  db 254
sq4:
  sub d
  sra d
  ld h,a

; ----------

  add hl,de
  jr nc,sq3
  ld e,040h
  db 210
sq3:
  sbc hl,de
  sra d
  ld a,e
  rra

; ----------

  or 010h
  ld e,a
  add hl,de
  jr nc,sq2
  and 0DFh
  db 218
sq2:
  sbc hl,de
  sra d
  rra

; ----------

  or 04h
  ld e,a
  add hl,de
  jr nc,sq1
  and 0F7h
  db 218
sq1:
  sbc hl,de
  sra d
  rra

; ----------

  inc a
  ld e,a
  add hl,de
  jr nc,sq0
  and 0FDh
sq0:
  sra d
  rra
  cpl

3 comments:

  1. This is clever and elegant! It was bothering me all night, so I'm glad I stumbled upon this-- it's easily 30% faster than my fastest. When I actually get some time, I plan to dissect this to extend the precision by 8 bits. I'm using a 16→8.8 square root routine as a first approximation for 80-bit floats, then an iteration of Newton's method with 32-bit result, then a final iteration with 64-bit result to obtain the mantissa. Currently it's roughly 28000cc.

    I can't believe I never thought to use a conditional jump to skip over the SBC instructions like that! I've abused tricks like that before in division to skip parts of an iteration (and I didn't miss that CP, either-- that's snazzy).

    ReplyDelete
  2. Really very well written and executed.

    ReplyDelete
  3. The sample code in the 'First Make It Work' section shows

    ld bc,08000h

    Should this be

    ld bc,04000h

    To match the psudo code?

    ReplyDelete

Note: only a member of this blog may post a comment.