C#•14mo ago

Micro-optimizing a Z80 emulators' pipeline. Unsafe code

so, i'm writing an emulator, and i'm trying to squeeze as much perf as i can out of hot paths. this seemingly simple fetch operation consumes about a third of the CPU time:

private byte Fetch()
{
    return _memory.Read(Registers.PC++);
}

private byte Fetch()
{
    return _memory.Read(Registers.PC++);
}

my memory class looks like this:

private GCHandle _memHandle;
private byte* pMem;
private byte[] _memory;

public MainMemory(int size) 
{
    // pin array and get GC ptr. omitted for brevity.
    pMem = (byte*)_memHandle.AddrOfPinnedObject();
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public byte Read(ushort address) => pMem[address];

private GCHandle _memHandle;
private byte* pMem;
private byte[] _memory;

public MainMemory(int size) 
{
    // pin array and get GC ptr. omitted for brevity.
    pMem = (byte*)_memHandle.AddrOfPinnedObject();
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public byte Read(ushort address) => pMem[address];

i know, it's a bit messy, and it's not really safe either, but boy does it give some perf gains. it's worth it. also, the array wont change size so it should be alright.

my registers are in a similar situation. the register set is actually an array of bytes that are accessed using constant indexers into said array, like this:

private GCHandle _regSetHandle;
private byte* pRegSet;
public byte[] RegisterSet;
public ProcessorRegisters()
{
    RegisterSet = new byte[26];
    _regSetHandle = GCHandle.Alloc(RegisterSet, GCHandleType.Pinned);
    pRegSet = (byte*)_regSetHandle.AddrOfPinnedObject();
}
// example
byte regA = pRegSet[Registers.A]; // A is an indexer into the array; i tried to follow the Z80's convention, so A = 7

private GCHandle _regSetHandle;
private byte* pRegSet;
public byte[] RegisterSet;
public ProcessorRegisters()
{
    RegisterSet = new byte[26];
    _regSetHandle = GCHandle.Alloc(RegisterSet, GCHandleType.Pinned);
    pRegSet = (byte*)_regSetHandle.AddrOfPinnedObject();
}
// example
byte regA = pRegSet[Registers.A]; // A is an indexer into the array; i tried to follow the Z80's convention, so A = 7

but, this is a Z80, meaning it also has 16-bit register pairs. this is important, because you can either access it as its high and low parts, or its entire pair, meaning that the exposed pairs depend on this same array, so i implemented them using properties

public ushort PC
{
    get => (ushort)((pRegSet[PCi] << 8) | pRegSet[PCiL]);
    set
    {
        pRegSet[PCi] = (byte)(value >> 8);
        pRegSet[PCiL] = (byte)value;
    }
}

public ushort PC
{
    get => (ushort)((pRegSet[PCi] << 8) | pRegSet[PCiL]);
    set
    {
        pRegSet[PCi] = (byte)(value >> 8);
        pRegSet[PCiL] = (byte)value;
    }
}

with all of this in mind, how can i make that fetch instruction faster and use less CPU time?

Buddy•11/17/24, 1:39 AM

This might be something for #allow-unsafe-blocks

PdawgOP•11/17/24, 1:39 AM

should i forward a link to this thread, or repost there?

Buddy•11/17/24, 1:40 AM

Probably forward it maybe?
just say that you were redirected there

PdawgOP•11/17/24, 1:40 AM

okay

Aaron•11/17/24, 1:45 AM

if that's your Fetch? there really isn't much you can do

PdawgOP•11/17/24, 1:45 AM

there's no crazy pointer magic left for me to do?

Aaron•11/17/24, 1:45 AM

they're not magic, sadly

PdawgOP•11/17/24, 1:46 AM

lol

PdawgOP•11/17/24, 1:46 AM

what about interfaces? is there anything i can do to optimize properties in interfaces?

PdawgOP•11/17/24, 1:46 AM

the interrupt handler is another hot path that also is unavoidable

Aaron•11/17/24, 1:46 AM

interfaces?

PdawgOP•11/17/24, 1:46 AM

PdawgOP•11/17/24, 1:47 AM

the databus is represented by an interface

PdawgOP•11/17/24, 1:47 AM

it has to be because this emulator is (going to be) a library

Aaron•11/17/24, 1:48 AM

Fetch/Read wouldn't happen to be interface methods, would they

PdawgOP•11/17/24, 1:48 AM

no no lol

PPdawg what about interfaces? is there anything i can do to optimize properties in inte...

PdawgOP•11/17/24, 1:48 AM

im talking about properties here

Aaron•11/17/24, 1:48 AM

:j_phew:

PdawgOP•11/17/24, 1:48 AM

PPdawg the databus is represented by an interface

PdawgOP•11/17/24, 1:49 AM

^ this

Buddy•11/17/24, 1:49 AM

You can use $paste to send full code snippets

BBuddy You can use $paste to send full code snippets

MODiX•11/17/24, 1:49 AM

If your code is too long, you can post to https://paste.mod.gg/, save, and copy the link into chat for others to see your shared code!

Aaron•11/17/24, 1:49 AM

I mean interfaces aren't fast, but there's not much you can do about that

PdawgOP•11/17/24, 1:49 AM

simply checking both every clock cycle takes a serious amount of CPU time

PdawgOP•11/17/24, 1:50 AM

it has to be done tho cause, ya know, thats how cpus work

PdawgOP•11/17/24, 1:50 AM

15%

AAaron I mean interfaces aren't fast, but there's not much you can do about that

PdawgOP•11/17/24, 1:51 AM

hmm - i guess ill pack em into a byte or something so that its only reading once per instruction

PdawgOP•11/17/24, 2:04 AM

marginal improvement but it gave me another ~600k instructions per second

PdawgOP•11/17/24, 2:04 AM

so ill take it

PdawgOP•11/17/24, 4:43 AM

this is fast enough righttt?

PdawgOP•11/17/24, 4:43 AM

this is in release mode

PdawgOP•11/17/24, 4:44 AM

100 million instructions a second, it's good enough. way better than anything else i've tried in C#. the fastest one behind Z80Sharp was 30 million/s

Klarth•11/17/24, 6:36 AM

I'd say that anything around an order of magnitude loss is expected for CPU emulation. Pushing much further than you are will probably require codegen to native.

PdawgOP•11/17/24, 6:36 AM

i actually haven't tried AoT yet, i wonder how fast it'd go

PdawgOP•11/17/24, 6:39 AM

as expected, its a bit slower. i could work towards AoT perf more but it's whatever

Klarth•11/17/24, 6:42 AM

I don't mean AOT. That's going to run slower at steady state than a very warm JIT.

Klarth•11/17/24, 6:43 AM

I mean to codegen the z80 into x64 assembly and then run x64 assembly. Sometimes called dynamic recompilation.

PdawgOP•11/17/24, 6:43 AM

ah that’s what I was about to ask

PdawgOP•11/17/24, 6:44 AM

I am terrible at modern x86(_64) assembly so I’ll steer clear lol. z80 and ARMv6 ASM is where it’s at

Klarth•11/17/24, 6:44 AM

You can't do this across the entire software being emulated, but you can for stretches if you have good detection. Software also tends to codegen in RAM, so you need to ensure that it's the same when you rerun cached x64 output.

PdawgOP•11/17/24, 6:47 AM

earlier today I was doing some bug fixing and finally got BASIC to actually execute though! RLCA and inc ix/iy were messed up (idk how I even messed that up, it’s so simple. wasn’t paying attention I guess.)

cap5lut•11/17/24, 7:15 AM

more guessing here, but if u store the register array internally as instead of ,
it will be aligned to 2 byte, then for the 16 bit register u can do just one aligned read/write, instead of reading/writing/bit shifting the bytes

cap5lut•11/17/24, 7:18 AM

and if u use an inline array for the registers and pin , u can probably also safe one indirection, which might improve performance

Klarth•11/17/24, 7:20 AM

I assume it's that way because they didn't want to worry about endianness on big endian systems.

cap5lut•11/17/24, 7:24 AM

nothing a bit of usage cant fix ;p

Ccap5lut more guessing here, but if u store the register array internally as `ushort[]` i...

PdawgOP•11/17/24, 8:01 AM

I like where you’re going, but most registers are used in their 8-bit forms, so it makes sense to leave them as such. The reason I specifically optimized the pairs was because the PC register is constantly having to be read from/written to. I may actually move it out to its own ushort tho, as i don’t think I really need to address it as individual 8-bit chunks

cap5lut•11/17/24, 8:05 AM

well, u can still keep the byte pointer to read the 8 bit registers individually, it would only benefit that ushort, but has no effects on the others.
having them in the same continuous memory will also help with caching, i would assume, so having them all on the same cache line might be important

PdawgOP•11/17/24, 8:05 AM

I pinned the byte array. is it inline by default? or does indexing into it like still require some lookup

Micro-optimizing a Z80 emulators' pipeline. **Unsafe code**

Micro-optimizing a Z80 emulators' pipeline. Unsafe code