I think I might have some ideas to try. Unfortunately, I'm really busy with work and HoloLens-related stuff (attending a couple conferences in May), so it's hard for me to dig into core emulation these days. I do want to see this release make it out in the next couple of months. I'm okay with deferring frame timing until after that, if it's not ready by then. For me, I think having specular lighting working in the new engine and making it the default engine is the top priority for a new release. Then we can go from there incrementally -- hopefully that sounds workable to everyone.
Anyway, looking at the code now, the Real3D status bit is driven by VBlank. Nik's idea was that the bit is some sort of a busy bit and that either the Real3D starts rendering at VBL (almost certainly wrong) or that the bit itself is linked to VBL. So at the start of the VBL period, he takes the current PowerPC cycle count, adds some time to it (to simulate the busy time), and then each time the status bit is read, checks to see whether the PowerPC cycle count has exceeded that new target value. Only then does the bit flip.
I think 0x88000000 drives the rendering process. And if so, the busy bit should be synchronized to 0x88000000 being written, not to VBL starting! Need to double check this.
I think what I may try first is experimenting with GPU threading disabled. The threading code adds some complications. We want to probably call SyncGPUs() when 0x88000000 is written, then render the frame. Meanwhile, the PowerPC should let some cycles elapse after that.
In the long run, SyncGPUs() needs to be split up because it also syncs the tile gen, which is not correct. The tilegen should probably wait until the beginning of the next frame to render. We want to do this in the GPU thread as well so maybe it doesn't matter too much. Eventually, the threading code needs to be rewritten so that the GPU thread is synchronized to PowerPC writing 0x88000000. If the GPU is busy rendering the previous frame (extremely unlikely), we can indeed just drop the frame (meaning we would not swap the ping pong memory) or wait for the GPU to free up.
One remaining question I have is about the non-ping pong memory regions. We probably need to treat them as ping pong memory, too.