Long term dev plan

Ian · Post by **Ian** » Sat Aug 03, 2024 7:03 pm

I've had some ideas in my head for a while now to make emulation .. better

I thought I'd put them to paper. I've made some progress towards these.

1. The ability to drop frames. This is required for vf3. Currently in the level transition loading phases we are drawing crazy amounts of garbage. There are some hacks in the renderer to actually stop it blowing up during these stages, without it would probably crash. A few games will crash in rare circumstances if you under clock the CPU and the renderer will try to render incomplete frames. Other games during boot are trying to draw stuff they shouldn't, I think virtua-on is doing this.

- To make this work we I rewrote the renderer to make the 3d graphics independent from the 2d, before the image was composited, which made independent drawing basically impossible.

What's left?

- Just need to plug in the correct logic to make this work. Frames seem to be terminated when a 0xC write is written to the tilegen. For the 3d to kick off 0x88 has to have been written to the 3d hardware. Some games will call this a few times per frame. I *think* it kind of just marks the transfer as complete. Some games will write extra data after 0x88 has been called. Not sure if this is after the ping_pong bit, or if these writes are double buffered at all. I am not 100% sure.

2. Move the start of the frame back to when v-blank is fired. The h/w will start writing data for a new frame something like 2/3rds of the way through the current frame, for a sort of triple buffered approach. To fix data straddling to frames I shifted the end of the frame to be where this ping_pong bit flips, but this is a bit of a kludge.

- To make this work we need to figure out exactly how the double buffered memory writes happen.

3. Some games ie scud and possibly lost world will update the tilegen mid frame, which is perfectly legal to do. Scud has an animated car which currently is completely missing from emulation. Quite often updates happen after the ping_pong bit flips, but in theory the games could write anywhere in the frame.

- To fix this we should draw line by line when the tilegen hardware will be updating the image. The 3d and the 2d hardware can not be drawing at the same time with opengl, to make this work we'd have to buffer all the writes for the tilegen. All the efficiency of the GPU tilegen shader implementation basically goes out the window as soon as you try and draw line by line or having to buffer memory writes.

- I've actually nearly finished this. I ported the GLSL shader code basically back to software, which was a port of Bart's original tilegen code. Just need to clean it up a bit really.

4. Code cleanup. The project originally was I assume? mostly done in visual studio. Bart later Bart changed the project to use 2 spaces for tabs instead of 4. For us visual studio peasants this is really painful lol. I'd personally like to convert it back to 4 spaces for tabs ...

I'm sure there are plenty of other bits in the code to clean up.

5. What's missing?

Post by **Bart** » Sun Aug 04, 2024 12:52 am

Converting to 4 spaces is fine (just please not tab characters lol). The project was originally always built with GCC and I used UltraEdit to write it

Nik added MSVC.

I've forgotten quite a bit about the frame logic. I wish I had written up those emails we exchanged several years back when I ran a series of experiments on the VF3 board. Are we sure that 0xC terminates rendering? I seem to recall it is written around the time that 0x88000000 is, but it wasn't clear what these two mean.

I'm still pretty sure that ping pong memory is actually double-buffered and the other culling RAM region is not. Should this be emulated?

Lastly, were there changes to the tile renderer logic in GLSL or can we just revert back to the original CPU rendering code?

Ian · Post by **Ian** » Sun Aug 04, 2024 10:04 am

I'm fairly sure 0xC is the last thing written in a frame. It makes sense that the tilegen controls the rendering since it generates irq2. The ping_pong flip time is also written to the tilegen when games boot up and that is really just something for the 3d hw.

gm_matthew · Post by **gm_matthew** » Mon Aug 05, 2024 4:34 pm

We made one change to the tilegen code since it was implemented in GLSL, basically just to swap the scroll values for the odd/even lines since we had them the wrong way round.

Writing to tilegen register 0xC tells the tilegen to signal the video board to flip the ping-pong buffers when it reaches the active video line number specified in register 0x8. Once that's done there's nothing left to do except wait for decrementer exception, which is timed to fire approximately when ping-pong flips.

Ping-pong memory is double-buffered, the rest of culling RAM is not. Writing to 0x8Cxxxxxx accesses culling RAM directly, while writing to 0x8Exxxxxx accesses the ping-pong buffer within culling RAM that is not being actively used to generate the next frame. Writing to 0x90xxxxxx, 0x94xxxxxx or 0x98xxxxxx only writes to texture RAM or polygon RAM if rendering has finished, otherwise it writes to the update buffer first. There are two update buffers, also stored within culling RAM, which are swapped at the same time as the ping-pong buffers to allow the video board to update polygon RAM and/or texture RAM for the current frame while also receiving updates for the next frame.

I think games only update culling RAM via (0x8Cxxxxxx) when nothing is being rendered. I have a hunch that it only happens when rendering is disabled via JTAG; I'll have to check.

Post by **Bart** » Fri Aug 09, 2024 11:06 pm

Some additional dev priorities that I'd like to get to:

- Better racing controls for both analog sticks and digital pads. I have some ideas, just need to find time to start playing around.
- Spruce up documentation and web page.
- Add analytics. Given that the user base is much larger than I realized, it would be interesting to know how frequently they upgrade, what input settings they tend to use, as well as basic PC configuration. I don't think this would be too intrusive because it won't fingerprint users and at any rate, as a fully open source project, there would be a public dashboard to view these stats.
- Get threading working in netplay mode. I need to modify Musashi to use a context object to store its state rather than a global context as it does now.
- Integrated GUI -- imgui is probably the leading candidate.
- Rewrite ROM loading one more time to better conform to the way MAME ROM sets are actually distributed. I do still want it to identify ROMs by CRC (I find this feature super handy myself because I don't like upgrading my ROMs).
- Metal renderer for iOS support.
- Real3D Pro-1000 support.
- Scripting support. I think a lot of enhancements could be controlled via scripts. Lua would be easy but I recently worked on a project that used Lua and absolutely hated it. Might need to suck it up and use it but I'm going to be on the lookout for something that's similarly lightweight but with a better language.
- Enhancements: model dumping would be fun and is something I might actually do sooner rather than later. Texture and model packs are frequently requested but this would be too much of a burden on the rendering engine in terms of maintainability. I do wonder if at some point, when the engine stabilizes, it could be supported by defining special case code paths for texture mapping (supporting ARGB8 and RGB8 textures). *Replacing* textures and even VROM models would be easy and could be done outside of the engine by intercepting texture transfers in Real3D.cpp or via a scripting interface, but they'd have to conform to the Real3D format and I'm not sure how willing anyone would be to develop a workflow and custom exporters to support this.

Chine · Post by **Chine** » Sat Aug 10, 2024 8:25 pm

Any plans for model1 & 2 ?

Thanks for your great work !

gm_matthew · Post by **gm_matthew** » Sat Aug 10, 2024 10:11 pm

I ought to work on getting the simulated netboard to run in its own thread at some point. The multithreading code could probably do with a rewrite, it currently uses the CThread class which is just a wrapper for the SDL thread functions, while C++11 onwards has multithreading support built-in.

If a Metal renderer is implemented it could also be used for macOS, which would mean the existing OpenGL renderer wouldn't be restricted to version 4.1 (the last version supported by Apple) any more

Ian · Post by **Ian** » Wed Aug 28, 2024 10:52 pm

Okay so I've ported the shader code back to software. The basic shader code is not as efficient as Bart's original code. However we only iterate over the 'active' layer according to the mask. So if the mask is the same for the whole layer we never have to even look at the alt layer which gives us quite a speed up. In bench-marking it's getting on for 2x as fast. I used Bart's idea of caching the palettes, which helps a lot with performance. I tried to keep the code simple as well, the original tilegen code was hard to read, I actually had to look back at the earliest versions of supermodel to really get my head around how it was working. There is no attempt to be smart such as writing to the pixels only once, the buffers are cleared every frame. There isn't even code to check for pixels out of bounds, which might happen if half a tile overlapped the right hand side. I just made the buffers wider (512 pixels).

There are a few minor differences. I only check the mask once per tile, instead at every pixel. This should hopefully replicate this:

Display quirks

The same hardware that is used to render the background layer is used for
the window layer. This causes a problem when the lower 3 bits of the
horizontal scroll are nonzero between tiles that are adjacent to each other
and come from different layers as specified by the mask bit.

The 8-pixel shift registers are only loaded once every 8 pixels. So in this
case the shift register will still contain data from the left tile, which
is shifted out into the right tile.

If the layer to the left has a scroll value of zero, then the shift
register is empty. Pixels 1-7 of the layer to the right are filled in
with the background color of the left tile instead.

Lastly I just made the code a drop in replacement for the original. All the code currently lives in

Code: Select all

UINT32 CTileGen::SyncSnapshots(void)
{
	// clear surfaces
	for (auto& s : m_drawSurface) {
		s->Clear();
	}

	// draw buffers (this should be called elsewhere later)
	for (int i = 0; i < 384; i++) {
		DrawLine(i);
	}

	// swap buffers
	for (int i = 0; i < 2; i++) {
		std::swap(m_drawSurface[i], m_drawSurfaceRO[i]);
	}

	Render2D->AttachDrawBuffers(m_drawSurfaceRO[0], m_drawSurfaceRO[1]);
	
	return UINT32(0);
}

That will want to be moved around a bit in the future.
Clearing surfaces should be moved to v-blank start. Line drawing called in the main thread at the right time. Then at the end of the frame opengl should be updated with the finished display surfaces. The surfaces are double buffered, but that might not be needed in the future.

But anyway all the pieces of the puzzle are there so we can correctly emulate how both of these chips drew simultaneously. Just need to finish the puzzle

Ian · Post by **Ian** » Mon Apr 07, 2025 11:23 pm

gm_matthew wrote: ↑Mon Aug 05, 2024 4:34 pm We made one change to the tilegen code since it was implemented in GLSL, basically just to swap the scroll values for the odd/even lines since we had them the wrong way round.

Writing to tilegen register 0xC tells the tilegen to signal the video board to flip the ping-pong buffers when it reaches the active video line number specified in register 0x8. Once that's done there's nothing left to do except wait for decrementer exception, which is timed to fire approximately when ping-pong flips.

Ping-pong memory is double-buffered, the rest of culling RAM is not. Writing to 0x8Cxxxxxx accesses culling RAM directly, while writing to 0x8Exxxxxx accesses the ping-pong buffer within culling RAM that is not being actively used to generate the next frame. Writing to 0x90xxxxxx, 0x94xxxxxx or 0x98xxxxxx only writes to texture RAM or polygon RAM if rendering has finished, otherwise it writes to the update buffer first. There are two update buffers, also stored within culling RAM, which are swapped at the same time as the ping-pong buffers to allow the video board to update polygon RAM and/or texture RAM for the current frame while also receiving updates for the next frame.

I think games only update culling RAM via (0x8Cxxxxxx) when nothing is being rendered. I have a hunch that it only happens when rendering is disabled via JTAG; I'll have to check.

How do you think the double buffering would actually work?
The memory writes to the real3d are all 32 bit writes. We have to store the incoming data somewhere. It has to have some sort of data structure to store this data, so it can be unpacked later. If I was to guess maybe something like this, basically an array of memblock. Maybe with a pointer write pointer to the current block.

struct MemBlock
{
uint32_t startAddress;
uint32_t lastAddress;
uint32_t data[0]; // zero sized arrays are expandable, at least in c99.
}

Write32GPU(uint32 addr, uint32 data)
{
if(currentMemBlock->lastAddress+4 == addr) { // data is continugous so write in current block
currentMemBlock->data[addr - memBlock->startAddress] = data;
currentMemBlock->lastAddr = addr;
}
else {
// start a new mem block
}
}

gm_matthew · Post by **gm_matthew** » Tue Apr 08, 2025 2:37 pm

The Real3D firmware suggests that the total amount of data written to the update buffer in "burst mode" is the size of the block plus two more 32-bit words; presumably these two extra words are the header containing the destination address and block size. (If "burst mode" is not active, the amount of data written is three times the block size; every word of data has its own header).

The way I'd be inclined to do it would be to define an array of update headers:

Code: Select all

struct UpdateBlock
{
    uint32_t dest;
    uint32_t size;
    uint32_t *pData;
}

where pData points to the starting location of the block in the update buffer. I don't like the idea of using a zero-sized array for the data block; this is an extension in GNU C but I'm not sure it would necessarily work in Visual C++.

We don't even necessarily need to double-buffer the update RAM; the real hardware does it so the game can write to one update buffer while copying data from the other buffer. Updating a 256x256 region of the texture sheet for example requires copying 128 kB which takes a bit of time on real hardware, but on a modern system it takes pretty much no time at all

Supermodel Forum

Long term dev plan

Long term dev plan

Re: Long term dev plan

Re: Long term dev plan

Re: Long term dev plan

Re: Long term dev plan

Re: Long term dev plan

Re: Long term dev plan

Re: Long term dev plan

Re: Long term dev plan

Re: Long term dev plan