Supermodel Forum

by **Ian** » Wed Sep 06, 2023 6:17 am

Why?
Because drawing is kind of slow on the cpu, and a GPU is a massive SIMD monster that can probably eat more data than you can throw at it. Originally I was thinking about writing a compute shader to do this, because with compute shaders you can write to arbitrary memory locations, but in the end I figured we could do it in a fragment shader.

With older opengl (2.1) really you could only do floating point math on the GPU in the shader. There really was very little integer math other than maybe a loop counter, which might get unrolled anyway.
With gl3+ you can do full integer maths so %^|&& etc with signed and unsigned data types

How do we map the memory to the GPU?
Well in opengl we either get uniform memory or texture memory. The VRAM is way too big to pass as uniform memory to the shader, so must be mapped as a texture. How do we map vram as a texture? Well the bottom part of the memory is exactly 1mb, which we can map to a 512x512 texture (4 bytes per pixel) == 1mb. In order to index the vram texture, we have to generate texture coordinates to map to the memory locations. I use this simple function

Code: Select all: ivec2 GetVRamCoords(int offset) { return ivec2(offset % 512, offset / 512); }

That essentially maps a 1d to a 2d array of a fixed width/height. The palette data works the same way, but with a different texture. To read the texture i use the texelFetch function, which takes integer coordinates and skips the bilinear/trilinear and other possible filtering modes associated with reading textures.

The shader inputs can map to an array of integer uniform inputs. Uniforms are great because conditional logic based upon inputs can be optimised away because every pixel for the draw call will evaluate to the same logic.

Here is what it looks like. This is drawing drawing like layer 2 directly to the frame buffer in a 496x384 window as a test

This is the shader code

Code: Select all: // Vertex shader static const char s_vertexShaderSourceNew[] = R"glsl( #version 410 core void main(void) { const vec4 vertices[] = vec4[](vec4(-1.0, -1.0, 0.0, 1.0), vec4(-1.0, 1.0, 0.0, 1.0), vec4( 1.0, -1.0, 0.0, 1.0), vec4( 1.0, 1.0, 0.0, 1.0)); gl_Position = vertices[gl_VertexID % 4]; } )glsl"; // Fragment shader static const char s_fragmentShaderSourceNew[] = R"glsl( #version 410 core //layout(origin_upper_left) in vec4 gl_FragCoord; // inputs uniform usampler2D vram; // texture 512x512 uniform usampler2D palette; // texture 128x256 - actual dimensions dont matter too much but we have to stay in the limits of max tex width/height, so can't have 1 giant 1d array uniform uint regs[32]; uniform int layerNumber; // outputs out vec4 fragColor; ivec2 GetVRamCoords(int offset) { return ivec2(offset % 512, offset / 512); } ivec2 GetPaletteCoords(int offset) { return ivec2(offset % 128, offset / 128); } uint GetLineMask(int yCoord) { uint shift = (layerNumber<2) ? 16u : 0u; // need to check this, we could be endian swapped so could be wrong uint maskPolarity = ((layerNumber & 1) > 0) ? 0xFFFFu : 0x0000u; int index = (0xF7000 / 4) + yCoord; ivec2 coords = GetVRamCoords(index); uint mask = ((texelFetch(vram,coords,0).r >> shift) & 0xFFFFu) ^ maskPolarity; return mask; } bool GetPixelMask(int xCoord, int yCoord) { uint lineMask = GetLineMask(yCoord); uint maskTest = 1 << (15-(xCoord/32)); return (lineMask & maskTest) != 0; } int GetLineScrollValue(int layerNum, int yCoord) { int index = ((0xF6000 + layerNum * 0x400) / 4) + (yCoord /2); int shift = (yCoord % 2) * 16; // double check this ivec2 coords = GetVRamCoords(index); return int((texelFetch(vram,coords,0).r >> shift) & 0xFFFFu); } int GetTileNumber(int xCoord, int yCoord, int xScroll, int yScroll) { int xIndex = ((xCoord + xScroll) / 8) & 0x3F; int yIndex = ((yCoord + yScroll) / 8) & 0x3F; return (yIndex*64) + xIndex; } int GetTileData(int layerNum, int tileNumber) { int addressBase = (0xF8000 + (layerNum * 0x2000)) / 4; int offset = tileNumber / 2; // two tiles per 32bit word int shift = (1 - (tileNumber % 2)) * 16; // triple check this ivec2 coords = GetVRamCoords(addressBase+offset); uint data = (texelFetch(vram,coords,0).r >> shift) & 0xFFFFu; return int(data); } int GetVFine(int yCoord, int yScroll) { return (yCoord + yScroll) & 7; } int GetHFine(int xCoord, int xScroll) { return (xCoord + xScroll) & 7; } // register data bool LineScrollMode (int layerNum) { return (regs[0x60/4 + layerNum] & 0x8000) != 0; } int GetHorizontalScroll(int layerNum) { return int(regs[0x60 / 4 + layerNum] &0x3FFu); } int GetVerticalScroll (int layerNum) { return int((regs[0x60/4 + layerNum] >> 16) & 0x1FFu); } int LayerPriority () { return int((regs[0x20/4] >> 8) & 0xFu); } bool LayerIs4Bit (int layerNum) { return (regs[0x20/4] & (1 << (12 + layerNum))) != 0; } bool LayerEnabled (int layerNum) { return (regs[0x60/4 + layerNum] & 0x80000000) != 0; } bool LayerSelected (int layerNum) { return (LayerPriority() & (1 << layerNum)) == 0; } float Int8ToFloat(uint c) { if((c & 0x80u) > 0u) { // this is a bit harder in GLSL. Top bit means negative number, we extend to make 32bit return float(int(c | 0xFFFFFFu)) / 128.0; } else { return float(c) / 127.0; } } vec4 GetColourOffset(int layerNum, vec4 colour) { uint offsetReg = regs[(0x40/4) + layerNum/2]; vec4 c; c.b = Int8ToFloat((offsetReg >>16) & 0xFFu); c.g = Int8ToFloat((offsetReg >> 8) & 0xFFu); c.r = Int8ToFloat((offsetReg >> 0) & 0xFFu); c.a = 0.0; colour += c; return clamp(colour,0.0,1.0); // clamp is probably not needed } vec4 Int16ColourToVec3(uint colour) { uint alpha = (colour>>15); // top bit is alpha. 1 means clear, 0 opaque alpha = ~alpha; // invert alpha = alpha & 0x1u; // mask bit vec4 c; c.r = float((colour >> 0 ) & 0x1F) / 31.0; c.g = float((colour >> 5 ) & 0x1F) / 31.0; c.b = float((colour >> 10) & 0x1F) / 31.0; c.a = float(alpha) / 1.0; c.rgb *= c.a; // multiply by alpha value, this will push transparent to black, no branch needed return c; } vec4 GetColour(int paletteOffset) { ivec2 coords = GetPaletteCoords(paletteOffset); uint colour = texelFetch(palette,coords,0).r; return Int16ColourToVec3(colour); // each colour is only 16bits, but occupies 32bits } vec4 Draw4Bit(int tileData, int hFine, int vFine) { // Tile pattern offset: each tile occupies 32 bytes when using 4-bit pixels (offset of tile pattern within VRAM) int patternOffset = ((tileData & 0x3FFF) << 1) | ((tileData >> 15) & 1); patternOffset *= 32; patternOffset /= 4; // Upper color bits; the lower 4 bits come from the tile pattern int paletteIndex = tileData & 0x7FF0; ivec2 coords = GetVRamCoords(patternOffset+vFine); uint pattern = texelFetch(vram,coords,0).r; pattern = (pattern >> ((7-hFine)*4)) & 0xFu; // get the pattern for our horizontal value return GetColour(paletteIndex | int(pattern)); } vec4 Draw8Bit(int tileData, int hFine, int vFine) { // Tile pattern offset: each tile occupies 64 bytes when using 8-bit pixels int patternOffset = tileData & 0x3FFF; patternOffset *= 64; patternOffset /= 4; // Upper color bits int paletteIndex = tileData & 0x7F00; // each read is 4 pixels int offset = hFine / 4; ivec2 coords = GetVRamCoords(patternOffset+(vFine*2)+offset); // 8-bit pixels, each line is two words uint pattern = texelFetch(vram,coords,0).r; pattern = (pattern >> ((3-(hFine%4))*8)) & 0xFFu; // shift out the bits we want for this pixel return GetColour(paletteIndex | int(pattern)); } void main() { ivec2 pos = ivec2(gl_FragCoord.xy); int tileNumber = GetTileNumber(pos.x,pos.y,0,0); int hFine = GetHFine(pos.x,0); int vFine = GetVFine(pos.y,0); int tileData = GetTileData(layerNumber,tileNumber); if(LayerIs4Bit(layerNumber)) { fragColor = Draw4Bit(tileData,hFine,vFine); } else { fragColor = Draw8Bit(tileData,hFine,vFine); } } )glsl";

What's left? Well most of the code is pretty much there. I just need to plugin scroll values/masking. Maybe can generate 1 layer of A and A' with a single draw pass?
The colour offsets can also be directly applied in the shader, so no need to pre-calculate these. This will simplify the logic in the tilegen class a lot.

Upsides/Downsides?

Downside is fragment shaders are much harder to debug than cpu side code. Emulating possible querks of the tilegen such as the fact the shift register? is reloaded only 8 pixels might be harder in a fragment shader and might require some ugly logic, as every pixel in the fragment shader is basically independent. Also if we emulated drawing 1 line at a time we would start to lose the speed benefits of doing it on the GPU as each draw call has a cost.

Upside?
So fast might as well be free, at least on a non ghetto gpu from this decade :p

by **Bart** » Wed Sep 06, 2023 12:29 pm

This is really cool! I'm hoping it works with both legacy and new engines and perhaps can be left configurable. It's always a bit dicey having two implementations of something lying around because, like the old engine, they will naturally start to diverge but the tilegen is mostly solved except for the window mask thing. For those ghetto GPUs, having the CPU version around as a fallback might be useful.

by **Ian** » Wed Sep 06, 2023 2:06 pm

Yes it'll work with the legacy engine

Currently it's just a proof of concept. I could also probably lower the required glsl version to maybe 330. I'd have to check.

But with any luck most people would get a speed up even on lower end systems.

by **Ian** » Fri Sep 08, 2023 3:10 pm

Not at pc currently but I saw mame was doing this with the line scroll values

int rowscroll = BYTE_REVERSE16(rowscroll_ram[((layer * 0x200) + y) ^ NATIVE_ENDIAN_VALUE_LE_BE(3,0)]) & 0x7fff;
if (rowscroll & 0x100)
rowscroll |= ~0x1ff

Basically if line scroll is larger than 256 it makes the value negative. So treating the values as signed. I wonder if that makes any difference to the maths

by **Bart** » Sat Sep 09, 2023 7:38 pm

I think it would but I don't know why they do this. Is this code also present in the System 24 and Model 2 tile generator logic? Model 2 and 3 use the System 24 tile generator, which was a 2D system that would have made heavy use of scrolling.

by **Ian** » Sat Sep 09, 2023 8:29 pm

Mame is doing this for system 24

uint16_t *hscrtb = tile_ram.get() + 0x4000 + 0x200*layer :mrgreen:

;

hscr = (-hscrtb[y]) & 0x1ff;

It makes them negative. I'm not sure what that even does to an unsigned number. Maybe it just makes them negative because scrolling happens in the left direction.

by **Ian** » Sat Sep 09, 2023 8:56 pm

Rolling start apparently does work correctly on mame

https://youtu.be/F1ttfo3KUYw

Go to 6mins 41 seconds. The game eventually hangs but no scroll issues present

by **gm_matthew** » Sun Sep 10, 2023 3:43 am

Ian wrote:Rolling start apparently does work correctly on mame

https://youtu.be/F1ttfo3KUYw

Go to 6mins 41 seconds. The game eventually hangs but no scroll issues present

The line scrolling code in MAME is broken as it's scrolling the "ROLLING START" banner twice as much as it should, scrolling it 16 pixels per frame when it should only be scrolling 8. This is not a timing issue; I've tested it myself in MAME 0.220 and if I use the debugger to edit the layer A' scrolling table and scroll certain lines left by an extra 8 pixels, they instead scroll an extra 16:

I'd test the scroll registers as well but unfortunately I can't get them to work properly in the debugger since MAME isn't set up to actually read them.

by **Ian** » Sun Sep 10, 2023 4:15 am

hm well that sucks

i tested with signed maths also and it didn't make any difference, so the error is definitely not here

by **Bart** » Sun Sep 10, 2023 8:25 pm

I need to take a look at this again at some point. Matthew: I'll look around the area of your patch but do you happen to have the code that updates the window layer (I call it 'stencil' in TileGen.cpp but it's actually 'window' in Sega terminology, similar to the window layer on Sega Genesis) disassembled for reference?

Supermodel Forum

Tilegen inside a fragment shader

Tilegen inside a fragment shader

Re: Tilegen inside a fragment shader

Re: Tilegen inside a fragment shader

Re: Tilegen inside a fragment shader

Re: Tilegen inside a fragment shader

Re: Tilegen inside a fragment shader

Re: Tilegen inside a fragment shader

Re: Tilegen inside a fragment shader

Re: Tilegen inside a fragment shader

Re: Tilegen inside a fragment shader

Who is online