Fast affine texture mapping II (fatmap2.txt) -------------------------------------------- by Mats Byggmastar MRI 3D programmer in Doomsday mri@penti.sit.fi 17 April 1997, Jakobstad, Finland Feel free to upload this document anywhere you find appropriate, as long as you keep it in it's original, unmodified form. This is free information, you may not charge anything for it. Companies with selfrespect are encouraged to contact me, if any of this code is to be used as part of a commercial product. Table of contents ----------------- 1. About this document 2. Disclaimer 3. About the source code 4. Convex polygons 5. Drawing convex polygons 6. Constant texture gradients and polygons 7. 16:16 bit fixed point idiv and imul 8. Fixed point ceil() function 9. Subpixel accuracy 10. Subtexel accuracy 11. Avoiding divide overflow in slope calculations 12. Loop counters in inner loops 13. 8:16 bit inner loop 14. Arbitrary size bitmap inner loop 15. 8:15 bit tiled inner loop 16. Polygons per second 17. Greetings 1. About this document ----------------------- This document is the follow-up to fatmap.txt released 19 Jun. 1996. The aim of this second document is to make a more accurate affine texture mapper. I have also skipped triangles and concentrated on convex polygons to make clipping more efficient. The tiled bitmap inner loop scheme described in fatmap.txt has now also been realized in a 6 clock tick inner loop that is put into action here with good results. As in the previous document the inner loops in assembly are developed for the Intel Pentium processor. People looking for descriptions on perspective correct texture mappers should visit Game Developer Magazine at http://www.gdmag.com and backorder the issues with Chris Hecker's articles on this subject! I, the author, am a 25 year old computer and telecom engineer (B.Sc.) currently working as a teacher in a vocational school for students in the age of 16-19 years studying computers, electronic, automation and power electric. I do real-time 3D graphics programming mostly as a hobby and am an active member of the Finnish demogroup Doomsday. My dream is to one day work full-time with 3D graphics. Special thanks to Harriet Mattfolk for proofreading this document and helping me with the english syntax. 2. Disclaimer -------------- The author takes no responsibility, if something in this document or the accompanying source or executable causes any kind of data loss or damage to your hardware. 3. About the source code ------------------------- Except for the inner loops and some misc. functions the source is written in C++ very close to C. To make the most of the code I've made the inner loops as inline assembler. I personally have the whole mapping functions in fine-tuned assembly, but it would be pointless to include such a source in this type of document. The source code that should accompany this document is divided into 8 files: misc.h - Misc. declarations of structures and functions clip.cpp - Sutherland-Hodgman 2D clipping draw.cpp - Calculates deltas, clips and calls the mappers main.cpp - Simple test program flat.cpp - Flat filler gouraud.cpp - Gouraud filler txtmap.cpp - Texture mapper txtarb.cpp - Arbitrary size texture mapper txttil.cpp - Tiled texture mapper The last five files are the ones of interest. The other files were just included so I would be able to make the test program. Clip.cpp is my very own implementation of the Sutherland-Hodgman clipping algorithm and might be of some interest. You can find the theory behind the Sutherland-Hodgman algorithm in any graphics textbook. From time to time I see people looking for a fast way to fill polygons with a solid color. I have therefore included the flat filler. You might want to replace the call to memset() with some other inner loop. I recently also participated in a discussion on the newsgroup comp.graphics.algorithms concerning floating point vs. fixed point in a gouraud filler so I decided to include my version of a gouraud filler. The source should be compiled with Watcom C/C++ as the inline assembly is Watcom specific. It has only been tested with Watcom C/C++ version 10.0. The compiled version of the test program (main.exe) is included. You will need the DOS4GW protected mode stub to run it. 4. Convex polygons ------------------- First, let us define what a convex polygon is. A convex polygon must not have any of it's corners pointing inwards to the center of the polygon. In other words, the angle between two edges joining in a vertex must not be greater than 180 degree. This means that triangles are always convex polygons. The opposite of convex is concave and they can have corners pointing inwards. ______ ______________ ____ ____ /\ / \ \ / | \ / | / \ / \ \ / | \/ | / \ \ / / / | | /_________\ \ ______ / /__________ / |________________| convex convex concave concave The mappers presented in this document can only draw convex polygons with one exception. The third polygon above will be handled correctly even if it is concave. This is because it is concave in such a way that no scanlines need to be split while rendering it. The fourth polygon will not be handled properly because scanlines must be split. 5. Drawing convex polygons --------------------------- Drawing convex polygons is just as simple and fast as drawing triangles. The only difference is that the vertex scanning code needs to be in a loop for polygons. If we only render triangles we know that we always have 3 vertices which makes the code a bit more straightforward. First step in a polygon function is to search for the top and bottom vertex. So we scan the vertex array and search for min and max y. Let us take the following 4 vertex polygon as example: v1 / \ / \ / \ / \ v2 / / v0 \ / \ / \ / \ / \ / v3 Note that the vertices must be in an anti clockwise order. We find that v1 is the top vertex and v3 is the bottom vertex. We now know that when scanning the left side of the polygon, we should start at v1 and move forward through the array until we come to v3. In general when moving forward, we should wrap to the start of the array if we move past the array. I.e. if we come to the last vertex, our next vertex will be v0. v1 / / / / v2 / \ \ \ \ \ v3 For the right side we start at v1 and move backwards through the array until we come to v3. In general when moving backwards, we should wrap to the end of the array if we move outside the array. I.e. when we come to v0, our next vertex will be the last vertex in the array. v1 \ \ \ \ / v0 / / / / / v3 In our example we have 2 sections on the left side: v1 - v2 and v2 - v3 and 2 sections on the right side: v1 - v0 and v0 - v3 The second step in a polygon function is to calculate the x coordinate slope for the first section on the right side. If the section had zero height, try the next until a non-zero section is found. The third step is to calculate x coordinate slope and any other slope (e.g. texture u and v) for the first section on the left side. If the section had zero height, try the next until a non-zero section is found. If all sections on a side has zero height, the whole polygon has zero height and should not be plotted. Now we can start to interpolate the values along the left and the right side as we draw each scanline using an inner loop. v1 /-\ /------\ /-----------\ / \ v2 / v0 When we come to the end of a section, we search for the next non-zero height section and calculate the new slope. If all sections on one side are done, i.e. if we have come to the bottom vertex, the polygon is done. 6. Constant texture gradients and polygons ------------------------------------------- With triangles the texture gradients (texture u,v slopes over the triangle surface) are guaranteed to be constant. Therefore we can calculate the gradients (dudx, dvdx) once and use the same values for all scanlines of the triangle. If we start out with a triangle and clip it into a polygon, the texture gradients for the polygon are still constant over the whole surface. So if we calculate the gradients before the triangle gets clipped, we can still use the constant texture gradient method for polygons. Calculating the texture gradients for a triangle can be done in the following way: v0 /\ / \ / \ /_________\ v1 v2 double d = (v0.x - v2.x) * (v1.y - v2.y) - (v1.x - v2.x) * (v0.y - v2.y); if(d == 0.0) return; double id = 1.0/d * 65536.0; long dudx = ((v0.u - v2.u) * (v1.y - v2.y) - (v1.u - v2.u) * (v0.y - v2.y)) * id; long dvdx = ((v0.v - v2.v) * (v1.y - v2.y) - (v1.v - v2.v) * (v0.y - v2.y)) * id; The vertex data x,y,u,v is assumed to be in floating point and the resulting dudx,dvdx is in 16:16 bit fixed point. 7. 16:16 bit fixed point idiv and imul --------------------------------------- All mappers work with 16:16 bit fixed point internally. I will not explain how fixed point works here, you can look that up elsewhere. I'll just point out that we need to perform the fixed point divides and multiplies in assembler rather than in C. So from here on you can assume that code like: long dxdy = (width << 16) / height; long x = v1.x + ((prestep * dxdy) >> 16); need to be performed by some inline assembly functions: long dxdy = idiv16(width, height); long x = v1.x + imul16(prestep, dxdy); 8. Fixed point ceil() function ------------------------------- We need a ceil() function for the subpixel and subtexel accuracy. Definition of ceil() is: ceil(x) returns the smallest integer not < x e.g. ceil(1.0) returns 1 ceil(1.5) returns 2 ceil(-2.0) returns -2 ceil(-1.5) returns -1 If we limit x to only positive numbers we can make a 16:16 bit fixed point version of ceil() very easily. We add 0xffff to x and right-shift by 16. inline long ceil(long x) { x += 0xffff; return x >> 16; } Note that this function do not give the correct result if x is negative. This is no problem as x will never be negative the way ceil() is used in the mappers. 9. Subpixel accuracy --------------------- The first thing to note when aiming for subpixel accurate polygon drawing is that we must never truncate screen coordinates to integers. Today it's common to use floating point all the way trough the 3D engine and convert the projected screen coordinates to integers just before entering the polygon function. This conversion should of course be a float to fixed point conversion so that we do not lose the fractional part. The fractional part is in fact the subpixel position that will affect how the slope of a polygon edge will be rendered. Without subpixel accuracy the polygons will jump one pixel at a time when the object is moving slowly over the screen. With subpixel accuracy the pixels that makes up the edges between the polygons will slowly "float", making the edges seem to move slowly over the screen. Calculating the slope of a subpixel accurate section: long scanlines = ceil(v2.y) - ceil(v1.y); if(scanlines <= 0) return; // Section had zero height long height = v2.y - v1.y; long dxdy = ((v2.x - v1.x) << 16) / height; long prestep = (ceil(v1.y) << 16) - v1.y; long left_x = v1.x + ((prestep * dxdy) >> 16); The height of the section, or the number of scanlines, will be an integer. When calculating the slope (dxdy) we will not use this height, rather the real height that includes the fractional part. We apply the subpixel accuracy to the slope by adjusting the initial x coordinate by the amount that was truncated when selecting the top y coordinate using ceil(). Interpolating along subpixel accurate left and right sections: for( ) { long x1 = ceil(left_x); // Scanline start x long width = ceil(right_x) - x1; // Scanline width if(width > 0) { Now draw the scanline from x1,y, width pixels wide. } left_x += left_dxdy; right_x += right_dxdy; y++; } 10. Subtexel accuracy --------------------- Calculate the u,v (dudy,dvdy) slopes along a section the same way as the x coordinate slope and prestep the initial values the same way: long height = v2.y - v1.y; long dudy = ((v2.u - v1.u) << 16) / height; long dvdy = ((v2.v - v1.v) << 16) / height; long prestep = (ceil(v1.y) << 16) - v1.y; long left_u = v1.u + ((prestep * dudy) >> 16); long left_v = v1.v + ((prestep * dvdy) >> 16); When we interpolate with subtexel accuracy we must prestep u,v before drawing each scanline. This is because we round the start of the scanline (x1) upward to the nearest pixel using ceil(). for( ) { long x1 = ceil(left_x); // Scanline start x long width = ceil(right_x) - x1; // Scanline width if(width > 0) { long prestep = (x1 << 16) - left_x; long u = left_u + ((prestep * dudx) >> 16); long v = left_v + ((prestep * dvdx) >> 16); Now draw the scanline from x1,y,u,v, width pixels wide. } left_x += left_dxdy; left_u += left_dudy; left_v += left_dvdy; right_x += right_dxdy; y++; } 11. Avoiding divide overflow in slope calculations -------------------------------------------------- For some sections that are very thin, the slope calculation might overflow. Look at the following case: v1.x = 0000:0000 v2.x = 0002:0000 v1.y = 0000:0000 v2.y = 0000:0001 ceil(v2.y) - ceil(v1.y) will report that this is one scanline even if the actual height is just a fraction of a pixel. height = v2.y - v1.y = 0000:00001 width = v2.x - v1.y = 0002:00000 so performing (width << 16) / height will cause divide overflow. We are aiming for nearly perfect affine mapping so we can't just take some default value for the slope. One way to avoid the overflow is to multiply width by the inverse of height using only 18:14 bit precision. long height = v2.y - v1.y; long inv_height = (0x10000 << 14) / height; long dxdy = ((v2.x - v1.x) * inv_height) >> 14; Note that this method can only be used for this special case where the section height is less than one pixel. Other sections that are more than one pixel should of course be calculated as usual. 12. Loop counters in inner loops -------------------------------- There is a neat way to combine movement of destination pointer and loop counter in inner loops. It might not save much (half a clock tick on a Pentium) but you might find other ways of using it. Assume that we want to draw a horizontal line on the screen with the following data: al = color ecx = width of line edi = destination pointer to first pixel on the left The standard way would be: inner: mov [edi], al ; plot inc edi ; advance destination pointer dec ecx ; decrease loop counter jnz inner Another way is to combine destination pointer and width and plot the line from right to left: inner: mov [edi+ecx-1], al dec ecx jnz inner In the above loop we got rid of one instruction. On the other hand, we are writing to memory from higher to lower addresses. This is bad for the write buffers. We should write to increasing addresses instead. That can be done in this way: lea edi, [edi+ecx] neg ecx ; loop counter goes from -width to 0 inner: mov [edi+ecx], al inc ecx jnz inner The destination is first moved to the end of the line and the loop counter is negated. The first pixel will therefore be plotted at start+width-width, i.e. at the start of the line. neg ecx can be replaced with: xor ecx, -1 inc ecx which most of the time can be paired with some other instructions in the setup code. neg is a non pairable, 1 tick instruction. We will then end up with: lea edi, [edi+ecx] xor ecx, -1 inc ecx inner: mov [edi+ecx], al inc ecx jnz inner 13. 8:16 bit inner loop ----------------------- After fatmap.txt was released I was sent a very nice 8:16 bit 4 clock tick inner loop made by Russel Simmons (Armitage/Beyond). In it's original form it plotted the scanlines from right to left. I have in a later stage changed this and present the new version here which plots the scanlines from left to right. ; bitmap (256x256) must be aligned on 64k. (low 16 bits zero) ; eax = u frac : - ; ebx = bitmap ptr : v int : u int ; ecx = scanline width ; edx = v frac : v int step : u int step ; esi = u frac step : 0 : 0 ; edi = destination ptr ; ebp = v frac step : 0 : 0 lea edi, [edi+ecx] xor ecx, -1 inc ecx inner: mov al, [ebx] ; get color add edx, ebp ; v frac += v frac step adc bh, dh ; v int += v int step (+carry from v frac) add eax, esi ; u frac += u frac step adc bl, dl ; u int += u int step (+carry from u frac) mov [edi+ecx], al ; plot pixel inc ecx jnz inner This is a good inner loop with enough precision, which also can handle texture wrapping. Note that the texture must be aligned on 64k. One way of aligning texture maps on 64k is to first make a buffer which is 64k byte larger than the texture map and then align a pointer inside that buffer: char source[256*256]; // source bitmap char bigbuffer[256*256*2]; // 2*64k byte buffer char * aligned = (char *) (((int)(bigbuffer + 0xffff)) & ~0xffff); memcpy(aligned, source, 256*256); Now access the bitmap using the aligned pointer. 14. Arbitrary size bitmap inner loop ------------------------------------ The idea for the following 5 clock tick inner loop was taken from Chris Hecker's scanline subdivision texture mapper. This loop doesn't rely on the fact that textures are always 256x256 pixels. We use a 2 dword lookup table instead to perform the v * width calculation. Therefore it can handle bitmaps of any size. The fractional part in this loop can be up to 32 bits but we only use 16 bits here. Note that the bitmap does not need to be aligned. The trick in this loop is to convert the carry flag from the fractional v addition into an index in the lookup table. Then to get the texture increment from the lookup table. ; bitmap can be any size, setup duvdxstep table according to width ; dvdxfrac = v frac step : 0 ; eax = u frac : 0 ; ebx = v frac : 0 ; ecx = scanline width ; edx = u frac step : 0 ; esi = bitmap ptr ; edi = destination ptr lea edi, [edi+ecx] xor ecx, -1 add ebx, [dvdxfrac] ; v frac += v frac step inc ecx sbb ebp, ebp ; -1 if carry from add inner: add eax, edx ; u frac += u frac step mov bl, [esi] ; get color adc esi, [duvdxstep+4+ebp*4] ; move texture pointer add ebx, [dvdxfrac] ; v frac += v frac step sbb ebp, ebp ; -1 if carry from add mov [edi+ecx], bl ; plot pixel inc ecx jnz inner The duvdxstep lookup table is set up in the following way: long duvdxstep[2]; duvdxstep[0] = (dudx >> 16) + (dvdx >> 16) * mapwidth + mapwidth; duvdxstep[1] = (dudx >> 16) + (dvdx >> 16) * mapwidth; This means that when we get that carry from the v addition and ebp becomes -1, we will address duvdxstep[0] and add one extra line in the bitmap. A drawback of this inner loop is that it can not handle texture wrapping. I.e. you must never let the u,v coordinates be outside the texture map or you will read random data outside the texture map or even cause a protection fault. 15. 8:15 bit tiled inner loop ----------------------------- In fatmap.txt I described a method of arranging the texture map as 8x8 tiles for more efficient cache usage. Back then I did not know any way of doing the bit-swapping inner loop in less than 11 clock ticks. But one day it just came to me and I could immediately write a 8 clock tick 8:16 bit version. That version did not even work in theory, but it was a starting point. The inner loop below has evolved from the original 8 clock tick version over a two month period. It was developed on paper only and the first time I actually tested any of the loops, was this version and it worked perfectly. The version below runs at 6 clock ticks and uses 8:15 bit interpolation. Here 8x8 tiles are used but any type of scheme can be used, just modify the bit-masks. The 256x256 bitmap need not be aligned, but for the tiling scheme to be effective the bitmap should be aligned on 32 byte. Texture wrapping is allowed. ; bitmap (256x256) must be stored as 8x8 tiles ; tildudx = wwwww11111111www1fffffffffffffffb (w=whole, f=frac) ; tildvdx = 11111wwwwwwww1111fffffffffffffffb ; eax = u wwwww00000000www0fffffffffffffffb ; ebx = v 00000wwwwwwww0000fffffffffffffffb ; ecx = scanline width ; edi = destination ptr ; esi = bitmap ptr lea edi, [edi+ecx-1] xor ecx, -1 lea ebp, [eax+ebx] ; u+v inc ecx inner: add eax, [tildudx] ; u += tildudx add ebx, [tildvdx] ; v += tildvdx shr ebp, 16 ; (u+v) >> 16 and eax, 11111000000001110111111111111111b ; clear bitgaps and ebx, 00000111111110000111111111111111b inc ecx mov dl, [esi+ebp] ; get color lea ebp, [eax+ebx] ; u+v mov [edi+ecx], dl ; plot pixel jnz inner Converting 16:16 bit fixed point dudx,dvdx to the tiled format can be done the following way: tildudx = (((dudx << 8) & 0xf8000000) + ((dudx >> 1) & 0x00007fff) + (dudx & 0x00070000)) | 0x07f88000; tildvdx = (((dvdx << 3) & 0x07f80000) + ((dvdx >> 1) & 0x00007fff)) | 0xf8078000; Note that we fill out the bit-gaps with 1's. This is because we must force the bits to jump over the gaps when we add in the inner loop. These 1's must be cleared out after the add. We do this with the 'and' instructions in the inner loop. Before entering the inner loop we also convert u,v to the tiled format: u = ((u << 8) & 0xf8000000) + ((u >> 1) & 0x00007fff) + (u & 0x00070000); v = ((v << 3) & 0x07f80000) + ((v >> 1) & 0x00007fff); It should be noticed that 15 bits are only used for the fractional part and that it is right-shifted 1 bit. This is done because we must not let the fractional part overflow and clutter the texture pointer when we add u+v using the lea instruction. We have placed a bit-gap between the fractional part and the whole part to avoid this. The bitmap itself can be tiled the following way: char source[256*256]; // linear source bitmap char tiled[256*256]; for(int v=0; v<256; v++) { for(int u=0; u<256; u++) { int dst = ((u<<8) & 0xf800)+(u & 0x0007)+((v<<3) & 0x07f8); tiled[dst] = source[u+v*256]; } } 16. Polygons per second ----------------------- From time to time I see coders talking about the speed of their 3D engine by specifying how many polygons per second can be plotted. In my opinion, this is completely irrelevant as they do not specify under what condition they have made the test. The speed will be _very_ dependent on the size and orientation of the polygons as well as how the texture is mapped onto the polygons. Many other factors will contribute to the speed, even the type of memory manager or DOS extender. With the simple test program accompanying this document I try to compare the speed between the different mappers. In this case, a comparison is valid as they work under the exact same conditions. I get the following results on my Pentium 120: Flat polys/sec 3494 Gouraud polys/sec 1849 Texture polys/sec 1214 Arbitrary texture polys/sec 1196 Tiled texture polys/sec 1520 These figures do not seem very impressive, but keep in mind that the triangles can be very large as I allow x,y to be in the range [0...320],[0...200]. In any case you should notice that the flat filler (using standard memset() as inner loop) is about twice as fast as the others. The arbitrary size mapper and the ordinary mapper should run at about the same speed. The tiled mapper will come out as a winner, because we allow the u,v values to vary a lot over the triangle surface, i.e. the cache issues will become more important than a quick setup and a fast inner loop. 17. Greetings ------------- Members of Doomsday Members of SoCS Members of Esteem Phil Carmody Nix / The Black Lotus Submissive / Cubic Team & $eeN Armitage / Beyond Jare / Iguana Wog / Orange #coders. No one mentioned, no one forgotten. All my students in YiJ 1996-1997; Data3, AD2, Elm2, El1A, El1B ...my heart in Oslo .eof.