Passing the Starling Image count limit


Flash player 11 has this 4096 instance limit for vertex and index buffers. Since in Starling framework every Image and Quad has it’s own vertex and index buffer their total amount is limited to 4096 also. With 30 fps frame rate a good GPU could handle a whole lot more images than that so it would be nice to get pass this limit. Luckily that’s pretty easy.

The solution is to use shared vertex and index buffers for the Images and Quads. In addition to this instance limit there is also a limit for single vertex and index buffer’s size so in this example I am limiting the amount of images to N = 8192 which is twice the original limit. To support even more images simply add several shared buffers.

First change the vertex buffer and index buffer variables in the Quad class to be static so that all instances of Quad and Image will use the same buffers. Also add a static vector of integers for storing available quad indices and a normal integer variable for storing the index this Quad is using.

Create a function for initializing the buffers and the vector for indices. Push integers from 0 to N-1 to this new static vector used for the available Quad indices. Create another local vector of unsigned integers for the index buffer’s data. Push a total of 6*N uints into this vector with the same logic that is used in QuadGroup’s addQuadData function (the six consequent numbers are the indices for one Quad’s corners with first six numbers being 0, 1, 2, 1, 3, 2). The third vector needed is a local vector of Numbers for the initial vertex data. Push a total of VertexData.ELEMENTS_PER_VERTEX*4*N numbers (for example 0) into this vector. Finally create the static vertex buffer with a size of 4*N and the static index buffer with a size of 6*N and upload the vertices vector into the vertex buffer and indices vector into the index buffer (check Quad’s createVertexBuffer and createIndexBuffer functions how the buffers are created).

To initialize the buffers call the function created in the previous step from Quad’s constructor and add one more static boolean for checking  the buffers are allocated only once. At the end of Quad’s constructor also pop one integer from that static quad index vector. Store this index in the Quad instance since it is needed to specify the Quad’s data’s offset in the buffers. Modify also Quad’s dispose function not to dispose the buffers but to simply push this index integer back to the vector from which it was popped.

The last thing do is to start using these shared vertex and index buffers properly. Add one boolean to Quad class for indicating the need to upload the vertex data to the vertex buffer. The initial value for this boolean is naturally true. Replace all the “if (mVertexBuffer) createVertexBuffer();” lines in both Quad and Image with a line setting this boolean to true. Then modify the createVertexBuffer function. It must not create any vertex buffers but only upload the data if the boolean you just started using is true. After uploading the vector set the boolean to false. Here the second parameter to the VertexBuffer3D’s uploadFromVector function is not any more 0 but this Quad’s index*4. The createIndexBuffer function you can remove since there is no need to update the index buffer after it has been created. Finally modify the render function in both Quad and Image. Remove the null checks for vertex and index buffer and always call createVertexBuffer function you just modified. Change the second parameter for Context3D’s drawTriangles function here. It is not any more 0 but this Quad’s index*6.

After doing these modifications try the Starling demo application again with frame rate set to 30 fps. On my laptop which I also used for the optimization tests in the previous post I got almost 5000 8000 images with the GPU rendering there seems to be also a small 20-25% performance improvement [with the projector. With a plugin player the improvement can be over 100% like discussed in the next post]. Anyways we can now clearly pass the original limit of 4096 Images.

Twitter account


Created just a second ago a twitter account @villekoskelaorg

The other account names similar to my name are not mine.

Optimizing Starling framework


Some of you have probably tried the Starling demo application that comes with the Starling framework package. If you have a pretty new computer with a separate GPU and you run the benchmark you probably get well over 1000 images moving and rotating smoothly at 60 fps on the screen. But what if your computer doesn’t have a GPU and all the rendering is done on software on CPU? Then the amount of images the benchmark is able to render at 60 fps can be under 100! This number depends of course on your computer’s CPU but anyways it is less than what you would expect to get with traditional display objects. I’ll next go through few tips and tricks which should enable you to render 3-4 times as many images as before when using the software rendering.

To test the software rendering performance simply set the renderMode parameter given to Starling class constructor to “software”. To detect that Flash is using software rendering check if the current rendering context’s driverInfo starts with “Software”.

The following tests were run [with Flash Player 11 projector] on my home laptop which has Intel Core i5-480M dual core 2,66GHz CPU and nVidia Geforce GT 415M GPU. It is not super powerful but anyways a pretty new and decent laptop. I also dropped the benchmark’s frame rate to 30 fps since that is what I thought is still an acceptable frame rate.

Original image amounts with my laptop at 30 fps were 4000 approximately 6200 with GPU and 400 520 with CPU rendering. Now let’s go to the code changes and their results.

 

Code change 1: Call flatten for the Sprite containing the images on every frame update.

Starling Sprite has a function called flatten. It precalculates the vertex and index buffers that are needed to render all the Images and Quads the Sprite contains and it divides these buffers into as few QuadGroups as possible. Quad and Image have vertex and index buffers for just two triangles but QuadGroup has vertex and index buffer for all the triangles it will draw making it a lot more optimal for 3D rendering. Images compiled into one QuadGroup need to use SubTextures from the same Texture meaning that the graphics the images use must internally be on the same Stage3D texture. TextureAtlas provides functionality to load this kind of sprite sheet textures and to create the SubTextures.

Result: Since the Images use SubTexture from the same Texture they can all be compiled into just one QuadGroup. The software rendering can now handle 800 840 (200% 162% compared to original) images.

 

Code change 2: Simplify the shader code.

The shader programs created in the registerPrograms function in Image class might be a bit too complicated for you. The fragment program is run for every pixel rendered so removing any operations from there improves the performance of CPU rendering a lot. So if you don’t need to be able to adjust the color and alpha values of the different corners of the image replace the original vertexProgramCode and fragmentProgramCode with these:

var vertexProgramCodeSimple:String =

“m44 op, va0, vc0  \n” +  // 4×4 matrix transform to output clipspace

“mov v0, va1       \n”;  // pass texture coordinates to fragment program

var fragmentProgramCode:String =

“tex ft1, v0, fs1 <???> \n” +  // sample texture 1

“mov oc, ft1 \n”;

Notice that you must not pass the color data to rendering context’s setVertexBufferAt function (originally index 1) any more so remove the line that did that. Also start passing the texture coordinates to vertex buffer at index 1 instead of the original 2.

Result: The software rendering handles 900 940 (225% 180% compared to original) images after changes 1 and 2.

 

Code change 3: Render with lower quality.

At some point it might be necessary to sacrifice rendering quality so gain some performance. The easiest way to do this is simply change the smoothing for Image to TextureSmoothing.NONE.

Result: The software rendering handles 1080 1130 (270% 217% compared to original) images after changes 1, 2 and 3.

So now we have the software rendering running with almost three over two times as many images as before (or 2.25 1.8 times as many if you wanted to keep the quality). But when we try these changes with GPU rendering the amount of images is only about 1700 2300 or little over under 40% of the original performance. This is because the implementation of the flatten function is quite far from optimal. Let’s next concentrate on tweaking it.

 

Code change 4: Optimize the flatten function in Sprite.

The flatten function in Sprite was not clearly designed to be called on every frame update but still using it that way improves the software rendering performance. With the changes I’ll briefly go through flatten function will become almost three times as fast as before.

So let’s see what flatten actually does. First it calls unflatten which disposes all the QuadGroups generated with the previous call to flatten. Instead of disposing we should try to simply update the QuadGroups we already have.

Step 1: Add resetting support to QuadGroup.

Add reset and initialize functions to QuadGroup. To support resetting add variables for storing both the current size and the allocated size for index and vertex buffers and the indices vector. Reset function simply sets all three of the current sizes to 0 but doesn’t touch the buffers or the vector. Initialize function takes the same parameters as the constructor and sets the values again. Modify the finish function to allocate new buffers only if the current ones are not big enough and upload the index buffer only when it’s size changes.

Step 2: Start reusing the current QuadGroup instances in Sprite.

Don’t dispose the current QuadGroups in Sprite’s unflatten function but only reset them and pass the current QuadGroup vector as an additional parameter to QuadGroup’s compile function.  Keep passing the index of the currently active QuadGroup as another parameter to QuadGroup’s compileObject function.

Since we are now using a cached vector of QuadGroups we are not necessarily adding quads to the QuadGroup that is last in the QuadGroup vector. The index is used for specifying which instance from the vector is currently used. When we need to use a new QuadGroup we first check if one is available in the next index in the vector and initialize it or if not then create a new one and add it to the vector.

Step 3: Start reusing the VertexData instances.

Add reset function and functionality also to VertexData. Modify the append function to write into current index in mData instead of pushing new values into the vector. Start using this new reset and append combination instead of the clone function to avoid creating new instances. The vertexData getter in Image also calls Texture’s adjustVertexData function every time when a cached result can be used. The cached value should be reset only when Image’s texture changes.

Step 4: Optimize QuadGroup’s compileObject function.

Don’t pass matrixStack and alphaStack but only the child matrix and child alpha to QuadGroup’s compileObject function. Don’t create a clone of the current matrix for the child matrix but instead allocate the child matrix outside the loop and simply call copyFrom to assign current matrix to the child matrix.

Step 5: Divide the vertex data to position data and other data.

Originally all the vertex data (position, texture position and color) is in the same vector in VertexData class. When the data is divided into two vectors so that first one contains only the position values and the second one the texture position and color values it is possible to use Matrix3D function transformVectors to transform the positions of all the vertices with single call when handling the quad compiling in QuadGroup’s compileObject function. If the data is divided into two vectors you also need to have two separate vertex buffers in Quad and QuadGroup.

One thing to notice here is that the maximum amount of vertex (and index) buffers is 4096 so after this modification you should make sure that when Images are flattened their own vertex and index buffers are disposed.

Result: Combining all the changes that do not affect rendering quality (1, 2, and 4) the software rendering now handles 1200 1250 (300% 240% compared to original) images. By using lower rendering quality and using all the four changes the software rendering can handle 1600 1670 (400% 320% compared to original) images! The GPU rendering can’t still handle more than 3000 5000 images (75% 80% compared to original) meaning that you probably shouldn’t use flatten function this way with GPU rendering.

 

To sum this all up when all or most of the sprites are moving you probably can’t achieve that much better GPU rendering performance with any small modifications to Starling framework. On the other hand with CPU rendering it is really easy to get two to three times better results than what Starling gives out of the box. Even simply calling flatten function on the root or close to root level sprites on every frame update boosts the CPU rendering performance considerably. [The actual results gained from the optimization depend heavily on the processor’s speed but are always noticable]. By using intelligent sprite flattening not performed on every frame update [and uploading vertex only when there are changes] it is possible to get better frame rates also with GPU rendering. Just remember that in order to have any performance gain from the sprite flattening the images flattened should be using textures from the same main texture. Since the maximum width and height of a texture are 2048 pixels you should be able to combine quite a lot graphics on any single texture.

Post about Starling optimizations


I’ve been working this weekend on a post about optimizing the Starling framework’s software rendering performance. Now it seems that I’ll be able to finish it tomorrow so stay tuned for the next update.

Flash 11 – using Stage3D for 2D graphics


The graphics rendering can be a bottle neck with many games. Especially with Flash there can be only limited amount of animating graphics on the screen and there are also limitations on the screen size for running games smoothly. The just released Flash 11 comes with Stage3D API that is a low level GPU accelerated API for rendering graphics. Using GPU acceleration brings the graphic performance on a whole new level but to properly use Stage3D one really needs to know 3D programming. Luckily there are already libraries that implement the functionality needed for drawing 2D graphics and do all the Stage3D stuff behind the scenes. Starling framework is one of these libraries.

Starling framework is an open source library that provides conventional Flash display list architecture while using Stage3D for the actual rendering. Starling is pure AS3 library, it’s free and open source so modifying and tuning it is really simple. The three most fundamental classes in Starling framework are Texture, Image and Sprite.

Texture can be considered as equivalent to BitmapData in the conventional display object world. It wraps the Stage3D texture and can be constructed from BitmapData or Bitmap. To make rendering more efficient once should have the different source images grouped properly on single BitmapData instance, create Texture from it and then create SubTextures with clipping rectangles from the Texture.

Image is used for drawing Texture on the screen so it can be considered to be equivalent to Bitmap. Images can be moved, rotated, scaled and their transparency can be adjusted.

Sprite is equivalent to normal Sprite and it is a display object container that can contain any number of child display objects. Sprite can be also flattened which means the changes to it’s children won’t be visible until the Sprite is flattened again or unflattened. When Sprite is flattened it’s content is optimized for rendering so flattened Sprites should be used when ever possible. One thing to notice is that Starling Sprite has rotation in radians not degrees like the normal Sprite.

Initializing Starling is really simple. Just create a class that extends Starling Sprite and pass it and few other parameters when creating an instance of Starling main class, adjust rendering parameters if you want and call start function for the Starling main class. When everything is ready the Sprite extending class you gave as a parameter to Starling will be instantiated and you can start adding more display objects on it.

That’s all this time. Next post will contain some tips and tricks for tuning the Starling performance.

Back from Adobe MAX


OK, the first post to this blog.

I just returned from Adobe MAX where an early build of Rovio‘s Flash version of Angry Birds was demonstrated in the keynotes and where I also had a presentation about the project and how we used Flash 11 for bringing the game to life.

My presentation had this really cryptic name “Unleashing the power of Stage3D for in-browser, socially aware on-line games” to keep Rovio’s and Angry Bird’s appearance at MAX a secret. The session was mainly about how we at Rovio used Starling framework to utilize Flash 11’s GPU accelerated Stage3D API for drawing 2D graphics. Unlike most of the other sessions this one was not recorded but I’ll try to go through some of the key points about using and tweaking the Starling framework here in my blog. So stay tuned for the next posts!

The MAX event itself was just huge. Met lots of Adobe people over there and really enjoyed the sessions I participated. If you are using any Adobe products and have not been there yet I really recommend participating it next year.