Monthly Archives: October 2011

Starling performance revisited

After running some more performance tests with the Starling demo’s benchmark it’s time to write once more about the results from the optimizations and the performance in general. Something that I didn’t originally mention in the earlier posts was that I ran all the tests with Flash player 11 projector and not under web browser plugin. This has a huge effect on the performance like this article will show. Also in the tests for my previous posts my laptop was not running at the full speed since some energy saving features were still on. Some of the original numbers were affected a little but I’ll fix those to the previous posts too.

So let’s now go to the latest tests results. Five different versions of the benchmark were run on four different computers. First the original benchmark using CPU rendering, then the original benchmark with Sprite flattening using CPU rendering, then the optimized Sprite flatten benchmark using CPU rendering (see “Optimizing Starling framework”), then the original benchmark using GPU rendering and finally the optimized benchmark using GPU rendering (see “Passing the Starling Image count limit”). On one computer I also tested the effect from using different editions of the Flash 11 player .The Starling demo’s benchmark was modified to run at 30 fps like in my previous posts.


This chart shows the amount of moving Images at 30 fps with four different computers. Both the CPU and GPU optimizations improve the performance considerably.


The comparison between different computers and Flash player editions shows few interesting things.

First interesting issue is when you check the results for CPU rendering with the slowest computer (Athlon 64 3000+) marked with blue color. There the optimization for the CPU rendering gets only 1.4 times the original performance. This is because of the big image on the background of the benchmark scene. Even if it stays still it needs to be rendered on every frame and with slower processors and CPU rendering that image alone requires quite some time to get drawn.

Second interesting issue is that GPU really does make a difference with Flash 11. When the five year old desktop computer with the slowest CPU starts using its GPU (which in fact is not that good) for the rendering it beats the laptop that has no GPU by a huge margin (2500 vs. 570 images) and even gets better results than the fastest computer with CPU rendering (2500 vs. 1970 images).

Third interesting issue is that there is quite a big performance difference between different Flash player editions. The release projector player is naturally fastest with all the different benchmarks but the order of the release plugin player and the debug projector player depends on the benchmark run. I have marked the two interesting cases with red text. First one is the “CPU flatten” with the debug projector player. Here the debug player performs really poorly. This is most likely caused by the fact that the original implementation of Sprite’s flatten function creates lots of new instances of different classes and the debug player needs to keep track of these. Second interesting case is the original “GPU” rendering benchmark with plugin player. It handles less than half of the optimized GPU rendering benchmark indicating that vertex buffer access has some serious overhead under plugin players. All in all the performance under plugin player seems to be within 60-80% of the performance of the projector player. I am hoping that with new player versions the difference would not be this big.

To wrap this all up again one thing to understand is that now with GPU rendering support the variation in performance between different systems can be just massive. In my tests the fastest computer was able to handle around 20 times as many moving images as the slowest one. This means that when you are implementing any Flash 11 game there really should be possibility to adjust the graphical detail to keep the game running smoothly on slower machines with possibly no GPU and also to give some extra visual effects for the users who have fast machines with state of the art GPUs. Also worth noticing is that when targeting web browsers the tuning done in “Passing the Starling Image count limit” more than doubles the GPU rendering and when using CPU for rendering it’s really crucial to have the optimization done in “Optimizing Starling framework” in place.

That’s all this time. Next post will probably be about handling device loss in Starling.

Passing the Starling Image count limit

Flash player 11 has this 4096 instance limit for vertex and index buffers. Since in Starling framework every Image and Quad has it’s own vertex and index buffer their total amount is limited to 4096 also. With 30 fps frame rate a good GPU could handle a whole lot more images than that so it would be nice to get pass this limit. Luckily that’s pretty easy.

The solution is to use shared vertex and index buffers for the Images and Quads. In addition to this instance limit there is also a limit for single vertex and index buffer’s size so in this example I am limiting the amount of images to N = 8192 which is twice the original limit. To support even more images simply add several shared buffers.

First change the vertex buffer and index buffer variables in the Quad class to be static so that all instances of Quad and Image will use the same buffers. Also add a static vector of integers for storing available quad indices and a normal integer variable for storing the index this Quad is using.

Create a function for initializing the buffers and the vector for indices. Push integers from 0 to N-1 to this new static vector used for the available Quad indices. Create another local vector of unsigned integers for the index buffer’s data. Push a total of 6*N uints into this vector with the same logic that is used in QuadGroup’s addQuadData function (the six consequent numbers are the indices for one Quad’s corners with first six numbers being 0, 1, 2, 1, 3, 2). The third vector needed is a local vector of Numbers for the initial vertex data. Push a total of VertexData.ELEMENTS_PER_VERTEX*4*N numbers (for example 0) into this vector. Finally create the static vertex buffer with a size of 4*N and the static index buffer with a size of 6*N and upload the vertices vector into the vertex buffer and indices vector into the index buffer (check Quad’s createVertexBuffer and createIndexBuffer functions how the buffers are created).

To initialize the buffers call the function created in the previous step from Quad’s constructor and add one more static boolean for checking  the buffers are allocated only once. At the end of Quad’s constructor also pop one integer from that static quad index vector. Store this index in the Quad instance since it is needed to specify the Quad’s data’s offset in the buffers. Modify also Quad’s dispose function not to dispose the buffers but to simply push this index integer back to the vector from which it was popped.

The last thing do is to start using these shared vertex and index buffers properly. Add one boolean to Quad class for indicating the need to upload the vertex data to the vertex buffer. The initial value for this boolean is naturally true. Replace all the “if (mVertexBuffer) createVertexBuffer();” lines in both Quad and Image with a line setting this boolean to true. Then modify the createVertexBuffer function. It must not create any vertex buffers but only upload the data if the boolean you just started using is true. After uploading the vector set the boolean to false. Here the second parameter to the VertexBuffer3D’s uploadFromVector function is not any more 0 but this Quad’s index*4. The createIndexBuffer function you can remove since there is no need to update the index buffer after it has been created. Finally modify the render function in both Quad and Image. Remove the null checks for vertex and index buffer and always call createVertexBuffer function you just modified. Change the second parameter for Context3D’s drawTriangles function here. It is not any more 0 but this Quad’s index*6.

After doing these modifications try the Starling demo application again with frame rate set to 30 fps. On my laptop which I also used for the optimization tests in the previous post I got almost 5000 8000 images with the GPU rendering there seems to be also a small 20-25% performance improvement [with the projector. With a plugin player the improvement can be over 100% like discussed in the next post]. Anyways we can now clearly pass the original limit of 4096 Images.

Twitter account

Created just a second ago a twitter account @villekoskelaorg

The other account names similar to my name are not mine.

Optimizing Starling framework

Some of you have probably tried the Starling demo application that comes with the Starling framework package. If you have a pretty new computer with a separate GPU and you run the benchmark you probably get well over 1000 images moving and rotating smoothly at 60 fps on the screen. But what if your computer doesn’t have a GPU and all the rendering is done on software on CPU? Then the amount of images the benchmark is able to render at 60 fps can be under 100! This number depends of course on your computer’s CPU but anyways it is less than what you would expect to get with traditional display objects. I’ll next go through few tips and tricks which should enable you to render 3-4 times as many images as before when using the software rendering.

To test the software rendering performance simply set the renderMode parameter given to Starling class constructor to “software”. To detect that Flash is using software rendering check if the current rendering context’s driverInfo starts with “Software”.

The following tests were run [with Flash Player 11 projector] on my home laptop which has Intel Core i5-480M dual core 2,66GHz CPU and nVidia Geforce GT 415M GPU. It is not super powerful but anyways a pretty new and decent laptop. I also dropped the benchmark’s frame rate to 30 fps since that is what I thought is still an acceptable frame rate.

Original image amounts with my laptop at 30 fps were 4000 approximately 6200 with GPU and 400 520 with CPU rendering. Now let’s go to the code changes and their results.


Code change 1: Call flatten for the Sprite containing the images on every frame update.

Starling Sprite has a function called flatten. It precalculates the vertex and index buffers that are needed to render all the Images and Quads the Sprite contains and it divides these buffers into as few QuadGroups as possible. Quad and Image have vertex and index buffers for just two triangles but QuadGroup has vertex and index buffer for all the triangles it will draw making it a lot more optimal for 3D rendering. Images compiled into one QuadGroup need to use SubTextures from the same Texture meaning that the graphics the images use must internally be on the same Stage3D texture. TextureAtlas provides functionality to load this kind of sprite sheet textures and to create the SubTextures.

Result: Since the Images use SubTexture from the same Texture they can all be compiled into just one QuadGroup. The software rendering can now handle 800 840 (200% 162% compared to original) images.


Code change 2: Simplify the shader code.

The shader programs created in the registerPrograms function in Image class might be a bit too complicated for you. The fragment program is run for every pixel rendered so removing any operations from there improves the performance of CPU rendering a lot. So if you don’t need to be able to adjust the color and alpha values of the different corners of the image replace the original vertexProgramCode and fragmentProgramCode with these:

var vertexProgramCodeSimple:String =

“m44 op, va0, vc0  \n” +  // 4×4 matrix transform to output clipspace

“mov v0, va1       \n”;  // pass texture coordinates to fragment program

var fragmentProgramCode:String =

“tex ft1, v0, fs1 <???> \n” +  // sample texture 1

“mov oc, ft1 \n”;

Notice that you must not pass the color data to rendering context’s setVertexBufferAt function (originally index 1) any more so remove the line that did that. Also start passing the texture coordinates to vertex buffer at index 1 instead of the original 2.

Result: The software rendering handles 900 940 (225% 180% compared to original) images after changes 1 and 2.


Code change 3: Render with lower quality.

At some point it might be necessary to sacrifice rendering quality so gain some performance. The easiest way to do this is simply change the smoothing for Image to TextureSmoothing.NONE.

Result: The software rendering handles 1080 1130 (270% 217% compared to original) images after changes 1, 2 and 3.

So now we have the software rendering running with almost three over two times as many images as before (or 2.25 1.8 times as many if you wanted to keep the quality). But when we try these changes with GPU rendering the amount of images is only about 1700 2300 or little over under 40% of the original performance. This is because the implementation of the flatten function is quite far from optimal. Let’s next concentrate on tweaking it.


Code change 4: Optimize the flatten function in Sprite.

The flatten function in Sprite was not clearly designed to be called on every frame update but still using it that way improves the software rendering performance. With the changes I’ll briefly go through flatten function will become almost three times as fast as before.

So let’s see what flatten actually does. First it calls unflatten which disposes all the QuadGroups generated with the previous call to flatten. Instead of disposing we should try to simply update the QuadGroups we already have.

Step 1: Add resetting support to QuadGroup.

Add reset and initialize functions to QuadGroup. To support resetting add variables for storing both the current size and the allocated size for index and vertex buffers and the indices vector. Reset function simply sets all three of the current sizes to 0 but doesn’t touch the buffers or the vector. Initialize function takes the same parameters as the constructor and sets the values again. Modify the finish function to allocate new buffers only if the current ones are not big enough and upload the index buffer only when it’s size changes.

Step 2: Start reusing the current QuadGroup instances in Sprite.

Don’t dispose the current QuadGroups in Sprite’s unflatten function but only reset them and pass the current QuadGroup vector as an additional parameter to QuadGroup’s compile function.  Keep passing the index of the currently active QuadGroup as another parameter to QuadGroup’s compileObject function.

Since we are now using a cached vector of QuadGroups we are not necessarily adding quads to the QuadGroup that is last in the QuadGroup vector. The index is used for specifying which instance from the vector is currently used. When we need to use a new QuadGroup we first check if one is available in the next index in the vector and initialize it or if not then create a new one and add it to the vector.

Step 3: Start reusing the VertexData instances.

Add reset function and functionality also to VertexData. Modify the append function to write into current index in mData instead of pushing new values into the vector. Start using this new reset and append combination instead of the clone function to avoid creating new instances. The vertexData getter in Image also calls Texture’s adjustVertexData function every time when a cached result can be used. The cached value should be reset only when Image’s texture changes.

Step 4: Optimize QuadGroup’s compileObject function.

Don’t pass matrixStack and alphaStack but only the child matrix and child alpha to QuadGroup’s compileObject function. Don’t create a clone of the current matrix for the child matrix but instead allocate the child matrix outside the loop and simply call copyFrom to assign current matrix to the child matrix.

Step 5: Divide the vertex data to position data and other data.

Originally all the vertex data (position, texture position and color) is in the same vector in VertexData class. When the data is divided into two vectors so that first one contains only the position values and the second one the texture position and color values it is possible to use Matrix3D function transformVectors to transform the positions of all the vertices with single call when handling the quad compiling in QuadGroup’s compileObject function. If the data is divided into two vectors you also need to have two separate vertex buffers in Quad and QuadGroup.

One thing to notice here is that the maximum amount of vertex (and index) buffers is 4096 so after this modification you should make sure that when Images are flattened their own vertex and index buffers are disposed.

Result: Combining all the changes that do not affect rendering quality (1, 2, and 4) the software rendering now handles 1200 1250 (300% 240% compared to original) images. By using lower rendering quality and using all the four changes the software rendering can handle 1600 1670 (400% 320% compared to original) images! The GPU rendering can’t still handle more than 3000 5000 images (75% 80% compared to original) meaning that you probably shouldn’t use flatten function this way with GPU rendering.


To sum this all up when all or most of the sprites are moving you probably can’t achieve that much better GPU rendering performance with any small modifications to Starling framework. On the other hand with CPU rendering it is really easy to get two to three times better results than what Starling gives out of the box. Even simply calling flatten function on the root or close to root level sprites on every frame update boosts the CPU rendering performance considerably. [The actual results gained from the optimization depend heavily on the processor’s speed but are always noticable]. By using intelligent sprite flattening not performed on every frame update [and uploading vertex only when there are changes] it is possible to get better frame rates also with GPU rendering. Just remember that in order to have any performance gain from the sprite flattening the images flattened should be using textures from the same main texture. Since the maximum width and height of a texture are 2048 pixels you should be able to combine quite a lot graphics on any single texture.

Post about Starling optimizations

I’ve been working this weekend on a post about optimizing the Starling framework’s software rendering performance. Now it seems that I’ll be able to finish it tomorrow so stay tuned for the next update.

Flash 11 – using Stage3D for 2D graphics

The graphics rendering can be a bottle neck with many games. Especially with Flash there can be only limited amount of animating graphics on the screen and there are also limitations on the screen size for running games smoothly. The just released Flash 11 comes with Stage3D API that is a low level GPU accelerated API for rendering graphics. Using GPU acceleration brings the graphic performance on a whole new level but to properly use Stage3D one really needs to know 3D programming. Luckily there are already libraries that implement the functionality needed for drawing 2D graphics and do all the Stage3D stuff behind the scenes. Starling framework is one of these libraries.

Starling framework is an open source library that provides conventional Flash display list architecture while using Stage3D for the actual rendering. Starling is pure AS3 library, it’s free and open source so modifying and tuning it is really simple. The three most fundamental classes in Starling framework are Texture, Image and Sprite.

Texture can be considered as equivalent to BitmapData in the conventional display object world. It wraps the Stage3D texture and can be constructed from BitmapData or Bitmap. To make rendering more efficient once should have the different source images grouped properly on single BitmapData instance, create Texture from it and then create SubTextures with clipping rectangles from the Texture.

Image is used for drawing Texture on the screen so it can be considered to be equivalent to Bitmap. Images can be moved, rotated, scaled and their transparency can be adjusted.

Sprite is equivalent to normal Sprite and it is a display object container that can contain any number of child display objects. Sprite can be also flattened which means the changes to it’s children won’t be visible until the Sprite is flattened again or unflattened. When Sprite is flattened it’s content is optimized for rendering so flattened Sprites should be used when ever possible. One thing to notice is that Starling Sprite has rotation in radians not degrees like the normal Sprite.

Initializing Starling is really simple. Just create a class that extends Starling Sprite and pass it and few other parameters when creating an instance of Starling main class, adjust rendering parameters if you want and call start function for the Starling main class. When everything is ready the Sprite extending class you gave as a parameter to Starling will be instantiated and you can start adding more display objects on it.

That’s all this time. Next post will contain some tips and tricks for tuning the Starling performance.