Optimizing Starling framework

Some of you have probably tried the Starling demo application that comes with the Starling framework package. If you have a pretty new computer with a separate GPU and you run the benchmark you probably get well over 1000 images moving and rotating smoothly at 60 fps on the screen. But what if your computer doesn’t have a GPU and all the rendering is done on software on CPU? Then the amount of images the benchmark is able to render at 60 fps can be under 100! This number depends of course on your computer’s CPU but anyways it is less than what you would expect to get with traditional display objects. I’ll next go through few tips and tricks which should enable you to render 3-4 times as many images as before when using the software rendering.

To test the software rendering performance simply set the renderMode parameter given to Starling class constructor to “software”. To detect that Flash is using software rendering check if the current rendering context’s driverInfo starts with “Software”.

The following tests were run [with Flash Player 11 projector] on my home laptop which has Intel Core i5-480M dual core 2,66GHz CPU and nVidia Geforce GT 415M GPU. It is not super powerful but anyways a pretty new and decent laptop. I also dropped the benchmark’s frame rate to 30 fps since that is what I thought is still an acceptable frame rate.

Original image amounts with my laptop at 30 fps were ~~4000~~ approximately 6200 with GPU and ~~400~~ 520 with CPU rendering. Now let’s go to the code changes and their results.

Code change 1: Call flatten for the Sprite containing the images on every frame update.

Starling Sprite has a function called flatten. It precalculates the vertex and index buffers that are needed to render all the Images and Quads the Sprite contains and it divides these buffers into as few QuadGroups as possible. Quad and Image have vertex and index buffers for just two triangles but QuadGroup has vertex and index buffer for all the triangles it will draw making it a lot more optimal for 3D rendering. Images compiled into one QuadGroup need to use SubTextures from the same Texture meaning that the graphics the images use must internally be on the same Stage3D texture. TextureAtlas provides functionality to load this kind of sprite sheet textures and to create the SubTextures.

Result: Since the Images use SubTexture from the same Texture they can all be compiled into just one QuadGroup. The software rendering can now handle ~~800~~ 840 (~~200%~~ 162% compared to original) images.

Code change 2: Simplify the shader code.

The shader programs created in the registerPrograms function in Image class might be a bit too complicated for you. The fragment program is run for every pixel rendered so removing any operations from there improves the performance of CPU rendering a lot. So if you don’t need to be able to adjust the color and alpha values of the different corners of the image replace the original vertexProgramCode and fragmentProgramCode with these:

var vertexProgramCodeSimple:String =

“m44 op, va0, vc0 \n” + // 4×4 matrix transform to output clipspace

“mov v0, va1 \n”; // pass texture coordinates to fragment program

var fragmentProgramCode:String =

“tex ft1, v0, fs1 <???> \n” + // sample texture 1

“mov oc, ft1 \n”;

Notice that you must not pass the color data to rendering context’s setVertexBufferAt function (originally index 1) any more so remove the line that did that. Also start passing the texture coordinates to vertex buffer at index 1 instead of the original 2.

Result: The software rendering handles ~~900~~ 940 (~~225%~~ 180% compared to original) images after changes 1 and 2.

Code change 3: Render with lower quality.

At some point it might be necessary to sacrifice rendering quality so gain some performance. The easiest way to do this is simply change the smoothing for Image to TextureSmoothing.NONE.

Result: The software rendering handles ~~1080~~ 1130 (~~270%~~ 217% compared to original) images after changes 1, 2 and 3.

So now we have the software rendering running with ~~almost three~~ over two times as many images as before (or ~~2.25~~ 1.8 times as many if you wanted to keep the quality). But when we try these changes with GPU rendering the amount of images is only about ~~1700~~ 2300 or little ~~over~~ under 40% of the original performance. This is because the implementation of the flatten function is quite far from optimal. Let’s next concentrate on tweaking it.

Code change 4: Optimize the flatten function in Sprite.

The flatten function in Sprite was not clearly designed to be called on every frame update but still using it that way improves the software rendering performance. With the changes I’ll briefly go through flatten function will become almost three times as fast as before.

So let’s see what flatten actually does. First it calls unflatten which disposes all the QuadGroups generated with the previous call to flatten. Instead of disposing we should try to simply update the QuadGroups we already have.

Step 1: Add resetting support to QuadGroup.

Add reset and initialize functions to QuadGroup. To support resetting add variables for storing both the current size and the allocated size for index and vertex buffers and the indices vector. Reset function simply sets all three of the current sizes to 0 but doesn’t touch the buffers or the vector. Initialize function takes the same parameters as the constructor and sets the values again. Modify the finish function to allocate new buffers only if the current ones are not big enough and upload the index buffer only when it’s size changes.

Step 2: Start reusing the current QuadGroup instances in Sprite.

Don’t dispose the current QuadGroups in Sprite’s unflatten function but only reset them and pass the current QuadGroup vector as an additional parameter to QuadGroup’s compile function. Keep passing the index of the currently active QuadGroup as another parameter to QuadGroup’s compileObject function.

Since we are now using a cached vector of QuadGroups we are not necessarily adding quads to the QuadGroup that is last in the QuadGroup vector. The index is used for specifying which instance from the vector is currently used. When we need to use a new QuadGroup we first check if one is available in the next index in the vector and initialize it or if not then create a new one and add it to the vector.

Step 3: Start reusing the VertexData instances.

Add reset function and functionality also to VertexData. Modify the append function to write into current index in mData instead of pushing new values into the vector. Start using this new reset and append combination instead of the clone function to avoid creating new instances. The vertexData getter in Image also calls Texture’s adjustVertexData function every time when a cached result can be used. The cached value should be reset only when Image’s texture changes.

Step 4: Optimize QuadGroup’s compileObject function.

Don’t pass matrixStack and alphaStack but only the child matrix and child alpha to QuadGroup’s compileObject function. Don’t create a clone of the current matrix for the child matrix but instead allocate the child matrix outside the loop and simply call copyFrom to assign current matrix to the child matrix.

Step 5: Divide the vertex data to position data and other data.

Originally all the vertex data (position, texture position and color) is in the same vector in VertexData class. When the data is divided into two vectors so that first one contains only the position values and the second one the texture position and color values it is possible to use Matrix3D function transformVectors to transform the positions of all the vertices with single call when handling the quad compiling in QuadGroup’s compileObject function. If the data is divided into two vectors you also need to have two separate vertex buffers in Quad and QuadGroup.

One thing to notice here is that the maximum amount of vertex (and index) buffers is 4096 so after this modification you should make sure that when Images are flattened their own vertex and index buffers are disposed.

Result: Combining all the changes that do not affect rendering quality (1, 2, and 4) the software rendering now handles ~~1200~~ 1250 (~~300%~~ 240% compared to original) images. By using lower rendering quality and using all the four changes the software rendering can handle ~~1600~~ 1670 (~~400%~~ 320% compared to original) images! The GPU rendering can’t still handle more than ~~3000~~ 5000 images (~~75%~~ 80% compared to original) meaning that you probably shouldn’t use flatten function this way with GPU rendering.

To sum this all up ~~when all or most of the sprites are moving you probably can’t achieve that much better GPU rendering performance with any small modifications to Starling framework. On the other hand~~ with CPU rendering it is really easy to get two to three times better results than what Starling gives out of the box. Even simply calling flatten function on the root or close to root level sprites on every frame update boosts the CPU rendering performance considerably. [The actual results gained from the optimization depend heavily on the processor’s speed but are always noticable]. By using intelligent sprite flattening not performed on every frame update [and uploading vertex only when there are changes] it is possible to get better frame rates also with GPU rendering. Just remember that in order to have any performance gain from the sprite flattening the images flattened should be using textures from the same main texture. Since the maximum width and height of a texture are 2048 pixels you should be able to combine quite a lot graphics on any single texture.

Comments

Peter Macinkovic (@inkovic) On October 19, 2011 at 1:11 pm
Permalink | Reply

Great info Ville. Very clever optimization information, thank you very much. Incredible CPU rendering performance manipulating the Starling framework like that.
Thibault Imbert On October 22, 2011 at 12:01 am
Permalink | Reply

Agreed! Those changes will be applied to Starling pretty soon. Sweet!
- villekoskela On October 22, 2011 at 11:28 am
  Permalink | Reply
  
  That’s excellent news Thibault!
Rahul Tandel On December 19, 2012 at 8:52 am
Permalink | Reply

I have faced the issue to my game load on tab that will crash after to play 3 state.
I check it will take 20 draw calls that are executed per frame.
Any one please help regarding to me to reduced the call and optimize the code and reduced the call also i will get error #3691: Resource Limit for this resource type exceeded.
please help me.
- villekoskela On February 22, 2013 at 11:16 am
  Permalink | Reply
  
  The algorithm works so that it needs to have initial limits where it starts packing the rectangles. This means you will have to give these dimensions.

Trackbacks

By Starling performance revisited « villekoskela on October 23, 2011 at 12:01 pm

[…] using CPU rendering, then the optimized Sprite flatten benchmark using CPU rendering (see “Optimizing Starling framework”), then the original benchmark using GPU rendering and finally the optimized benchmark using GPU […]
By Cool Stuff with the Flash Platform - 10/24/2011 | Remote Synthesis on October 24, 2011 at 9:29 pm

[…] games. He has recently started blogging and posted several Starling tutorials including on how to optimize the Starling framework for various hardware where the system may default to the CPU for processing. His next post shows […]
By Tuning Starling based games « villekoskela on February 18, 2012 at 1:41 pm

[…] software rendering and then drop some details that are not necessary for the game. Consider using lower rendering quality for the graphics and limit the amount of particles if you are using the particle emitter […]

villekoskela