Category Archives: ActionScript 3

Boosting Starling 1.3 performance by 50%

It’s been a while since my latest Starling blog post. The Starling version I was tuning back then was 0.9.1 while the current one is already 1.3. During that time lots of nice features have been added to the framework including some performance optimizations but anyways I thought now would be a good time to check if there is still something to improve on the performance side. So let’s now start optimizing Starling 1.3.

Optimizing DisplayObject

The first place to start tuning is DisplayObject class and its get transformationMatrix function. There as the first step the transformation matrix is always set to identity matrix. Then there are lots of if clauses whether the matrix should be scaled, scewed, rotated, translated or modified because of the pivot point. It is pretty clear that these operations most likely override the initial values that were achieved by setting the matrix to identity matrix.

So how could we improve the performance here? Let’s first divide the function into two cases – first a case where the display object is skewed (this is most likely not too common case) and then to a case where it’s not. For the skewed case we can use the original implementation but for the other more common case we do things a bit differently. First we stop setting the matrix to identity altogether. Instead we check if the display object is rotated and if not we simply set the matrix to do proper scaling and transformation like this:

mTransformationMatrix.a = mScaleX;

mTransformationMatrix.b = 0.0;

mTransformationMatrix.c = 0.0;

mTransformationMatrix.d = mScaleY;

mTransformationMatrix.tx = mX – mPivotX * mScaleX;

mTransformationMatrix.ty = mY – mPivotY * mScaleY;

With this implementation there is no need to compare any of the values against zero or one and there is also no overhead from using function calls. Basically this is about as fast as just the call to the identity function that we removed and we have the final matrix now. For the transformation there is really no point checking if the values are zero or not since in most of the cases they are not and then the if check will just add one extra step to the execution. It’s also a good thing to keep in mind that adding zero or multiplying by one doesn’t change the result.

For the case where the display object is rotated we calculate cosine and sine for the angle and then do the matrix math by ourselves by setting transformation matrix’s a to mScaleX * cosA, b to mScaleY * sinA, c to –mScaleX * sinA, d to mScaleX * cosA and tx and ty the same way as with when the display object was not rotated. Again there is no need to check if values are different from one or zero to avoid the extra steps. For the pivot handling we divide the original if clause into two parts first checking if pivotX is not zero and then pivotY. Here the amount of operations possibly avoided by the check justifies its cost.

With these simple changes the Starling benchmark scene should be able to handle 5-10% more images.

Optimizing VertexData

The next class to start tuning is VertexData. It has copyTo and transformVertex functions that get called whenever a quad is added to a quad batch. Here we apply similar idea as in the previous step so instead of first copying and then doing the matrix transformation for the values we pass the transformation matrix as a parameter to copyTo function and assign translated values to the target data.

This change should again slightly improve the performance.

Next change is the biggest one and will also have the biggest effect on performance. Starling now has this tinted parameter telling if the quad or image will use coloring or is partially transparent but it’s not really used for anything else than selecting correct fragment shader. Since once again probably most of the images will not use coloring nor be partially transparent the color data should not be copied between vertex data instances or sent to the GPU when updating the vertex buffers. To achieve this optimization we divide the mRawData vector in VertexData class into three separate vectors – one for position data, another one for color data and third one for texture data. After all the changes in VertexData class it’s also necessary to have three vertex buffers in QuadBatch to match the three vectors.

When we start passing tinting parameter to VertexData class copyTo function we can copy the color data only if the tinting is in use. The same logic can be used in QuadBatch class syncBuffers function so that the color vertex buffer is updated only if tinting is used. For the cases where tinting is not used this halves the amount of data first copied between VertexData instances and later uploaded to the vertex buffers.

After these changes you can expect around 40% improvement to the image count the Starling benchmark scene can handle. The improvement should happen both on desktops and mobile devices.

Optimizing event handling

If you still want to improve the performance the next place to start tuning is the event dispatching. For example on each frame update when Stage class advanceTime function is called it will iterate through all the display objects to collect a list of those that are actually listening to the enter frame event. Since most likely only couple of your thousands of display objects are actually listening to this event a more efficient way to do this is to have list of these display objects in Stage class. This can be achieved by making the display objects report through their parent all to way up to the Stage addition and removal of listened event. You will also need to modify the function for setting the display object’s parent a bit so that the changes are properly reported to Stage.

With this additional change you can expect 50% improvement to the image count handled by Starling benchmark scene. In my opinion this is quite a remarkable achievement.

Rectangle packing

Something for example Starling framework is still missing is a rectangle packing utility with which you could generate a texture atlas on the run-time. After some Googling I didn’t come across with too good examples so I spent couple of hours to write one of my own. Since this is a freetime project of mine I am this time also releasing the source code.

Rectangle packing

The idea in rectangle packing is to place smaller rectangles inside a bigger container rectangle as tightly as possible. This is especially useful when generating big textures containing many sub textures. My implementation uses the concept of “free rectangles” within the main rectangle. The packed rectangles are always placed in the top left corner of some free rectangle that they completely fit into. To get very close to optimal packing the top most of the left most free rectangles the packed rectangle fits into is selected for placing.

The algorithm

Initially there is naturally only one “free rectangle” that is the main rectangle itself. After packing the first rectangle in the original free rectangle is removed and there are from zero to two new free rectangles – if the packed rectangle is as big as the container there are no more free rectangles, if the packed rectangle is as wide or as tall as the container there is one free rectangle either below or on the right side of it and if the packed rectangle is smaller there is one free rectangle below it and one on it’s right side. Packing next rectangle happens the same way – also any other free rectangle the packed rectangle intersects is cut into new smaller free rectangles around the packed rectangle. During the process all the free rectangles that are fully contained by another free rectangle are removed.


The image below shows how the free rectangles are combined. The image on the left side shows that there are two free rectangles after placing the first rectangle. Placing the second rectangle would divide the free rectangle “2” into two new free rectangles (on the right side and below) and also the free rectangle “1” into three new free rectangles (above, on the right side and below) but here two of these new free rectangles are completely contained in the bigger free rectangles so the total amount of free rectangles after packing the second rectangle is three.

To see this rectangle packing in action click the image below. Drag the orange circle at the bottom right corner of the container rectangle with your mouse (keeping the left mouse button down) to see how packed rectangles move around the area. With a decent computer the packing of 500 rectangles takes about 1 millisecond.

You can download the full source code for the demo GitHub.

Like the copyright notice in the source files says you may use and/or modify the source code freely but do not remove the copyright notice or move the files into other packages. If you find the utility especially useful you can mention me in credits too.

Update: The version available since 22nd of August 2012 almost 10 times the speed of the original one.

Tuning Starling based games

All the optimizations on your rendering library won’t help your game’s performance unless you use the library properly. I’ll now go through few basic ideas we used with Angry Birds Facebook version to get the most out of the Starling framework.

Texture usage

Now when Starling automatically generates the QuadBatches when rendering images it is important that display objects share the same base texture when ever possible. To achieve this use big base texture and then use subtextures from the main texture for the images. When all your graphics won’t fit one 2048×2048 texture group them smartly so that for example background graphics use the same base texture, game objects use another shared texture etc. Consider writing your own logic for combining smaller bitmaps into a big one for the texture generation so that you can modify these textures for example when moving from one game level into another one. Also note that even if Flash 11 should guarantee 128MB of texture memory it means only 8 of these 2048×2048 textures. On some setups even this might be too much so don’t create textures unless you really need them and when you don’t need some texture any more dispose it.

Display object usage

No matter how active your game’s visuals are all the graphics are probably not moving or changing. If you have these kinds of elements, for example background that contains several sprites and images, place the element’s display objects within one sprite and then flatten the sprite after it’s content is set. You will still be able to move, scale and rotate the background but it will render a lot faster than without flattening. If possible add also some quick checks if the parent display object is within the visible display area or not. Set the visibility of the objects accordingly to speed up the rendering.

Background coloring

The whole Stage3D rendering context needs to be cleared on every frame update no matter how much you render on it. By default Starling uses the initial stage color for the clearing but if your game happens to have a solid color sky, ground or what ever that is drawn behind all the other graphics add a setter for the clearing color and set it to this color you would be using. This way painting the whole canvas with the color is basically a free operation since the clearing has to be done anyway. If you have several of these single color areas use Quads for the smaller ones since they are a lot faster than Images with single color textures.

Detail adjustments

Computers that use software rendering will most likely have problems running your game as smoothly as those that use hardware rendering. Detect if the game is running using software rendering and then drop some details that are not necessary for the game. Consider using lower rendering quality for the graphics and limit the amount of particles if you are using the particle emitter extension.

Overlay graphics

It’s possible to have conventional Flash display objects on top of the Starling’s Stage3D graphics. Using them for example for UI elements might make sense but remember here not to touch the attributes of the display objects if they have not changed. If these sprites are not updated on every frame they should not have too big effect on the frame rate.

That’s all this time. With these simple tricks you should get your Starling based game running smoothly on most of the computers.

Naming conventions and obfuscation

Just fixed a nasty bug that was caused by the fact that we had added a new XML element and an ActionScript class that had exactly same names. This ifself wouldn’t cause any problems but when the code was run through an obfuscation software the problems started. The part of the code where we were actually referencing to this child element in the XML, not the class with the same name, got also obfuscated and naturally it then didn’t match the XML files read in.

The solution for the problem was as easy as to simply follow common naming conventions – start the names of the classes with upper case letters and the names of XML elements with lower case letters. After converting the names of the new XML elements to start with lower case also the obfuscated code started to work fine again.

Starling gets wings

The new version 0.9.1 of Starling framework is now available for download and I have to say that Daniel has done excellent job rewriting some of the crucial parts of the rendering code. Now the CPU rendering is running little faster than the old Starling with all the optimizations from my previous posts but the GPU rendering has improved even more. On my laptop the new Starling framework can handle 35% more images than the old version with my optimizations. The exact performance depends on the GPU but I would say that the new approach is definitely the way to go with GPU rendering. But even if the version 0.9.1 brings in serious performance enhancements it’s still possible to make it a tiny bit faster with relatively small changes.

The biggest change in the new Starling framework was that all the Image and Quad rendering is now done in batches. This means that during the rendering up to 8192 consecutive Images using same base texture and rendering options are combined into same vertex buffer and then rendered with single drawTriangles call. The idea is the same used in the flatten function earlier but now it’s done automatically behind the scenes. This means that on every frame there are some pretty heavy matrix calculations required to generate the vertex buffer data. So the first place we start tweaking is the VertexData class.

For every Image and Quad their VertexData is first copied in the addQuad function of QuadBatch with VertexData’s copyTo function and then matrix transformed with transformQuad function. Within the transformQuad function the position values are again copied back and forth between Vectors. To make these two operations more optimal the mRawData Vector in VertexData needs to be divided into two – one Vector containing the position values and another containing the texture and color values. The transformation matrix is added as a parameter to copyTo function and the whole transformQuad function is dropped. In the copyTo function the now separate position value Vector can be given as an input to Matrix3D’s transformVectors function. Then the transformed values are copied from the sPositions Vector into the target VertexData instance’s position value Vector. With this change we get rid of two unnecessary Vector copying rounds.

After dividing the data in VertexData into position data and texture and color data the latter should be converted into ByteArray instead of Vector. Uploading ByteArray data into vertex buffer is a lot faster and since this color and texture data is probably not changing too often the time it takes to set the values is not that important. Switching from Vector to ByteArray will require changes in so many functions that I won’t go through them but few things to remember are to make the ByteArray’s endian to be Endian.LITTLE_ENDIAN, set the length and position always correctly and use writeFloat and readFloat functions to read and write the ByteArray.

The changes in VertexData class need to be handled also in QuadBatch, Image and Quad classes. In QuadBatch two vertex buffers are now required – one for the position data and another for the texture and color data. Image and Quad need to have the transformation matrix as a parameter in their copyVertexDataTo functions.

Other places for small optimizations are the isStageChange function in QuadBatch that can also be optimized since checking first if there are no quads is definitely not the most optimal solution. RenderSupport on the other hand uses get currentQuadPacth function a lot so a minor tweak here is to store the current QuadBatch into separate variable and use that instead.

After these changes you can expect about 5-10% better results with the Starling demo when using GPU rendering. With CPU rendering the improvement is a lot smaller.

Since with this version of Starling the amount of Images on screen is getting really high the event handling also needs to be really optimized. One place to gain performance from is the Stage’s advanceTime function. On every frame update it calls dispatchEventOnChildren function with enter frame event. The dispatchEventOnChildren will go through all the possibly over ten thousand child display objects and check if they are listening to this event. The changes are that only one display object in your application is actually interested in this event so this is really inefficient. Quick fix for this is to require that only display objects that are direct children of Stage can receive the enter frame event. This way you can limit the check to the root display object in the advanceTime function. This change should improve the Starling demo performance by another 5-10% when using GPU rendering. Another thing to notice is that even if the amount of rendered Images may rise in total by about 10-15% with GPU rendering also the CPU load will drop significantly. This means that there is more time per frame for your actual application logic.

The speed comparisons between original Starling, original Starling with my optimizations from the previous posts and the new Starling are shown in the chart below. The results with the new Starling framework are on the bottom four rows. [Edit: The test were run at 30 fps]

Starling new speed

One interesting thing is that at least on Windows 7 the 64bit Internet Explorer 9 Flash plugin plays the Stage3D content even faster than the Flash projector. Firefox on the other hand performs really poorly achieving only about half of Internet Explorer’s performance. This means that at the moment if you are developing a top notch Flash 11 application for web you probably should recommend the users to avoid using Firefox.

To wrap things up Starling 0.9.1 is definitely an update everyone has been waiting for. It gives really good performance boost compared to the previous version and if the device loss handling is also added hopefully in the next version then there’s basically no reason not to use Starling for 2D rendering when you are developing a Flash 11 application.

Handling rendering ”device loss” with Starling

After Flash 11 brought the GPU accelerated 3D rendering with Stage3D it became necessary to handle the possible error conditions Direct3D and OpenGL developers are familiar with. You might expect Starling framework to take care of errors like “device loss” for you but unfortunately this is not the case. I’ll now briefly explain how to properly detect and handle the device loss so that your Flash application won’t crash.

Rendering “device loss” happens when the GPU hardware becomes unavailable to the application – for example when Windows operating system goes to lock screen [this doesn’t seem to happen with all the different Windows versions / hardware]. When a device loss happens, the current Context3D instance and all the buffers, textures and shader programs created with it are disposed. This means that calling any functions on them will throw an exception and crash the application if the exception is not caught.

Luckily the device loss is really easy to detect. Simply check the value of your Context3D instance’s driverInfo and if it is “Disposed” you’ll know the device loss has occurred and the context has been disposed. This check should be added in the beginning of Starling class’s render function and if you detect the context has been disposed simply return from the function immediately. Remember to check this condition also in your application’s enterFrame handler and return from there also if the context has been disposed. Trying to create any new Textures or calling functions that would modify the buffers for example in Image will also throw an exception.

After we have survived to device loss it’s time to get prepared for the new device creation. When the rendering device becomes available again, Flash runtime generates a new Context3D instance automatically and dispatches a CONTEXT3D_CREATE event. Starling is listening to this event but the handler function onContextCreated was not designed to handle more than one context creation so we need to modify it a little. In the beginning of the function set mContext to null and create a new Dictionary for mPrograms. Now Starling is running again with a valid context and shader programs but all the textures and vertex/index buffers are still invalid.

To handle the texture updating you should listen to the CONTEXT3D_CREATE event also outside Starling. If the event occurs the second time you need to reinitialize all the textures. The easiest way to achieve this is to add a function to Starling Texture classes for creating a new base texture and to have all the Texture instances managed with a single class. This manager should have the original bitmap / byte array data required for reinitalizing the textures available either in memory or on disk. By using centralized management like this you don’t need to create new Texture instances but you can continue using the existing ones no matter where they are and you also avoid all the changes to the actual application rendering logic.

The last step is to create new vertex and index buffers with the new context. One way to do this is to add a running id number for the contexts created in Starling. Whenever a new context is created this number is increased by one. Then in all the places where vertex and index buffers are used store this context id when the buffers are created and when you use the buffers check that the id the buffers were created for is the same as the current id. If they are not the same just create new buffers for the current context and store the new id.

After these modifications your Flash application should survive the device loss. The newer versions of operating systems and Flash player might become better not to lose the device but it’s always better safe than sorry.