Angry Birds goes Flash 11

Yesterday was a remarkable day for our pretty new Flash team at Rovio since the very first Angry Birds game using Flash 11 went live. Like mentioned in my Adobe MAX presentation we are using Stage3D and Starling framework for the graphics rendering and so far we’ve been very happy with the results. Later with the data and feedback collected from this project we’ll hopefully be able to find solutions to all the remaining issues no matter how small they are. If the findings are anything interesting you can expect to read about them here in my blog.

To try the game yourself just visit and register to play. The game is in Spanish since it was made for PepsiCo Mexico so as a tip the fields in the registration form are “nick name”, “password” twice, “email” and “age” with rest being optional.

Starling gets wings

The new version 0.9.1 of Starling framework is now available for download and I have to say that Daniel has done excellent job rewriting some of the crucial parts of the rendering code. Now the CPU rendering is running little faster than the old Starling with all the optimizations from my previous posts but the GPU rendering has improved even more. On my laptop the new Starling framework can handle 35% more images than the old version with my optimizations. The exact performance depends on the GPU but I would say that the new approach is definitely the way to go with GPU rendering. But even if the version 0.9.1 brings in serious performance enhancements it’s still possible to make it a tiny bit faster with relatively small changes.

The biggest change in the new Starling framework was that all the Image and Quad rendering is now done in batches. This means that during the rendering up to 8192 consecutive Images using same base texture and rendering options are combined into same vertex buffer and then rendered with single drawTriangles call. The idea is the same used in the flatten function earlier but now it’s done automatically behind the scenes. This means that on every frame there are some pretty heavy matrix calculations required to generate the vertex buffer data. So the first place we start tweaking is the VertexData class.

For every Image and Quad their VertexData is first copied in the addQuad function of QuadBatch with VertexData’s copyTo function and then matrix transformed with transformQuad function. Within the transformQuad function the position values are again copied back and forth between Vectors. To make these two operations more optimal the mRawData Vector in VertexData needs to be divided into two – one Vector containing the position values and another containing the texture and color values. The transformation matrix is added as a parameter to copyTo function and the whole transformQuad function is dropped. In the copyTo function the now separate position value Vector can be given as an input to Matrix3D’s transformVectors function. Then the transformed values are copied from the sPositions Vector into the target VertexData instance’s position value Vector. With this change we get rid of two unnecessary Vector copying rounds.

After dividing the data in VertexData into position data and texture and color data the latter should be converted into ByteArray instead of Vector. Uploading ByteArray data into vertex buffer is a lot faster and since this color and texture data is probably not changing too often the time it takes to set the values is not that important. Switching from Vector to ByteArray will require changes in so many functions that I won’t go through them but few things to remember are to make the ByteArray’s endian to be Endian.LITTLE_ENDIAN, set the length and position always correctly and use writeFloat and readFloat functions to read and write the ByteArray.

The changes in VertexData class need to be handled also in QuadBatch, Image and Quad classes. In QuadBatch two vertex buffers are now required – one for the position data and another for the texture and color data. Image and Quad need to have the transformation matrix as a parameter in their copyVertexDataTo functions.

Other places for small optimizations are the isStageChange function in QuadBatch that can also be optimized since checking first if there are no quads is definitely not the most optimal solution. RenderSupport on the other hand uses get currentQuadPacth function a lot so a minor tweak here is to store the current QuadBatch into separate variable and use that instead.

After these changes you can expect about 5-10% better results with the Starling demo when using GPU rendering. With CPU rendering the improvement is a lot smaller.

Since with this version of Starling the amount of Images on screen is getting really high the event handling also needs to be really optimized. One place to gain performance from is the Stage’s advanceTime function. On every frame update it calls dispatchEventOnChildren function with enter frame event. The dispatchEventOnChildren will go through all the possibly over ten thousand child display objects and check if they are listening to this event. The changes are that only one display object in your application is actually interested in this event so this is really inefficient. Quick fix for this is to require that only display objects that are direct children of Stage can receive the enter frame event. This way you can limit the check to the root display object in the advanceTime function. This change should improve the Starling demo performance by another 5-10% when using GPU rendering. Another thing to notice is that even if the amount of rendered Images may rise in total by about 10-15% with GPU rendering also the CPU load will drop significantly. This means that there is more time per frame for your actual application logic.

The speed comparisons between original Starling, original Starling with my optimizations from the previous posts and the new Starling are shown in the chart below. The results with the new Starling framework are on the bottom four rows. [Edit: The test were run at 30 fps]

Starling new speed

One interesting thing is that at least on Windows 7 the 64bit Internet Explorer 9 Flash plugin plays the Stage3D content even faster than the Flash projector. Firefox on the other hand performs really poorly achieving only about half of Internet Explorer’s performance. This means that at the moment if you are developing a top notch Flash 11 application for web you probably should recommend the users to avoid using Firefox.

To wrap things up Starling 0.9.1 is definitely an update everyone has been waiting for. It gives really good performance boost compared to the previous version and if the device loss handling is also added hopefully in the next version then there’s basically no reason not to use Starling for 2D rendering when you are developing a Flash 11 application.

Poll – what should I write next about?

So far I have been writing about optimizing and tweaking the Starling Framework. That has offered in my optionion a fresh topic for several pretty interesting posts but now when I have got those out I am asking you what would you want me to write next about? Should I still write about some aspect of the Starling Framework or go with something totally different? Suggest your topic here and I may write about that next!

Handling rendering ”device loss” with Starling

After Flash 11 brought the GPU accelerated 3D rendering with Stage3D it became necessary to handle the possible error conditions Direct3D and OpenGL developers are familiar with. You might expect Starling framework to take care of errors like “device loss” for you but unfortunately this is not the case. I’ll now briefly explain how to properly detect and handle the device loss so that your Flash application won’t crash.

Rendering “device loss” happens when the GPU hardware becomes unavailable to the application – for example when Windows operating system goes to lock screen [this doesn’t seem to happen with all the different Windows versions / hardware]. When a device loss happens, the current Context3D instance and all the buffers, textures and shader programs created with it are disposed. This means that calling any functions on them will throw an exception and crash the application if the exception is not caught.

Luckily the device loss is really easy to detect. Simply check the value of your Context3D instance’s driverInfo and if it is “Disposed” you’ll know the device loss has occurred and the context has been disposed. This check should be added in the beginning of Starling class’s render function and if you detect the context has been disposed simply return from the function immediately. Remember to check this condition also in your application’s enterFrame handler and return from there also if the context has been disposed. Trying to create any new Textures or calling functions that would modify the buffers for example in Image will also throw an exception.

After we have survived to device loss it’s time to get prepared for the new device creation. When the rendering device becomes available again, Flash runtime generates a new Context3D instance automatically and dispatches a CONTEXT3D_CREATE event. Starling is listening to this event but the handler function onContextCreated was not designed to handle more than one context creation so we need to modify it a little. In the beginning of the function set mContext to null and create a new Dictionary for mPrograms. Now Starling is running again with a valid context and shader programs but all the textures and vertex/index buffers are still invalid.

To handle the texture updating you should listen to the CONTEXT3D_CREATE event also outside Starling. If the event occurs the second time you need to reinitialize all the textures. The easiest way to achieve this is to add a function to Starling Texture classes for creating a new base texture and to have all the Texture instances managed with a single class. This manager should have the original bitmap / byte array data required for reinitalizing the textures available either in memory or on disk. By using centralized management like this you don’t need to create new Texture instances but you can continue using the existing ones no matter where they are and you also avoid all the changes to the actual application rendering logic.

The last step is to create new vertex and index buffers with the new context. One way to do this is to add a running id number for the contexts created in Starling. Whenever a new context is created this number is increased by one. Then in all the places where vertex and index buffers are used store this context id when the buffers are created and when you use the buffers check that the id the buffers were created for is the same as the current id. If they are not the same just create new buffers for the current context and store the new id.

After these modifications your Flash application should survive the device loss. The newer versions of operating systems and Flash player might become better not to lose the device but it’s always better safe than sorry.

“One more” Starling post coming soon

I’ve been pretty busy the last two weeks so I haven’t had time to update this blog but things will hopefully change during the next weekend. So far I’ve been mainly focusing on the performance issues of the Starling framework in Flash 11 but the next post will be about handling the Stage3D “device loss” which results an application crash if not handled properly. More about that before the end of this week!

Starling performance revisited

After running some more performance tests with the Starling demo’s benchmark it’s time to write once more about the results from the optimizations and the performance in general. Something that I didn’t originally mention in the earlier posts was that I ran all the tests with Flash player 11 projector and not under web browser plugin. This has a huge effect on the performance like this article will show. Also in the tests for my previous posts my laptop was not running at the full speed since some energy saving features were still on. Some of the original numbers were affected a little but I’ll fix those to the previous posts too.

So let’s now go to the latest tests results. Five different versions of the benchmark were run on four different computers. First the original benchmark using CPU rendering, then the original benchmark with Sprite flattening using CPU rendering, then the optimized Sprite flatten benchmark using CPU rendering (see “Optimizing Starling framework”), then the original benchmark using GPU rendering and finally the optimized benchmark using GPU rendering (see “Passing the Starling Image count limit”). On one computer I also tested the effect from using different editions of the Flash 11 player .The Starling demo’s benchmark was modified to run at 30 fps like in my previous posts.


This chart shows the amount of moving Images at 30 fps with four different computers. Both the CPU and GPU optimizations improve the performance considerably.


The comparison between different computers and Flash player editions shows few interesting things.

First interesting issue is when you check the results for CPU rendering with the slowest computer (Athlon 64 3000+) marked with blue color. There the optimization for the CPU rendering gets only 1.4 times the original performance. This is because of the big image on the background of the benchmark scene. Even if it stays still it needs to be rendered on every frame and with slower processors and CPU rendering that image alone requires quite some time to get drawn.

Second interesting issue is that GPU really does make a difference with Flash 11. When the five year old desktop computer with the slowest CPU starts using its GPU (which in fact is not that good) for the rendering it beats the laptop that has no GPU by a huge margin (2500 vs. 570 images) and even gets better results than the fastest computer with CPU rendering (2500 vs. 1970 images).

Third interesting issue is that there is quite a big performance difference between different Flash player editions. The release projector player is naturally fastest with all the different benchmarks but the order of the release plugin player and the debug projector player depends on the benchmark run. I have marked the two interesting cases with red text. First one is the “CPU flatten” with the debug projector player. Here the debug player performs really poorly. This is most likely caused by the fact that the original implementation of Sprite’s flatten function creates lots of new instances of different classes and the debug player needs to keep track of these. Second interesting case is the original “GPU” rendering benchmark with plugin player. It handles less than half of the optimized GPU rendering benchmark indicating that vertex buffer access has some serious overhead under plugin players. All in all the performance under plugin player seems to be within 60-80% of the performance of the projector player. I am hoping that with new player versions the difference would not be this big.

To wrap this all up again one thing to understand is that now with GPU rendering support the variation in performance between different systems can be just massive. In my tests the fastest computer was able to handle around 20 times as many moving images as the slowest one. This means that when you are implementing any Flash 11 game there really should be possibility to adjust the graphical detail to keep the game running smoothly on slower machines with possibly no GPU and also to give some extra visual effects for the users who have fast machines with state of the art GPUs. Also worth noticing is that when targeting web browsers the tuning done in “Passing the Starling Image count limit” more than doubles the GPU rendering and when using CPU for rendering it’s really crucial to have the optimization done in “Optimizing Starling framework” in place.

That’s all this time. Next post will probably be about handling device loss in Starling.


Get every new post delivered to your Inbox.

Join 231 other followers