The new version 0.9.1 of Starling framework is now available for download and I have to say that Daniel has done excellent job rewriting some of the crucial parts of the rendering code. Now the CPU rendering is running little faster than the old Starling with all the optimizations from my previous posts but the GPU rendering has improved even more. On my laptop the new Starling framework can handle 35% more images than the old version with my optimizations. The exact performance depends on the GPU but I would say that the new approach is definitely the way to go with GPU rendering. But even if the version 0.9.1 brings in serious performance enhancements it’s still possible to make it a tiny bit faster with relatively small changes.
The biggest change in the new Starling framework was that all the Image and Quad rendering is now done in batches. This means that during the rendering up to 8192 consecutive Images using same base texture and rendering options are combined into same vertex buffer and then rendered with single drawTriangles call. The idea is the same used in the flatten function earlier but now it’s done automatically behind the scenes. This means that on every frame there are some pretty heavy matrix calculations required to generate the vertex buffer data. So the first place we start tweaking is the VertexData class.
For every Image and Quad their VertexData is first copied in the addQuad function of QuadBatch with VertexData’s copyTo function and then matrix transformed with transformQuad function. Within the transformQuad function the position values are again copied back and forth between Vectors. To make these two operations more optimal the mRawData Vector in VertexData needs to be divided into two – one Vector containing the position values and another containing the texture and color values. The transformation matrix is added as a parameter to copyTo function and the whole transformQuad function is dropped. In the copyTo function the now separate position value Vector can be given as an input to Matrix3D’s transformVectors function. Then the transformed values are copied from the sPositions Vector into the target VertexData instance’s position value Vector. With this change we get rid of two unnecessary Vector copying rounds.
After dividing the data in VertexData into position data and texture and color data the latter should be converted into ByteArray instead of Vector. Uploading ByteArray data into vertex buffer is a lot faster and since this color and texture data is probably not changing too often the time it takes to set the values is not that important. Switching from Vector to ByteArray will require changes in so many functions that I won’t go through them but few things to remember are to make the ByteArray’s endian to be Endian.LITTLE_ENDIAN, set the length and position always correctly and use writeFloat and readFloat functions to read and write the ByteArray.
The changes in VertexData class need to be handled also in QuadBatch, Image and Quad classes. In QuadBatch two vertex buffers are now required – one for the position data and another for the texture and color data. Image and Quad need to have the transformation matrix as a parameter in their copyVertexDataTo functions.
Other places for small optimizations are the isStageChange function in QuadBatch that can also be optimized since checking first if there are no quads is definitely not the most optimal solution. RenderSupport on the other hand uses get currentQuadPacth function a lot so a minor tweak here is to store the current QuadBatch into separate variable and use that instead.
After these changes you can expect about 5-10% better results with the Starling demo when using GPU rendering. With CPU rendering the improvement is a lot smaller.
Since with this version of Starling the amount of Images on screen is getting really high the event handling also needs to be really optimized. One place to gain performance from is the Stage’s advanceTime function. On every frame update it calls dispatchEventOnChildren function with enter frame event. The dispatchEventOnChildren will go through all the possibly over ten thousand child display objects and check if they are listening to this event. The changes are that only one display object in your application is actually interested in this event so this is really inefficient. Quick fix for this is to require that only display objects that are direct children of Stage can receive the enter frame event. This way you can limit the check to the root display object in the advanceTime function. This change should improve the Starling demo performance by another 5-10% when using GPU rendering. Another thing to notice is that even if the amount of rendered Images may rise in total by about 10-15% with GPU rendering also the CPU load will drop significantly. This means that there is more time per frame for your actual application logic.
The speed comparisons between original Starling, original Starling with my optimizations from the previous posts and the new Starling are shown in the chart below. The results with the new Starling framework are on the bottom four rows. [Edit: The test were run at 30 fps]
One interesting thing is that at least on Windows 7 the 64bit Internet Explorer 9 Flash plugin plays the Stage3D content even faster than the Flash projector. Firefox on the other hand performs really poorly achieving only about half of Internet Explorer’s performance. This means that at the moment if you are developing a top notch Flash 11 application for web you probably should recommend the users to avoid using Firefox.
To wrap things up Starling 0.9.1 is definitely an update everyone has been waiting for. It gives really good performance boost compared to the previous version and if the device loss handling is also added hopefully in the next version then there’s basically no reason not to use Starling for 2D rendering when you are developing a Flash 11 application.