Starling gets wings


The new version 0.9.1 of Starling framework is now available for download and I have to say that Daniel has done excellent job rewriting some of the crucial parts of the rendering code. Now the CPU rendering is running little faster than the old Starling with all the optimizations from my previous posts but the GPU rendering has improved even more. On my laptop the new Starling framework can handle 35% more images than the old version with my optimizations. The exact performance depends on the GPU but I would say that the new approach is definitely the way to go with GPU rendering. But even if the version 0.9.1 brings in serious performance enhancements it’s still possible to make it a tiny bit faster with relatively small changes.

The biggest change in the new Starling framework was that all the Image and Quad rendering is now done in batches. This means that during the rendering up to 8192 consecutive Images using same base texture and rendering options are combined into same vertex buffer and then rendered with single drawTriangles call. The idea is the same used in the flatten function earlier but now it’s done automatically behind the scenes. This means that on every frame there are some pretty heavy matrix calculations required to generate the vertex buffer data. So the first place we start tweaking is the VertexData class.

For every Image and Quad their VertexData is first copied in the addQuad function of QuadBatch with VertexData’s copyTo function and then matrix transformed with transformQuad function. Within the transformQuad function the position values are again copied back and forth between Vectors. To make these two operations more optimal the mRawData Vector in VertexData needs to be divided into two – one Vector containing the position values and another containing the texture and color values. The transformation matrix is added as a parameter to copyTo function and the whole transformQuad function is dropped. In the copyTo function the now separate position value Vector can be given as an input to Matrix3D’s transformVectors function. Then the transformed values are copied from the sPositions Vector into the target VertexData instance’s position value Vector. With this change we get rid of two unnecessary Vector copying rounds.

After dividing the data in VertexData into position data and texture and color data the latter should be converted into ByteArray instead of Vector. Uploading ByteArray data into vertex buffer is a lot faster and since this color and texture data is probably not changing too often the time it takes to set the values is not that important. Switching from Vector to ByteArray will require changes in so many functions that I won’t go through them but few things to remember are to make the ByteArray’s endian to be Endian.LITTLE_ENDIAN, set the length and position always correctly and use writeFloat and readFloat functions to read and write the ByteArray.

The changes in VertexData class need to be handled also in QuadBatch, Image and Quad classes. In QuadBatch two vertex buffers are now required – one for the position data and another for the texture and color data. Image and Quad need to have the transformation matrix as a parameter in their copyVertexDataTo functions.

Other places for small optimizations are the isStageChange function in QuadBatch that can also be optimized since checking first if there are no quads is definitely not the most optimal solution. RenderSupport on the other hand uses get currentQuadPacth function a lot so a minor tweak here is to store the current QuadBatch into separate variable and use that instead.

After these changes you can expect about 5-10% better results with the Starling demo when using GPU rendering. With CPU rendering the improvement is a lot smaller.

Since with this version of Starling the amount of Images on screen is getting really high the event handling also needs to be really optimized. One place to gain performance from is the Stage’s advanceTime function. On every frame update it calls dispatchEventOnChildren function with enter frame event. The dispatchEventOnChildren will go through all the possibly over ten thousand child display objects and check if they are listening to this event. The changes are that only one display object in your application is actually interested in this event so this is really inefficient. Quick fix for this is to require that only display objects that are direct children of Stage can receive the enter frame event. This way you can limit the check to the root display object in the advanceTime function. This change should improve the Starling demo performance by another 5-10% when using GPU rendering. Another thing to notice is that even if the amount of rendered Images may rise in total by about 10-15% with GPU rendering also the CPU load will drop significantly. This means that there is more time per frame for your actual application logic.

The speed comparisons between original Starling, original Starling with my optimizations from the previous posts and the new Starling are shown in the chart below. The results with the new Starling framework are on the bottom four rows. [Edit: The test were run at 30 fps]

Starling new speed

One interesting thing is that at least on Windows 7 the 64bit Internet Explorer 9 Flash plugin plays the Stage3D content even faster than the Flash projector. Firefox on the other hand performs really poorly achieving only about half of Internet Explorer’s performance. This means that at the moment if you are developing a top notch Flash 11 application for web you probably should recommend the users to avoid using Firefox.

 
To wrap things up Starling 0.9.1 is definitely an update everyone has been waiting for. It gives really good performance boost compared to the previous version and if the device loss handling is also added hopefully in the next version then there’s basically no reason not to use Starling for 2D rendering when you are developing a Flash 11 application.

Post a comment or leave a trackback: Trackback URL.

Comments

  • Mike - Lime Rocket  On December 14, 2011 at 9:51 am

    Awesome writeup mate, Im interested to see the effect of these new changes for AIR mobile builds when stage3d is turned on.

  • Redoc  On December 20, 2011 at 5:59 pm

    Nice the new Starling update is definitely a performance boost but there is still a lot of room for improvement when comparing to other engines.. I was able to render 7000 sprites using the latest Starling, 14000 using ND2D and 51000 using Genome2D at 60FPS.

    • villekoskela  On December 20, 2011 at 7:24 pm

      I just tried Genome2D and at least on my laptop the sample application didn’t handle that much more sprites than the Starling demo. I’ll need to check it more carefully to be able to post some performance comparisons.

      • Redoc  On December 20, 2011 at 8:25 pm

        Aren’t you using Mac by any chance? Because benchmarks on Macs are all over the place for some reason atleast for me. It also seem that on Macs the actual reported framerate is not what i see on screen. It says 60FPS in some demos but its nowhere near 60FPS being jerky and all, this never happens on PC.

        Also depends heavily on exact Flash player version since it seems that they are optimizing the GPU stuff performance more and more. And where Starling has heavily CPU based pipeline ND2D and Genome2D seem to have GPU bottleneck and therefore they gain from these updates more. Extrapolating from this i would also say that they will perform better on mobile but who knows.

        All of the frameworks are great and have their pros and cons, and at the end we the users are the real winners because we have options to choose from 😉

      • Redoc  On December 20, 2011 at 9:17 pm

        Ok I rerun the tests to be absolutely sure and here are the approximate numbers, all of them latest versions from github and all of these sprites are moving each frame.

        Starling: 7000 sprites
        ND2D: 13500 sprites
        Genome2D: 34000 sprites

        If you disable the movement it jumps way higher.

        The 51000 number i got before was probably using the Genome2D blit() method, i am able to render 53000 sprites at 60FPS that way now.

      • villekoskela  On December 20, 2011 at 9:56 pm

        Thanks for your numbers. My laptop is a PC with i5-480M CPU, GeForce GT 415M GPU and Windows 7. Like you said it’s good to have competing frameworks available.

      • Redoc  On December 20, 2011 at 10:11 pm

        Oh my bad i am on PC as well i5-650 CPU, GeForce GTX470 which is way better GPU but the difference is not there for some reason. Daniel mentioned something on the forum about PC performance being not up there for some reason yet. Strange thing is that you have PC as well and according to your Starling benchmark you were able to pull better numbers.

      • villekoskela  On December 20, 2011 at 10:19 pm

        I run my tests at only 30 fps. I added that as a comment to the post since even if I had mentioned that in the earlier posts it was missing from this one.

      • Redoc  On December 20, 2011 at 10:32 pm

        Oh makes sense then, seems to point out to the CPU bottleneck as i mentioned since the numbers scale by the CPU speed not GPU.

Trackbacks

Leave a comment