Boosting Starling 1.3 performance by 50%

It’s been a while since my latest Starling blog post. The Starling version I was tuning back then was 0.9.1 while the current one is already 1.3. During that time lots of nice features have been added to the framework including some performance optimizations but anyways I thought now would be a good time to check if there is still something to improve on the performance side. So let’s now start optimizing Starling 1.3.

Optimizing DisplayObject

The first place to start tuning is DisplayObject class and its get transformationMatrix function. There as the first step the transformation matrix is always set to identity matrix. Then there are lots of if clauses whether the matrix should be scaled, scewed, rotated, translated or modified because of the pivot point. It is pretty clear that these operations most likely override the initial values that were achieved by setting the matrix to identity matrix.

So how could we improve the performance here? Let’s first divide the function into two cases – first a case where the display object is skewed (this is most likely not too common case) and then to a case where it’s not. For the skewed case we can use the original implementation but for the other more common case we do things a bit differently. First we stop setting the matrix to identity altogether. Instead we check if the display object is rotated and if not we simply set the matrix to do proper scaling and transformation like this:

mTransformationMatrix.a = mScaleX;

mTransformationMatrix.b = 0.0;

mTransformationMatrix.c = 0.0;

mTransformationMatrix.d = mScaleY;

mTransformationMatrix.tx = mX – mPivotX * mScaleX;

mTransformationMatrix.ty = mY – mPivotY * mScaleY;

With this implementation there is no need to compare any of the values against zero or one and there is also no overhead from using function calls. Basically this is about as fast as just the call to the identity function that we removed and we have the final matrix now. For the transformation there is really no point checking if the values are zero or not since in most of the cases they are not and then the if check will just add one extra step to the execution. It’s also a good thing to keep in mind that adding zero or multiplying by one doesn’t change the result.

For the case where the display object is rotated we calculate cosine and sine for the angle and then do the matrix math by ourselves by setting transformation matrix’s a to mScaleX * cosA, b to mScaleY * sinA, c to –mScaleX * sinA, d to mScaleX * cosA and tx and ty the same way as with when the display object was not rotated. Again there is no need to check if values are different from one or zero to avoid the extra steps. For the pivot handling we divide the original if clause into two parts first checking if pivotX is not zero and then pivotY. Here the amount of operations possibly avoided by the check justifies its cost.

With these simple changes the Starling benchmark scene should be able to handle 5-10% more images.

Optimizing VertexData

The next class to start tuning is VertexData. It has copyTo and transformVertex functions that get called whenever a quad is added to a quad batch. Here we apply similar idea as in the previous step so instead of first copying and then doing the matrix transformation for the values we pass the transformation matrix as a parameter to copyTo function and assign translated values to the target data.

This change should again slightly improve the performance.

Next change is the biggest one and will also have the biggest effect on performance. Starling now has this tinted parameter telling if the quad or image will use coloring or is partially transparent but it’s not really used for anything else than selecting correct fragment shader. Since once again probably most of the images will not use coloring nor be partially transparent the color data should not be copied between vertex data instances or sent to the GPU when updating the vertex buffers. To achieve this optimization we divide the mRawData vector in VertexData class into three separate vectors – one for position data, another one for color data and third one for texture data. After all the changes in VertexData class it’s also necessary to have three vertex buffers in QuadBatch to match the three vectors.

When we start passing tinting parameter to VertexData class copyTo function we can copy the color data only if the tinting is in use. The same logic can be used in QuadBatch class syncBuffers function so that the color vertex buffer is updated only if tinting is used. For the cases where tinting is not used this halves the amount of data first copied between VertexData instances and later uploaded to the vertex buffers.

After these changes you can expect around 40% improvement to the image count the Starling benchmark scene can handle. The improvement should happen both on desktops and mobile devices.

Optimizing event handling

If you still want to improve the performance the next place to start tuning is the event dispatching. For example on each frame update when Stage class advanceTime function is called it will iterate through all the display objects to collect a list of those that are actually listening to the enter frame event. Since most likely only couple of your thousands of display objects are actually listening to this event a more efficient way to do this is to have list of these display objects in Stage class. This can be achieved by making the display objects report through their parent all to way up to the Stage addition and removal of listened event. You will also need to modify the function for setting the display object’s parent a bit so that the changes are properly reported to Stage.

With this additional change you can expect 50% improvement to the image count handled by Starling benchmark scene. In my opinion this is quite a remarkable achievement.

Comments

PrimaryFeather On February 22, 2013 at 11:31 am
Permalink | Reply

Thanks a lot for testing the effect of those changes, Ville! Definitely something I’ll look into for Starling 1.4! =)
Vic C. (@puppetMaster3) On February 22, 2013 at 11:00 pm
Permalink | Reply

You could also listen to a enterFrame signal as needed.
- villekoskela On February 23, 2013 at 10:32 am
  Permalink | Reply
  
  That’s what happening here. Since the Starling display objects are not traditional Flash display objects they can’t listen to the normal enter frame event but same kind of functionality has been implemented in Starling (which I now was optimizing).
tsangwailam On February 25, 2013 at 5:43 am
Permalink | Reply

Any chance to test with the code for performance?
- villekoskela On February 25, 2013 at 9:37 am
  Permalink | Reply
  
  With these descriptions of the optimizations you should be able to do them all in couple of hours by yourself.
  - Anako On February 27, 2013 at 11:07 pm
    Permalink | Reply
    
    Uh anyone managed to implement this code and would be so nice to share modifed classes? Im not lazy i just dont understand exactly how to do it:P
  - Shawn Blais Skinner On March 5, 2013 at 2:13 am
    Permalink | Reply
    
    Yes so we should all re-implement the same code, and validate and debug the results, instead of you just posting it?? Come on man… don’t be such a **** tease.
    - villekoskela On March 5, 2013 at 7:26 am
      Permalink | Reply
      
      This way there’s a good chance that who ever is implementing these changes will actually spend few extra hours on thinking them and after understanding how the improved performance was achieved will in the future write more optimized code.
    - Shawn Blais Skinner On March 5, 2013 at 5:59 pm
      Permalink | Reply
      
      Or, we could be writing the game, have 3000 things to do, and just need a speed boost to the rendering performance 🙂
    - Shawn Blais Skinner On March 5, 2013 at 6:01 pm
      Permalink | Reply
      
      If I do end up doing this, I’ll host the code anyways, since it would be silly not to at least share it with people, and create one solid branch of code that could be easily merged into the primary Starling trunk.
      - villekoskela On March 5, 2013 at 6:38 pm
        
        That’s a good option. Since the modifications I described in the blog post were not done for any hobby or open source project publishing them is not possible. Anyways like Daniel mentioned already in the first reply something similar (and probably still a bit improved) will come with Starling 1.4.
      - Shawn Blais Skinner On March 5, 2013 at 6:59 pm
        
        I started doing it, but man it’s starting to get pretty involved within the VertexData class, splitting into multiple vector’s and changing function signature in QuadBatch… think I’ll hold off on this for now. Still think it would be helpful to post your code so at least we can verify the gains within our own benchmarks and devices.
        
        I have a large bank of devices, and just ran a large set of “before” results. Just need the modified Starling codebase and I’ll do all the “afters” 🙂
PrimaryFeather On March 7, 2013 at 3:43 pm
Permalink | Reply

As a first step, I’ve just integrated the transformation matrix optimizations — that doesn’t have any side effects, so it was a safe thing to do. The VertexData changes are more challenging; I don’t want all custom objects of people out there to break …
- Anko On March 7, 2013 at 5:19 pm
  Permalink | Reply
  
  Cheers Daniel! 🙂

Trackbacks

By Boosting Starling 1.3 performance by 50% | Game and Web | Scoop.it on February 23, 2013 at 2:08 am

[…] It’s been a while since my latest Starling blog post. The Starling version I was tuning back then was 0.9.1 while the current one is already 1.3. During that time lots of nice features have been ad… […]

villekoskela