It’s been a while since my latest Starling blog post. The Starling version I was tuning back then was 0.9.1 while the current one is already 1.3. During that time lots of nice features have been added to the framework including some performance optimizations but anyways I thought now would be a good time to check if there is still something to improve on the performance side. So let’s now start optimizing Starling 1.3.
The first place to start tuning is DisplayObject class and its get transformationMatrix function. There as the first step the transformation matrix is always set to identity matrix. Then there are lots of if clauses whether the matrix should be scaled, scewed, rotated, translated or modified because of the pivot point. It is pretty clear that these operations most likely override the initial values that were achieved by setting the matrix to identity matrix.
So how could we improve the performance here? Let’s first divide the function into two cases – first a case where the display object is skewed (this is most likely not too common case) and then to a case where it’s not. For the skewed case we can use the original implementation but for the other more common case we do things a bit differently. First we stop setting the matrix to identity altogether. Instead we check if the display object is rotated and if not we simply set the matrix to do proper scaling and transformation like this:
mTransformationMatrix.a = mScaleX;
mTransformationMatrix.b = 0.0;
mTransformationMatrix.c = 0.0;
mTransformationMatrix.d = mScaleY;
mTransformationMatrix.tx = mX – mPivotX * mScaleX;
mTransformationMatrix.ty = mY – mPivotY * mScaleY;
With this implementation there is no need to compare any of the values against zero or one and there is also no overhead from using function calls. Basically this is about as fast as just the call to the identity function that we removed and we have the final matrix now. For the transformation there is really no point checking if the values are zero or not since in most of the cases they are not and then the if check will just add one extra step to the execution. It’s also a good thing to keep in mind that adding zero or multiplying by one doesn’t change the result.
For the case where the display object is rotated we calculate cosine and sine for the angle and then do the matrix math by ourselves by setting transformation matrix’s a to mScaleX * cosA, b to mScaleY * sinA, c to –mScaleX * sinA, d to mScaleX * cosA and tx and ty the same way as with when the display object was not rotated. Again there is no need to check if values are different from one or zero to avoid the extra steps. For the pivot handling we divide the original if clause into two parts first checking if pivotX is not zero and then pivotY. Here the amount of operations possibly avoided by the check justifies its cost.
With these simple changes the Starling benchmark scene should be able to handle 5-10% more images.
The next class to start tuning is VertexData. It has copyTo and transformVertex functions that get called whenever a quad is added to a quad batch. Here we apply similar idea as in the previous step so instead of first copying and then doing the matrix transformation for the values we pass the transformation matrix as a parameter to copyTo function and assign translated values to the target data.
This change should again slightly improve the performance.
Next change is the biggest one and will also have the biggest effect on performance. Starling now has this tinted parameter telling if the quad or image will use coloring or is partially transparent but it’s not really used for anything else than selecting correct fragment shader. Since once again probably most of the images will not use coloring nor be partially transparent the color data should not be copied between vertex data instances or sent to the GPU when updating the vertex buffers. To achieve this optimization we divide the mRawData vector in VertexData class into three separate vectors – one for position data, another one for color data and third one for texture data. After all the changes in VertexData class it’s also necessary to have three vertex buffers in QuadBatch to match the three vectors.
When we start passing tinting parameter to VertexData class copyTo function we can copy the color data only if the tinting is in use. The same logic can be used in QuadBatch class syncBuffers function so that the color vertex buffer is updated only if tinting is used. For the cases where tinting is not used this halves the amount of data first copied between VertexData instances and later uploaded to the vertex buffers.
After these changes you can expect around 40% improvement to the image count the Starling benchmark scene can handle. The improvement should happen both on desktops and mobile devices.
Optimizing event handling
If you still want to improve the performance the next place to start tuning is the event dispatching. For example on each frame update when Stage class advanceTime function is called it will iterate through all the display objects to collect a list of those that are actually listening to the enter frame event. Since most likely only couple of your thousands of display objects are actually listening to this event a more efficient way to do this is to have list of these display objects in Stage class. This can be achieved by making the display objects report through their parent all to way up to the Stage addition and removal of listened event. You will also need to modify the function for setting the display object’s parent a bit so that the changes are properly reported to Stage.
With this additional change you can expect 50% improvement to the image count handled by Starling benchmark scene. In my opinion this is quite a remarkable achievement.