Passing the Starling Image count limit

Flash player 11 has this 4096 instance limit for vertex and index buffers. Since in Starling framework every Image and Quad has it’s own vertex and index buffer their total amount is limited to 4096 also. With 30 fps frame rate a good GPU could handle a whole lot more images than that so it would be nice to get pass this limit. Luckily that’s pretty easy.

The solution is to use shared vertex and index buffers for the Images and Quads. In addition to this instance limit there is also a limit for single vertex and index buffer’s size so in this example I am limiting the amount of images to N = 8192 which is twice the original limit. To support even more images simply add several shared buffers.

First change the vertex buffer and index buffer variables in the Quad class to be static so that all instances of Quad and Image will use the same buffers. Also add a static vector of integers for storing available quad indices and a normal integer variable for storing the index this Quad is using.

Create a function for initializing the buffers and the vector for indices. Push integers from 0 to N-1 to this new static vector used for the available Quad indices. Create another local vector of unsigned integers for the index buffer’s data. Push a total of 6*N uints into this vector with the same logic that is used in QuadGroup’s addQuadData function (the six consequent numbers are the indices for one Quad’s corners with first six numbers being 0, 1, 2, 1, 3, 2). The third vector needed is a local vector of Numbers for the initial vertex data. Push a total of VertexData.ELEMENTS_PER_VERTEX*4*N numbers (for example 0) into this vector. Finally create the static vertex buffer with a size of 4*N and the static index buffer with a size of 6*N and upload the vertices vector into the vertex buffer and indices vector into the index buffer (check Quad’s createVertexBuffer and createIndexBuffer functions how the buffers are created).

To initialize the buffers call the function created in the previous step from Quad’s constructor and add one more static boolean for checking the buffers are allocated only once. At the end of Quad’s constructor also pop one integer from that static quad index vector. Store this index in the Quad instance since it is needed to specify the Quad’s data’s offset in the buffers. Modify also Quad’s dispose function not to dispose the buffers but to simply push this index integer back to the vector from which it was popped.

The last thing do is to start using these shared vertex and index buffers properly. Add one boolean to Quad class for indicating the need to upload the vertex data to the vertex buffer. The initial value for this boolean is naturally true. Replace all the “if (mVertexBuffer) createVertexBuffer();” lines in both Quad and Image with a line setting this boolean to true. Then modify the createVertexBuffer function. It must not create any vertex buffers but only upload the data if the boolean you just started using is true. After uploading the vector set the boolean to false. Here the second parameter to the VertexBuffer3D’s uploadFromVector function is not any more 0 but this Quad’s index*4. The createIndexBuffer function you can remove since there is no need to update the index buffer after it has been created. Finally modify the render function in both Quad and Image. Remove the null checks for vertex and index buffer and always call createVertexBuffer function you just modified. Change the second parameter for Context3D’s drawTriangles function here. It is not any more 0 but this Quad’s index*6.

After doing these modifications try the Starling demo application again with frame rate set to 30 fps. On my laptop which I also used for the optimization tests in the previous post I got almost ~~5000~~ 8000 images with the GPU rendering there seems to be also a ~~small~~ 20-25% performance improvement [with the projector. With a plugin player the improvement can be over 100% like discussed in the next post]. Anyways we can now clearly pass the original limit of 4096 Images.

By villekoskela, on October 20, 2011 at 6:46 am, under ActionScript 3. 15 Comments

Comments

sasmaster On October 20, 2011 at 1:23 pm
Permalink | Reply

Please tell me what graphics card and CPU you run? Because I can get not more than 10 fps when spawning around 2000 images in Starling using Intel dual core + Nvidia 8600 gt.I think most of the average hardware machines would get the same results.Another question ,how much improvement have you got after doing this hack? In fact I think this is the minor problem of Starling because it is utterly unwise to load buffers on each frame disregarding the object’s state.Just for a comparison -I created a basic scene graph in my mod of Starling where I upload the buffers only once for each image.And then doing it again only in case if some of the object properties get “dirty” .Guess what ? I got 35-40% performance gain .And it is even without implementing some batching techniques that can be seen in ND2D that can run more than 10000 particles with pretty descent frame rate.Anyways ,nice article and cool idea you have presented here!
villekoskela On October 20, 2011 at 1:52 pm
Permalink | Reply

I have a laptop with Intel Core i5-480M dual core 2,66GHz CPU and nVidia Geforce GT 415M GPU. I checked some 3DMark06 scores and your Geforce 8600 GT should be a bit faster than my laptop’s GPU if you have the desktop version (the laptop version is a bit slower than mine). For the performance improvement gained from these modifications I do not have any accurate numbers but I can do the measurements later.
- sasmaster On October 20, 2011 at 2:01 pm
  Permalink | Reply
  
  No way ” Intel Core i5-480M dual core 2,66GHz CPU and nVidia Geforce GT 415M GPU ” machine is slower or the same as my old 8600GT and Pentium 4 dual core CPU.(Your card is far more advanced)
  http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units#GeForce_8_.288xxx.29_series
villekoskela On October 20, 2011 at 2:12 pm
Permalink | Reply

The Core i5-480M CPU is faster than Pentium 4 dual core but according the 3DMark06 scores available in many charts the GT 415M is slower than 8600 GT. Also the chart you posted shows this (check the fill rates and GFLOPs). If you are getting lower performance make sure your display drivers are up to date so that Flash is not using software rendering and that you are running a release build of the benchmark application.
- sasmaster On October 20, 2011 at 2:15 pm
  Permalink | Reply
  
  Come on man.I am not a noob to the extent I can’t figure out when I run in software mode.And yes,I am talking about release build.But never mind,nice article.
villekoskela On October 20, 2011 at 10:06 pm
Permalink | Reply

Did some performance tests and it seems that after these changes the benchmark can handle 20-40% more Images (actual results depend on GPU and CPU used) with GPU rendering.
sasmaster On October 23, 2011 at 7:49 pm
Permalink | Reply

One question .Why do you pack your vertex and index buffers with the data for all the images ? You can just continue using the same two and only static buffers with the data for one image only because basically it is not changing overtime.Your approach would be essential only if you deform the mesh of the objects at the runtime. In such a case there is a need to update the vertex data in the vertex buffer per object.Or may be I am completely missing your point ?
Michael.
- villekoskela On October 23, 2011 at 8:08 pm
  Permalink | Reply
  
  Here I am doing just “minimal modifications” to the original Starling framework to get more than 4096 Images at the same time (8192 with single vertex and index buffer in this example). Like you said this approach also allows modifying the objects (images) runtime just like the original Starling framework. This functionality is needed for the Starling demo’s benchmark since all the images are rotating independent from each other and new images are added at every frame update. Good question! Hope I managed to answer it.
  - sasmaster On November 2, 2011 at 4:49 pm
    Permalink | Reply
    
    What I am trying to say is that you do too much work achieving your task.You don’t really have to push a predefined vertices and indices of all your 8000 objects into vertex array,and working with position index to target the right mesh in the buffer stream.The first half of your approach is cool.Then you just ,on init of each image, can upload its vertices and indices into the two and only existing vertex and index buffers and that is it. NO fancy pops and push with all that vertex array traversing.And yes ,you can rotate and move your stuff as you wish.If you want to get what I mean see my mod(https://github.com/sasmaster/FLINTMolehill) of Flint Particles where I have been working (with Richard Lord) on the integration of a stage3D based 2d particles system.I took the base design from the Starling and then did some major optimizations where some of those the same you talk about in your article.Yes you are right setting only one buffer both improves the performance and also frees you from the initial vertex limit.Still , if you create your objects all the time anew you gonna suffer the cpu falls because of constant uploads ,so just as you did ,I also use particle pool containing pre-initialized particles which are reused. So yes, you hit all the important points on Starling optimization but I really can’t understand you idea behind filling the vertex data array with the data for all your particles and then picking out the chunks for rendering.The only possible gain I can see here (and I am not sure how much it improves the performance) is that instead of many small vector objects you have only one big.But even if this way is more efficient ,(something I am not sure because of the way data aligned and stored in the low level memory and it usually suggested to break the data into small chunks) ,still I think you loose some cycles on the lookup for the data in your vector when you need to use it .
villekoskela On November 2, 2011 at 7:03 pm
Permalink | Reply

The point of using this big buffer is that even when the images don’t use the same texture or colors (don’t have identical vertex data) there is no need to upload the data to vertex buffer before each image is rendered. For particle animations where each particle has the same vertex data it’s possible to use the same shared buffer of four vertices.
- sasmaster On November 2, 2011 at 7:15 pm
  Permalink | Reply
  
  Well ,you have got the point ,right .In my cases I really work with homogeneous particle stock .Thanks again.Great article 🙂

Trackbacks

By Starling performance revisited « villekoskela on October 23, 2011 at 12:01 pm

[…] benchmark using GPU rendering and finally the optimized benchmark using GPU rendering (see “Passing the Starling Image count limit”). On one computer I also tested the effect from using different editions of the Flash 11 player […]
By Cool Stuff with the Flash Platform - 10/24/2011 | Remote Synthesis on October 24, 2011 at 9:29 pm

[…] where the system may default to the CPU for processing. His next post shows how you can safely pass the Starling Image count limit which is an 4096 instance […]
By 2011年十个值得收藏的Flash博客 | Flash开发者大会 on January 6, 2012 at 5:17 am

[…] Passing the Starling Image count limit […]
By 2011年十个值得收藏的Flash博客 | Nickro Blog on May 30, 2013 at 4:03 am

[…] Passing the Starling Image count limit […]

villekoskela