I suspect that wouldn't help. If you plug a corpus of 3D things in (assuming you could provide the equivalent quantity/quality of training materials) the result of blindly discovering the important commonalities and differences and distinctions will result in a flow of 3D things coming out. First of all, you'd have to supply a pipeline into 2Dness (and sensibly so, perhaps separately train another aspect of AI to properly place and orientate such output items to compose them well) and, secondly, there's a whole new dimension (or more, once you multiply elements) of inadvertent error that the engine can stray into.
The chopsticks/noodles problem is seemingly not really much to do with the lack of 'depth perception' of the "understanding" algorithm, but that straight or bendy distinct filaments defines "chopsticks" and "noodles" better in multiples than anything else (including a singular stick or strand) and thus it happily overfulfills the number in the fine tradition of the Paperclip Maximiser. Somehow[1]...
Perhaps the weird twist of fingers might be 'cured', if you go to the trouble to demonstrate the normal 3D composition of a hand, but I'm not sure even that's guaranteed. Current "flat drawn" fingers might gain depth distortions to rival the oblique ones we see, as the recomposition/reassembly of features includes far more differing orientations to try to merge into a generative output, and the method seems to be lax with the consistent length and extent of digits (and termination of the swathe of pixels) so I'm not sure it can learn much more from the variety of related 3D data (and how it generates its voxels).
Which is not to say that you can't actually expect anything to work in 3D as good as it does in 2D, all else being equal (except data volume, which needs to be exponentionally greater[2]). But it's not necessarily going to solve the 2D problems.
[1] I don't know how much we've poked into the 'trained database' to try to distill out the learnt-rules that it seems go a bit wild in these cases. I suspect that it'd be forensically obscure without spending a disproportionate amount of time digging about in data that is effectively "machine only"-readable. Perhaps tweaking key (un-named) values manually and re-running to find how we can guarantee just two (nicely held) chopsticks or a reasonable representation of pasta and mouth, but not realising that this makes the next request for an image of (say) a horse jumping a gate to somehow go all wrong because our value was doing double-duty in this other manner of subject, for various mostly (and certainly trivially) ineffable reasons...
[2] Or simplified, making it a different animal from the 2D Raster->[Black Box]->2D Raster training-to-result sequence, clearly based upon a simple array 'canvas' that can be kept as a valid 2D array (no matter how messy the contents). Do you tell it how to understand 3D Vector, in its necessary tree-structure? Or just plug in raw VRML/whatever and have part of its training-problem being to output intrinsically valid VRML markup, even before populating such a valid descriptor with hopefully aesthetically valid content. More decisions, more development, more internal complication to the project (one way or another).