Think of it like this: Every frame, in a very basic (non-optimized) game, there's a rendering function (greatly simplified):
Clear the screen image
For each layer
Clear the layer image
For each tilemap
Clear the tilemap image
For each tile
Draw the tile onto the tilemap image
End
Apply color effects to the tilemap image
Copy the part of the tilemap image visible on-screen onto the layer image
End
For each sprite
If the sprite is on-screen
Clear the sprite image
Draw the sprite frame onto the sprite image
Apply color effects to the sprite image
Copy the sprite image onto the layer image
End
End
Apply color effects to the layer image
Draw the part of the layer image visible on-screen onto the screen image
End
Draw the screen image to the real screen
So, every layer, tilemap, and sprite actually stores its own image. (In practice, to save memory, if an image does not need to be stored for more than a single operation, I just use a temporary canvas which is continually blanked out and drawn over)
(Actually most simple games would just draw the images directly onto the screen. A lot of the draw actions here happen for the sake of applying color effects to individual objects, and in fact when there are no colorization effects a lot of draw actions are skipped, although not quite as many as there would be if the game was truly optimized. Part of the problem is that since I can't control how the world builder will apply certain effects, sometimes skipping steps can end up adding extra draw operations in the long run.)
Every draw or copy action takes time; the bigger the image being drawn, the more time it takes (total
size of all draw actions matters more than the total
number of draw actions; after all,
further behind the scenes the code is already drawing every pixel). Improving framerate is basically about streamlining this process so the game needs as few draw actions as possible. Much of this process is already optimized: the whole point of tilesets is so that they don't have to be completely redrawn every frame; instead of clearing the whole tilemap image every frame they only need to clear and redraw a single tile when that tile is changed.
If I add an extra image representing all tilemaps in the layer, then each individual tilemap doesn't need to be drawn to the layer every frame. The "all tilemaps" image just needs to be updated when one of the tilemaps in a layer is changed in a way that makes it need redrawing.