A friend of mine once suggested that ‘a good way’ to compress a (2D) video stream would be, to compute the per-pixel difference of each frame with respect to the previous frame, and then to JPEG-Compress the result. And wouldn’t we know it, this is exactly how MJPEG works! However, the up-to-date, state-of-the-art compression schemes go further than to do that, in order to achieve smaller file-sizes, and are often based on Macroblocks.
Also, my friend failed to notice that at some point within 2D video compression, ‘reference frames’ are needed, which are also referred to sometimes as key-frames. These key-frames should not be confused however with the key-frames that are used in video editing software, 2D and 3D, to control animations which the power-user wants to create. Reference frames are needed within 2D video compression, if for no other reason, than the fact that given small amounts of error with which ‘comparison frames’ are decompressed, the actual present frame’s contents will deviate further and further from the intended, original content, beyond what is acceptable given that the stream is to be compressed.
The concept behind Macroblocks can be stated quite easily. Any frame of a video stream can be subdivided into so-called “Transform Blocks”, which are typically 8×8 pixel-groups, and of which the Discrete Cosine Transform can computed, in what would amount into the simple compression of each frame. The DCT coefficients are then quantized, as is familiar. Simply because the video is also encoded as having a Y’UV colour scheme, there are two possible resolutions at which the DCT could be computed, one for the Luminance Values, and the lower resolution, spanning the doubled number of pixels, for the Chroma Values. However, it is in the comparison of each frame with the previous frames, that ‘good’ 2D video compression has an added aspect of complexity, which my friend did not foresee.
The preceding frame is first translated in 2D, by a vector that is encoded with each Macroblock, in an estimation of motion on the screen, and only after this translation of the subdivided image by an integer number of pixels by X and by Y a sub-result forms, with which the per-pixel difference of the present frame is computed, resulting in per-pixel values that may or may not be non-zero, and resulting in the possibility that an entire Transform Block has DCT coefficients which may all be zeroes.
Each Macroblock possesses a set of 6 bits, that indicate whether non-zero coefficients have been encoded in its 4 Luminance Transform Blocks, plus one bit to state the same information for its 1 (U) and its 1 (V) Transform Block. And it is in the possibility that an entire DCT Transform resulted in zeroes, such that the corresponding Macroblock bit can be set to zero, that the greatest compression can be achieved. The corresponding Transform Block can then just be left out of the stream!
There are just two added details to the complexity with which high-performing 2D video compression can sometimes yield good results:
- Instead of there being one type of interpolated frame, between one type of reference frames, there can actually be two types of interpolated frames, the first of which can act as a local reference frame, from which the second type of interpolated frame can be derived. And inconsistently, these two types of interpolated frames are sometimes referred to as Prediction Frames, as opposed maybe, to Intra-Frames…
- The second level of interpolated frames can sometimes benefit from bidirectional prediction. This means that they are derived not only from the preceding prediction frame, but simultaneously, from the following prediction frame, such that the two results can be blended together, to result in this suggested ‘sub-result’.
Usually, if two ranks of interpolated frames are being computed, their designations are I-Frames, P-Frames and B-Frames, and the result is called a Group Of Pictures.
What the exact pattern of sub- and sub-sub-frames is, can be made to vary, as part of the meta-data of how a stream is encoded, even when the same actual CODEC is being used.
Using popular software to export videos into sophisticated compressed streams, such as the GIMP-GAP extension of GIMP, can have as side effect that the application can specify such a fancy combination of interpolated frames that, when given a short number of frames to export, either some trailing frames just need to be dropped, or otherwise, no output may result just because the video was too short. I tend to shrug off such results, as being due to only complete Groups Of Pictures being exported. Recently, a significant 24-frame animation did not export to H.264 format at all, but a trivial 48-frame animation exported just fine – to a 32-frame video-clip.
After digging a bit further I discovered that the version of GIMP-GAP which I had just compiled, sets an unreasonable ‘GOP’ parameter, of 250. For the significant animation that initially failed to produce any video-clip, I added a 25th frame, and changed this parameter to a multiple of (B-Frames + 1), that was also an exact fraction of (The Total Number of Frames – 1), that parameter now being set to 12. The result was that now, my entire animation was exported to the resulting MP4-File.
Just as easily, I was able to create a trivial 49-frame animation, set the ‘GOP’ parameter to 24 frames, and set the ‘B-Frames’ parameter to 5. The result was again, that my entire animation was exported…