MPEG-1 VIDEO

How does MPEG-1 VIDEO work ?

First off, it starts with a relatively low resolution video sequence (possibly decimated from the original) of about 352 by 240 frames by 30 frames/s (US--different numbers for Europe), but original high (CD) quality audio. The images are in color, but converted to YUV space, and the two chrominance channels (U and V) are decimated further to 176 by 120 pixels. It turns out that you can get away with a lot less resolution in those channels and not notice it, at least in "natural" (not computer generated) images.

The basic scheme is to predict motion from frame to frame in the temporal direction, and then to use DCT's (discrete cosine transforms) to organize the redundancy in the spatial directions. The DCT's are done on 8x8 blocks, and the motion prediction is done in the luminance (Y) channel on 16x16 blocks. In other words, given the 16x16 block in the current frame that you are trying to code, you look for a close match to that block in a previous or future frame (there are backward prediction modes where later frames are sent first to allow interpolating between frames). The DCT coefficients (of either the actual data, or the difference between this block and the close match) are quantized, which means that you divide them by some value to drop bits off the bottom end. Hopefully, many of the coefficients will then end up being zero. The quantization can change for every "macroblock" (a macroblock is 16x16 of Y and the corresponding 8x8's in both U and V). The results of all of this, which include the DCT coefficients, the motion vectors, and the quantization parameters (and other stuff) is Huffman coded using fixed tables. The DCT coefficients have a special Huffman table that is two-dimensional in that one code specifies a run-length of zeros and the non-zero value that ended the run. Also, the motion vectors and the DC DCT components are DPCM, (subtracted from the last one) coded.

There are three types of coded frames. There are I or intra frames. They are simply a frame coded as a still image, not using any past history. You have to start somewhere. Then there are P or predicted frames. They are predicted from the most recently reconstructed I or P frame. (I'm describing this from the point of view of the decompressor.) Each macroblock in a P frame can either come with a vector and difference DCT coefficients for a close match in the last I or P, or it can just be "intra" coded (like in the I frames) if there was no good match.

Lastly, there are B or bidirectional frames. They are predicted from the closest two I or P frames, one in the past and one in the future. You search for matching blocks in those frames, and try three different things to see which works best. (Now I have the point of view of the compressor, just to confuse you.) You try using the forward vector, the backward vector, and you try averaging the two blocks from the future and past frames, and subtracting that from the block being coded. If none of those work well, you can intra- code the block.

The sequence of decoded frames usually goes like:

IBBPBBPBBPBBIBBPBBPB... Where there are 12 frames from I to I (for US and Japan anyway.) This is based on a random access requirement that you need a starting point at least once every 0.4 seconds or so. The ratio of P's to B's is based on experience. Of course, for the decoder to work, you have to send that first P *before* the first two B's, so the compressed data stream ends up looking like: 0xx312645... where those are frame numbers. xx might be nothing (if this is the true starting point), or it might be the B's of frames -2 and -1 if we're in the middle of the stream somewhere.

You have to decode the I, then decode the P, keep both of those in memory, and then decode the two B's. You probably display the I while you're decoding the P, and display the B's as you're decoding them, and then display the P as you're decoding the next P, and so on.

What do B-frames buy you ?

Since bi-directional macroblock predictions are an average of two macroblocks blocks, noise is reduced at low bit rates. At nominal MPEG-1 video (352 x 240 x 30, 1.15 Mbit/sec) rates, it is said that B-frames improves SNR by as much as 2 dB. (0.5 dB gain is usually considered worth-while in MPEG). However, at higher bit rates, B-frames become less useful since they inherently do not contribute to the progressive refinement of an image sequence (i.e.not used as prediction by subsequent coded frames). Regardless, B-frames are still politically controversial.

Why do some people hate B-frames ?

Computational complexity, bandwidth, delay, and picture buffer size are the four B-frame Pet Peeves. Computational complexity is increased since a some macroblock modes require averaging between two macroblocks. Worst case, memory bandwidth is increased an extra 16 MByte/s (601 rate) for this extra prediction. An extra picture buffer is needed to store the future prediction reference (bi-directionality). Finally, extra delay is introduced in encoding since the frame used for backwards prediction needs to be transmitted to the decoder before the intermediate B-pictures can be decoded and displayed.

Cable television (e.g. General Instruments) have been particularly adverse to B-frames since the extra picture buffer pushes the decoder DRAM memory requirements past the magic 8-Mbit (1 Mbyte) threshold into the realm of 16 Mbits (2 MByte) for CCIR-610 frames (704 x 480), yet not for lowly 352 x 480. However, cable does not realize that DRAM does not come in convenient high-volume (low cost) 8-Mbit packages as 16-Mbit does. In a few years, the cost differences between 16 Mbit and 8 Mbit will become insignificant compared to the gain in compression. For the time being, cable boxes will start with 8-Mbit and allow future drop-in upgrades to 16-Mbit. The early market success of B-frames seem to have been determined by a fire at a Japanese chemical plant.

Can motion vectors be used to measure object velocity ?

Motion vector information cannot be reliably used as a means of determining object velocity unless the encoder model specifically set out to do so. First, encoder models that optimize picture quality form vectors that typically minimize prediction error and, consequentially, the vectors often do not represent true object translation. Standards converters that re-sample one frame rate to another (as in NTSC to PAL) use different methods (field coding, edge detection, et al) that are not concerned with optimizing SNR vs bitrate. Secondly, motion vectors are not transmitted for all macroblocks anyway.

How do you code interlaced video with MPEG-1 syntax ?

Two methods can be applied to interlaced video that maintain syntactic compatibility with MPEG-1 (which was originally designed for progressive frames only). In the field concatenation method, the encoder model can carefully construct predictions and prediction errors that realize good compression but maintain field integrity (distinction between adjacent fields of opposite parity). Some pre-processing techniques can also be applied to the interlaced source video that would, e.g., lessen sharp vertical frequencies. This technique is not efficient of course. On the other hand, if the original source was progressive (e.g. film), then it is more trivial to convert the interlaced source to a progressive format before encoding. (MPEG-2 would then only offer superior performance through greater DC block precision, non-linear mquant, intra VLC, etc.) Reconstructed frames are re-interlaced in the decoder Display process.

The second syntactically compatible method codes fields separately. Picture types are keyed to motion activity to aid efficiency of prediction.

Where did they get 352x240 ?

That derives from the CCIR-610 digital television standard which is used by professional digital video equipment. It is (in the US) 720 by 243 by 60 fields (not frames) per second, where the fields are interlaced when displayed. (It is important to note though that fields are actually acquired and displayed a 60th of a second apart.) The chrominance channels are 360 by 243 by 60 fields a second, again interlaced. This degree of chrominance decimation (2:1 in the horizontal direction) is called 4:2:2. The source input format for MPEG I, called SIF, is CCIR-610 decimated by 2:1 in the horizontal direction, 2:1 in the time direction, and an additional 2:1 in the chrominance vertical direction. And some lines are cut off to make sure things divide by 8 or 16 where needed. For 50 Hz display standards (PAL, SECAM) change the number of lines in a field from 243 or 240 to 288, and change the display rate to 50 fields/s or 25 frames/s. Similarly, change the 120 lines in the decimated chrominance channels to 144 lines. Since 288*50 is exactly equal to 240*60, the two formats have the same source data rate.

Can MPEG-1 encode higher sample rates than 352x240x30 ?

Yes. The MPEG-1 syntax permits sampling dimensions as high as 4095 x 4095 x 60 frames per second. The MPEG most people think of as "MPEG-1" is actually a kind of subset known as Constrained Parameters Bitstream (CPB).

What are Constrained Parameters Bitstreams ?

CPB are a limited set of sampling and bitrate parameters designed to normalize computational complexity, buffer size, and memory bandwidth while still addressing the widest possible range of applications. CPB limits video to 396 macroblocks (101,376 pixels) per frame if the frame rate is less than or equal to 25 fps (frames per second), and 330 macroblocks (84,480 pixels) per frame if the frame rate is less or equal to 30 fps. Therefore, MPEG video is typically coded at SIF dimensions (352 x 240 x 30fps or 352 x 288 x 25 fps).

The total maximum sampling rate is 3.8 Ms/s (million samples/sec) including chroma. The coded video rate is limited to 1.862 Mbit/sec. In industrial practice, the bitrate is the most often waived parameter of CPB, with rates as high as 6 Mbit/sec in use.

Why are Constrained Parameters Bitstreams so important?

It is an optimum point that allows (just barely) cost effective VLSI implementations in 1992 technology (0.8 microns). It also implies a nominal guarantee of interoperability for decoders and encoders. MPEG devices which are not capable of meeting SIF rates are not canonically considered to be true MPEG.

Are there ways of getting around Constrained Parameters Bitstreams for SIF class applications and decoder ?

Yes, some. Remember that CPB limits frames to 396 macroblocks (as in 352 x 288 SIF frames). 416 x 240 x 24 Hz sampling rates are still within the constraints, but this only aids NTSC (240 lines/field) displays. Deviating from 352 samples/line could throw off many decoder implementations that have limited horizontal sample rate conversion modes. Due to chip die size constraints (most chips barely pack in the necessary features), many decoders use simple doubling, e.g. 352 to 704 samples/line via binary taps which are simple shift-and-add operations. Future MPEG decoders will have arbitrary sample rate converters on-chip. Also remember that the 1.86 Mbit/sec limit is often ignored in real life.

How much does it compress ?

As mentioned before, audio CD data rates are about 1.5 Mbits/s. You can compress the same stereo program down to 256 Kbits/s with no loss in discernible quality. (So they say. For the most part it's true, but every once in a while a weird thing might happen that you'll notice. However the effect is very small, and it takes a listener trained to notice these particular types of effects.) That's about 6:1 compression. So, a CD MPEG I stream would have about 1.25 MBits/s left for video. The number I usually see though is 1.15 MBits/s (maybe you need the rest for the system data stream). You can then calculate the video compression ratio from the numbers here to be about 26:1. If you step back and think about that, it's little short of a miracle. Of course, it's lossy compression, but it can be pretty hard sometimes to see the loss, if you're comparing the SIF original to the SIF decompressed. There is, however, a very noticeable loss if you're coming from CCIR-610 and have to decimate to SIF, but that's another matter. I'm not counting that in the 26:1.

The standard also provides for other bit rates ranging from 32Kbits/s for a single channel, up to 448 Kbits/s for stereo.

MPEG-1 AUDIO

Is the same video compression applied to audio ?

Definitely no. The eye and the ear, even if they are only a few centimeters apart, works very differently. The ear has a much higher dynamic range and resolution. It can pick out more details but it is slower than the eye.

The MPEG committee chose to recommend 3 compression methods and named them Audio Level I, II and III. Level I is the simplest, a sub-band coder with a psycoacustic model (You'll get the details of this stuff further on). Layer II adds more advanced bit allocation techniques and greater accuracy. Layer III adds a hybrid filterbank and non- uniform quantization. Layer I, II and III gives increasing quality/compression ratios with increasing complexity and demands on processing power.

The reason for recommending 3 methods where partly that the testers felt that none of the coders was 100% transparent to all material and partly that the best coder (Layer III) was so computing heavy that it would seriously impact the acceptance of the standard.

The specs say that a valid Layer III decoder shall be able to decode any Layer I, II or III MPEG Audio stream. A Layer II decoder shall be able to decode Layer I and Layer II streams. I would not worry too much about Layer III. Layer II is where its happening and the info in this FAQ is mainly about this coder.

How does MPEG-1 AUDIO work ?

Well, first you need to know how sound is stored in a computer. Sound is pressure differences in air. When picked up by a microphone and fed through an amplifier this becomes voltage levels. The voltage is sampled by the computer a number of times per second. For CD-audio quality you need to sample 44100 times per second and each sample has a resolution of 16 bits. In stereo this gives you 1,4Mbit per second and you can probably see the need for compression.

To compress audio MPEG tries to remove the irrelevant parts of the signal and the redundant parts of the signal. Parts of the sound that we do not hear can be thrown away. To do this MPEG Audio uses psyco- acustic principles.

How good is MPEG-1 AUDIO compression ?

MPEG can compress to a bitstream of 32kbit/s to 384kbit/s (Layer II). A raw PCM audio bitstream is about 705kbit/s so this gives a max. compression ratio of about 22. Normal compression ratio is more like 1:6 or 1:7. If you think that this is not much please remember that unlike video we are talking about no perceivable quality loss here. 96kbit/s is considered transparent for most practical purposes. This means that you will not notice any difference between the original and the compressed signal for rock'n roll or popular music. For more demanding stuff like piano concerts and such you will need to go up to 128kbit/s.

How does MPEG-1 AUDIO achieve this compression ratio ?

Well, with audio you basically have two alternatives. Either you sample less often or you sample with less resolution (less than 16 bit per sample). If you want quality you can't do much with the sample frequency. Humans can hear sounds with frequencies from about 20Hz to 20kHz. According to the Nyquist theorem you must sample at least two times the highest frequency you want to reproduce. Allowing for imperfect filters, a 44,1kHz sampling rate is a fair minimum. So you either set out to prove the Nyquist theorem is wrong or go to work on reducing the resolution. The MPEG committee chose the latter.

Now, the real reason for using 16 bits is to get a good signal-to- noise (s/n) ratio. The noise we're talking about here is quantization noise from the digitizing process. For each bit you add, you get 6dB better s/n. (To the ear, 6dBu corresponds to a doubling of the sound level.) CD-audio achieves about 90dB s/n. This matches the dynamic range of the ear fairly well. That is, you will not hear any noise coming from the system itself (well, there is still some people arguing about that, but lets not worry about them for the moment). So what happens when you sample to 8 bit resolution ? You get a very noticeable noise floor in your recording. You can easily hear this in silent moments in the music or between words or sentences if your recording is a human voice. Waitaminnit. You don't notice any noise in loud passages, right? This is the masking effect and is the key to MPEG Audio coding. Stuff like the masking effect belongs to a science called psyco- acoustics that deals with the way the human brain perceives sound. And MPEG uses psycoacustic principles when it does its thing.

Explain the masking effect

Say you have a strong tone with a frequency of 1000Hz. You also have a tone nearby of say 1100Hz. This second tone is 18 dB lower. You are not going to hear this second tone. It is completely masked by the first 1000Hz tone. As a matter of fact, any relatively weak sounds near a strong sound is masked. If you introduce another tone at 2000Hz also 18 dB below the first 1000Hz tone, you will hear this. You will have to turn down the 2000Hz tone to something like 45 dB below the 1000Hz tone before it will be masked by the first tone. So the further you get from a sound the less masking effect it has. The masking effect means that you can raise the noise floor around a strong sound because the noise will be masked anyway. And raising the noise floor is the same as using less bits and using less bits is the same as compression.

Let's now try to explain how the MPEG Audio coder goes about its thing. It divides the frequency spectrum (20Hz to 20kHz) into 32 sub-bands. Each sub-band holds a little slice of the audio spectrum. Say, in the upper region of sub-band 8, a 1000Hz tone with a level of 60dB is present. OK, the coder calculates the masking effect of this sound and finds that there is a masking threshold for the entire 8th sub-band (all sounds w. a frequency...) 35dB below this tone. The acceptable s/n ratio is thus 60 - 35 = 25 dB. The equals 4 bit resolution. In addition there are masking effects on band 9-13 and on band 5-7, the effect decreasing with the distance from band 8. I a real-life situation you have sounds in most bands and the masking effects are additive. In addition the coder considers the sensitivity of the ear for various frequencies. The ear is a lot less sensitive in the high and low frequencies. Peak sensitivity is around 2-4kHz, the same region that the human voice occupies.

The sub-bands should match the ear, that is each sub-band should consist of frequencies that have the same psycoacustic properties. In MPEG layer II, each subband is 625Hz wide. It would been better if the sub-bands where narrower in the low frequency range and wider in the high frequency range. To do this you need complex filters. To keep the filters simple they chose to add FFT in parallel with the filtering and use the spectral components from the FFT as additional information to the coder. This way you get higher resolution in the low frequencies where the ear is more sensitive.

But there is more to it. We have explained concurrent masking, but the masking effect also occurs before and after a strong sound (pre- and postmasking)

If there is a significant (30 - 40dB ) shift in level. The reason is believed to be that the brain needs some processing time. Premasking is only about 2 to 5 ms. The postmasking can be up till 100ms. Other bit-reduction techniques involve considering tonal and non- tonal components of the sound. For a stereo signal you have a lot of redundancy between channels. The last step before formatting is Huffman coding.

The coder calculates masking effects by an iterative process until it runs out of time. It is up to the implementor to spend bits in the least obtrusive fashion. For layer II the coder works on 23 ms of sound (1152 samples) at a time. For some material the 23 ms time-window can be a problem. This is normally in a situation with transients where there are large differences in sound level over the 23 ms. The masking is calculated on the strongest sound and the weak parts will drown in quantization noise. This is perceived as a noise-echo by the ear. Layer III addresses this problem specifically.

What is the hardware demand ?

According to my informations Layer III needs about 20MIPS per channel for real-time coding. This means a real fast DSP. Layer II on the other hand needs only a simple DSP like for example the AD2015 that can be had for a few dollars. The process is asymmetrical, much more processing is needed on the coding side. A decoder could be made to work without hardware assistance on a decent computer.

Who is using MPEG-1 AUDIO?

Philips uses MPEG for their new digital video CD's. They say they will start shipping movies and music videos on CD's for their CD-I player by the end of this year. MPEG is accepted by Eureka-147. That means that when digital radio broadcasts starts in Europe a couple of years from now, you will receive MPEG coded audio.

Which sampling frequencies are used ?

You can have 48kHz, (used in professional sound equipment), 44,1kHz (used in consumer equipment like CD-audio) or 32kHz (used in some communications equipment).

How many audio channels?

MPEG I allows for two audio channels. These can be either single (mono) dual (two mono channels), stereo or joint stereo (intensity stereo or m/s-stereo). In normal (l/r) stereo one channel carries the left audio signal and one channel carries the right audio signal. In m/s stereo one channel carries the sum signal (l+r) and the other the difference (l-r) signal. In intensity stereo the high frequency part of the signal (above 2kHz) is combined. The stereo image is preserved but only the temporal envelope is transmitted. In addition MPEG allows for pre-emphasis, copyright marks and original/copy marks. MPEG II allows for several channels in the same stream.

Where can I get more details about MPEG audio ?

There is no description of the coder in the specs. The specs describes in great detail the bitstream and suggests psycoacustic models.

MPEG-1 SYSTEMS

What about MPEG-1 SYSTEMS ?

The MPEG system committee completed and approved for release the technical specification for combining a plurality of coded audio and video streams into a single data stream. The specification provides a fully synchronised audio and video and facilitates the storage in and the possible further transmission of the combined information through a variety of digital media.

This systems coding includes necessary and sufficient information in the bit stream to provide the system-level functions of synchronization of decoded audio and video, initial and continuous management of coded data buffers to prevent overflow and underflow, random access start-up, and absolute time identification. The coding layer specifies a multiplex data format that allows multiplexing of multiple simultaneous audio and video streams as well as privately defined data streams.

The basic principle of MPEG System coding is the use of time stamps which specify the decoding and display time of audio and video and the time of reception of the multiplexed coded data at the decoder, all in terms of a single 90kHz system clock. This method allows a great deal of flexibility in such areas as decoder design, the number of streams, multiplex packet lengths, video picture rates, audio sample rates, coded data rates, digital storage medium or network performance. It also provides flexibility in selecting which entity is the master time base, while guaranteeing that synchronization and buffer management are maintained. Variable data rate operation is supported. A reference model of a decoder system is specified which provides limits for the ranges of parameters available to encoders and provides requirements for decoders.

Some optional sets of constraints provide a framework for common industry acceptance of certain key parameters for use by decoder designs and information providers. While the MPEG Systems specification is included in the current work item of MPEG, it is designed for compatibility with future extensions to audio, video and hypermedia coding, and a wide variety of bitrates.