Berkeley Multimedia Research Center
Published: September 1994
Berkeley, CA
USA
http://www.bmrc.berkeley.edu 

Parallel MPEG-1 Video Encoding

Abstract
1. Introduction
2. MPEG-1 Video
2.1 Data Rate
2.2 Color Space
2.3 Frame Coding
2.4 Bitstream Representation
2.5 Challenges

3. Sequential MPEG-1 Encoder
3.1 Implementation
3.2 Motion Vector Search Techniques
3.3 Optimizations
3.4 Quality and Performance
3.4.1 Comparison of I/P.B-frame Encoding Times
3.4.2 Performance on Different Image Sequences
3.4.3 Performance on Different Machines
3.4.4 Performance of Different Search Techniques
3.5 Comparison of Original vs. Decoded Reference Frames

4. Parallel MPEG-1 Encoder
4.1 Communication Required
4.2 Network Model
4.3 I/O
4.4 Scheduling
4.5 Parallel Performance
4.6 Comparison to Philips Encoder

5. Conclusion
6. Acknowledgements
References

Parallel MPEG-1 Video Encoding*

Kevin L. Gong and Lawrence A. Rowe
Computer Science Division - EECS
University of California
Berkeley, California 94720
(Keving@CS.Berkeley.EDU, Rowe@CS.Berkeley.EDU)

*This research was supported by the National Science Foundation (Grant MIP 90-14940), Digital Equipment Corporation, Fujitsu Network Transmission Systems, Inc., and Hewlett-Packard Corp.

Abstract

MPEG-1 is an ISO/IEC (International Organisation for Standardisation/International Electrotechnical Commission) standard for medium-quality and medium-bit rate video and audio encoding. We have developed a parallel MPEG-1 encoder that runs on a network of workstations. We have implemented a number of different motion vector search techniques. The sequential code can encode over 0.6 frames per second on an HP 9000/720 machine, and the parallel code can encode slightly less than 4 frames per second on a network of 9 of these machines connected via Ethernet.

Keywords

MPEG, MPEG-1, video compression, parallel programming 


1. Introduction

2. MPEG-1 Video
Back to Top
The Motion Pictures Expert Group (MPEG) has produced a standard for video and audio encoding for medium- quality, medium-bit-rate compression, called MPEG-1 [3].
Our goal in developing an MPEG-1 encoder was to support the creation of the Berkeley Distributed Video on Demand System (VODS) [9, 10]. The VODS system is designed to handle hundreds of hours of video. Due to the large size and amount of video, that video must be compressed. MPEG-1 allows video to be compressed by ratios in the range of 50:1 to 100:1, depending on image sequence type and desired quality.
Commercial encoders use custom hardware and parallel encoding, which is fast and expensive. For example, the C-Cube real-time MPEG-1 video encoder uses 8 custom-designed video processor chips; a complete system is sold for $120,000. Optibase and Optivision sell PC-based, real-time encoders based on the C-Cube video processor chips for $20,000. Software-only encoders are slower. For example, the Stanford encoder, running on a DEC 5000, encodes at just 0.2 frames per second [2]. The Philips parallel MPEG encoder runs on a POOMA machine [12]. POOMA is an experimental parallel computer system consisting of 100 nodes. Each node consists of a Motorola MC68020 CPU, MC68881 FPU, and 16 MB of memory. Later, we will compare the performance of the Philips encoder with our encoder.
We developed a parallel encoder that runs on a network of workstations to reduce the time needed for encoding. We wanted our encoder to be used by as many people as possible, including those already using the Berkeley MPEG-1 player [1]. To address this issue, we designed the encoder to be as flexible and easy-to-use as possible, allowing different methods and quality of encoding for users with different needs. Thus, our ultimate goal was to create a usable, flexible, portable, parallel encoder with reasonable quality and performance.
This paper describes the design, implementation, and performance of the Berkeley MPEG-1 video encoder. The paper is organized as follows. Section 2 describes the MPEG-1 video standard. Section 3 describes the encoder and discusses the quality and performance of the sequential encoder. Section 4 describes the parallel encoder and its performance. Section 5 presents our conclusions. 

2. MPEG-1 Video

1. Introduction
3. Sequential MPEG-1 Encoder
Back to Top

The MPEG-1 video standard is designed to encode color images at 1.5 Mb/s and to provide VHS-quality playback. The basic idea behind MPEG encoding is to exploit temporal locality. Excepting certain types of video such as music videos, the images do not change much within small time intervals. MPEG-1 video takes advantage of this by encoding an image relative to other images temporally close to it. 

2.1 Data Rate

2. MPEG-1 Video
A source video is a sequence of numbered frames, F1...Fn. Each frame is a still image. A video player displays one frame after another, usually at a rate close to 30 frames per second (the standard television rate).
Frames are digitized in a standard RGB format, 24 bits per pixel (8 each for Red, Green, and Blue). This generates the data rates shown in Table 1. MPEG-1 is designed to produce bit rates of 1.5 Mb/s or less, and is intended to be used with images of size 352x288 (at 24-30 frames per second). Larger images are typically scaled before encoding.
Table 1: Uncompressed Data Rates
Image Size  Frame Rate 

(frames/sec) 

Data Rate 

(Mbits/sec)

160x120  12  5.3
160x120  20  8.8
352x288  24  55.7
352x288  30  69.6
720x576  24  227.8
720x576  30  284.8

2.2 Color Space

2. MPEG-1 Video
The MPEG-1 algorithm operates on images represented in the YUV color space. Thus, if an image is stored in RGB format it must first be converted to YUV format.
The YUV format is subsampled. All luminance information (Y) is retained; there are 8 bits per pixel (bpp) of luminance information. However, chrominance information (U and V) is subsampled 2:1 in both the horizontal and vertical directions, which reduces the information by a 4:1 ratio. Thus, there are 2 bits per pixel of U information and 2 bits per pixel of V information. This subsampling does not drastically affect quality because the eye is more sensitive to luminance than to chrominance information.
Subsampling is a lossy step. The 24 bits per pixel of RGB information is reduced to 12 bits of YUV information, which automatically gives 2:1 compression. 

2.3 Frame Coding

2. MPEG-1 Video
Frames are logically divided into 16x16 pixel macroblocks. Each macroblock, in turn, is divided into six 8x8 blocks - 4 luminance and 2 chrominance (4 Y blocks, 1 U block, and 1 V block).
There are three types of frames: intra-frames (I), forward predicted frames (P), and bi-directional predicted frames (B). P- and B-frames are encoded relative to other frames as shown in Figure 1.
An I-frame is encoded as a single image, with no reference to any past or future frames. The encoding scheme used is similar to JPEG compression [13]. Each block is encoded independently with one exception explained below. The block is first transformed from the spatial domain into a frequency domain using the Discrete Cosine Transform (DCT), which separates the signal into independent frequency bands. Most frequency information is in the upper left corner of the resulting block. At this point, the data is quantized. Quantization can be thought of as ignoring lower-order bits (though this process is slightly more complicated). Quantization is the only lossy part of the compression scheme other than subsampling. The resulting data is then run-length encoded in a zig-zag ordering to optimize compression. This zig-zag ordering produces longer runs of 0's by taking advantage of the fact that there should be little high-frequency information. The afore-mentioned exception to independence is that the coefficient in the upper left corner of the block, called the DC coefficient, is encoded relative to the DC coefficient of the previous block.

A P-frame is encoded relative to the past reference frame. A reference frame is a P- or I-frame. The past reference frame is the closest preceding reference frame. More formally, the past reference frame for a frame Fi is the reference frame Fj such that Fj comes before Fi in display order, and there are no other reference frames between Fj and Fi in display order.

Each macroblock in a P-frame can be encoded either as an I-macroblock or as a P-macroblock. An I-macroblock is coded just like a macroblock in an I-frame. A P-macroblock is some 16x16 area of the past reference frame, plus an error term. To specify the 16x16 area of the reference frame, a motion-vector is transmitted. A motion vector of (0,0) means the 16x16 area is in the same position as the macroblock we are encoding. Other vectors are relative to that position. Motion vectors may include half-pixel values, in which case pixels are averaged. For example, the 16x16 area resulting from a motion vector of (1.5, 3.0) is the average of the two 16x16 areas resulting from the motion vectors (1.0, 3.0) and (2.0, 3.0) (see Figure 2). The error term is transmitted using the DCT, quantization, and run-length encoding. A special case is a skipped macroblock, which refers to a macroblock with a (0,0) motion vector and no error term. It takes just 0 or 1 bits to encode a skipped macroblock1.
The search for the best motion vector (the one which gives the smallest error term possible) is the heart of any MPEG-1 video encoder. The motion vector search is what makes encoders inherently slower than decoders.
A B-frame is encoded relative to the past reference frame, the future reference frame, or both frames. The future reference frame is the closest proceeding reference frame. The future reference frame for a frame Fi is the reference frame Fk such that Fk comes after Fi in display order, and there are no other reference frames between Fi and Fk in display order.
The encoding for B-frames is similar to P-frames, except that motion vectors may refer to areas in the future reference frame. For macroblocks that use both past and future reference frames, the two 16x16 areas are averaged. Skipped macroblocks are also defined for B-frames. In B- frames, a skipped macroblock has the same motion vector as the previous macroblock, and no error term.
Frames not need follow a static IPB pattern. Each individual frame can be any one of the three types. Often, however, the IPB sequence, called the coding pattern, is fixed for an entire video. This pattern is just one of several encoding parameters. The parameters having to do with the motion vector search are specific to an encoder implementation, so they will be discussed in Section 3. All encoders, however, must handle the parameters having to do with quantization. The quantization matrix (q-matrix) determines the quantization in the I-macroblocks. This matrix can be mod- ified to optimize compression and quality for a particular image sequence. The quantization scale (q-scale) controls the granularity of quantization. In essence, the quantization matrix is multiplied by the q-scale. The q-scale also affects quantization for the DCT error terms in P- and B-macroblocks. Encoders can vary the q-scale and q-matrix to change the bit rate and image quality in a sequence. 

2.4 Bitstream Representation

2. MPEG-1 Video
An MPEG-1 video sequence is an ordered stream of bits, with special bit patterns marking the beginning and ending of a logical section.
Each video sequence is composed of a series of Groups of Pictures (GOP's). A GOP is composed of a sequence of pictures (frames). A frame is composed of a series of slices. A slice is composed of a series of macroblocks, and a macroblock is composed of 6 blocks (4 for luminance and 2 for chrominance). (see Figure 3)
The GOP structure is intended to assist random access into a sequence. A GOP is an independently decodable unit that can be of any size as long as it begins with an I-frame.
Each slice is an independently decodable unit too. There can be 1 slice per frame, 1 slice per macroblock, or anywhere in between. The slice structure is intended to allow decoding in the presence of errors. Serendipitously, it also allows parallel encoding/decoding at the slice level.
The MPEG-1 standard allows various properties of the sequence to be modified in mid-stream. For example, the q-matrix and q-scale can both be changed. The entire q- matrix can be changed between video sequences. The q- scale can be changed between slices or even between macroblocks. 

2.5 Challenges

2. MPEG-1 Video
The primary challenges to implementing an MPEG-1 encoder are: 1) creating fast, efficient motion vector search techniques and 2) varying the encoding parameters to provide the best balance of encoding speed, compression, and quality.
Considerable research has been done on motion vector search [8]. Research has also been done on dynamically varying encoding parameters to produce the highest quality image at a specified bit rate (see, for example, the paper by Puri and Aravind [4]). However, much of this work has not been published because it is the basis for competition between different commercial products and services. 

3. Sequential MPEG-1 Encoder

2. MPEG-1 Video
4. Parallel MPEG-1 Encoder
Back to Top

 

We first implemented a sequential MPEG-1 encoder running on a single workstation before creating the parallel version. This section describes the implementation and performance of the sequential encoder. 

3.1 Implementation

3. Sequential MPEG-1 Encoder
The sequential encoder compresses one frame at a time, remembering information about various reference frames as required. The encoder keeps information on at most 3 frames at a time: the current frame, the past reference frame, and the future reference frame. We assume random access to frames. If, instead, we only had sequential access to the frames in display order, we would have to remember up to M+1 frames, where M is the greatest distance between reference frames. In other words, M-1 is the maximum number of consecutive B-frames.
Prospective users may want to use different encoding processes depending on the required bit rate, quality, and encoding speed. These criteria also depend on the source material. To accommodate this variability, the encoder reads a parameter file that specifies the encoding parameters.
All options are set once per execution of the encoder and remain constant throughout the execution of the program. We currently do not support dynamically varying parameters. The following parameters can be set:
Q-scale. The q-scale can be set separately for I-, P-, and B-frames.
Pattern. The IPB pattern can be set.
Search techniques. A variety of different motion vector search techniques and ranges can be selected. These will be discussed later.
Original or decoded frame as reference frame (the basis for the motion vector search will be discussed in section 3.6). 

3.2 Motion Vector Search Techniques

3. Sequential MPEG-1 Encoder

 

Different search techniques produce dramatic differences in speed and compression rates. The search techniques implemented for P-frames are: 1) exhaustive, 2) subsample, 3) two-level, and 4) logarithmic.
The search window, which specifies the motion vectors to consider, can be specified for each technique. The search window is a square area centered around the (0,0) vector.
Exhaustive search examines all motion vectors in the search window. The vector that gives the smallest error is considered the best vector. This error, which compares two 16x16 areas, can be computed in any one of several ways. A standard error measure, called the mean squared error (MSE), is the average of the squares of the 256 luminance differences. Generally speaking, chrominance is ignored when computing the error because, as stated earlier, the eye is more sensitive to changes in luminance.
The multiplication in the MSE makes it slow. Consequently, we use the mean absolute difference (MAD) measure, which is the average of the absolute values of the 256 luminance differences. This method gives results similar to the MSE, but at a much lower computational cost.
Subsample search computes the error based on fewer pixels. Only 64 of the 256 pixels are used to compute the error [5].
The two-level search is a simple hierarchical search. We first search, exhaustively, for the best full-pixel motion vector. We then pick the best vector from that vector and the surrounding 8 half-pixel vectors.
Logarithmic search significantly reduces the number of motion vectors to test. It first tests nine vectors in the search window. It then tests eight vectors centered on the best vector from the first nine. This process is repeated until the search ends on a single vector. Figure 4 illustrates this approach.
To compare the different search techniques quantitatively, we will compute the number of pixel-to-pixel comparisons for each method. Optimization reduces these numbers, so the comparison is not precise, but it will give us a rough idea of what to expect. Table 2 shows the differences, assuming an n x m pixel rectangular search window, where n > m.
Table 2: Theoretical Comparison of P-Frame Search Techniques
Search Technique  Pixel-to-Pixel Comparisons
Exhaustive  256(2n+1)(2m+1)
Subsample  64(2n+1)(2m+1)
Two-Level  256((n+1)(m+1)+8)
Logarithmic  256(6log2 2m + 2log2 2n + 1)
Search techniques for B-frames must examine both the past and future reference frames. The techniques implemented are: 1) exhaustive, 2) simple, and 3) cross2.
In all three methods, the best forward and backward vectors are found independently (using the P-frame search method). Call these vectors Vf and Vb, respectively. If using one of these vectors is better than using a combination of a forward and backward vector, then the single vector is used instead. Thus, the only difference in techniques is how they search for the best combination of vectors.
Exhaustive search does an O(n2) search. For each forward vector, it tries to find the best matching backward vector. This method is obviously extremely slow.
In the simple search, the only combination considered is Vf and Vb. There is no additional search, so this method takes roughly twice as long as a P-frame search.
The cross2 technique searches for the best backward vector that matches Vf and the best forward vector that matches Vb. The best of the resulting 4 alternatives (Vf, Vb, Vf+match, or Vb+match) is taken. This method takes roughly twice as long as the simple search. 

3.3 Optimizations

3. Sequential MPEG-1 Encoder

 

MPEG-1 encoding is inherently slower than decoding because of the time required to search for motion vectors. One goal we had in developing the encoder was to write efficient code that would be easy to modify and debug. Our approach was to write simple, modular code, and then use profiling to discover where time was being spent. It turned out, not surprisingly, that most of the time spent was in a very few procedures. Using a variety of algorithmic and coding tricks, we achieved a speed-up of almost 8 times compared to the non-optimized version.
The first optimization was to combine the computation of motion-compensated blocks and the corresponding error term. Originally, the block was computed and placed in memory, and then the error was calculated. Part of the reason for this separation was for modularity. Computing the motion block required interpolation for half-pixel motion, so it was difficult to combine the computations. We combined the computation and at the same time made a related optimization. In the motion vector search, a half-pixel value may be used more than once. Therefore, we compute all half-pixel values for the entire frame once. These two optimizations together sped up the encoder by over a factor of two.
The second optimization was unrolling the main loop that computed the error. It was two nested loops of 16 iterations each. The inner loop was unrolled to speed up the encoder by about 25%.
The third optimization was to keep track of the smallest error found, and stop the computation as soon as the error being computed was larger. The few lines it took to make this change sped up the encoder by another 50%.
After making this change, it made sense to change the order in which the motion vector search was done. Instead of doing a raster-scan order search, it makes more sense to do a spiral search centered around (0,0) because smaller motion vectors are more common. This change sped up the encoder by another 30%.
In a similar vein, it made sense to try the motion vector from the previous macroblock before doing the spiral search. This strategy is especially useful when the motion is caused by the camera panning a scene, in which case many motion vectors will be the same. This sped up the encoder by an additional 10%.
Finally, values were originally stored as 16-bit integers. However, most compilers convert values to 32-bit integers before doing any arithmetic. Therefore, we changed the code so that values are stored as 32-bit integers to eliminate the need for conversions. This sped up the encoder by an additional 25-30%.
Even with the search optimizations, a current profile of the encoder still reveals that motion vector search dominates running time. If there are no P- or B-frames, the DCT and quantization computations account for over 50% of the time. In a standard IBBPBBPBBPBBPBB pattern, the motion vector search takes over 90% of the time. 

3.4 Quality and Performance

3. Sequential MPEG-1 Encoder

 

We conducted a series of experiments to test the relative performance of different encoding parameters and the different motion vector search techniques. The experiments are designed to compare the compression factor, encoding time, and image quality.
Compression factors are given in terms of bits per pixel. Since the input is of size 24 bits per pixel, an output of 1 bit per pixel corresponds to 24:1 compression. Smaller bits per pixel are clearly better.
Encoding times are given in terms of macroblocks processed per second. For comparison, a 352x288 pixel image contains 396 macroblocks.
There are many ways to define image quality, including MSE, MAD, and signal-to-noise ratio (SNR). More complex measures have been developed [6], but we have not yet used them.
None of the traditional metrics are really satisfactory, because you can easily construct two images in which the image with lower quality according to an error metric is noticeably better to the human eye. Also, no mention has been made of moving image quality. One might argue that the logical thing to do is to look at the errors of each individual frame, but in reality the human eye may be more susceptible to different kinds of errors in moving pictures than it is in still pictures.
Since no good, simple metric exists, we used the peak signal-to-noise ratio (PSNR) in our experiments. PSNR is defined as:

where MSE is the mean squared error. Using this metric, larger numbers are better quality. For example, if each pixel in the encoded image is different than the original image by just 1, the PSNR is 20log10 255 = 48.13. If, instead, each is different by 2, the PSNR is 20log10 255/2 = 42.11.

All tests, except where noted, were done with the following encoding parameters: IBBPBBPBBPBBPBB pattern, q-scale of 8/10/25 for I/P/B frames, logarithmic P- search, cross2 B-search, decoded frames as reference frames, and a search window of +/- 10 pixels. Times reported in the sequential experiments, except where noted, are for encoding time only (no I/O time is included). The tests were done on a DEC 3000/400.
Performance and image quality are affected by the type of image being compressed. To avoid bias in our experiments, we used 4 different image sequences.
The flower garden sequence is a standard test of video compression algorithms. It has high-frequency information located in the flowers at the bottom of the images, which makes compression hard. It has low-frequency information in the form of clear sky in the upper left, which should give good compression. And, it has motion because the camera is panning from left to right. This panning causes a tree in the foreground to move relative to the background.
The tennis sequence shows two people playing ping- pong. The first scene is a ball bouncing and there is very little information in the image. The second scene shows one player bouncing the ball across the net. This sequence has less high- frequency information than the flower garden sequence, and thus gets better compression.
The walk sequence is a computer-generated walk- through of Soda Hall, the new UC Berkeley computer science building [11]. The motion in this sequence comes only from movement of the camera (except for a few frames with moving elevator doors and a moving sculpture). Since it is computer-generated, many of the surfaces are uniformly colored with little detail, making it compress well.
Lastly, the luxo sequence is a Pixar animation [13]. It includes a lot of dark, unchanging areas, and minimal motion.
All sequential results, except those explicitly comparing the different image sequences, are averaged from the 4 different sequences. Normalized results were computed to avoid skewing based on one sequence performing particularly well or poor on a given test. However, the normalized results did not differ significantly from the averaged results, so are not shown. The normalized results are only mentioned here to show that the skewing possibility was taken into account. 

3.4.1 Comparison of I/P/B-frame Encoding Times

3. Sequential MPEG-1 Encoder
3.4 Quality and Performance

 

The pattern of I/P/B frames has a large effect on the overall time for encoding. Table 3 compares time and compression for I, P and B frames:
Table 3: Comparison of I/P/B-Frames
Frame Type  Bits Per Pixel  Macroblocks Per Second  PSNR
I-frames  1.344  761  34.3
P-frames  0.607  458  33.2
B-frames  0.081  198  31.5 
Note the significance of B-frames. B-frames are a factor of 7.5 times smaller than P-frames, but at only 2.3 times the computation cost and a 5% reduction in quality. This data illustrates a fundamental behavior of MPEG (and many other compression schemes): computation time is traded for improved compression

3.4.2 Performance on Different Image Sequences

3. Sequential MPEG-1 Encoder
3.4 Quality and Performance

 

Table 4 compares quality, compression, and encoding speed for the different image sequences. Notice that the computer-generated sequences (walk and luxo) yield higher compression ratios and quality in less time than the other sequences. This result is important because many users of our encoder are generating movies from scientific visualization experiments. One future research direction might be to explore coding heuristics based on the type of source material.
Table 4: Comparison of Images
Sequence  Bits Per Pixel  MPS  PSNR 
(I/P/B)
flowers  0.625  229  30.7/30.3/27.6
tennis  0.255  221  32.6/32.0/31.0
walk  0.186  244  36.0/34.5/33.7
luxo  0.169  294  37.9/35.9/33.7
As a general rule, a higher compression ratio requires less encoding time, and yields higher quality. There are two reasons for this observation. First, speed comes from a larger number of skipped macroblocks, which also improves compression. Skipped macroblocks also speed up encoding because they are found first. Second, the improved quality results from the absence of high-frequency information, which also improves compression because there are more zero values in the DCT. The following table supports the skipped macroblock theory by showing how many macroblocks of each type are found in B-frames.
Table 5: Macroblock Types in B-Frames
Sequence  Skipped
flowers  0.02%  92%  8%
tennis  0.14%  94%  6%
walk  0.00%  84%  16%
luxo  0.09%  68%  31%

3.4.3 Performance on Different Machines

3. Sequential MPEG-1 Encoder
3.4 Quality and Performance

 

Table 6 compares the performance of the sequential encoder on different machines. Relative performance is roughly in agreement with the SPECint numbers.
Table 6: Machine Comparison
Machine  Macroblocks Per Second  SPECint92
HP 9000/755  280  80.0
DEC 3000/400  247  74.7
Sparc 10/30  104  40.0

3.4.4 Performance of Different Search Techniques

3. Sequential MPEG-1 Encoder
3.4 Quality and Performance

 

This section compares the performance of different P- and B-frame motion vector search techniques. Table 7 compares the P-frame motion vector search techniques.
Table 7: P-frame Motion Vector Search
Technique  Bits Per Pixel  Macroblocks Per Second  PSNR
Exhaustive  0.575  56  33.3
Subsample  0.578  138  33.3
TwoLevel  0.579  181  33.3
Logarithmic  0.607  458  33.2
There is almost no difference in quality or compression between the first three techniques, which suggests that two- level search should be used because it is significantly faster than the other two methods. Logarithmic search gives almost identical quality and slightly worse compression, but at a much faster rate.
Table 8 compares the B-frame motion vector search techniques.
Table 8: B-frame Motion Vector Search
Technique  Bits Per Pixel  Macroblocks Per Second  PSNR
Exhaustive  0.081  Less than 1  31.1
Cross2  0.079  197  30.9
Simple  0.077  346  30.8
Exhaustive search gives minimal quality improvement over cross2, but actually gives worse compression, and cross2 in turn gives worse compression than simple. These results are unexpected; if our error metric (MAD) were directly correlated with compression ratio, it would be impossible. We have not been able to explain this result. Further research on this subject is needed.
Table 9 shows how the choice of P-frame search technique affects the performance of B-frame search (in this case, cross2 search).
Table 9: Effect of P-search on B-search
Technique  Bits Per Pixel  Macroblocks Per Second  PSNR
Exhaustive  0.079  14  31.7
Subsample  not implemented  --  --
TwoLevel  0.079  53  31.7
Logarithmic  0.081  198  31.5
The differences in compression and quality are very small for the three techniques, which makes a strong argument for using logarithmic search when doing B-frame searches. 

3.5 Comparison of Original vs. Decoded Reference Frames

3. Sequential MPEG-1 Encoder

 

An interesting question is whether to use the original, unencoded frames as reference frames, or to use the decoded frames as reference frames. Because any MPEG-1 playback system will not have access to the original frames, the best reference frame is the decoded frame. The problem this causes is that during the encoding process, the reference frames must all be decoded, adding to the time needed.
Oddly enough, there are also concerns about compression ratio. While image quality will be sacrificed by using the original frames as reference frames, it turns out that the compression ratio is improved
Table 10 compares the use of original and decoded frames as reference frames.
Table 10: Original or Decoded?
Reference  Bits Per 
Pixel 
Macroblocks 
Per Second 
PSNR 
I/P/BM
Decoded  0.309  247  34.3/33.2/31.5
Original  0.276  297  34.3/31.3/30.3
The MPEG-1 standard suggests that there is usually little difference in quality. The numbers here tend to support this because quality differs by about 5%. Of course, this observation depends on your definition of "little difference," particularly in light of the "little differences" that caused many small changes to be added to the standard.
There is a noticeable drop in compression (12%) and encoding speed (17%) when using decoded frames as reference frames. dd> One useful experiment might be to lower the q-scale when using the original frames until the quality reaches that achieved by using the decoded frames, and then comparing the compression rates.
The experiments in this section have shown that the logarithmic and two-level search methods are good choices for P-frame motion vector search, and the simple search method is good for the B-frame motion vector search. 

4. Parallel MPEG-1 Encoder

3. Sequential MPEG-1 Encoder
5. Conclusion
Back to Top

 

MPEG-1 video compression is CPU-intensive and slow. On the other hand, many people have access to workstations that sit idle much of the time. We developed a portable parallel encoder to take advantage of these otherwise-idle machines to speed up encoding.
MPEG-1 video compression lends itself very well to parallelism. There are several levels at which parallelism can be implemented. The finest-grained parallelism can be obtained at the macroblock level. The problem with this approach is that the encoding of one macroblock depends on the encoding of the previous macroblock. In an I-block (i.e., a block in an I-macroblock), the DC value is represented as an offset from the DC value of the previous I-block. In B- frames, a skipped macroblock is based on the motion vector of the previous macroblock. Thus, fine-grained communication is required between processors coding adjacent macroblocks which makes this approach impractical on a network of workstations.
Medium-grained parallelism can be obtained at the slice level. MPEG-1 devised the concept of a slice such that each slice could be independently decoded. Consequently, they can also be independently encoded. Suppose we have one slice per row of macroblocks. The challenge now is to provide random access to slices within a frame, and to concatenate the resulting encoded bitstreams. This approach is good for reducing latency in real-time applications. But, if the encoding is done off-line, coarse-grained parallelism is more manageable.
At the coarsest level, we can encode GOP's in parallel. Unfortunately, even encoding one GOP at a time does not completely eliminate the dependencies between I/ P/B frames. In general, some frames in one GOP will depend on a reference frame in the previous or next GOP.
Encoding GOP's in parallel is also somewhat inflexible. A more flexible method is to encode a number of contiguous frames, but not necessarily a GOP. This approach has similar properties to encoding a GOP at a time, but allows more scheduling flexibility which improves load- balancing. We use this approach in our parallel encoder. 

4.1 Communication Required

4. Parallel MPEG-1 Encoder

 

Several different kinds of communication are required in the parallel version of the MPEG-1 encoder. The most obvious communication needed is that required for scheduling. Each encoding process must know which frames it should encode. Such scheduling requires very limited communication, on the order of a few bytes per frame.
A second kind of communication is required if we use decoded frames as reference frames. Consider the sequence IBBPBBI. Suppose processor P0 encodes frames 1-3 (IBB), and processor P1 encodes frames 4-7 (PBBI). Processor P1 must get encoded frame 1 from processor P0 to encode frame 4. Thus, we need a method to transfer encoded frames between processors.
A third kind of communication is required to transfer raw frames. This communication introduces the greatest amount of traffic because uncompressed image data is very large. Because the images come from disk in the first place, a simple solution is to use NFS. To simplify matters, we assume that each input frame resides in a separate file2.
A fourth kind of communication is required to transfer the encoded frames to disk or a processor that combines them to form the output bitstream. This communication is less intensive than the transfer of raw frames, because encoded frames are much smaller (50-100 times smaller). 

4.2 Network Model

4. Parallel MPEG-1 Encoder

 

Figure 5 shows the process architecture for the parallel encoder. The Master Server schedules the slave processes that do the encoding. The Decode Server directs the transfer of encoded frames between slaves if decoded frames are used as reference frames. It does not actually make the transfer. It only keeps track of the reference frames that are ready and the processes which are waiting for frames, and it notifies them when appropriate. The slave processes read/write the encoded frames using NFS. The Combine Server concatenates the encoded frames to create the output bitstream.
Currently, the Master, Decode, and Combine Servers all run on the same physical machine. The Slave processes run on different machines. There may or may not be a slave process on the "server" machine.
NFS is used to transmit frames to the individual slave processes, as well as to/from the Combine Server. Elsewhere, communication is done using UNIX TCP sockets. 

4.3 I/O

4. Parallel MPEG-1 Encoder

 

An often overlooked, but important aspect of any parallel algorithm is I/O bandwidth. Regardless of how the parallel encoder works, a limiting factor is the disk on which the original frames sit. If the frames cannot be accessed faster than a certain rate, the encoder cannot work faster than that rate.
Video is digitized into a compressed format (most commonly JPEG). Thus, we can reduce some of the I/O requirements by keeping the video in JPEG format, sending the compressed data to the slave processes, and only then having them uncompress the frames once they get there. We will present results for experiments that use compressed images and experiments that use uncompressed images. 

4.4 Scheduling

4. Parallel MPEG-1 Encoder

 

An important question is how to schedule frames to processors. Here, there is a trade-off between load balancing and scheduling overhead. For example, scheduling one frame per processor ensures the best possible load balancing, but increases overhead to an unbearable amount. Scheduling overhead includes both the communication required for scheduling, as well as the extra frames that must be transmitted. For example, if a process encodes IBB, then it requires the future reference frame; even if original frames are used as reference frames, the future reference frame is read at least twice - once by this process, and once by the process that actually encodes it. A frame that is read by a process that does not encode it is a penalty frame. The scheduling method with the least overhead is one in which all processors run at about the same speed, so each processor is given N/P contiguous frames, where N is the number of frames and P is the number of processors. In this case, there are no more than 2P penalty frames (2 per processor). In the best case (but still allowing B-frames), there are P penalty frames.
At the opposite extreme, there is the one frame per processor solution. In this scheduling algorithm, there are 2Nb+Np penalty frames, where Nb is the number of B-frames and Np is the number of P-frames. The large number of penalty frames makes this solution impractical.
Unfortunately, there is also a trade-off in the amount of parallelism gained. Generally speaking, sequences of frames can be encoded completely in parallel, guaranteeing perfect parallelism in any scheduling method. There are two reasons why this is not the case. First, the encoder produces one output file per frame, and these files must be combined into one file. A separate process, the Combine Server, is in charge of this combining. It puts the frames in order into a single file. It only proceeds when the next frame (in decode order) has been encoded. Thus, the scheduling method described above will cause the combining process to wait until all slave processes are finished before it begins its task. The combine time for our experiments was about 15 seconds. For some experiments, this is almost 40% of the total time to encode!
The second problem is using decoded frames as reference frames. Here, the above scheduling method is probably the worst method. Consider an IPB sequence in which the second processor must wait for the first to finish, the third must wait for the second, and so on, so that almost no parallelism at all occurs! For example, consider processor P0 encoding IBBPBBPBBPBB, and processor P1 encoding the rest: PBBIBBPBBP. Process P1 can only encode its first P frame by waiting for processor P1 to finish encoding its last P frame, which means that processor P1 does not start until processor P0 is almost finished.
The only way to ensure parallelism is to have each run of frames start with an I-frame. If it does not, the slave process must wait for the previous reference frame to be encoded. The encoder supports this restriction on scheduling by accepting a FORCE_I_ALIGN parameter that enforces this restriction. To simplify our parallel results, we use the original frame as the reference frame in our experiments. Results with decoded frames as reference frames would be similar if we used a larger number of frames in our experiments.
We originally used a method in between the two previous methods. First, each processor encodes S frames to determine the approximate encoding speeds of the different processors. From then on, the scheduler assigns each slave process T seconds of work. The scheduler assigns a number of frames to be encoded based on previous performance. This dynamic scheduling method is useful for heterogeneous systems with processors that operate at different speeds, as well as homogeneous systems in which network traffic and multi-user variance causes machine speeds to vary. This method ensures load balancing to within T seconds.
The problem with this method is that load balancing is affected by the value of T. Small values of T result in better load balancing, but more penalty frames. To solve this problem, we used a tapered scheduling algorithm, in which the number of penalty frames is kept low by having a T that starts out large, but grows smaller and smaller as we near the finish [7]. This method is implemented by estimating the time remaining, and scheduling a processor enough frames to keep it busy for that amount of time. Tapering T in this manner improves load balancing, while keeping the number of penalty frames relatively low.
Table 11: Scheduling

(frames) 

(seconds) 
Penalty 
Frames 
FPS
238  2.11
120  3.19
10  55  3.57
15  39  3.49
20  30  3.13
taper  68  3.75
10  46  3.66
taper  62  3.95
65  3.57
taper  62  3.75
12  10  30  3.33
12  15  22  3.41
12  20  21  3.19
12  taper  54  3.33 
15  10  16  2.94
15  taper  40  2.88
N/P  12  2.68
Table 11 shows the number of penalty frames and encoding speed for different values of S and T. The experiments to produce the table, as well as all parallel experiments, were run on a 9-machine cluster of HP 9000/ 720 computers connected by an Ethernet. All parallel experiments used the 150-frame flower-garden sequence. The scheduling experiments used the following encoding parameters: IBBPBBPBBPBBPBB pattern, q-scale of 8/10/ 25 for I/P/B frames, logarithmic P-search, simple B-frame search, original frames as reference frames, and a search window of +/- 10 pixels.
The results in Table 11 are not too surprising. For values of T smaller than 10, the number of penalty frames rises and the FPS falls. For values of T greater than 10, load balancing becomes worse and the FPS falls. Tapering performs well compared to the fixed T values. The extreme methods (S = 1 and S = N/P) give the worst performance. 

4.5 Parallel Performance

4. Parallel MPEG-1 Encoder

 

The tapered scheduling algorithm was used with S = 6. The technique gives good load balancing, which is important. Even though the machines used in our experiments were the same, CPU load and network activity caused performance to vary greatly.
The experiments used four different sets of encoding parameters. I-frame encoding used a pattern of I-frames only and a q-scale of 8. Fast encoding used the following encoding parameters: IBBPBBPBBPBBPBB pattern, q- scale of 8/10/25, logarithmic P-search, simple B-search, original frames as reference frames, and a search window of +/- 10 pixels. Medium encoding used the same encoding parameters as fast encoding, except with two-level P-frame search. Slow encoding used the same encoding parameters as medium encoding, but with cross2 B-frame search. In the graphs below, some numbers may not make sense due to rounding errors and variations in speed caused by other running processes and network traffic.
Speedup and efficiency are two standard measures used to evaluate parallel performance.

Speedup is some number between 0 and p, and efficiency is some number between 0 and 1. In both cases, larger numbers are better.

The following two graphs compare the performance of the different encoding schemes. Speedup and efficiency are better for the slower encoding schemes. This is expected because computation time is larger in relation to the communication and I/O time.
The best speedup is about 7.5 for slow encoding. Speedup for I-frame encoding is much worse. The is partly because I-frame encoding produces much larger output files than the other encoding schemes, increasing the I/O requirements, and partly because I-frame encoding is faster (which means a smaller speedup will push the I/O requirements as much as a larger speedup for the other encoding schemes). Each raw frame is roughly 1 Megabit. Since we use NFS for I/O, and the Ethernet has a capacity of 10 Mb/s, performance is limited to less than 10 frames per second. The experiments achieved much less than this because additional I/O is required for output and combining, network performance is typically less than 10 Mb/s, and some time must be used for computation.
We can alleviate some of the I/O problems by reducing the size of the input. If the input files are originally stored in JPEG format on disk, then the input will be much smaller. This is not unreasonable because hardware exists to JPEG compress input as it is digitized.
The following two graphs show the results when the images were originally compressed using JPEG to roughly 25% the size of the original raw YUV images. dd> This method does not work as well as could be expected. There are several possible reasons why performance is poor: uneven load balancing, the sequential combine time, processor waiting (a slave waiting for the scheduler), I/O bandwidth limitations, and network bandwidth limitations. We will examine each of these possibilities for these four experiments.
For a given number of processors, we can easily calculate the time required for perfect speedup by dividing the sequential time by the number of processors. The difference between this time and the actual running time is the slowdown time. We will measure each potential slowdown factor by the percentage of the slowdown time it contributes. For example, if sequential encoding takes 180 seconds and parallel encoding with 9 processors takes 35 seconds, the slowdown time is 15 seconds (35 - 180/9). If uneven load balancing added 3 seconds to the time, uneven load balancing accounts for 20% of the slowdown time.
Load balancing was very good. It was within 1, 2 or 3 seconds in all cases. The following is a graph of its slowdown time percentage. It explains some of the difference, but in general is not a large factor. It never contributed more than 25% of the slowdown time.
The sequential time is the time spent starting up the slaves and the time for the Combine Server to finish once all the slaves are finished. This time is dependent on which processors finish last and in what order.

The combine time is a large factor, especially in I-frame encoding. This is to be expected, since the combine time is a function of the size of the output and I-frame encoding results in larger output. More research can be done on trying to create a scheduling algorithm which reduces the combine time, but that is only half the problem.

The remaining slowdown time must come from processor waiting, and I/O and network bandwidth limitations. The following graph shows this contribution. Processor waiting is very small, so most of the slowdown is from the I/O and network limitations.
The fast JPEG encoding is probably more affected by network limitations because of penalty frames. From our scheduling experiments we expect about 60 penalty frames (I-frame encoding incurs no penalty frames). Penalty frames in the JPEG version are more costly because each frame must be decompressed. This decompression time is included in the above graph since it is not needed for the sequential version (and is therefore part of slowdown).
Let us examine more carefully what we should have expected. The following table lists an estimate on I/O and CPU requirements. It assumes 60 penalty frames (0 penalty frames when encoding I-frames only). CPU time includes the time required to uncompress JPEG frames where applicable. To compute I/O requirements, we add the input bits, the output bits, the output bits twice more (one read, one write when combining), and more input bits for the 60 penalty frames (where applicable).
Table 12: I/O and CPU times
Method  Mbits I/O  CPU seconds
I-Frames  250,872,456  91
Fast  233,366,193  240
Medium  234,198,144  560
Slow  233,033,760  1050
I JPEG  138,154,632  180
Fast JPEG  75,561,189  336
Based on this table, we can estimate the worst case FPS. The table that follows assumes that the I/O and CPU time are completely independent. If there is overlap, then the encoding will be done faster. The table is for 9 processors.
Table 13: Worst Case FPS Estimates
Method  10 Mb/s I/O  5 Mb/s I/O
I-Frames  4.41 fps  2.59 fps
Fast  3.07  2.11
Medium  1.77  1.40
Slow  1.08  0.93
I JPEG  4.52  3.27
Fast JPEG  3.37  2.90
This table, along with the tables for the JPEG algorithm, suggest that I/O performance is worse for the JPEG algorithm. This may occur because the data is in smaller packages than with the non-JPEG version. This may cause the overhead of NFS access to degrade I/O performance.
Even if this is true, however, the JPEG version is the only way to overcome the network bottleneck. To see this, the following is a graph of minimum FPS for different numbers of processors, given 10 Mb/s I/O. It shows that using JPEG should be better for 7 or more processors, and get much better for larger numbers of processors.

4.6 Comparison to Philips Encoder

4. Parallel MPEG-1 Encoder

 

Earlier, we mentioned the Philips MPEG encoder. It runs on an experimental parallel computer system consisting of 100 nodes. Each node consists of a Motorola MC68020 CPU, MC68881 FPU, and 16 MB of memory.
The Philips encoder achieved 89% efficiency with 100 processors. The overall speed, however, was just 0.6 frames per second, which means the encoder did not reach I/ O limitations. The authors of [12] admit that I/O problems could occur with faster encoding.
In comparison, our encoder achieved 69% efficiency for fast encoding on 9 processors. The overall speed was just under 4 frames per second. The network limitations cause this reduction in efficiency. It should be noted that even if a faster network is used, problems may still occur with I/O. If the network has a capacity of 10 Mb/s, and I/O bandwidth is 2 MB/s (16 Mb/s), then any bottleneck caused by the network masks any problems with I/O performance. If I/O becomes a bottleneck, one possible solution (other than using compressed input frames) is to use multiple disks. Processors could use their local disks to temporarily store encoded frames. If the input frames were stored on local disks, that would help even more. 

5. Conclusions

4. Parallel MPEG-1 Encoder
6. Acknowledgements
Back to Top
We have achieved our goal of creating a portable, parallel MPEG-1 encoder. The code was first distributed to the Internet in July 1993. Over 2,500 copies have been distributed to date (May 1994).
The key to creating an efficient encoder is to optimize the motion vector search - especially the error calculation. Other operations are less important in time significance. In addition to this optimization, we implemented a variety of different search techniques to account for varying encoding requirements.
A coarse-grained parallel version of the encoder was implemented. Close to linear speed-up can be gained, up to some limit dependent on I/O and network performance. We used a simple adaptive scheduling algorithm in which the scheduler estimates how many frames each processor can encode in a fixed time interval.
Experiments run on nine HP 9000/720 workstations on an ethernet network showed that the major bottleneck is network bandwidth, which is directly related to I/O requirements. Reading compressed images from disk to alleviate this bottleneck did not improve performance, but should do so for larger numbers of processors. 

6. Acknowledgments

5. Conclusion
References
Back to Top
Thanks to Brian Smith for many helpful suggestions and to Ketan Patel and Dan Wallach for their work on early versions of the MPEG-1 encoder. 

References

6. Acknowledgements
Back to Top

[1] Ketan Patel, Brian C. Smith, and Lawrence A. Rowe, "Performance of a Software MPEG Video Decoder," UCB/ERL M93/2, 7 January 1993

 [2] PVRG-MPEG Codec 1.0, Andy C. Hung, Feb. 25, 1993

 [3] MPEG-1 Standard (ISO/IEC International Standard 11172-2)

 [4] Atul Puri and R. Aravind. "Motion-Compensated Video Coding with Adaptive Perceptual Quantization," IEEE Transactions on Circuits and Systems for Video Technology, Vol. 1 No. 4, December 1991.

 [5] Bede Liu and Andre Zaccarin, "New Fast Algorithms for the Estimation of Block Motion Vectors," IEEE Transactions on Circuits and Systems for Video Technology, Vol. 3, No. 2, April 1993.

 [6] John A. Saghri, Patrick S. Cheatham, and Ali Habibi. "Image quality measure based on a human visual system model," Optical Engineering, Vol. 28 No. 7, July 1989.

 [7] Lucco, Steven. "A Dynamic Scheduling Method for Irregular Parallel Programs," Proceedings of the Conference on Programming Language Design and Implementation, ACM Sigplan, 1992.

 [8] Sohail Zafar, Ya-Qin Zhang, and John S. Baras. "Predictive Block-Matching Motion Estimation for TV Coding - Part I: Inter-Block Prediction," IEEE Transactions on Broadcasting, Vol. 37, No. 3, Sept. 1991.

 [9] L.A. Rowe, J. Boreczky, and C. Eads. "Indexes for User Access to Large Video Databases," Proceedings of IS&T/SPIE 1994 International Symposium on Electronic Imaging: Science and Technology, San Jose, CA, February 1994.

 [10] C. Federighi and L.A. Rowe. "The Design and Implementation of the UCB Distributed Video on Demand System," Proceedings of IS&T/SPIE 1994 International Symposium on Electronic Imaging: Science and Technology, San Jose, CA, February 1994.

[11] S.J. Teller and C.H. Sequin. "Visibility preprocessing for interactive walkthroughs," Computer Graphics, July 1991, Vol. 25, No. 4: 61-69.

 [12] Frans Sijstermans and Jan van der Meer. "CD-I Full- Motion Video Encoding on a Parallel Computer," Communications of the ACM, Vol. 34 No. 4, April 1991.

 [13] ISO 10918-1 JPEG Draft International Standard

 [14] Pixar, Inc., "Luxo Jr." (film), 1986 


Footnotes

1. Each encoded macroblock is numbered (with an offset from the previous encoded macroblock). If the increment is greater than 1, then there are skipped macroblocks. The number of bits required to encode the increment increases for larger increments.

 2. We provide a mechanism for dealing with the situation in which the input disk is not mounted using NFS. However, we will assume for simplicity of discussion that all disks are mounted using NFS.