Berkeley Multimedia Research
Center
Published: September 1994
Berkeley, CA
USA
http://www.bmrc.berkeley.edu
Parallel MPEG-1 Video Encoding
Abstract
1. Introduction
2. MPEG-1 Video
2.1 Data Rate
2.2 Color Space
2.3 Frame Coding
2.4 Bitstream Representation
2.5 Challenges
3. Sequential MPEG-1 Encoder
3.1 Implementation
3.2 Motion Vector Search Techniques
3.3 Optimizations
3.4 Quality and Performance
3.4.1 Comparison of I/P.B-frame Encoding Times
3.4.2 Performance on Different Image Sequences
3.4.3 Performance on Different Machines
3.4.4 Performance of Different Search Techniques
3.5 Comparison of Original vs. Decoded Reference Frames
4. Parallel MPEG-1 Encoder
4.1 Communication Required
4.2 Network Model
4.3 I/O
4.4 Scheduling
4.5 Parallel Performance
4.6 Comparison to Philips Encoder
5. Conclusion
6. Acknowledgements
References
Parallel MPEG-1 Video Encoding*
Kevin L. Gong and Lawrence A. Rowe
Computer Science Division - EECS
University of California
Berkeley, California 94720
(Keving@CS.Berkeley.EDU,
Rowe@CS.Berkeley.EDU)
*This research was supported by the National Science Foundation
(Grant MIP 90-14940), Digital Equipment Corporation, Fujitsu Network Transmission
Systems, Inc., and Hewlett-Packard Corp.
Abstract
MPEG-1 is an ISO/IEC (International Organisation for Standardisation/International
Electrotechnical Commission) standard for medium-quality and medium-bit
rate video and audio encoding. We have developed a parallel MPEG-1 encoder
that runs on a network of workstations. We have implemented a number of
different motion vector search techniques. The sequential code can encode
over 0.6 frames per second on an HP 9000/720 machine, and the parallel
code can encode slightly less than 4 frames per second on a network of
9 of these machines connected via Ethernet.
Keywords
MPEG, MPEG-1, video compression, parallel programming
1. Introduction
2. MPEG-1 Video
Back to Top
The Motion Pictures Expert Group (MPEG) has produced a standard for video
and audio encoding for medium- quality, medium-bit-rate compression, called
MPEG-1 [3].
Our goal in developing an MPEG-1 encoder was to support the creation of
the Berkeley Distributed Video on Demand System (VODS) [9,
10]. The VODS system is designed to handle hundreds of hours of video.
Due to the large size and amount of video, that video must be compressed.
MPEG-1 allows video to be compressed by ratios in the range of 50:1 to
100:1, depending on image sequence type and desired quality.
Commercial encoders use custom hardware and parallel encoding, which is
fast and expensive. For example, the C-Cube real-time MPEG-1 video encoder
uses 8 custom-designed video processor chips; a complete system is sold
for $120,000. Optibase and Optivision sell PC-based, real-time encoders
based on the C-Cube video processor chips for $20,000. Software-only encoders
are slower. For example, the Stanford encoder, running on a DEC 5000, encodes
at just 0.2 frames per second [2]. The Philips parallel
MPEG encoder runs on a POOMA machine [12]. POOMA is
an experimental parallel computer system consisting of 100 nodes. Each
node consists of a Motorola MC68020 CPU, MC68881 FPU, and 16 MB of memory.
Later, we will compare the performance of the Philips encoder with our
encoder.
We developed a parallel encoder that runs on a network of workstations
to reduce the time needed for encoding. We wanted our encoder to be used
by as many people as possible, including those already using the Berkeley
MPEG-1 player [1]. To address this issue, we designed
the encoder to be as flexible and easy-to-use as possible, allowing different
methods and quality of encoding for users with different needs. Thus, our
ultimate goal was to create a usable, flexible, portable, parallel encoder
with reasonable quality and performance.
This paper describes the design, implementation, and performance of the
Berkeley MPEG-1 video encoder. The paper is organized as follows. Section
2 describes the MPEG-1 video standard. Section 3 describes the encoder
and discusses the quality and performance of the sequential encoder. Section
4 describes the parallel encoder and its performance. Section 5 presents
our conclusions.
2. MPEG-1 Video
1. Introduction
3. Sequential MPEG-1 Encoder
Back to Top
The MPEG-1 video standard is designed to encode color images at 1.5
Mb/s and to provide VHS-quality playback. The basic idea behind MPEG encoding
is to exploit temporal locality. Excepting certain types of video such
as music videos, the images do not change much within small time intervals.
MPEG-1 video takes advantage of this by encoding an image relative to other
images temporally close to it.
2.1 Data Rate
2. MPEG-1 Video
A source video is a sequence of numbered frames, F1...Fn. Each frame is
a still image. A video player displays one frame after another, usually
at a rate close to 30 frames per second (the standard television rate).
Frames are digitized in a standard RGB format, 24 bits per pixel (8 each
for Red, Green, and Blue). This generates the data rates shown in Table
1. MPEG-1 is designed to produce bit rates of 1.5 Mb/s or less, and is
intended to be used with images of size 352x288 (at 24-30 frames per second).
Larger images are typically scaled before encoding.
Table 1: Uncompressed Data Rates
| Image Size |
Frame Rate
(frames/sec) |
Data Rate
(Mbits/sec) |
| 160x120 |
12 |
5.3 |
| 160x120 |
20 |
8.8 |
| 352x288 |
24 |
55.7 |
| 352x288 |
30 |
69.6 |
| 720x576 |
24 |
227.8 |
| 720x576 |
30 |
284.8 |
2.2 Color Space
2. MPEG-1 Video
The MPEG-1 algorithm operates on images represented in the YUV color space.
Thus, if an image is stored in RGB format it must first be converted to
YUV format.
The YUV format is subsampled. All luminance information (Y) is retained;
there are 8 bits per pixel (bpp) of luminance information. However, chrominance
information (U and V) is subsampled 2:1 in both the horizontal and vertical
directions, which reduces the information by a 4:1 ratio. Thus, there are
2 bits per pixel of U information and 2 bits per pixel of V information.
This subsampling does not drastically affect quality because the eye is
more sensitive to luminance than to chrominance information.
Subsampling is a lossy step. The 24 bits per pixel of RGB information is
reduced to 12 bits of YUV information, which automatically gives 2:1 compression.
2.3 Frame Coding
2. MPEG-1 Video
Frames are logically divided into 16x16 pixel macroblocks. Each macroblock,
in turn, is divided into six 8x8 blocks - 4 luminance and 2 chrominance
(4 Y blocks, 1 U block, and 1 V block).
There are three types of frames: intra-frames (I), forward predicted frames
(P), and bi-directional predicted frames (B). P- and B-frames are encoded
relative to other frames as shown in Figure 1.
An I-frame is encoded as a single image, with no reference to any past
or future frames. The encoding scheme used is similar to JPEG compression
[13]. Each block is encoded independently with one
exception explained below. The block is first transformed from the spatial
domain into a frequency domain using the Discrete Cosine Transform (DCT),
which separates the signal into independent frequency bands. Most frequency
information is in the upper left corner of the resulting block. At this
point, the data is quantized. Quantization can be thought of as ignoring
lower-order bits (though this process is slightly more complicated). Quantization
is the only lossy part of the compression scheme other than subsampling.
The resulting data is then run-length encoded in a zig-zag ordering to
optimize compression. This zig-zag ordering produces longer runs of 0's
by taking advantage of the fact that there should be little high-frequency
information. The afore-mentioned exception to independence is that the
coefficient in the upper left corner of the block, called the DC coefficient,
is encoded relative to the DC coefficient of the previous block.
A P-frame is encoded relative to the past reference frame. A
reference frame is a P- or I-frame. The past reference frame is the closest
preceding reference frame. More formally, the past reference frame for
a frame Fi is the reference frame Fj such that Fj comes before Fi in display
order, and there are no other reference frames between Fj and Fi in display
order.
Each macroblock in a P-frame can be encoded either as an I-macroblock or
as a P-macroblock. An I-macroblock is coded just like a macroblock in an
I-frame. A P-macroblock is some 16x16 area of the past reference frame,
plus an error term. To specify the 16x16 area of the reference frame, a
motion-vector is transmitted. A motion vector of (0,0) means the
16x16 area is in the same position as the macroblock we are encoding. Other
vectors are relative to that position. Motion vectors may include half-pixel
values, in which case pixels are averaged. For example, the 16x16 area
resulting from a motion vector of (1.5, 3.0) is the average of the two
16x16 areas resulting from the motion vectors (1.0, 3.0) and (2.0, 3.0)
(see Figure 2). The error term is transmitted using the DCT, quantization,
and run-length encoding. A special case is a skipped macroblock, which
refers to a macroblock with a (0,0) motion vector and no error term. It
takes just 0 or 1 bits to encode a skipped macroblock1.
The search for the best motion vector (the one which gives the smallest
error term possible) is the heart of any MPEG-1 video encoder. The motion
vector search is what makes encoders inherently slower than decoders.
A B-frame is encoded relative to the past reference frame, the future
reference frame, or both frames. The future reference frame is the
closest proceeding reference frame. The future reference frame for a frame
Fi is the reference frame Fk such that Fk comes after Fi in display order,
and there are no other reference frames between Fi and Fk in display order.
The encoding for B-frames is similar to P-frames, except that motion vectors
may refer to areas in the future reference frame. For macroblocks that
use both past and future reference frames, the two 16x16 areas are averaged.
Skipped macroblocks are also defined for B-frames. In B- frames, a skipped
macroblock has the same motion vector as the previous macroblock, and no
error term.
Frames not need follow a static IPB pattern. Each individual frame can
be any one of the three types. Often, however, the IPB sequence, called
the coding pattern, is fixed for an entire video. This pattern is
just one of several encoding parameters. The parameters having to do with
the motion vector search are specific to an encoder implementation, so
they will be discussed in Section 3. All encoders, however, must handle
the parameters having to do with quantization. The quantization matrix
(q-matrix) determines the quantization in the I-macroblocks. This
matrix can be mod- ified to optimize compression and quality for a particular
image sequence. The quantization scale (q-scale) controls the granularity
of quantization. In essence, the quantization matrix is multiplied by the
q-scale. The q-scale also affects quantization for the DCT error terms
in P- and B-macroblocks. Encoders can vary the q-scale and q-matrix to
change the bit rate and image quality in a sequence.
2.4 Bitstream Representation
2. MPEG-1 Video
An MPEG-1 video sequence is an ordered stream of bits, with special bit
patterns marking the beginning and ending of a logical section.
Each video sequence is composed of a series of Groups of Pictures
(GOP's). A GOP is composed of a sequence of pictures (frames). A frame
is composed of a series of slices. A slice is composed of a series of macroblocks,
and a macroblock is composed of 6 blocks (4 for luminance and 2 for chrominance).
(see Figure 3)
The GOP structure is intended to assist random access into a sequence.
A GOP is an independently decodable unit that can be of any size as long
as it begins with an I-frame.
Each slice is an independently decodable unit too. There can be 1 slice
per frame, 1 slice per macroblock, or anywhere in between. The slice structure
is intended to allow decoding in the presence of errors. Serendipitously,
it also allows parallel encoding/decoding at the slice level.
The MPEG-1 standard allows various properties of the sequence to be modified
in mid-stream. For example, the q-matrix and q-scale can both be changed.
The entire q- matrix can be changed between video sequences. The q- scale
can be changed between slices or even between macroblocks.
2.5 Challenges
2. MPEG-1 Video
The primary challenges to implementing an MPEG-1 encoder are: 1) creating
fast, efficient motion vector search techniques and 2) varying the encoding
parameters to provide the best balance of encoding speed, compression,
and quality.
Considerable research has been done on motion vector search [8].
Research has also been done on dynamically varying encoding parameters
to produce the highest quality image at a specified bit rate (see, for
example, the paper by Puri and Aravind [4]). However,
much of this work has not been published because it is the basis for competition
between different commercial products and services.
3. Sequential MPEG-1 Encoder
2. MPEG-1 Video
4. Parallel MPEG-1 Encoder
Back to Top
We first implemented a sequential MPEG-1 encoder running on a single workstation
before creating the parallel version. This section describes the implementation
and performance of the sequential encoder.
3.1 Implementation
3. Sequential MPEG-1 Encoder
The sequential encoder compresses one frame at a time, remembering information
about various reference frames as required. The encoder keeps information
on at most 3 frames at a time: the current frame, the past reference frame,
and the future reference frame. We assume random access to frames. If,
instead, we only had sequential access to the frames in display order,
we would have to remember up to M+1 frames, where M is the
greatest distance between reference frames. In other words, M-1
is the maximum number of consecutive B-frames.
Prospective users may want to use different encoding processes depending
on the required bit rate, quality, and encoding speed. These criteria also
depend on the source material. To accommodate this variability, the encoder
reads a parameter file that specifies the encoding parameters.
All options are set once per execution of the encoder and remain constant
throughout the execution of the program. We currently do not support dynamically
varying parameters. The following parameters can be set:
Q-scale. The q-scale can be set separately for I-, P-, and B-frames.
Pattern. The IPB pattern can be set.
Search techniques. A variety of different motion vector search techniques
and ranges can be selected. These will be discussed later.
Original or decoded frame as reference frame (the basis for the
motion vector search will be discussed in section 3.6).
3.2 Motion Vector Search Techniques
3. Sequential MPEG-1 Encoder
Different search techniques produce dramatic differences in speed and compression
rates. The search techniques implemented for P-frames are: 1) exhaustive,
2) subsample, 3) two-level, and 4) logarithmic.
The search window, which specifies the motion vectors to consider, can
be specified for each technique. The search window is a square area centered
around the (0,0) vector.
Exhaustive search examines all motion vectors in the search window. The
vector that gives the smallest error is considered the best vector. This
error, which compares two 16x16 areas, can be computed in any one of several
ways. A standard error measure, called the mean squared error (MSE),
is the average of the squares of the 256 luminance differences. Generally
speaking, chrominance is ignored when computing the error because, as stated
earlier, the eye is more sensitive to changes in luminance.
The multiplication in the MSE makes it slow. Consequently, we use the mean
absolute difference (MAD) measure, which is the average of the absolute
values of the 256 luminance differences. This method gives results similar
to the MSE, but at a much lower computational cost.
Subsample search computes the error based on fewer pixels. Only 64 of the
256 pixels are used to compute the error [5].
The two-level search is a simple hierarchical search. We first search,
exhaustively, for the best full-pixel motion vector. We then pick the best
vector from that vector and the surrounding 8 half-pixel vectors.
Logarithmic search significantly reduces the number of motion vectors to
test. It first tests nine vectors in the search window. It then tests eight
vectors centered on the best vector from the first nine. This process is
repeated until the search ends on a single vector. Figure 4 illustrates
this approach.
To compare the different search techniques quantitatively, we will compute
the number of pixel-to-pixel comparisons for each method. Optimization
reduces these numbers, so the comparison is not precise, but it will give
us a rough idea of what to expect. Table 2 shows the differences, assuming
an n x m pixel rectangular search window, where n > m.
Table 2: Theoretical Comparison of P-Frame Search Techniques
| Search Technique |
Pixel-to-Pixel Comparisons |
| Exhaustive |
256(2n+1)(2m+1) |
| Subsample |
64(2n+1)(2m+1) |
| Two-Level |
256((n+1)(m+1)+8) |
| Logarithmic |
256(6log2 2m + 2log2 2n + 1) |
Search techniques for B-frames must examine both the past and future reference
frames. The techniques implemented are: 1) exhaustive, 2) simple, and 3)
cross2.
In all three methods, the best forward and backward vectors are found independently
(using the P-frame search method). Call these vectors Vf and Vb, respectively.
If using one of these vectors is better than using a combination of a forward
and backward vector, then the single vector is used instead. Thus, the
only difference in techniques is how they search for the best combination
of vectors.
Exhaustive search does an O(n2) search. For each forward vector, it tries
to find the best matching backward vector. This method is obviously extremely
slow.
In the simple search, the only combination considered is Vf and Vb. There
is no additional search, so this method takes roughly twice as long as
a P-frame search.
The cross2 technique searches for the best backward vector that matches
Vf and the best forward vector that matches Vb. The best of the resulting
4 alternatives (Vf, Vb, Vf+match, or Vb+match) is taken. This method takes
roughly twice as long as the simple search.
3.3 Optimizations
3. Sequential MPEG-1 Encoder
MPEG-1 encoding is inherently slower than decoding because of the time
required to search for motion vectors. One goal we had in developing the
encoder was to write efficient code that would be easy to modify and debug.
Our approach was to write simple, modular code, and then use profiling
to discover where time was being spent. It turned out, not surprisingly,
that most of the time spent was in a very few procedures. Using a variety
of algorithmic and coding tricks, we achieved a speed-up of almost 8 times
compared to the non-optimized version.
The first optimization was to combine the computation of motion-compensated
blocks and the corresponding error term. Originally, the block was computed
and placed in memory, and then the error was calculated. Part of the reason
for this separation was for modularity. Computing the motion block required
interpolation for half-pixel motion, so it was difficult to combine the
computations. We combined the computation and at the same time made a related
optimization. In the motion vector search, a half-pixel value may be used
more than once. Therefore, we compute all half-pixel values for the entire
frame once. These two optimizations together sped up the encoder by over
a factor of two.
The second optimization was unrolling the main loop that computed the error.
It was two nested loops of 16 iterations each. The inner loop was unrolled
to speed up the encoder by about 25%.
The third optimization was to keep track of the smallest error found, and
stop the computation as soon as the error being computed was larger. The
few lines it took to make this change sped up the encoder by another 50%.
After making this change, it made sense to change the order in which the
motion vector search was done. Instead of doing a raster-scan order search,
it makes more sense to do a spiral search centered around (0,0) because
smaller motion vectors are more common. This change sped up the encoder
by another 30%.
In a similar vein, it made sense to try the motion vector from the previous
macroblock before doing the spiral search. This strategy is especially
useful when the motion is caused by the camera panning a scene, in which
case many motion vectors will be the same. This sped up the encoder by
an additional 10%.
Finally, values were originally stored as 16-bit integers. However, most
compilers convert values to 32-bit integers before doing any arithmetic.
Therefore, we changed the code so that values are stored as 32-bit integers
to eliminate the need for conversions. This sped up the encoder by an additional
25-30%.
Even with the search optimizations, a current profile of the encoder still
reveals that motion vector search dominates running time. If there are
no P- or B-frames, the DCT and quantization computations account for over
50% of the time. In a standard IBBPBBPBBPBBPBB pattern, the motion vector
search takes over 90% of the time.
3.4 Quality and Performance
3. Sequential MPEG-1 Encoder
We conducted a series of experiments to test the relative performance of
different encoding parameters and the different motion vector search techniques.
The experiments are designed to compare the compression factor, encoding
time, and image quality.
Compression factors are given in terms of bits per pixel. Since the input
is of size 24 bits per pixel, an output of 1 bit per pixel corresponds
to 24:1 compression. Smaller bits per pixel are clearly better.
Encoding times are given in terms of macroblocks processed per second.
For comparison, a 352x288 pixel image contains 396 macroblocks.
There are many ways to define image quality, including MSE, MAD, and signal-to-noise
ratio (SNR). More complex measures have been developed [6],
but we have not yet used them.
None of the traditional metrics are really satisfactory, because you can
easily construct two images in which the image with lower quality according
to an error metric is noticeably better to the human eye. Also, no mention
has been made of moving image quality. One might argue that the logical
thing to do is to look at the errors of each individual frame, but in reality
the human eye may be more susceptible to different kinds of errors in moving
pictures than it is in still pictures.
Since no good, simple metric exists, we used the peak signal-to-noise ratio
(PSNR) in our experiments. PSNR is defined as:
where MSE is the mean squared error. Using this metric, larger numbers
are better quality. For example, if each pixel in the encoded image is
different than the original image by just 1, the PSNR is 20log10
255 = 48.13. If, instead, each is different by 2, the PSNR is 20log10
255/2 = 42.11.
All tests, except where noted, were done with the following encoding parameters:
IBBPBBPBBPBBPBB pattern, q-scale of 8/10/25 for I/P/B frames, logarithmic
P- search, cross2 B-search, decoded frames as reference frames, and a search
window of +/- 10 pixels. Times reported in the sequential experiments,
except where noted, are for encoding time only (no I/O time is included).
The tests were done on a DEC 3000/400.
Performance and image quality are affected by the type of image being compressed.
To avoid bias in our experiments, we used 4 different image sequences.
The flower garden sequence is a standard test of video compression
algorithms. It has high-frequency information located in the flowers at
the bottom of the images, which makes compression hard. It has low-frequency
information in the form of clear sky in the upper left, which should give
good compression. And, it has motion because the camera is panning from
left to right. This panning causes a tree in the foreground to move relative
to the background.
The tennis sequence shows two people playing ping- pong. The first
scene is a ball bouncing and there is very little information in the image.
The second scene shows one player bouncing the ball across the net. This
sequence has less high- frequency information than the flower garden sequence,
and thus gets better compression.
The walk sequence is a computer-generated walk- through of Soda
Hall, the new UC Berkeley computer science building [11].
The motion in this sequence comes only from movement of the camera (except
for a few frames with moving elevator doors and a moving sculpture). Since
it is computer-generated, many of the surfaces are uniformly colored with
little detail, making it compress well.
Lastly, the luxo sequence is a Pixar animation [13].
It includes a lot of dark, unchanging areas, and minimal motion.
All sequential results, except those explicitly comparing the different
image sequences, are averaged from the 4 different sequences. Normalized
results were computed to avoid skewing based on one sequence performing
particularly well or poor on a given test. However, the normalized results
did not differ significantly from the averaged results, so are not shown.
The normalized results are only mentioned here to show that the skewing
possibility was taken into account.
3.4.1 Comparison of I/P/B-frame Encoding Times
3. Sequential MPEG-1 Encoder
3.4 Quality and Performance
The pattern of I/P/B frames has a large effect on the overall time for
encoding. Table 3 compares time and compression for I, P and B frames:
Table 3: Comparison of I/P/B-Frames
| Frame Type |
Bits Per Pixel |
Macroblocks Per Second |
PSNR |
| I-frames |
1.344 |
761 |
34.3 |
| P-frames |
0.607 |
458 |
33.2 |
| B-frames |
0.081 |
198 |
31.5 |
Note the significance of B-frames. B-frames are a factor of 7.5 times smaller
than P-frames, but at only 2.3 times the computation cost and a 5% reduction
in quality. This data illustrates a fundamental behavior of MPEG (and many
other compression schemes): computation time is traded for improved
compression.
3.4.2 Performance on Different Image Sequences
3. Sequential MPEG-1 Encoder
3.4 Quality and Performance
Table 4 compares quality, compression, and encoding speed for the different
image sequences. Notice that the computer-generated sequences (walk
and luxo) yield higher compression ratios and quality in less time
than the other sequences. This result is important because many users of
our encoder are generating movies from scientific visualization experiments.
One future research direction might be to explore coding heuristics based
on the type of source material.
Table 4: Comparison of Images
| Sequence |
Bits Per Pixel |
MPS |
PSNR
(I/P/B) |
| flowers |
0.625 |
229 |
30.7/30.3/27.6 |
| tennis |
0.255 |
221 |
32.6/32.0/31.0 |
| walk |
0.186 |
244 |
36.0/34.5/33.7 |
| luxo |
0.169 |
294 |
37.9/35.9/33.7 |
As a general rule, a higher compression ratio requires less encoding time,
and yields higher quality. There are two reasons for this observation.
First, speed comes from a larger number of skipped macroblocks, which also
improves compression. Skipped macroblocks also speed up encoding because
they are found first. Second, the improved quality results from the absence
of high-frequency information, which also improves compression because
there are more zero values in the DCT. The following table supports the
skipped macroblock theory by showing how many macroblocks of each type
are found in B-frames.
Table 5: Macroblock Types in B-Frames
| Sequence |
I |
B |
Skipped |
| flowers |
0.02% |
92% |
8% |
| tennis |
0.14% |
94% |
6% |
| walk |
0.00% |
84% |
16% |
| luxo |
0.09% |
68% |
31% |
3.4.3 Performance on Different Machines
3. Sequential MPEG-1 Encoder
3.4 Quality and Performance
Table 6 compares the performance of the sequential encoder on different
machines. Relative performance is roughly in agreement with the SPECint
numbers.
Table 6: Machine Comparison
| Machine |
Macroblocks Per Second |
SPECint92 |
| HP 9000/755 |
280 |
80.0 |
| DEC 3000/400 |
247 |
74.7 |
| Sparc 10/30 |
104 |
40.0 |
3.4.4 Performance of Different Search Techniques
3. Sequential MPEG-1 Encoder
3.4 Quality and Performance
This section compares the performance of different P- and B-frame motion
vector search techniques. Table 7 compares the P-frame motion vector search
techniques.
Table 7: P-frame Motion Vector Search
| Technique |
Bits Per Pixel |
Macroblocks Per Second |
PSNR |
| Exhaustive |
0.575 |
56 |
33.3 |
| Subsample |
0.578 |
138 |
33.3 |
| TwoLevel |
0.579 |
181 |
33.3 |
| Logarithmic |
0.607 |
458 |
33.2 |
There is almost no difference in quality or compression between the first
three techniques, which suggests that two- level search should be used
because it is significantly faster than the other two methods. Logarithmic
search gives almost identical quality and slightly worse compression, but
at a much faster rate.
Table 8 compares the B-frame motion vector search techniques.
Table 8: B-frame Motion Vector Search
| Technique |
Bits Per Pixel |
Macroblocks Per Second |
PSNR |
| Exhaustive |
0.081 |
Less than 1 |
31.1 |
| Cross2 |
0.079 |
197 |
30.9 |
| Simple |
0.077 |
346 |
30.8 |
Exhaustive search gives minimal quality improvement over cross2, but actually
gives worse compression, and cross2 in turn gives worse compression than
simple. These results are unexpected; if our error metric (MAD) were directly
correlated with compression ratio, it would be impossible. We have not
been able to explain this result. Further research on this subject is needed.
Table 9 shows how the choice of P-frame search technique affects the performance
of B-frame search (in this case, cross2 search).
Table 9: Effect of P-search on B-search
| Technique |
Bits Per Pixel |
Macroblocks Per Second |
PSNR |
| Exhaustive |
0.079 |
14 |
31.7 |
| Subsample |
not implemented |
-- |
-- |
| TwoLevel |
0.079 |
53 |
31.7 |
| Logarithmic |
0.081 |
198 |
31.5 |
The differences in compression and quality are very small for the three
techniques, which makes a strong argument for using logarithmic search
when doing B-frame searches.
3.5 Comparison of Original vs. Decoded Reference Frames
3. Sequential MPEG-1 Encoder
An interesting question is whether to use the original, unencoded frames
as reference frames, or to use the decoded frames as reference frames.
Because any MPEG-1 playback system will not have access to the original
frames, the best reference frame is the decoded frame. The problem this
causes is that during the encoding process, the reference frames must all
be decoded, adding to the time needed.
Oddly enough, there are also concerns about compression ratio. While image
quality will be sacrificed by using the original frames as reference frames,
it turns out that the compression ratio is improved
Table 10 compares the use of original and decoded frames as reference frames.
Table 10: Original or Decoded?
| Reference |
Bits Per
Pixel |
Macroblocks
Per Second |
PSNR
I/P/BM |
| Decoded |
0.309 |
247 |
34.3/33.2/31.5 |
| Original |
0.276 |
297 |
34.3/31.3/30.3 |
The MPEG-1 standard suggests that there is usually little difference in
quality. The numbers here tend to support this because quality differs
by about 5%. Of course, this observation depends on your definition of
"little difference," particularly in light of the "little differences"
that caused many small changes to be added to the standard.
There is a noticeable drop in compression (12%) and encoding speed (17%)
when using decoded frames as reference frames. dd> One useful experiment
might be to lower the q-scale when using the original frames until the
quality reaches that achieved by using the decoded frames, and then comparing
the compression rates.
The experiments in this section have shown that the logarithmic and two-level
search methods are good choices for P-frame motion vector search, and the
simple search method is good for the B-frame motion vector search.
4. Parallel MPEG-1 Encoder
3. Sequential MPEG-1 Encoder
5. Conclusion
Back to Top
MPEG-1 video compression is CPU-intensive and slow. On the other hand,
many people have access to workstations that sit idle much of the time.
We developed a portable parallel encoder to take advantage of these otherwise-idle
machines to speed up encoding.
MPEG-1 video compression lends itself very well to parallelism. There are
several levels at which parallelism can be implemented. The finest-grained
parallelism can be obtained at the macroblock level. The problem with this
approach is that the encoding of one macroblock depends on the encoding
of the previous macroblock. In an I-block (i.e., a block in an I-macroblock),
the DC value is represented as an offset from the DC value of the previous
I-block. In B- frames, a skipped macroblock is based on the motion vector
of the previous macroblock. Thus, fine-grained communication is required
between processors coding adjacent macroblocks which makes this approach
impractical on a network of workstations.
Medium-grained parallelism can be obtained at the slice level. MPEG-1 devised
the concept of a slice such that each slice could be independently decoded.
Consequently, they can also be independently encoded. Suppose we have one
slice per row of macroblocks. The challenge now is to provide random access
to slices within a frame, and to concatenate the resulting encoded bitstreams.
This approach is good for reducing latency in real-time applications. But,
if the encoding is done off-line, coarse-grained parallelism is more manageable.
At the coarsest level, we can encode GOP's in parallel. Unfortunately,
even encoding one GOP at a time does not completely eliminate the dependencies
between I/ P/B frames. In general, some frames in one GOP will depend on
a reference frame in the previous or next GOP.
Encoding GOP's in parallel is also somewhat inflexible. A more flexible
method is to encode a number of contiguous frames, but not necessarily
a GOP. This approach has similar properties to encoding a GOP at a time,
but allows more scheduling flexibility which improves load- balancing.
We use this approach in our parallel encoder.
4.1 Communication Required
4. Parallel MPEG-1 Encoder
Several different kinds of communication are required in the parallel version
of the MPEG-1 encoder. The most obvious communication needed is that required
for scheduling. Each encoding process must know which frames it should
encode. Such scheduling requires very limited communication, on the order
of a few bytes per frame.
A second kind of communication is required if we use decoded frames as
reference frames. Consider the sequence IBBPBBI. Suppose processor P0
encodes frames 1-3 (IBB), and processor P1 encodes frames 4-7
(PBBI). Processor P1 must get encoded frame 1 from processor
P0 to encode frame 4. Thus, we need a method to transfer encoded
frames between processors.
A third kind of communication is required to transfer raw frames. This
communication introduces the greatest amount of traffic because uncompressed
image data is very large. Because the images come from disk in the first
place, a simple solution is to use NFS. To simplify matters, we assume
that each input frame resides in a separate file2.
A fourth kind of communication is required to transfer the encoded frames
to disk or a processor that combines them to form the output bitstream.
This communication is less intensive than the transfer of raw frames, because
encoded frames are much smaller (50-100 times smaller).
4.2 Network Model
4. Parallel MPEG-1 Encoder
Figure 5 shows the process architecture for the parallel encoder. The Master
Server schedules the slave processes that do the encoding. The Decode Server
directs the transfer of encoded frames between slaves if decoded frames
are used as reference frames. It does not actually make the transfer. It
only keeps track of the reference frames that are ready and the processes
which are waiting for frames, and it notifies them when appropriate. The
slave processes read/write the encoded frames using NFS. The Combine Server
concatenates the encoded frames to create the output bitstream.
Currently, the Master, Decode, and Combine Servers all run on the same
physical machine. The Slave processes run on different machines. There
may or may not be a slave process on the "server" machine.
NFS is used to transmit frames to the individual slave processes, as well
as to/from the Combine Server. Elsewhere, communication is done using UNIX
TCP sockets.
4.3 I/O
4. Parallel MPEG-1 Encoder
An often overlooked, but important aspect of any parallel algorithm is
I/O bandwidth. Regardless of how the parallel encoder works, a limiting
factor is the disk on which the original frames sit. If the frames cannot
be accessed faster than a certain rate, the encoder cannot work faster
than that rate.
Video is digitized into a compressed format (most commonly JPEG). Thus,
we can reduce some of the I/O requirements by keeping the video in JPEG
format, sending the compressed data to the slave processes, and only then
having them uncompress the frames once they get there. We will present
results for experiments that use compressed images and experiments that
use uncompressed images.
4.4 Scheduling
4. Parallel MPEG-1 Encoder
An important question is how to schedule frames to processors. Here, there
is a trade-off between load balancing and scheduling overhead. For example,
scheduling one frame per processor ensures the best possible load balancing,
but increases overhead to an unbearable amount. Scheduling overhead includes
both the communication required for scheduling, as well as the extra frames
that must be transmitted. For example, if a process encodes IBB, then it
requires the future reference frame; even if original frames are used as
reference frames, the future reference frame is read at least twice - once
by this process, and once by the process that actually encodes it. A frame
that is read by a process that does not encode it is a penalty frame.
The scheduling method with the least overhead is one in which all processors
run at about the same speed, so each processor is given N/P contiguous
frames, where N is the number of frames and P is the number
of processors. In this case, there are no more than 2P penalty frames
(2 per processor). In the best case (but still allowing B-frames), there
are P penalty frames.
At the opposite extreme, there is the one frame per processor solution.
In this scheduling algorithm, there are 2Nb+Np
penalty frames, where Nb is the number of B-frames and
Np is the number of P-frames. The large number of penalty
frames makes this solution impractical.
Unfortunately, there is also a trade-off in the amount of parallelism gained.
Generally speaking, sequences of frames can be encoded completely in parallel,
guaranteeing perfect parallelism in any scheduling method. There are two
reasons why this is not the case. First, the encoder produces one output
file per frame, and these files must be combined into one file. A separate
process, the Combine Server, is in charge of this combining. It puts the
frames in order into a single file. It only proceeds when the next frame
(in decode order) has been encoded. Thus, the scheduling method described
above will cause the combining process to wait until all slave processes
are finished before it begins its task. The combine time for our experiments
was about 15 seconds. For some experiments, this is almost 40% of the total
time to encode!
The second problem is using decoded frames as reference frames. Here, the
above scheduling method is probably the worst method. Consider an IPB sequence
in which the second processor must wait for the first to finish, the third
must wait for the second, and so on, so that almost no parallelism at all
occurs! For example, consider processor P0 encoding IBBPBBPBBPBB, and processor
P1 encoding the rest: PBBIBBPBBP. Process P1 can only encode its first
P frame by waiting for processor P1 to finish encoding its last P frame,
which means that processor P1 does not start until processor P0 is almost
finished.
The only way to ensure parallelism is to have each run of frames start
with an I-frame. If it does not, the slave process must wait for the previous
reference frame to be encoded. The encoder supports this restriction on
scheduling by accepting a FORCE_I_ALIGN parameter that enforces this restriction.
To simplify our parallel results, we use the original frame as the reference
frame in our experiments. Results with decoded frames as reference frames
would be similar if we used a larger number of frames in our experiments.
We originally used a method in between the two previous methods. First,
each processor encodes S frames to determine the approximate encoding speeds
of the different processors. From then on, the scheduler assigns each slave
process T seconds of work. The scheduler assigns a number of frames to
be encoded based on previous performance. This dynamic scheduling method
is useful for heterogeneous systems with processors that operate at different
speeds, as well as homogeneous systems in which network traffic and multi-user
variance causes machine speeds to vary. This method ensures load balancing
to within T seconds.
The problem with this method is that load balancing is affected by the
value of T. Small values of T result in better load balancing, but more
penalty frames. To solve this problem, we used a tapered scheduling algorithm,
in which the number of penalty frames is kept low by having a T that starts
out large, but grows smaller and smaller as we near the finish [7].
This method is implemented by estimating the time remaining, and scheduling
a processor enough frames to keep it busy for that amount of time. Tapering
T in this manner improves load balancing, while keeping the number of penalty
frames relatively low.
Table 11: Scheduling
S
(frames) |
T
(seconds) |
Penalty
Frames |
FPS |
| 1 |
0 |
238 |
2.11 |
| 3 |
5 |
120 |
3.19 |
| 3 |
10 |
55 |
3.57 |
| 3 |
15 |
39 |
3.49 |
| 3 |
20 |
30 |
3.13 |
| 3 |
taper |
68 |
3.75 |
| 6 |
10 |
46 |
3.66 |
| 6 |
taper |
62 |
3.95 |
| 9 |
5 |
65 |
3.57 |
| 9 |
taper |
62 |
3.75 |
| 12 |
10 |
30 |
3.33 |
| 12 |
15 |
22 |
3.41 |
| 12 |
20 |
21 |
3.19 |
| 12 |
taper |
54 |
3.33 |
| 15 |
10 |
16 |
2.94 |
| 15 |
taper |
40 |
2.88 |
| N/P |
- |
12 |
2.68 |
Table 11 shows the number of penalty frames and encoding speed for different
values of S and T. The experiments to produce the table, as well as all
parallel experiments, were run on a 9-machine cluster of HP 9000/ 720 computers
connected by an Ethernet. All parallel experiments used the 150-frame flower-garden
sequence. The scheduling experiments used the following encoding parameters:
IBBPBBPBBPBBPBB pattern, q-scale of 8/10/ 25 for I/P/B frames, logarithmic
P-search, simple B-frame search, original frames as reference frames, and
a search window of +/- 10 pixels.
The results in Table 11 are not too surprising. For values of T smaller
than 10, the number of penalty frames rises and the FPS falls. For values
of T greater than 10, load balancing becomes worse and the FPS falls. Tapering
performs well compared to the fixed T values. The extreme methods (S =
1 and S = N/P) give the worst performance.
4.5 Parallel Performance
4. Parallel MPEG-1 Encoder
The tapered scheduling algorithm was used with S = 6. The technique gives
good load balancing, which is important. Even though the machines used
in our experiments were the same, CPU load and network activity caused
performance to vary greatly.
The experiments used four different sets of encoding parameters. I-frame
encoding used a pattern of I-frames only and a q-scale of 8. Fast encoding
used the following encoding parameters: IBBPBBPBBPBBPBB pattern, q- scale
of 8/10/25, logarithmic P-search, simple B-search, original frames as reference
frames, and a search window of +/- 10 pixels. Medium encoding used the
same encoding parameters as fast encoding, except with two-level P-frame
search. Slow encoding used the same encoding parameters as medium encoding,
but with cross2 B-frame search. In the graphs below, some numbers may not
make sense due to rounding errors and variations in speed caused by other
running processes and network traffic.
Speedup and efficiency are two standard measures used to evaluate parallel
performance.
Speedup is some number between 0 and p, and efficiency is some number
between 0 and 1. In both cases, larger numbers are better.
The following two graphs compare the performance of the different encoding
schemes. Speedup and efficiency are better for the slower encoding schemes.
This is expected because computation time is larger in relation to the
communication and I/O time.
The best speedup is about 7.5 for slow encoding. Speedup for I-frame encoding
is much worse. The is partly because I-frame encoding produces much larger
output files than the other encoding schemes, increasing the I/O requirements,
and partly because I-frame encoding is faster (which means a smaller speedup
will push the I/O requirements as much as a larger speedup for the other
encoding schemes). Each raw frame is roughly 1 Megabit. Since we use NFS
for I/O, and the Ethernet has a capacity of 10 Mb/s, performance is limited
to less than 10 frames per second. The experiments achieved much less than
this because additional I/O is required for output and combining, network
performance is typically less than 10 Mb/s, and some time must be used
for computation.
We can alleviate some of the I/O problems by reducing the size of the input.
If the input files are originally stored in JPEG format on disk, then the
input will be much smaller. This is not unreasonable because hardware exists
to JPEG compress input as it is digitized.
The following two graphs show the results when the images were originally
compressed using JPEG to roughly 25% the size of the original raw YUV images.
dd> This method does not work as well as could be expected. There are several
possible reasons why performance is poor: uneven load balancing, the sequential
combine time, processor waiting (a slave waiting for the scheduler), I/O
bandwidth limitations, and network bandwidth limitations. We will examine
each of these possibilities for these four experiments.
For a given number of processors, we can easily calculate the time required
for perfect speedup by dividing the sequential time by the number of processors.
The difference between this time and the actual running time is the slowdown
time. We will measure each potential slowdown factor by the percentage
of the slowdown time it contributes. For example, if sequential encoding
takes 180 seconds and parallel encoding with 9 processors takes 35 seconds,
the slowdown time is 15 seconds (35 - 180/9). If uneven load balancing
added 3 seconds to the time, uneven load balancing accounts for 20% of
the slowdown time.
Load balancing was very good. It was within 1, 2 or 3 seconds in all cases.
The following is a graph of its slowdown time percentage. It explains some
of the difference, but in general is not a large factor. It never contributed
more than 25% of the slowdown time.
The sequential time is the time spent starting up the slaves and the time
for the Combine Server to finish once all the slaves are finished. This
time is dependent on which processors finish last and in what order.
The combine time is a large factor, especially in I-frame encoding.
This is to be expected, since the combine time is a function of the size
of the output and I-frame encoding results in larger output. More research
can be done on trying to create a scheduling algorithm which reduces the
combine time, but that is only half the problem.
The remaining slowdown time must come from processor waiting, and I/O and
network bandwidth limitations. The following graph shows this contribution.
Processor waiting is very small, so most of the slowdown is from the I/O
and network limitations.
The fast JPEG encoding is probably more affected by network limitations
because of penalty frames. From our scheduling experiments we expect about
60 penalty frames (I-frame encoding incurs no penalty frames). Penalty
frames in the JPEG version are more costly because each frame must be decompressed.
This decompression time is included in the above graph since it is not
needed for the sequential version (and is therefore part of slowdown).
Let us examine more carefully what we should have expected. The following
table lists an estimate on I/O and CPU requirements. It assumes 60 penalty
frames (0 penalty frames when encoding I-frames only). CPU time includes
the time required to uncompress JPEG frames where applicable. To compute
I/O requirements, we add the input bits, the output bits, the output bits
twice more (one read, one write when combining), and more input bits for
the 60 penalty frames (where applicable).
Table 12: I/O and CPU times
| Method |
Mbits I/O |
CPU seconds |
| I-Frames |
250,872,456 |
91 |
| Fast |
233,366,193 |
240 |
| Medium |
234,198,144 |
560 |
| Slow |
233,033,760 |
1050 |
| I JPEG |
138,154,632 |
180 |
| Fast JPEG |
75,561,189 |
336 |
Based on this table, we can estimate the worst case FPS. The table that
follows assumes that the I/O and CPU time are completely independent. If
there is overlap, then the encoding will be done faster. The table is for
9 processors.
Table 13: Worst Case FPS Estimates
| Method |
10 Mb/s I/O |
5 Mb/s I/O |
| I-Frames |
4.41 fps |
2.59 fps |
| Fast |
3.07 |
2.11 |
| Medium |
1.77 |
1.40 |
| Slow |
1.08 |
0.93 |
| I JPEG |
4.52 |
3.27 |
| Fast JPEG |
3.37 |
2.90 |
This table, along with the tables for the JPEG algorithm, suggest that
I/O performance is worse for the JPEG algorithm. This may occur because
the data is in smaller packages than with the non-JPEG version. This may
cause the overhead of NFS access to degrade I/O performance.
Even if this is true, however, the JPEG version is the only way to overcome
the network bottleneck. To see this, the following is a graph of minimum
FPS for different numbers of processors, given 10 Mb/s I/O. It shows that
using JPEG should be better for 7 or more processors, and get much better
for larger numbers of processors.
4.6 Comparison to Philips Encoder
4. Parallel MPEG-1 Encoder
Earlier, we mentioned the Philips MPEG encoder. It runs on an experimental
parallel computer system consisting of 100 nodes. Each node consists of
a Motorola MC68020 CPU, MC68881 FPU, and 16 MB of memory.
The Philips encoder achieved 89% efficiency with 100 processors. The overall
speed, however, was just 0.6 frames per second, which means the encoder
did not reach I/ O limitations. The authors of [12]
admit that I/O problems could occur with faster encoding.
In comparison, our encoder achieved 69% efficiency for fast encoding on
9 processors. The overall speed was just under 4 frames per second. The
network limitations cause this reduction in efficiency. It should be noted
that even if a faster network is used, problems may still occur with I/O.
If the network has a capacity of 10 Mb/s, and I/O bandwidth is 2 MB/s (16
Mb/s), then any bottleneck caused by the network masks any problems with
I/O performance. If I/O becomes a bottleneck, one possible solution (other
than using compressed input frames) is to use multiple disks. Processors
could use their local disks to temporarily store encoded frames. If the
input frames were stored on local disks, that would help even more.
5. Conclusions
4. Parallel MPEG-1 Encoder
6. Acknowledgements
Back to Top
We have achieved our goal of creating a portable, parallel MPEG-1 encoder.
The code was first distributed to the Internet in July 1993. Over 2,500
copies have been distributed to date (May 1994).
The key to creating an efficient encoder is to optimize the motion vector
search - especially the error calculation. Other operations are less important
in time significance. In addition to this optimization, we implemented
a variety of different search techniques to account for varying encoding
requirements.
A coarse-grained parallel version of the encoder was implemented. Close
to linear speed-up can be gained, up to some limit dependent on I/O and
network performance. We used a simple adaptive scheduling algorithm in
which the scheduler estimates how many frames each processor can encode
in a fixed time interval.
Experiments run on nine HP 9000/720 workstations on an ethernet network
showed that the major bottleneck is network bandwidth, which is directly
related to I/O requirements. Reading compressed images from disk to alleviate
this bottleneck did not improve performance, but should do so for larger
numbers of processors.
6. Acknowledgments
5. Conclusion
References
Back to Top
Thanks to Brian Smith for many helpful suggestions and to Ketan Patel and
Dan Wallach for their work on early versions of the MPEG-1 encoder.
References
6. Acknowledgements
Back to Top
[1] Ketan Patel, Brian C. Smith, and Lawrence A. Rowe, "Performance
of a Software MPEG Video Decoder," UCB/ERL M93/2, 7 January 1993
[2] PVRG-MPEG Codec 1.0, Andy C. Hung, Feb. 25, 1993
[3] MPEG-1 Standard (ISO/IEC International Standard 11172-2)
[4] Atul Puri and R. Aravind. "Motion-Compensated Video Coding
with Adaptive Perceptual Quantization," IEEE Transactions on Circuits and
Systems for Video Technology, Vol. 1 No. 4, December 1991.
[5] Bede Liu and Andre Zaccarin, "New Fast Algorithms for the
Estimation of Block Motion Vectors," IEEE Transactions on Circuits and
Systems for Video Technology, Vol. 3, No. 2, April 1993.
[6] John A. Saghri, Patrick S. Cheatham, and Ali Habibi. "Image
quality measure based on a human visual system model," Optical Engineering,
Vol. 28 No. 7, July 1989.
[7] Lucco, Steven. "A Dynamic Scheduling Method for Irregular
Parallel Programs," Proceedings of the Conference on Programming Language
Design and Implementation, ACM Sigplan, 1992.
[8] Sohail Zafar, Ya-Qin Zhang, and John S. Baras. "Predictive
Block-Matching Motion Estimation for TV Coding - Part I: Inter-Block Prediction,"
IEEE Transactions on Broadcasting, Vol. 37, No. 3, Sept. 1991.
[9] L.A. Rowe, J. Boreczky, and C. Eads. "Indexes for User Access
to Large Video Databases," Proceedings of IS&T/SPIE 1994 International
Symposium on Electronic Imaging: Science and Technology, San Jose, CA,
February 1994.
[10] C. Federighi and L.A. Rowe. "The Design and Implementation
of the UCB Distributed Video on Demand System," Proceedings of IS&T/SPIE
1994 International Symposium on Electronic Imaging: Science and Technology,
San Jose, CA, February 1994.
[11] S.J. Teller and C.H. Sequin. "Visibility preprocessing for interactive
walkthroughs," Computer Graphics, July 1991, Vol. 25, No. 4: 61-69.
[12] Frans Sijstermans and Jan van der Meer. "CD-I Full- Motion
Video Encoding on a Parallel Computer," Communications of the ACM, Vol.
34 No. 4, April 1991.
[13] ISO 10918-1 JPEG Draft International Standard
[14] Pixar, Inc., "Luxo Jr." (film), 1986
Footnotes
1. Each encoded macroblock is numbered (with
an offset from the previous encoded macroblock). If the increment is greater
than 1, then there are skipped macroblocks. The number of bits required
to encode the increment increases for larger increments.
2. We provide a mechanism for dealing with the situation
in which the input disk is not mounted using NFS. However, we will assume
for simplicity of discussion that all disks are mounted using NFS.