(Version 1.4, February 7, 2006)
Lawrence A. Rowe and Vince Casalaina
University of California, Berkeley
A revised version of this paper was published in IEEE Multimedia (October-December 2006). A PDF version of the paper is available at http://www.computer.org/portal/cms_docs_multimedia/multimedia/promo3.pdf. The published paper has more recent usage statistics and includes some discussion not included here. But, this web paper and the accompanying slide show include more details about the applications used to produce the webcast.
Many people in the ACM SIG Multimedia (SIGMM) community have requested that conference and workshop presentations be captured for on-demand replay. The goal is to publish this material in the ACM Digital Library alongside the papers published in the conference proceedings so researchers and practitioners can watch the presentations and discussions at the conference.
Many research groups and commercial organizations have developed technology to capture and stream presentations [Bianchi 1998, Mukhopadhyay 1999, Rowe 2001, Rui 2004, Steinmetz 2001]. But, presentation capture has been impractical for professional meetings, such as ACM conferences and workshops, because the cost to capture the material and publish it has been too expensive. The typical cost of audio/video capture using conventional technology is $5,000-$20,000 per day depending on the complexity of the event capture (e.g., multiple cameras, audience audio, etc.) and how the final product is produced (e.g., "live-to-tape" recording versus off-line editing).
This cost is impractical for capturing presentations at many ACM conferences because the total budget for the event is between $50,000 and $200,000, of which approximately $5,000 is allocated to produce the proceedings. Remember, much of the actual cost to produce the proceedings is provided free by volunteers (e.g., authors, conference publication chairs, students, etc.). These conferences typically do not make a large profit, so it is impractical to add the cost of capturing presentations to the attendee registration fee.
The economics change for large conferences or special events (e.g., the annual ACM awards dinner). It is possible to generate a profit from charging a high price (e.g., $500) for a DVD that includes tutorials and conference presentations at a conference like SIGGRAPH that attracts upwards of 10,000 attendees.
The only way a professional organization can capture all conference presentations is by using low-cost capture and production technologies. This paper describes a low-cost approach to presentation capture that produces high-quality material. The approach is similar to "live-to-tape" recording used in the broadcast industry, except rather than record to videotape, material is compressed and recorded to a computer disk. These media files do not require off-line editing or postproduction so the cost of authoring and publishing the material is significantly reduced. We used this new approach to capture presentations at NOSSDAV 2005 [NOSSDAV 2005]. The results of this experiment, including actual costs and what we learned during the process, are discussed in this paper.
The objective of the NOSSDAV experiment was to test the technology and procedures for presentation capture using low-cost equipment and production techniques. The total cost of the equipment used in the experiment including audio, video, and computer equipment was approximately $12,000. The production team included two people:
The cost for this capture was $3,100 for a two-day conference. However, it did not include the full equipment rental and personnel costs. Using fully-charged labor and equipment, we estimate that similar captures can be done for approximately $3,000 per day plus expenses (e.g., travel, room, and board). This cost estimate includes equipment rental so organizations interested in capturing an event do not have to invest in expensive equipment and the cost of configuring it for presentation capture.
The presentations are encoded using industry standard codecs and file formats. They can be viewed using the QuickTime plug-in embedded in a webpage either by downloading or streaming the media file. The NOSSDAV 2005 presentations are available at the following webpage:
  http://bmrc.berkeley.edu/research/nossdav05/
More information on configuring your system to play the material is available at the webpage.
This paper describes the technology and process used to capture and publish these presentations. Section 2 describes the hardware and software. Section 3 describes the process involved including acquiring performance releases, preparing for and setting up at the conference site, and capturing the event. Section 4 describes the post capture production process, publication of the material, and viewing statistics. Section 5 discusses lessons learned doing this experiment and changes we plan to make next time. Lastly, section 6 draws conclusions from this experiment and suggests future research. A slide show showing the entire process is also available.
This section describes the equipment configuration and strategy for capturing the presentations.
The basic idea is to capture audio, video, and graphics (i.e., RGB output from a presentation computer) and encode it into a QuickTime file that can be streamed or downloaded for on-demand replay. The challenge is to capture high-quality images of the projected presentation material no matter what technology the presenter uses and to do it inexpensively. For example, speakers at NOSSDAV used laptops with different operating systems (e.g., Linux, Mac OSX, or Windows), and they used different presentation software (e.g., Powerpoint, Keynote, Web browser, etc.). Most presentations used relatively static slides with various transition effects (e.g., slide-to-slide transitions and builds) with occaissional animations to illustrate dynamic behavior. Some presenters used sound and video in their presentations and one speaker included a live demonstration. The dynamic slides and live demonstrations are difficult to capture.
The conventional approach to lecture capture is to use one or two cameras focussed on the speaker (e.g., a close-up of the speaker and a wide angle shot of the stage), and a wireless microphone to capture the audio. Some productions use a third camera to capture audience members as they ask questions. We decided not to use audience video so that we did not have to ask them to sign releases, although we did capture many people asking questions using the cameras available. Audience sound was captured using the podium microphone.
A problem arises when you try to capture the graphics material projected to the audience. Typically, the RGB signal from a computer is converted to a video signal either by scan converting the RGB signal or by pointing a camera at the projection screen. The video signal is digitized and compressed using the same technology that captures video of the speaker. Both approaches have limitations because the RGB signal (e.g., SVGA at 800x600 or XGA at 1024x768) has too much data when compared to video (e.g., 720x480 or 352x288). In addition, the RGB signal is a progressive signal updated 60 times/second, whereas video is an interlaced signal updated 30 times/second. Digitizing and compressing these images discards 50%-70% of the information in the image, which often makes the presentation material unreadable when played. Image readability is not a problem if the speaker remembers to use large font sizes and reduces the amount of text or image detail on the slide. But, as a practical matter, most presenters do not abide by these rules.
Another way to capture the presentation material is to acquire the presentation source files and create images that can be synchronized with the audio and video. This approach produces high quality slide images, but raises the cost of production unless speakers are constrained to a limited set of presentation packages, and they do not use dynamic material in their presentations (e.g., animations and live demonstrations). It is difficult to get copies of all presentations from the speakers because they are reluctant to give them to other people. And, speakers from companies are often required to get approval before giving copies to others. A prior experiment capturing presentations at ACM Multimedia 2001 collected copies of the speaker's presentation material and authored the content off-line [SOMA 2002]. Only 30% of the people provided copies of their presentation material, so the final content was less useful. Consequently, it is easier to capture the slides during the event. A normal performance release can cover the intellectual property. Copyright issues are discussed in more detail below.
The approach we used to capture presentation material is direct capture of the RGB signal. Image quality is substantially better and dynamic material is captured. An RGB capture board is relatively inexpensive ($1.5K - $2.5K), although integrating it with software to capture video and audio and to synchronize the results leads to higher costs. The device we used to capture the presentations is produced by NCast Corporation [NCast 2004].
Disclosure: The first author (Rowe) is a co-founder and investor in the company.
The NCast G2 product contains an embedded computer that runs software to digitize audio and RGB signals, and compress them using MP3 audio and MPEG4 video codecs, respectively. The G2 can webcast live media streams using either unicast or multicast Internet transmission and archive the material in an mp4 file. The mp4 file can be played using the Apple QuickTime Player. Audio and video packets are multiplexed in the file with appropriate Presentation Timestamps so they are played correctly. The G2 can be controlled using either an embedded web interface or a remote control interface that uses either a TCP or RS/232 connection. Various parameters can be set to control image scaling, frame rate, and pre-compression image filtering which influences media quality and file size. The retail cost of the G2 is $5,500.
The G2 captures RGB images, so it was necessary to convert the NTSC video signal into RGB and then combine that with the RGB signal. We used the G2 with a Kramer VP-720DS seamless switcher, which accepts up to four video inputs (i.e., two s-video and two composite signals), and one RGB input. The VP-720DS produces an RGB output signal selected from one of the video or RGB inputs. It scales the selected input to the specified output format and uses frame accurate switching between the various input sources implemented by a fast "fade through black" transition. The VP-720DS also provides a picture-in-picture (PIP) function that will show the RGB signal composed with one of the video signals or one of the video signals composed with the RGB signal. That is, a small window with the second signal is overlayed on the first signal. The VP-720DS allows the director to determine the size and location of the small window. The retail cost of the VP-720DS is $1595, although they are widely available for approximately $1,000.
The presentation captured is a single video stream that shows the speaker, the presentation material, or the presentation material with the speaker in a PIP window as shown in the following examples:
![]() |
![]() |
![]() |
| Speaker close-up from pan/tilt camera. |
Graphics signal capture of presentation. |
PIP with speaker video embedded in graphics image. |
The equipment configuration used during the capture is shown in the image to the right (click on the image to display a larger version in a separate window). The producer/director operated the wide-angle camera, an audio mixer to control sound levels, and two GUI applications that ran on a laptop to control the pan/tilt camera (Canon VCC4) and the capture and switching hardware (i.e., G2 and VP-720DS). Both cameras were run through a preview monitor so the producer/director could setup the next camera shot before switching to it.
The house audio system provided a single audio signal that combined a wireless microphone used by the speaker, a wired podium microphone, and output from the presentation computer. The room was live enough that we were able to capture audience questions and speaker introductions from the podium microphone. We brought additional microphones to capture audience questions because we were uncertain about the venue audio system.
The control software was designed to be easy-to-use and to provide only the functions required for lecture capture. Our hope was to automate as much as possible of the production process. Two applications were used: 1) an application to control the Canon VCC4 PTZ camera, and 2) an application to control the NCast G2 and Kramer VP-720DS. The camera control application was a client/server system (vcc3client and vcc3d) developed for a Canon VCC3 at U.C. Berkeley for lecture webcasting and distributed collaboration (i.e., an Access Grid room). The PTZ camera control application provides interface buttons to smoothly pan, tilt, or zoom the camera at a speed configured by the user and to set/recall up to six preset positions. The application has a simple interface with home, pan, zoom, and tilt buttons and a complex interface with the preset functions and access to other functions (e.g., setting movement speed, etc.). The user can switch back and forth between the two interfaces, which are shown in the two images below. The image on the left shows the simple interface and the image on the right shows the complex interface. During the NOSSDAV capture, the director used slow speed movement to improve positioning accuracy and allow live capture of the camera move (e.g., to follow a speaker who paces). The six presets were used to frame frequently used shots (e.g., wide-angle view of panel, speaker, etc.).
Vcc3client and vcc3d are written in Open Mash, which is a toolkit built on Tcl/Tk to support the development of distributed streaming media applications. Vcc3d communicates with the camera by sending text commands over an RS-232 serial connection. It is 2,500 lines of Tcl. A vcc3client communicates with the server using Tcl-DP, which is a lightweight text-oriented RPC package similar to SOAP. The client application is 800 lines of Tcl/Tk.
|
|
| Simple VCC3 Control Interface. | Complex VCC3 Control Interface. |
The conference control application (confcap) provided functions to control capture (e.g., start/resume, stop, and pause), select the video source (e.g., wide-angle camera, close-up camera, or RGB signal), control use of PIP, and to configure selected properties of the hardware such as the capture format size (e.g., VGA, SVGA, or XGA) and display mode, which controlled the gamma setting used by the Kramer. The confcap interface is shown on the right. The application kept track of the talk number and recording time for each presentation and wrote a log with info when various events occurred (e.g., start recording, begin speaker introduction, begin presentation, etc.). Our plan was to remove the introduction from the final presentations so that viewers could jump directly to the talks. We thought we could automate this by capturing the timecode when the speaker began talking. This idea did not work in practice as the director was too busy framing shots and setting audio levels to mark the beginning of the presentation.
Confcap is coded in pure Tcl/Tk. It communicates with the G2 using text commands transmitted over a TCP connection and communicates with the VP-720DS using text commands transmitted over a serial connection. The program is approximately 1,100 lines of code.
Presenters used laptop computers that ran different operating systems, and they used different presentation software. Most presenters used SVGA and XGA display settings. We captured the material using the native image size and 30 frames per second. The compressed bit rate of the captured material was bounded at 1.5 Mbs. The conference was divided into 90 minute sessions that included three 15 minute presentations and a 30 minute discussion period. This format worked exceptionally well as some of the most interesting results from the conference arose during these discussions. We captured the discussion periods as separate talks so they are included in the conference archive and easy to access.
This section describes preparation before the event and the logistics of setting up and tearing down the equipment to capture the presentations.
Good preparation for any live, remote event requires a complete system check before the event. We collected all the equipment together, connected it with various cables and switches, and tested the control software. This test was also used to train the director (Casalaina) how to use the control software.
After the test, cables and equipment were labelled and packed into two hardshell travel cases. We planned to check the equipment cases as baggage when travelling so we researched the size and weight limits of the particular carrier and weighed everything to guarantee that the cases were acceptable. To limit the number of cases, we chose to make two of them over the weight limit for free luggage and so we paid extra luggage charges ($75 each way).
The conference was scheduled for monday and tuesday, so we arrived early sunday afternoon to setup and test the equipment. The conference venue (Dolce Skamania Lodge outside Portland, OR) was very nice, and the A/V technical support (Bob Bottomley) was superb. It took approximately three hours to complete the setup.
During the event, the director operated the equipment and monitored the capture. We ran the audio signal through a small mixer so we could easily control sound levels. The director used headphones to monitor sound before capture. The G2 RGB display shows the captured video which allows the director to monitor the content. It also displays captured sound as a sound meter on the RGB display connected to the device. Hence, we were able to verify that sound was intelligible and the captured audio signal was acceptable. The production assistant (Rowe) solicited performance releases from each speaker so we could publish the material, helped speakers with RGB output settings, and tweaked the control software.
The performance release was approved by ACM before the event because we plan to publish the material in the ACM Digital Library. Our past experience was that 67% of the presenters would sign the release. The others declined either because they were uncertian whether they had releases for material used in the talk (e.g., images from newspapers, video material, etc.) or because they worked for organizations which required the releases to be signed by corporate lawyers. Surprisingly, everyone who gave a presentation at the conference signed the release, so we were able to publish all the material. Looking back at the program, we notice that nearly all speakers are from universities, which may explain their willingness to allow publication. Speakers from companies, whether or not from research laboratories, are less likely to authorize publication.
The G2 stores the captured media files on an internal disk. The material is encoded using MPEG-4 audio and video. We captured the opening introduction, presentations for the 33 accepted papers, the keynote address, and nine question and answer sessions, which produced 44 media files that required 8.7 GB of disk space. Each 15 minute presentation is approximately 170 MB. After the event, we copied the files to a second USB disk just in case there was a problem when traveling home.
It took approximately one hour to tear down the equipment and repack it in the hardshell cases for transportation home.
This section describes the postproduction process and the system used to publish the material.
The G2 produces files that can be played by a QuickTime Player. We setup a Darwin Streaming Server (DSS) on a FreeBSD PC located at the University of California at Berkeley and loaded the captured files onto it. We then played the material using various Windows and Macintosh PC's from different places including high-speed connections at the university and broadband connections at home. The captured material did not play that well for two reasons:
Consequently, we decided to recode the material so that more people could play the material. We had trouble finding a software package to transcode the files that would run on the W2K PC available to do the postprocessing and was inexpensive. We found several packages that appeared to work, but they cost between $400 and $1,000 (e.g., Sorenson squeeze, Adobe Premiere Pro, Apple Final Cut Pro, and Adobe After Effects Pro). While we were searching for the best alternative, Apple released QuickTime V7 Pro which included the required transcoding functionality. It ran on a PC and only cost $30.
After experimenting with various settings for the transcoded material, we decided to publish two versions of each presentation, specifically a lower quality version that can be played anywhere and a higher quality version for people with fast network connections and computers. The lower quality version uses 384x256 images at 15 fps that is 600 Kbs, and the higher quality version uses 512x384 images at 15 fps that is 1,200 Kbs. We used H.264 video for the published material because it was supported by the transcoding software and produced better results than the MPEG-4 video codec.
Transcoding all the material was time consuming as it required 3X and 9X real-time to produced the 600 Kbs and 1,200 Kbs material, respectively. The H.264 codec in QuickTime V7 Pro has one- and two-pass encoders. We used the one-pass encoder even though the results were not as good as with the two-pass encoder because it required 40X real-time to transcode a file. We had 14 hours of material so it took approximately 170 hours to transcode it.
We produced webpages to play the material including a listing of all talks and popup windows to play each talk. It took some effort, but eventually we were able to get the HTML to work correctly on all web browsers using the embedded QuickTime Player.
The conference was held June 13-14, 2005. We published the material on September 1, 2005. The availability of the material was communicated on the SIGMM website and by email sent to all conference participants. In the first two months the material was successfully played 120 times, which was roughly 80% of all requests. The 30 failures were logged as server timeout errors, which we believe were caused by users trying to play the material on a computer behind a firewall or NAT router using RTSP rather than HTTP transport. This problem is discussed in the next section. We omitted plays by the site producer during development and testing. In other words, these plays count only other users interested in the material or the technology.
Looking at the successful plays, 33% used the high-speed version (i.e., 1200 Kbs) and 65% used the low-speed version (i.e., 600 Kbs). The remaining 2% played the audio-only talk. We are surprised that more people did not play the high-speed version since we expected that most people interested in playing the material would be at universities, which typically have high-speed connections to other universities.
The talks and Q&A sessions were played between 0 and 13 times with a 2.8 mean number of plays (standard deviation 2.8). Surprisingly, eight talks and one Q&A session have never been played. The following table shows the most popular talks.
| Presentation | Number of Plays |
|---|---|
|
"Supporting P2P Gaming When Players Have Heterogeneous Resources"
SPEAKER: Brian Neil Levine (University of Massachusetts) |
13 |
|
"MOPAR: A Mobile Peer-to-Peer Overlay Architecture for Interest
Management of Massively Multiplayer Online Games" SPEAKER: Son Vuong (University of British Columbia) |
9 |
|
"Weather Forecasting - Predicting Performance for Streaming Video over
Wireless LANs"
SPEAKER: Robert Kinicki (Worcester Polytechnic Institute) |
7 |
|
"Meeting CPU Constraints by Delaying Playout of Multimedia Tasks: An
Analytical Framework"
SPEAKER: Balaji Raman (National University of Sinagpore) |
7 |
Table: Most frequently played talks
The most popular talk was the first talk at the conference. Hence, it appears first on the program web page which probably leads to more plays. The most popular panel was after the "Network Gaming" session, which was played five times.
The material has not been played as many times as we had hoped. The transport problem with the embedded QuickTime player discussed below contributes is part of the problem. But, we need to do further analysis to determine why more people are not playing the material.
This section discusses lessons we learned while doing this experiment. We begin with a discussion of what worked well and follow that with a discussion of the changes we would make next time.
Several things worked exceptionally well including the Kramer VP-720DS and the NCast G2. Moreover, the low-cost model for capture and publication worked. As mentioned above, we believe a single track conference can be captured and published for approximately $3,000 a day plus expenses. This price will be higher if additional equipment is used as discussed below. But, it should still be well under $5,000 per day. Second, we were able to capture all presentations regardless of the slide and computer technology they used. We believe the published material is reasonable quality given the constraints on playback (i.e., network bandwidth and computer processing power). Moreover, a definite advantage of the NCast approach to RGB capture is that dynamic material (e.g., animations and demonstrations) was captured along with traditional static slides.
Nevertheless, as in any production, we could have done better. The discussion about improvements is organized around four ideas, namely, improving the quality of the captured material, improving the process used to capture and publish the material, improving the software for controlling event capture, and improving the usability of the published material.
Improving Quality. Generally speaking the captured material is good quality, but it can be improved. First, we captured the material at 30 FPS using the native resolution of the presenter's projected material if the resolution was XGA or smaller and XGA resolution if larger. Although it reduces visual quality, a lower resolution capture (e.g., SVGA) at 15 FPS is good enough given the constraints of current playback technology (e.g., broadband networks and typical computers).
Scaling higher resolution images to SVGA and applying typical video coding algorithms produced some "ringing" around text on the slides (i.e., ghost edges around the characters). Modern computers are exceptionally good at displaying material at different resolutions. Where possible we need to encourage presenters to use lower resolution when projecting their material. This problem is related to bandwidth available for transmitting the material during playback and decoding efficiency of the playback computer. Over time, these constraints will be relaxed and larger images can be captured at higher frame rates.
Second, audio capture can always be improved. Some speakers did not use the wireless microphone. Audio capture was good if they stayed at the podium, but sometimes they strayed away from the podium or looked at the screen, which hurt audio capture. Obvious advice is to force speakers to use the wireless microphone. Audio capture of audience questions must be improved because they were sometimes difficult to hear. We thought the podium microphone would pick up most audience questions, which it did, but sometimes the audience member did not speak loudly, and it was difficult for the director to change the sound level quickly during audience and speaker interaction. We should have used several microphones pointed at the audience and controlled separately at the mixer to capture audience questions. We had only one wireless microphone. It would be better to have several microphones so the session moderator is always wired and the next speaker can get ready before it is time to talk.
Third, we overlooked a small detail in the placement of the PIP window with the speaker that subtly impacted the final image. The two images below show examples from the captured material. The image on the left shows the PIP window in the lower right corner. The speaker is on the left of the projection screen when you are facing the stage so when he or she points to the screen, it appears they are looking elsewhere on your desktop rather than at the presentation slide. The image on the right shows how this looks after we moved the PIP window to the lower left corner of the image. Now, when the speaker looks at or gestures to the screen, it appears like that in the captured image.
![]() |
![]() |
|
| PIP window on right. | PIP window on left. |
Fourth, we had a minor problem positioning the RGB image on the projection screen and at capture. The projector in the room had a remote control to move the image left/right or up/down, but we did not notice the problem during testing. As a result, the RGB images in the first couple of talks were shifted up and to the right when captured, which lead to video noise across the bottom of the captured images. Both the Kramer VP-720DS and the NCast G2 have controls to move the image, but we did not have access to them in our control software. This problem can be easily fixed.
A fifth way to improve the quality of the captured material is to use more cameras and provide more control to the director. We did not incorporate an audience camera positioned in the front of the room because we did not have an extra camera. We will do that in the future, as long as audience members do not object. We will use PTZ cameras for all sources rather than a manual camera because it will simplify operation for the director.
Lastly, we needed a spotlight on the speaker. Wide angle views of the stage were unusable when slides were being projected because the bright light bouncing off the screen caused the camera auto exposure to close the aperture, which produced a dark image that made it difficult to see the speaker. A good spotlight on the speaker would fix this problem.
Improving Process. Several changes can be made to improve the event capture and material publication process. First, a preconfigured custom hardshell case for the production equipment (e.g., NCast G2, Kramer VP-720DS, audio mixer, control computer, etc.) would greatly simplify preparation before an event and setup at the remote location. The case can essentially be a small rack for the equipment that can dampen sound from various fans and incorporate cooling and access to wiring. These cases are relatively inexpensive and many companies will custom-design them for a specific application.
Second, we need a smaller preview monitor rather than the 8" color monitor we used for this production. Several small LCD monitors can be rackmounted so that the video cameras can be previewed simultaneously and the captured program can be monitored. Rack mounting them in the travel case will also simplify setup and teardown.
Lastly, the postproduction and publication process can be substantially improved. Because it was the first time the production package was used in a conference setting, it took almost two months to publish the captured material. Some of the delay was caused because we had to determine the best playback encoding and transcode the material. We also had to setup the media server and author webpages for the conference program and individual presentations. Most of this work can be automated.
During the event we spent a lot of energy keeping track of the speaker and which recorded file corresponded to each talk. The NCast G2 identifies the talk by encoding the beginning date and time of the capture into the filename. We had to copy the captured materal off the NCast G2 by hand and then used scripts to automatically produce the web pages given the files and information about the talks (e.g., title, authors, affliation, speaker, talk duration, start time, etc.). This step can easily be automated by entering the conference program ahead of time and relating it to the capture files. Moreover, the NCast G2 interface to the FTP server could be openned up so that the entire postproduction process can be automated.
Improving Usability. Numerous changes can be made to the control software we used to capture the event. Some changes relate to issues discussed in the preceding paragraphs (e.g., entering the conference program before the event and using different equipment). Other changes are discussed in the following paragraphs.
First, the PIP interface needs to be improved. The control software needs a simple configuration interface that allows the director to change the PIP location (i.e., bottom left or right) so that he or she can easily adapt to the conference venue spatial positioning. This feature will not be difficult to add since the remote Kramer interface has the function. But, one problem with the Kramer is the absence of a function to switch the PIP and main window source. This function exists through the VP-720DS on-display control interface. However, we could not figure out how to execute that function remotely, even when we tried to mimic the on-display control interface operations. The device clearly has the function, but it is not available through the serial control interface. This limitation caused a real problem in the production because several times the director wanted to swap the PIP and main window images. To do it, he had to turn off the PIP window, switch to the alternative source, and turn on the PIP window. This combination of actions was distracting and time consuming.
Second, the PTZ camera control software should be rewritten. As mentioned above, the software was written for a Canon VCC3. The Canon VCC4 has more functions (e.g., variable speed moves) that can be exploited to improve the captured images. The VCC3 has manual iris and focus controls but we could not get them to work in emulation mode on the VCC4. Presumably the VCC4 interface to these controls does work.
And lastly, the camera control software had six presets that allowed the director to pick standard positions for common shots (e.g., wide-angle stage view, speaker close-up, panel, etc.). These presets fixed the camera in three dimensions (i.e., pan, tilt, and zoom). Sometimes the director wanted a preset that was a defined movement from the current camera position (e.g., pan right a fixed amount and tilt up). This feature was needed when shooting the panel since there were several people sitting at a table in the front of the room. There were not enough presets to define a shot for each speaker. It would be easy to establish one fixed position and two defined movements: left and right. And, while six presets was enough for one talk or panel, it would be nice to have several groups of six defined so that the director can select a group rather than having to redefine them between talks and panel sessions.
Improving Playback. We used the QuickTime Player embedded in a web page to play the recorded material on demand. As mentioned above, the Apple Darwin Streaming Server running on a FreeBSD computer served the material. Numerous users had problems playing the material. Generally speaking it worked well on Macintoshes running OS 10 and the Apple Safari web browser. While users were able to play the material using Windows computers and other browsers (e.g., Firefox and Microsoft IE), most had problems with streaming transport. Users have no patience for configuring software to run. This material must work like television, turn it on (i.e., go to a webpage) and it works.
The QuickTime embedded player can transport content using either RTSP or HTTP streaming. Given the state of the Internet today, nearly everyone must use HTTP streaming due to firewalls and NAT routers. But, the player uses RTSP streaming by default so the user has to reset the transport parameter manually. Most users, including experienced computer scientists, are confused by this requirement even though the webpages we produced discuss the problem and symptoms, and they explain how to change the setting. Unfortunately, the embedded QuickTime Player does not allow the parameter to be set in the HTML code. Consequently, nearly everyone who tries to play the material has problems.
Moreover, a recent release of the QuickTime software for Windows (version 7.0.3) exacerbated this problem. Prior to this release, the user could set the transport to use port 8000 with HTTP streaming. But, this release does not allow the user to change the port (i.e., it must use port 80 which is the default). This restriction, or more likely bug, causes problems because we are running the DSS server on the same machine as a web server. This problem was fixed by explicitly including the port number in the RTSP URL used to launch playback. It appears that this solution works with the default transport setting since port 8000 is, by default, HTTP streaming. But, we did not notice this change because no one sent email indicating they were having a problem playing the material. The server logs show that people just stopped trying to play the material.
An experiment was conducted to test RGB capture and a low-cost capture-to-disk model for capturing conference presentations. All talks and panel discussions at a two-day conference were captured and published for on-demand playback on the web. While the time for postproduction was longer than estimated because the wrong representation was captured during the event, it appears future conferences can be captured and published for less than $5,000 a day. And, this cost will be further reduced with more experience.
Many lessons were learned about the process and the material being captured. Continous improvement will improve both the effort required to capture and publish the material and the quality of the resulting material if this technique is used again.
[Bianchi 1998] M. Bianchi, "Autoauditorium: a fully automatic, multi-camera system to televise auditorium presentations," Proc. Joint DARPA/NIST Smart Spaces Technology Workshop, Gaithersburg, Maryland, July, 1998.
[Mukhopadhyay 1999] Sugata Mukhopadhyay and Brian C. Smith, "Passive Capture and Structuring of Lectures," Proceedings of the Seventh ACM Multimedia Conference, Orlando, FL, Nov 1999.
[NOSSDAV 2005] Wu-chi Feng and Ketan Mayer-Patel (Eds), Proceedings of the 15th International Workshop on Network and Operating Systems Support for Digital Audio and Video, http://portal.acm.org/toc.cfm?id=1065983, Stevenson, Washington, June 2005. Conference presentations, http://bmrc.berkeley.edu/research/nossdav05/.
[Rowe 2001] L.A. Rowe, et.al., "BIBS: A Lecture Webcasting System," BMRC Technical Report, http://bmrc.berkeley.edu/research/publications/bibs-report.html, June 2001.
[Rui 2004] Yong Rui, et.al., "Automating Lecture Capture and Broadcast: Technology and Videography," Multimedia Systems Journal, Springer-Verlag, 2004.
[SOMA 2002] SOMA Media, "ACM Multimedia 2001: Conference Presentations," DVD, November 2002.
[Steinmetz 2001] Arnd Steinmetz and Martin Kienzle, "The e-Seminar Lecture Recording and Distribution System," Proceedings of SPIE Vol. 4312 (Multimedia Computing and Networking 2001 (MMCN)), San Jose, CA, Jan 2001.