QuickTime VR currently includes two different types of
movies: panoramic and object.
4.1.1 The Panoramic Movie
Conventional QuickTime movies are one-dimensional
compressed sequences indexed by time. Each QuickTime movie
may have multiple tracks. Each track can store a type of linear
media, such as audio, video, text, etc. Each track type may have
its own player to decode the information in the track. The
tracks, which usually run parallel in time, are played
synchronously with a common time scale. QuickTime allows
new types of tracks and players to be added to extend its
capabilities. Refer to [24] and [25] for a detailed description of
the QuickTime architecture.
Panoramic movies are multi-dimensional event-driven
spatially-oriented movies. A panoramic movie permits a user to
pan, zoom and move in a space interactively. In order to retrofit
panoramic movies into the existing linear movie framework, a
new panoramic track type was added. The panoramic track
stores all the linking and additional information associated
with a panoramic movie. The actual panoramic images are
stored in a regular QuickTime video track to take advantage of
the existing video processing capabilities.
An example of a panoramic movie file is shown in figure 3.
The panoramic track is divided into three nodes. Each node
corresponds to a point in a space. A node contains information
about itself and links to other nodes. The linking of the nodes
form a directed graph, as shown in the figure. In this example,
Node 2 is connected to Node 1 and Node 3, which has a link to
an external event. The external event allows custom actions to
Figure 3. A panoramic movie layout and its
corresponding node graph.
The nodes are stored in three tracks: one panoramic track
and two video tracks. The panoramic track holds the graph
information and pointers to the other two tracks. The first
video track holds the panoramic images for the nodes. The
second video track holds the hot spot images and is optional.
The hot spots are used to identify regions of the panoramic
image for activating appropriate links. All three tracks have
the same length and the same time scale. The player uses the
starting time value of each node to find the node's
corresponding panoramic and hot spot images in the other two
The hot spot track is similar to the hit test track in the
Virtual Museum [3]. The hot spots are used to activate events or
navigation. The hot spots image encodes the hot spot id
numbers as colors. However, unlike the Virtual Museum where a
hot spot needs to exist for every view of the same object, the
hot spot image is stored in panoramic form and is thereby
orientation-independent. The hot spot image goes through the
same image warping process as the panoramic image.
Therefore, the hot spots will stay with the objects they attach
to no matter how the camera pans or zooms.
The panoramic and the hot spot images are typically diced
into smaller frames when stored in the video tracks for more
efficient memory usage (see 4.2.1 for detail). The frames are
usually compressed without inter-frame compression (e.g.,
frame differencing). Unlike linear video, the panoramic movie
does not have ana prioriorder for accessing the frames. The
image and hot spot video tracks are disabled so that a regular
QuickTime movie would not attempt to display them as linear
videos. Because the panoramic track is the only one enabled,
the panoramic player is called upon to traverse the contents of
the movie at playback time.
The track layout does not need to be the same as the
physical layout of the data on a storage medium. Typically, the
tracks should be interleaved when written to a slow medium,
such as a CD-ROM, to minimize the seek time.
An object movie typically contains a two-dimensional
array of frames. Each frame corresponds to a viewing direction.
The movie has more than two dimensions if multiple frames are
stored for each direction. The additional frames allow the object
to have time-varying behavior (see 4.2.2). Currently, each
direction is assumed to have the same number of frames.
The object frames are stored in a regular video track.
Additional information, such as the number of frames per
direction and the numbers of rows and columns, is stored with
the movie header. The frames are organized to minimize the
seek time when rotating the object horizontally. As in the
panoramic movies, there is no inter-frame compression for the
frames since the order of rotation is not known in advance.
However, inter-frame compression may be used for the multiple
frames within each viewing direction.
4.2 The Interactive Environment
The interactive environment currently consists of two types
of players: the panoramic player and the object player.
4.2.1 The Panoramic Player
The panoramic player allows the user to perform continuous
panning in the vertical and the horizontal directions. Because
the panoramic image has less than 180 degrees vertical field-of
view, the player does not permit looking all the way up or
down. Rotating about the viewing direction is not currently
supported. The player performs continuous zooming through
image magnification and reduction as mentioned previously. If
multiple levels of resolution are available, the player may
choose the right level based on the current memory usage, CPU
performance, disk speed and other factors. Multiple level
zooming is not currently implemented in QuickTime VR.
Figure 4. Panoramic display process.
The panoramic player allows the user to control the view
orientation and displays a perspectively correct view by
warping a panoramic image. Figure 4 shows the panoramic
display process. The panoramic images are usually compressed
and stored on a hard disk or a CD-ROM. The compressed image
needs to be decompressed to an offscreen buffer first. The
offscreen buffer is generally smaller than the full panorama
because only a fraction of the panorama is visible at any time.
As mentioned previously, the panoramic image is diced into
tiles. Only the tiles overlapping the current view orientation
are decompressed to the offscreen buffer. The visible region on
the offscreen buffer is then warped to display a correct
perspective view. As long as the region moves inside the
offscreen buffer, no additional decompression is necessary. To
minimize the disk access, the most recent tiles may be cached
in the main memory once they are read. The player also
performs pre-paging to read in adjacent tiles while it is idle to
minimize the delay in interactive panning.
The image warp, which reprojects sections of the
cylindrical image onto a planar view, is computed in real-time
using a software-based two-pass algorithm [26]. An example of
the warp is shown in figure 5, where the region enclosed by the
yellow box in the panoramic image is warped to create a
The performance of the player varies depending on many
factors such as the platform, the color mode, the panning mode
and the window sizes. The player is currently optimized for
display in 16-bit color mode. Some performance figures for
different processors are given below. These figures indicate the
number of updates per second in a 640x400-pixel window in
16-bit color mode. Because the warping is performed with a
two-pass algorithm, panning in 1D is faster than full 2D
panning. Note that the Windows version has a different
implementation for writing to display which may affect the
Processor 1D Panning 2D Panning
The player can perform image warping at different levels of
quality. The lower quality settings perform less filtering and the
images are more jagged but are faster. To achieve the best
balance between quality and performance, the player
automatically adjusts the quality level to maintain a constant
update rate. When the user is panning, the player switches to
lower quality to keep up with the user. When the user stops, the
player updates the image in higher quality.
Moving in space is currently accomplished by jumping to
points where panoramic images are attached. In order to
preserve continuity of motion, the view direction needs to be
maintained when jumping to an adjacent location. The
panoramas are linked together by matching their orientation
manually in the authoring stage (see 4.3.1.4). Figure 6 shows a
sequence of images generated from panoramas spaced 5 feet
The default user interface for navigation uses a combination
of a 2D mouse and a keyboard. When the cursor moves over a
window, its shape changes to reflect the permissible action at
the current cursor location. The permissible actions include:
continuous panning in 2D; continuous zooming in and out
(controlled by a keyboard); moving to a different node; and
activating a hot spot. Clicking on the mouse initiates the
corresponding actions. Holding down and dragging the mouse
performs continuous panning. The panning speed is controlled
by the distance relative to the mouse click position.
In addition to interactive control, navigation can be placed
under the control of a script. A HyperCardexternal command
and a Windows DLL have been written to drive the player. Any
application compatible with the external command or DLL can
control the playback with a script. A C run-time library
interface will be available for direct control from a program.
While the panoramic player is designed to look around a
space from the inside, the object player is used to view an
object from the outside. The object player is based on the
navigable movie approach. It uses a two-dimensional array of
frames to accommodate object rotation. The object frames are
created with a constant color background to facilitate
compositing onto other backgrounds. The object player allows
the user to grab the object using a mouse and rotate it with a
virtual sphere-like interface [27]. The object can be rotated in
two directions corresponding to orbiting the camera in the
longitude and the latitude directions.
If there is more than one frame stored for each direction, the
multiple frames are looped continuously while the object is
being rotated. The looping enables the object to have cyclic
time varying behavior (e.g. a flickering candle or streaming
4.3 The Authoring Environment
The authoring environment includes tools to make
panoramic movies and object movies.
Figure 7. The panoramic movie authoring process.
4.3.1 Panoramic Movie Making
A panoramic movie is created in five steps. First, nodes are
selected in a space to generate panoramas. Second, the
panoramas are created with computer rendering, panoramic
photography or “stitching” a mosaic of overlapping
photographs. Third, if there are any hot spots on the panorama,
a hot spot image is constructed by marking regions of the
panorama with pseudo colors corresponding to the hot spot
identifiers. Alternatively, the hot spots can be generated with
computer rendering [28], [3]. Fourth, if more than one
panoramic node is needed, the panoramas are linked together by
manually registering their viewing directions. Finally, the
panoramic images and the hot spot images are diced and
compressed to create a panoramic movie. The authoring process
is illustrated in figure 7.
The nodes should be selected to maintain visual consistency
when moving from one to another. The distance between two
adjacent nodes is related to the size of the virtual environment
and the distance to the nearby objects. Empirically we have
found that a 5-10 foot spacing to be adequate with most interior
spaces. The spacing can be significantly increased with
The purpose of stitching is to create a seamless panoramic
image from a set of overlapping pictures. The pictures are taken
with a camera as it rotates about its vertical axis in one
direction only. The camera pans at roughly equal, but not exact,
increments. The camera is mounted on a tripod and centered at
its nodal point with minimal tilting and rolling. The camera is
usually mounted sideways to obtain the maximum vertical field
of-view. The setup of the camera is illustrated in figure 8. The
scene is assumed to be static although some distant object
motion may be acceptable.
Figure 8. Camera setup for taking overlapping pictures.
The stitcher uses a correlation-based image registration
algorithm to match and blend adjacent pictures. The adjacent
pictures need to have some overlap for the stitcher to work
properly. The amount of overlap may vary depending on the
image features in the overlapping regions. In practice, a 50%
overlap seems to work best because the adjacent pictures may
have very different brightness levels. Having a large overlap
allows the stitcher to more easily smooth out the intensity
The success rate of the automatic stitching depends on the
input pictures. For a typical stitching session, about 8 out of
10 panoramas can be stitched automatically, assuming each
panorama is made from 12 pictures. The remaining 2 panoramas
requires some manual intervention. The factors which
contribute to automatic stitching failure include, but are not
limited to, missing pictures, extreme intensity change,
insufficient image features, improper camera mounting,
significant object motion and film scanning errors.
In addition to being able to use a regular 35 mm camera, the
ability to use multiple pictures, and hence different exposure
settings, to compose a panorama has another advantage. It
enables one to capture a scene with a very wide intensity range,
such as during a sunset. A normal panoramic camera captures
the entire 360 degrees with a constant exposure setting. Since
film usually has a narrower dynamic range than the real world
does, the resultant panorama may have areas under or over
exposed. The stitcher allows the exposure setting to be
specifically tailored for each direction. Therefore, it may create
a more balanced panorama in extreme lighting conditions.
Although one can use other devices, such as video or digital
cameras for capturing, using still film results in high resolution
images even when displayed at full screen on a monitor. The
film can be digitized and stored on Kodak's PhotoCD. Each
PhotoCD contains around 100 pictures with 5 resolutions each.
A typical panorama is stitched with the middle resolution
pictures (i.e., 768 x 512 pixels) and the resulting panorama is
around 2500 x 768 pixels for pictures taken with a 15 mm lens.
This resolution is enough for a full screen display with a
moderate zoom angle. The stitcher takes around 5 minutes to
automatically stitch a 12-picture panorama on a PowerPC
601/80 MHz processor, including reading the pictures from the
PhotoCD and some post processing. An example of a
panoramic image stitched automatically is shown in figure 9.
Hot spots identify regions of a panoramic image for
interactions, such as navigation or activating actions.
Currently, the hot spots are stored in 8-bit images, which limit
the number of unique hot spots to 256 per image. One way of
creating a hot spot image is by painting pseudo colors over the
top of a panoramic image. Computer renderers may generate the
The hot spot image does not need to have the same
resolution as the panoramic image. The resolution of the hot
spot image is related to the precision of picking. A very low
resolution hot spot image may be used if high accuracy of
The linking process connects and registers view orientation
between adjacent panoramic nodes. The links are directional
and each node may have any number of links. Each link may be
attached to a hot spot so that the user may activate the link by
clicking on the hot spot.
Currently, the linking is performed by manually registering
the source and destination view orientations using a graphical
linker. The main goal of the registration is to maintain visual
consistency when moving from one node to another.
4.3.1.5 Dicing and Compression
The panoramic and hot spot images are diced before being
compressed and stored in a movie. The tile size should be
optimized for both data loading and offscreen buffer size. A
large number of tiles increases the overhead associated with
loading and decompressing the tiles. A small number of tiles
requires a large offscreen buffer and reduces title paging
efficiency. We have found that dicing a panoramic image of
2500x768 pixels into 24 vertical stripes provides an optimal
balance between data loading and tile paging. Dicing the
panorama into vertical stripes also minimizes the seek time
involved when loading the tiles from a CD-ROM during
A panorama of the above resolution can be compressed to
around 500 KB with a modest 10 to 1 compression ratio using
the Cinepak compressor, which is based on vector quantization
and provides a good quality vs. speed balance. Other
compressors may be used as well for different quality and speed
tradeoffs. The small disk footprint for each panorama means
that a CD-ROM with over 600 MB capacity can hold more than
1,000 panoramas. The capacity will only increase as higher
density CD-ROMs and better compression methods become
The hot spot image is compressed with a lossless 8-bit
compressor. The lossless compression is necessary to ensure
the correctness of the hot spot id numbers. Since the hot spots
usually occupy large contiguous regions, the compressed size is
typically only a few kilo-bytes per image.
4.3.2 Object Movie Making
Making an object movie requires photographing the object
from different viewing directions. To provide a smooth object
rotation, the camera needs to point at the object's center while
orbiting around it at constant increments. While this
requirement can be easily met in computer generated objects,
photographing a physical object in this way is very
challenging unless a special device is built.
Currently, we use a device, called the "object maker," to
accomplish this task. The object maker uses a computer to
control two stepper motors. The computer-controlled motors
orbit a video camera in two directions by fixing its view
direction at the center of the object. The video camera is
connected to a frame digitizer inside the computer, which
synchronizes frame grabbing with camera rotation. The object
is supported by a nearly invisible base and surrounded by a
black curtain to provide a uniform background. The camera can
rotate close to 180 degrees vertically and 360 degrees
horizontally. The camera typically moves at 10-degree
increments in each direction. The entire process may run
automatically and takes around 1 hour to capture an object
If multiple frames are needed for each direction, the object
may be captured in several passes, with each pass capturing a
full rotation of the object in a fixed state. The multi-pass
capture requires that the camera rotation be repeatable and the
object motion be controllable. In the case of candle light
flickering, the multiple frames may need to be captured
successively before the camera moves on to the next direction.