Traditionally, virtual reality systems use 3D computer
graphics to model and render virtual environments in real-time.
This approach usually requires laborious modeling and
expensive special purpose rendering hardware. The rendering
quality and scene complexity are often limited because of the
real-time constraint. This paper presents a new approach which
uses 360-degree cylindrical panoramic images to compose a
virtual environment. The panoramic image is digitally warped
on-the-fly to simulate camera panning and zooming. The
panoramic images can be created with computer rendering,
specialized panoramic cameras or by "stitching" together
overlapping photographs taken with a regular camera. Walking
in a space is currently accomplished by "hopping" to different
panoramic points. The image-based approach has been used in
the commercial product QuickTime VR, a virtual reality
extension to Apple Computer's QuickTime digital multimedia
framework. The paper describes the architecture, the file format,
the authoring process and the interactive players of the VR
system. In addition to panoramic viewing, the system includes
viewing of an object from different directions and hit-testing
through orientation-independent hot spots.
CR Categories and Subject Descriptors:I.3.3
[Computer Graphics]: Picture/Image Generation– Viewing
algorithms; I.4.3 [ Image Processing]: Enhancement–
Geometric correction, Registration.
Additional Keywords: image warping, image registration,
virtual reality, real-time display, view interpolation,
environment maps, panoramic images.
A key component in most virtual reality systems is the
ability to perform a walkthrough of a virtual environment from
different viewing positions and orientations. The walkthrough
requires the synthesis of the virtual environment and the
simulation of a virtual camera moving in the environment with
up to six degrees of freedom. The synthesis and navigation are
usually accomplished with one of the following two methods.
1.1 3D Modeling and Rendering
Traditionally, a virtual environment is synthesized as a
collection of 3D geometrical entities. The geometrical entities
are rendered in real-time, often with the help of special purpose
3D rendering engines, to provide an interactive walkthrough
The 3D modeling and rendering approach has three main
problems. First, creating the geometrical entities is a laborious
manual process. Second, because the walkthrough needs to be
performed in real-time, the rendering engine usually places a
limit on scene complexity and rendering quality. Third, the
need for a special purpose rendering engine has limited the
availability of virtual reality for most people since the
necessary hardware is not widely available.
Despite the rapid advance of computer graphics software and
hardware in the past, most virtual reality systems still face the
above problems. The 3D modeling process will continue to be a
very human-intensive operation in the near future. The real
time rendering problem will remain since there is really no
upper bound on rendering quality or scene complexity. Special
purpose 3D rendering accelerators are still not ubiquitous and
are by no means standard equipment among personal computer
Another approach to synthesize and navigate in virtual
environments, which has been used extensively in the video
game industry, is branching movies. Multiple movie segments
depicting spatial navigation paths are connected together at
selected branch points. The user is allowed to move on to a
different path only at these branching points. This approach
usually uses photography or computer rendering to create the
movies. A computer-driven analog or digital video player is
used for interactive playback. An early example of this
approach is the movie-map [1], in which the streets of the city
of Aspen were filmed at 10-foot intervals. At playback time,
two videodisc players were used to retrieve corresponding views
to simulate the effects of walking on the streets. The use of
digital videos for exploration was introduced with the Digital
Video Interactive technology [2]. The DVI demonstration
allowed a user to wander around the Mayan ruins of Palenque
using digital video playback from an optical disk. A "Virtual
Museum" based on computer rendered images and CD-ROM was
described in [3]. In this example, at selected points in the
museum, a 360-degree panning movie was rendered to let the
user look around. Walking from one of the points to another
was simulated with a bi-directional transition movie, which
contained a frame for each step in both directions along the
path connecting the two points.
An obvious problem with the branching movie approach is
its limited navigability and interaction. It also requires a large
amount of storage space for all the possible movies. However,
this method solves the problems mentioned in the 3D
approach. The movie approach does not require 3D modeling
and rendering for existing scenes; it can use photographs or
movies instead. Even for computer synthesized scenes, the
movie-based approach decouples rendering from interactive
playback. The movie-based approach allows rendering to be
performed at the highest quality with the greatest complexity
without affecting the playback performance. It can also use
inexpensive and common video devices for playback.
Permission to make digital/hard copy of part or all of this work
for personal or classroom use is granted without fee provided
that copies are not made or distributed for profit or commercial
advantage, the copyright notice, the title of the publication and
its date appear, and notice is given that copying is by permission
of ACM, Inc. To copy otherwise, to republish, to post on
servers, or to redistribute to lists, requires prior specific
©1995 ACM-0-89791-701-4/95/008…$3.50
Because of the inadequacy of the existing methods, we
decided to explore a new approach for the creation and
navigation of virtual environments. Specifically, we wanted to
develop a new system which met the following objectives:
First, the system should playback at interactive speed on
most personal computers available today without hardware
acceleration. We did not want the system to rely on special
input or output devices, such as data gloves or head-mount
displays, although we did not preclude their use.
Second, the system should accommodate both real and
synthetic scenes. Real-world scenes contain enormously rich
details often difficult to model and render with a computer. We
wanted the system to be able to use real-world scenery directly
without going through computer modeling and rendering.
Third, the system should be able to display high quality
images independent of scene complexity. Many virtual reality
systems often compromise by displaying low quality images
and/or simplified environments in order to meet the real-time
display constraint. We wanted our system's display speed to be
independent of the rendering quality and scene complexity.
This paper presents an image-based system for virtual
environment navigation based on the above objectives. The
system uses real-time image processing to generate 3D
perspective viewing effects. The approach presented is similar
to the movie-based approach and shares the same advantages. It
differs in that the movies are replaced with “orientation
independent” images and the movie player is replaced with a
real-time image processor. The images that we currently use are
cylindrical panoramas. The panoramas are orientation
independent because each of the images contains all the
information needed to look around in 360 degrees. A number of
these images can be connected to form a walkthrough sequence.
The use of orientation-independent images allows a greater
degree of freedom in interactive viewing and navigation. These
images are also more concise and easier to create than movies.
We discuss work related to our approach in Section 2.
Section 3 presents the simulation of camera motions with the
image-based approach. In Section 4, we describe QuickTime
VR, the first commercial product using the image-based
method. Section 5 briefly outlines some applications of the
image-based approach and is followed by conclusions and future
The movie-based approach requires every displayable view
to be created and stored in the authoring stage. In the movie
map [1] [4], four cameras are used to shoot the views at every
point, thereby, giving the user the ability to pan to the left and
right at every point. The Virtual Museum stores 45 views for
each 360-degree pan movie [3]. This results in smooth panning
motion but at the cost of more storage space and frame creation
The navigable movie [5] is another example of the movie
based approach. Unlike the movie-map or the Virtual Museum,
which only have the panning motion in one direction, the
navigable movie offers two-dimensional rotation. An object is
photographed with a camera pointing at the object's center and
orbiting in both the longitude and the latitude directions at
roughly 10-degree increments. This process results in hundreds
of frames corresponding to all the available viewing directions.
The frames are stored in a two-dimensional array which are
indexed by two rotational parameters in interactive playback.
When displaying the object against a static background, the
effect is the same as rotating the object. Panning to look at a
scene is accomplished in the same way. The frames in this case
represent views of the scene in different view orientations.
If only the view direction is changing and the viewpoint is
stationary, as in the case of pivoting a camera about its nodal
point (i.e. the optical center of projection), all the frames from
the pan motion can be mapped to a canonical projection. This
projection is termed an environment map, which can be
regarded as an orientation-independent view of the scene. Once
an environment map is generated, any arbitrary view of the
scene, as long as the viewpoint does not move, can be
computed by a reprojection of the environment map to the new
The environment map was initially used in computer
graphics to simplify the computations of specular reflections
on a shiny object from a distant scene [6], [7], [8]. The scene is
first projected onto an environment map centered at the object.
The map is indexed by the specular reflection directions to
compute the reflection on the object. Since the scene is far
away, the location difference between the object center and the
surface reflection point can be ignored.
Various types of environment maps have been used for
interactive visualization of virtual environments. In the movie
map, anamorphic images were optically or electronically
processed to obtain 360-degree viewing [1], [9]. A project
called "Navigation" used a grid of panoramas for sailing
simulation [10]. Real-time reprojection of environment maps
was used to visualize surrounding scenes and to create
interactive walkthrough [11], [12]. A hardware method for
environment map look-up was implemented for a virtual reality
While rendering an environment map is trivial with a
computer, creating it from photographic images requires extra
work. Greene and Heckbert described a technique of
compositing multiple image streams with known camera
positions into a fish-eye view [14]. Automatic registration can
be used to composite multiple source images into an image with
enhanced field of view [15], [16], [17].
When the viewpoint starts moving and some objects are
nearby, as in the case of orbiting a camera around an object, the
frames can no longer be mapped to a canonical projection. The
movement of the viewpoint causes "disparity" between
different views of the same object. The disparity is a result of
depth change in the image space when the viewpoint moves
(pivoting a camera about its nodal point does not cause depth
change). Because of the disparity, a single environment map is
insufficient to accommodate all the views. The movie-based
approach simply stores all the frames. The view interpolation
method presented by Chen and Williams [18] stores only a few
key frames and synthesizes the missing frames on-the-fly by
interpolation. However, this method requires additional
information, such as a depth buffer and camera parameters, for
each of the key frames. Automatic or semi-automatic methods
have been developed for registering and interpolating images
with unknown depth and camera information [16], [19],[20].