Traditionally, virtual reality systems use 3D computer
 
 graphics to model and render virtual environments in real-time.
 
 This approach usually requires laborious modeling and
 
 expensive special purpose rendering hardware. The rendering
 
 quality and scene complexity are often limited because of the
 
 real-time constraint. This paper presents a new approach which
 
 uses 360-degree cylindrical panoramic images to compose a
 
 virtual environment. The panoramic image is digitally warped
 
 on-the-fly to simulate camera panning and zooming. The
 
 panoramic images can be created with computer rendering,
 
 specialized panoramic cameras or by "stitching" together
 
 overlapping photographs taken with a regular camera. Walking
 
 in a space is currently accomplished by "hopping" to different
 
 panoramic points. The image-based approach has been used in
 
 the commercial product QuickTime VR, a virtual reality
 
 extension to Apple Computer's QuickTime digital multimedia
 
 framework. The paper describes the architecture, the file format,
 
 the authoring process and the interactive players of the VR
 
 system. In addition to panoramic viewing, the system includes
 
 viewing of an object from different directions and hit-testing
 
 through orientation-independent hot spots.
 
 CR Categories and Subject Descriptors:I.3.3
 
 [Computer Graphics]: Picture/Image Generation– Viewing
 
 algorithms; I.4.3 [ Image Processing]: Enhancement–
 
 Geometric correction, Registration.
 
 Additional Keywords: image warping, image registration,
 
 virtual reality, real-time display, view interpolation,
 
 environment maps, panoramic images.
 
 A key component in most virtual reality systems is the
 
 ability to perform a walkthrough of a virtual environment from
 
 different viewing positions and orientations. The walkthrough
 
 requires the synthesis of the virtual environment and the
 
 simulation of a virtual camera moving in the environment with
 
 up to six degrees of freedom. The synthesis and navigation are
 
 usually accomplished with one of the following two methods.
 
 1.1 3D Modeling and Rendering
 
 Traditionally, a virtual environment is synthesized as a
 
 collection of 3D geometrical entities. The geometrical entities
 
 are rendered in real-time, often with the help of special purpose
 
 3D rendering engines, to provide an interactive walkthrough
 
 The 3D modeling and rendering approach has three main
 
 problems. First, creating the geometrical entities is a laborious
 
 manual process. Second, because the walkthrough needs to be
 
 performed in real-time, the rendering engine usually places a
 
 limit on scene complexity and rendering quality. Third, the
 
 need for a special purpose rendering engine has limited the
 
 availability of virtual reality for most people since the
 
 necessary hardware is not widely available.
 
 Despite the rapid advance of computer graphics software and
 
 hardware in the past, most virtual reality systems still face the
 
 above problems. The 3D modeling process will continue to be a
 
 very human-intensive operation in the near future. The real
 
 time rendering problem will remain since there is really no
 
 upper bound on rendering quality or scene complexity. Special
 
 purpose 3D rendering accelerators are still not ubiquitous and
 
 are by no means standard equipment among personal computer
 
 Another approach to synthesize and navigate in virtual
 
 environments, which has been used extensively in the video
 
 game industry, is branching movies. Multiple movie segments
 
 depicting spatial navigation paths are connected together at
 
 selected branch points. The user is allowed to move on to a
 
 different path only at these branching points. This approach
 
 usually uses photography or computer rendering to create the
 
 movies. A computer-driven analog or digital video player is
 
 used for interactive playback. An early example of this
 
 approach is the movie-map [1], in which the streets of the city
 
 of Aspen were filmed at 10-foot intervals. At playback time,
 
 two videodisc players were used to retrieve corresponding views
 
 to simulate the effects of walking on the streets. The use of
 
 digital videos for exploration was introduced with the Digital
 
 Video Interactive technology [2]. The DVI demonstration
 
 allowed a user to wander around the Mayan ruins of Palenque
 
 using digital video playback from an optical disk. A "Virtual
 
 Museum" based on computer rendered images and CD-ROM was
 
 described in [3]. In this example, at selected points in the
 
 museum, a 360-degree panning movie was rendered to let the
 
 user look around. Walking from one of the points to another
 
 was simulated with a bi-directional transition movie, which
 
 contained a frame for each step in both directions along the
 
 path connecting the two points.
 
 An obvious problem with the branching movie approach is
 
 its limited navigability and interaction. It also requires a large
 
 amount of storage space for all the possible movies. However,
 
 this method solves the problems mentioned in the 3D
 
 approach. The movie approach does not require 3D modeling
 
 and rendering for existing scenes; it can use photographs or
 
 movies instead. Even for computer synthesized scenes, the
 
 movie-based approach decouples rendering from interactive
 
 playback. The movie-based approach allows rendering to be
 
 performed at the highest quality with the greatest complexity
 
 without affecting the playback performance. It can also use
 
 inexpensive and common video devices for playback.
 
 Permission to make digital/hard copy of part or all of this work
 
 for personal or classroom use is granted without fee provided
 
 that copies are not made or distributed for profit or commercial
 
 advantage, the copyright notice, the title of the publication and
 
 its date appear, and notice is given that copying is by permission
 
 of ACM, Inc. To copy otherwise, to republish, to post on
 
 servers, or to redistribute to lists, requires prior specific
 
 ©1995 ACM-0-89791-701-4/95/008…$3.50
 
 Because of the inadequacy of the existing methods, we
 
 decided to explore a new approach for the creation and
 
 navigation of virtual environments. Specifically, we wanted to
 
 develop a new system which met the following objectives:
 
 First, the system should playback at interactive speed on
 
 most personal computers available today without hardware
 
 acceleration. We did not want the system to rely on special
 
 input or output devices, such as data gloves or head-mount
 
 displays, although we did not preclude their use.
 
 Second, the system should accommodate both real and
 
 synthetic scenes. Real-world scenes contain enormously rich
 
 details often difficult to model and render with a computer. We
 
 wanted the system to be able to use real-world scenery directly
 
 without going through computer modeling and rendering.
 
 Third, the system should be able to display high quality
 
 images independent of scene complexity. Many virtual reality
 
 systems often compromise by displaying low quality images
 
 and/or simplified environments in order to meet the real-time
 
 display constraint. We wanted our system's display speed to be
 
 independent of the rendering quality and scene complexity.
 
 This paper presents an image-based system for virtual
 
 environment navigation based on the above objectives. The
 
 system uses real-time image processing to generate 3D
 
 perspective viewing effects. The approach presented is similar
 
 to the movie-based approach and shares the same advantages. It
 
 differs in that the movies are replaced with “orientation
 
 independent” images and the movie player is replaced with a
 
 real-time image processor. The images that we currently use are
 
 cylindrical panoramas. The panoramas are orientation
 
 independent because each of the images contains all the
 
 information needed to look around in 360 degrees. A number of
 
 these images can be connected to form a walkthrough sequence.
 
 The use of orientation-independent images allows a greater
 
 degree of freedom in interactive viewing and navigation. These
 
 images are also more concise and easier to create than movies.
 
 We discuss work related to our approach in Section 2.
 
 Section 3 presents the simulation of camera motions with the
 
 image-based approach. In Section 4, we describe QuickTime
 
 VR, the first commercial product using the image-based
 
 method. Section 5 briefly outlines some applications of the
 
 image-based approach and is followed by conclusions and future
 
 The movie-based approach requires every displayable view
 
 to be created and stored in the authoring stage. In the movie
 
 map [1] [4], four cameras are used to shoot the views at every
 
 point, thereby, giving the user the ability to pan to the left and
 
 right at every point. The Virtual Museum stores 45 views for
 
 each 360-degree pan movie [3]. This results in smooth panning
 
 motion but at the cost of more storage space and frame creation
 
 The navigable movie [5] is another example of the movie
 
 based approach. Unlike the movie-map or the Virtual Museum,
 
 which only have the panning motion in one direction, the
 
 navigable movie offers two-dimensional rotation. An object is
 
 photographed with a camera pointing at the object's center and
 
 orbiting in both the longitude and the latitude directions at
 
 roughly 10-degree increments. This process results in hundreds
 
 of frames corresponding to all the available viewing directions.
 
 The frames are stored in a two-dimensional array which are
 
 indexed by two rotational parameters in interactive playback.
 
 When displaying the object against a static background, the
 
 effect is the same as rotating the object. Panning to look at a
 
 scene is accomplished in the same way. The frames in this case
 
 represent views of the scene in different view orientations.
 
 If only the view direction is changing and the viewpoint is
 
 stationary, as in the case of pivoting a camera about its nodal
 
 point (i.e. the optical center of projection), all the frames from
 
 the pan motion can be mapped to a canonical projection. This
 
 projection is termed an environment map, which can be
 
 regarded as an orientation-independent view of the scene. Once
 
 an environment map is generated, any arbitrary view of the
 
 scene, as long as the viewpoint does not move, can be
 
 computed by a reprojection of the environment map to the new
 
 The environment map was initially used in computer
 
 graphics to simplify the computations of specular reflections
 
 on a shiny object from a distant scene [6], [7], [8]. The scene is
 
 first projected onto an environment map centered at the object.
 
 The map is indexed by the specular reflection directions to
 
 compute the reflection on the object. Since the scene is far
 
 away, the location difference between the object center and the
 
 surface reflection point can be ignored.
 
 Various types of environment maps have been used for
 
 interactive visualization of virtual environments. In the movie
 
 map, anamorphic images were optically or electronically
 
 processed to obtain 360-degree viewing [1], [9]. A project
 
 called "Navigation" used a grid of panoramas for sailing
 
 simulation [10]. Real-time reprojection of environment maps
 
 was used to visualize surrounding scenes and to create
 
 interactive walkthrough [11], [12]. A hardware method for
 
 environment map look-up was implemented for a virtual reality
 
 While rendering an environment map is trivial with a
 
 computer, creating it from photographic images requires extra
 
 work. Greene and Heckbert described a technique of
 
 compositing multiple image streams with known camera
 
 positions into a fish-eye view [14]. Automatic registration can
 
 be used to composite multiple source images into an image with
 
 enhanced field of view [15], [16], [17].
 
 When the viewpoint starts moving and some objects are
 
 nearby, as in the case of orbiting a camera around an object, the
 
 frames can no longer be mapped to a canonical projection. The
 
 movement of the viewpoint causes "disparity" between
 
 different views of the same object. The disparity is a result of
 
 depth change in the image space when the viewpoint moves
 
 (pivoting a camera about its nodal point does not cause depth
 
 change). Because of the disparity, a single environment map is
 
 insufficient to accommodate all the views. The movie-based
 
 approach simply stores all the frames. The view interpolation
 
 method presented by Chen and Williams [18] stores only a few
 
 key frames and synthesizes the missing frames on-the-fly by
 
 interpolation. However, this method requires additional
 
 information, such as a depth buffer and camera parameters, for
 
 each of the key frames. Automatic or semi-automatic methods
 
 have been developed for registering and interpolating images
 
 with unknown depth and camera information [16], [19],[20].