(1)
Intuitive Surgical, Sunnyvale, CA, USA
20.1 Introduction
A clinician performing head and neck robotic surgery currently relies on volumetric preoperative and diagnostic images (e.g., computed tomography (CT) and/or magnetic resonance images (MRI)) to develop a surgical plan. However, intraoperative imaging provides in situ information about the presence of pathology and its anatomic relationship to vital structures in real time, potentially allowing for more targeted, safer, and less morbid surgery. Literature on intraoperative imaging for robotics in otolaryngology (e.g., irradiative modality, cone beam computed tomography (CBCT); and non-irradiative modalities, ultrasound (US), narrow band imaging (NBI), or near-infrared fluorescence) shows researchers adapting traditionally diagnostic imaging techniques for intraoperative needs. Surgical objectives include planning traversals for target resection, margin delineation, and reconstruction while controlling or preserving critical functional structures. Intraoperative imaging has been explored not only to visualize this workspace but also to provide an anchoring modality for registering higher-resolution preoperative images and plans. This requires establishing the correspondence of image coordinates in preoperative to intraoperative space and to the surgical scene by using registration algorithms. Rigid registration is an arguably solved issue with a mainstay of commercially available systems used in standard surgical procedures. However, nonrigid registration in real time, involving reliably modeling intraoperative deformations from setup and intervention, continues to be a challenge. In addition to high fidelity registration, effective navigation and visualization in robotic surgery in otolaryngology are also very active areas of research. In this chapter, we survey how research and development in various modalities of intraoperative imaging have addressed these technical challenges in state-of-the art navigation for robotic surgery in otolaryngology.
20.2 Background
20.2.1 Intraoperative Imaging Modalities
20.2.1.1 C-arm and Flat-Panel Cone Beam Computed Tomography (CBCT)
For 2D intraoperative imaging, X-ray has long been established as a cost-effective, real-time modality. Fluoroscopy was used by Goding et al. to observe hypoglossal nerve stimulation and to evaluate airway changes for otolaryngology [5]. However, 3D imaging better informs the surgeon about the precise extent for dissection and can be used to update intraoperative stereotactic navigation. The emergence of intraoperative CBCT has proved useful in complex skull base and endoscopic surgery, especially in cases in which the extent of bony resection is critical to the successful outcome of the operation [6]. Unlike anterior skull base surgery, the use of intraoperative image guidance has only recently gained popularity in the field of lateral skull base surgery. While conventional multi-detector computed tomography (MDCT) is able to better resolve soft tissue, flat-panel CBCT scanners generally deliver less radiation to the patient and have less metal artifact effects. In a retrospective case review of 12 patients, Conley et al. [6] compared a conventional CT system (NeuroLogica CereTom® (NeuroLogica Corporation, Danvers, MA)) to two 2D flat-panel CBCT systems (O-ARM® (Medtronic Inc., Minneapolis, MN), and Xoran xCAT® (Xoran Technologies Inc., Ann Arbor, MI)), evaluating the ease of use, image characteristics, and integration with image guidance for skull base and endoscopic sinus surgery. In their study, all three scanners provided good quality images, but more significantly their results showed that intraoperative CBCT was not only technically feasible but was also useful for surgical decision-making in three out of four of their cases. For example, the intraoperative scan was used to facilitate a novel approach (retrolabyrinthine) to acoustic tumor removal by precisely delineating the extent of bony removal vis-à-vis the needed exposure. This would have been less certain, by standard means of caliper estimation. In one of these cases, the use of the O-arm was impossible secondary to morbid obesity, which precluded safe positioning of the patient for image acquisition. Other limitations noted include restrictions on the type of head reference array that can be used in lateral approaches and the lack of integration of the navigation systems and the operating microscope.
20.2.1.2 Ultrasound Imaging
The advent of minimally invasive approaches has further integrated ultrasound devices for otolaryngology in both diagnostics and intervention. In diagnostic techniques for otolaryngology, ultrasound has become an extended component of the physical examination in head and neck patients, particularly those with diseases of the thyroid, salivary gland, lymph node, and tongue [7–9]. For surgery in otolaryngology, the risk of inadvertent tissue injury requires in situ imaging techniques that can be used to visualize the operative field dynamically and beyond the visible surface. In Doppler mode, US provides temporal 4D data to assess vascular flow in the oral and maxillofacial regions [10] and is an efficient modality to image the morphology of soft tissue in 3D, allowing for applying differential pressure on retropharyngeal metastases to determine their spatial mobility relative to the carotid artery. Transoral ultrasound has been shown to be a cost-effective modality for evaluating the retropharyngeal space [11–13]. In base of tongue cancer, clinicians have successfully used ultrasound to guide core biopsies [12, 14, 15], interstitial photodiagnosis, and photodynamic therapy [16]. Furthermore, registration of ultrasonography to CT has demonstrated advantages in staging and surgical planning for papillary thyroid carcinoma (PTC). In a tertiary center prospective study, Lesnik et al. [17] measured sensitivity, specificity, and positive/negative predictive value of nodal diagnostics in central/lateral cervical compartments in 162 PTC patients undergoing preoperative lymph node evaluation by PE, US, and CT. The gold standard for diagnostic accuracy is surgical pathology. In patients undergoing primary (Group I)/revision (Group II) surgical treatment for PTC, the cases that used US registered to CT yielded significantly higher sensitivity for macroscopic lymph node detection in both lateral and central neck, most marked in Group I central compartment.
20.2.1.3 Optical Imaging
In endoscopic surgery, the unique advantage of optical imaging techniques, such as autofluorescence and narrow band imaging, over all other modalities discussed in this chapter, is the direct coregistration of white light and augmented information in the primary visual field. If the photons required for optical imaging can be gathered from the same endoscope/laryngoscope used to guide the robotic surgery, the nontrivial issue of organ deformation is intrinsically addressed [18]. In fact, near-infrared visualization of fluorescence tracers (e.g., indocyanine green (ICG)) have shown promise in identifying and guiding tumor resection because of their favorable characteristics, such as minimal scattering, enhanced tissue penetration depth, and high-quality contrast [18]. Currently in robotics, the da Vinci® Surgical System (Intuitive Surgical Inc., Sunnyvale, CA) supports integrated near-infrared imaging to visualize fluorescence. Fluorescence in robotic surgery has been used successfully in urologic and general laparoscopic surgery [19–21]. The potential of their application in head and neck oncology can be seen in the study by Rosenthal et al. [22] which assessed the safety and tumor specificity of a fluorescently labeled epidermal growth factor receptor (EGFR)-targeted agent. A 30-day dose escalation study of the EGFR was performed with 12 patients undergoing surgical resection of squamous cell carcinoma. Multi-instrument fluorescence imaging was performed in the operating room and in surgical pathology. Fluorescence levels positively correlated with EGFR levels, and results showed that fluorescence imaging with an intraoperative, wide-field device can successfully differentiate tumor from normal tissue during resection (average tumor-to-background ratio of 5.2 in the highest dose range). This study is the first to demonstrate that commercially available antibodies can be fluorescently labeled and safely administered to humans to identify cancer with submillimeter resolution, which has the potential to improve outcomes in clinical oncology.
Narrow band imaging is another type of optical imaging technology, typically integrated in an endoscopy system that can display the mucosal surface layer in high contrast, especially hemoglobin-rich areas such as blood vessels and microvascular patterns. Magnifying endoscopy (ME) enhances the capabilities of standard video endoscopy with higher resolution and higher contrast, compared to endoscopes with NBI used in otolaryngology, such as ENF-VQ and ENF-VH (Olympus Medical Systems, Tokyo, Japan) [23]. Researchers have shown that ME-NBI enables detection of early superficial laryngopharyngeal cancers, which are difficult to detect by standard endoscopy or nonmagnifying endoscopy with NBI [24]. For transoral robotic surgery (TORS), Tateya et al. report two advantages of using ME-NBI [25]. One is that the combination facilitates early diagnosis of pharyngeal cancers, and the detection of superficial lesions is expected to increase with the advent of ME-NBI. The second advantage is associated with improved resection of invasive cancer lesions. ME-NBI provides improved lesion boundary visualization, which will be beneficial in avoiding excessive resection. This is especially helpful for the superficial part of the invasive cancer, thus resulting in better functional outcomes, such as swallowing and voice function. The limitation of ME-NBI is that it is not useful for examining deeper tissue beneath the epithelium. Thus pathological diagnosis via biopsy is therefore still necessary for checking the vertical margin.
20.2.2 Intraoperative Navigation Through Registration and Visualization
20.2.2.1 Registration
A common form of navigation in surgery for head and neck registers preoperative image data (i.e., diagnostic CT, MRI) using either optical or electromagnetic (EM) technology. In this context, registration is the spatial alignment of a medical image data set to the coordinate system of the patient and/or operating room. Commercially available optical and EM systems have had the most success in workspaces with rigid structures, such as skull base surgery, craniomaxillofacial surgery, and neurosurgery [26, 27]. Intraoperatively, these guidance systems provide real-time tracking of a pointer or other tools with respect to the registered image data. The accuracy attainable with optical systems in clinic has been reported to be approximately 2 mm in target registration error [28]; however studies using EM tracking in clinical settings have noted higher errors [29]. Optical systems generally provide better spatial uniformity with a larger field of view than EM solutions, which are subject to interference from magnetic objects and stray electromagnetic fields [30]. The major disadvantage of an optical system is the requirement of line of sight between the camera and the tracked markers. Although each of these conventional platforms has individual trade-offs and potential deficiencies, they have been readily applied in image-guided surgical interventions [31].
For example, in 2008, Desai et al. [32] presented a series of three case studies using EM-based tracking to guide transoral resection of oropharyngeal and pharyngeal space lesions. Using the Brainlab EM-tracking system (Munich, Germany), preoperative CT was registered to the patient through identification of bony landmarks. This allowed surgeons to localize an intraoperative pointer with respect to the preoperative CT throughout the procedure. The study showed that the provided guidance was especially helpful in assessing the anatomy during the dissection, particularly in the deep lateral parapharyngeal space close to the carotid artery and lateral pharyngeal wall. The limitation, other than the technical disadvantages of EM tracking as discussed above, is the reliance on preoperative data, which does not account for intraoperative deformations. Additionally, the guidance provided is viewed separately from the primary visual field and must be mentally correlated.
While rigid registration is arguably solved with optical and EM systems, deformable workspaces, where nonrigid changes necessitate an update to the preoperative surgical plan, continue to challenge researchers. For example, in transoral base of tongue surgery, deformations begin with setup: the patient’s neck is flexed, with the mouth open and tongue retracted. To capture these setup deformations for TORS, Liu et al. [33, 34] experimented with the alignment of preoperative CT to presurgical CBCT. Reaungamornrat et al. [35] developed a four-step nonrigid transformation, where a volume of interest (e.g., tongue and hyoid bone) is segmented in both the moving image (i.e., CT) and the fixed image (i.e., CBCT). These segmentation “masks” provide surface meshes from which two point clouds are defined. First, a Gaussian mixture (GM) [36] registration is used to compute a rigid initial global alignment of the two point clouds. Second, a GM nonrigid registration uses a thin-plate spline approach to perform deformable alignment of the point clouds. For both the moving and fixed mask, a distance transform [37] (DT) consisting of the distance of each voxel to the surface mesh is computed in step three. Lastly, in the fourth step, a fast-symmetric-force variant of the Demons algorithm [38] is applied to register the two DTs. Operating on distance transforms allows the combined registration module to be intensity-invariant and thereby supports registration of surgical CAD/CAM derived from other modalities, such as MRI, in addition to CT. Aside from this hybrid approach, many other deformable registration methods exist and substantiate further investigation for otolaryngology, but are beyond the scope of this chapter.
20.2.2.2 Visualization
Viewing navigational data in interventional suites can be accomplished through a variety of mediums. Traditional imaging systems present their images on a 2D computer monitor, which requires the clinician to mentally register the given information with the operative scene. However, navigational information can be absorbed through audio, visual, and haptic means with obvious advantages through fusion of multiple sources of information. For example, we can overlay live fluoroscopy onto 3D volumes from CBCT angiographies. However, in endoscopic head and neck interventions, direct overlay of navigational information onto video images [26] provides a more natural integration with the primary visual displays. Video augmentation has been shown to be advantageous in monocular endoscopic skull base procedures [39], while stereoscopic augmented reality has been realized in operating microscopes and robotic surgical case studies [40]. With the advent of 3D visualization in consumer products (e.g., Google glass, Microsoft HoloLens, Magic Leap, etc.), the millennial generation of surgeons and patients can expect visualization to advance in these directions.
For robotic surgery in otolaryngology, similar methods of navigation through video augmentation have been explored for cochlear implant [41], TORS studies using ex vivo animal and cadaveric models [33, 34], and clinical TORS with retrospective analysis [40]. In 2012, in a single clinical case study for TORS, Pratt et al. [40] augmented the da Vinci stereoscopic view by manually aligning models of segmented anatomy derived from preoperative plans. Their retrospective analysis of procedure footage noted beneficial opportunities for guidance in TORS along with observations that the degree of tongue muscle deformation induced by gag placement is significant. The need to bridge the gap between preoperative images and intraoperative setup was further emphasized with ex vivo studies by Liu et al. [33, 34]. Using porcine tongue phantoms, they tasked a TORS surgeon with placing pins into embedded targets in order to evaluate target localization using varied methods of image guidance: (1) Simulated current practice with preoperative images on a computer monitor. (2) Intraoperative images on a computer monitor. (3) Video augmentation with intraoperative images. Experiments not only showed a statistically significant improvement in target localization error when comparing (1) to (3) (4.9 ± 4.6 mm to 1.7 ± 1.8 mm, respectively, when measured from the edge of the target), but improvements from (1) to (2) also showed the value of navigating with intraoperative imaging, as compared to relying on preoperative data.
The experiments from Liu et al. [34] further highlight one of the main challenges in augmented reality, namely, stereopsis, or depth perception. Incorrect stereopsis has been a topic of discussion since the 1990s, as researchers noted natural spatial errors affecting virtual reality when systems portrayed 3D space using a 2D display [42]. Poor calibration or registration amplifies stereopsis incorrectly, and the user will empirically observe that virtual anatomy appears to be detached and floating in front of the real scene. To counter such effects, Bichlmeier et al. [43] adjusted the transparency according to the position and line of sight of the observer, creating a significantly improved fusion of virtual objects to a realistic viewpoint in the scene. To directly address ambiguity in stereopsis for TORS, Liu et al. [44] extended their video augmentation guidance system with tool localization. Using the joint values of the robotic arm, their system tracked the primary surgical tool and communicating explicit depth information, with respective to tool localization, through dynamic color changes of virtual anatomical models. Further details of their experiment are discussed below in 20.3.
20.3 Preclinical Studies of Intraoperative Imaging and Navigation in Robotic Surgery
20.3.1 C-arm and Flat-Panel Cone Beam Computed Tomography (CBCT)
Continued advancements in imaging and robotics have led to the development of hybrid operating rooms with integrated intraoperative imaging systems (e.g., ultrasound, fluoroscopy, and cone beam computed tomography) with minimally invasive surgical systems (e.g., robotics, laparoscopy, and endoscopy) [18]. The design of these advanced ORs must consider functional needs in order to accommodate different perioperative setups and workflows [45]. The workspace for C-arms may range from mobile bases to ceiling- or floor-mounted systems. Image quality and resolution differ widely, with older technology using image intensifiers to newer technology using motor-actuated flat-panel detectors that are synchronized with an X-ray source. Thus, the imaging capabilities of intraoperative C-arms can vary from single 2D planar X-rays images to 3D reconstructed volumes (e.g., CBCT [33, 34, 41, 46]).
Room setup and surgical workflow are even more complex in cases that involve both CBCT and robotic assistance. In designing a hybrid operating room, the “free” or available workspace, after surgical setup, shared between an imaging system and a surgical robotic system, must be evaluated. For otolaryngology, workspace ergonomics were explored in a 2015 preclinical study of the Artis zeego (Siemens AG, Berlin, Germany) and a da Vinci Si (Intuitive Surgical Inc., Sunnyvale, CA) as an extension of experiments conducted by Liu et al. [44] in 2015. Their experimental setup (Fig. 20.1a) showed that a full CBCT can be acquired with the base of the da Vinci® patient-side cart (PSC) positioned for intervention, however, with all robotic arms retracted. The ability to acquire a full CBCT image without repositioning the PSC (with robotic arms retracted) was also confirmed in an investigation for intraoperative CBCT guidance for a cadaveric cochlear implant [41] (Fig. 20.1d). Therefore, their work showed that the free workspace for these two procedures supports intraoperative CBCT acquisition.
Fig. 20.1
Photographs of the da Vinci Si-zeego workspace. (a) Configuration for transoral robotic surgery using an in vivo porcine phantom with (b) fluoroscopy and (c) video augmentation. (d) Configuration for cochlear implant using a cadaveric phantom with (e) segmented critical structures from CBCT and (f) video augmentation
In the TORS experiments, Liu et al. [44] injected two synthetic tumors into the base of tongue of in vivo porcine models and ex vivo porcine tongue models. Using the da Vinci® and zeego setup above, they obtained a presurgical CBCT angiography, which allowed them to derive models of segmented critical anatomy after surgical positioning. Segmentations included the lingual arteries, synthetic tongue resection targets (centroid and boundary), registration fiducials, oral tongue, and tongue base volume. A head and neck surgeon, proficient in TORS, performed mock tumor resections (with and without video augmentation as guidance) with the goal of achieving a 10 mm margin while controlling the lingual artery. Similar to Pratt et al. [40], in experiments with video guidance, the models of critical anatomy were directly displayed as a transparent overlay in the stereoscopic viewport. However, in contrast to the preoperative data and manual updates used by Pratt et al., these images captured surgical setup and were automatically updated by using the joint values of the robotic camera arm.