Laryngeal High-Speed Videoendoscopy

Dimitar Deliyski

The purpose of this chapter is to lay the foundation for broader understanding of one of the most promising laryngeal imaging techniques, high-speed videoendoscopy (HSV). This technique may have significant impact in helping us uncover new phenomena in the mechanism of voice production and to better understand laryngeal pathology along with its impact on voice quality. HSV is the most powerful tool for the examination of vocal fold vibration to date. It will provide further insights into the biomechanics of laryngeal sound production, as well as enable more accurate functional assessment of the pathophysiology of voice disorders leading to refinements in the diagnosis and management of vocal fold pathology. For the clinic, HSV is capable of providing unmatched functional and structural information about the larynx, which will ultimately improve clinical practice in speech-language pathology and otolaryngology.

Without claims of completeness, this chapter describes the origins and principles of the HSV technique, the technical considerations important for making HSV clinically useful, the advantages of HSV over videostroboscopy, the unsolved challenges to HSV delaying its wide clinical implementation, and the directions in which HSV is expected to improve our research and clinical abilities. Several clinical applications of HSV are reviewed.

Origins and Principles of High-Speed Videoendoscopy

During phonation, the vocal folds usually open and close over 100 times per second and vibrate at velocities approaching 1 meter per second, making it impossible to view this activity with the unaided eye.1 For centuries, scientists and clinicians have been trying to build instruments allowing visualization of this fast vibration. To present such fast vibration to the human eye, one has to “slow it down.” There are three methods for slowing down fast motion.

The most obvious method to “slow down” vocal fold vibration is by optically photographing the fast-vibrating vocal folds at speeds several times faster than the frequency of vibration, then presenting those images to the human eye at significantly slower rates. This is the principle of high-speed imaging (Fig. 28.1). Until the late 19th century, little was known about the limits of visual perception, and building a high-speed imaging machine required technologies not available at the time. Therefore, scientists and engineers had to search for alternative methods.

Fig. 28.1 Example of a sequence of color HSV images containing two glottal cycles. The frequency of vibration of the vocal folds (male subject) is 126 Hz and the HSV frame rate is 4000 fps, producing a sequence of ∼32 images per glottal cycle (Video Clip 53).

Another “indirect” approach to the evaluation of vocal fold vibration is by recording signals that result from the vibration and presenting them in a graphic format to the human eye. One can then infer conclusions about the characteristics of vocal fold vibration by analyzing the graphic images. The invention of the phonograph and the gramophone allowed for obtaining “visible graphic recordings” (ie, acoustic waveforms).² Consequently, the advances in acoustic voice analysis, in electroglottography (EGG) and photoglottography, and the ability to record signals for transglottal airflow and intraoral and subglottal pressure provided invaluable indirect information about the vibration of the vocal folds. The knowledge learned via these technologies helped to refine the models and theory of voice production and stimulated the building of instrumentation that improved clinical voice assessment.

The third approach for “slowing down” the vibration of the vocal folds is by taking advantage of the stroboscopic effect, which is possible due to the quasi-periodic nature of the vocal fold vibration. In the late 19th century, Oertel published the earliest application of stroboscopic principles for observing vocal fold vibration.3 Later, combining indirect voice signals, acoustic or EGG, with a film or video camera, led to the invention of the most widely used instrument for laryngeal imaging today, the videostroboscopic system.⁴ Chapter 11 provides full details about the principles and clinical application of videostroboscopy.

The first high-speed motion picture machine was built in the 1930s, leading immediately to several studies of vocal fold vibration.^5,6 These and other later studies have become some of the most important works in understanding laryngeal physiology.^5–8 But the technology for high-speed imaging was too impractical until the mid-1990s, when two types of high-speed imaging systems became commercially available, the videokymography (VKG) and the HSV systems.^9,10 VKG could scan a single line across the vocal folds at a speed of 7800 lines per second, and the HSV systems could scan a full image at speeds up to 2000 frames per second (fps). These first systems provided monochromatic images with poor resolution and image quality. The VKG was faster and less expensive than HSV but could scan only one section on the anterior-posterior plane of the vocal folds and lacked a mechanism for feedback about which line was being scanned. A new-generation VKG system has resolved many of these concerns.¹¹ In the meantime, bridging the technological gain in machine vision helped tremendously improve the HSV technology.¹² Today, high-speed cameras can record at frame rates up to 1,000,000 fps. They can record in color (Fig. 28.1), with high spatial resolution and excellent image quality, for longer durations.¹³ Before overwhelming the reader by presenting technical parameters, it is important to explain: What makes HSV superior to videostroboscopy?

Advantages of High-Speed Videoendoscopy over Videostroboscopy

Videostroboscopy

To elicit an effect of “slow motion,” videostroboscopy relies on the assumption of the near-periodic nature of vocal fold vibration. Figure 28.2A reiterates an illustration of the principle of videostroboscopy from Chapter 11. In videostroboscopy, the resulting “slow-motion” glottal cycle is artificially assembled from images sampled from consecutive phases taken from different glottal cycles. The strobe light flashes are short (10 to 20 μsec), delivered only during the cycle phases for which images are taken.

It is important to realize that in the case of an aperiodic signal, the near-periodic assumptions do not hold. When the acoustic or EGG signal is aperiodic, the timing of strobe flashes does not correspond with the phases of the glottic cycle in the desired sequence. Even subtle variations in periodicity can produce completely distorted or unrealistic videostroboscopic sequences. Depending on the type of aperiodicity, the distortions may produce random-appearing vibrations, may change the balance between the timing of the opening and closing phases of the glottal cycle, may produce a reverse-appearing motion during a portion of the cycle or through the entire cycle, or may “lock” out of the closed phase, making it appear that the glottis never closes completely. All these effects can occur even in vocal fold vibration with very good overall periodicity, where the irregularity is so slight that it is not visually perceivable. At the same time, relatively pronounced aperiodic patterns may be able to synchronize well the strobe light, producing an illusory “regular” cycle. There are at least three reasons for these effects:

Videostroboscopy is a hybrid between an acoustic analysis system and an imaging system. The videostroboscopic effect relies on analysis of the acoustic or EGG waveform. We typically classify videostroboscopy under the category of laryngeal imaging systems. That is partially true, because the end product of a videostroboscopic exam is a series of images. However, which images are being presented depends on acoustic analysis. Therefore, from the point of view of the analysis of vocal fold vibration, videostroboscopy is an acoustic analysis technique, not a true imaging technique. The videostroboscopic vocal fold vibratory patterns are determined by the acoustic waveform, not by the actual biomechanical vibration. All limitations to acoustic voice analysis are present in videostroboscopy.¹⁴

Fig. 28.2 Illustration of the principle of sampling in videostroboscopy (A) and in HSV (B, C). (A) In videostroboscopy, the resulting “slow-motion” glottal cycle (below) is artificially assembled from images sampled from consecutive phases in different glottal cycles (above); whereas (B) in HSV, each cycle (below) is represented by images sampled within that very same cycle (above); that is, there is a “true” intracycle slow-motion viewing achieved by zooming, or warping the timescale. (C) When increasing the frame rate, HSV represents more accurately the details of the vibration within the glottal cycle.
For each video frame, videostroboscopy relies on pitch tracking through a laryngeal contact microphone or EGG to predict the phase of the upcoming glottal cycles. Thus, the acoustic or EGG signal during one 30-msec video frame is used to pre dict the frequency of the next phase of vibration that will be recorded in the upcoming video frame, assuming that the period will not change within the next 30 msec. Obviously, in the case of aperiodic vibration, or aperiodic acoustic or EGG waveforms, the videostroboscopic images will present in random or chaotic order and will not be representative of the actual vibration pattern.
It is also important to note that high aperiodicity of the acoustic or EGG signal does not always mean that the period of vibration is highly irregular. A visible irregular pattern of vocal fold vibration would always cause severe irregularities in the acoustic waveform, translating into percep tual effects of severe dysphonia or aphonia. However, period perturbations of the acoustic waveform do not necessarily mean that the period of vibration is visibly aperiodic. Most of the increased acoustic perturbation cases are not related to “visible” period irregularities of the vibration.15 The variations of the acoustic period are not necessarily caused by variations in the period of the glottal cycle. The glottal cycle period may be stable overall, but the cycle-to-cycle variations of local (intracycle) vibratory features, such as glottal width, symmetry, open quotient, mucosal wave, mucus bridges, and/or loss of contact, may be producing acoustic period perturbations. Videostroboscopy interprets these acoustic period perturbations as vibratory period instability, leading to an overdiagnosis of aperiodicity.^15,16

In summary, aperiodic vibration, or aperiodic acoustic waveforms cause the strobe light to become asynchronized with the actual phase of vocal fold movements preventing visualization in “slow motion.” As a result, videostroboscopy cannot be used on persons whose voice disorder has caused their vocal fold movement, or acoustic waveform, to become aperiodic. Thus, many patients, mainly those exhibiting dysphonia, cannot benefit from the technology of videostroboscopy even though it is considered the current gold standard for laryngeal imaging.

Videostroboscopy is a technique that revolutionized clinical management of voice disorders and laryngeal pathology. However, it is applicable only on sustained phonation tasks for individuals with stable phonatory characteristics. Accurate and reliable assessment cannot be achieved for individuals with pronounced dysphonia. Videostroboscopy is not applicable for evaluating transient vocal fold vibratory behaviors, such as, phonatory breaks, laryngeal spasms, and the onset and offset of phonation. It cannot be used for tasks involving vocal attack, coughing, throat clearing, laughing, and other activities including rapid laryngeal maneuvers.

High-Speed Videoendoscopy

In contrast with videostroboscopy, HSV is the only technique that captures the true intracycle vibratory behavior through a true series of full-frame images of the vocal folds. Therefore, HSV, by default, overcomes the above limitations of videostroboscopy, providing for the possibility of a more reliable and accurate objective quantification of the vocal fold vibratory behavior regardless of whether this behavior is periodic or aperiodic.

Figure 28.2B illustrates the principle of HSV sampling in comparison with stroboscopic sampling (Fig. 28.2A). In HSV, each resulting glottal cycle is represented by several images sampled within that very same cycle (ie, there is a “true” intracycle slow-motion viewing achieved by zooming the timescale). The lighting is constant, not intermittent as in stroboscopy. HSV is recording constantly, and no information can be missing between the frames. HSV data contains all frames, not just selected ones. Therefore, HSV supersedes videostroboscopy. We have demonstrated that videostroboscopy can be produced from HSV using simulated stroboscopy with audio (SSA).¹³ The advantage of SSA is that it uses the actual vibration, not an indirect acoustic signal, to establish the glottal cycle phases in producing the stroboscopic effect, thus it does not suffer the tracking errors typical for videostroboscopy.

In addition, HSV is a superset of VKG, because it contains all the VKG lines from the anterior to the posterior, whereas VKG contains only one line.13 Therefore, kymography can be produced from HSV by selecting a particular line across the anterior-posterior axis, a process termed digital kymography (DKG).

These properties make HSV uniquely suitable for either spatial and/or dynamic representation of the same content (ie, as a movie) or kymographically. Not only does HSV record the true glottal cycle: It records a series of many glottal cycles, allowing for the study of cycle-to-cycle variation in the local (intracycle) vibratory features over time.

Assessment of Vibratory Features of Sustained Phonation

The purpose of videostroboscopy is the assessment of vocal fold vibratory features in sustained phonation. In a stroboscopic exam, the voice assessment protocol includes vibratory features such as periodicity, symmetry, mucosal wave, open quotient, glottal closure, and mucus aggregation. HSV can be used to elicit all features of the stroboscopic protocol. However, due to the higher temporal resolution and tracking reliability of the HSV technique, some of the features appear differently, and new important aspects of these features can be observed.

Symmetry

Vibratory symmetry of the glottal cycle can be regarded in several ways. In a videostroboscopic evaluation, asymmetry is judged in the left-right dimension evidenced by amplitude and phase differences between the left and right vocal folds. A recent systematic categorization of asymmetry differentiated four aspects of left-right asymmetry: amplitude, phase and frequency differences, and axis shifts.¹⁷ Another important aspect of asymmetry is the anterior-posterior phase asymmetry, which is often manifested through the hourglass or zipper effects during vocal fold closure.¹⁸ Anterior-posterior phase asymmetry is defined as the anterior and posterior portion of one vocal fold reaching maximal glottal opening at different times within the glottal cycle. Left-right phase asymmetry is defined as the two vocal folds reaching maximal glottal opening at different times within the glottal cycle. Left-right amplitude asymmetry is defined as the two vocal folds having different maximal amplitudes of glottal opening within the glottal cycle. Left-right frequency asymmetry is defined as the two vocal folds vibrating at different frequencies. Axis shifts are defined as the spatial location of the opening of the vocal folds within the glottal cycle shifting to the left or to the right from the location of last contact. HSV allows for the objective visualization of all five aspects of asymmetry, whereas videostroboscopy can be used only for two, left-right amplitude and phase asymmetry.^18,19 In videostroboscopy, these two features are usually judged together because it is difficult to perceptually separate them, and their visualization is limited only to periodic vibration and acoustic waveforms.

Period and Glottal Width Irregularity

Regularity, or periodicity, of vocal fold vibration can be defined as the exact repetition of a spatial-temporal pattern. Thus, irregularity and aperiodicity refer to any change of this pattern over time. The most common visually judged features of vocal fold vibratory regularity are glottal period regularity and glottal width regularity, which reflect the two aspects of the spatial-temporal pattern.^15,19 Both HSV and videostroboscopy can be used for assessing period and glottal width regularity. However, the reliability of videostroboscopy suffers significantly in the presence of irregularity due to tracking problems, whereas HSV can visualize any irregular vibratory pattern. In videostroboscopy, the determination of irregularity is essentially based on the acoustic or EGG signal’s properties, not on the actual vibration properties. In addition to the ability to precisely record irregular patterns, HSV allows for presenting these patterns in a spatial-temporal domain using DKG, making them more comprehensible.

Mucosal Wave

Mucosal wave is one feature that is generally thought to be a good global indicator of vibratory behavior. Mucosal wave is the propagation of the epithelium and superficial layer of the lamina propria from the inferior to the superior surface of the vocal folds during phonation. The presence, magnitude, and symmetry of the mucosal wave are indicators of tension and pliability of the underlying vocal fold tissue and are essential to the production of good voice quality.20 Due to the anatomic configuration of the vocal folds and the superficial viewpoint of rigid endoscopy relative to them, the mucosal wave is viewed through two different aspects. The first viewing aspect is the lateral propagation of the mucosal wave between the vocal folds, where the mucosal wave is seen as the differential between the lower and the upper margins of the vocal folds during closing. This view begins with the closing phase, from the moment of adduction of the lower margins of the vocal folds through the end of the adduction of the entire folds. The second viewing aspect is the propagation of the mucosal wave on the upper surface of the vocal folds. This view begins during the closing phase from the upper margins of the vocal folds. While the vocal folds are adducting to close the glottis, the mucosal wave is traveling in the opposite direction, toward the exterior margins of the vocal folds. Both HSV and videostroboscopy allow visualization of the two mucosal wave aspects. However, HSV provides more objective visualization, especially through the use of DKG. Due to its high velocity of propagation, the mucosal wave is the feature most sensitive to the frame rate of the HSV system.²⁰ Our investigations show that for achieving full viewing of the mucosal wave features, the frame rate has to be at least 16 times higher than the frequency of vibration. That is, for a man with a fundamental frequency (F₀) of 125 Hz, the frame rate has to be at least 2000 fps, for a woman with F₀ = 300 Hz, it has to be at least 4800 fps, and for a woman producing falsetto with F₀ = 1000 Hz, it has to be at least 16,000 fps to track the detail of mucosal wave propagation.

Open Quotient

Open quotient is the amount of time the vocal folds are in the opening and closing phase, versus the duration of the entire vibratory cycle.^19,21 HSV allows for measuring open quotient because it provides the true intracycle information for each glottal cycle.

Contact and Loss of Contact

Glottal closure is the pattern of vocal fold contact at the closed phase of vibration. It is generally categorized as closed, hourglass, anterior gap, posterior gap, or irregular. This feature can be viewed through both HSV and videostroboscopy. However, it is very important to report whether the realization of contact and loss of contact are changing from one cycle to the next. Only HSV can provide this information due to its inherent true cycle-to-cycle visualization.

Mucus and Mucus Bridges

Vocal fold mucus aggregation is common in persons with voice disorders. It is known that an increase in vocal fold mass, from mucus, will change vocal fold vibratory behavior. Mucus has been noted as the causal factor of rough vocal quality. The presence, type, thickness, location, and pooling of mucus aggregation are important indicators of how mucus is impacting vocal quality.²² Mucus can be evaluated with both HSV and videostroboscopic techniques, and videostroboscopy is generally more sensitive due to its better spatial resolution and image quality. However, another feature important for voice quality, the cycle-to-cycle variation of mucus bridges forming between the vocal folds during loss of contact, can be studied only through HSV.

As indicated, many vibratory features can be studied by either videostroboscopy or HSV. However, most of the features appear different from videostroboscopy when viewed using HSV.15,16,18,20,22,23 Voice clinicians, including speech-language pathologists and laryngologists, have been highly trained to use videostroboscopy. When using HSV, they may attempt interpreting vibratory features relative to the norms used in the clinic with videostroboscopy. Thus, there is a risk that a new and very different technique may not be found useful, unless a smooth transition is realized. An important first step in such transition is to generate HSV-specific clinical norms. This topic is covered later in the section “High-Speed Videoendoscopy in the Clinical Speech-Language Pathology Practice.”

Analysis of Aperiodic Phenomena

HSV is uniquely suited for studying aperiodic vibration and other fast movements. This is an area in which videostroboscopy has no utility. Videostroboscopy cannot be used on persons whose voice disorder has caused vibration with perturbed periodicity. Not only can HSV visualize such vibration, but also it allows measuring the degree of perturbation. HSV is applicable for evaluating most transient vocal fold vibratory behaviors.

Phonatory Breaks, Laryngeal Spasms, Onset and Offset of Phonation

HSV is the only imaging technique that can effectively record and visualize transient phonatory events. A better understanding of the nature and occurrence of these events is a very important area of voice research with strong implications for clinical practice, from the functional evaluation and diagnosis of various voice disorders through treatment planning and intervention.

Phonatory breaks are transient instabilities or short interruptions of the phonatory process. They are a typical phenomenon associated with several voice disorders. We have seen them also sometimes in vocally normal populations, especially within the first 100 msec after phonatory onset. HSV allows for precisely tracking the phonatory breaks, visualizing them, and assessing their temporal pattern and duration.

Laryngeal spasms can result in vocal fold abduction or adduction. They are thought to occur in neurologically based voice disorders and are most typical in spasmodic dysphonia. HSV allows for precisely tracking, categorizing, and measuring laryngeal spasms.

The characteristics of phonation onset and offset of phonation may be indicative of a specific type of voice disorder. Little is known in this area. The evaluation of vocal offset can provide invaluable information about vocal fold pliability by judging how quickly and orderly the end of phonation occurs. The evaluation of vocal onset can provide objective information about the maneuvers the patient performs in reaching the optimal phonatory threshold to begin phonation or about asymmetries due to left-right differences in mass and tension, which may not be visible during stable sustained phonation. HSV places this information at our fingertips, and it is only a matter of conducting sufficient research to create better protocols for objective voice evaluation.

Vocal Attack Time

The speed with which the vocal folds adduct to the midline is considered an important variable in the etiology of some voice disorders and may also be a meaningful indicator of central or peripheral neural dysfunction. Measuring vocal attack time has been addressed by Moore and by Werner-Kukuk and von Leden.5,8 HSV allows for precisely recording the voice onset for different types of glottal attacks and measuring useful physiologic characteristics. Recently, we have used HSV to successfully validate a vocal attack time measurement, which is discussed in more detail later in the section “High-Speed Videoendoscopy as a Research Tool for Voice Science.”²⁴

Coughing, Throat Clearing, Laughing, and Other Activities Involving Rapid Laryngeal Maneuvers

Coughing and throat clearing are considered to be potentially harmful to the vocal folds from the point of view of vocal hygiene. Clinicians typically recommend “safer” mucus clearing behaviors, such as “soft” cough and clear. But little is known about the biomechanics of these processes and how are they actually harmful to the tissue. HSV allows for precise visualization, registration, and measurement of the physical attributes of these behaviors. Our ongoing research effort in this area may provide clinically useful data. On a separate note, laughing, clearing, and other similar quickly varying laryngeal tasks are clinically useful as media for eliciting phonation in aphonic patients. HSV can be used for visualizing the vocal fold vibration during the short phonatory segments elicited though such clinical techniques.

Alaryngeal Speech

Developing instrumental or perceptual techniques for the evaluation of alaryngeal voice has always been a significant challenge. These voices do not qualify for acoustic voice analysis, present difficulties in using perceptual scales, and cannot be documented via EGG or videostroboscopy.14 HSV has been successfully used for visualization of the vibratory characteristics of the substitute voice generator and for automatic image segmentation of the neoglottis.²⁵ The ability of HSV to visualize and measure vibration after laryngectomy is important for evaluating the success of voice restoration.

Objective Automated Analysis

After everything said about HSV, it is important to make a clarification. HSV is a lot more than a slow-motion movie. Videostroboscopy is a technique designed primarily for slow-motion visualization of the fast vocal fold vibration during sustained phonation in real time, which is very limited in terms of objective measurement of the vibration. HSV is fundamentally different in that respect. The visualization is very accurate, presented in warped (delayed) time. Every characteristic of the visualized vibration is potentially measurable, because it is inherently accurately represented in the recording. HSV can be described as a data “cube,” which has two spatial coordinates, x (left-right) and y (posterior-anterior), and one temporal coordinate, t (time). All three dimensions are described by the intensity of each pixel. It is a solid “cube”: there is no missing information along any of the dimensions. Therefore, it is essential to demonstrate what is clinically relevant to develop the appropriate analytic technique for measuring it.

Several automatic and semiautomatic HSV-derived measurements have been reported in the literature.²⁶ They have been classified as follows:

Measures related to frequency of vibration: fundamental frequency; period perturbation quotient; coefficient of variation of F₀; voice breaks; vocal tremor (F₀-modulation) frequency and magnitude.
Measures associated with glottal symmetry: left-to-right phase, amplitude, and frequency symmetry quotients; axis shifts; posterior-to-anterior symmetry concurrence (showing whether some parts along the vocal fold have different symmetry parameters than others).
Measures related to glottal width and area characteristics: open and closed quotient; glottal area perturbation quotient; coefficient of variation of glottal area; soft phonation index; vocal tremor (glottal area modulation) frequency and magnitude.
Measures reflecting unilateral dynamic characteristics: activity/displacement of the left or the right vocal fold, and ratio of left versus right vocal fold.
Measures related to mucosal wave properties: mucosal wave presence; symmetry quotient; relative area to glottal open area; sharpness pattern.
Measures related to vertical movement during phonation: left-to-right vertical symmetry, computed through the image intensity.
Measures assessing modal types: vibration-based voice typing (similar to types 1, 2, and 3 per Titze); automatic classification of bifurcation patterns (eg, periodic, biphonia, diplophonia, vocal fry, aphonia, vocal onset, vocal offset, etc.); subharmonic level (ie, which is the most active subharmonic of F₀ [first, second, etc.]).¹⁴
Semiautomatic measures reported objectively: manually placed posterior and anterior commissure markers; manually tagged transient events; visually classified patterns of vibration.¹⁷

Some of these objective measures have been compared with visual perceptual ratings: period and glottal width irregularity, left-right phase and amplitude asymmetry, axis shifts during closure, and open quotient.^15,18,21 The findings suggest that usually objective measures differ from visual subjective ratings, underscoring the limits of human perception and the importance of developing robust automated measurement techniques. New objective HSV measures are being developed through the phonovibrography method.²⁷ Objective measures that have been reported using VKG—amplitude symmetry, speed quotient, and phase symmetry index—are highly applicable to HSV.²⁸ Whereas several research teams are actively developing HSV-based objective measures, the clinical efficacy of these measures is still under investigation. There are no established standards or commercially available software products at this time.

Relationships between Vibration and Acoustics

Due to the high temporal resolution of HSV, it is possible for the first time to precisely align the HSV images with acoustics and other voice signals (Fig. 28.3), such as EGG, transglottal airflow, intraoral and subglottal pressure, and accelerometry. This is exciting for two reasons. First, voice science can better understand the relationships between vocal fold vibration and the resulting voice, leading to important refinements of the models of voice production. Additionally, combining HSV measures of vocal fold vibration with concurrent acoustic and EGG measures may provide complementary, high-precision measures that can improve the clinical practice. Scientific investigations of these relationships are under way. Several examples are presented later in the section “High-Speed Videoendoscopy as a Research Tool for Voice Science.”

Technical and Methodological Considerations Using High-Speed Videoendoscopy

This section is intended to provide a practical understanding of the HSV technology. There are two aspects, technical and methodological. The technical part is concerned with acquiring high-quality HSV data. That is, making sure that all vibratory information of interest was recorded correctly, with sufficient spatial and temporal image quality. The methodological aspect is concerned with the efficacy of presenting the relevant information to the clinician or researcher. That is, finding ways of complementing the playback of the HSV movie by other, more intuitive facilitative playbacks and objective measures that can reveal the relevant content, which is often hidden to the human eye through the HSV movie playback.

Important Technical Characteristics of High-Speed Videoendoscopy

An HSV system typically consists of the following elements: a digital high-speed camera (monochrome or color); a 70-degree or 90-degree rigid laryngeal endoscope; an endoscopic lens adapter; a powerful light source (usually 300 W constant xenon); a trigger button; a computer controlling the camera via specialized software for image acquisition and real-time video feedback; a computer monitor; and a wheeled equipment cart. The camera may be connected to the computer either via a specialized hardware card or via a standard Ethernet or FireWire interface. In some cameras, the digital processing circuitry is in a separate box installed on the cart, which allows for a lighter camera head attached to the endoscope. Heavier cameras, 2 lb and above, may be weight-balanced using a camera crane. The synchronous recording of additional signals is available with more advanced configurations. Such systems include additional hardware and software. Our HSV system, designed at the Voice and Speech Laboratory, University of South Carolina (Columbia, SC) (Fig. 28.3), includes the following additional elements: an 8-channel data acquisition card; a head-mount condenser microphone; a microphone preamplifier; an EGG device; a frequency divider; data acquisition software; and a second monitor to separate the HSV image from the channel data feedback.

Sensitivity

The digital high-speed cameras are photon-integrating devices. The complementary metaloxide semiconductor (CMOS) photo sensor of the camera is divided into pixels (individual photo cells), each usually 10 μm × 10 μm in size, or larger, up to 22 μm by 22 μm. For the duration of one frame of the recording, each pixel “counts” the number of photons being reflected from the surface of the anatomic structures that “fall” on the surface of that sensor. The stronger the intensity of light reflection and/or the longer the integration time, the higher is the number recorded for that pixel (ie, the brighter that pixel will be in the recorded movie). The amount of light that the tissue can absorb safely is limited. Thus, the sensitivity of the sensor’s pixels and the duration of each frame’s integration time are important parameters for eliciting an image. The most sensitive high-speed cameras today have monochrome sensitivity of 6400 ISO per 1280 × 800 pixels sensor (Vision Research, Inc., Wayne, NJ) and 6400 ISO per 1000 × 1000 pixels sensor (Photron Inc., San Diego, CA), which provides similar sensitivity per pixel. The sensitivity of the color versions of these cameras is 1600 ISO.

Fig. 28.3 A block-circuit of the HSV system designed at the Voice and Speech Laboratory, University of South Carolina, which allows aligning precisely the HSV images with acoustic, EGG, and other voice signals. A Phantom v7.3 high-speed camera (Vision Research, Inc.) is clocked by the sampling rate of an 8-channel M-Audio Delta 1010LT data acquisition card (Avid Technology, Inc.) after 1:6 frequency division. The camera “Ready” signal makes it possible to achieve accuracy of synchronization of 11 μsec. This architecture permits exactly attributing the six acoustic or EGG samples corresponding with each frame.

Integration Time

The high-speed camera integrates the light reflected from the tissue surface for a given time corresponding with one frame. Each recorded sample is an image, termed frame, constructed of many pixels. For example, if the HSV frame rate is 2000 fps, each second of time is divided into 2000 recording sessions following every 500 μsec. The integration of each frame takes most of the 500-μsec period, given that the only time the integration is not active is during the “reset” time for each frame, which is negligible (∼2 μsec). This is the most fundamental difference between stroboscopy and HSV. In videostroboscopy, the strobe flashes are intermittent. If they cannot be precisely timed, the resulting reconstructed glottal cycle image sequence is incorrect. HSV is recording everything and no information is missing between the frames (Fig. 28.2).

Frame Rate

Although HSV technology provides true sampling of vibration, the selection of an appropriate frame rate is very important for the accurate recording of some of the relevant vibratory features. Figure 28.2C illustrates the effect of increasing the HSV frame rate on the accuracy of representing the vibration details within the glottal cycle compared with lower frame rates (Fig. 28.2B). The frame rate determines the integration time (ie, the time between frames). If the integration time is too long and the velocity of the features being filmed is too high, the fast-moving features are averaged through the integration period, and they appear blurred, out of focus, or may even become invisible. The faster the motion, the shorter the integration time has to be, thus the frame rate has to be higher. Based on our data, the fastest vocal fold vibratory features are the mucosal wave propagation and the movement of the vocal fold edges during the closing phase. Based on visual testing, we established a rule of thumb that the frame rate has to be at least 16 times higher than the frequency of the glottal cycle (in periodic sustained phonation same as F₀). That is, each cycle has to be presented by at least 16 images (Fig. 28.1). Therefore, in clinical settings, the optimal frame rate is ∼8000 fps, allowing for the evaluation of voicing tasks not exceeding F₀ of 500 Hz. That frame rate would cover most clinical tasks for men and women, such as habitual pitch and loudness phonation, onset and offset, high and low pitch in modal register, and breathy and pressed phonation. In some special tasks, such as falsetto register or pitch glides, even the rate of 8000 fps may be insufficient and some features may be underrepresented. Obviously, the commonly used frame rate of 2000 fps is inadequate and would misrepresent the vibratory features of persons with a F₀ above 125 Hz (ie, most women would appear to lack a mucosal wave).²⁰ Several older studies have suggested that increased pitch relates to a reduced mucosal wave, probably a conclusion partially resulting from inferior technology at the time.

Color

Traditionally, HSV systems have been monochromatic (black and white). The Voice and Speech Laboratory integrated the first known color HSV system back in 2003.¹³ Since then, we have learned about the advantages and caveats of color. Color is clinically important for correctly identifying anatomic structures and especially for identifying lesions and structural tissue changes. However, to achieve color, the sensitivity of the camera is reduced ∼4 times, because the light has to be channeled into three color filters (for red, green, and blue), and additional light loss is caused by filtering-out the infrared and ultraviolet components, light absorption, and reflection. That translates into a 4-times reduction of the maximum frame rate of the HSV system for the same image quality relative to monochrome. Additionally, color significantly reduces the effective spatial resolution of the camera due to the Bayer mosaic color filtering used in single-chip color sensors. Consequently, for the very same model camera, the color and monochrome versions at the same pixel resolution have a significantly lower effective resolution of the color camera because the image is obtained through interpolation. Thus, the edges of the vocal folds represented using the monochrome camera are more accurate. Color HSV systems have advantages when viewing the vocal fold, but monochrome images allow for more accurate measurement of the vibratory characteristics.

Lighting

HSV technology requires a lot of light due to the CMOS photon integration principles. Thus, increasing the amount of light can improve HSV image quality and frame rates. The type of light source used with most HSV systems today is 300 W constant xenon light. There is, however, a safety concern that further increasing the amount of light used with HSV can cause tissue damage. Additionally, it is considered possible that long exposures to a 300 W constant xenon light can cause tissue damage. No reports of such damage have been filed to date, but as a precaution it is recommended that the amount of time the vocal folds are exposed to light during an HSV exam be reduced to less than 20 seconds.

Spatial (Pixel) Resolution

Our experience shows that spatial resolution above 300 × 300 pixels is adequate for quality images and for automated analysis. The spatial resolution of modern high-speed cameras allows for much higher resolution, including high-definition resolution of 1920 × 1080 pixels, and up to 2048 × 2048 pixels (Vision Research, Inc.).

Effective Dynamic Range

Dynamic range is a measure of brightness resolution of the sensor (ie, the ratio between the largest possible to the smallest possible light that the camera can register). That ratio is related to the sensitivity of the camera in a combination with the quantization levels (bits per pixel). For example, 8 bits per pixel provide for a maximum dynamic range of 256 (48 dB), whereas 12-bit quantization supports a maximum dynamic range of 4096 (72 dB) per pixel. Whether this is an effective dynamic range depends on the amount of noise in the lower bits (the smaller intensity values). If noise is present, the effective dynamic range is reduced relative to the maximum dynamic range. A high effective dynamic range allows one to “brighten” a dark image or to “darken” a very bright image without causing a distortion or loss of information. That is very important for achieving high image quality and for accurate image analysis.

Weight

The achievement of ultrahigh speeds requires that all hardware, including the memory, be physically located inside the body of the camera. Thus, the fastest high-speed cameras are too heavy to be held by hand during the exam. The camera used in the Voice and Speech Laboratory system, Phantom v7.3 (Vision Research Inc.), weighs 7 lb. To compensate for the weight, we attached the device to a camera crane (model CamCrane 200; Glidecam Industries, Inc, Kingston, MA). The camera weight is balanced, to appear weightless to the operator, while allowing the most degrees of freedom for motion by using a ballhead. Other than creating weightlessness, the crane was found to reduce significantly the endoscopic motion and tilt, thus introducing a comfortable system to operate. Based on that experience, we recommend using a camera crane regardless of the weight of the camera.

Color and spatial resolution are very important factors when identifying lesions, vascularities, and tissue changes and for accurately representing the glottal edges. Spatial resolution is also important as it allows for the wide view angle necessary to examine the full anterior-posterior view of the vocal folds and their surrounding anatomic structures. The frame rate is essential for accurately displaying the mucosal wave and providing sharp glottal edges, especially when viewing high-pitched samples. Long recording duration is necessary to register multiple phonatory tasks in a continuous recording (ie, comfortable, high and low pitch, glides, loudness levels, repetitive phonation, and forced inhalation), adduction and abduction of vocal folds, and phonatory onset and offset. An increased dynamic range allows for improved viewing quality and increased accuracy of the automated image analyses. Due to insufficient clinical experiments with HSV, the necessary requirements for these factors have not yet been standardized.

Of all factors that influence HSV, the color, temporal resolution, and dynamic range are the ones limited by the sensitivity of the camera sensor. The spatial resolution, temporal resolution, and dynamic range are in a reciprocal relationship as they share the same hardware bandwidth and memory resources. The sample duration and spatial resolution depend on the memory available, while the view angle depends on the spatial resolution and the optics installed. Therefore, improvements in HSV technology depend on three factors: sensitivity of the camera sensor, hardware speed, and memory size. The most important and the most challenging factor is the sensitivity, due to the limitations of CMOS technology and the lack of demand for high sensitivity from the traditional market sectors using high-speed cameras.13

An ongoing collaborative effort between the Voice and Speech Laboratory, University of South Carolina, and the Center for Laryngeal Surgery and Voice Rehabilitation, Massachusetts General Hospital (Boston, MA), led recently to the following breakthrough advances in HSV technology:

Color HSV allowing for high-quality 42-bit rigid videoendoscopy at the speed of 6000 fps at a spatial resolution of 400 × 480 pixels and for 24-bit rigid color HSV at 10,000 fps and 320 × 320 pixels resolution.
Ultrahigh-speed monochrome HSV allowing for high-quality 12-bit rigid videoendoscopy at the speed of 16,000 fps at a spatial resolution of 320 × 320 pixels and for 8-bit rigid HSV at 48,000 fps and 128 × 200 pixels resolution.
High-precision temporal synchronization of monochrome HSV at 16,000 fps with multiple channels of other data allowing for accuracy of synchronization around 11 μsec at a sampling rate of 96,000 Hz per data channel.
High-definition HSV allowing for optically zoomed, high-quality, 12-bit monochrome imaging of the vibrating vocal fold tissues at the speed of 4000 fps and spatial resolution of 600 × 800 pixels.
Flexible HSV allowing for the use of a regular nasal fiberscope at the speed of up to 6000 fps and spatial resolution of 320 × 320 pixels.

These examples are presented to the reader to provide a notion of what is considered to be the state of the art in year 2009. Although these HSV system integrations are experimental and not currently commercially packaged for clinical use, it is likely that in another 5 years from now, HSV systems with similar and better parameters will be available to the clinic at a nonprohibitive cost. More importantly, and in the meantime, the research on the clinical efficacy of HSV needs to accelerate to “catch up” with the advance of technology.

Methodology: High-Speed Videoendoscopy Offers Much More than a Slow-Motion Movie

The improvement of the HSV camera technology is essential for the accurate recording of the biomechanical information of vocal fold movement. The biggest advantage of HSV is the true visual presentation of movement of an anatomic structure that humans, especially the skilled clinicians, understand best.13 Advanced image processing techniques will complement the visual data automated analyses and measurements.

That is, the presentation of the HSV content to the clinician can be made either visually or through measurements. Thus, the methodology for voice evaluation via HSV can be achieved via visual perceptual ratings and via automatic or manual objective measures. These are two mutually complementary approaches. In the rich HSV content, some of the vibratory information is difficult for the human eye to perceive but can be measured automatically, whereas other features are difficult to formalize as an algorithm but are intuitive to the human brain. Therefore, it is most likely that the HSV clinical voice evaluation protocol of the future will be a combination of visual ratings and objective measures.

Facilitative Playbacks

As noted earlier in the section “Advantages of High-Speed Videoendoscopy over Videostroboscopy,” HSV is a lot more than a slow-motion movie. There are many creative ways of presenting the HSV content in an intuitive form by preserving some of the spatial information so the clinician can follow the anatomy while the features of interest are emphasized for easy comprehension. This approach facilitates visual perception, improves the accuracy of quantification, and increases the reliability of visual rating. Special tools for enhanced visualization have been created, termed facilitative playbacks.^13,27 The following are some of the facilitative playbacks that have been successfully used thus far for research and clinical purposes: digital kymography playback, mucosal wave playback, mucosal wave kymography playback, and phonovibrogram.

Digital Kymography Playback. The normal sequence of viewing HSV recordings, termed HSV playback, is by sequentially presenting image frames with spatial coordinates x and y along the time axis t. DKG playback corresponds viewing DKG image frames with coordinates x and t in a sequence presented along the posterior-anterior axis y. In DKG playback, the DKG frames are viewed as a movie sequence that plays from the posterior toward the anterior.¹³ The DKG playback can be regarded as a step up from multiplane kymography.²⁹ Figure 28.4 provides three snapshots from a DKG playback of sustained phonation taken in the posterior, medial, and anterior areas along the posterior-anterior axis. Figure 28.5 shows two snapshots from a DKG playback of a phonatory offset, taken in the posterior and medial areas. The DKG playback was found useful for demonstrating the change of the dynamic characteristics while viewing damaged tissues, such as lesions, scars, and discolored areas. Dynamic changes due to stiffness of the tissue, shown as a movie, may also help to reveal the nature of lesions (ie, cysts vs polyps). It is essential to point out the importance of endoscopic motion compensation in ascertaining time alignment of the anatomic structures for valid DKG representations.³⁰

Fig. 28.4 Example of a DKG playback of sustained phonation. DKG playback is a movie playing from posterior to anterior. The figure provides three snapshots taken in the posterior, medial, and anterior areas, respectively. The image on the left shows an average image of the vocal folds with the line being scanned across the glottis. On the right, the corresponding kymographic image is shown. The actual DKG playback movie used in this example is provided as Video Clip 54 in the DVD accompanying this book.

Mucosal Wave Playback. The mucosal wave (MW) playback is produced by modifying the HSV image sequence into a series of frames, in which the pixel intensity encodes the motion of the upper and lower margins of the vocal folds and the mucosal edges. As illustrated in Fig. 28.6, in the MW playback, the color indicates the direction of motion (ie, the opening edges are encoded in green and the closing edges in red).¹³ The frame selected for the example in Fig. 28.6 demonstrates the effectiveness of the MW playback in emphasizing the fact that during the beginning of the closing phase, when the lower margins are closing, the upper margins of the vocal folds may still be opening. The width of the differential between the upper and lower margins is a measure of the extent of the lateral phase of the mucosal wave.

Fig. 28.5 Example of a DKG playback of a phonatory offset. The figure provides two snapshots taken in the posterior and medial areas, respectively. The image on the left shows an average image of the vocal folds with the line being scanned across the glottis. On the right, the corresponding kymographic image is shown. The DKG playback movie used in this example is provided as Video Clip 55 in the DVD accompanying this book.

Mucosal Wave Kymography Playback. Mucosal wave kymography (MKG) playback corresponds to viewing kymographic frames of mucosal wave along the posterior-anterior axis y (Fig. 28.7). Thus, MKG playback is a kymographic playback of the mucosal wave movie content. The MW and MKG playbacks have a substantial clinical potential as facilitative visual techniques as they allow for the assessment of the fine detail of the mucosal wave including the propagation of the mucosal edges during the opening and closing glottal phases.^13,18,20

Phonovibrogram. Phonovibrogram (PVG) is a two-dimensional diagram of the vocal fold vibration. It is obtained by segmenting the edges of the vibrating vocal folds and transforming the obtained contour data into a two-dimensional image without loss of information.²⁷ Within a PVG image, the segmented contours of the moving vocal folds are unambiguously transformed into a set of geometric objects. PVG images can be regarded as fingerprints of vocal fold vibration and enable a direct and intuitive assessment of vocal fold vibration. The interpretational power and the quantitative analysis of PVG have been demonstrated on persons with voice disorders and vocally normal individuals.²⁷

Objective Measures

Human perception is imperfect when evaluating visual content, especially content such as that of HSV where a vibration pattern is repeated over and over and variations may be insignificant relative to the overall pattern. For example, it is impossible for the human brain to compare the pattern of a glottal cycle with a pattern 10 cycles later after seeing the cycles in between. Objective, unbiased measures are important for overcoming the limitations of human perceptions, as well as for documenting and comparing quantitatively the HSV content. The earlier section “Advantages of High-Speed Videoendoscopy over Videostroboscopy” already covered in detail the possible HSV measures that have relevance to voice assessment. To make these measurements automatically, researchers first have to build valid, reliable, and accurate segmentation and registration algorithms. Segmentation is a process of detecting and extracting the features of interest that are later subjected to measurement. In the context of HSV, there are three types of segmentation: temporal, image (spatial), and kymographic (spatial-temporal) segmentation.

Fig. 28.6 Example frame of an MW playback taken at the beginning of the closing phase during sustained phonation. In the MW playback, the color indicates the direction of motion (ie, the opening edges are encoded in green and the closing edges in red). This example demonstrates the effectiveness of the MW playback in emphasizing the fact that in the beginning of the closing phase, when the lower margins are closing, the upper margins of the vocal folds may still be opening. The width of the differential between the upper and lower margins is a measure of the extent of the lateral phase of the mucosal wave. The MW playback movie used in this example is provided as Video Clip 56 in the DVD accompanying this book.

Fig. 28.7 A medial position frame of an MKG playback of sustained phonation. This type of display allows for the temporal representation of the dynamics of the mucosal edges during glottal opening and closing in consecutive glottal cycles. The color shows the phase of motion (opening is green, closing is red). The mucosal wave extent appears as a double-edged or thicker red curve during the closing phase. The MKG playback movie used in this example is provided as Video Clip 57 in the DVD accompanying this book.

Temporal Segmentation

Because HSV contains rich temporal and spatial information, it takes a very long time to view the HSV playback, and it is impractical to use DKG and other facilitative playbacks on lengthy HSV samples. Luckily, more than 95% of the vibratory information available through HSV is dynamically redundant due to the repetition of a pattern. Therefore, during sustained phonation, viewing shorter segments inclusive of just a couple of glottal cycles can be representative for the relevant information in the HSV sample. The process of automatically analyzing the temporal redundancy and selecting short segments representative of the whole HSV sample is termed temporal segmentation.13,31 Temporal segmentation allows for navigating through the enormously large HSV content by mapping the specific areas of interest. After the segments are selected, they are either subjected to motion compensation to produce facilitative playbacks or subjected to image or kymographic segmentation to produce objective measurements.³⁰

Image and Kymographic Segmentation

Image segmentation is a process of partitioning a digital image into multiple segments (sets of pixels). The goal of image segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. HSV image segmentation is used to locate the vocal fold edges and to approximate them with lines and curves, which then can be numerically analyzed. Typically, HSV image segmentation is applied on each HSV frame (ie, in spatial domain). There is a variety of HSV segmentation algorithms in the published literature, and the algorithms are based on different methods: thresholding, region growing, active contours (snakes), level sets, and so forth. Segmentation has been applied also on VKG using snakes.²⁸ This type of spatial-temporal segmentation is termed kymographic segmentation. It is an approach applicable to HSV with significant potential. Based on our research, kymographic segmentation offers faster and more reliable and accurate HSV segmentation than the spatial image segmentation.³² The added benefit of using kymographic segmentation is the inherent temporal registration (ie, the process of tracking the movement of the segmented vocal fold edges over time), a feature not available through spatial-domain image segmentation.

Unsolved Problems with Laryngeal High-Speed Videoendoscopy

Despite its obvious advantages, HSV has not yet gained widespread clinical adoption because of remaining technical, methodological, and practical issues and an associated lack of information regarding the validity and clinical relevance of HSV. These limitations are highly interrelated.¹³ Improvements of technology and methodology are usually driven by the clinical demand. However, the research needed to demonstrate the clinical relevance of HSV, which is necessary for guiding the development of the appropriate methodology, is still in an early stage. Given that there are no established requirements, standardized clinical protocols, and clearly demonstrated benefits, there are no sufficient incentives among clinicians to implement this new, costly technology.

Clinicians and researchers usually consider the cost of HSV systems as being the most prohibitive factor for the clinical implementation of HSV. But this is only partially true. The cost of HSV technology is comparable with that of stroboscopy. Clinicians are accustomed to using and interpreting videostroboscopy, and they view HSV as a technique supplemental to videostroboscopy; a technique offering further detail about the glottal cycle or about irregular patterns. As we already pointed out, the same vibratory features may appear different on HSV than they do on videostroboscopy. Thus, the new information gained by the clinician from HSV may not be complementary, because it is not comparable. It is understandable why the clinician may feel discouraged. Adding to the cost of the equipment, the time spent for training, performing the procedure, and maintenance—without a health insurance billing/coding procedure in place for financial reimbursement—is indeed discouraging. Thus, the most prohibitive factor is not necessarily the cost of the HSV equipment, but rather the fact that there are more factors contributing to a greater cost, which is not yet justified through clinical evidence. As soon as the clinical value of HSV and its superiority over stroboscopy is demonstrated through research, HSV will supplant videostroboscopy in the clinic.

Many technical and methodological challenges still need to be addressed and studied. Effective techniques for the visualization and measurement of the features of the mucosal wave, vibratory regularity and symmetry, the vocal fold edge, glottal closure, vibratory amplitude, and the open quotient should become available. New methods for assessment of nonstationary laryngeal dynamics, such as onsets, offsets, and breaks, should be developed. Although camera technology capable of meeting the technical characteristics for clinical application is available, a lot of practical issues have yet to be addressed, such as the weight of the cameras and the creation of special optical lenses better serving HSV, and further increase in sensitivity, memory size, and storage would be very beneficial. More unanswered questions affect the practicality of HSV: Does the bright constant xenon light pose any risks to the patients? How to store the huge amount of data? Appropriate image compression techniques are essential. How to quickly view the lengthy HSV recordings? For example, 10 seconds of HSV data recorded at a speed of 10,000 fps would require 2 hours 46 minutes to view the whole recording at a speed of 10 fps. That is impractical, and the need for effective automatic temporal and image segmentation and facilitative playbacks and objective measurements is obvious. Currently, there are no commercial software tools for such analyses, and the commercial HSV integrated systems are still a step away from the necessary speed and image quality. More clinical research is necessary. Until the practicality, validity, reliability, and clinical relevance of HSV are formally studied, voice specialists will not be willing to change their methods for evaluating the vibratory behaviors of the vocal folds.

High-Speed Videoendoscopy as a Research Tool for Voice Science

Historically, high-speed imaging has been used mainly as a research tool. Since the 1930s, different high-speed techniques helped in building our current understanding of voice production.5–8 A summary of the state of the art in high-speed digital imaging as of the end of the 20th century has been written by Kiritani.³³

The most recent and significant advancement of the HSV technology allowed for capturing unprecedented details about the biomechanics of laryngeal sound production. The HSV technology is already helping to guide innovations in surgical voice restoration and to develop the next generation of clinical tools for improving the functional evaluation of voice disorders.^13,19,34 Before and above the potential of HSV as a clinical tool remains the role of HSV as a powerful technique for refining our understanding of voice production.

The purpose of this section is to update the reader about currently ongoing voice research using HSV including examples. There are several laboratories worldwide conducting HSV-based research to answer basic voice science questions. Our purpose is not to conduct a systematic review but rather provide a few examples stemming from our own experience, which will demonstrate the power of HSV in addressing current basic questions. The following is a summary of five different basic voice science studies with significant potential to impact clinical practice. These studies are not directly related and are currently at various stages of their completion. All five studies, however, have one thing in common: HSV is used as an intermediate technique, or as a “physiologic marker,” for validating concepts using other, indirect measurement techniques.

Improved HSV technology became essential in investigating the interrelations between the biomechanical aspects of vocal fold vibration (ie, the vocal fold physiology) and other biofeedback signals that have traditionally been used for the evaluation of vocal function. Such traditional biofeedback signals include the acoustic voice waveform (sound pressure), EGG, transglottal airflow, intraoral/subglottal pressure, and accelerometry. These indirect signals have been the basis for objective measures of laryngeal function. The relationship between these measures and vocal fold physiology has been largely assumed, depending on the current theoretical paradigm. However, little direct evidence about the correlation of these signals and measures to the actual biomechanical vibration of the vocal folds has been reported to date.

Measure of Vocal Attack Time

It this study, we hypothesized that the time lag between the rise of the sound pressure (SP) and EGG signals, measured at the onset of phonation, provides a useful index of vocal attack time, which is an important variable in the etiology of some voice disorders and may also be a meaningful indicator of central or peripheral neural dysfunction.24 HSV was used for the experimental validation of this measure, whereby the SP and EGG signals were recorded synchronously with HSV. DKG images were subsequently generated from the HSV and used to manually measure the time from the beginning of vocal fold oscillation to the first vocal fold contact. The study demonstrated that, after appropriate signal processing, the intersignal time delay provides a potentially useful measure that varies with vocal attack characteristics. HSV-based techniques were essential for providing a physiologic understanding and quantitative validation of the proposed measure.

Relationship between the Glottal Flow and Glottal Area Waveforms

A very important but little studied aspect of human voice production is the relationship between the vocal fold vibration and the transglottal airflow. To analyze this relationship, in this study we combined HSV of the glottis for determining the glottal area waveform (GAW) with inverse filtering of the acoustic signal for estimating the glottal flow waveform (GFW).³¹ The HSV system, recording at 16,000 fps, and the audio recording hardware were synchronized with an accuracy of 11 μsec. We developed an image segmentation algorithm for automatic extraction of the GAW from the HSV images based on region growing. The HSV samples and the corresponding acoustic signals were obtained from 14 vocally normal individuals for different voicing conditions (ie, various registers, adductory adjustment, longitudinal tension, or nonstationary phonatory behavior). The revealed waveform shapes and slopes of the vocal fold vibration and the transglottal flow were in agreement and changed correspondingly with the different phonatory conditions. However, the delay between the glottal area and flow signals varied in a very wide range, sometimes exceeding the glottal cycle length. This contradicts current models estimating delay based on vocal-tract length and speed of sound. Thus, further research is warranted.

Signature-Based Measurement of the Delay between Voice Signals

Relating intracycle landmarks of vocal fold vibration to corresponding features of indirect voice signals is a question of high importance for basic science and clinical practice, as shown in the previous study. The recent refinement of HSV allowed for accurate synchronization of HSV image data with acoustics and other signals. Synchronization is a necessary prerequisite for aligning these signals with high temporal precision. However, a remaining problem is the inherent delay between the signals measured at the laryngeal level versus those measured inside or outside the vocal tract. Traditionally, these delays have been estimated based on distance and speed of sound. Recent findings cautioned us that actual delays may be significantly different, usually greater than estimated. Therefore, delays should be measured rather than estimated.

The purpose of this study was to develop a technique for accurate and reliable measurement of the delay between different types of voice signals.35 The technique relies on the premise that although different types of signals have inherently dissimilar waveforms due to differences in phase and spectral characteristics, they carry a common signature encoded in the fluctuations of their fundamental frequency. The method comprises a 3-stage high-precision autocorrelation-based frequency signature decoder applied on two synchronously recorded voice signals, followed by statistical procedures for eliciting the actual delay. The validity, accuracy, and reliability of the technique were tested on a data set of 720 two-channel samples with various delay between the channels. The results demonstrate the proposed technique warrants the required accuracy and reliability necessary for intracycle landmark alignment. The new technique is currently being applied for signature-based delay measurement on HSV images versus other signals.

Complementing High-Speed Videoendoscopy with Electroglottography

The purpose this study was to assess the utility of combining HSV with EGG indices of glottal and vocal fold function.³⁶ Because EGG is sensitive to changes in vocal fold contact area during phonation, it can be a valuable tool for both voice researchers and clinicians. Clinical observation and the application of various physical and mathematical models have been used to identify important EGG signal landmarks and to relate changes in signal morphology to specific aspects of laryngeal physiology. The continued refinement and applicability of HSV allows the synchronization of the EGG signal with endoscopic images of the vocal folds. It is the purpose of this study to investigate variations of specific EGG features and relate them to HSV-observed changes in vibratory behavior.

In this ongoing study, 14 vocally normal speakers were recorded using synchronous HSV (16,000 fps) with EGG (96,000 Hz) with an accuracy of 11 μsec as they produced the vowel /i/ sustained in eight different modes of phonation: habitual, high and low pitch, breathy and pressed phonation, glottal fry, tremor, and falsetto. Based on current EGG models, 10 signal features were identified and grouped into four categories: cycle-phase-related features, temporal features, noise-related features, and configuration-related features. The resulting data were compiled to compare differences related to the phonatory mode and the sex of the subject. Using custom-designed software with a specialized graphic user interface, the EGG signals were precisely aligned with multislice digital kymography derived from HSV (Fig. 28.8), and color-encoded marks were manually placed at five characteristic EGG landmarks. The classified EGG features were then related to specific vocal fold vibratory characteristics using the custom-built software. The software allowed dynamically allocating vocal fold contact on the HSV images and DKG for any lateral line and precisely matching the HSV contact features to the EGG landmarks.

Relations among High-Speed Videoendoscopy Image Analyses and Aerodynamic, Electroglottography, and Acoustic Measures

The purpose of this ongoing study is to create a cross-classification scheme between noninvasive integrative measures (glottal flow, oral pressure, sound pressure, accelerometer, and EGG) and specific physiologic moments of glottal dynamics (derived from HSV), as well as to use computer phonatory models to provide similarities to the cross-classification scheme created with the human data. A modified Rothenberg mask is used to record flexible nasal HSV at 4000 fps with the other 5 data channels. The goal is to use HSV as a tool for the experimental validation and refinement of current theoretical models of voice production.

Fig. 28.8 Example for the use of custom-designed software with a specialized graphic user interface for precisely aligning EGG signals with multislice digital kymography derived from HSV. Five color-encoded marks are manually placed at the corresponding intracycle EGG landmarks for each glottal cycle. The position of each landmark is then compared with the phase of vocal fold vibration at each of the five posterior-to-anterior positions.

High-Speed Videoendoscopy in the Clinical Speech-Language Pathology Practice

The speech-language pathology practice can benefit tremendously from HSV by refining the current protocol for voice evaluation. The vibratory features of sustained phonation currently rated using videostroboscopy can become more accurate and reliable if measured using HSV-based technology, and new features of aperiodic vibration will become clinically applicable, such as voice breaks, laryngeal spasms, vocal attack time, and onset and offset of phonation.

Several clinical research teams have addressed the clinical comparison of videostroboscopy and HSV.13,15,16,18,20,22,23 One thing is similar in the results of all these experimental studies; that is, most of the analyzed vocal fold vibratory features (ie, periodicity, mucosal wave, symmetry, open quotient, glottal closure, and mucus aggregation) appeared different on HSV relative to videostroboscopy. For several features, there was a difference in sensitivity and specificity between the two imaging modalities. Videostroboscopy often had lower intrarater and interrater reliability.

The earlier section “Advantages of High-Speed Videoendoscopy over Videostroboscopy” already elucidated the inherent reasons for differences between videostroboscopy and HSV, based on the principles of these techniques. Because voice clinicians have been trained to use videostroboscopy, they may attempt interpreting vibratory features relative to the norms used in the clinic with videostroboscopy. Thus, there is a risk of overdiagnosing or misdiagnosing when HSV features are compared with the norms developed for videostroboscopy.

Normative Data for High-Speed Videoendoscopy

One main limitation to the clinical use of HSV is the lack of norms. An important first step in establishing the clinical utility of HSV is to generate HSV-specific clinical norms. The ability of the clinician to differentiate normal physiology from pathology using HSV images depends on valid norms. The following is a summary of findings achieved through recent research comparing videostroboscopy with HSV for commonly used clinical vibratory features.13

Phase Symmetry

Investigations of left-to-right phase symmetry revealed more instances of asymmetry than symmetry across a variety of HSV facilitative playbacks.¹⁸ In comparison, there was an even larger prevalence of asymmetrical ratings for posterior-to-anterior phase symmetry. An increased percentage of cases were rated as symmetrical during pressed versus comfortable phonation at habitual pitch and loudness, which could reveal physiologic differences between the two types of phonation.