The improvement of internal consistency of the Acoustic Voice Quality Index




Abstract


Purpose


This investigation aims to explore the improvement of the relatively new hoarseness severity quantification method, called Acoustic Voice Quality Index (AVQI), which measures the concatenation of continuous speech (CS) and sustained phonation (SP) segments. Earlier investigations indicated that the proportion of the SP is more dominant in the final AVQI result than the CS.


Method


Sixty voice samples were selected with different voice pathologies and equal distribution of hoarseness severity ranged from normal to severe. Every voice sample varied in three different durations: voice duration-one (VD-1) with seventeen syllables text plus three seconds of SP, voice duration-two (VD-2) with customized length of CS plus three seconds of SP, and voice duration-three (VD-3) with a whole text plus three seconds SP. All voice samples were perceptually judged on overall voice quality by five experienced voice clinicians. AVQI’s precision and concurrent validity were assessed in all three VDs. Finally, the internal consistency across all three VDs was analyzed.


Results


No significant differences were found in the perceptual evaluation of overall voice quality across all three VDs by acceptable rater reliability. The concurrent validity distinguished in all three VDs as a marked degree of correlation (i.e., ranged from r s = 0.891 to r s = 0.929) with no significant differences across all three VDs. The best precision was found in VD-2. Finally, the internal consistency showed in VD-2 a balanced out impact of the final AVQI score with no significant differences from both speech tasks.


Conclusion


Although AVQI currently uses the speech material of VD-1, the present study demonstrated the best results in VD-2 (i.e., precision and internal consistency). These features of VD-2 facilitate higher representativity and improve the validity of this objective diagnostic instrument.



Introduction


Voice quality is a feature of voice production that describes a perceptual phenomenon in the voice sound . Generally, voice quality is not a clearly defined term in literature . However, the overall voice quality is mostly compatible with the term hoarseness. Hoarseness is a voice symptom that perceptually deviates from normal voice quality recognized by oneself or others . Major subtypes of abnormal overall voice quality, which have received wide acceptance, are breathiness, roughness and strained .


Furthermore, variations in voice quality are the most frequent voice complaints in clinical practice . Therefore, two broad approaches enable measuring voice quality . First, a subjective method (i.e., auditory-perceptual judgment) is used to listen to a patient’s voice and assign a score that reflects his/her judgment of the voice sound. Second, objective methods are used which apply specific algorithms to quantify certain aspects of a correlate of vocal production like vocal acoustic signal, the inverse-filtered oral airflow signal or its derivatives. All these tools may evaluate the presence, the degree, and the progression of abnormal voice quality in a sufficiently valid and reliable way. Traditionally, the auditory-perceptual judgment has been commonly used to determine all these three components in the evaluation of voice quality because of the simplicity and efficiency of this method. The use of auditory-perceptual judgment is not undisputed in literature. There are many factors which affect the auditory-perceptual judgment’s reliability and accuracy . The consideration of these factors is difficult to monitor in clinical practice and thus, objective tools may support examiners in their decision of rating voice quality. Acoustic-analysis of the voice signal is one possibility in the evaluation of voice quality and is the most used diagnostic instrument to identify voice disorders in research .


Recently, new methods in acoustic-analysis were used to analyze continuous speech and sustained phonation with sufficient accuracy and reliability, e.g., the Acoustic Voice Quality Index (AVQI) proposed by Maryn et al. . This feature in acoustic-analysis facilitates higher ecological validity in the evaluation of voice quality.


AVQI is a six-factor acoustic model based on linear regression analysis used to measure overall voice quality in concatenated continuous speech and sustained phonation segments. In order to simplify clinical interpretation, the regression model was linearly rescaled in such a way that the outcome of the equation resulted in a score between 0 and 10 by calling this final model AVQI . It is one of the first objective-acoustic models to judge continuous speech. To our knowledge the Cepstral Spectral Index of Dysphonia by Awan et al. may also successfully evaluate overall voice quality in continuous speech and sustained phonation.


AVQI is an acoustic correlate of auditory-perceptual judgment because perceptual evaluation is considered the ‘gold standard’ in research and clinical practice.


The AVQI model uses a detection algorithm from Parsa and Jamieson to separate voice and voiceless segments of the recording of continuous speech. This procedure allows acoustic measurements of continuous speech with many more meaningful acoustic markers based on the frequency-, time-, and amplitude domains in the evaluation of overall voice quality as shown in the meta-analysis from Maryn et al. .


Although AVQI was originally developed for Dutch speakers, this model has been validated and found reliable in different languages including native German , native English in pediatric population , and multilingual persons speaking Dutch, English, German and French . Based on the results of these studies, AVQI seems to be cross-linguistically robust in Germanic languages. The performance of AVQI is relatively insulated from inter-language phonetic differences .


The relevance of acoustic measurements in clinical management by rating overall voice quality is to objectively monitor voice quality through the voice therapy process. Therefore, AVQI has thus far proven highly sensitive in voice changes through voice therapy (i.e., r = 0.80) .


Recent research about internal consistency and test–retest measurement of AVQI has shown a low level of AVQI score variability (i.e., an AVQI score = 0.54), but AVQI has been most strongly influenced by sustained phonation . Furthermore, sustained phonation has revealed to have a significantly greater influence on AVQI score than continuous speech . These findings suggest that more research is required for more representativity and ecological validity in AVQI to balance out the internal consistency through equal proportion of these two speech tasks.


This investigation aims to explore the equal proportion of the two speech tasks in AVQI by expanding the duration of continuous speech. Although the part of continuous speech in the current model of AVQI covers 17 to 22 syllables , the duration analysis material of continuous speech is significantly lower, after separating voice to voiceless segments of the detection algorithm, than the constant three seconds on sustained phonation .


Furthermore, the judgment validity of overall voice quality was verified anew between auditory-perceptual rating and AVQI scores by different durations of the analyzed segments.


Our research questions address the following:



  • 1.

    Are there significant differences between different auditory-perceptual overall voice quality ratings (i.e., the judgment of concatenated continuous speech and sustained phonation) by varying the duration of continuous speech?


  • 2.

    What is the impact of varying the voiced continuous speech duration on the correlation and perceptual diagnostic accuracy between perceptual ratings and AVQI values?


  • 3.

    Does the internal consistency of AVQI improve when the proportions of sustained phonation and voiced continuous speech are adapted to reach higher ecological validity?






Methods



Subjects


The voice-disordered subjects were recruited retrospectively from the ENT caseload of the Sint-Jan General Hospital in Bruges, Belgium. Concatenated voice samples of continuous speech and sustained phonation were obtained from a database of 350 patients with various organic and non-organic etiologies . These voice samples were chosen by selecting four groups with various degrees of hoarseness (i.e., absence/clear voice, slight, moderate, and severe). Firstly, the selection was based upon prior modal agreement across five judges (i.e., a minimum of three judges rated the same result) who judged the hoarseness with an ordinal four-point equal-appearing interval scale, which corresponded exactly to the four hoarseness degrees . Thus, these samples were considered to be highly representative of a specific level of hoarseness. Secondly, voice samples with complete aphonia were excluded in this selection. The presence of the two criteria led to the selection of 15 voice samples in each hoarseness group with a total of sixty participants. Table 1 summarizes further subject details, i.e., gender, age, and voice disorder of the relevant subjects.



Table 1

Descriptive data of dichotomous factors of the 60 subjects.
































































Variable Results
Gender
Male 23 (38%)
Female 37 (62%)
Age in years (mean ± standard deviation) 40.8 ± 19.7
Voice Disorder
Functional Dysphonia 22
Paralysis/paresis 11
Nodules 8
Polypoid mucosa (edema) 4
Cyst 3
Tumor 3
Ventricular hypertrophy 2
Granuloma 1
Hemorrhage 1
Presbylarynx 1
Reflux laryngitis 1
Spasmodic dysphonia 1
Vocal fold atrophy 1
Web 1


This study consisted of a retrospective and non-interventional re-analysis of earlier recordings, and therefore no advise/consent of our Ethics Committee was needed .



Voice samples


Every voice sample from each participant contained a sustained phonation of three seconds of the mid-vowel portion of the vowel [a:]. The samples also included a read aloud Dutch phonetically balanced text (“Papa en Marloes”) using for both speech types comfortable pitch and loudness. All recordings were conducted with an AKG C420 head-mounted condenser microphone (AKG Acoustics, Munich, Germany) digitized at 44,100 samples per second , that is, a sampling rate of 44.1 kHz and 16 bits of resolution using the Computerized Speech Lab model 4500 (Kay Pentax, Lincoln Park, NJ) and were recorded in a soundproof booth.


To verify post-hoc the level of environmental noise of the voice recordings, the signal-to-noise ratio (SNR) by Deliyski et al. was used. All voice samples were consistent with the recommended SNR norm for acceptable circumstances of acoustic recordings and analysis. The results showed a mean SNR of 39.41 dB and SD of 4.00 dB for all sixty voice samples.


In the following investigations every voice sample varied in three different durations:



  • (1)

    voice duration-one (VD-1) consisted of the first two sentences containing seventeen syllables plus three seconds of sustained phonation according to the standard AVQI procedure of Maryn et al. ,


  • (2)

    voice duration-two (VD-2) contained continuous speech with customized length (i.e., to correspond with three seconds of voiced continuous speech after extraction) plus three seconds of sustained phonation, and


  • (3)

    voice duration-three (VD-3) covered the whole text containing ninety-three syllables plus three seconds of sustained phonation.



The customized length of continuous speech in VD-2 was acquired as follows:


Firstly, the continuous speech was completely extracted of all voiceless segments in the software Praat by using the extraction Praat-script from Maryn et al. . The first three seconds of the whole extraction was essential.


Secondly, the customized cut-off point was found hand-marked in the original text without extraction, which corresponded to the extracted first three seconds from step one. To define the hand-marked cut-off point, we used the oscilogram- and narrowband spectrogram view, the pitch contour, and the auditory feedback, respectively.


Thirdly, the duration of each hand-marked segment was verified additionally. Therefore, the extraction Praat-script of Maryn et al. was run on the customized hand-marked segment. A tolerant margin of ± 0.1 s was acceptable across hand-marked judging and three seconds after the extraction from the Praat-script of Maryn et al. .


Table 2 presents the different durations of continuous speech.



Table 2

Different durations of the part of continuous speech in the three different VDs presented by 60 subjects.


































Mean (s) SD (s)
Without extraction of voiceless segments Duration (s) of the first two sentences from the phonetic-balanced text used in VD-1 4.6338 1.0543
Duration (s) of the customized length from the phonetic-balanced text that corresponds with three seconds of only voice in continuous speech after extraction that used in VD-2 10.1112 2.8600
Duration (s) of the whole phonetic-balanced text used in VD-3 27.6778 5.9010
with only voice segments Duration (s) of the first two sentences from the phonetic-balanced text used in VD-1 1.5320 0.3366
Duration (s) of the customized length from the phonetic-balanced text that corresponds with three seconds of only voice in continuous speech after extraction used in VD-2 3.0110 0.0557
Duration (s) of the whole phonetic-balanced text that used in VD-3 8.3040 1.7795

Abbreviation : SD: standard deviation.



Auditory-perceptual judgment


For the auditory-perceptual judgment of overall voice quality, an expert panel of five native Dutch speech-language therapists judged overall voice quality. The panel consisted of four females and one male who specialized in voice disorders and had professional experience in auditory-perceptual judgment ranging from five to forty years (mean = 25.6 years, and SD = 12.7 years). Each listener rated overall voice quality of each concatenated voice sample (i.e., one single sound wave based on each of the three continuous speech segments, a pause of 1 s, and sustained phonation of the central vowel segment) with one hoarseness degree for the whole concatenated voice sample. The listener procedure was comparable to those described in previous studies . The judges used the Grade (G) from the GRBAS-scale . The G level represents the degree of hoarseness or voice abnormality . As recommended by Wuyts et al. the judges used the ordinal four-point equal-appearing interval scale (0 = absence of hoarseness/clear voice, 1 = slightly hoarse, 2 = moderately hoarse, 3 = severely hoarse). All voice samples were provided in a quiet room with a low ambient noise level lower than 40 dB (A), measured with a calibrated PCE-322A sound level meter (PCE Inst., Meschede, Germany). They were presented to each listener individually at a comfortable loudness level through an external soundcard from Creative Soundblaster x-fi 5.1. USB and a Beyerdynamic DT 770 PRO 80Ω headphone. Every listener was allowed to repeat each voice sample as often as necessary to make a final decision of judgment.


All voice samples and all VDs were judged randomly in one session to minimize learning effects. Furthermore, all judges were blinded regarding the identity, diagnosis and disposition of the voice samples. To assess intra-rater reliability, 15 voice samples of each VD, approximately 25% of the 60 voice samples, were selected randomly. These voice samples were repeated a second time at the end of the perceptual judgment without informing the listeners that stimuli were repeated.


Internal factors such as fatigue, attention, and low concentration level as described by Kreiman et al. were controlled by using a short break after every twenty-fifth rating. Furthermore, as recommended by Chan and Yiu , anchor voices were used to putatively increase the reliability of listener ratings. Thus, six samples of concatenated continuous speech and sustained phonation were selected from the database of previously perceived judged investigations. The selection criteria of the anchor voices were based upon prior unanimous agreement across judges adhering to the three hoarseness degrees of slightly, moderately, and severe. Thus, these samples were considered to be highly representative of a specific level of G. In total two sets of continuous increasing hoarseness level (i.e., three samples per set) were performed for the listeners as anchors. The two sets distinguished between two chief subtypes of hoarseness (i.e., breathiness and roughness) recognized in various scientific papers . Each listener heard these two sets at the beginning and after the break of every twenty-fifth rating.



Acoustic measures


Acoustic analyses were applied to the segmentation and concatenation of the voiced segments of relevant continuous speech parts using the extraction Praat-script by Maryn et al. . The three-second mid-vowel segment was appended to this chain of voiced text segments.


These one single sound files for calculating AVQI consisted of six acoustic parameters using the software Praat : smoothed cepstral peak prominence (CPPs), harmonics-to-noise ratio (HNR), shimmer local (Shim), shimmer local dB (ShdB), general slope of the spectrum (Slope), and tilt of the regression line through the spectrum (Tilt). The smoothed CPP is the distance between the first rahmonic’s peak and the point with equal quefrency on the regression line through the smoothed cepstrum. The HNR is the base-10-logarithm of the ratio between the periodic energy and the noise energy, multiplied by 10. The Shim is the absolute mean difference between the amplitudes of successive periods, divided by the average amplitude. The ShdB is the base-10-logarithm of the difference between the amplitudes of successive periods, multiplied by 20. The general Slope is the difference between the energy in 0–1000 Hz and the energy in 1000–10,000 Hz of the long-term average spectrum. The Tilt is the difference between the energy in 0–1000 Hz and the energy in 1000–10,000 Hz of the trendline through the long-term average spectrum.


The initial AVQI procedure by Maryn et al. was adapted by Maryn and Weenink to analyze and calculate AVQI only using the software Praat . Thus, all AVQI analyses in the present study were carried out using this customized Praat-script. Subsequently, the AVQI scores of the presented regression formula by Maryn and Weenink corresponded to the AVQI scores of VD-1 and were calculated according to the following equation:


AVQI = 3.295 − 0.111 × CPPs − 0.073 × H N R − 0.213 × Shim + 2.789 × ShdB − 0.032 × Slope + 0.077 × Tilt × 2.208 + 1.797



Perceptual diagnostic accuracy and concurrent validity of AVQI across the three VDs


Before investigating the internal consistency of AVQI, the perceptual diagnostic accuracy of the three different VDs (i.e., how well can AVQI discriminate between absence of hoarseness and hoarse voices from VD-1 to VD-3) and concurrent validity (i.e., the correlation between AVQI values and auditory perceptual evaluation of overall voice quality) were validated to the auditory-perceptual evaluation. Therefore, an adaption of the initial regression formula as described previously was achieved for VD-2 and VD-3 because the weighting of the six acoustic measures in the AVQI model was utilized for seventeen syllables on continuous speech plus three seconds sustained phonation. VD-2 and VD-3 have more voice segments on continuous speech and it can be concluded that the AVQI scores based on the regression formula of VD-1 are inadequate for VD-2 and VD-3.



Internal consistency of AVQI


The determination of internal consistency of AVQI was measured by analyzing three different kinds of AVQI scores in VD-1 (i.e. using the equation from Maryn and Weenink ), VD-2 and VD-3, which represents a new weighting of the equation as mentioned earlier.


Firstly, an AVQI analysis was achieved on only the first two voice-extracted sentences on continuous speech with the regression formula for VD-1 (AVQI-CS-VD-1), only the voice-extracted customized length on continuous speech with the equation for VD-2 (AVQI-CS-VD-2), and only the voice-extracted whole phonetically balanced text with the equation for VD-3 (AVQI-CS-VD-3).


Secondly, an AVQI analysis was conducted on only the three seconds of sustained phonation by using the equation of VD-1 (AVQI-SP-VD-1), VD-2 (AVQI-SP-VD-2), and VD-3 (AVQI-SP-VD-3), respectively.


Thirdly, an AVQI analysis was carried out on the concatenation from both speech tasks with the equation of VD-1 (AVQI-T-VD-1), VD-2 (AVQI-T-VD-2), and VD-3 (AVQI-T-VD-3), respectively.


The differences in all three VDs between the AVQI scores from continuous speech and the AVQI total scores were compared with the differences between the AVQI scores from sustained phonation and the AVQI total scores.


The lower the differences to the total AVQI scores, the more determination the respective speech task has in the final AVQI result.


The more equal the two differences of AVQI-CS and AVQI-SP to AVQI-T, the higher the equalization of the proportion from both speech types is in the determination of the final AVQI result.



Statistical analysis


All statistical analyses were completed using SPSS for Windows version 19.0 (IBM Corp., Armonk, NY), except when stated otherwise. First, the intra-rater reliability of the five raters was assessed for each VD using the Cohen’s Kappa coefficient (C k ) analyzed with the software package of r-Studio v. 3.0.1 (R Core Team, Vienna, Austria). This statistic is a chance corrected index of the agreement between the ratings of two judges, yielding values of C k = 1 for perfect agreement and C k = 0 when agreement is no better than that by chance . To assess the agreement of inter-rater reliability from all VDs among the five judges, we computed the kappa coefficient according to Fleiss , who extended the C k for more than two judges. The Fleiss Kappa (F k ) was determined by also using the software package of r-Studio v. 3.0.1 (R Core Team, Vienna, Austria). Guidelines for the interpretation of the k statistics were provided by Landis and Koch . Significant changes in all kappa values were tested using bootstrapping with 10,000 replications based on a script by Van Belle .


Second, the auditory-perceptual severity judgment across the three VDs was analyzed to investigate significant differences in severity rating by perennially longer duration in continuous speech. Therefore, a paired t-test was used.


Third, to obtain comparable AVQI values across the three VDs, a stepwise multiple linear regression was executed for VD-2 and VD-3. This was done to construct a statistical model representing the best weighting of the six acoustic predictors in AVQI for the overall degree of disordered voice. Therefore, the individual G mean (i.e., average G-score over the five raters) results from VD-2 and VD-3 were used as dependent variables and the six acoustic measures as independent variables in each VD-model. A multiple regression equation was constructed based on the unstandardized coefficients of the statistical model. Finally, the two models were also linearly rescaled to the outcome of the equation of VD-1 that resulted in a score between 0 and 10.


Fourth, in order to investigate the criterion-related concurrent validity of AVQI from the three VDs, the correlation between the G mean and AVQI was calculated with the Spearman rank-order correlation coefficient (r s ), and the coefficient of determination (r 2 s ). Interpretation guidelines for r s were provided by Frey et al. . Furthermore, to assess equality between these three r s values (viz., is AVQI equally valid across the three different VDs?), the significance was determined with the VassarStats software . With the Fisher’s z r transformation, a z value was computed to estimate the significance of the difference (i.e., the two-tailed p value) between two correlation coefficients : two r s values are considered statistically significantly different when p ≤ 0.05, and vice versa.


Fifth, to evaluate the perceptual diagnostic accuracy of AVQI across the three different VDs, several estimates were calculated. Diagnostic precision of a measure is commonly evaluated by its sensitivity (i.e., correctly identified hoarseness which test positive on AVQI) and specificity (i.e., correctly identified absence of hoarseness when they test negative on AVQI). However, depending on AVQI’s threshold chosen to define a positive result, its sensitivity and specificity can vary. This trade-off between sensitivity and specificity can be graphically produced by generating the receiver operating characteristic (ROC) curve. AVQI’s ROC curve is created by plotting a point that, per AVQI cut-off score, represents the true positive rate (i.e., sensitivity) on the ordinate and the false positive rate (i.e., 1 − specificity) on the abscissa. As mentioned by Barsties and Maryn the absence of hoarseness in a voice sound was considered when modal agreement between all five judges rated the voice of the different VDs with a G level = 0 (i.e., G mean < 0.5). A hoarse voice was considered as soon the mean rating was rounded off to about G mean ≥ 0.50. Thus, hoarseness ratings of the different VDs ranged from G mean ≥ 0.50 to ≤ 3.


The ability of AVQI to discriminate between clear and hoarse voices was represented by the “area under ROC-curve” (A ROC ). An A ROC = 1.0 is found for measures that perfectly distinguish between clear and hoarse voices. An A ROC = 0.5 corresponds with chance-level perceptual diagnostic accuracy . To provide additional evidence regarding the value of a diagnostic measure and to help reduce problems with sensitivity/specificity related to the base-rate differences in the samples (i.e., the uneven proportions between one severity degree of clear voices and three severity degrees of hoarse voices in 180 voice samples), likelihood ratios should also be calculated . The “likelihood ratio for a positive result” (LR +) yields information regarding how the odds of the disease increase when the test is positive (i.e., calculated as follows: Sensitivity/(1 − Specificity)). LR + provides information regarding the likelihood that an individual is hoarse when testing positive. The “likelihood ratio for a negative result” (LR −) is an estimate that helps to determine if an individual does not have a particular abnormal degree of hoarseness when they test negative on the diagnostic test calculated as follows: ((1 − Sensitivity) /Specificity). LR − provides information regarding the likelihood that an individual has a normal voice quality when testing negative. As a general guideline, the diagnostic value of a measure is considered to be high when LR + ≥ 10 and LR − ≤ 0.1 . Because the LR statistics consider sensitivity and specificity simultaneously, they are less vulnerable to sample size characteristics and base-rate differences in the sample between subjects with a clear voice and a hoarse voice .


Sixth, to analyze the internal consistency of AVQI, the paired sample t-test was used to determine significant differences between the absolute differences between the AVQI results on only one speech task and the AVQI total score.


All results were considered statistically significant at p ≤ 0.05.





Methods



Subjects


The voice-disordered subjects were recruited retrospectively from the ENT caseload of the Sint-Jan General Hospital in Bruges, Belgium. Concatenated voice samples of continuous speech and sustained phonation were obtained from a database of 350 patients with various organic and non-organic etiologies . These voice samples were chosen by selecting four groups with various degrees of hoarseness (i.e., absence/clear voice, slight, moderate, and severe). Firstly, the selection was based upon prior modal agreement across five judges (i.e., a minimum of three judges rated the same result) who judged the hoarseness with an ordinal four-point equal-appearing interval scale, which corresponded exactly to the four hoarseness degrees . Thus, these samples were considered to be highly representative of a specific level of hoarseness. Secondly, voice samples with complete aphonia were excluded in this selection. The presence of the two criteria led to the selection of 15 voice samples in each hoarseness group with a total of sixty participants. Table 1 summarizes further subject details, i.e., gender, age, and voice disorder of the relevant subjects.


Aug 23, 2017 | Posted by in OTOLARYNGOLOGY | Comments Off on The improvement of internal consistency of the Acoustic Voice Quality Index

Full access? Get Clinical Tree

Get Clinical Tree app for offline access