Medical Informatics and Databases

Medical Informatics and Databases

John C. Sok

Richard K. McHugh

Our ability to obtain medical information has increased exponentially with public access to the Internet (1). As more hospitals and health care systems transition to the use of electronic health records (EHRs), the potential for machine-searchable clinical data significantly increases. EHRs store various forms of data including clinical diagnosis, drug prescriptions, treatment interventions, testing, and other associated medical records. We have more clinical data available to us now than at any other time in history, and the volume only continues to increase. This chapter describes the emerging field of medical informatics in otolaryngology and serves as an introduction to data mining and searchable clinical databases.

Collections of clinical data comprise various forms of databases. Smaller studies are amenable to researcherperformed manual searches through primary data. However, given the increasing volume and complexity of the data, medical informatics will become even more essential when performing clinical studies. Data mining of clinical databases also has the potential to find associations that would not otherwise be uncovered by traditional manual search and analyses. Furthermore, genomics and other translational forms of data are adding more complexity to clinical databases. This volume and variety of data is likely to continue to increase and become more complex as we shift from analysis of static targets (e.g., DNA, RNA, protein sequences) to dynamic targets (e.g., transcription, expression, metabolism, and genomics). Although daunting, such data present us with unparalleled opportunities. The medical informatics required to perform successful analyses of large clinical databases entails the combination of biostatistics and computational biology, among many other disciplines. Here we provide a framework to understand and apply medical informatics to clinical and translational studies in otolaryngology.


A database represents any organized collection of information, usually in digital form. There are several types of databases utilized in clinical medicine as reviewed by Harrison and Aller in 2008 (2). Larger regional and national data may be obtained from insurance or care provider claims data, single or cooperative provider repositories, or public health and government databases. Unfortunately, insurance databases and most government databases are not designed to record complete clinical records. Furthermore, such databases may be greatly impacted by selection, analysis, and interpretation biases (3). For instance, billing codes may be used to identify a study group. However, such codes may have been submitted as “rule out” diagnoses and may not represent the actual diagnosis. Moreover, the choice of codes utilized may be influenced by financial incentives. Further, only billed interventions may be listed such that the study lacks a complete clinical record of observations due to nonbillable interventions being excluded. Therefore, databases comprised of insurance claims data or similarly compiled data may be used to draw broad population-based associations but are not suitable for higher complexity clinical questions.

An individual clinician or clinical service often creates clinical databases composed of a series of patients with a common disease process, symptom, or treatment paradigm. Although this type of data is more precise and prone to fewer selection errors than insurance claims data, the population study size and size of the geographic area are generally small. The creation of such databases is usually under the auspices of a single Institutional Review Board (IRB) approval and designed to study a focused set of questions. These databases are usually small enough to facilitate manual analyses performed by one or more researchers.
Unfortunately, databases created in this format are not broadly applicable to other clinical data mining queries.

Although there still remains some resistance by physicians to adopt EHR systems, EHRs are becoming more prevalent (4). In February 2009, US Congress enacted into law the Health Information Technology for Economic and Clinical Health (HITECH) Act, which implemented new policies to induce adoption and “meaningful use” of EHRs by hospitals and physicians. Therefore, the American Academy of Otolaryngology—Head and Neck Surgery Medical Informatics Committee recently published formal recommendations and guidelines to encourage the adoption of EHRs (5). EHRs yield a wealth of clinical data. When planning a study that uses EHRs, the specific types of data and analyses should be predetermined prior to beginning the study. For instance, the population under study should be definable such that conclusions have wide applicability. Other factors to consider include age, gender, interventions, follow-up, and others. These factors produce many types of data. All of the temporal data and some laboratory results may be represented by numerical data. Binary coding may be used for gender, qualitative laboratory results, and status changes such as recurrence of a disease. Staging systems, ICD-9, and CPT numbers yield commonly accepted codes. However, a significant proportion of clinical data is textual and is in analog form derived from various health practitioner notes including clinic and operative notes, or descriptive reports from imaging studies.

Outcome measures, narrative text, and other textual data are most cumbersome to evaluate. It requires a researcher to manually evaluate each case individually in order to ascertain the data and often presents additional challenges related to the analysis of such data. Furthermore, textbased data sets are generally not conducive to data mining analyses for several reasons (6). First, EHRs record a significant portion of data in nonsearchable textual form. Nonsearchable text-based data need to either be transferred to a data warehouse as coded and searchable data, or a text-based search engine capable of utilizing the EHR directly must be in place. Second, there are a multitude of incompatible EHR systems. The differences in these systems limit one’s ability to merge the data for collaborative efforts. Natural language processing presents a possible solution for recording and analyzing textual data, and it has been proven to have had at least limited success for data mining in the field of Allergy (7).

The Veterans Administration (VA) manages a nationwide system of EHRs. This is perhaps the best documentation of EHR models, and it has been noted to have significant advantages and disadvantages (8). The VA EHR system is known as the Veterans Health Information Systems and Technology Architecture, or VistA. The greatest advantage of the VA EHR database is the immense volume of available clinical data. However, at this time, its data searches are not perfect and can produce duplicative results, requiring manual review and thereby creating inefficiencies. There are plans to update VistA to improve its data mining capabilities and to support evidence-based medicine studies.

Since EHR databases are difficult to utilize due to textual data and inefficient search strategies that were not designed to facilitate clinical studies, researchers have created data warehouses (6). Data warehouses are comprised of EHR data that is coded, structured, and inclusive of all personal patient information deidentified for Health Insurance Portability and Accountability Act (HIPAA) of 1996 compliance. The software for such a warehouse is designed to support clinical databases studies and data mining. As such, data warehouses represent a powerful source for associative clinical studies. However, the cost to create and maintain a data warehouse is substantial. Usage of such clinical data framework for research on a population at a national level has been examined for the United States (9).

Regardless of the source, data must be accurate and standardized in order to be useful in any study. Therefore, the first step to answering a research question that proposes to query a database should be to ensure that the data is of acceptable quality for such analysis. Likewise, the first step in creating a database should be to set universal standards for how data are recorded to maintain or increase relevance for studies.


As clinical databases grow to provide increased statistical power and become representative of the overall population, the methods to perform clinical analysis of these databases grow beyond the limit of manual search strategies. Medical informatics seeks to create and utilize robust computational methods in database studies. The basic principles of clinical trial design and statistical analyses are the foundation for medical informatics. Although the details of these principles are beyond the scope of this chapter, many reviews on this topics have been published (3,10,11,12). It should be noted that nonrandomized or noncontrolled data are not able to establish causality. However, population-based correlative studies may be extremely powerful at revealing associations and limiting bias and may be applicable to situations in which a randomized, controlled trial could not otherwise be performed.

Unfortunately, studies that are not randomized and controlled are more likely to be predisposed to bias. Bias entails random, systematic, or intentional disagreement between the results and the true occurrence. Bias may influence any part of a clinical study including population accrual, data collection, analysis, or interpretation. Formulation of any clinical study should include efforts to understand and limit potential biases (3).

Data mining involves “extraction of implicit, previously unknown, and potentially useful information” through mathematical analysis (13). Therefore, through pattern discovery, clustering, and other methods, it may reveal
clinically significant associations. By definition, data mining is different from the standard statistical analyses used in clinical studies. Brown and Harrison provide encompassing reviews of data mining and the complex analyses possible (14,15). As with any database study, data mining requires complete, accurate, structured, and coded data to perform adequate calculations. Unfortunately, medical data are perhaps the most difficult on which to perform data mining due to multiple factors including high dimensionality, heterogeneity, imprecision, and temporal patterns (15). A number of premade tools for data mining exist. “Opensource” tools created in a collaborative, nonprofit, and expandable manner have distinct advantages over commercial data mining software packages (16). Specific usage of data mining tools is beyond the scope of this introduction, and collaboration with a statistician familiar with these methods is recommended.


Clinical databases may take many forms based on the type of data, structure, and other features (see Table 204.1). Here we describe the steps to complete three types of clinical study based on specific databases. First, we describe the meta-analysis study using the PubMed and other literature databases, which is a commonly employed study for clinical research. Second, we describe the Surveillance Epidemiology and End Results (SEER) database for cancer population studies. Lastly, we describe the utility of the VA EHR VistA. A published study relating to laryngeal cancer is utilized for each database study example. Each example study may be obtained for in-depth information and assist extrapolation to similar types of studies and databases. Although software-based search and analysis has become more commonplace, these studies required manual review by a researcher to some degree.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

May 24, 2016 | Posted by in OTOLARYNGOLOGY | Comments Off on Medical Informatics and Databases

Full access? Get Clinical Tree

Get Clinical Tree app for offline access