In order to draw meaning from the exponentially increasing quantity of healthcare data, it must be dealt with from a big data perspective, using technologies capable of processing massive amounts of data efficiently and securely. The pharmaceutical industry faces the big data challenge through all phases of the drug development lifecycle. Genomics, clinical monitoring and pharmacovigilance illustrate the value of a big data approach. Whether focusing on genetic and environmental disease risks, pattern detection through real time biosensors for patients or post-market monitoring of drug effectiveness, each area involves the collection and analysis of numerous variables and requires extreme computing power to reveal the details of interplay between the variables. To make big data work for pharmaceutical information, attention must be paid to data collection on a vast scale, from multiple sites and over long time periods. Big data support must be incorporated into interoperable electronic medical records and presented intuitively through visual analytics.
very large databases
Bulletin, June/July 2013
Turning Healthcare Challenges into Big Data Opportunities: A Use-Case Review Across the Pharmaceutical Development Lifecycle
by Timothy Schultz
The quest for efficient data acquisition, processing and consumption methodologies has been a topic of critical interest for decades across enterprise and academic institutions alike. In order for organizations to remain agile, novel insights must be derived and decisions made in a relatively short amount of time – often in light of limited observations and sparse data. With the relatively inexpensive cost of computational power and data storage and with a society more interconnected through powerful mobile devices, increased access to invaluable data is beginning to be realized. However, an unintended consequence to these developments is managing and processing massive amounts of data securely and efficiently. These challenges are computationally and storage intensive in nature and are further complicated by an increasing emphasis on fault tolerance, redundancy and scalability. As a popular example, Facebook must continuously deal with the demand of hosting a massive number of social interactions per day, resulting in 500 TB of data. In order for Facebook to suggest relevant content to its projected 1.11 billion members in a timely manner based on past activity, computational resources must be seamlessly incorporated into its infrastructure to address increasing usage demands and a growing user base.
Big Data Overview
The term big data has become a buzzword in the field of information technology. It epitomizes the challenges just described in the broadest of terms. In the past, solutions to such challenges have focused on massive dedicated mainframe computers, distributed computational grids and, more recently, so-called cloud services. The aim of these solutions has been to distribute a computationally intensive workload across a series of dedicated resources, such that the overall execution time is reduced as a function of available compute nodes. Popularized by third-party services such as Amazon’s Elastic Compute Cloud (EC2) and Rackspace Cloud, the cloud-based offerings allow organizations to scale technology infrastructure in an agile manner by completely outsourcing and abstracting away hardware layers. As a result, these solutions in principle allow users to add computing resources to their infrastructure in an ad hoc manner. However, more specifically, big data has also become synonymous with Apache’s software stack built on Hadoop, including MapReduce, HDFS, HBase, Hive and Cassandra. Inspired by a series of research papers published by Google to solve distributed, contemporary challenges facing search engines, Hadoop was the first to gain widespread notoriety. This attention is due to its distributed nature and built-in, fault-tolerance features. The software stack allows users to divide complex processes across a number of computational devices whose results are then merged by a central server. Services such as the ones provided by Cloudera allow organizations to easily administer the many working components of a Hadoop cluster through user-friendly, web-based administration tools. While Hadoop is not the only stack that offers distributed computational services, it has gained a great deal of attention for its approach to big data problems.
Figure 1. The Hadoop open-source technology stack as provided by Cloudera (www.cloudera.com/content/cloudera/en/products/cdh.html)
Current Applications of Big Data in Health Care
The most prevalent examples of big data currently reside with social media because user-contributed content provides a robust source of data for analysis. However, the issues discussed above are certainly applicable to the healthcare industry as well . Biological and health-based data are naturally much more complicated and difficult to collect than social-media data. Modeling biological phenomena is typically very complex and has always been understood to be a computationally intensive process. In addition, the Affordable Care Act will undoubtedly continue to transform the healthcare environment in the United States. The ability of pharmaceutical companies to continue bringing new life-saving/life-enhancing medicines to patients in a timely, yet cost-effective, manner will be dependent on their ability to skillfully manage big data generated during all phases of pharmaceutical development. Experience shows that the complexity of such a challenge will be directly related to the complexity of the targeted disease state.
From early research to clinical application and ultimately to pharmacovigilance, the process of delivering therapeutic interventions to patients is an inherently complex, interdisciplinary and interconnected workflow. The pharmaceutical development process consists of many functional research areas. Each contributes a specialized set of insights and knowledge to the underlying mechanisms of a disease or disorder. Not surprisingly, each area also has its own unique set of data challenges. These challenges are further complicated by the many challenges in sharing and translating insights across functions. The big data healthcare problem is based on identifying novel ways to acquire, process and disseminate biomedical data more efficiently. The concept of achieving computational efficiency is certainly not new. It has been the focus of intense research for decades. Nevertheless, the recent focus on challenges associated with big data beyond the scope of healthcare research and development has also generated opportunities for closer partnerships between researchers and supporting business functions. Brief examples of big data use-cases sampled across the pharmaceutical development lifecycle include those found in genomics, clinical monitoring and pharmacovigilance.
Figure 2. A high-level overview of the pharmaceutical development lifecycle, with emphasis on the sharing of translational insights between functional areas
Genomics. One particular area of interest is identifying the relationships between a disease and its genetic, environmental and/or health-based risk factors. Each not only gives a unique view into the underlying mechanisms of diseases and disorders, but also reveals the interplay between different types of risk factors. Identifying risk-based genes is a means to uncover biological pathways for direct therapeutic interventions, while personal risk factors establish corrective interventions, which patients can implement to reduce their risk of developing particular diseases. An example is discussed in , where researchers use Bayesian network (BN) learning to untangle the complex relationship between the intrinsic and extrinsic factors that drive bladder cancer. In brief, a Bayesian network is a probabilistic graph model used to expose the conditional dependencies between random variables. To construct the BN structure in this example, variables themselves are operationalized as a set of binary values. Categorical variables can be expressed simply as binary True/False, while continuous variables (such as the level of a specific biomarker) can be similarly expressed by whether the value has exceeded some operational threshold (biomarker positive/negative). Each variable (gene, environmental condition, health condition, existence of biomarker and so forth) is expressed as a node, where directed edges connect those variables that exhibit a directional dependency. Each node is associated with a function that captures the probability of all occurrence combinations with its direct parents. If Node A is connected to Node B, we say that B depends on A or A is a parent of B. Constructing a BN is complex, and adding additional variables to the analysis exponentially increases the quantity of possible comparisons for consideration.
To give an example, researchers applied BN analysis to a set of 1447 single nucleotide polymorphisms (SNP) belonging to genes known to be associated with cancer, smoking history, environmental risk factors such as arsenic exposure and demographic information. Three different approaches were used to learn the structure of the bladder cancer BN, and their results were compared. Overall, it was determined that the three approaches differed slightly in their results, but revealed similar underlying patterns. Interplay between three SNPs in linkage disequilibrium was identified; however, the directionality of these relationships did not always agree between the learning methods. No genetic risk factors were identified to have any relationship with other health factors, but, as the researchers note, this finding is not very surprising since only cancer-risk genes were selected for the analysis. In theory, genetic polymorphisms could show relationships with one’s affinity to smoke, should addiction risk-based genes be selected.
While many conventional BN analyses can be executed on a single central processing unit (CPU), as was done in this case, BN can require scalable processing in the conventional big data sense depending on the size of the data and the size of the directed graph. Bayesian networks, however, have been ported to run in a distributed fashion over Hadoop to accommodate significantly larger graphs
 than the one constructed for this analysis.
Clinical Monitoring. The uses of biosensors, or devices that can detect and measure some physiological activity on or within the body, are quickly becoming an area of intense research. New technology start-up companies such as Ginger.io are using the popularity of feature-rich smartphones to monitor the well-being of patients by assessing various metrics of biofeedback, including blood glucose, heart rate and accelerometer monitoring. Understanding how to detect reproducible patterns in signals acquired from biosensors suggests a non-invasive way of learning underlying physiological processes. Ginger.io allows for real-time capture of biometrics to track health-based outcomes for patients by securely monitoring biometrics through a smartphone. In recent years, many types of research-grade instrumentation which were once available to organizations with large budgets, such as EEG headsets and stress sensors, are now more commercialized and obtainable by research enthusiasts. Open source APIs allow researchers to directly access the devices on both desktops and mobile devices, such that novel applications can be devised. As the form factors of these sensors continue to shrink and become less cumbersome when placed on the body, as well as when integrated directly into clothing fabrics, access to large amount of continuous streaming biometric data for analysis increases.
The most attractive area of biosensor analysis can be found in pattern detection, by linking the behavior of biosignals to known phenomenon that occur within the body. For example, researchers use data acquired from accelerometers to capture the movement of subjects along a three-dimensional axis. By finding patterns in the accelerometer data, researchers can predict what activities the subjects are currently partaking in, such as running, walking or typing, and develop metrics for daily energy expenditure , although the accuracy of such predictions greatly diminishes outside a controlled laboratory. The possibility of such prediction suggests using accelerometer data from a mobile device to detect early and subtle signs of Parkinson’s in at-risk patients. As another example, researchers have used EEG headsets to monitor the brain activity of autistic children when carrying out certain tasks, such as facial and emotion recognition . These examples require a very large amount of both healthy control data and targeted patient data to establish a “ground truth.” From these pools of subjects, some must be allocated for training for a semi-supervised approach, which is explicitly annotated with the actual actions the person was taking at that time. The uniqueness of each patient, coupled with noise from the sensor itself, can mean that the amount of training data required can be quite substantial.
The capture, indexing and processing of continuously streaming (and possibly annotated) fine-grained temporal data is another big data challenge of increasing interest. The processing of biosignals can include a feature selection step that is executed across metrics calculated within windows of varying and overlapping lengths of time, with resolution ranging from a millisecond to seconds. Understanding the variations within population samples and between comparative populations can be a computationally intensive process, considering the number of subjects required and time-slices under scrutiny. Patient monitoring, assessing health outcomes and understanding the physiological impact a disease has on a patient are just a few examples in which big data analysis has a potential for major impact in this area. Accurate and non-invasive means of inquiring about the inner workings of diseases and disorders can allow researchers to develop innovative ways of applying these insights as future clinical trial endpoints or of remotely alerting caretakers to an impending medical issue. The implications are seemingly limitless.
Pharmacovigilance. The World Health Organization defines pharmacovigilance (PV) as “the science and activities relating to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problem.” When drugs are evaluated for efficacy and safety in large Phase 3 clinical trials, studies are conducted in controlled experiment states in which inclusion and exclusion criteria are enforced. However, once drugs are approved and available on the market, a great deal of monitoring by the pharmacovigilance function must be done to ensure that drugs are performing as expected outside the clinic and in the commercial environment. All reports of adverse events (AE) experienced by patients are taken very seriously. They are reported to the FDA and captured in their Adverse Event Reporting System (FAERS) database. The FDA publicly releases reports of these incidents on a quarterly basis with identifying personal information removed. Challenges associated with the understanding of AEs in the commercial setting are often complicated by the fact that patients may take drugs in combinations beyond those included in well-controlled clinical trials. Since clinical trials may exclude patients who are taking specific medications concomitantly, insights into drug-drug interactions may not be entirely understood at time of commercialization. One solution to this challenge is to analyze the millions of adverse event reports each year to determine whether a particular set of drugs, taken together in a regimen, may be responsible. Expanding drug regimens into all possible two-way, three-way and four-way combinations suggests a dataset of potential interaction combinations in the order of trillions of possibilities. The goal is to identify drug combinations of interest that result in adverse events that occur more than would be expected.
One example depicting a solution to this problem is explained in , where an analysis similar to that described above of FDA data is processed over Hadoop. Provided source code contains data manipulation procedures to first merge plain-text CSV data together on a per-subject basis, expand drug-reaction possibilities and express results in the form of a contingency table. When this logic is executed on a Hadoop cluster, the computation of the final contingency table can be distributed across computational nodes using MapReduce and reassembled on a central server. Overall, researchers demonstrated how results could be expressed and visualized as a graph, where nodes represented a particular drug, and a weighted, undirected connection between them represented the strength of a possible interaction. Clusters formed in the resulting graph showed tightly interconnected groups of similar drugs such as those used to combat HIV and various types of cancers. However, they also note the need to ultimately include qualified clinicians in the analysis process to help interpret results.
Lastly, while the monitoring of dangerous drug-drug interactions in clinical practice is an important activity, another interesting area takes the opposite approach in identifying drugs with synergistic efficacy. In other words, identifying comorbid conditions common in a particular disease can suggest other therapeutic interventions that further alleviate secondary symptoms. Disentangling symptoms directly attributed to the mechanism of a disease state from those caused as a consequence is an incredible challenge. A recent investigatory trial has since suggested that depression medication not be used in dementia patients
Barriers to Entry and Areas of Opportunity
The daunting task of managing big data will frequently require pharmaceutical information scientists to develop innovative approaches to address complex data computation processes. However a number of barriers currently impeding implementation of such strategies must be overcome before these data can be effectively and routinely managed. The time-consuming process of data collection commensurate with the scope of the task (for example, Phase 3 clinical trial management) and its secure internal and external dissemination must first be addressed. The quantity of longitudinal and/or cross-sectional observations needed to be deemed “big data worthy” often results in complications from a both a temporal and logistical perspective. Comprehensive, multi-site research collaborations continue to expand in the era of translational medicine. One such collaboration is the Alzheimer’s Disease Neuroimaging Initiative (ADNI), which has yielded many great insights into the etiology of Alzheimer’s disease over the past eight years. Yet during this time very few big data applications have been generated. But this paucity may not be due to a lack of effort. Relying on a conventional brute force big data approach to understanding an insidious disease such as Alzheimer’s by following patients in the clinic over the course of the disease in its entirety may simply not be feasible. However this challenge is certainly not unique to Alzheimer’s disease. Researchers exploring cures for other poorly understood medical conditions such as autism face a similar dilemma. When, due to the nature of a particular disease or condition, larger quantities of data are required to improve the statistical power of a clinical study in a highly regulated environment, researchers must continue to develop novel ways of challenging old data analysis and modeling paradigms in light of these limitations . In this way, current challenges may be turned into future opportunities.
While some of the most full-featured electronic medical record (EMR) systems may have some big data applications in the form of detailed plain-text clinician notes and medication information, these setups are in the minority
. Along those lines, obtaining insights from disparate data sources and incorporating them into the same analysis workflow is vital. However such integration is still an issue when lack of data-standard adherence is prevalent. In addition, legacy clinical data trapped within flat files must be seamlessly integrated into research workflows. Data curation and interoperability are two areas that will assist in the timely discovery and dissemination of data for analysis. By enabling researchers to locate useful data, quickly assess applicable endpoints and understand how studies were constructed, more robust and integrated datasets can be created. Using domain-specific ontologies such as the Neuroscience Information Framework (NIF)
 and a general descriptive-study ontology as discussed in
 to annotate independent variables can facilitate this process, making big data use-cases more viable from a data standpoint.
Lastly, while the content thus far has focused on the quantitative aspects of big data, one must also note qualitative considerations. The area of visual analytics presents an area of synergistic research with big data by conceptualizing the output of complex processes through intuitive graphical means. Metrics dashboarding, real-time interactive visualization and giga-node graph exploration are some examples that would serve as appropriate visualization solutions to the big data examples discussed above. By enabling researchers to scrutinize visual representations of solutions, latent patterns in the data can be identified through quantitative means. Since the application of big data solutions proposes comprehensive workflows that convert unstructured data into analysis-ready datasets, consideration of the structure of the end-data models is vital for the visualization process. An example would be a large weighted adjacency matrix to express the structure of a graph network. Generalizable data models simplify the visualization workflow process as well. By allowing data to be seamlessly piped across specialized visualization applications, researchers can gain insights of varying dimensions into the data.
The domain of healthcare research poses an incredibly challenging set of problems that require the synergistic insights of clinical research and information technology. By understanding the complex nature of acquiring, processing and maintaining health-based data along the entirety of the pharmaceutical development lifecycle, novel technologies and statistical applications can be devised. In light of current legislation such as the Affordable Care Act and mounting medical challenges such as the Alzheimer’s epidemic, the solutions discussed above are more pertinent now than ever before. However, while big data solutions are an attractive area for researchers, many data integration and interoperability issues still impede their widespread implementation. Regardless, the concepts of big data, as has been shown here, have very practical applications for aiding in the development of future life-saving therapeutic interventions.
Resources Mentioned in the Article
 Dai, L., Gao, X., Guo, Y., Xiao, J. & Zhang, Z. (2012). Bioinformatics clouds for big data manipulation. Biology Direct, 7(1), 43; Discussion, 43. Retrieved June 28, 2013, from www.biology-direct.com/content/7/1/43
 Su, C., Andrew, A., Karagas, M. R. & Borsuk, M. E. (2013). Using Bayesian Networks to discover relations between genes, environment, and disease. BioData Mining, 6(1), 6. Retrieved May 28, 2013, from www.biodatamining.org/content/6/1/6
 Basak, A., Brinster, I., Ma, X. & Mengshoel, O. (2012). Accelerating Bayesian Network parameter learning using Hadoop and MapReduce. In BigMine ’12: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining (101-108). New York: ACM.
 Stikic, M., Larlus, D. & Schiele, B. (2009). Multi-graph based semi-supervised learning for activity recognition. Proceedings of the 13th IEEE International Symposium on Wearable Computers (ISWC'09) ( pp. 85-92). New York: IEEE.
 Cooper, N. R., Simpson, A., Till, A., Simmons, K. & Puzzo, I. (2013). Beta event-related desynchronization as an index of individual differences in processing human facial expression: Further investigations of autistic traits in typically developing adults. Frontiers in Human Neuroscience, 7, 159.
 Wills, J. (2011). Using Apache Hadoop to find signal in the noise: Analyzing adverse drug events. Retrieved May 28, 2013, from http://blog.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/
 Banerjee, S., Katona, C., Knapp, M., Lawton, C., Lindesay, J., Livingston, G., et al. (2011). Sertraline or mirtazapine for depression in dementia (HTA-SADD): A randomised, multicentre, double-blind, placebo-controlled trial. The Lancet, 378(9789), 403-411.
 Yang, E., DiBernardo, A., Farnum, M., Lobanov, V., Schultz, T., Verbeeck, R., et al. (2011). Quantifying the pathophysiological timeline of Alzheimer's disease. Journal of Alzheimer's Disease: JAD, 26(4), 745.
 Kokkonen, E. W. J., Davis, S. A., Lin, H., Dabade, T. S., Feldman, S. R. & Fleischer, A. B. (2013). Journal of the American Medical Informatics Association, 20(e1), e33-e38. Retrieved May 28, 2013 from http://jamia.bmj.com/content/20/e1/e33
 Gardner, D., Gupta, A., Halavi, M., Kennedy, D. N., Marenco, L., Martone, M. E., et al. (2008). The neuroscience information framework: A data and knowledge environment for neuroscience. Neuroinformatics, 6(3), 149-160.
 Russ, T. A., Ramakrishnan, C., Hovy, E. H., Bota, M. & Burns, G. A. P. C. (2011). Knowledge engineering tools for reasoning with scientific observations and interpretations: A neural connectivity use case. BMC Bioinformatics, 12(1), 351-351.
Timothy Schultz is a doctoral student at the College of Information Science & Technology (The iSchool) at Drexel University. His research interests are in developing novel approaches for computationally modeling the progression of diseases, such as Alzheimer's disease. He focuses specifically on pattern detection of biological signals from consumer-based biosensors for modeling the body's physiological response to internal and external stimuli. He can be reached at tjs72<at>drexel.edu.
Articles in this Issue
Turning Healthcare Challenges into Big Data Opportunities: A Use-Case Review Across the Pharmaceutical Development Lifecycle