Data Scientists Must Also Be Research Methodology Scientists

William Hersh, MD, Professor and Chair, OHSU
Blog: Informatics Professor

I had the chance recently to attend a conference in Singapore, Big Data and Analytics in Health Care. It was an interesting blend of academics, operational health information technology professionals, and data scientists from companies in the emerging analytics market. I was also in Singapore for the end in-person session of the 10×10 (“ten by ten”) introductory informatics course we offer there.

The talks were all interesting, but I was struck by the difference in the content and tone of the academic and clinical operations speakers compared to those from analytics companies and who called themselves “data scientists.” Whereas the academic and clinical operational types were cautious in their methods and results, the data scientists implied their techniques would revolutionize healthcare and threw around terms like “big data” and “analytics” at every turn. One of the latter types showed a “model” of the pathways leading to good (conservative) and bad (surgery) outcomes in back pain, with the intermediate nodes representing actions along the path, such as medication use, physical therapy, and chiropractic care. It was not clear to me how this model could be used to improve care, and I am not sure the speaker really understood that correlations do not prove causality. A second such speaker showed some interesting correlations between words and phrases that occur in clinical narratives of patients with diabetes and aspects of their care. I understand machine learning and how it might be used to “learn” things about patients with diabetes, but I did not see any evidence that this work would lead to any kind of improved patient outcomes.

Another concern I have about proponents of clinical data analytics is their presumption that their algorithms can somehow take all of the growing amount of operational electronic health record (EHR) data and automatically turn it into medical knowledge, as if they could turn a crank with data going in and knowledge emerging. I do have great enthusiasm for some of what can be done with this data, but I also have concerns about the quality and completeness of this data as well as the causality issues that arise without controlling observations in experimental ways.

I had the opportunity to speak at the conference as well, and gave a talk pulling together my cautious enthusiasm for using operational clinical data for research and other analytical purposes. This was the first public talk I have given on this topic since publication of a paper with ten other colleagues on caveats for the use of operational electronic health record data in comparative effectiveness research in the journal Medical Care [1]. The paper was commissioned by AcademyHealth and is part of a special supplement of the journal devoted to electronic data methods.

Our paper notes that while there are many opportunities for using clinical data for research and analytics, we also must remember the limitations of such data. In particular, EHR and other clinical data may be:

  • Inaccurate – data entry is not always a top priority for clinicians, and they may take shortcuts, such as copy-and-paste
  • Incomplete – patients do not get all of their care in one setting
  • Transformed in ways that undermine meaning – coding for billing is the best known example of this
  • Unrecoverable for research – data may be in clinical narratives or other less accessible places
  • Of unknown provenance – we need to know where data comes from and how likely it is to be accurate
  • Of inappropriate granularity – data too coarse for research purposes
  • Incompatible with research protocols – patients are not always diagnosed and treated consistently with best practices

Despite these caveats, I am optimistic that there will be uses for this data, especially if we can generate it in a standards-based way and otherwise improve its quality. Hopefully clinicians, researchers, patients, public health authorities, quality improvement leaders, and other who might benefit from the data will have incentive to improve it by more meticulous entry as well as use of standards-based, such as those proscribed by Stage 2 of the meaningful use program [2]. For many clinicians especially these days, the EHR can be a data sink hole into which they enter data, spending a great deal of time but getting little in return.

The bottom line is that while data scientists may be able to generate interesting and important results with their methods, they must also understand basic principles of research science, such as inferential statistics, clinical significance, and cause and effect. In addition, they must demonstrate their methods lead to improvements in health and/or healthcare, and are not just generating interesting associations. In other words, they must show evidence that their methods add value, just as medical care and informatics are required to do.


  1. Hersh, WR, Weiner, MG, et al. (2013). Caveats for the use of operational electronic health record data in comparative effectiveness research. Medical Care. 51(Suppl 3): S30-S37
  2. Metzger, J and Rhoads, J (2012). Summary of Key Provisions in Final Rule for Stage 2 HITECH Meaningful Use. Falls Church, VA, Computer Sciences Corp.

This article post first appeared on The Informatics Professor. Dr. Hersh is a frequent contributing expert to HITECH Answers.