Is Data Science a Scam?
The pharmaceutical industry (Pharma) is highly regulated and its workers highly credentialed. Analyses of clinical trial data are constrained by regulatory guidelines and pre-specification in an analysis plan, including penalising the company's drug in the primary analysis which determines its fate. This analytical approach is unintuitive but implies a strong desire to form conclusions around a conservative estimate of efficacy.
Sheiner (1991) regarding the Intent-to-Treat principle: “[W]hy would anybody in their right mind advocate acting as though what they meant to happen did happen when they knew for sure that it did not?”
Despite these standards, academic clinicians flaunt their disapproval of industry with relentless crowd-pleasing books. To maintain legitimacy, the critics observe Pharma from the outskirts and, thus, betray a superficial understanding of the system they seek to reform. As clinicians they also miss the subtly of statistical arguments - where scrutiny would be best applied.
The point is: there is something seedy going on in Pharma and these critics miss it. It is behind the scenes and too esoteric for them to detect, no matter how eager they are to define themselves by their anti-Pharma posturing. And there's an added irony: the corruption is introduced by the academics themselves. I.e., the academics are sanctifying and ushering in a new mindset that offers Pharma what it has always craved: “the liberation of analyses” (an actual phrase heard in Pharma), and new ambiguity in trial results.
Some background is necessary: An industry guideline stipulates that a clinical trial must have “an appropriately qualified” statistician assigned to it. Industry always interpreted this to mean an MSc, but there are too many trials and not enough MSc level statisticians, leading Industry to fund MSc courses in the late 90s. This seemed a smart move. Courses grew in number and were attracting more students who all entered industry upon graduation, as they'd agreed. But many did not stay in their employment; the value of the degrees was tarnished as they became softened; and supply still did not meet demand.
Some 20 years later Data Science develops and makes itself known to a wider audience. The term is swiftly adopted by those who wish to appear current. Talking as if Data Science is Statistics 2.0 was a simple error made by the journalists that enhanced its apparent relevance for young scientists; combined of course with the glamour of AI and machine learning (ML) and the sudden declaration that the Fathers of modern statistics were racists.
Online magazines sprung up like Towards Data Science where influencers teach statistics to data scientists minus the history and difficult theory (statistics rebranded). Masters degrees in Statistics & Data Science began to appear; statistical societies added these degrees to their lists of accredited courses that lead to certification, e.g. Chartered Statistician; hence satisfying the regulatory requirement of “appropriately qualified” and expanding the recruitment pool. Bear in mind, there is no traditional route to accreditation for the data scientists and they are an eclectic and ill-defined group.
But how to import this new crop of analysts into drug development? I.e., how to relax requirements and change habits comprehensively?
Data Science has clout, and this is where the status-chasers come in, i.e. the academics described above who celebrate their intolerance of Bad Pharma with longwinded best-sellers. In fact, that was the title of academic clinician Ben Goldacre's book: Bad Pharma. With the style of a British tabloid newspaper, it provides an out-of-date and grotesque description of Pharma that won over sideliners who wallow in cynicism. Heather Heying held it aloft on the Dark Horse podcast and declared it a “terrifying book”.
For a long time Goldacre was in evidence-based medicine (EBM). Motivated by a distrust of Pharma, EBM proponents place meta-analysis atop their heirarchy of evidence, rather than the costly, industry-run clinical trial. The EBM enterprise lingers but could be declared a failure given that leaders confess they now think “systematic reviewing [is like] searching through rubbish”, and Goldacre's own Open Trial initiative backfired when it found that academia are much more likely than industry to fail to publish trial results, and Big Pharma is better than smaller Pharma. The Open Trials X account stopped posting 5 years ago, and the EBM proponents were silent before and after the Cass review landed.
“Trials with a commercial sponsor were substantially more likely to post results than those with a non-commercial sponsor (68.1% v 11.0% ...); as were trials by a sponsor who conducted a large number of trials (77.9% v 18.4% ...).”
Goldacre pivoted to Data Science where he became inaugural Director of the Institute for Applied Data Science at Oxford which promised to “generate new data and evidence, but also make it more impactful in the world”. It is important to notice the language used by data science advocates; it is often self-promotional marketing-speak like impactful, innovation and extracting insights from big data. Also worth noting is that Pharma are pouring obscene sums into these newly established data science centres, and, needless to say, the cash-strapped academics are clamouring to inform the public about the promise of “big data”. (It turns out one can draw parallels with the greedy Pharma execs the academics decry, only the academics are not selling drugs, they are selling themselves.)
The key issue to understand is the intentional blurring of the boundaries of these distinct fields. Each nurtures a different mindset about how one thinks about, and handles, data. Statisticians treat data as sacrosanct and handle code the same way, whereas data scientists talk about cleaning data and are happy to make use of a script found online; a careless attitude that is an affront to the statistician's sensibilities. There's a departure from strict pre-specification too; the model learns, it's dynamic and ad hoc, and the blackbox nature provides contentment for those lacking clinical understanding.
Ultimately there's a claim (a misunderstanding in fact) that the clinical trial is limited in scope and we ought to exploit uncontrolled, real world data. In the absence of clinical trials the data scientists “overclaim the usefulness and applicability of [their ML tools] to solve clinical problems”. If those who lament Pharma were sincere they should have plenty to say about promoting products off of cobbled-together data sources. After all, a bad algorithm that informs patient triage can do just as much harm, if not more, than a bad drug. Instead we witness an EBM person move to data science and upend their own heirarchy of evidence, without flinching.
In every case these examples imply an opening up of methods and a simultaneous relaxing of standards. And being open-minded in this way, i.e. showing one is unconstrained by old habits, has become the thing to espouse.
Within industry, new departments of data science are formed and statisticians are rebranded. Heads of the new groups talk about data analytics and insights and the need for statisticians to be open-minded and collaborative. Some state that those who are not onboard will be left behind and are, until then, an obstacle. These higher-ups are not statisticians or data scientists or computer scientists. They are VPs, i.e. MDs and careerists transfixed by the hype and promise of data science. They do not understand any of it in detail, they only know they are excited about it. They stand under lights on stage and declare “We can do it!” Do what? Who knows.
The statistician's presence was always considered a necessary burden by them; the dogged adherence to guidelines and strict validation is considered stifling, slow and unhelpful. But the statistician knows how to speak with, and appease, the regulator whose requests are ultimately statistical in nature. Thus the leader will smile at the statistician, but counter their suggestions by promoting a higher-level understanding which usually amounts to something reprehensible (deviating from the analysis plan to diminish a safety signal, etc.).
This is not to downplay the work of the data scientists. It would be superfluous to describe the value of this burgeoning field as it is properly defined, e.g. in understanding what is happening at the protein level. It is always interesting and inspiring to hear what they are up to. But those seeking reform should be thwarted by obvious inherent problems: you cannot have big data until the drug is on the market, thus: big data cannot inform approval.
We wish to affirm only the following: Statistics and Data Science should be treated as distinct fields that interact; Pharma's instinct will always be to create a grey-area in which their marketing teams can manoeuvre; the academics and their journals abet Pharma by not noticing that what is sold as cutting-edge is often not cutting-edge, and hence it is indistinguishable from marketing; when it is cutting-edge, it is, therefore, likely tentative and awaiting verification.
Academia is leaking public trust. Industry, with its high quality randomised controlled trials and diligence, obviated the reproducibility crisis that still affects academic research today. Industry has the opportunity to recover the trust in science spent by the academics.