# Philosophy of Medicine

First published Mon Jun 6, 2016

Philosophy of medicine is a field that seeks to explore fundamental issues in theory, research, and practice within the health sciences, particularly metaphysical and epistemological topics. Its historic roots arguably date back to ancient times, to the Hippocratic corpus among other sources, and there have been extended scholarly discussions on key concepts in the philosophy of medicine since at least the 1800s. Debates have occurred in the past over whether there is a distinct field rightly termed “philosophy of medicine” (e.g., Caplan 1992) but as there are now dedicated journals and professional organizations, a relatively well-established canon of scholarly literature, and distinctive questions and problems, it is defensible to claim that philosophy of medicine has now established itself. Although ethics and values are part of many problems addressed within the philosophy of medicine, bioethics is generally considered to be a distinct field, and hence is not explored in this entry (but see the entry on theory and bioethics). That being said, philosophy of medicine serves as a foundation for many debates within bioethics, given that it analyzes fundamental components of the practice of medicine that frequently arise in bioethics such as concepts of disease. The philosophy of medicine also has made important contributions to general philosophy of science, and particularly to understandings of explanation, causation, and experimentation as well as debates over applications of scientific knowledge. Finally, the philosophy of medicine has contributed to discussions on methods and goals within both research and practice in the medical and health sciences. This entry focuses primarily on philosophy of medicine in the Western tradition, although there are growing literatures on philosophy of non-Western and alternative medical practices. It emphasizes philosophical literature while utilizing relevant scholarly publications from other disciplinary perspectives.

## 1. Introduction: How Should We Define Health and Disease?

One of the fundamental and most long-standing debates in the philosophy of medicine relates to the basic concepts of health and disease (see concepts of health and disease). It may seem obvious what we mean by such statements: people seek treatment from medical professionals when they are feeling unwell, and clinicians treat patients in order to help them restore or maintain their health. But people seek advice and assistance from medical professionals for other reasons, such as pregnancy which cannot be construed as a disease state, and high blood pressure which is asymptomatic. Thus the dividing line between disease and health is notoriously vague, due in part to the wide range of variations present in the human population and to debates over whether many concepts of disease are socially constructed. One of the further complicating factors is that both the concepts of health and disease typically involve both descriptive and evaluatory aspects (Engelhardt 1975), both in common usage among lay persons and members of the medical profession.

Exploring these distinctions remains epistemologically and morally important as these definitions influence when and where people seek medical treatment, and whether society regards them as “ill”, including in some health systems whether they are permitted to receive treatment. As Tristram Engelhardt has argued,

the concept of disease acts not only to describe and explain, but also to enjoin to action. It indicates a state of affairs as undesirable and to be overcome. (1975: 127)

Hence how we define disease, health, and related concepts is not a matter of mere philosophical or theoretical interest, but critical for ethical reasons, particularly to make certain that medicine contributes to people’s well-being, and for social reasons, as one’s well-being is critically related to whether one can live a good life.

The terms “disease” and “illness” often are used interchangeably, particularly by the general public but also by medical professionals. “Disease” is generally held to refer to any condition that literally causes “dis-ease” or “lack of ease” in an area of the body or the body as a whole. Such a condition can be caused by internal dysfunctions such as autoimmune diseases, by external factors such as infectious or environmentally-induced diseases, or by a combination of these factors as is the case with many so-called “genetic” diseases (on the idea of genetic disease and associated problems, see for instance Hesslow 1984, Ankeny 2002, Juengst 2004). It has been argued that there is no philosophically or scientifically compelling distinction between diseases and other types of complaints that many would not consider to be diseases such as small stature, obesity, or migraine headaches (Reznek 1987). The notion of “disease” is common among most cultures, and may even be a universal concept (Fabrega 1979). It is a useful concept as it allows a clear focus on problems that afflict particular human beings and suggests that medicine can help to control or ameliorate such problems. In contrast, “illness” is usually used to describe the more non-objective features of a condition, such as subjective feelings of pain and discomfort. It often refers to behavioral changes which are judged as undesirable and unwanted within a particular culture, and hence lead members of that culture to seek help, often from professionals identified as health providers of some type within that culture (on some of the complexities relating to the triad of concepts “disease, illness, sickness”, see Hofmann 2002).

The term “sickness” emphasizes the more social aspects of ill health, and typically highlights the lack of value placed on a particular condition by society. Disease conditions are investigated not only to be understood scientifically, but in hopes of correcting, preventing, or caring for the states that are disvalued, or that make people sick. The classic work of the sociologist Talcott Parsons (1951) showed how the “sick role” relieves one of certain social responsibilities (for example, allows one to take time off work or to avoid family responsibilities) and also relieves blame for being ill (though not necessarily from having become ill in the first place). Although there are exceptions and counterexamples to this model (for example, some chronic diseases), it does fit our generally accepted societal notions of what it means to be sick (and healthy), and the moral duties and responsibilities that accompany the designation of someone as sick.

The dominant approach in much of the recent philosophical scholarship on the philosophy of medicine views disease concepts as involving empirical judgments about human physiology (Boorse 1975, 1977, 1997; Scadding 1990; Wachbroit 1994; Thagard 1999; Ereshefsky 2009). These so-called “naturalists” (sometimes called “objectivists”, for example see Kitcher 1997, or “descriptivists”) focus on what is biologically natural and normal functioning for all human beings (or more precisely human beings who are members of relevant classes such as those within a particular age group or of the same sex). They argue that medicine should aim to discover and describe the underlying biological criteria which allow us to define various diseases. Christopher Boorse’s revised account has been the most influential in the literature, claiming that health is the absence of disease, where a disease is an internal state which either impairs normal functional ability or else a limitation on functional ability caused by the environment (Boorse 1997). “Normal functioning” is defined in terms of a reference class which is a natural class of organisms of uniform functional design (i.e., within a specific age group and sex), so that when a process or a part (such as an organ) functions in a normal way, it makes a contribution that is statistically typical to the survival and reproduction of the individual whose body contains that process or part. His definition includes specific reference to the environment so as not to rule out environmentally-induced conditions which are so common as to be statistically normal such as dental caries.

Many have criticized these approaches (to name just a few, Goosens 1980; Reznek 1987; Wakefield 1992; Amundson 2000; Cooper 2002), as well as naturalistic accounts of disease more generally. As they have noted, naturalistic accounts do not reflect our typical usage of the terms “disease” and “health” because they neglect to take into account any values which shape judgments about whether or not someone is healthy. The usual counterexamples proposed to naturalism are masturbation, which was widely believed to be a serious disease entity in the 18th and 19th centuries (Engelhardt 1974), and homosexuality, which for most of the 20th century was classified as a disease in the Diagnostic and Statistical Manual (DSM) of the American Psychiatric Association. These are counterexamples as their redefinitions as non-disease conditions were due not to new biological information about these states of being but changes in society’s moral values. Naturalists respond to such arguments by pointing out that homosexuality and masturbation were never diseases in the first place but erroneous classifications, and thus these examples do not affect the validity of the definition of disease favored by them when it is applied rigorously.

A more telling criticism of naturalism is that although its advocates claim to rely exclusively on biological science to generate their definitions of health and disease, these rely implicitly on an equation of statistical and theoretical normality (or the “natural state” of the organism), at least in Boorse’s formulation (Ereshefsky 2009). But biology does not give us these norms directly, nor is there anything absolutely standard in “species design” (as many philosophers of biology have argued) despite Boorse’s claims. No particular genes are the “natural” ones for a given population, even if we take a subgroup according to age or gender (Sober 1980). Nor does standard physiology provide these norms (Ereshefsky 2009), not in the least part because physiological accounts typically provide idealized and simplified descriptions of organs and their functions, but not of their natural states (Wachbroit 1994). Rachel Cooper (2002) compellingly argues that coming up with an acceptable conception of normal function (and in turn dysfunction) is the major problem with Boorsian-style accounts, arguing that his analysis should focus on disposition to malfunction instead. This argument utilizes counterexamples such as activities that interfere with normal functioning such as taking contraceptive pills that are not diseases, as well as examples of persons with chronic diseases controlled by drugs who function normally as a result. Elselijn Kingma (2007, 2010) has critiqued Boorse’s appeal to reference classes as objectively discoverable, arguing that these cannot be established without reference to normative judgments. A further issue often noted with regard to naturalistic accounts of disease (for example, that of Lennox 1995) is the underlying assumption that biological fitness (survival and reproduction) is the goal of human life, and along with this that medicine is only considered to be interested in biological fitness, rather than other human goals and values, some of which might indeed run contrary to or make no difference in terms of the goal of biological fitness, such as relief of pain.

An alternative approach in the philosophical literature to naturalist/descriptivist/objectivist definitions of disease and health can roughly be termed “normative” or “constructivist”. Most proponents agree that we must define the terms “disease” and “health” explicitly and that our definitions are a function of our values (Margolis 1976; Goosens 1980; Sedgewick 1982; Engelhardt 1986). Hence defining various disease conditions is not merely a matter of discovering patterns in nature, but requires a series of normative value judgments and invention of appropriate terms to describe such conditions. Conversely, health involves shared judgments about what we value and what we want to be able to do; disease is a divergence from these social norms. Normativists believe that their definitions are valid not only philosophically but also reflect actual usage of the terminology associated with disease and health both in common language and among medical professionals. They also claim that this approach more adequately explains how certain conditions can come to be viewed in different ways over the course of history as our values changed despite relatively few changes in our underlying biological theories about the condition, for example homosexuality. Further, they are able to accommodate examples of so-called folk illnesses or culture-bound syndromes such as ghost sickness among some Native American tribes, the evil eye in many Mediterranean cultures, or susto in Latin and South American cultures, as their theories explicitly allow for cross-cultural differences in understandings of disease and health.

However normativism also generates a series of typical criticisms: it cannot cope adequately with cases where there is general agreement that a state is undesirable (such as alcoholism or morbid obesity) but no similar general agreement that the state is actually a disease condition (Ershefsky 2009). Another classic objection is that normative accounts do not allow us to make retrospective judgments about the validity of disease categories such as “drapetomania” (a disease which was commonly diagnosed among American slaves in the 19th century, with the main symptom being the tendency to run away) (Cartwright 1851). The normativist can point to changes in values to explain the abandonment of belief in this disease condition, but would not be able to claim that the doctors were in any sense “wrong” to consider drapetomania to be a disease. Hence there is more involved in our everyday usage of the terms “disease” and “health” than just value or normative conditions.

Hybrid theories of health and disease attempt to overcome the gaps in both the naturalistic and normative approaches, by hybridizing aspects of both theories (Reznek 1987; Wakefield 1992; Caplan 1992). For instance Jerome Wakefield (1992, 1996, 2007), writing about psychiatric conditions in particular, notes that a condition should be considered a disease if it both causes harm to the person or otherwise contributes diminished value, and the condition results from some internal mechanism failing to perform its natural function (hence for instance much of what is diagnosed as “depression” would fail to count as a disease condition). Whereas the normativist is committed to calling any undesirable state a disease condition, these hybrid criteria rule out calling conditions “diseases” which are non-biological,. Then various marginal cases might be considered to be healthy rather than potentially described as diseased, and hence might not be eligible for treatment within conventional medicine. Examples include those organs or structures that no longer have a function due to evolutionary processes cannot malfunction and so cannot be diseased. Many hybrid approaches also retain too many assumptions about their naturalistic components, and hence are criticized for relying on a notion of natural function which cannot be supported by biology.

The concept of health has been relatively undertheorized in comparison to those of disease and illness, perhaps in part because it raises even more complicated issues than these concepts describing its absence. One could be a straightforward naturalist about health, and define it as being a product of a functional biology; however this argument would run afoul of the same criticisms of naturalism recounted above (see Hare 1986). The source for the classic definition of health comes from the Constitution of the World Health Organization (WHO) which defines health

a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity. (WHO 1948: preamble)

Notice that according to this formulation, health is not just the absence of disease but a positive state of well-being and flourishing (notoriously ambiguous concepts in themselves). Although quality of life is often cited as critical to definitions and theories of health, many commentators are wary of the expansiveness of a definition similar to the WHO’s terminology, as it seems to encompass many things beyond the health of the individual which could contribute (or diminish) his or her “well-being”.

A more narrow definition of health takes its rightful domain as being the state which medicine aims to restore, and its opposite to be “unhealth” or falling short of being healthy, rather than disease as such (Kass 1975). Under such a definition, medicine should not engage in aesthetic surgery or elective terminations of pregnancy or similar procedures which do not (strictly speaking) seek to restore health. Caroline Whitbeck (1981) has defined health in terms of the psychological and physiological capacities of an individual that allow him or her to pursue a wide range of goals and projects. Hence her account is a type of hybrid approach, since she places biological capacities at the core of her definition of health but only in so far as they help individuals to flourish and live their lives as they wish to do. The concept of health here is much more than the absence of disease; for instance, one could have a high level of health while still suffering from a particular disease condition.

One much discussed philosophical approach to defining health is that of Georges Canguilhem (1991, based on work in the early 1940s), who argued against equating it with normality. He noted that the concept of a norm could not be defined objectively in a manner that could be determined using scientific methods. Physiology deals with the science of norms, but even scientifically-based medical approaches should not focus solely on norms, contrary to for instance the ideal vision of medicine according to Claude Bernard (1865). The history of how the distinction between the normal and pathological became so entrenched is explored in detail in Michel Foucault’s now classic work (1963). Both Foucault and Canguilhem sought to reveal how values have been built into the epistemological framework underlying modern medicine.

One of the key points in Canguilhem’s argument is that our usage of the term “normal” often conflates two distinct meanings: the usual or typical, and that which is as it ought to be. Consequently, he argues that there can be no purely scientific or objective definition of the normal that allows us to take the theories of physiology and apply them in medical practice, and accordingly we cannot define health as normality either. Instead, according to him, health is that which confers a survival value, particularly adaptability within a set of environmental conditions: “to be in good health is being able to fall sick and recover; it is a biological luxury” (1991: 199). Disease, then, is reduction in the levels of tolerance for the vagaries of the environment. As Mary Tiles (1993) has noted, this emphasis on health rather than normality is a particularly useful tool for enriching contemporary debates over preventative medicine and more generally the trend toward the development of a positive conception of health. Havi Carel (2007, 2008) has contributed to this strand of thought, developing a phenomenological notion of health which emphasizes that health should be understood as the lived experience of one's own body rather than as simply statistically normal bodily functioning in abstract biological terms. Hence she develops an expressly revisionist project, emphasizing that a phenomenological perspective accommodates cases where someone is ill (in biological terms) but healthy, such as in chronic illness.

A number of authors have made even more extreme claims, arguing that seeking concepts of disease is bound to be a failed effort. For instance, Peter Schwartz (2007) claims that there is not an underlying general concept of disease within the biomedical sciences that is coherent enough to be analyzed, and that different concepts of disease might be useful within different contexts. Some philosophers have argued that to seek correct definitions for “disease” and “health” is distracting and irrelevant when it comes to clinical decisions: as Germund Hesslow puts it, “the health/disease distinction is irrelevant for most decisions and represents a conceptual straightjacket [sic]” (1993: 1). The key is whether or not a particular state is desirable to its bearer, and not whether the person actually has a disease or defect. For instance, the term “malady” has been proposed as a more appropriate alternative to “disease” (Clouser, Culver, and Gert 1981), and which should be extended to include all illnesses, injuries, handicaps, dysfunctions, and even asymptomatic conditions. A malady is present when there is something wrong with a person; regardless of the cause (mental or physical), to be a malady, the condition must be part of its bearer and not distinct or external to him or her. The clear advantage of this approach is that it unifies a range of phenomena and descriptions that seem intuitively to be related. The disadvantages include that it relies in part on an objectivist approach to disease, and hence suffers from some of the difficulties detailed above that plague some versions of naturalism (for a provocative reaction to this debate, see Worrall and Worrall 2001).

An alternative approach to defining disease and health has been described by Marc Ereshefsky (2009) in terms of making distinct state descriptions (descriptions of physiological or psychological states while avoiding any claims about naturalness, functionality, or normality), and normative claims (explicit judgments about whether we value or disvalue a particular physiological or psychological state). This approach has the advantages of allowing more clarity about controversial “disease” conditions as it avoids the need to apply the term explicitly. It also forces us to pinpoint the key issues that matter to understanding and treating someone suffering from ill health. But perhaps most persuasively, he argues that this approach allows us to distinguish the current state of a human from those we wish to promote or diminish, whereas the terms “disease” and “health” do not adequately highlight this critical distinction.

In short, philosophers of medicine continue to debate a range of accounts: in broad outline, the most vigorous disagreement centers on whether more objective, biologically-based, and generalizable accounts are preferable to those that incorporate social and experiential perspectives. It is clear that none satisfy all of the desiderata of a complete and robust philosophical account that also can be useful for practitioners; although some would dispute whether the latter should be a requirement, many believe that philosophy of medicine should be responsive to and helpful for actual clinical practices.

## 2. Contested and Controversial Disease Categories

Some disease categories are far from straightforward in terms of being recognized, named, classified, and made legitimate both within medicine itself and for the wider society. In recent times there have been long-standing debates over a range of conditions including Lyme disease, fibromyalgia, and chronic fatigue syndrome (CFS), to name just a few (for extended historical discussions of these and related conditions, see Aronowitz 1998, 2001; Shorter 2008). Take CFS as an example: its main symptoms are fatigue after exertion over a period lasting at least six months, but sufferers can have a wide array of complaints in diverse systems of the body; the range of severity is as wide as the range of symptoms. The condition has been associated with several other controversial syndromes and sometimes equated to with them, most notably myalgic encephalitis and fibromyalgia, as well as other illnesses of inexact definition such as multiple chemical sensitivity and irritable bowel syndrome; more popular (and derogatory) labels also have been attached to it such as yuppie flu. Definitive evidence as to the cause or basis of CFS has remained elusive, and in the absence of causal explanations, accurate diagnoses and effective treatments often have been difficult to obtain. Thus the illness has been perceived by many as being illegitimate because of difficulties in proving the existence of a discrete disease condition, given the lack of traditional forms of clinical evidence for it, and it has had different statuses in different locales (see Ankeny and Mackenzie 2016). These issues severely impact on the lives of those affected by this condition, and on the care that is thought to be appropriate to be made available to them.

Mental illnesses (and the term “mental health” itself) also have traditionally posed considerable problems for categorization and conceptualization for both medical practitioners and philosophers of medicine. Many authors advocate the case that it is critical to make a distinction between mental and physical illness (Macklin 1972), particularly because of the moral implications associated with labeling a condition as mental or psychological. Psychiatry is a field which has historically been loaded with value judgments, many of which were quite dubious. There is a long history of using mental illness as a way to categorize behaviors which are socially deviant as well as those conditions of ill health with no apparent organic cause and which do not otherwise fit into our dominant biomedical model. Many scholars (e.g., Ritchie 1989; Gaines 1992; Mezzich et al. 1996; Horwitz and Wakefield 2007; Demazeux and Singy 2015) have critiqued approaches and the underlying assumptions of the various editions of the Diagnostic and Statistical Manual published by the American Psychiatric Association, which is a “bible” for psychiatric conditions for many practitioners and also has considerable public influence for instance on who can seek care. Key examples of contested issues within the DSM include the highly politicized nature of the processes of revision across various editions, various cultural, sexist, and gender biases inherent in specific diagnostic categories, and the relatively weak reliability and validity of the classification system.

One key question is whether the biomedical model is the most appropriate approach to psychological or mental conditions and their treatment. Some theorists have argued in favor of naturalistic accounts of disease, notably Thomas Szasz (1961, 1973, 1987). As a result, he famously claimed that “mental diseases” are a myth and do not exist because they do not result from tissue damage; in his view, all diseases must be correlated with this sort of physical damage. He thus argues that the concept of mental illness is a prescriptive concept used as though it were a merely descriptive one, and also a justificatory concept masquerading as an explanatory one. These conclusions lead him to a highly critical analysis of psychiatric practices, and to reclassifying such forms of suffering as “problems in living” rather than diseases. However it is not always clear in his account what his evidence for these claims is, and in particular whether he is making an in principle objection or one that is grounded in the history of the mistreatment of people with mental illnesses, and the disservice done to them in part because of the adoption of the medical model. In addition, some have noted that some psychiatric conditions do in fact correlate with physiologically detectable and other types of biological abnormalities. For instance twin studies have demonstrated that genetics is a major factor in the etiology of schizophrenia among other conditions typically considered to be psychiatric, although clearly not all conditions that are diagnosable according to contemporary psychiatric standards fit this model.

A prominent functionalist approach to mental disorders more recently has been that of Wakefield (1992, 1996, 2007), as discussed above, who argues that mental disorders are best understood as “harmful” dysfunctions, which permits a supposedly value-free foundation in terms of biological function gauged in evolutionary terms) with judgments coming in only in terms of the judgment of whether certain dysfunctions are harmful to their bearers. Such accounts have been criticized along lines similar to analyses of Boorsian accounts by emphasizing that function and dysfunction cannot in fact be defined independently of value terms, but Wakefield’s account also has been questioned in terms of its practical implications (e.g., Sadler and Agich 1995) and whether malfunction is a necessary component of mental disorder (Murphy and Woolfolk 2000).

Other authors, notably George Engel (1977), have argued for the need to unify our understandings of mental and physical illness under a broader, biopsychosocial model. Such a model would focus clinicians to take account of both the physical, psychological, and social factors that contribute to ill health, in contrast to the traditional biomedical model which is faulted for being overly reductionistic rather than holistic. Such an account, it is claimed, would be more effective in dealing with borderline cases including people who are told they are in need of treatment due to abnormal lab results or similar but who are feeling well, as well as those who appear to have no underlying somatic disease condition but are feeling unwell. Hence this type of account does not draw any sharp distinction between the physical and the mental (or even the social), leaving the question of appropriate therapies or approaches as a matter to be decided by the doctor and his or her patient. Engel compellingly defended this type of account as more appropriate not only for clinical work but for research and teaching in medicine. It is arguable that it has implicitly (and often explicitly) been adopted in much of current-day medical practice and teaching, although it is less clear whether it has had much influence in biomedical research, much of which tends to remain more reductionistic in its nature.

## 3. Theories, Causes, and Explanations in Medicine

There is no widely accepted notion of what a scientific theory is. The logical positivists thought that theories are sets of propositions, formalizable in first-order logic, at one point, and as classes of set-theoretic models at another. For our purposes here one can distinguish two senses of theory, a narrower and a broader sense. In the narrower sense, a theory comprises a set of symbols and concepts used to represent the entities in a domain of discourse as well as a set of simple general-purpose principles that describe the behavior of these entities in abstract terms. In the broader sense, theory refers to any statement or set of statements used to explain the phenomena of a given domain.

In medicine one can find theories in both the narrower and the broader sense. Humorism, for instance, holds that the human body is filled with four basic substances or “humors”: black bile, yellow bile, phlegm, and blood. The humors are in balance in a healthy person; diseases are explained by excesses or deficiencies in one or more humors. Humorism has ancient origins and influenced Western medicine well into the 18th century. Eastern medicine has analogous systems of thought. Indian Ayurveda medicine, for example, is a theory of the three primary humors wind, bile, and phlegm, and diseases are similarly understood as imbalances in humors (Magner 2002).

In contemporary Western medicine, such highly unifying and general theories play a limited role, however. Evolutionary and Darwinian medicine may well constitute exceptions but these are at best emergent fields at present (see Méthot 2011). Contemporary Western medical researchers and practitioners instead seek to explain medical outcomes using mechanistic hypotheses about their causes—symptoms by hypotheses about diseases, diseases by hypotheses about antecedents, epidemics by hypotheses about changes in environmental or behavioral conditions (Thagard 2006). What distinguishes these contemporary medical theories from the ancient approaches is that the causes of symptoms, diseases, and epidemics can in principle be as multifarious as the outcomes themselves; in the ancient approaches, lack of humoral balance was the only possible cause. In contemporary Western medicine, there is no presupposition concerning number, form, or mode of action of the causes that explain the outcome other than there being some cause or set of causes responsible.

Not every cause is equally explanatory. A given person’s death can be described as one by cardiac arrest, pulmonary embolism or lung cancer, for instance. The lung cancer may have had a genetic mutation, the deposition of carcinogens in lung tissue and smoking in its causal history. The smoking, in turn, was caused by the smoker’s proneness to addictive behavior, peer pressure and socio-economic environment, let us suppose. Which of the many candidate hypotheses of the form “X causes (or caused) Y”, where Y refers to the patient’s death, does best explain the outcome? There is no absolute answer to this question. The goodness of a medical explanation depends in part on the context in which it is given (see entry on scientific explanation). When asked “Why did Y happen?” a coroner might refer to the pulmonary embolism, the patient’s physician to the lung cancer and an epidemiologist to the patient’s tobacco consumption. The adequacy of a medical explanation is related to our ability to intervene on the factor in question. A pulmonary embolism can be prevented by screening the patient for blood clots. The accumulation of carcinogens in lung tissue can be prevented by stopping smoking. By contrast, even though certain kinds of genetic mutations are in the causal history of any cancer, the mutation is not at present of much explanatory interest to most clinicians, as this is not a factor on which they can easily intervene. There is considerable current medical research to identify mutations associated with various subtypes of cancer and using these to develop targeted therapies and interventions, as well as to provide more accurate prognostic information. Medical explanation, thus, is closely related to our instrumental interests in controlling, preventing and controlling outcomes (Whitbeck 1977).

One issue that is currently debated in the philosophy of medicine is the desirability (or lack thereof) of citing information about the mechanisms responsible for a medical outcome to explain this outcome. While mechanisms are usually characterized in causal terms (e.g., Glennan 2002; Woodward 2002; Steel 2008), it is not the case that every cause acts through or is a part of some mechanism, which is understood as a more or less complex arrangement of causal factors that are productive of change (e.g., Machamer et al. 2000). Absences, such as lack of sunlight, can cause medical outcomes but are not related to them through continuous mechanisms from cause to effect (Reiss 2012). Neuroscientific explanations are often acceptable despite the lack of knowledge or false assumptions about mechanisms (Weber 2008). However, we may ask whether mechanistic explanations are generally preferable to non-mechanistic causal explanations.

Many medical researchers and philosophers of medicine subscribe to a reductionist paradigm, according to which bottom-up explanations that focus on the generative physiological mechanisms for medical outcomes are the only acceptable ones or at least always preferable. Indeed, macro-level claims such as “Smoking causes lung cancer” seem to raise more questions than they answer: Why does smoking have adverse health consequences? To prevent these consequences, is it necessary to stop smoking? Is it possible to produce cigarettes the smoking of which has fewer or no adverse consequences? What is the best policy to improve morbidity and mortality from lung cancer? Knowing that it is specific carcinogens in tobacco smoke and genetic susceptibility that are jointly responsible for the onset of the disease helps to address many of these questions.

Nevertheless it would be wrong to assume that we cannot explain outcomes without full knowledge of the mechanisms responsible. When, in the mid-1950s, smoking was established as a cause of lung cancer, it was certainly possible to explain lung cancer epidemics in many countries where people had exchanged pipe smoking for cigarette smoking half a century earlier—even though the mechanism of action was not understood at the time. Differences in lung cancer incidence between men and women or between different countries can be explained with reference to different smoking behaviors. Policy interventions, in this case the addition of warning labels to cigarette packets, could not wait until sufficient mechanistic knowledge was available, nor did they have to wait.

For reasons such as these, a number of philosophers of medicine have proposed to adopt an “explanatory pluralism” for medicine (De Vreese et al. 2010; Campaner 2012). If nothing else, this is certainly a position that is consistent with the explanatory practices in the field.

## 4. Reductionism and Holism in Medicine

As in many fields, debates over reductionism versus holism are rife in medicine both with reference to medical research and practice, and the terms often are used rather loosely to mean a range of things (for a related discussion see entry on reductionism in biology). In the broadest terms, reductionistic approaches to disease look for fundamental mechanisms or processes that are the underlying causes of that disease. In recent years in light of large-scale genomic sequencing initiatives notably the Human Genome Project, there has been considerable emphasis on reducing diseases to the genetic or molecular level. Those who advocate more holistic approaches note that reductionism leaves out important information based on the patient’s experiences of the disease at the phenotypic level, and such information is critical to pursuit of effective treatments. Many diseases typically viewed as “genetic” have proven to be extremely difficult in practice to reduce to unified disease entities with singular (or simple) genetic causes, including mental illnesses (Harris and Schaffner 1992), cystic fibrosis (Ankeny 2002), and Alzheimer’s disease (Dekkers and Rikkert 2008). As Catherine Dekeuwer (2015) notes, given that there probably is genetic variation in susceptibility to virtually all diseases, there is no clear demarcation between genetic diseases and diseases for which there are genetic risk factors; hence she argues that our tendency to focus on genetic determinants of disease may reinforce folk notions of the geneticization of both people and of human behavior.

With regard to research, critics of reductionism point out that there has been an overemphasis on the pursuit of genetic or molecular level explanations of disease to the neglect of alternative levels of explanation. Further, such limitations are highly detrimental to patients, especially because there are not likely to be short-term cures or treatments for most genetic diseases, perhaps beyond avoiding having children carrying particular genes in the first instance (see for instance Hubbard and Wald 1999), although this domain of medicine is rapidly changing as new treatments are developed and the understandings of the effects of genomic mutations improve. Focusing overly or solely on the genetic level results in a process which the sociologist Abby Lippmann (1991) terms “geneticization”, namely reducing the differences between individuals to their DNA, and in turn viewing genetics as the most promising approach to curing disease, rather than viewing people and the illnesses that they suffer at a phenotypic and much more environmentally-situated level. In addition, as Elisabeth Lloyd (2002) argues, higher levels of social organization that are culturally sanctioned have unrecognized causal effects on health, and hence medical research should not be restricted solely to the molecular level.

Fred Gifford (1990) claims that although all phenotypic traits are the result of an interaction between genes and the environment within which they are expressed, nonetheless it makes sense to distinguish certain traits as “genetic”; he argues in terms of populations that if it is genetic differences that make the differences in that trait variable in a given population, and if genetic traits can be individuated in a way that matches what some genetic factors cause specifically, then a trait (including a disease trait) can be understood as genetic. Kelly Smith (1992) disputes this, noting that the second condition depends on an extremely problematic distinction between causes (in this case genes) and mere conditions (e.g., epigenetic factors). Lisa Gannett (1999) argues for a “pragmatic” account of genetic explanation, claiming that when a disease is classed as “genetic”, the reasons for singling out genes as causes over other conditions necessarily include pragmatic dimensions inasmuch as they are relative to a given causal background (which includes both genetic and nongenetic factors), relative to a population, and relative to our present state of knowledge. More recently it has been argued that although explanatory reduction cannot be defended on metaphysical grounds, reductive explanations might be indispensable ways to address certain questions in the most accurate, adequate, and efficient ways (van Bouwel et al. 2011).

## 5. Randomized Controlled Trials and Evidence-Based Medicine

“Evidence-based medicine” (EBM) describes a movement that was started (under that name) in the early 1990s by a group of epidemiologists at McMaster University in Hamilton, Canada, as a reaction against what was perceived as an over-reliance on clinical judgment and experience in making treatment decisions for patients. According to a widely cited definition:

Evidence based medicine is the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients. (Sackett et al. 1996: 312)

Such a definition has bite only when the concept of evidence used is relatively narrow. In particular, it should not allow clinical judgment and experience to count as “best evidence”.

To this effect, proponents of EBM have developed so-called “hierarchies of evidence” that categorize different research methods with respect to their supposed quality. While there is no universally accepted hierarchy, the different proposed hierarchies all agree in the priority they give to randomized controlled trials (RCTs) and reviews thereof. A typical hierarchy looks as follows (Weightman et al. 2005):

Level of evidence Type of evidence
1++ High quality meta-analyses, systematic reviews of RCTs (including cluster RCTs), or RCTs with a very low risk of bias
1+ Well conducted meta-analyses, systematic reviews of RCTs, or RCTs with a low risk of bias
1–* Meta-analyses, systematic reviews of RCTs, or RCTs with a high risk of bias
2++ High quality systematic reviews of, or individual high quality non-randomised intervention studies (controlled non-randomised trial, controlled before-and-after, interrupted time series), comparative cohort and correlation studies with a very low risk of confounding, bias or chance
2+ Well conducted, non-randomised intervention studies (controlled non-randomised trial, controlled before-and-after, interrupted time series), comparative cohort and correlation studies with a low risk of confounding, bias or chance
2–* Non-randomised intervention studies (controlled non-randomised trial, controlled before-and-after, interrupted time series), comparative cohort and correlation studies with a low risk of confounding, bias or chance
3 Non-Analytical studies (e.g., case reports, case series)
4 Expert opinion, formal consensus

Evidence produced by RCTs has thus been called the “gold standard” of evidence in EBM (e.g., by Timmermans and Berg 2003).

In an RCT, a population of individuals who might benefit from a new medical treatment are divided into a treatment group—the group whose members receive the new treatment—and one or several control groups—groups whose members receive either an alternative or “standard” treatment or a placebo. Individual patients are assigned to a group by means of a random process such as the flip of a coin. A placebo is an intervention that resembles the new treatment in all respects except that it has no known ingredients active for the condition under investigation (i.e., it is some kind of “sugar pill”). Patients, researchers, nurses, and analysts are all blinded with respect to treatment status of all patients until after the analysis. After a period of time, a pre-determined outcome variable is observed and the values of the variable are compared between the groups. If the value of the outcome variable differs between different treatment groups at the desired level of statistical significance, the treatment is judged to be effective.

Proponents of EBM regard RCTs as reliable means to judge treatment efficacy because they can help to control for a variety of (though not all) biases and confounders. If, for instance, the symptoms of a patient or group of patients improve after an intervention, this may be due to spontaneous remission rather than the treatment. An experimental design that compares a treatment group with one or several control groups is therefore better able to control for this confounder than a simple “before-and-after” design. Similarly, a design in which the allocation to treatment and control groups is done by a non-random process, it is possible that healthier patients end up in the treatment and less healthy patients in the control group. If so, the measured improvement may be due to the health status of the patients rather than the intervention. Especially if the allocation is done by a medical researcher who has a stake in the matter (for instance because she has developed the new treatment), allocation decisions may consciously or subconsciously be influenced by expectations about who will profit from the intervention and thus create unbalanced groups. Allocation by a random process helps to control this source of bias.

No one denies that RCTs are powerful experimental designs—and that their power stems from the ability to control numerous sources of bias and confounding. However, to refer to RCTs as the “gold standard” of evidence suggests that they are more. Specifically, one may be led to assume that RCTs are necessary for reliable causal inference or that RCTs are guaranteed to deliver reliable results. A number of philosophers of medicine have in the past decade or so argued that these stronger claims do not hold to scrutiny.

In particular, the following claims have been criticized:

• The logic of statistical significance tests requires randomization (Fisher 1935). Ronald Fisher invoked his famous tea lady thought experiment in order to make plausible that significance testing works only with randomized allocation. Suppose an English lady claims that she is able to tell whether tea or milk was poured into the cup first and we would like to test this assertion. If she gets it right each time in a series of eight cups (four “milk first” and four “tea first” cups), this result may be due to her usually sharp sense of taste. But it may also be for indefinitely many other reasons: she may know that milk was poured first in the first four cups and correctly identified the first four as “milk first” cups; the “milk first” cups differ in color or shape from the “tea first” cups or have any other visually identifiable features; a confederate recorded which cups were “milk first” and signals her; and so on. Fisher now argues that only if the allocation of tea to cups was done at random, the probability of the lady getting all eight cups right is correctly identified as the probability of her getting it right if she were to guess, having no discriminatory ability (which in this case is 1/70). Therefore, we can judge that she really does have an unusual discriminatory ability or something very unlikely must have happened (i.e., an event with the probability 1/70). But this is incorrect. In fact, there is still an indefinite number of ways in which she got the result even though she does not have a good sense of taste. If a confederate signals her the correct answer, the probability of her getting it right is very close to 1 independently of her discriminatory ability (Worrall 2007a). A good experiment would prevent this, but this has to do with other aspects of the experimental design, not randomization.
• Randomization controls for all confounders, known and unknown (Fisher 1935; Giere 1984). Many variables affect a patient’s probability of recovery: her gender, age, co-morbidities, genetic factors, compliance with the treatment regime, psychological factors and many more. If we want to judge that an observed difference in recovery rates between treatment groups is due to the intervention rather than these other factors, we have to make sure that the probability distribution of causal factors is the same between the different groups. Randomization is supposed to ensure this. However, for any finite test population size (and many RCTs do indeed have relatively small numbers of patients), it remains possible that treatment groups are unbalanced: old patients ending up in one group, younger in the other etc. While it is the case that the larger the number of patients in the RCT, the less likely it is that the groups are unbalanced with respect to any given factor, if there are many possible factors affecting the outcome it is actually very likely that some of them are unbalanced. Thus, in practice if it is noticed after randomization that the two groups are unbalanced with respect to a variable that is thought to affect the outcome outcome, then the groups are re-randomized or adjusted (Worrall 2002)
• It is possible to “prove” the results of an RCT to be correct (Cartwright 1989; cf. Worrall 2007b). Every scientist, at some point in his career, learns that one cannot judge X to be a cause of Y just because X and Y are correlated. According to a prominent theory of causation, viz. the probabilistic theory, causation is a form of correlation after all. Very roughly, the probabilistic theory holds that X causes Y just in case X and Y are correlated and all sources of confounding have been controlled (Reiss 2007). It can now be shown that under the probabilistic theory and a host of other assumptions (including the assumption that randomization has been successful in that the treatment groups are balanced with respect to prognostic factors), if the treatment status variable is correlated with the outcome variable, then the treatment must cause the outcome (Cartwright 2007). To give RCTs a special status in EBM on the basis of this reasoning would be to commit a logical mistake, however. The argument can only show that if all the assumptions behind an RCT are satisfied, the RCT will give a causally correct result. It does not show that RCTs are the only way to generate provably correct results. Indeed, it can relatively easily be shown that observational studies that identify so-called instrumental variables are similarly provably correct under a certain set of assumptions (Reiss 2005).

A final but very important issue is that of the external validity of the RCT results. Even under ideal conditions (i.e., when medical researchers have very strong reasons to presume the assumptions under which an RCT works to be satisfied), the RCT can only establish that the treatment is effective in the test population. Typical test populations differ from the target populations (i.e., those populations for whom the treatment has been developed and who will eventually receive the treatment) in more or less systematic ways. For example, many RCTs will exclude elderly patients or patients with co-morbidities but the treatment will be marketed to these patients. For financial reasons, many RCTs are nowadays conducted in developing countries whereas the treatments are mainly or exclusively marketed to patients in developed countries. Whereas the protocols for conducting an RCT are very strict and detailed, there are no good guidelines how to make treatment decisions when the patient at hand belongs to a population that differs from the population in which the RCT was conducted (e.g., Cartwright 2011).

There are in fact two problems of external validity in the application of RCT results. On the one hand there is the population-level problem of making an inference from test to target population. On the other hand, there is the problem of making an inference from population to individual. The RCT provides evidence for a population-level claim: “In population p (the test population) intervention X is effective in the treatment of condition Y”. For this claim to be true, the treatment must be on average effective, which allows the effectiveness to vary among the individuals in the population. Indeed, it is possible that the intervention is effective (and beneficial) on average but ineffective or positively harmful in some individuals (i.e., members of some subpopulations). Proponents of EBM to some extent oversell their case when they write that EBM

de-emphasizes intuition, unsystematic clinical experience, and pathophysiologic rationale… and [instead] stresses the examination of evidence from clinical research. (Evidence-Based Medicine Working Group 1992)

because inferences from test to target population and from any population to the individual receiving the treatment are necessarily based on clinical judgment.

John Worrall argues that, at the end of the day, RCTs are a powerful means to control selection bias, but no more than that (Worrall 2002, 2007a,b). As he uses the term, selection bias occurs when treatment and control group are unbalanced with respect to some prognostic factors because a medical researcher has selected which patients will receive which treatment. Selection bias in this sense obviously cannot occur in an RCT because in an RCT the allocation is made by a random process. But it is also clear that randomization is at best sufficient but not necessary to achieve the result. A large number of alternative designs may be used to the same effect: allocation can be made by a strict, albeit non-random protocol; allocation is made by non-experts who are unrelated to the treatment development and therefore have no expectations concerning outcomes; treatment and control groups are deliberately matched (again by persons who have nothing at stake or according to some protocol); and so on.

A controversial issue is the role of mechanistic knowledge, that is, knowledge about the biological and physiological mechanisms responsible for medical outcomes (and thus treatment efficacy) should play in EBM. As mentioned above, the RCT provides evidence for black-box causal claims of the form “In population p, intervention X is effective in the treatment of condition Y”. As we have seen, proponents of EBM also believe EBM to de-emphasize patho-physiologic rationale (a different term for “mechanistic knowledge”). Nevertheless, a number of philosophers of medicine have pointed out that mechanistic knowledge is in fact important in EBM or that it should receive more attention. Federica Russo and Jon Williamson have, for instance, argued that causal claims need both statistical evidence as well as evidence about the mechanisms that connect an intervention with the outcome variable in order to be established (Russo and Williamson 2007). Others disagree (Reiss 2012) or qualify the claim (Gillies 2011; Howick 2011a; Illari 2011). Further, it has been pointed out that mechanistic knowledge plays an important role in the design and preparation of an RCT, as well as in the interpretation and application of RCT results (La Caze 2011; Solomon 2015). Especially when it comes to extrapolating research results from a test to another population, mechanistic knowledge is supposed to be vital (Steel 2008; see also next section). On the other hand, knowledge about mechanisms is often highly problematic and should not be relied on too heavily in applications (Andersen 2012).

## 6. Animal Models

New therapies are often trialed using animal models before they are tested on humans in a randomized trial. Animal models also play important roles in establishing whether or not a substance is toxic for humans. The International Agency for Research on Cancer (IARC), for example, classifies substances with respect to the quality of the evidence for their carcinogenicity into five groups. Evidence from animal models is referred to in the characterization of each group (IARC 2006). This raises questions about how such extrapolations from animal models to humans work, and how reliable they are.

Animal models are widely used in biomedical research because experimental interventions on animals are easier to conduct and cheaper than experiments on humans. Both kinds of experiments involve ethical dilemmas, but animal experimentation is usually regarded as less problematic from an ethical point of view than experimentation with humans. At any rate, the number of animals killed, maimed, or made sick in biomedical research is much higher than the number of humans adversely affected in this research.

There is a fundamental inferential problem in transferring what has been learned in any model (whether human, animal, or whatever) to some target population of interest has been described as the “experimenter’s circle[special-character:rdquo (Steel 2008). The problem is essentially this. What is true of a model can be presumed to be true of the target only to the extent that the model is similar to the target in relevant respects. The reason we experiment on models in the first place is, however, that the model differs in important respects from the target (if animals were just like humans, we would not find experiments on the former to be ethically less problematic than experiments on the latter). Extrapolation—the inference from model to target—is therefore only worthwhile to the extent that there are significant limitations in our ability to study the target directly. If so, there can be no good grounds to decide whether a model is a good one for the target. To do so, we would have to investigate whether the target is relevantly similar to the model; but if we could do so, there would be no reason to study the model in the first place.

This inferential problem has led some commentators to maintain highly skeptical views concerning our ability to use animals as models for humans in biomedical research. Hugh LaFollette and Niall Shanks argue that animal models cannot be reliably used for extrapolation at all, but at best only heuristically, as sources of hypotheses that have to be tested on humans (LaFollette and Shanks 1997). They introduce two terms to make their argument: causal analogue model (CAM) and hypothetical analogue model (HAM). The former can be used to make reliable predictions about target populations of interest; the latter only heuristically. The main premise in their argument that animal models in biomedical research are at best HAMs but not CAMs is that for a model to be a CAM there cannot be causally relevant disanalogies between model and target—a condition which is rarely if ever met by animal models (again, this is why we study animals in the laboratory in the first place).

Daniel Steel (2008: ch. 5) argues that LaFollette and Shanks’ condition for reliable extrapolation is too stringent. Whether a claim about a model can be extrapolated depends, he argues, also on the strength of the claim to be exported. It is one thing, say, to reason from

x% of the members of population p will show symptoms of poisoning after ingesting substance S

to

x% of the members of population $$q \ne p$$ will show symptoms of poisoning after ingesting substance S,

quite another to reason from the quantitative claim to a qualitative claim such as “Substance S is poisonous for the members of q”.

Steel’s own reconstruction of how extrapolation works in the biomedical sciences is called comparative process tracing. He assumes that causes C (such as medical interventions or the ingestion of toxic substances) bring about their effects E (such as the appearance of symptoms or improvements or deteriorations of symptoms) through a series of steps or stages. To trace a causal process means to investigate through what set of stages C brings about E. Process tracing is comparative when the set of stages through which C brings about E in one species or population is compared to the set through which it does so (if it does so indeed) in another.

Comparative process tracing would be futile if, in order to know that C causes E in the target species or population, we would have to compare all the stages of the process between model and target. This is because in order to do so, we would have to know all stages of the process through which C causes E, but if we did, we would already know that C causes E. This brings us back to the extrapolator’s circle. Steel now argues that comparative process tracing avoids the extrapolator’s circle by demanding processes to be compared only at stages where they are likely to differ and assuming that differences between model and target matter only to stages that are downstream from where they obtain. Thus, if we compare an intermediate stage of the process which obtains in the model with that stage in the target and find them to be relevantly similar, then the only differences that may still obtain will be downstream from this stage. We therefore do not require knowledge of the entire process from C to E in the target, and the extrapolator’s circle is successfully avoided.

How useful comparative process tracing is as a method for extrapolation for the biomedical sciences depends on how reliable the assumption that only downstream differences matter to extrapolation is, the reliability with which stages where there might be differences between model and target can be identified and the reliability of our mechanistic knowledge more generally. If, say, our reasons for supposing that C causes E through a series of stages X, Y, Z in the model, or that X and Z are the stages where model and target are likely to differ, are not very strong, then the method does not get off the ground. This is an issue that depends on the quality of the existing knowledge about a given case and cannot be addressed for the biomedical sciences as a whole. There are certainly some examples of well-established causal claims where it is known only that C causes E but the details of the causal process are entirely beyond our current grasp (Reiss forthcoming-a).

An alternative to comparative process tracing that has been proposed is extrapolation by knowledge of causal capacities. If C has a causal capacity to bring about E, then C causes E in a somewhat stable or invariant manner. Specifically, C will then continue to contribute to the production of E even when disturbing factors are present (Cartwright 1989). To establish that C has the causal capacity to cause E therefore means to show that C’s causing E is independent of the background in which C and E occur to some extent. And therefore, if C causes E in a model species or population and C has the causal capacity to bring about E, then there is some reason to believe that C causes E also in the target species or population (for a defense, see Cartwright 2011).

The usefulness of the method of extrapolation by causal capacities depends, among other things, on the extent to which biomedical factors can be characterized as having capacities. Many biomedical causes do indeed have some degree of stability. The sickle cell trait is 50% protective against mild clinical malaria, 75% protective against admission to the hospital for malaria, and almost 90% protective against severe or complicated malaria (Williams et al. 2005). These figures suggest a reading along the lines of,

in the presence of the sickle cell trait (a preventer of/disturbing factor for malaria), infection with Plasmodium malaria continues to affect outcomes consistently. (Reiss 2015b: 19)

But there is a high degree of interaction with other factors as well. Whether or not a substance is toxic for an organism depends on minute details of its metabolic system, and unless the conditions are just right, the organism may not be affected by the substance at all. To what extent this method will be successful therefore similarly case-dependent as comparative process tracing.

As we can see, there is no general answer to the question whether or not animal studies are valuable from a purely epistemic (as opposed to ethical, economic, or combined) view. Other authors have developed a practice-based taxonomy of animal modes to allow more accurate assessment of the epistemic merits and shortcomings, and predictive capacities of specific modeling practices (Degeling and Johnson 2013). There is much evidence that species differ enormously with respect to their susceptibility to have toxic reactions to substances. Thus, while it is very likely that for any one toxin, there is some species that is predictive of the human response, it is often hard to tell which one is most appropriate for any particular toxin. A species that predicts the human response well for one substance may be a bad model for another. However, some authors suggest that extrapolations from animal models have been made successfully in at least some cases (Steel 2008 discusses the extrapolation of claims concerning the carcinogenicity of aflatoxin from Fisher rats to humans; see Reiss 2010a for a critical appraisal and Steel 2013 for a response).

## 7. Observational Studies and Case Reports

Frequently, in the biomedical sciences, reliable animal or other non-human models are not available and RCTs on humans are infeasible for ethical or practical reasons. In these and other cases, biomedical hypotheses can be established using observational methods. As we have seen in Section 5, evidence-based medicine regards observational methods as generally less reliable than RCTs and other experimental methods. This is because observational studies are subject to a host of confounders and biases that can be controlled when the hypothesis is tested by a—well-designed and well-conducted—RCT. But it is not the case that observational methods cannot deliver reliable results. In fact, it is well possible that the medical knowledge that has been established observationally by far exceeds the knowledge that comes from RCTs. Here are some examples of medical interventions that are widely accepted as effective but whose effectiveness has not been tested using RCTs: penicillin in the treatment of pneumonia, aspirin for mild headache, diuretics for heart failure, appendectomy for acute appendicitis and cholecystectomy for gallstone disease (Worrall 2007a: 986); automatic external defibrillation to start a stopped heart, tracheostomy to open a blocked air passage, the Heimlich maneuver to dislodge an obstruction in the breathing passages, rabies vaccines and epinephrine in the treatment of anaphylactic shock (Howick 2011b, 40).

Observational studies often begin by reporting a recorded correlation between a medical outcome of interest and one or a set of independent variables: lung cancer rates are higher in groups of smokers than in groups of non-smokers, liver cancer rates are higher in populations that tend to consume food that has been contaminated with aflatoxin than in populations whose food is uncontaminated, to give a few examples. That smoking causes lung cancer, or aflatoxin cancer of the liver, would indeed account for the observed correlations. But so would a variety of other hypotheses. Generally, if two variables X and Y are correlated, it may be the case that X causes Y, Y causes X or a common factor Z causes both X and Y (or a combination of these). In the smoking/lung cancer case, all three hypotheses were invoked as possible accounts of the data. Ronald Fisher famously proposed that it may be the case that early stages of bronchial carcinoma cause an individual to crave cigarettes, and he provided some evidence that both smoking behavior and susceptibility to lung cancer have a common genetic basis (Fisher 1958). Moreover, it is possible that the correlation itself is spurious—that the data are correlated as per some measure of correlation such as Pearson’s coefficient, but that the underlying variables are not in fact correlated in the population of interest. Selection bias is normally understood as the bias that obtains when individuals self-select into the observed population and the reasons for which they do so are correlated with the outcome variable. If an observational study examines only hospitalized patients and smokers are more likely to be in hospital for reasons that have nothing to do with lung cancer, then smoking and lung cancer can be correlated in the data even if the variables are independent in the general population. Mismeasurement and diagnostic error provide another account of spurious correlation. Suppose tuberculosis was on the rise a generation or so after many people traded pipe smoking for cigarette smoking. Then, if it was difficult to distinguish a death from tuberculosis from a death from lung cancer because necropsy techniques were not sufficiently well developed, the data might again show a correlation even though the population variables are uncorrelated.

Retrospective observational studies work by ruling out alternative hypotheses such as these ex post rather than controlling for them ex ante as RCTs do (Reiss 2015a). In an RCT, mismeasurement should not obtain because the protocol specifies measurement procedures for the outcome variables in great detail in advance. Selection bias should not obtain because patients are randomized into treatment groups. Once allocated to a group, they are prevented from obtaining another treatment elsewhere, and researchers make sure that patients comply with the treatment regime. But there are equivalent means to rule out these possibilities in observational settings. While it may well be the case that early stages of cancer cause a craving for cigarettes, this hypothesis cannot explain the protective effect that stopping smoking has. At the time of the smoking/lung cancer controversy in the mid-1950s, misdiagnosis was indeed a problem. However, it could be shown that in order to account for the observed rise in lung cancer incidence, the diagnostic error at autopsy among older people would have to have been an order of magnitude higher than the diagnostic error among younger people (Gilliam 1955). Mismeasurement could therefore also be ruled out. Similar considerations helped to rule out other alternative hypotheses (Cornfield et al. 1959).

Even if one were to believe, with the proponents of EBM, that observational studies are generally less reliable than RCTs, medicine could—fairly obviously—not do without them. There are large numbers of pressing questions that could not be addressed by an RCT for ethical, financial and other practical reasons. No-one would seriously consider testing a proposition such as “Aflatoxin causes cancer of the liver (in humans)” by an RCT. This is not merely because of the straightforward ethical issues involved in deliberately exposing humans to a potential carcinogen for the sake of medical progress. It is also because exposure to low levels of aflatoxin can take many years or even decades to produce symptoms. The ability of researchers to control food intake in a large group of experimental subjects for a very long has evident practical financial and limitations. RCTs can also not be used when researchers or patients or both cannot be blinded, and many medical interventions do require the doctor’s or the patient’s knowledge of details about the intervention.

Moreover, it is not clear that RCTs are always more reliable than observational studies to answer questions both methods are able to address. Whether or not a study is reliable depends on whether or not confounders and biases have in fact been eliminated, not by which method they have been eliminated. Issues concerning the reliability of a method can be entangled with issues concerning its ability to address the research question the biomedical scientist seeks to answer. Both RCTs and observational studies in the biomedical sciences are typically employed to test rather complex hypotheses about the safety and efficacy of medical interventions. It may well be that some of the issues are more reliably treated by one method and others by the other.

A famous controversy in which the results from observational studies and those from RCTs conflicted was that over the benefits and safety of hormone replacement therapy (HRT) in the early 2000s (Vandenbroucke 2009). HRT seemed protective for coronary heart disease in observational studies, whereas RCTs indicated an increase in the first years of use. For breast cancer, combined hormone preparations showed a smaller risk in an RCT than in observational studies. In the end it turned out that the timescale of the effects was responsible, and that because of the way they are typically run, observational studies got some issues right and RCTs others:

The observational studies had picked up a true signal for the women closer to menopause. In the randomised trial, that signal was diluted because fewer women close to menopause were enrolled… The randomised trials had it right for coronary heart disease but failed to sufficiently focus on women close to menopause for breast cancer. The main reasons for the discrepancies were changes of the effects of HRT over different times… (Vandenbroucke 2009: 1234)

Case reports remain extremely popular in medicine both as publications to communicate within the field and for pedagogical purposes. In short, a case report describes a medical problem experienced by one or more patient, usually involving the presentation of an illness or similar that in some way difficult to explain or categorize based on existing understandings of disease or understandings of physiology and pathology. Cases in medicine take highly standardized forms of presentation which are inculcated in health care professionals during their education, and many have commented on their highly standardized narrative structure and its epistemic and other implications (Hunter 1991; Hurwitz 2006). Cases typically provide details on the presentation of the disease, diagnosis, treatment, and outcomes for the patient, with a focus on practice-based observations and clinical care (rather than the results of randomized controlled trials or other experimental methodologies). One of the purposes of cases is to gather detailed information including facts that may not be immediately relevant, but that could prove to be (Ankeny 2011). Thus the information contained in the case and the case itself can be useful over the long term particularly if it can be systematically combined with other cases into larger datasets.

Single cases are seen by some as problematic as a form of evidence particularly in the era of EBM, because they often focus on highly unusual manifestations of illness and disease, rather than typical or repeatedly observed conditions that might support generalizable rules. This feature has led some to describe medicine as a “science of particulars” (Gorovitz and MacIntyre 1976), or as an art rather than a science (Pellegrino 1979), particularly in processes of diagnosis (see Section 9). However standard accounts of EBM include the case series as a type of evidence, which involves the aggregation of individual cases of patients with similar attributes (e.g., who received the same treatment or therapy) who are tracked over time using descriptive data and without utilizing particular hypotheses to look for evidence of cause and effect. EBM does place the case series quite low in its hierarchy of evidence but nonetheless it is acknowledged that cases have potential usefulness especially where forms of evidence that rate more highly are not available, as may often be the case where human patients are concerned due to practical or ethical reasons, or where the available evidence at higher levels has been produced in a manner that is methodologically or otherwise flawed.

Cases can serve other purposes: for instance analyses of cases can provide working hypotheses about casual attribution that can ground further tests of causal relations (Ankeny 2014), which in turn allows use of more traditional methodologies such as RCTs, cohort studies, and so on to explore these causal hypotheses. In the context of clinical care, cases can allow health care providers to identify a cause that can be manipulated to cure (or prevent) the condition in question, in order to treat ill patients, even in the absence of more rigorous forms of evidence.

## 8. Diagnosis

Diagnosis is the process through which a clinician determines what is wrong with a patient who is ill or ailing in some way. Although a critical part of the practice of medicine, it has been relatively neglected in the literature of philosophy of medicine particularly in comparison to more statistically-based methods for evaluating evidence in other fields (Stanley and Campos 2013). The key philosophical issues that arise in this context relate to how such determinations can be made in a manner that is accurate given the high amount of uncertainty and complexity often associated with the human condition, and hence involve logical, epistemological, and ontological issues. The usual way of proceeding in a clinical setting is to ask the patient to articulate what is ailing him or her, and thus to use a standardized reporting format to detail various symptoms which represent subjective manifestations of the illness or disease. In addition, clinicians perform various tests and examinations that allow more objective manifestations or signs to be recorded, such as heart rate, blood pressure and count, reflexes, and so on. A perennial debate in the philosophy of medicine is what constitutes symptoms and signs and whether they are in fact distinct, which relates to deeper issues about the realism of disease conditions as discussed above (Section 2).

The tricky part of the process is to find a means for mapping these symptoms and signs onto a particular disease condition. Some would advocate that this process is no different than usual methods in philosophy of science for hypothesis generation and testing based on evidence, and this type of model fits with what is termed differential diagnosis. Differential diagnosis involves a set of hypothetical explanations for a particular condition which come to be ruled in (or out) based on the evidence together with additional data that is collected, hence relying on a form of reasoning via decision nodes or algorithmic pathways (Stanley and Campos 2013). However, the details of the rules of reasoning that underlie this sort of process remain largely unarticulated, as does the amount of “tacit” knowledge that may contribute to diagnostic reasoning.

There are various ways in which diagnosis is taught and operationalized in clinical settings: in some subspecialties in particular, “pattern” recognition often using pictorial representations seems to be common, and hence diagnosis is a form of recognizing repeated patterns. However this approach can be dangerous particularly among novices, given the large number of similar patterns among common diseases. Some have claimed that the making of a diagnosis is both a deontic act and computable, and that diagnoses are relative only inasmuch as they occur in a complex context which in turn makes them a social practice (Sadegh-Zadeh 2011). Computer-assisted diagnostic techniques have improved and are used increasingly in clinical settings; Kenneth Schaffner (1981) provided an early analysis of the criteria which an ideal diagnostic logic would need to satisfy (for updated discussions see Schaffner 1993, 2010, and for arguments about the limitations of such types of diagnosis see Wartofsky 1986). In recent years there is a relative consensus among medical professionals and those involved in medical informatics that medical diagnosis almost certainly relies on some form of “fuzzy logic” (e.g., Sadegh-Zadeh 2000; Barro and Marin 2002).

## 9. Clinician Judgement and the Role of Expertise

As we have seen in Section 5, hierarchies of evidence in evidence-based medicine rank study results from “systematic” clinical research such as RCTs and observational studies higher than “unsystematic” expert opinion. The epidemiologists who initiated the formal EBM movement in the early 1990s had good reason to be skeptical about expert opinion. When therapies are subjected to systematic tests, tradition and expert opinion are sometimes shown to be flawed. John Worrall discusses three examples: grommets for glue ear, ventricular ectopic beats repressing substances such as encainide or flecainide for cardiac arrest, and routine fetal heart rate monitoring to prevent infant death (Worrall 2007a: 985). In each case we have a procedure the effectiveness of which is indicated by common sense and knowledge about the patho-physiological pathways—glue ear is a condition produced by a build-up of fluid in the middle ear that is unable to drain away because of pressure differentials, grommets act by letting air into the middle ear and thereby equalizing pressure, for instance—but which, when tested by a randomized trial, turns out to be ineffective at best and positively harmful in the worst case.

Misjudgments concerning the efficacy of therapies for purely epistemic reasons are not the only worry that one might have about expert opinion. Medical experts and patients are in what economists call a principal-agent relationship. The principal—in this case, the patient—desires the delivery of a certain good or service—in this case, his health. He instructs an agent—in this case, the doctor—with it, because he lacks the expertise to produce the good himself. The good can only be produced with uncertainty: no therapy is 100% effective. Moreover, the success at delivering the good depends in part on the agent’s effort. The doctor may not always choose the optimal therapy for a patient (we can suppose that it takes some effort to select the optimal therapy for a patient), and any therapy can be implemented sloppily. Moreover, lacking expertise, the patient cannot observe the level of effort a doctor puts in. He therefore cannot design a contract that makes payments dependent on level of effort (much less on success, as success is in part influenced by factors outside of either party’s control). Agents therefore have an incentive to cheat: not to put in the level of effort required to select and deliver the optimal therapy from a patient’s point of view.

If patients and doctors were perfectly rational and motivated only by their own material welfare, and in the absence of regulation, there simply wouldn’t be a market for health services. Doctors would choose therapies that are best for them and not for patients, and patients would anticipate this behavior and stop seeking doctor’s services in the first place. In our world, neither patients nor doctors are particularly rational, nor are they motivated purely by self-interest, there are ethical codes such as the modern form of the Hippocratic Oath, and the health sector is one of the most regulated industries of all. All this does not, however, change the incentive structure in which doctors and other providers of health services operate. Because they and not the patients are experts, they have incentives to choose therapies that are in their best interests and not in the patients’ interests.

There is a further complication. Many, probably most, doctors have connections to the pharmaceutical industry in one form or another. According to one study, 94% of U.S. physicians receive financial benefits from the pharmaceutical industry (Bekelman et al. 2003). Even if we suppose that doctors do not prescribe a therapy because they are paid to do so, marketing efforts directed at them will influence treatment recommendations, if only because they know certain pills better than others, or because some treatments are at the top of their heads.

For all these reasons, the EBM principle that treatment decisions should be based on the best available evidence from systematic research does not come out of nowhere. If, say, there is an RCT or an observational study that reports that treatment X is more effective at relieving symptoms S than treatment Y, it would seem bad to recommend a patient who suffers from S to take Y because his GP doesn’t know about X, doesn’t know the study result, personally profits from prescribing Y or is inattentive. However, while these are all bad reasons to recommend Y over X in the light of the study result, there may be a variety of good reasons.

As discussed in Section 5, RCTs and many observational studies are population-level studies, which produce average results that are not straightforwardly applicable to individuals. If, say, treatment X reduces the risk of suffering from some adverse event over a period of time by 50% in population p, that is, the risk ratio (RR) for this treatment is 50%, then there may be no individual in p for whom the treatment halves the risk. Instead, the RR may vary dramatically among the subpopulations of p, and it may well be the case that Y is more effective than X for some subpopulations.

The same is true of side effects. Tonelli (2006) discusses a case where a patient who suffers from multiple sclerosis receives a treatment that does seem to alleviate her symptoms, but since she has started taking it, she has been plagued by severe episodes of depression. Clinical trial results indicate that the drug is effective in treating multiple sclerosis, and no adverse psychiatric effects have been reported. Her GP and her psychiatrist now debate whether to continue the treatment. There are various reasons why the clinical studies do not show evidence of mental health effects: the trial subjects weren’t properly screened for depression; adverse effects were found but not reported; the adverse effects were not statistically significant—but they may have been clinically significant for some subpopulations; the side effects only obtain in populations that differ from the trial populations.

This case shows that a treatment’s effectiveness in relieving the symptoms of the disease for which it was prescribed is not the only consideration when making a treatment decision. The goal of the treatment is to improve the patient’s wellbeing, which is well recognized by the proponents of EBM. A patient’s wellbeing has many components, of course, and the symptoms of any given disease are at best one element in its determination. This is another reason why clinical judgment must be exercised in the derivation of a treatment recommendation.

Unfortunately, experts—like all humans—are notoriously bad decision makers. Cognitive psychologists have established a large number of cognitive biases to which human experts are subject: they suffer from overconfidence (e.g., Dawes and Mulford 1996) and hindsight bias (e.g., Fischhoff 1975; Hugh and Dekker 2009); are regularly outperformed by simple mechanical algorithms (e.g., Grove and Meehl 1996); commit the conjunction fallacy (Tversky and Kahneman 1983; Rao 2009), and many others.

To give an example for a simple mechanical algorithm outperforming experts, consider the Goldberg Rule, according to which a patient is to be qualified as neurotic if $$x = (\textrm{L} + \textrm{Pa}+ \textrm{Sc}) - (\textrm{Hy} + \textrm{Pt}) > 45$$ (where L is a validity scale and Pa, Sc, Hy, and Pt are clinical scales of a Minnesota Multiphasic Personality Inventory or MMPI test) and as psychotic otherwise. Lewis Goldberg tested the rule on a set of MMPI profiles from 861 patients who had been diagnosed by the psychiatric staff in their hospital or clinic and found it to be 70% accurate; clinicians’ accuracy ranged from 55% to 67% (Goldberg 1968; for a discussion, see Bishop and Trout 2005).

There is no one strategy to deal with the various biases and interests that affect clinicians’ judgments. Better numeracy and statistical training at universities can help to eliminate some cognitive biases (Gigerenzer 2014). Computer-aided medical diagnosis and decision making may alleviate others. No training or computer program can make normative judgments, however, and neither will help with adverse incentive structures and financial interests. These difficulties also beset committees of medical experts to which we are turning next.

## 10. How Are Collective Expert Judgments Made in Medicine?

One way to help overcome expert bias is by making medical decisions not dependent on individual expert judgments but instead have groups of experts coming to some form of aggregate judgment. The U.S. National Institutes of Health, for instance, used to organize so-called consensus conferences designed to resolve scientific controversy. Panel members are chosen from clinicians, researchers, methodologists and the general public. Federal employees are not eligible, nor are researchers who have published on the subject at hand or have financial conflicts of interest (Solomon 2007). These exclusions are intended to contribute to controlling government influences as well as any biases due to financial or intellectual interests.

Consensus conferences and other mechanisms for reaching group judgments are clearly no panacea. Miriam Solomon (2015), for instance, argues that consensus conferences tend to “miss the window of epistemic opportunity” in that they often take place after the medical community has already settled an issue. More important in the present context is the observation that while these conferences possibly help to control some forms of partiality, they are ineffective in reducing others and may be responsible for the introduction of new biases. One concern is that panel members may read the existing evidence selectively, for instance, because of weighing salient studies or studies that are available to them more heavily. Another is that phenomena such as groupthink (Janis 1982) and peer pressure may influence results. In an NIH consensus conference panel members have to come to a verdict after only two days of hearings and deliberations. Under these conditions it is certainly possible that more outspoken panel members or those who perform well under extreme pressure have a undue influence on results. Moreover, it is not clear that excluding clinicians who have published on the issue at hand is always such a good idea. After all, it is not implausible to maintain that those scientists who actively work on a research topic are those who best understand it and therefore can make the best informed judgments. For these and other reasons, Solomon (2007, 2015) explores the consequences of judgment aggregation. In this process group members typically do not deliberate but instead cast their opinions which are then aggregated using some pre-determined procedure. The majority rule would be a simple example of such a procedure.

Coming to a group judgment using a mechanical procedure such as majority vote has a number of advantages. First, there are epistemic advantages that can be illustrated by Condorcet’s Jury Theorem. This theorem shows that if (a) the judgment concerns a proposition that can either be true or false, (b) jury members have an independent probability $$>.5$$ that they get the judgment right, (c) the individual judgments are aggregated using majority vote, then the larger the jury, the more likely it is to reach the correct group judgment. Under these conditions, then, a committee of experts is likely to make a better judgment than a single expert. Moreover, in the absence of deliberation and pressure to come to a unanimous results, and when voting is secret, the influence of groupthink, peer pressure etc. is attenuated or eliminated.

When conditions (a)–(c) do not hold, results are more ambiguous or even negative. When experts are not reliable, i.e., the individual probability of getting the judgment right is $$<.5$$, the larger the group, the less likely is it to reach a correct group judgment and the optimal group size is a single expert. When the outcome can have more than two values, inconsistent results can obtain. This can easily be demonstrated with an example in which there are three possible outcomes and three experts. Suppose, for instance, that a panel has to decide which of three treatments A, B and C is the most effective in treating some disease. The individual panel members have the following individual rankings:

Expert I: $$A > B > C$$

Expert II: $$B > C > A$$

Expert III: $$C > A > B$$,

where “>” means “more effective”. There is now a majority that holds that A is more effective than B (I&III), a majority that holds that B is more effective than C (I&II) and a majority that holds that C is more effective than A (II&III). More generally, whenever there are logical relations among the propositions to be decided (in this case: $$A > B$$ and $$B > C$$ implies that $$A > C$$), there are at least three panel members, and votes are aggregated by the majority rule, inconsistencies can arise at the group level (Pettit 2001).

The majority rule is of course only one way to aggregate judgments. The Delphi method (e.g., Dalkey and Helmer 1963; for an application to medicine, see Jones and Hunter 1995) applies to cases where the task is to provide a numerical estimate of some variable of interest (say, the risk difference a new treatment makes). Experts answer questionnaires in several rounds. After each round, a facilitator provides an anonymous summary of the experts’ estimates from the previous round and the reasons given for their judgments. Experts are thus supposed to be encouraged to revise their earlier answers in light of other experts’ estimates and justifications. During this process the range of the estimate will often decrease, and it is hoped that the group will converge towards the correct answer. The process is stopped after a pre-determined stopping criterion such as number of rounds, achievement of consensus, stability of results, and an average of the estimates of the final round is used as result.

Solomon (2011, 2015) raises a fundamental issue concerning group judgments that is entirely independent of the specific method used. She argues that we do not often find group judgment methods to determine the truth of scientific hypotheses or estimates of variables in the natural sciences (though see Staley 2004). If there is uncertainty about, say, which of two alternative hypotheses is true or what value a natural constant has, scientists go out and test, experiment, measure. Controversies, in other words, are settled on the basis of evidence, not (individual or group) opinion. Shouldn’t we, with the advancement of evidence-based medicine, expect the same to happen in medicine? Consequently, she recommends more widespread use of mechanical techniques for amalgamating evidence such as meta-analysis in lieu of consensus conferences and the like.

The frequency of NIH consensus conferences has indeed markedly declined in recent years (Solomon 2011, 2015). But this is of course no reason to maintain that group judgments are no longer needed. Consensus conferences may be the wrong tool for the purposes of the NIH, or the NIH may have a mistaken view about the ability of evidence to settle disputes adequately. Indeed, there are at least two reasons to believe that group judgment procedures are here to stay.

The first reason is that, as we have seen above, medical decisions are always in part decisions about normative matters. No treatment is entirely without side effects and so if judgments about efficacy are to be of practical guidance, they must include a weighing of benefit (alleviation of disease symptoms) against cost (suffering from side effects)—even if economic costs and benefits are not to be taken into consideration. Second, government agencies such as the U.S. Food and Drug Administration (FDA) have to decide whether new treatments should be licensed to be marketed. These decisions often have significant consequences, and democracies tend to prefer to be able to hold someone accountable for making them. Drug approval therefore cannot be determined on the basis of evidence according to some mechanical algorithm.

Biddle (2007) discusses epistemological and moral issues of drug approval in the context of a case study on Vioxx, an analgesic. Vioxx was approved by the FDA in 1999 but five years later pulled from the market by its manufacturer Merck due to safety concerns. It is estimated that some 55,000 people died from taking the drug (Harris 2005). Biddle observes that the FDA is not sufficiently independent of the pharmaceutical industry to make unbiased decisions likely. Many of the members of the FDA’s drug approval committees have financial conflicts of interests (often in the form of receiving benefits from the pharmaceutical company whose drug is to be approved), and a large number of employees of the FDA are dependent on the “user fees” industry pays to help cover the cost of drug approval. To solve these problems of conflicts of interests, Biddle proposes to institute an adversarial system in which two groups of advocates, a group of representatives of the manufacturer and a group of independent scientists, would argue before a panel of judges over whether a drug should be allowed on the market. The panel of judges in this model also consists of independent FDA or university scientists. He argues that the adversarial system would better acknowledge the fact that an increasing number of medical researchers have financial ties to the pharmaceutical industry by treating them as advocates rather than disinterested experts. (See also Reiss and Wieten 2015, Reiss forthcoming-b.)

## 11. Values in Medical Research

There is no doubt that medical research is shaped by various external values, in ways similar to the value ladeness that is well-recognized in other areas of science (see entry on scientific objectivity). Many of these values create a variety of ethical dilemmas relating to equity of access to health care and similar. Even in recent years once medical research has been made more inclusive, this trend has introduced a host of additional philosophical and ethical issues (Epstein 2007). For our purposes, we will focus on the implications of the systematic exclusion of certain types of individuals, groups, or diseases from research for future research as well as clinical medical practice in terms of the validity of evidence produced and decisions made based on that evidence.

In traditional medical research, it was generally assumed that white male participants could be used as the basis of generalizations that in turn could be extrapolated to all other populations, including minorities and females (Dresser 1992). Reviews of the literature indicate that women in particular have been excluded (especially older women), and that research on women has usually been related to reproductive function and capacity (Inborn and Whittle 2001). Such types of research have been argued to fail the ideals of quality medical research as well as evidence-based health care (Dodds 2008). Although some improvements have been made in recent years, there remain certain forms of blanket exclusions for instance of women of childbearing age or pregnant women in many types of medical research. These types of systematic exclusion are highly problematic especially because there is clear evidence of critical differences between men and women with regard to a range of factors relating to receptivity to therapies for both biological and social reasons.

In the case of minorities such as African-Americans in the United States, even when research trials seek to recruit them, a range of factors may contribute to them not being involved in medical and other types of research studies. These include distrust due to historical and institutional racism including research performed without consent; lack of understanding about research and consent; social stigma; financial considerations; and lack of culturally-sensitive recruitment methods by researchers (e.g., Huang and Coker 2010). Such gaps in medical research potentially lead to use of treatments or therapies that may in fact be harmful for particular groups, and may result in the withholding of therapies that might be beneficial.

Medical research also is affected by which conditions or diseases are selected for investigation (Reiss and Kitcher 2008): perhaps most notoriously, “orphan” diseases which are either rare, common only in minority populations, or only present in certain developing world or other lower socioeconomic settings, are often neglected for drug and other therapy development because it is perceived that there will not be a viable commercial market for any products on which research might be done due to the at-risk or affected population (and hence such potential products are often termed “orphan drugs”). In some cases patients may pursue “off-label” use of drugs which are approved for a condition other than the one they have, because approval for an “orphan” disease is unlikely due to cost and demand; however such off-label uses of drugs even when overseen by a physician typically result in lack of consistent collection of evidence and absence of typical risk-benefit regulatory considerations utilized when a drug is approved for particular purposes.

A final way in which our knowledge in medicine generated by research is potentially adversely affected by values is through the funding patterns connected with research. As implied above, pharmaceutical companies sponsor a considerable portion of drug trials and have a variety of interests at stake in these investments well beyond the gathering of evidence for the effectiveness (or lack thereof) of a particular product. There is consistent evidence that negative research results typically are suppressed when sponsored by industry (Lexchin 2012a), leading to a bias in what is reported and thus what evidence is available on which to make prescribing and treatment decisions. Bias also has been found in a number of other areas: within the study itself in the choice of research question or topic of investigation, in the choice of doses or drugs against which the drug under study is to be compared, in the control over trial design and various changes in protocols, and in decisions to terminate clinical trials early, and in the reinterpretation of data, as well as in the publication of data such as restrictions on publication rights, use of fake journals, favoring journal supplements and symposia rather than peer review venues, the use of ghostwriting, and in the details of the reporting of results and outcomes (Sismondo 2008; Reiss 2010b; Lexchin 2012b). All of these issues weaken the evidence base on which clinical care judgments are made, and also lead to potentially adverse effects for patients.

## 12. Measuring Medical Outcomes

In order to evaluate medical outcomes quantitatively, they have to be measured. There are numerous reasons for aiming to quantify medical outcomes. We may want to compare two or more treatments with respect to their efficacy at relieving certain symptoms or their ability to prevent deaths due to a certain disease. When resources are scarce, we may not only want to invest in treatments that are efficacious (that is, they do improve patient morbidity, mortality or both) but also efficient (that is, it is more efficacious than other treatments relative to the cost of procuring it). For matters of international comparison, development and international justice, we also want to have measures of disease burden: Which of a number of tropical diseases has the highest cost in terms of increased morbidity and mortality? For each research dollar spent on treatments for disease X, how much can we expect to reduce the morbidity and mortality it causes?

Clinical trials now often report so-called patient-reported outcome measures or PROMs. A PROM is a questionnaire given to patients to evaluate certain aspects of their quality of life, functioning or health status after a medical intervention without interpretation of the patient’s response by a clinician or other people. It might ask, for example, how difficult patients find it to climb up a flight of stairs after hip surgery or whether or not a cancer treatment helps them to pursue their hobbies. The main goal of a PROM is the assessment of treatment benefit or risk in cases where the medical outcome is best known by the patient or best measured from the patient perspective.

PROMs can vary considerably in length and complexity depending on the concept that is being measured. In simple, straightforward cases (e.g., intensity of a certain kind of pain), a single question might suffice. In others, it may be necessary to address several aspects of a more complex functioning with a number of questions each. Either way, the design of the questionnaire should make sure that the instrument reliably measures the concept of interest. The FDA distinguishes the following six measurement properties or “tests” (FDA 2009: 11):

• Test-retest or intra-interviewer reliability (“Are the scores stable over time when no change is expected in the concept of interest?”)
• Internal consistency (“Is there a high correlation between the responses that purport to measure the same concept?”)
• Inter-interviewer reliability (“Is there agreement among responses when the PROM is administered by two or more different interviewers?”)
• Content validity (“Is there evidence that the instrument measures the concept of interest?”)
• Construct validity (“Is there evidence that the relationships between responses conform to expectations?”)
• Ability to detect change (“Is there evidence that the instrument can identify differences in scores over time in individuals or groups who have changed with respect to the concept of interest?”).

Despite their plausibility, these tests are not methodologically innocuous. Content validity, for instance, is assessed on the basis of qualitative research in the form of patient interviews, focus groups, and qualitative cognitive interviewing (the latter refers to a method that asks respondents to think aloud and describe their thought processes as they answer the instrument questions and involves follow-up questions in a field test interview to gain a better understanding of how patients interpret questions). This qualitative research aims to develop questions with standardized meanings that are shared between patients and clinicians. Arguably, however, there will always be differences in interpreting phrases such as “bodily pain” or “difficulty in lifting one”s arm’ because they refer to a patient’s experiences, and these will differ from patient to patient and, in a given patient, from time to time (Rapkin and Schwartz 2004). Moreover, there may be good philosophical reasons to allow for the expression of a sufficient array of legitimate perspectives on health and quality of life instead of insisting on a standardization of meaning across patients and contexts (McClimans 2010). Similarly, internal consistency can be desirable only to the extent that the concept is a relatively simple one and different questions really do address the same concept. It is of less relevance when the disorder is heterogeneous (McClimans and Browne 2011). These kinds of worries can be raised with respect to each of the measurement tests. Finally, there is an issue when several PROMs that address a given disorder or treatment exist. Different PROMs will score differently with respect to the different tests, and there is no universally valid schema to weigh their relative importance (ibid.).

Disability-adjusted life years or DALYs aim to measure burden of disease. The measure was originally developed by Harvard University for the World Bank and World Health Organization (WHO) in 1990 and is now widely used by heath policy researchers for comparisons between countries and over time and as a tool for policy making. It can also be used to measure the effectiveness of interventions, though these are usually health policy rather than medical interventions narrowly construed. The WHO makes regular global disease burden estimates in terms of DALYs at regional and global level for more than 135 causes of disease and injury (Mathers et al. 2002).

The principal idea behind DALYs is simple. If a woman in Guatemala dies of Chagas disease at age 63, this adds 20 DALYs to the global disease burden because her death is 20 years “premature” as compared to Japanese life expectancy (which is taken as the standard because it is the highest world wide). If a man in Hamburg has an accident that confines him to the wheelchair for the rest of his life, this contributes 0.57 DALYs for each of his remaining life years because the weight for paraplegia is 0.57. Every kind of disease or impairment is thus given a number between 0 and 1 (where 0 = full health and 1 = death) that makes it comparable to other conditions. For example, blindness has a weight of 0.43. As blindness contributes less to the burden of disease than paraplegia, this means that blindness is regarded as the less severe of the two in terms of its reduction of functional capacity (Prüss-Üstün et al. 2003).

The simple idea is complicated by two adjustments, however. Typical burden of disease studies weigh an impairment differently depending on the age of the person whose functional capacity is impaired by the disease or disability. Blindness, say, has a greater impact on the burden of disease if it occurs at age 20 than if it occurs at very young or older ages (Prüss-Üstün et al. 2003). Moreover, if the man who has the accident now can be expected to live with the disease for 30 years, future years of disability are discounted by a factor. The further into the future a disability occurs, the less it contributes to the disease burden (ibid.).

The adequacy of any socio-economic indicator has to be evaluated in the light of the purpose that it is meant to serve (Reiss 2008). If DALYs are supposed to measure the everyday concept “burden of disease”, we may criticize, for instance, that the indicator fails to take account of societal, cultural, climatic and other variations within which the disease or disability occurs. Being paraplegic, say, is less burdensome when it occurs in societies that spend more resources on making public buildings and transport wheelchair accessible, that display more tolerance towards the handicapped, or in flatter than in hillier regions. Arguably, therefore, DALYs measure ill-health rather than disease burden (Anand and Hanson 1997). Similarly, because ill-health is measured as a percentage, a disease occurring in a person who is already handicapped contributes less to the measure than the same disease occurring in an otherwise comparable but not handicapped person. If DALYs are used to make public health decisions, however, it might be better to prioritize those individuals who are least well off instead of those who are relatively better off (ibid.).

The WHO is very explicit that numerous choices made in the construction of the DALY measure are value-based (Murray 1994; Prüss-Üstün et al. 2003). Clearly, there is no matter of fact whether paraplegia constitutes a more severe impairment of someone’s functional capacity than blindness, much less the precise extent to which it contributes to the burden of disease. The same is true of the duration of the time lost due to premature death, age weights and time preference. While any given choice will, due to its value-laden nature, be controversial, the WHO makes some efforts to represent societal preferences instead of, say, a priori philosophical arguments. For example, the disability weights used in the 2003 World Health Survey were based on health state valuations from large representative population samples in over 70 countries (Prüss-Üstün et al. 2003: Ch. 3). Similarly, age weights are based on empirical studies that have indicated there is a broad social preference to value a year lived by a young adult more highly than a year lived by a young child, or lived at older ages (Murray 1996).

## Bibliography

• Amundson, R., 2000, “Against Normal Function”, Studies in History and Philosophy of the Biological and Biomedical Sciences, 31: 33–53.
• Anand, S. and K. Hanson, 1997, “Disability-Adjusted Life Years: A Critical Review”, Journal of Health Economics, 16: 685–702.
• Andersen, H., 2012, “Mechanisms: What Are They Evidence for in Evidence-based Medicine?” Journal of Evaluation in Clinical Practice, 18(5):992-999.
• Ankeny, R.A., 2002, “Reduction Reconceptualized: Cystic Fibrosis as a Paradigm Case for Molecular Medicine”, in L.S. Parker and R.A. Ankeny (eds.), Mutating Concepts and Evolving Disciplines: Genetics, Medicine and Society, Dordrecht: Kluwer, 127–141.
• –––, 2011, “Using Cases to Establish Novel Diagnoses: Creating Generic Facts by Making Particular Facts Travel Together”, in P. Howlett and M.S. Morgan (eds.), How Well Do Facts Travel? The Dissemination of Reliable Knowledge, Cambridge: Cambridge University Press, 252272.
• –––, 2014, “The Overlooked Role of Cases in Casual Attribution in Medicine”, Philosophy of Science, 81: 9991011.
• Ankeny, R.A. and F. Mackenzie, 2016, “Three Approaches to Chronic Fatigue Syndrome in the United Kingdom, Australia, and Canada: Lessons for Democratic Policy”, in S.M. Dodds and R.A. Ankeny (eds.), Big Picture Bioethics: Democratic Policy Making in Contested Domains, Dordrecht: Springer, forthcoming.
• Aronowitz, R.A., 1998, Making Sense of Illness: Science, Society and Disease, Cambridge: Cambridge University Press.
• –––, 2001, “When Do Symptoms Become a Disease?” Annals of Internal Medicine, 134: 803–808.
• Barro S. and R. Marin (eds), 2002, Fuzzy Logic in Medicine, Heidelberg: Physika-Verlag.
• Bekelman, J., Y. Li and C. Gross, 2003, “Scope and Impact of Financial Conflicts of Interest in Biomedical Research”, Journal of the American Medical Association, 289: 454–465.
• Bernard, C., 1865 [1957], An Introduction to the Study of Experimental Medicine, New York: Dover.
• Biddle, J., 2007, “Lessons from the Vioxx Debacle: What the Privatization of Science Can Teach Us About Social Epistemology”, Social Epistemology, 21: 21–39.
• Bishop, M. and J.D. Trout, 2005, Epistemology and the Psychology of Human Judgment, Oxford: Oxford University Press.
• Boorse, C., 1975, “On The Distinction Between Disease and Illness”, Philosophy and Public Affairs, 5: 49–68.
• –––, 1977, “Health as a Theoretical Concept”, Philosophy of Science, 44: 542–573.
• –––, 1997, “A Rebuttal on Health”, in J.M. Humber and R.F. Almeder (eds.), What is Disease?, Totowa, NJ: Humana Press, 3–143.
• Campaner, R., 2012, Philosophy of Medicine: Causality, Evidence and Explanation, Bologna: Archetipo Libri.
• Canguilhem, G. 1991, The Normal and the Pathological, trans. C.R. Fawcett, New York: Zone Books.
• Caplan, A.L., 1992, “Does the Philosophy of Medicine Exist?” Theoretical Medicine, 13: 67–77.
• Carel, H., 2007, “Can I Be Ill and Happy?” Philosophia, 35: 95–110.
• –––, 2008, Illness: The Cry of the Flesh, Dublin: Acumen.
• Cartwright, N., 1989, Nature’s Capacities and Their Measurement, Oxford: Clarendon.
• –––, 2007, “Are RCTs the Gold Standard?” BioSocieties, 2: 11–20.
• –––, 2011, “A Philosopher’s View of the Long Road from RCTs to Effectiveness”, The Lancet, 377: 1400–1401.
• Cartwright, S., 1851 [2004], “Report on the Diseases and Physical Peculiarities of the Negro Race”, reprinted in A.L. Caplan, J.J. McCartney, and D.A. Sisti (eds.), Health, Disease, and Illness, Washington, DC: Georgetown University Press, 28–39.
• Clouser, K.D., C.M. Culver, and B. Gert, 1981, “Malady: A New Treatment of Disease”, Hastings Center Report, 11(3): 29–37.
• Collingwood, R., 1940, An Essay on Metaphysics, Oxford: Clarendon Press.
• Cooper, R., 2002, “Disease”, Studies in the History and Philosophy of Biology and the Biomedical Sciences, 33: 263–282.
• Cornfield, J., W. Haenszel, E.C. Hammond, A.M. Lilienfeld, M.B. Shimkin, and E.L. Wynder, 1959, “Smoking and Lung Cancer: Recent Evidence and A Discussion of Some Questions”, Journal of the National Cancer Institute, 22: 173–203.
• Dalkey, N. and O. Helmer, 1963, “An Experimental Application of the Delphi Method to the Use of Experts”, Management Science, 9: 458–467.
• Dawes, R. and M. Mulford, 1996, “The False Consensus Effect and Overconfidence: Flaws in Judgment or Flaws in How We Study Judgment?” Organizational Behavior and Human Decision Processes, 65: 201–211.
• Degeling, C. and J. Johnson, 2013, “Evaluating Animal Models: Some Taxonomic Worries”, Journal of Medicine and Philosophy, 38: 91–106.
• Dekeuwer, C., 2015, “Defining Genetic Disease”, in P. Huneman, G. Lambert, and M. Silberstein (eds.), Classification, Disease and Evidence: New Essays in the Philosophy of Medicine, Dordrecht: Springer, 147–164.
• Dekkers, W. and M.O. Rikkert, 2006, “What is a Genetic Cause? The Example of Alzheimer’s Disease”, Medicine, Health Care and Philosophy, 9: 273–284.
• Demazeux, S. and P. Singy (eds), 2015, The DSM-5 in Perspective: Philosophical Reflections on the Psychiatric Babel, Dordrecht: Springer.
• De Vreese, L., E. Weber and J. Van Bouwel, 2010, “Explanatory Pluralism in the Medical Sciences: Theory and Practice”, Theoretical Medicine and Bioethics, 31: 371–390.
• Dodds, S.M., 2008, “Inclusion and Exclusion in Women's Access to Health and Medicine”, International Journal of Feminist Approaches to Bioethics, 1: 58–79.
• Dresser, R., 1992, “Wanted: Single, White Male for Medical Research”, Hastings Center Report, 22: 24–29.
• Engel, G.L., 1977, “The Need for a New Medical Model: A Challenge for Biomedicine”, Science, 196: 129–136.
• Engelhardt, H.T., 1974, “The Disease of Masturbation: Values and the Concept of Disease”, Bulletin of the History of Medicine, 48: 234–48.
• –––, 1975, “The Concepts of Health and Disease”, in H.T. Engelhardt Jr. and S.F. Spicker (eds), Evaluation and Explanation in the Biomedical Sciences, Dordrecht: Reidel, 125–141.
• –––, 1986, “Clinical Complaints and the Ens Morbi”, Journal of Medicine and Philosophy, 11: 207–214.
• Epstein, S., 2007, Inclusion: The Politics of Difference in Medical Research, Chicago: University of Chicago Press.
• Ereshefsky, M., 2009, “Defining ‘Health’ and ‘Disease’”, Studies in the History and Philosophy of Biology and Biomedical Sciences, 40: 221–227.
• Evidence-Based Medicine Working Group, 1992, “Evidence-Based Medicine: A New Approach to Teaching the Practice of Medicine”, Journal of the American Medical Association, 268(17): 2420–2425.
• Fabrega, H., 1979, “The Scientific Usefulness of the Idea of Illness”, Perspectives in Biology and Medicine, 22: 545–558.
• FDA (U.S. Food and Drug Administration), 2009, Guidance for Industry Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims, Washington, DC: U.S. Department of Health and Human Services.
• Fischhoff, B., 1975, “Hindsight is Not Equal to Foresight: The Effect of Outcome Knowledge on Judgement under Uncertainty”, Journal of Experimental Psychology, Human Perception & Performance, 1: 288–299.
• Fisher, R.A., 1935, The Design of Experiments, Oxford: Oliver & Boyd.
• –––, 1958, “Cancer and Smoking”, Nature, 182: 596.
• Foucault, M., 1963 [1973], The Birth of the Clinic: An Archaeology of Medical Perception, New York: Pantheon.
• Gaines, A.D., 1992, “From DSM-I to III-R, Voices of Self, Mastery and the Other: A Cultural Constructivist Reading of U.S. Psychiatric Classification”, Social Science and Medicine, 35: 3–24.
• Gannett, L., 1999, “What’s in a Cause? The Pragmatic Dimensions of Genetic Explanations”, Biology and Philosophy, 14: 349–373.
• Gasking, D., 1955, “Causation and Recipes”, Mind, 64: 479–87.
• Giere, R., 1984, Understanding Scientific Reasoning, New York: Holt, Rinehart, and Winston.
• Gifford, F., 1990, “Genetic Traits”, Biology and Philosophy, 5: 327–47.
• Gigerenzer, G., 2014, Risk Savvy: How to Make Good Decisions, New York (NY), Viking Penguin.
• Gilliam, A., 1955, “Trends of Mortality Attributed to Carcinoma of the Lung: Possible Effects of Faulty Certification of Deaths Due to Other Respiratory Diseases”, Cancer, 8: 1130–1136.
• Gillies, D., 2011, “The Russo–Williamson Thesis and the Question of Whether Smoking Causes Heart Disease”, in P. McKay Illari, F. Russo, and J. Williamson (eds.), Causality in the Sciences, Oxford: Oxford University Press, 110–125.
• Glennan, S., 2002, “Rethinking Mechanistic Explanation”, Philosophy of Science, 69: S342–353.
• Goldberg, L., 1968, “Simple Models of Simple Processes? Some Research on Clinical Judgments”, American Psychologist, 23: 483–496.
• Goosens, W., 1980, “Values, Health and Medicine”, Philosophy of Science, 47: 100–115.
• Gorovitz, S. and A. MacIntyre, 1976, “Toward a Theory of Medical Fallibility”, Journal of Medicine and Philosophy, 1: 51–71.
• Grove, W. and P. Meehl, 1996, “Comparative Efficiency of Informal (Subjective, Impressionistic) and Formal (Mechanical, Algorithmic) Prediction Procedures: The Clinical-Statistical Controversy”, Psychology, Public Policy, and Law, 2: 293–323.
• Hare, R.M., 1986, “Health”, Journal of Medical Ethics, 12: 174–181.
• Harris, G., 2005, “F.D.A. Official Admits ‘Lapses’ on Vioxx”, New York Times, 2 March.
• Harris, H.W. and K.F. Schaffner, 1992, “Molecular Genetics, Reductionism, and Disease Concepts in Psychiatry”, Journal of Medical Philosophy, 17: 127–153.
• Hesslow, G., 1984, “What Is A Genetic Disease? On the Relative Importance of Causes”, in L. Nordenfelt and B.I.B. Lindahl (eds.), Health, Disease and Causal Explanation in Medicine, Doredrecht : Reidel, 183–193.
• –––, 1993, “Do We Need A Concept of Disease?” Theoretical Medicine and Bioethics, 14: 1–14.
• Hofmann, B., 2002, “On the Triad Disease, Illness, and Sickness”, Journal of Medicine and Philosophy, 27: 651–673.
• Horwitz, A.V. and J.C. Wakefield, 2007, The Loss of Sadness, New York: Oxford University Press.
• Howick, J., 2011a, “Exposing the Vanities—and a Qualified Defense—of Mechanistic Reasoning in Health Care Decision Making”, Philosophy of Science, 78: 926–940.
• –––, 2011b, The Philosophy of Evidence-Based Medicine, Chichester: Wiley-Blackwell.
• Huang, H. and A.D. Coker, 2010, “Examining Issues Affecting African-American Participation in Research Studies”, Journal of Black Studies, 40: 619–636.
• Hubbard, R. and E. Wald, 1999, Exploding the Gene Myth, New York: Beacon.
• Hugh, T. and S. Dekker, 2009, “Hindsight Bias and Outcome Bias in the Social Construction of Medical Negligence: A Review”, Journal of Law and Medicine, 16: 846–857.
• Hunter, K.M., 1991, Doctors’ Stories: The Narrative Structure of Medical Knowledge, Princeton: Princeton University Press.
• Hurwitz, B., 2006, “Form and Representation in Clinical Case Reports”, Literature and Medicine, 25: 216–240.
• IARC [International Agency for Research on Cancer], 2006, IARC Monographs on the Evaluation of Carcinogenic Risks to Humans: Preamble, Lyon: International Agency for Research on Cancer.
• Illari, P., 2011, “Mechanistic evidence: Disambiguating the Russo-Williamson Thesis”, International Studies in the Philosophy of Science, 25(2):139-157.
• Inborn, M. and K. Whittle, 2001, “Feminism Meets the ‘New’ Epidemiologies: Towards an Appraisal of Antifeminist Biases in Epidemiological Research on Women’s Health”, Social Science and Medicine, 53: 553–567.
• Janis, I., 1982, Groupthink: Psychological Studies of Policy Decisions and Fiascoes, Boston: Houghton Mifflin.
• Jones, J. and D. Hunter, 1995, “Consensus Methods for Medical and Health Services Research”, British Medical Journal, 311: 376–380.
• Juengst, E., 2004, “FACE Facts: Why Human Genetics Will Always Provoke Bioethics”, Journal of Law, Medicine and Ethics, 32: 267–275.
• Kass, L.R., 1975, “Regarding the End of Medicine and the Pursuit of Health”, Public Interest, 40: 11–42.
• Kingma, E., 2007, “What Is It To Be Healthy?” Analysis, 67: 128–133.
• –––, 2010, “Paracetamol, Poison and Polio: Why Boorse’s Account of Function Fails to Distinguish Health and Disease”, The British Journal for the Philosophy of Science, 61: 241–264.
• Kitcher, P., 1997, The Lives To Come: The Genetic Revolution and Human Possibilities, New York: Simon & Schuster.
• La Caze, A., 2011, “The Role of Basic Science in Evidence-based Medicine”, Biology and Philosophy, 26(1):81-98.
• LaFollette, H. and N. Shanks, 1997, Brute Science: Dilemmas of Animal Experimentation, London: Routledge.
• Lennox, J.G., 1995, “Health as an Objective Value”, Philosophy of Medicine, 20: 499–511.
• Lexchin, J., 2012, “Those Who Have the Gold Make the Evidence: How the Pharmaceutical Industry Biases the Outcomes of Clinical Trials of Medications”, Science and Engineering Ethics, 18: 247–261.
• –––, 2012b, “Sponsorship Bias in Clinical Research”, International Journal of Risk & Safety in Medicine, 24: 233–242.
• Lippman, A., 1991, “Prenatal Genetic Testing and Screening: Constructing Needs and Reinforcing Inequities”, American Journal of Law and Medicine, 17: 15–50.
• Lloyd, E.A., 2002, “Reductionism in Medicine: Social Aspects of Health”, in M.H.V. Van Regenmortel and D.L. Hull (eds.), Promises and Limits of Reductionism in the Biomedical Sciences, New York: John Wiley & Sons, 67–82.
• Machamer, P., L. Darden and C. Craver, 2000, “Thinking About Mechanisms”, Philosophy of Science, 67: 1–25.
• Macklin, R., 1972, “Mental Health and Mental Illness: Some Problems of Definition and Concept Formation”, Philosophy of Science, 39: 341–365.
• Magner, L., 2002, A History of the Life Sciences, New York: Marcel Dekker.
• Margolis, J., 1976, “The Concept of Disease”, The Journal of Medicine and Philosophy, 1: 238–255.
• Mathers, C., C. Stein, D. Ma Fat, C. Rao, M. Inoue, N. Tomijima, C. Bernard, A.D. Lopez, and C.J.L. Murray, 2002, Global Burden of Disease 2000: Version 2 Methods and Results, Geneva: World Health Organization.
• McClimans, L., 2010, “Towards Self-Determination in Quality of Life Research”, Medicine, Health Care and Philosophy, 13: 67–76.
• McClimans, L. and J. Browne, 2011, “Choosing a Patient-Reported Outcome Measure”, Theoretical Medicine and Bioethics, 32: 47–60.
• Méthot, P.-O., 2011, “Research Traditions and Evolutionary Explanations in Medicine”, Theoretical Medicine and Bioethics, 32: 75–90.
• Mezzich J.E., A. Kleinman, H. Fabrega Jr., D.L. Parron (eds), 1996, Culture and Psychiatric Diagnosis: A DSM IV Perspective, Washington, DC: American Psychiatric Press.
• Murphy, D. and R.L. Woolfolk, 2000, “The Harmful Dysfunction Analysis of Mental Disorder”, Philosophy, Psychiatry and Psychology, 7: 241–252.
• Murray, C., 1994, “Quantifying the Burden of Disease: The Technical Basis for Disability-Adjusted Life Years”, Bulletin of the World Health Organization, 72: 429–445.
• –––, 1996, “Rethinking DALYs”, in C. Murray and A. Lopez, The Global Burden of Disease: A Comprehensive Assessment of Mortality and Disability from Diseases, Injuries, and Risk Factors in 1990 and Projected to 2020, Boston: Harvard University Press, 1–98.
• Parsons, T., 1951, The Social System, Glencoe, IL: The Free Press.
• Pellegrino, E.D., 1979, Humanism and the Physician, Knoxville: University of Tennessee Press.
• Pettit, P., 2001, “Deliberative Democracy and the Discursive Dilemma”, Philosophical Issues, 11: 268–299.
• Prüss-Üstün, A., C. Mathers, C. Corvalán, and A. Woodward, 2003, Introduction and Methods: Assessing the Environmental Burden of Disease at National and Local Levels, Geneva: World Health Organization.
• Rao, G., 2009, “Probability Error in Diagnosis: The Conjunction Fallacy among Beginning Medical Students”, Family Medicine, 41: 262–265.
• Rapkin, B. and C. Schwartz, 2004, “Toward a Theoretical Model of Quality-of-Life Appraisal: Implications of Findings from Studies of Response Shift”, Health and Quality of Life Outcomes, 2: 14–25.
• Reiss, J., 2005, “Causal Instrumental Variables and Interventions”, Philosophy of Science, 74: 962–976.
• –––, 2007, “Time Series, Nonsense Correlations and the Principle of the Common Cause”, in F. Russo and J. Williamson (eds.), Causality and Probability in the Sciences, London: College Publications, 179–196.
• –––, 2008, Error in Economics: Towards a More Evidence-Based Methodology, London: Routledge.
• –––, 2010a, “Across the Boundaries: Extrapolation in Biology and Social Science, Daniel P. Steel. Oxford University Press, 2007. xi + 241 pages”, Economics and Philosophy, 26: 382–390.
• –––, 2010b, “In Favour of a Millian Proposal to Reform Biomedical Research”, Synthese, 177: 427–447.
• –––, 2012, “Third Time’s a Charm: Wittgensteinian Pluralisms and Causation”, in P. McKay Illari, F. Russo and J. Williamson (eds.), Causality in the Sciences, Oxford: Oxford University Press, 907–927.
• –––, 2015a, “A Pragmatist Theory of Evidence”, Philosophy of Science, 82: 341–362.
• –––, 2015b, Causation, Evidence, and Inference, New York (NY): Routledge.
• –––, forthcoming-a, “On the Causal Wars”, in H.-K. Chao, J. Reiss and S.-T. Chen (eds.), Philosophy of Science in Practice, Dordrecht: Springer.
• –––, forthcoming-b, “Meanwhile, Why Not Biomedical Capitalism?”, in K. Elliott and D. Steel (eds.), Current Controversies in Science and Values, New York (NY): Routledge
• Reiss, J. and P. Kitcher, 2009, “Biomedical Research, Neglected Diseases, and Well-Ordered Science”, Theoria, 24: 263–282.
• Reiss, J. and S. Wieten, “On Justin Biddle’s ‘Lessons from the Vioxx Debacle’”, Social Epistemology Review and Reply Collective 4(5), 2015: 20-22.
• Reznek, L., 1987, The Nature of Disease, New York: Routledge.
• Ritchie, K., 1989, “The Little Woman Meets Son of DSM-III”, Journal of Medicine and Philosophy, 14: 695–708.
• Russo, F. and J. Williamson, 2007, “Interpreting Causality in the Health Sciences”, International Studies in the Philosophy of Science, 21: 157–170.
• Sackett, D.L., W.M. Rosenberg, J.A. Gray, R.B. Haynes, and W.S. Richardson, 1996, “Evidence-Based Medicine: What it Is and What it Isn't”, British Medical Journal, 312: 71–72. pmcid:PMC2349778
• Sadegh-Zadeh, K., 2000, “Fuzzy Health, Illness, and Disease”, Journal of Medicine and Philosophy, 25: 605–638.
• –––, 2011, “The Logic of Diagnosis”, in F. Gifford (ed.), Handbook of the Philosophy of Science, Volume 16: Philosophy of Medicine, Amsterdam: Elsevier, 357–424.
• Sadler, J.Z. and G.J. Agich, 1995, “Diseases, Functions, Values, and Psychiatric Classification”, Philosophy, Psychiatry, and Psychology, 2: 219–231.
• Scadding, J., 1990, “The Semantic Problem of Psychiatry”, Psychological Medicine, 20: 243–248.
• Schaffner, K..F., 1981, “Modeling Medical Diagnosis: Logical and Computer Approaches”, Synthese, 47: 163–199.
• –––, 1993, Discovery and Explanation in Biology and Medicine, Chicago: University of Chicago Press.
• –––, 2010, “Interpretive Practices in Medicine”, in P. Machamer and G. Wolters (eds.), Interpretation: Ways of Thinking about the Sciences and the Arts, Pittsburgh: University of Pittsburgh Press, 158–178.
• Schwartz, P., 2007, “Decision and Discovery in Defining ‘Disease’”, in H. Kincaid and J. McKitrick (eds.), Establishing Medical Reality, Amsterdam: Springer, 47–63.
• Sedgewick, P., 1982, PsychoPolitics, New York: Harper and Row.
• Shorter, E., 2008, From Paralysis to Fatigue: A History of Psychosomatic Illness in the Modern Era, New York: Simon & Schuster.
• Sismondo, S., 2008, “Ghost Management: How Much of the Medical Literature is Shaped Behind the Scenes by the Pharmaceutical Industry?” PLoS Medicine, 4(9).
• Smith, K.C., 1992, “The New Problem of Genetics: A Response to Gifford”, Biology and Philosophy, 7: 331–348.
• Sober, E., 1980, “Evolution, Population Thinking, and Essentialism”, Philosophy of Science, 47: 350–383.
• Solomon, M., 2007, “The Social Epistemology of NIH Consensus Conferences”, in H. Kincaid and J. McKitrick (eds.), Establishing Medical Reality, New York: Springer, 167–177.
• –––, 2011, “Group Judgment and the Medical Consensus Conference”, in F. Gifford (ed.), Handbook of the Philosophy of Science, Volume 16: Philosophy of Medicine, Amsterdam: Elsevier, 239–254.
• –––, 2015, Making Medical Knowledge, Oxford: Oxford University Press.
• Staley, K., 2004, The Evidence for the Top Quark. Cambridge: Cambridge University Press.
• Stanley, D.E. and D.G. Campos, 2013, “The Logic of Medical Diagnosis”, Perspectives in Biology and Medicine, 56: 300–315.
• Steel, D., 2008, Across the Boundaries: Extrapolation in Biology and Social Science, Oxford: Oxford University Press.
• –––, 2013, “Mechanisms and Extrapolation in the Abortion-Crime Controversy”, in H.-K. Chao, S.-T. Chen and R. Millstein, Mechanism and Causality in Biology and Economics, Dordrecht, Springer: 185-206.
• Szasz, T., 1961, The Myth of Mental Illness, New York: Harper & Row.
• –––, 1973, The Second Sin, New York: Doubleday.
• –––, 1987, Insanity, New York: Wiley.
• Thagard, P., 1999, How Scientists Explain Disease, Princeton: Princeton University Press.
• –––, 2006, “What is a Medical Theory?” in R. Payton and L. McNamara, Multidisciplinary Approaches to Theory in Medicine, vol. 3, Amsterdam: Elsevier, 47–62.
• Tiles, M., 1993, “The Normal and Pathological: The Concept of a Scientific Medicine”, British Journal for the Philosophy of Science, 44: 729–742.
• Timmermans, S. and M. Berg, 2003, The Gold Standard: The Challenge of Evidence-Based Medicine and Standardization in Health Care, Philadelphia: Temple University Press.
• Tonelli, M., 2006, “Evidence-Based Medicine and Clinical Expertise”, Virtual Mentor, 8: 71–74.
• Tversky, A. and D. Kahneman, 1983, “Extensional vs. Intuitive Reasoning: The Conjunction Fallacy in Probability Judgment”, Psychological Review, 90: 293–315.
• van Bouwel, J., E. Weber, and L. de Vreese, 2011, “Indispensability Arguments in Favour of Reductive Explanations”, Journal for General Philosophy of Science, 42: 33–46.
• Vandenbroucke, J.P., 2009, “The HRT Controversy: Observational Studies and RCTs Fall in Line”, The Lancet, 373: 1233–1235.
• Wachbroit, R., 1994, “Normality as a Biological Concept”, Philosophy of Science, 61: 579–591.
• Wakefield, J.C., 1992, “The Concept of Mental Disorder: On the Boundary between Biological and Social Values”, American Psychologist, 47: 373–388.
• –––, 1996, “Dysfunction as a Value-Free Concept”, Philosophy, Psychiatry and Psychology, 2: 233–246.
• –––, 2007, “What Makes a Mental Disorder Mental?” Philosophy, Psychiatry and Psychology, 13: 123–131.
• Wartofsky, M., 1986, “Clinical Judgment, Expert Programs, and Cognitive Style: A Counter-Essay in the Logic of Diagnosis”, Journal of Medicine and Philosophy, 11: 81–92.
• Weber, M., 2008, “Causes without Mechanisms: Experimental Regularities, Physical Laws, and Neuroscientific Explanation”, Philosophy of Science, 75: 995-1007.
• Weightman, A., S. Ellis, A. Cullum, L. Sander, and R.L. Turley (eds), 2005, Grading Evidence and Recommendations for Public Health Interventions: Developing and Piloting a Framework, London: Health Development Agency.
• Whitbeck, C., 1977, “Causation in Medicine: The Disease Entity Model”, Philosophy of Science, 44: 619–637
• –––, 1981, “A Theory of Health”, in A.L. Caplan and H.T. Engelhardt, Jr. (eds.), Concepts of Health and Disease: Interdisciplinary Perspectives. Reading, MA: Addison-Wesley, 611–626.
• WHO [World Health Organization], 1948, Preamble to the Constitution of the World Health Organization as adopted by the International Health Conference, New York, 19 June–22 July 1946; signed on 22 July 1946 by the representatives of 61 States (Official Records of the World Health Organization, no. 2, p. 100) and entered into force on 7 April 1948.
• Williams, T.N., T.W. Mwangi, S. Wambua, N.D. Alexander, M. Kortok, R.W. Snow, and K. Marsh, 2005, “Sickle Cell Trait and the Risk of Plasmodium falciparum Malaria and Other Childhood Diseases”, Journal of Infectious Diseases, 192: 178–186.
• Woodward, J., 2002, “What Is a Mechanism?” Philosophy of Science, 69: S366–377.
• –––, 2003, Making Things Happen, Oxford: Oxford University Press.
• Worrall, J., 2002, “What Evidence in Evidence-Based Medicine”, Philosophy of Science, 69: S316–330.
• –––, 2007a, “Evidence in Medicine and Evidence-Based Medicine”, Philosophy Compass, 2: 981–1022.
• –––, 2007b, “Why There’s No Cause to Randomize”, British Journal for Philosophy of Science, 58: 451–488.
• Worrall, J. and J. Worrall, 2001, “Defining Disease: Much Ado about Nothing”, in A. Tymieniecka and E. Agazzi (eds.), Life Interpretation and the Sense of Illness Within the Human Condition, Dordrecht: Kluwer Academic Publishers, pp. 33–55.