Last week, a colleague on Twitter (Hi, Brendan!) asked – possibly rhetorically – whether it was possible to measure quality in higher education. I took the bait and thought I would formulate my response here.
Not everyone agrees, but I think in almost every aspect of higher education, quality can be evaluated. Not always in strictly quantitative ways, but certainly in ways that allow general comparison across similar units or organizations. But the important thing is that quality needs to be measured simultaneously at multiple levels – programs/departments, institutions and systems. The fashion for measuring it merely at the level of the institution is short-sighted. And one needs to measure slightly different things at different levels. It’s not all the same.
Let’s start with teaching evaluation. At the program level, you want to know whether graduates have learned what they were supposed to. In professional programs, this could be taken care of by licensing exams – though unfortunately in Canada licensing exam pass rates by institution are secret. In countries like Brazil and Jordan, there are subject-level exit exams which can test whether subject matter has been absorbed. Of course, for that approach to work you’d need some rough agreement across institutions about what kind of knowledge students in each program are meant to master, and whatever you think of that proposition (personally, I am not sure why anything deserves to be called a “discipline” if it can’t agree on something that simple), that probably isn’t happening any time soon.
So, what are the alternatives? Well, any method of assessment theoretically works as long as it is being held to some kind of external standard. So, grade point averages, or individual portfolios, or whatever – they all work provided the rubrics according to which they are assessed are subject to external assessment to make sure that what counts as an A in one institution counts as an A in another. The UK and Denmark are two countries which have systems aiming to work in this way; here in Canada, every institution basically has its own grading curve, which makes comparability impossible. Of course, whether you measure disciplinary-level outcomes this way or via exams, you’d want a control that accounts for the preparedness-level of entrants into that program, so that you aren’t just measuring how well an institution’s admissions office is at attracting top students.
At the institutional level, I think you want to know something slightly different about teaching quality. You want to aggregate those program-level results, yes. But it’s the institution, not the program, which tends to take responsibility for the student’s overall personal development. The student services, the extra-curricular opportunities, all those life-altering things that happen to students to affect their later course of life – that’s worth measuring at the institutional level. Maybe that includes measures like employment and salary, but personally, I’d be interested in finding out, say 10 years after graduation, how graduates feel their life is going, and the extent to which their institution played a positive role in their personal development. Seems like the kind of thing worth knowing.
At the system level, one purpose of public education is to ensure that employers (both public and private) are getting a steady stream of graduates with the right kinds of skills. One could do this at the level of institution or program, but employers don’t necessarily have a great sense about this (they have lots of employees, and don’t necessarily know where they all come from), but they do have a pretty good sense of how they are doing compared to previous cohorts of graduates. Understanding where new graduates are meeting standards and where they are falling short would be an important source of information about quality but also of how all schools could improve, given constantly changing technology and labour market demands.
Then there’s the question of research or more broadly knowledge production. I think there is a difference in how one chooses to evaluate Mode 1 vs Mode 2 research (or, very roughly, “pure” and “applied” research). It probably makes most sense to measure Mode 1 at the level of the subject – which, given the way that universities are organized, usually (but not always) means the level of the department. In most fields – humanities and fine arts excepted – one can use bibliometrics relatively unproblematically (they are much maligned, but when used at the departmental level to compare like departments, they work pretty well – much better than they do at either the individual or institutional level). One could also use some kind more nuanced and contextualized analysis like the Research Assessment Exercises in the UK, but these are a lot more expensive than bibliometrics without necessarily being a whole lot better. The main benefit – in theory at least – is that the expert panels doing the reviewing might be able to pick out some research gems that might bear significant promise for the future without having yet had a major impact on the literature, but that seems likely to be hit-and-miss.
Mode 2 research – that is, research meant to be put into immediate action – is, it seems to me, somewhat different. Subjects/Departments are not usually set up to do the connection work with business and community required to make Mode 2 happen. But nor, pace the folks at THE, is this primarily an institutional-level responsibility. Mode 2 varies enormously by broad field of study, not just as a function of the kind of knowledge each field produces, but also as a function of the industries they feed into. Engineers tend to work with profit-seeking companies which are lean and eagerly looking for sources of profit both short-and long-term; nursing and public health units are dealing with big health bureaucracies which can be resistant to change, etc. So it seems to me that this meso-level of field of study is probably a better choice for the level of responsibility for Mode 2 than either subjects/departments than institution. As for how you measure it, let me refer you back to last week’s blog on The Impact of Impact.
(There is a counter-argument here that is that an awful lot of Mode 2 is interdisciplinary. In fact, at least in Canada, the argument for greater interdisciplinarity and the argument for Mode 2 research seem to be almost identical – see this piece from Daniel Woolf a few years ago in University Affairs. If you buy this argument, then the institution is probably the right level)
It is also worth periodically doing some system-level investigation on Mode 2 research: asking whether institutions are producing enough of this kind of work to power public and private innovation. I think this can be done by some kind of combination of quantitative surveys along with targeted qualitative interviews – the kind of thing the Council of Canadian Academies does reasonably well.
So that’s it. But why don’t we do it? Well, it’s exhausting for one thing. And if people think that assessment is some unnecessary “extra”, it’ll never get the funding required to do it properly. But I think it’s also something deeper. Governments have come to think of quality measurement as something which should be used only for summative purposes – that is, to judge institutions and in some cases base funding on these judgements. That’s not always a terrible idea, but it tends to put people’s backs up and in some cases makes them more resistant not just to change but to thinking about quality in a productive manner.
But there’s another possibility: why not see more of judgements as formative, that is, designed to help guide improvements at all levels (including the system level), rather than to pass judgement? That wouldn’t necessarily make institutions hotbeds of improvement and innovation, but it would take a lot of the pressure and heat out of the quality measurement game.
In other words, if we made real attempts to measure quality as if quality mattered, rather than (or at least in addition to) a cheap accountability system, we might actually get somewhere. Worth thinking about, anyway.
This starts by conflating the quality of education with its outcomes, but of course quality is much more than just outcomes (Stufflebeam, 2002). It then redesignates quality as value added, which is a very different concept, and very hard to measure (Banta, 2007), as the results from Collegiate Learning Assessment show again (Douglass, Thomson & Zhao, 2012).
Mode 2 is not just applied research, but ‘knowledge . . . worked out in a context of application. . . . interdisciplinary. . . . heterogeneity of skills. . . . more heterarchical and transient. Each employs a different type of quality control. . . . mode 2 is more socially accountable and reflexive. It includes a wider, more temporary and heterogeneous set of practitioners, collaborating on a problem defined in a specific and localised context.’ (Gibbons, 1997: 3)
Banta, Trudy W (2007, January 26) A warning on measuring learning outcomes. Inside Higher Education, http://www.insidehighered.com/layout/set/print/views/2007/01/26/banta
Douglass, J. A., Thomson, G., & Zhao, C. (2012) The learning outcomes race: the value of self-reported gains in large research universities. Higher Education, 64, 317-335.
Gibbons, M. (1997) What kind of University? Research and teaching in the 21st century. 1997 Beanland lecture, Victoria University of Technology, https://www.westernsydney.edu.au/__data/assets/pdf_file/0017/405251/Gibbons_What_Kind_of_University.pdf
Stufflebeam, Daniel, L (2002) The CIPP model for evaluation. In D. L. Stufflebeam, G. F. Madaus, & T. Kellaghan, (editors), Evaluation models (2nd ed.). (Chapter 16). Boston: Kluwer Academic.
So you want bibliometrics of departments, not faculty members, which would be formative, not summative, and used to improve, not discipline. Good luck with that.
In all likelihood, some set of metrics will be brought in with these sorts of promises, then all the department measures will be downstreamed unto faculty, who’ll be annually poked, prodded and asked to explain how they’re contributing to the department’s bibliometric score. What gets measured, gets micromanaged.
Even if not obviously used to punish, any quality measure cannot help but be taken as a judgement, and often a judgement on maddeningly perverse grounds. Nobody likes to be judged, especially not people amongst whom imposter syndrome runs rampant. One might as well command existential crises.
Sometimes the past is prologue. An historical note about the definition and measurement of quality can be found by taking a look at the 1993 report of the Broadhurst Report of the Task Force on University Accountability in Ontario, and the Report of the task force’s Committee on Accountability, Performance Indicators, and Outcomes Assessment. The Task Force recommended that each university should adopt indicators for inclusion in its accountability framework, and in turn recommended the indicators identified by the Committee. Twenty-four indicators were identified, and divided into five categories: Responsiveness, Quality, Performance, Resources, and Mission.
There were eight indicators of quality: Distribution of entering grade average, Acceptance or yield rate, Research grants per professor, Research yield, Library resources volumes held and total spending, Per cent of faculty holding scholarly awards, and Per cent of students holding scholarly awards. The report included a metric for each indicator. After the report was adopted by the Council of Ontario Universities each university tested each metric was tested to determine if it could be reliably calculated by each university. The test concluded that each indicator was feasible in practice.
Today a different set of quality indicators might be constructed, and the metrics of some of the original indicators might no longer be feasible. Nevertheless, it would be mistaken to conclude that the task of defining and measuring quality is impossible, or even is as difficult as it might now appear. It was done once, and was proven workable. The Task Force, by the way came to the conclusion that the focal point for measuring quality should programs, not institutions, except as an aggregate compendium. That followed the Task Force’s logic of breaking indicators into five categories, some of which — other than quality — could function at the institutional level.