Scaling Down Inequality

Gender bias in student evaluations of college professors diminishes considerably by changing the scale.

James Joyner · Thursday, January 31, 2019 · 16 comments

Kevin Drum points to a study by Lauren Rivera and András Tilcsik with a curious finding.

Here’s the paper abstract:

Quantitative performance ratings are ubiquitous in modern organizations—from businesses to universities—yet there is substantial evidence of bias against women in such ratings. This study examines how gender inequalities in evaluations depend on the design of the tools used to judge merit. Exploiting a quasi-natural experiment at a large North American university, we found that the number of scale points used in faculty teaching evaluations—whether instructors were rated on a scale of 6 versus a scale of 10—significantly affected the size of the gender gap in evaluations. A survey experiment, which presented all participants with an identical lecture transcript but randomly varied instructor gender and the number of scale points, replicated this finding and suggested that the number of scale points affects the extent to which gender stereotypes of brilliance are expressed in quantitative ratings. These results highlight how seemingly minor technical aspects of performance ratings can have a major effect on the evaluation of men and women. Our findings thus contribute to a growing body of work on organizational practices that reduce workplace inequalities and the sociological literature on how rating systems—rather than being neutral instruments—shape the distribution of rewards in organizations. [emphases mine]

Drum highlights this explanation from later in the paper:

Drawing from a complementary survey experiment, we show that this effect is not due to gender differences in instructor quality. Rather, it is driven by differences in the cultural meanings and stereotypes raters attach to specific numeric scales. Whereas the top score on a 10-point scale elicited images of exceptional or perfect performance—and, as a result, activated gender stereotypes of brilliance manifest in raters’ hesitation to assign women top scores—the top score on the 6-point scale did not carry such strong performance expectations. Under the 6-point system, evaluators recognized a wider variety of performances—and, critically, performers—as meriting top marks. Consequently, our results show that the structure of rating systems can shape the evaluation of women’s and men’s relative performance and alter the magnitude of gender inequalities in organizations. [emphases presumably Drum’s]

and observes:

In other words, students viewed a 9 or 10 on a scale of 1-10 as implying true brilliance, and they were reluctant to evaluate female instructors as brilliant. However, a 6 on a scale of 1-6 doesn’t carry the same connotations. Students interpret it as really good, but not necessarily brilliant. Because of that, they were perfectly happy to evaluate the top female instructors with the top evaluation.

Do you believe this? Do I believe it? Beats me. The sample size in the study is large, so that’s not a problem. The switch to a 6-point scale was unrelated to gender concerns, so that’s not an issue. The modeling appears to be reasonable. And the change in results is large. The effect sure seems real, but it’s still anyone’s guess about why the effect is real and why it’s so large. Given my respect for cognitive biases like framing effects, the authors’ explanation seems OK to me, but it’s still a bit of a guess. I’d sure like to hear a few other people weigh in.

The authors both have PhDs in sociology from Harvard and are tenured at top-drawer universities, and the article is forthcoming in what I believe to be the top journal in their field, so I’ll defer to their expert judgment, especially since I’ve only skimmed the article.

One thing that occurs to me about the specific scale—6 vice 10—is that the latter number has a particular connotation that might bring out gender bias in a way the former does not. That is, it has long been a custom for boys and men to rate girls and women on their physical attractiveness using a scale of 1 to 10. I first became aware of this phenomenon in 1979, following publicity for the movie “10” starring Bo Derek, although the practice may well long predate that film. So, it’s quite possible that male students asked to rate a female professor on a scale of 1 to 10 will subconsciously factor in her sexual desirability in way that they wouldn’t with a male professor. And, on a 6-point scale, that connotation simply wouldn’t be introduced.

That’s pure conjecture, of course, and I’m not sure offhand how one would even go about testing it.

Comments

mattbernius says:

Thursday, 31 January 2019 at 08:10

Thanks for highlighting this article James. The social connotations of a 10 point scale are really interesting. I could see the “Bo Derek” factor coming into play. After all, if we buy into the authors’ arugment that the presence of “10” is invoking one cultural framework of evaluation (i.e. that “10” = Perfect), then that opens the door for other cultural frameworks too.

Drum, and the authors he’s summarizing, make a convincing argument about the effect of the scale change:

However, a 6 on a scale of 1-6 doesn’t carry the same connotations. Students interpret it as really good, but not necessarily brilliant.

Having less choices makes the range of each “bucket” wider. So as the authors suggest that eliminates nuances that could cause gender and other biases to rate one person lower than the other.

I also wonder if the unexpected nature of the scale actually caused participants to stop and think more to try and fit an instructor’s performance into this new model. If memory serves, one of the best tools for fighting bias is reflection. I just scanned the article and it looks like the authors raise this point too.

6
James Joyner says:

Thursday, 31 January 2019 at 08:44

@mattbernius:

Having less choices makes the range of each “bucket” wider.

My fiance and her kids have this thing where they rate things on a scale of 1-2. It tends to lead to less severe judgments, as they tend to think “1” as “terrible” and “2” as “not so bad.”

4
mattbernius says:

Thursday, 31 January 2019 at 09:13

@James Joyner:
One other thing I like about low integer, even scales (which I’ve been experimenting with when I occasionally do surveys) is that not only do the buckets lead to clearer outcomes, but they also eliminate the neutral options (which most people gravitate towards). In my experience, “forcing” respondents to ultimately have to choose to score on the positive or the negative side of the scale leads to far more useful and better considered results.

That said, I always get anecdotal feedback that many folks hate not having a “meh” option.

2
John Peabody says:

Thursday, 31 January 2019 at 09:33

Perhaps I am short-sighted, but I think this is spot on. With a 1-10 scale, I would expect my service (at a hotel, auto repair, shopping experience) to be PERFECT to receive a ten. Well, perfection is darn rare, and so I give very rare tens when asked to rate. But I sometimes (and frequently quite irritatingly) run into situations where a store or hotel DEMANDS to know what they can do to allow me to rate a ten. And it’s worse when drilling down to individual employees, for anything less than a ten puts them into a situation where they worry about dismissal. Yes, I’m grumpy, but I chafe at the pressure to rate “10s” for something I feel is properly a 7-8 situation. A 1-6 scale would do wonders.

2
Tony W says:

Thursday, 31 January 2019 at 10:11

@John Peabody:

but I chafe at the pressure to rate “10s” for something I feel is properly a 7-8 situation.

I would love to learn which MBA program persuaded today’s business leaders that high-stakes surveys are the way to learn what is happening within their companies. It is idiotic to believe that any survey, the outcome of which determines the employee’s bonus/continued employment, is fair and accurate.

6
Teve says:

Thursday, 31 January 2019 at 10:23

Robert Parker’s wine scale sucks. What’s the practical difference between a wine that scores an 88, and one that scores a 91?

I’ve always liked the Michelin stars scale for its simplicity.

1 star==really good
2 stars==excellent
3 stars==Holy Shit

4
OzarkHillbilly says:

Thursday, 31 January 2019 at 11:24

@John Peabody:

But I sometimes (and frequently quite irritatingly) run into situations where a store or hotel DEMANDS to know what they can do to allow me to rate a ten.

“STFU with your stupid questions?” I have gotten to the point where I refuse to take part. If they don’t like it they can GFT.

1
Kathy says:

Thursday, 31 January 2019 at 11:33

I’ve come to the conclusion that of all the foibles of human cognition, the most important one is this: the impression that we accurately perceive and reason about everything, despite the huge evidence to the contrary.

5
just nutha says:

Thursday, 31 January 2019 at 11:38

I would love to learn which MBA program persuaded today’s business leaders that high-stakes surveys are the way to learn what is happening within their companies.

Exactly! In some cases, it may be a way of “fishing for complements.” Case in point: The Chevy dealer in my town is reported as being a guy who fires people when the dealership doesn’t get “perfect” scores. As a consequence, the sales person called me to ask if I would give him a chance to make any complaints right before I filled out my evaluation poll.

Normally, I don’t give 10s. As many have noted here, nobody is that good. But I do on the performance evals from the Chevy dealership. Even if I’m not completely satisfied because I don’t wish to risk someone losing their job over a minor complaint of mine, and if the dealer is so insecure that he needs the reinforcement that he’s a tough guy, I don’t really care. (Sort of like I feel about Trump, only benign instead of malicious.)

2
mattbernius says:

Thursday, 31 January 2019 at 11:44

@Tony W:

I would love to learn which MBA program persuaded today’s business leaders that high-stakes surveys are the way to learn what is happening within their companies.

The absolutely worst example of this is the Net Promoter Score. Its, you guessed it, a 1-10 scale with anyone who scores you 0-6 is considered a detractor, 7-8’s are “passives” (i.e. they supposedly think you’re ok), and 9-10 are promoters.

The entire scale has been shown to be, at best good for a very narrow measurement, and more likely pretty much bunk. But that hasn’t stopped everyone for using it for 15+ years.

https://en.wikipedia.org/wiki/Net_Promoter

Every time I see anyone advocating for it’s use (typically misapplying it) I just shake my head.

1
just nutha says:

Thursday, 31 January 2019 at 11:47

@Teve: My biggest problem with that kind of a scale is that I usually can’t tell the difference between a “60” and an “80.” How am I ever going to tell the difference between 88 and 91?

(For the record, I grew up in a relatively typical (I suspect anyway) Italian family, where my grandparents drank “table wine” out of water tumblers. My parents didn’t drink wine at all Grandpa was a drunk and my dad never forgot it) and I’m more of a hard liquor guy. I can’t even tell cabernet from merlot.)

1
Mikey says:

Thursday, 31 January 2019 at 15:45

@James Joyner:

My fiance

Congratulations, James. I can’t imagine having had to deal with what you did and still being able to find hope in life, and another partner with whom to share it. I’m very happy for you that you did.

9
Gustopher says:

Thursday, 31 January 2019 at 16:01

So, it’s quite possible that male students asked to rate a female professor on a scale of 1 to 10 will subconsciously factor in her sexual desirability in way that they wouldn’t with a male professor. And, on a 6-point scale, that connotation simply wouldn’t be introduced.

This is just going to mess with boys. They’ll start rating girls on a scale of 1 to 6, one will be overheard in math class talking about Janelle Monáe as a 6 after she gets a haircut, and she’ll have to fight back tears and then decades later she writes a song about it.

2
Barry says:

Thursday, 31 January 2019 at 17:25

As somebody who has and will work with satisfaction data, is his is amazing!
DrDaveT says:

Thursday, 31 January 2019 at 21:02

@mattbernius:

One other thing I like about low integer, even scales […] is that not only do the buckets lead to clearer outcomes, but they also eliminate the neutral options (which most people gravitate towards).

I’ve been doing a lot of technical job interviews lately. The system we’ve settled on has 4 possible ratings for the 2 rounds that are inputs to management:
double minus: if you are thinking of making an offer to this person, I want a chance to talk you out of it
minus: I do not recommend making an offer to this person
plus: I recommend making an offer to this person
double plus: if you are not planning to make an offer to this person, I want a chance to change your mind

It works vastly better than all of our previous attempts at scoring.

3
MarkedMan says:

Thursday, 31 January 2019 at 21:53

We all are sure we are completely unbiased, but I often think about the transition of orchestras to blind tryouts. Initially this was done to combat the tendency of people to pick their own students. But a side effect was that women’s chances of being selected improved by a third. That’s a huge effect!

3