It's Possible to Reduce Polarization without Reducing Engagement
The results of the Prosocial Ranking Challenge
On behalf of the dozens of people who worked on this project in one way or another, I am very happy to announce that we have posted an initial paper to ArXiv. I’ll let the abstract explain what we found:
We report the first direct comparisons of multiple alternative social media algorithms on multiple platforms on outcomes of societal interest. We used a browser extension to modify which posts were shown to desktop social media users, randomly assigning 9,386 users to a control group or one of five alternative ranking algorithms which simultaneously altered content across three platforms for six months during the US 2024 presidential election. This reduced our preregistered index of affective polarization by an average of 0.03 standard deviations (p < 0.05), including a 1.5 degree decrease in differences between the 100 point inparty and outparty feeling thermometers. We saw reductions in active use time for Facebook (-0.37 min/day) and Reddit (-0.2 min/day), but an increase of 0.32 min/day (p < 0.01) for X/Twitter. We saw an increase in reports of negative social media experiences but found no effects on well-being, news knowledge, outgroup empathy, perceptions of and support for partisan violence. This implies that bridging content can improve some societal outcomes without necessarily conflicting with the engagement-driven business model of social media.
Results
Here are the results on our polarization index, pooled across all rankers (as preregistered):

And here are the results on platform use time, also pooled:

Here are the results for individual rankers:
Only two rankers showed a statistically significant reduction in polarization — though all had negative point estimates, which is good news because it suggests that a variety of algorithms might work. The most depolarizing algorithms are Add News (-0.044 SD) and Uprank Bridging, Downrank toxic (-0.042 SD), though we don’t actually have enough statistical power to distinguish the rankers.
Add News Inserts personalized (topic matched) news posts from 95 credible and ideologically diverse news sources to increase the user’s exposure to factual public affairs information. Uprank Bridging, Downrank toxic is built on the Google Jigsaw experimental Perspective bridging API, and reorders posts and comments using the average of bridging attributes minus the average of negative attributes (e.g. insult, identity attack, moral outrage, alienation).
The various algorithms also had fairly consistent effects on time on platform, typically negative for Facebook and Reddit, positive for X/Twitter. (No, we don’t know why we see this particular pattern.)
The polarization reduction includes a 1.5 point decrease in affective polarization on the 100 point feeling thermometer scale (ingroup-outgroup). Since affective polarization in the U.S. has been rising by about 0.6 points per year in the last four decades, this corresponds to a reversal with a magnitude equal to around two and a half years of average polarization increase.
We also saw a similarly sized negative change in reported user experience on platform, −0.038 SD measured using the Neely Index. It seems our algorithmic changes made the experience slightly worse for people. This could be because we didn’t really have a user testing cycle for our rankers, or it could be because (for example) seeing more news about the outgroup is just kind of a downer, even if it reduces polarization.
We saw no statistically significant change in our other preregistered outcomes of well-being, support for partisan violence, meta-perceptions of support for violence, and news knowledge. We did see a small increase in social trust, which is consistent with the higher regard for the outgroup that we saw.
As always, there is a ton more in the supplementary materials, including analyses of the sample demographics, heterogeneity, attrition effects, in-feed survey results (all null I’m afraid), robustness checks, long descriptions of each ranking algorithm, and neat graphs of how many posts each algorithm added/removed/reranked
Do these results matter?
0.03 Standard deviations is a small change. We still think this is an important result for a variety of reasons.
First, it compares well to the most similar previous experiments. Piccardi et al’s browser extension experiment injecting content into X/Twitter found a 2.1 point decrease, while Levi et al’s experiment had people manually subscribe to outgroup news sources on Facebook and found a 0.96 point reduction (all on the 100 point feeling thermometer scale). Taken together, these results suggest the beginning of a replicable paradigm. This is heartening news after the Meta 2020 election studies showed no depolarization effects. Why the difference? Most likely because that work was testing different algorithms (chron feeds, limited reshares, and more outgroup content rather than more bridging outgroup content).
We also believe that we would see bigger effects if a platform actually implemented these algorithms, for three reasons. First, we were not able to change people’s feeds on mobile, where most social media consumption happens. Second, we know there are polarization network effects, where people who are less polarized also share less polarizing material. Third, changing ranking algorithms changes incentives for content creators, producing second order effects. Potentially, journalists and politicians would change their content to succeed better under more positive algorithms.
Moreover, these results are sustainable. We ran the experiment for six months, which should be enough time for the novelty to wear off. “One shot” or “few shot” polarization reduction interventions (like watching an ad or talking to someone or taking a quiz) reduce polarization by an average of 5.4 points, but the effects wear off. The effects of an algorithmic change are permanent, and potentially compounding.
Just as importantly, we have shown that reduced polarization doesn’t necessarily come at the cost of decreased engagement. This is very relevant to product design and policy discussions. It’s also a good reminder that we can’t assume that every platform is already on the Pareto frontier of good for business and good for society.
Next stop: a production implementation
I described this experiment many times over the past few years, and the most common questions I got were “do you think a platform would actually do this?” and “can I try it?”
Well, we’re going to make sure at least one platform actually does it, so that you can try it. Several members of the Ranking Challenge team are already busy at work on GreenEarth, a project to create open-source AI-driven recommender infrastructure for ATProto. This will allow BlueSky users — and users of many other ATProto apps like Graze.social and Skylight — to access, and even create, a wide variety of customized prosocial feeds.

Future work
There’s still a ton of data analysis left to do! We collected all the content that participating users saw on algorithmic feeds over six months (196 million items) and all the posts and comments they contributed (1.2 million items), as well as all engagement actions on these items including like, share, etc. (84 million events) — but we haven't analyzed the text data at all yet.
We expect to write at least two more papers from this dataset. And then we’ll anonymize it carefully, and release it publicly for future researchers.
Thank you for being a part of the Prosocial Ranking Challenge.



