Frequently Asked Questions

Last updated March 11

Jonathan Stray

Jan 27, 2024

Subscribe for updates on the API, contest rules, events, deadlines, etc.

This FAQ is divided into three sections: contest questions, scientific questions, and technical questions.

Contest FAQ

How do I submit?

Use the submission form.

What are the deadlines?

April 15, 2024: Submissions due

Each submission will include documentation describing how the algorithm works, what outcome(s) it is expected to change, and why it is desirable to test, plus a prototype implementation (running as a live API endpoint that you host).

May 1st: Ten finalists will be announced

We will select up to ten finalists. Each gets a $1000 prize, and one month to create a production-ready version of their algorithm. Our judges will try to give useful suggestions on how to strengthen your approach.

June 1st: Finalist submissions due

This time your ranker will need to be delivered in a Docker container and meet certain performance and security requirements. $50,000 in prize money will be split between all submissions that meet the technical requirements (see below).

June 15th: Winners announced

While we will award prize money to up to ten finalists, it is expensive to run a ranker, so unfortunately we can't test all of them. We will select up to five rankers to test based on scientific merit and technical readiness (including performance and stability).

What do you mean by “ranking algorithm”?

A program that takes a set of posts and re-orders them. It can also remove posts or add entirely new posts. We think adding new posts may be the most powerful strategy! See our API docs.

Who is eligible to enter?

Any person or team of people, with a few exceptions (conflicts of interest, live in a country we can’t legally pay money to, etc.)

What are you looking for when judging rankers?

Scientific merit, meaning that 1) the ranker is based on a plausible theory of how exposure to different content on Facebook, X, and Reddit, can change one of our informational, well-being, or conflict related outcomes, and 2) there’s a reason to believe the effect will be large enough to detect with the sample size we can afford.

Desirability, meaning that the change you want to make seems like a good idea, something that would improve current designs.

Technical requirements, meaning that you meet the technical specification at each stage of the competition. These concern API conformance, maximum latency, and security.

One tricky issue here, really the limiting factor in what you can do, is that your content changes can’t make people want to stop participating in the experiment. The experience should be mostly transparent, so that we don’t get differential attrition between arms which could cause selection bias problems. We’ll be looking for that in our data analysis.

Does my first-round submission have to work for all three platforms?

Yes, it must work for Facebook, X, and Reddit. There are some differences between platforms. For example Facebook and Reddit have comment threads while X does not. You can re-order both posts and comments on those platforms, but can only insert posts. See the API docs.

I want to ask you something else

That’s not a question, but join us on discord.

Science FAQ

What outcomes will you test in the experiment?

A battery of both survey and behavioral outcomes related to conflict (including polarization), well-being (including addiction), and information (including misinformation). Here is our draft list of survey items. Do you have a specific recommendation? Let us know.

There will be about 1500 people in each arm, which will allow us to find effect sizes of 0.1 Standard Deviations with 80% probability. Each user will use the extension for four months, with surveys at 0, 2, and 4 months.

Can I add outcomes to test?

Yes, up to three survey questions. If they’re good questions.

Will the results be good science?

We are making the method as strong as we can manage, including pre-registration, multiple control arms, manipulation and robustness checks, and appropriate statistical analysis.

The major limitations are that we can only sort the top few hundred posts (the most we can retrieve at once from what has already been selected for the user), that it’s desktop only, and that there will be a slight delay when loading posts (we hope to keep this to less than a second), and that we will only have 1,500 participants per arm.

However, this experiment is multi-platform, long term, and ecologically valid – we will be testing real users on real platforms.

Are there any ethical considerations?

Yes, because we are experimenting on people and collecting their data. We believe this research readily meets the standards of autonomy, beneficence, non-maleficence, and justice. That is, we think this research will produce valuable information that can make things better for others in the future without exposing anyone to unreasonable risk or harm.

We are following standard research practices including appropriate institutional approvals. We will include your algorithm in our IRB submission, and may decline to test your ranker if we cannot get ethics approval for it. We have received approval for a similarly structured study before. Participants will be debriefed and paid for their time. We are addressing relevant privacy and security concerns with appropriate procedural and technical approaches.

Questions or ideas? Please talk to us!

Who will be authors on the paper?

The project team, the judges (if they wish), and the members of the winning teams.

Who is running this experiment?

The experiment is being run out of the UC Berkeley Center for Human-compatible AI, but it’s an interdisciplinary team effort.

Jonathan Stray – UC Berkeley (computer science)
Julia Kamin – Civic Health Project (political science)
Kylan Rutherford – Columbia University (political science)
Ceren Budak – University of Michigan (computational social science)
George Beknazar-Yuzbashev – Columbia University (economics)
Mateusz Stalinski – University of Warwick (economics)
Ian Baker – UC Berkeley (engineering)

Who will judge the entries?

The following astounding individuals have volunteered their time and expertise both to advise us on this project, and judge the entries:

Mark Brandt, Michigan State
Amy Bruckman, Georgia Tech
Andy Guess, Princeton
Dylan Hadfield-Mennell, MIT
Michael Inzlicht, U Toronto
Alex Landry, Stanford
Yph Lelkes, U Penn
Paul Resnick, U Michigan
Lisa Schirch, U Notre Dame
Joshua Tucker, NYU
Robb Willer, Stanford
Magdalena Wojcieszak, UC Davis

Submissions will be anonymized, and any judges with conflicts of interest will be recused from individual entries.

Who is funding the challenge?

UC Berkeley CHAI

Jonathan Stray is receiving financial support for this project as a Project Liberty Fellow, powered by Common Era.

Technical FAQ

How do I write a ranker?

Check out the API docs.

What data can the ranker use?

Your ranker gets to see the text content of all posts the user sees. You can store whatever you want in a private database.

What data will the ranker get about each participant?

Some basic demographics from the intake survey, including political party self-identification and intensity of social media use. Let us know if you’d want age, race, gender, or SES — these are sensitive so we’d want to review your plans.

A history of what each participant has previously seen and what they have engaged with will be available in the database.

We are still discussing whether we can provide any data about the user’s social network, e.g. a list of who they follow on X. But even if we can, we still won’t be able to provide any information about any of those users, because there’s no time to do that much scraping within the 500ms window. Let us know if you think you could do something interesting with this limited information.

What data will the ranker get about each post?

Text and basic metadata, including comment threading, session ID, and an indication of whether there is an image, with the alt text if so. See the API docs.

We are discussing including classifier output for each post the estimates a) is the post political or civic content and b) what is its political ideology on a left-right scale. Of course you could compute this yourself, but let us know if you want to.

How many posts will be included in one call to the ranker?

Up to a couple hundred, depending on what is retrievable from the platform within the 500ms window. These will all be posts that the platform has already selected for the user, so depending on your goal there may or may not be value in reordering them, but of course you can remove and add posts too.

Ads in the feed will not be sent to your ranker, and all advertising on the page will be preserved.

Can I scrape data from within my ranker?

For security reasons you cannot call external APIs or services, but you can run a background process that imports public data or scrapes public social media data. You could ingest all of Wikipedia or monitor Google News, if you want. You cannot scrape the user’s social network. See this post and the repo.

Can we ask users for information and use that to personalize ranking?

Not currently planned. We love the idea of greater user control over ranking algorithms, but we also want to figure out good defaults because most people won’t use controls. And this simplifies the experiment.

However, if you wanted to add up to three questions to the intake survey, you could get that data as part of the user demographics.

Can we use LLMs in the ranker?

Definitely, though for privacy reasons you cannot rely on any external services in the final submission. This means, for example, that you can submit a first round prototype built using GPT4 or Claude but cannot use these in your production code. We will provide sample code for a Mistral-based production ranker. See the repo.

What hardware will the ranker run on?

For the prototype, you are hosting your ranker so you can run it on anything you like.

For the production ranker, assume you can run a 7B parameter LLM on all items to rank (~300) within the 500ms window, 95% of the time. We will provide support for parallelizing LLM calls to make this possible (using scorer processes). This is not a large model by modern standards, and you cannot call external models due to privacy and security constraints. However, in our experience Mistral 7B can match the accuracy of GPT4 on classification and scoring tasks, if you distill the GPT4 output and fine-tune Mistral.

The Prosocial Ranking Challenge

Discussion about this post