Background Processes: Databases, Scrapers, Workers and Scorers
How to get outside data into your ranker, and use parallelism for speed
A number of ranking algorithms depend on collecting and remembering some sort of information (like the history of posts a user has engaged with), continuously importing external data (like the most recent posts from a set of users), or running some processing in the background (like updating a model). Additionally, the 500ms (p95) latency requirement may demand post-level parallelism. This post describes the architecture we are implementing to support all of these cases.
None of the information here applies to your first round submission (due April 15th), For your first round-submission, which you host, you can run whatever processes you want and access any external services or APIs. But it must be possible to later implement your design with the API below to be eligible to win.
This post describes the API and security model that production rankers must conform to, for those finalists who are invited to submit a production ranker in the second round (due May 15th). All finalist teams who successfully submit a ranker meeting our technical requirements will share in the $50,000 prize pot. These requirements include the constraints on network access and background processes described in this post.
We do not yet have detailed documentation on this API — it’s coming. Now would be a good time to subscribe.
No network access, but a database
We absolutely cannot leak any information that we collect from users. This is a basic privacy and security requirement, and is also required by our institutional ethics review. To ensure this, you ranker will be run inside a sandbox with no network access (beyond being able to respond to incoming requests for ranking).
However, some algorithms need to store data between ranking sessions. For example, many ranking algorithms depend on the history of what the user has seen and clicked on. This involves storing sensitive data, but only data the ranker already sees, and again, there is no network access that could lead to intentional or unintentional leaks.
To facilitate such stateful algorithms, each ranker will be provided with a read/write access to a private key-value store. This will likely be Redis, but we have not made a definitive decision yet.
First round submissions are self-hosted, so if you need a database you can use whatever you like. For second-round submissions, the blue “ranker” process must be submitted inside a Docker container which exports a REST endpoint implementing the ranking API, while we will provide the database and enforce the sandbox.
External scraping
Not being able to access the outside world is a severe limitation in many ways. For example, pro-social ranking algorithms might involve adding items that the user normally wouldn’t see. You might, for example, want to inject content of the type that succeeded in the Strengthening Democracy Challenge, or select bridging content from elsewhere on the platform. These items must already exist on the platform, either posted by you or by someone else. Without any network access, how would your ranker learn about the existence of new items it can add?
The answer is an external scraping process. Your ranker container you can optionally include a scraper process that will run in the background outside the sandbox, with full network access. However, it can only communicate with your ranker by writing to the database. Privacy is protected by preventing reading.
Again, only second-round submissions need to be containerized. For your first-round submission, you are hosting the ranker so you can arrange your scraping however you want. However, your scraper cannot receive any data from the ranker, or it cannot run as a production ranker and thus will not be eligible to win.
Note that your ranker cannot tell the scraper what to scrape. For example, it cannot reveal the usernames of participants to the scraper, which would be needed to, for example, scrape a participant’s social network. We are exploring ways to make limited portions of the participants’ social network available to rankers; let us know if this is a priority for you.
Workers and scorers
Your ranker is only invoked when there are posts to rank. You may want to run some background processing to prepare certain data in advance, e.g. you could do matrix factorization. For this you may provide an optional “worker” process that runs inside the sandbox and can access the database.
Performance requirements are also very tight, as you must complete ranking all posts within 500ms, 95% of the time. Many ranking algorithms are parallelizable at the post level, so we will provide a simple API for distributing posts over a set of “scorer” processes. The “scorer” is another REST endpoint that your container may optionally support, which takes a set of posts and returns a JSON result for each one. All such post-level results are then returned to the calling ranker, where the final ordering occurs.
The typical case would be to use a classifier within the “scorer” to produce a numeric score for each post individually, then sort by this score within the “ranker” (and probably do some diversity re-ranking there too as most real rankers do). This is a kind of map-reduce architecture.
This is the full configuration of processes that a production ranker may include. You must supply a process within your container for each of the blue boxes above, although only the “ranker” is required. All of these processes have access to the same shared database, though as described above the scraper can only write to it.
Again, this only applies to second-round submissions. For first-round submissions, which you host yourself, you can use any technology you like as long as you can put the ranking endpoint on the public internet. Also, first-round submissions do not have to meet the 500ms latency requirement, so parallelization is probably not necessary. However, it must be possible to implement your ranker with the architecture above, our your entry cannot win.