Optimizing our weekly digest mailer

By

Every Sunday we send an email with new published jobs from last week to all the professionals that want to receive the newsletter. This email is highly personalized for every professional, only showing jobs that match their preferences, like categories, seniority, modality, technologies, expected salary, and so on.

For a couple of years, this task was well handled by the worker server, but in the last year or so, the number of professionals who subscribed to the newsletter increased a lot. Right now we're sending about ~650k emails every week, and sending that amount of personalized emails takes a few hours to process, of course. A couple of months ago, we started to notice that during those hours, the site was getting many timeout errors, so I jumped to try to optimize the task.

(Note: The task that processes this mailer does a bunch of stuff, and it's quite complex, but for the sake of readers, I will simplify it to just focus on the behavior that concerned the refactor itself)

Originally, this task was looping over each of the 650K professionals, to then retrieve in a single database query all jobs that were published last week that match the professional' preferences. So it made at least 1.3 million queries to the database during the lifetime of the process (one to load the professional object used in the email rendering, and one to get the matching jobs of the professional, so it is 650k x 2 = 1.3 million). Each of these queries is quite heavy because of the amount of JOINS that it had. It took around FOUR AND A HALF hours to process the mailer. Within those hours, the site was quite buggy.

My first approach was to change the jobs matcher engine to query against ElasticSearch (ES from now on) instead of Postgres. The hypothesis was that ES would free up the connection slots of Postgres, so the site traffic could be better served. To do this, we indexed all our published jobs on ES, and made available all the required information for the matching right in the schema, so we didn't be required to make additional requests to Postgres during the task execution. After trying this ES engine in production, it took way longer to process than the Postgres engine, and eventually the execution just died. Almost 90% of the emails were sent, but it took more than TEN HOURS to process (actually I don't have the exact number because the execution didn't finish, but last time I checked it was around 10 hours and counting). It turns out that ES uses the JVM memory to perform queries, and making a ton of queries from a script is a bad idea because the requests will perform way slower than a regular database. ES is good at performing a reduced amount of queries fast, but it isn't when performing a ton of queries at computer speed.

Obviously, this approach wasn't good, so we had to think about another one. After having indexed all the required data on ES to perform the job matching, we came up with a different solution. We realized that when you have a very heavy task running asynchronously, there is already a lot of memory allocated for the process, so instead of hitting a database engine when performing every matching, we just wanted to do it in memory instead. At the start of the task, now we get all the jobs published last week from ES and cache that result. Then, for each professional, we process their preferences matching against the raw data from ES in memory, so no extra query is required. This cut the amount of queries required by the process by half, so the impact on the database was reduced by 50% too. After trying this approach in production, the results were much better. Now the site is not that buggy during the execution of the process, and the execution time required by the task decreased a lot, taking just TWO AND A HALF hours to complete. That's 55% faster!

In summary, this is a good example of why programming for a worker process should be approached differently than programming for a web server. We often assume that retrieving all the data digested from the database is better for performance, and it is when we are talking about processing a web request, but when it comes to a heavy asynchronous task, it could be better to process the data in memory.

Latest on Blog