<- Updates

A deep dive into Rspamd

A deep dive into Rspamd

Sep 30, 2024

Recently, a large university institution told us that a customer's emails would be delayed (but not blocked) by their automated systems due to a greylisting process. This meant the emails were temporarily rejected to ensure they were legitimate, but they would still be delivered after a resending attempt. 

We knew our customer passed our own requirements because we also need to protect the reputation of our platform, so we took this as an opportunity to optimize our system for legacy firewalls.

The resulting effort involved digging into spam filtering source code, our own MJML generation (the email markup language we use), and the MIME encoding package we relied on.

The university’s email infrastructure uses an open-source spam filtering system called Rspamd, which is based on another open-source system called SpamAssassin. Rspamd categorizes emails into three levels of spamminess, based on a score derived from a set of code-driven rules.

  1. Greylisted (Score >= 4): Our customer's email was classified at the lowest warning level, 'greylisted'. This means the email is temporarily rejected to verify its legitimacy by waiting for an automatic resending attempt. While this ensures the sender is genuine, it also causes delays in delivery.

  2. Add Header (Score >= 6): The second level adds a header to the email, marking it as probable spam. This allows email clients like Gmail or Outline to automatically move it to the spam folder, reducing the likelihood of the recipient seeing it.

  3. Reject (Score >= 15): The third and most severe level is rejecting the email entirely, preventing it from reaching the recipient. This ensures that highly suspicious emails do not enter the recipient's mailbox at all.

Our customer's email was greylisted at a score of 5.24. We investigated fixes for every issue in the scoring report, and ran our tests through an Rspamd docker container to validate the results, eventually reducing the score from 5.24 to 1.24, yay!

So what issues showed up in our customer's scorecard? 

Rspamd scores contain three pieces of information: the problem, the original score, and the weighted score based on a customizable weight per rule. You can read the specific implementation in Lua and C in the Rspamd Github repo.

The code is interesting to peruse, but let's break the rules down our total score and look at the individual parts.

The Total Score

The first number of each row is a weighted value that is added to the total score, and the second number is the original unweighted value.

["MANY_INVISIBLE_PARTS",0.100000,1.0,"2"] ["HAS_LIST_UNSUB",-0.010000,-0.010000] ["URI_COUNT_ODD",1.0,1.0,"45"] ["FROM_EXCESS_BASE64",1.500000,1.500000] ["SUBJ_EXCESS_BASE64",1.500000,1.500000] ["ONCE_RECEIVED",0.100000,0.100000] ["R_BAD_CTE_7BIT",1.050000,3.500000,"7bit","utf8"]
Many Invisible Parts: Email content that is visually hidden from the recipient but still present in the email code
["MANY_INVISIBLE_PARTS",0.100000,1.0,"2"]

There are tricks to hide spam content from the human eye or computer vision, like choosing the same text color as the background color (just like early SEO hacks), using a zero font size, zero opacity, css transparency, etc. 

We don't do any of these but when you use more sophisticated MJML with inline CSS like we do, you get some false-positives and not all false-positives can be avoided

We started off at a pretty low score already so let's move on.

URI Count Odd: A mismatch in the number of URLs between the HTML and plain text versions
["URI_COUNT_ODD",1.0,1.0,"45"]

When you send an email with HTML content, you also need to send a plain text version with the payload. Spam emails can contain invisible images for tracking, which do not show up in the plain text version.

This rule checks for hidden image links by excluding image URIs from a count of valid URIs in the HTML version, against the count of URIs in the plain text version. We got this score down to zero by disabling image-to-text conversion in our MJML generation, without affecting our actual content.

Has List Unsub: Has an unsubscribe link
["HAS_LIST_UNSUB",-0.010000,-0.010000]

Our emails already come with automated links to unsubscribe, so we banked some goodwill with a teeny tiny negative score here (negative means less spammy, positive is more spammy).

FROM_EXCESS_BASE64: Excessive Base64 encoding in headers
["FROM_EXCESS_BASE64",1.500000,1.500000] ["SUBJ_EXCESS_BASE64",1.500000,1.500000]

Since the MIME standard was originally based on ASCII, it evolved to support unicode characters such as emojis and additional languages, and binary file attachments by encoding them in a format using only ASCII characters. The most efficient but least human-legible of these is the Base64 encoding format.

Spammers use Base64 encoding to hide the contents of headers such as “From”, “Reply-To”, and “Subject”. Our interface at Loops supports editing sender names and subjects with UTF-8 characters, such as emojis.

If you’re sending an email from the plain name “Alice [email protected]”, there’s no need to encode it as =?utf-8?Q?WxpY2U=?= [email protected]. It’s an automatic yellow card when the spam filtering system decodes the header and realizes the encoding was unnecessary.

We use the handy and concise MIMEText npm package, and its latest version automatically encodes everything as base64.

We resolved this with a fork of the latest package version, so we could pass a custom filter function to determine whether a MIME header needs encoding. We proposed a PR with unit tests for review, and imported our forked package in the meantime.

Originally, we only used base64 encoding when we detected non-ascii characters in a header, but realized even some ascii characters such as colon ":" and brackets "<>" needed encoding to pass server checks, hence the custom filter.

ONCE_RECEIVED: Email only passed through one server
["ONCE_RECEIVED",0.100000,0.100000]

A legitimate email goes through many servers: the outgoing server, the incoming server, internal network servers, etc. Each server in this relay adds a “Received” header to the email payload.

An email with only one receipt might be coming from a suspicious origin. Since our infrastructure doesn’t go through many layers, we’ll take the hit on this minor score unless we decide to add some layers later.

R_BAD_CTE_7BIT: Outdated encoding method in headers
["R_BAD_CTE_7BIT",1.050000,3.500000,"7bit","utf8"]

The previous version of MIME encoding we used sometimes applied an encoding format called “7bit”, which isn’t compatible with Unicode characters. When we upgraded our MIME encoding package to use only base64 for headers, this went away.

We only had to comply with a few rules to reduce our spamminess score, but reading the source code for all the other Rspamd rules is like an archeological dig through decades of spam wars. The regular expressions alone seem to cover everything that any spammer has ever tried in the history of the internet, including the kitchen sink (https://github.com/rspamd/rspamd/blob/master/rules/regexp/headers.lua). You’ll see mentions of ancient email clients like “The Bat” and long-gone ISP’s like “sympatico”.

There are also some curious rules, such as penalizing an email if the subject contains or ends with “?” question or “!” exclamation marks. Also, there’s a rule penalizing subject lines containing a money currency such as $, €, ¥. This might not be obvious, but our rigorous technical analysis suggests you might get penalized for subject lines like, “Help, I am Prince! 💸 Can I transfer ¥30,000 to your account?”. So don’t do that maybe.

Gmail lets you see the raw MIME headers of an email by choosing “Show original” from an option menu. Sending an email from:

"Rocket Man <🚀 [email protected]>

should automatically encode the “From” header as:

When we looked at the raw headers in Gmail, it did not appear to be encoded:

"Rocket Man <🚀 [email protected]>

So why were we being penalized?

It turns out that Gmail will strip the base64 encoding to make the raw output human-readable, so we verified our conditional encoding on other email platforms like Proton and iCloud mail. Since manual verification played tricks on us, we decided to pursue more automated verification.

The best way to figure out how our email would be scored by Rspamd was to run Rspamd ourselves, so we added an Rspamd docker container to our development environments. We automated an internal task to generate raw MIME text from a new set of sample emails that covered different combinations of ascii and unicode characters in various headers. 

Storing these samples as part of our git repo lets us diff the rendered emails as we add features.

Then, we wrote tests to pass these MIME text payloads to our Rspamd docker container, and captured the resulting scores as part of our git repo. This historical record lets us do regression comparisons as our platform continues to evolve.

Another email deliverability milestone untangled with this deep dive, but lots more to do. It's a cool space to be in, you can find an .html url from twenty years ago online and it will have relevant and topical information since in most cases, email doesn't actually change.

Thanks for reading.

Ready to send better email?

Ready to send better email?

Go