Bug 7844 - U+1D5B5 MATHEMATICAL SANS-SERIF...
Summary: U+1D5B5 MATHEMATICAL SANS-SERIF...
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: All All
: P2 trivial
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-30 03:49 UTC by jidanni
Modified: 2020-08-12 19:54 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Crafted example of the issue text/plain None Bill Cole [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description jidanni 2020-07-30 03:49:49 UTC
It seems this slips through spamassassin:

From: "π–‘π—…π—Žπ—ˆπ—‘π—’π—‡ 𝖀𝖣" <newsletter@express.be>
Subject: 𝖭𝖾𝗏𝖾𝗋 𝖭𝖾𝖾𝖽 𝖡𝗂𝖺𝗀𝗋𝖺 𝖠𝗀𝖺𝗂𝗇

$ unicode 𝖡𝗂𝖺𝗀𝗋𝖺 | grep ^U+
U+1D5B5 MATHEMATICAL SANS-SERIF CAPITAL V
U+1D5C2 MATHEMATICAL SANS-SERIF SMALL I
U+1D5BA MATHEMATICAL SANS-SERIF SMALL A
U+1D5C0 MATHEMATICAL SANS-SERIF SMALL G
U+1D5CB MATHEMATICAL SANS-SERIF SMALL R
U+1D5BA MATHEMATICAL SANS-SERIF SMALL A
Comment 1 Kevin A. McGrail 2020-08-02 11:43:19 UTC
Do you have a spample of this you can put on pastebin?
Comment 2 jidanni 2020-08-02 18:54:07 UTC
Just copy and paste the raw Subject above into a new test mail.
SA Bugzilla did not mangle the Unicode. The raw characters you need are right there embedded into this bug report web page.
Comment 3 John Hardin 2020-08-02 22:15:07 UTC
Has anybody else started working on this already? If not, I'll get started.
Comment 4 Kevin A. McGrail 2020-08-02 23:39:35 UTC
John, I was looking for a spample and was going to run it through my tests and see whether the replace_tags in KAM.cf and the replace_tags in stock hit.  They might just need a few small adjustments.  However, I don't recommend working on a snippet alone.  We need to see what the entire message scored as and whether this would move the needed.  Without a real-world spample, I think this should be paused.
Comment 5 jidanni 2020-08-03 00:03:38 UTC
Try
$ echo test | mail -s '𝖭𝖾𝗏𝖾𝗋 𝖭𝖾𝖾𝖽 𝖡𝗂𝖺𝗀𝗋𝖺 𝖠𝗀𝖺𝗂𝗇' $USER
That is all you need.
I am certain.
Comment 6 Kevin A. McGrail 2020-08-03 00:14:12 UTC
(In reply to jidanni from comment #2)
> Just copy and paste the raw Subject above into a new test mail.
> SA Bugzilla did not mangle the Unicode. The raw characters you need are
> right there embedded into this bug report web page.

The exact formatting of the subject is important and synthesizing it is not good.

For example, when I process just what you post, it looks like the V is unicode uD835uDDB5 but you posted u1D5B5 so I can't get a hit with replacetags that works.

The subject will have to be encoded to work because I think subjects have to be in ascii.  What's the exact subject code from a source view of the email?  This is why spamples are important.
Comment 7 jidanni 2020-08-03 00:23:35 UTC
$ echo ... | mutt -s ... made
Subject: =?utf-8?B?8J2WrfCdlr7wnZeP8J2WvvCdl4sg?=
 =?utf-8?B?8J2WrfCdlr7wnZa+8J2WvSDwnZa18J2XgvCdlrrwnZeA8J2Xi/Cdlrog?=
 =?utf-8?B?8J2WoPCdl4DwnZa68J2XgvCdl4c=?=
Comment 8 jidanni 2020-08-03 00:27:37 UTC
$ echo 𝖡𝗂𝖺𝗀𝗋𝖺|base64 
8J2WtfCdl4LwnZa68J2XgPCdl4vwnZa6Cg==
Comment 9 Kevin A. McGrail 2020-08-03 00:33:33 UTC
Why can't you provide a real-world spample?
Comment 10 jidanni 2020-08-03 01:03:01 UTC
The original message it "too dangerous to ever let anyone see" OK?

Just make sure spamassassin can catch the word Viagra in a subject, no matter what charset. Thanks.
Comment 11 Kevin A. McGrail 2020-08-03 02:12:43 UTC
Without a spample, this is going nowhere.  At a minimum, recommend you provide an unadultered Subject header and From headers. 

John, I recommend closing as worksforme otherwise.  You decide.
Comment 12 Kevin A. McGrail 2020-08-03 02:14:19 UTC
If the danger is you feel it's a security issue, reclassify the bug to security.  That makes the information non-public.
Comment 13 Bill Cole 2020-08-03 02:46:40 UTC
Created attachment 5713 [details]
Crafted example of the issue

I have constructed an example of the issue.
Comment 14 John Hardin 2020-08-03 02:50:00 UTC
I'll see what the spample does.
Comment 15 Kevin A. McGrail 2020-08-03 03:25:51 UTC
(In reply to John Hardin from comment #14)
> I'll see what the spample does.

Assuming that spample is indicative of a real world spam, it requires two additions to the replace tag rules in KAM.cf.  Look at __KAM_VIAGRA2 and the replace tag for G1 and R1.
Comment 16 jidanni 2020-08-04 17:00:35 UTC
(The original message was full of
https://en.wikipedia.org/wiki/Personal_data
https://en.wikipedia.org/wiki/Web_beacon
Even asking the user to retrieve it via his
Goofy Inc. Pro Mail Browser etc.
to send it to you guys would have triggered them.
So you will have to just do with the Subject,
(which by the way was just raw, not base64, QP, etc.
it turns out.)
Anyway the point is: just catch the V word in the subject.
Thanks.)
Comment 17 Bill Cole 2020-08-04 17:48:09 UTC
(In reply to jidanni from comment #16)
> (The original message was full of
> https://en.wikipedia.org/wiki/Personal_data
> https://en.wikipedia.org/wiki/Web_beacon
> Even asking the user to retrieve it via his
> Goofy Inc. Pro Mail Browser etc.
> to send it to you guys would have triggered them.
> So you will have to just do with the Subject,

Which the attached pseudo-message uses. 

> (which by the way was just raw, not base64, QP, etc.
> it turns out.)

Really? That's not what you said earlier, when you provided a base64-encoded version of the header. It's hard to know what to believe, absent an actual message. 

Real raw non-ASCII characters in headers are non-compliant with relevant RFCs and may actually cause concrete problems with some software, so they are rare.
Comment 18 Kevin A. McGrail 2020-08-04 19:43:56 UTC
Without a spample, this is a waste of too many people's time and energy.
Comment 19 RW 2020-08-04 21:14:02 UTC
I think this is a bit harsh. He may have stated it badly, but we know what the problem is. It not as if it's new, the use of mathematical sans serif for obfuscation was discussed in the user list thread "base64 encoded sextorsion".

I suspect that the use of any of the mathematical typesetting characters in a Subject header is worth scoring in its own right.

As regards encoding I would hope it makes no difference to header rules (without :raw). Non-encoded 8-bit header text should be left as it is.
Comment 20 John Hardin 2020-08-05 00:56:52 UTC
(In reply to jidanni from comment #16)
> (The original message was full of
> https://en.wikipedia.org/wiki/Personal_data
> https://en.wikipedia.org/wiki/Web_beacon
> Even asking the user to retrieve it via his
> Goofy Inc. Pro Mail Browser etc.

Sanitizing the PII would probably not corrupt the analysis of the message, but to avoid even that risk you could provide the spample privately to *one* of us as a gzipped email attachment. I have long experience dealing with confidential information, I'm sure Kevin does too.

Are you willing to send an intact copy to me privately so that I can test changes against it?

Absent that, I can ensure all of the mathematical glyphs are in the replace list and *hope* that works.

> to send it to you guys would have triggered them.

SA doesn't follow links when it scans and I doubt any of us will view a spample in an HTML-enabled MUA.

> So you will have to just do with the Subject,
> (which by the way was just raw, not base64, QP, etc.
> it turns out.)

Which means that the spample Bill ginned up for the bug is not an accurate representation of your real-life spam.


> Anyway the point is: just catch the V word in the subject.
> Thanks.)
Comment 21 John Hardin 2020-08-05 02:20:01 UTC
Committed revision 1880592.

This should fix "viagra". I'm working on adding the remaining letters for full coverage.
Comment 22 jidanni 2020-08-06 05:48:59 UTC
(Well in general telling Grandma/Grandpa "That's terrible that you got a spam. Send me a copy so can tell the SpamAssassin team." Will end up in them certainly "opening the spam message" with all the dangers involved. Indeed, opening it several times, before figuring out how to forward it.)
Comment 23 RW 2020-08-06 15:03:10 UTC
The changes to the tags brings in

 0.8 FUZZY_VPILL

I think this is more of a hammer shaped problem. If anyone is interested, I'm going with this:

header  SUBJ_UCMATH             Subject  =~ /\xf0\x9d[\x90-\x9f][\x80-\xbf]/

meta    SUBJ_UNENC_UCMATH      UBJ_UCMATH   && __SUBJECT_NEEDS_MIME


I doubt these characters are particularly common in email and probably much less common in subjects. Even if someone  pastes them into a subject field they will normally be MIME encoded. 

I've received a few with this kind of subject obfuscation through gmail. Like the OP's they had no encoding (technically RFC compliant at gmail) or broken encoding (see bug 6352 for an example).
Comment 24 jidanni 2020-08-06 17:29:04 UTC
Yeah, except for some math professors, nobody should be using MATHEMATICAL SANS-SERIF CAPITALs etc. so nab 'em!
Comment 25 John Hardin 2020-08-06 18:04:14 UTC
(In reply to jidanni from comment #22)
> (Well in general telling Grandma/Grandpa "That's terrible that you got a
> spam. Send me a copy so can tell the SpamAssassin team." Will end up in them
> certainly "opening the spam message" with all the dangers involved. Indeed,
> opening it several times, before figuring out how to forward it.)

Ah, ok. I was assuming this was something you had administrative access to.
Comment 26 John Hardin 2020-08-12 19:54:16 UTC
Committed revision 1880815

This adds the remaining letters from the mathematical glyphs ranges.