Bug 7519 - BODY_SINGLE_WORD triggers on base64 encoded text with more than one word.
Summary: BODY_SINGLE_WORD triggers on base64 encoded text with more than one word.
Status: RESOLVED DUPLICATE of bug 7219
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.4.0
Hardware: Other Linux
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-10 19:27 UTC by Mark London
Modified: 2017-12-11 21:40 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Email shows problem. text/plain None Mark London [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Mark London 2017-12-10 19:27:46 UTC
Created attachment 5493 [details]
Email shows problem.

See attachment.   There is a paragraph of text in the following mime attachment, but the it's triggering the "one word text message" rule.

Content-Type: text/plain;Name="text_0.txt";Charset="utf-8"
Content-Disposition: Attachment;Filename="text_0.txt";Charset="utf-8"
Content-Location: text_0.txt
Content-Transfer-Encoding: base64
Comment 1 Bill Cole 2017-12-11 01:05:00 UTC
That message is badly malformed. The Content-Type header is invalid (missing spaces,) there is no MIME-Version header, the Message-ID header is invalid (missing angle brackets) and some of the putative MIME parts are improperly encoded into lines an order of magnitude longer than MIME allows. 

As a result, there is no formally correct way to parse this message. That any software can make any sense of it is a tribute to how lenient mail software is. It is unclear to me why it is hitting BODY_SINGLE_WORD but it is also hitting HTML_IMAGE_ONLY_20 and BODY_URI_ONLY incorrectly and I expect that all of these are due to SA being confused by the compound pathology of the message. Note that the rules it correctly hits (BASE64_LENGTH_79_INF, BAYES_50, MIME_HEADER_CTYPE_ONLY, MISSING_SUBJECT, and INVALID_MSGID) add up to 5.3, so even if we figured out precisely how the 3 bogus hits happened and fixed that, SA would (by default) still call it spam.

The "garbage in, garbage out" principle applies here. It is not a bug for SpamAssassin to misparse a message that technically has no correct parsing.
Comment 2 RW 2017-12-11 13:48:03 UTC
Actually it is a bug that I pointed-out some time ago - I don't recall the bug number.

The problem is in 

body __BODY_TEXT_LINE     /^\s*\S/
body __BODY_TEXT_LINE     multiple maxhits=3

the count usually include the Subject line, but only if the header is present and contain a non-space character. 

In the attached email the multi-word paragraph is counted as if it were the subject. 

IMO it should be 

body   __BODY_TEXT_LINE_FULL    /^\s*\S/
body   __BODY_TEXT_LINE_FULL    multiple maxhits=3

header __SUBJECT_HAS_NON_SPACE  Subject =~ /\S/

meta   __BODY_TEXT_LINE         __BODY_TEXT_LINE_FULL - __SUBJECT_HAS_NON_SPACE


The arithmetic for __BODY_SINGLE_WORD,  __BODY_URI_ONLY & __EMPTY_BODY then needs to be adjusted for __BODY_TEXT_LINE being one smaller.
Comment 3 Bill Cole 2017-12-11 21:40:36 UTC
Yes, it's Bug #7219

*** This bug has been marked as a duplicate of bug 7219 ***