SA Bugzilla – Bug 4773
Minor suggestion for spam rule regarding Pharmacies
Last modified: 2006-02-09 16:32:43 UTC
First, thank your for an OUTSTANDING program. Second, I saw on the website if anyone has a suggestion for new rules they can submit here (sorry if this is incorrect). I run an email server for a 30 account domain and I see Pharmacy and Pharmaceutical spam with the words jumbled a couple hundred times a day. I have written a couple rules that find most of these. I wish to suggest them to Spamassassin. Note that these rules only look for common ways of miss-spelling the words Pharmacy and Pharmaceutical, the correctly spelled words will not be marked as spam. CONTAINS_JUMBLED_PHARAMACY \bP([\s\w]?)h([\s\w]?)a([\s\w]?)r([\s\w]?)a([\s\w]?)m([\s\w]?)a([\s\w]?)c([\s\w]?)y\b CONTAINS_JUMBLED_PHRMACEUTICAL \bPhrm([\s\w]?)a([\s\w]?)c([\s\w]?)e([\s\w]?)u([\s\w]?)t([\s\w]?)i([\s\w]?)c([\s\w]?)a([\s\w]?)l\b CONTAINS_JUMBLED_PHARAMACEUTICAL \bP([\s\w]?)h([\s\w]?)a([\s\w]?)r([\s\w]?)a([\s\w]?)m([\s\w]?)a([\s\w]?)c([\s\w]?)e([\s\w]?)u([\s\w]?)t([\s\w]?)i([\s\w]?)c([\s\w]?)a([\s\w]?)l\b The following is the set of examples I've used to test against. The above rules catch most of these. Pharmacy P armacy P harmacy Phad ramacy Phae ramacy Pharrmacy pharamacy Pharamacy P haramacy P harcamacy Pharamga cy Phara maecy Pharam acy Pharamac y Pharama cy Ph armamacy Ph aramacy Pha ramacy Phar amacy Pharfama cy Pghara macy Ptharamacy Phamaceutical Phharmyaceutical pfhr maceutical Pharaamaceutical Pharamaceuetical phrmaceuteic al Phrmaceutica ul Phrmace uticjal Phrmaceuticma l Phrmaceuticm al Phrmaceuti cyal Phrmacteutic al Phrmac teutical Phrm adceutical Phrmeac eutical Phrmlaceut ical Phadramaceutical Pharmacceutoical I understand that if these rules (or something similar) were to get published to the Spamassassin distribution, the spammers would just use these rules to come up with new ways of miss-spelling. Until that time, these rules do help considerably on my domain at least.
Hi, Thank you for your suggestion. You may want to take a look at the FUZZY_* rules in 3.1 which utilize the ReplaceTags plugin to do this kind of thing on a generic level. There's a FUZZY_PHARMACY rule already, but I don't know if we tried pharmaceutical.
(In reply to comment #1) > Hi, > > Thank you for your suggestion. You may want to take a look at the FUZZY_* rules in 3.1 which utilize the > ReplaceTags plugin to do this kind of thing on a generic level. There's a FUZZY_PHARMACY rule already, > but I don't know if we tried pharmaceutical. I'm sorry, I was running off the SA ver.3.04 rules. I took a look at the new ReplaceTags plugin and it looks like an outstanding addition. Unfortunately, I don't have any time soon when I can test these new capabilities. I will have to get upgraded to the new version ASAP and see how things go. As you mentioned, I don't see any rules for 'Pharmaceutical' so it probably could be added to the FUZZY_* rules. I think something like: body FUZZY_PHARMACEUTICAL /<inter W2><post P2>(?!pharmaceutical)<P><H><A><R><M><A><C><E><U><T><I><C><A><L>/i describe FUZZY_PHARMACEUTICAL Attempt to obfuscate words in spam replace_rules FUZZY_PHARMACEUTICAL added to the 25_replace.cf file would do it. I admit, I don't completely understand how the ReplaceTags plugin works, but it looks like it is still trying to find all the letters of the word in the correct order. The rules I suggest actually look for common miss-spellings of the original words. Pharmacy -> Pharamacy Pharmaceutical -> Phrmaceutical, Pharamaceutical So in conclusion, I will do my best to upgrade to the current version of SA and evaluate how the new rules catch these words. If I find that the current version of SA does not do a good job of finding these, I will re-post to this report (or start a new report).
ok, I put a version into the sandbox for testing. it works well for my corpus: 0.748 0.8599 0.0000 1.000 0.77 0.01 TVD_FUZZY_PHARMACEUTICAL thanks for the suggestion! :)