Bug 6973 - google translate redirector_pattern is incomplete
Summary: google translate redirector_pattern is incomplete
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.3.2
Hardware: PC Linux
: P2 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-09-06 01:38 UTC by Chris Myers
Modified: 2013-09-06 15:33 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Myers 2013-09-06 01:38:31 UTC
Google Translate redirector pattern doesn't cover all of the possible URL's supported by Google's translation API.  Valid URL's include https?://translate.google.com/translate_[ct]/, but the redirector_pattern provided with SpamAssassin only matches https://translate.google.com/translate/

A more complete pattern is:

redirector_pattern m'^http:/*(?:\w+\.)?google(?:\.\w{2,3}){1,2}/translate(_[ct])?\?.*?(?<=[?&])u=(.*?)(?:$|[&\#])'i
Comment 1 John Hardin 2013-09-06 01:51:52 UTC
(In reply to Chris Myers from comment #0)

> redirector_pattern
> m'^http:/*(?:\w+\.)?google(?:\.\w{2,3}){1,2}/translate(_[ct])?\?.
> *?(?<=[?&])u=(.*?)(?:$|[&\#])'i

ITYM:

    m'^https?:/*
-----------^^
Comment 2 Chris Myers 2013-09-06 14:58:37 UTC
The redirector_pattern in the report began life as a cut-and-paste from my updates_spamassassin_org/72_active.cf file.  It really says just http:// rather than https?:// (which I agree is an improvement).  My change to the pattern is actually changing .../translate\? to /translate(_[ct])?.
Comment 3 Chris Myers 2013-09-06 14:59:32 UTC
errr actually I meant to say "/translate(_[ct])\?" with the backslash. :-(
Comment 4 John Hardin 2013-09-06 15:16:37 UTC
(In reply to Chris Myers from comment #2)
> The redirector_pattern in the report began life as a cut-and-paste from my
> updates_spamassassin_org/72_active.cf file.  It really says just http://
> rather than https?:// (which I agree is an improvement).

Indeed? I didn't actually check the current sources - if so, that's a hole.

> My change to the
> pattern is actually changing .../translate\? to /translate(_[ct])?.

...or /translate(?:_[ct])?\?   :)

Can you provide a pointer to a spec from Google that documents the possible formats? Or was this just from observation?
Comment 5 Chris Myers 2013-09-06 15:33:04 UTC
> Indeed? I didn't actually check the current sources - if so, that's a hole.

Yup.

Agreed that getting rid of the unneeded backreference is probably a beneficial thing.  I don't live-and-breath Perl RE's.

I've seen /translate_c and /translate_t referred to by users on the Internet (such as http://googlesystem.blogspot.com/2008/03/useful-google-translate-addresses.html) but didn't find any actual Google doc -- it may be an internal thing rather than part of the public API.  This particular bug report is driven by an actual spam message that referenced a URL beginning with:

 http://translate.google.co.ke/tran%73%6C%61te_c?hl=<omitted>