Issue 64368 - Inconsistant regex-support with named classes
Summary: Inconsistant regex-support with named classes
Status: CLOSED FIXED
Alias: None
Product: Calc
Classification: Application
Component: ui (show other issues)
Version: OOo 2.0
Hardware: All All
: P3 Trivial with 4 votes (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
: 66720 77248 (view as issue list)
Depends on:
Blocks:
 
Reported: 2006-04-13 11:12 UTC by villeroy
Modified: 2021-01-07 18:03 UTC (History)
8 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
Evaluation sheet (contains basic-code) (17.78 KB, application/vnd.oasis.opendocument.spreadsheet)
2006-04-13 17:55 UTC, villeroy
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description villeroy 2006-04-13 11:12:44 UTC
The help on regular expressions states:
?  Matches one or none of the preceeding item
[:digit:]? Matches one single digit

In fact [:digit:]? -like the other named classes- matches one single digit while
[:digit:] does not match at all. 
1. This is contrary to the meaning of "?"
2. ^[:digit:]$ *does* match lines with a single digit as well as a calc-cell,
containing a single digit.
3. [0-9] and [0-9]? behave consistantly, according to POSIX (as far as I know it
from Emacs)
Comment 1 villeroy 2006-04-13 17:55:17 UTC
Created attachment 35687 [details]
Evaluation sheet (contains basic-code)
Comment 2 ooo 2006-04-13 18:53:40 UTC
Grabbing issue.
Comment 3 ooo 2006-04-13 18:57:05 UTC
Accepted. Targeting to same milestone as previous issue 63849.
Comment 4 mike_hall 2006-06-24 20:27:06 UTC
Added cc
Comment 5 grsingleton 2006-06-27 13:51:27 UTC
.
Comment 6 Joe Smith 2006-07-24 03:57:05 UTC
Sorry, I'm really not sure where to put this, but as this issue seems to be
active and closely related to what I want to report, I'm going to put it here.
If it should be filed separately, let me know.

First, I want to confirm that a regexp search for [:digit:] does not work--it
never matches anything--in Writer as well (2.0.2 on FC5 & XP, 2.0.3 on FC5). I
assume Calc and Writer share the same RE library, so this doesn't need to be
reported separately.

Second, there is a useful workaround: a regexp search for ([:digit:]) works "as
expected" (again, tested with Calc and Writer):

  e([:digit:])?    -- finds 'e' followed by zero or one digit
  ^([:digit:])$    -- finds lines or cells with exactly one digit

All(?) of the named classes seem to work this way (I've actually only tested
[:space:] and [:digit:]).

Third, there seems to be some confusion (possibly on my part) as to the use of
the POSIX named character classes: as far as I can determine, the named classes
are only special _within a regular character class_, and a bare '[:digit:]'
doesn't match a digit at all, but any of the characters 'd', 'i', 'g', 't' or
':'. In order to search for a digit, you have to write [[:digit:]]. This form
_never_ matches anything in OOo, even with the extra ()s.
Comment 7 mike_hall 2006-07-24 10:11:06 UTC
Issue 66720 is a duplicate of this issue. See that issue for the additional
information that a search for any string ending with a named class does not work
(tested in Writer but presumably generic). The brackets workaround resolves the
problem, providing the closing bracket is last.

I'm guessing that the fix for this is probably very easy. Any chance of changing
milestone to 2.0.5?

To jes, named classes are regex terms, only expected to work if the regex box is
checked.
Comment 8 ooo 2006-07-24 13:54:17 UTC
*** Issue 66720 has been marked as a duplicate of this issue. ***
Comment 9 ooo 2006-07-24 14:18:14 UTC
There are several misconceptions about named character classes in regular
expressions. A named character class [:name:] may appear in a bracket
expression, so [[:digit:]] is a synonym for [0-9] (using ASCII digits).
[:digit:] on itself without the surrounding brackets is not a valid expression,
in no context.

I took this issue because a single [[:digit:]] does not match one digit, as it
should, same with [[:alnum:]] and other character classes. Only if used as
[[:digit:]]? or [[:digit:]]* a digit is matched.

The use of constructs like [:digit:]? or ([:digit:]) is undefined. In fact a
better implementation would not match anything there.

For the definition of POSIX regexps see also
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
Comment 10 Joe Smith 2006-07-24 17:07:26 UTC
> The use of constructs like [:digit:]? or ([:digit:]) is undefined.
> In fact a better implementation would not match anything there.

At least from what I've checked, Perl, GNU sed, GNU grep and bash all treat
these like a normal character class. Why wouldn't they be?

I don't know if the misconception is in the implementation or in the
documentation. The online help doesn't explain that the named classes require
any context; but more importantly, the implementation seems to neither require
nor support the standard usage.

I'm hoping that someone familiar with the implementation and the roadmap would
triage this so that we're all on the same page. Is POSIX the accepted standard
for RE's in OOo? If so, are the implementation and the documentation correct?
Comment 11 mike_hall 2006-07-24 17:41:45 UTC
I was going to make a similar point. Looking at the documentation both in the
link given (thanks - that's useful) and in OOo Help, it looks to me as if
[:name:] is defined to work anywhere in a regex. Where is the mention of
additional brackets and what purpose would they serve? The fix suggested will
correct a different issue but leave the inconsistent behaviour that a search eg
for "[:alnum:]text" works but "text[:alnum:]" does not. This is bound to confuse
people and lead to the issue being raised again in future.
Comment 12 ooo 2006-09-14 15:31:52 UTC
Jes,

> > The use of constructs like [:digit:]? or ([:digit:]) is undefined.
> > In fact a better implementation would not match anything there.
> 
> At least from what I've checked, Perl, GNU sed, GNU grep and bash all treat
> these like a normal character class. Why wouldn't they be?

Please

1. recheck the programs you mentioned with what I wrote. None of them
   accepts a plain [:digit:] as a valid regular expression matching
   a digit. Just try
   sed -e 's/[:digit:]/foo/'
   It will not replace a digit with foo, but will replace any of the
   characters 'digt:' with foo. You have to use a named character class
   within a bracketed expression.
   sed -e 's/[[:digit:]]/foo/'

2. Don't mix bash in. Bash doesn't know regular expressions, you're
   confusing it with file name pattern matching. And yes, there bash
   knows named character classes, but they also have to be used within
   brackets. See
   http://www.tldp.org/LDP/Bash-Beginners-Guide/html/sect_04_03.html


Mike,

> Where is the mention of additional brackets and what purpose would they serve?

Please note that the entire definition of character classes in topic 6 of
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
is a subsection of "9.3.5 RE Bracket Expression". A named character
class represents a set within a bracketed expression, and the "[:" and
":]" delimiters are part of the character class expression. Simplified
for an "ASCII locale", [:digit:] is a synonym for 0-9, so a complete
bracketed expression of [ab[:digit:]] would be the same as [ab0-9] and
would match a,b,0,1,2,3,4,5,6,7,8,9

  Eike
Comment 13 Joe Smith 2006-09-14 20:05:32 UTC
JES->ER: I'm glad we are in vehement agreement ;-)

I only wished to take issue with your statement:
  > The use of constructs like [:digit:]? or ([:digit:]) is undefined.
  > In fact a better implementation would not match anything there.

If I understand what you are saying, I don't understand what evidence you have
to support this statement. As far as I can tell, these are both perfectly
reasonable constructs that match simple character classes.

$ echo "Digits: 1234" | perl -ne '@m = m/[:digit:]?/g; print("m:(@m)\n")'
m:( i g i t  :       )

$ echo "Digits: 1234" | perl -ne '@m = m/([:digit:])/g; print("m:(@m)\n")'
m:(i g i t :)

The important point here is that they DON'T match digits, which you clearly
understand as well.

What is the value in fixing the problem originally reported in this issue, when
OOo's entire implementation of named classes is not in line with the POSIX
standard or with common practice?
Comment 14 mike_hall 2006-09-28 07:41:48 UTC
OK, now I understand and perl does require the additional brackets (though it
still seems non-intutive because [::] means the sams as [:] and the second :
could reasonably be used to recognise that the user intends a character class).
Anyway, all is now clear and apologies for my denseness.
The more serious point is that in 2.0.4RC2 on Win XP I now can't get character
classes to work at all, eg with RE flag set searching for [0-9] works fine, but
[[:digit:]] finds nothing. Similarly with other named classes, with or without
other characters before or after the brackets. 
Also, searching for [[:dig]] (for example) with the RE flag set sends Writer
into a loop and the application has to be killed and restarted.
Should I open new issue(s)?
Comment 15 mike_hall 2006-09-28 16:29:24 UTC
...same in 2.0.4RC3
Comment 16 frank 2007-07-06 13:59:36 UTC
*** Issue 77248 has been marked as a duplicate of this issue. ***
Comment 17 villeroy 2008-01-11 22:13:44 UTC
Well, it is quite a sophisticated discussion about [:name:] or [[:name:]] beeing
the right token for named classes in different applications. I think [:name:] is
just fine with OOo. 
The reason why I filed this issue was: there is something wrong with [:name:].
As a stand-alone token it does not match at all. Meanwhile someone found out
that [:name:] fails if it is the very last token in a regex. [:name:]* [:name:]+
[:name:]{1} [:name:]$ work as expected.
Another matter of fact are different matches of the same regex when used in
Find/Replace, filters and formulas.
Comment 18 Joe Smith 2008-01-15 16:57:39 UTC
Point taken.

I have filed Issue 85269 to receive any further discussion of the regexp syntax.
Comment 19 ooo 2008-05-30 16:03:33 UTC
Propably not doable in time frame for 3.0, retargeting to 3.x
Comment 20 Joe Smith 2013-09-12 20:17:30 UTC
I believe this issue can be closed; the issue is obsolete with the new regex engine introduced in AOO 3.4 which fully supports standard pattern syntax.
Comment 21 Marcus 2017-05-20 10:45:31 UTC
Reset the assignee to the default "issues@openoffice.apache.org".
Comment 22 Dick Groskamp 2021-01-07 15:11:06 UTC
Fixed in AOo 4.1. Probably since introducing new regexp engine in 3.4