Apache OpenOffice (AOO) Bugzilla – Full Text Issue Listing |
Summary: | Inconsistant regex-support with named classes | ||||||
---|---|---|---|---|---|---|---|
Product: | Calc | Reporter: | villeroy <villeroy> | ||||
Component: | ui | Assignee: | AOO issues mailing list <issues> | ||||
Status: | CLOSED FIXED | QA Contact: | |||||
Severity: | Trivial | ||||||
Priority: | P3 | CC: | digro, gerry, gudmundpublic, issues, jes, khirano, kozodaevroman, mike.hall | ||||
Version: | OOo 2.0 | ||||||
Target Milestone: | --- | ||||||
Hardware: | All | ||||||
OS: | All | ||||||
Issue Type: | DEFECT | Latest Confirmation in: | --- | ||||
Developer Difficulty: | --- | ||||||
Attachments: |
|
Description
villeroy
2006-04-13 11:12:44 UTC
Created attachment 35687 [details]
Evaluation sheet (contains basic-code)
Grabbing issue. Accepted. Targeting to same milestone as previous issue 63849. Added cc . Sorry, I'm really not sure where to put this, but as this issue seems to be active and closely related to what I want to report, I'm going to put it here. If it should be filed separately, let me know. First, I want to confirm that a regexp search for [:digit:] does not work--it never matches anything--in Writer as well (2.0.2 on FC5 & XP, 2.0.3 on FC5). I assume Calc and Writer share the same RE library, so this doesn't need to be reported separately. Second, there is a useful workaround: a regexp search for ([:digit:]) works "as expected" (again, tested with Calc and Writer): e([:digit:])? -- finds 'e' followed by zero or one digit ^([:digit:])$ -- finds lines or cells with exactly one digit All(?) of the named classes seem to work this way (I've actually only tested [:space:] and [:digit:]). Third, there seems to be some confusion (possibly on my part) as to the use of the POSIX named character classes: as far as I can determine, the named classes are only special _within a regular character class_, and a bare '[:digit:]' doesn't match a digit at all, but any of the characters 'd', 'i', 'g', 't' or ':'. In order to search for a digit, you have to write [[:digit:]]. This form _never_ matches anything in OOo, even with the extra ()s. Issue 66720 is a duplicate of this issue. See that issue for the additional information that a search for any string ending with a named class does not work (tested in Writer but presumably generic). The brackets workaround resolves the problem, providing the closing bracket is last. I'm guessing that the fix for this is probably very easy. Any chance of changing milestone to 2.0.5? To jes, named classes are regex terms, only expected to work if the regex box is checked. *** Issue 66720 has been marked as a duplicate of this issue. *** There are several misconceptions about named character classes in regular expressions. A named character class [:name:] may appear in a bracket expression, so [[:digit:]] is a synonym for [0-9] (using ASCII digits). [:digit:] on itself without the surrounding brackets is not a valid expression, in no context. I took this issue because a single [[:digit:]] does not match one digit, as it should, same with [[:alnum:]] and other character classes. Only if used as [[:digit:]]? or [[:digit:]]* a digit is matched. The use of constructs like [:digit:]? or ([:digit:]) is undefined. In fact a better implementation would not match anything there. For the definition of POSIX regexps see also http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html > The use of constructs like [:digit:]? or ([:digit:]) is undefined.
> In fact a better implementation would not match anything there.
At least from what I've checked, Perl, GNU sed, GNU grep and bash all treat
these like a normal character class. Why wouldn't they be?
I don't know if the misconception is in the implementation or in the
documentation. The online help doesn't explain that the named classes require
any context; but more importantly, the implementation seems to neither require
nor support the standard usage.
I'm hoping that someone familiar with the implementation and the roadmap would
triage this so that we're all on the same page. Is POSIX the accepted standard
for RE's in OOo? If so, are the implementation and the documentation correct?
I was going to make a similar point. Looking at the documentation both in the link given (thanks - that's useful) and in OOo Help, it looks to me as if [:name:] is defined to work anywhere in a regex. Where is the mention of additional brackets and what purpose would they serve? The fix suggested will correct a different issue but leave the inconsistent behaviour that a search eg for "[:alnum:]text" works but "text[:alnum:]" does not. This is bound to confuse people and lead to the issue being raised again in future. Jes, > > The use of constructs like [:digit:]? or ([:digit:]) is undefined. > > In fact a better implementation would not match anything there. > > At least from what I've checked, Perl, GNU sed, GNU grep and bash all treat > these like a normal character class. Why wouldn't they be? Please 1. recheck the programs you mentioned with what I wrote. None of them accepts a plain [:digit:] as a valid regular expression matching a digit. Just try sed -e 's/[:digit:]/foo/' It will not replace a digit with foo, but will replace any of the characters 'digt:' with foo. You have to use a named character class within a bracketed expression. sed -e 's/[[:digit:]]/foo/' 2. Don't mix bash in. Bash doesn't know regular expressions, you're confusing it with file name pattern matching. And yes, there bash knows named character classes, but they also have to be used within brackets. See http://www.tldp.org/LDP/Bash-Beginners-Guide/html/sect_04_03.html Mike, > Where is the mention of additional brackets and what purpose would they serve? Please note that the entire definition of character classes in topic 6 of http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html is a subsection of "9.3.5 RE Bracket Expression". A named character class represents a set within a bracketed expression, and the "[:" and ":]" delimiters are part of the character class expression. Simplified for an "ASCII locale", [:digit:] is a synonym for 0-9, so a complete bracketed expression of [ab[:digit:]] would be the same as [ab0-9] and would match a,b,0,1,2,3,4,5,6,7,8,9 Eike JES->ER: I'm glad we are in vehement agreement ;-) I only wished to take issue with your statement: > The use of constructs like [:digit:]? or ([:digit:]) is undefined. > In fact a better implementation would not match anything there. If I understand what you are saying, I don't understand what evidence you have to support this statement. As far as I can tell, these are both perfectly reasonable constructs that match simple character classes. $ echo "Digits: 1234" | perl -ne '@m = m/[:digit:]?/g; print("m:(@m)\n")' m:( i g i t : ) $ echo "Digits: 1234" | perl -ne '@m = m/([:digit:])/g; print("m:(@m)\n")' m:(i g i t :) The important point here is that they DON'T match digits, which you clearly understand as well. What is the value in fixing the problem originally reported in this issue, when OOo's entire implementation of named classes is not in line with the POSIX standard or with common practice? OK, now I understand and perl does require the additional brackets (though it still seems non-intutive because [::] means the sams as [:] and the second : could reasonably be used to recognise that the user intends a character class). Anyway, all is now clear and apologies for my denseness. The more serious point is that in 2.0.4RC2 on Win XP I now can't get character classes to work at all, eg with RE flag set searching for [0-9] works fine, but [[:digit:]] finds nothing. Similarly with other named classes, with or without other characters before or after the brackets. Also, searching for [[:dig]] (for example) with the RE flag set sends Writer into a loop and the application has to be killed and restarted. Should I open new issue(s)? ...same in 2.0.4RC3 *** Issue 77248 has been marked as a duplicate of this issue. *** Well, it is quite a sophisticated discussion about [:name:] or [[:name:]] beeing the right token for named classes in different applications. I think [:name:] is just fine with OOo. The reason why I filed this issue was: there is something wrong with [:name:]. As a stand-alone token it does not match at all. Meanwhile someone found out that [:name:] fails if it is the very last token in a regex. [:name:]* [:name:]+ [:name:]{1} [:name:]$ work as expected. Another matter of fact are different matches of the same regex when used in Find/Replace, filters and formulas. Point taken. I have filed Issue 85269 to receive any further discussion of the regexp syntax. Propably not doable in time frame for 3.0, retargeting to 3.x I believe this issue can be closed; the issue is obsolete with the new regex engine introduced in AOO 3.4 which fully supports standard pattern syntax. Reset the assignee to the default "issues@openoffice.apache.org". Fixed in AOo 4.1. Probably since introducing new regexp engine in 3.4 |