Apache OpenOffice (AOO) Bugzilla – Issue 107619
Bug in Find & replace with regular expression
Last modified: 2016-08-31 22:25:27 UTC
Find & replace with regular expression find ([0-9]) replace $1 In cell with: 987 after replace we get 987 (ok) after replace all we get 999 another case: find: ([^ ]*)[ ]*([^ ]*) replace: $1$2 {replace space between something} replace all in cell with: "a b" we get "ab" (ok) in cell with: "a" we get "aa" (??)
[Commenting on the second regular expression only, I didn't check the first] Actually I would have used another regular expression for the second purpose you state, i.e., ([^ ]*)[ ]+([^ ]*) which doesn't show the bug. But indeed there is something strange with the behavior of ([^ ]*)[ ]*([^ ]*) It seems that, in cells with no spaces, cells with an even number of characters (aa, aaaa, ...) are correctly left untouched, while cells with an odd number of characters (a, aaa, ...) are duplicated, thus resulting in "aa", "aaaaaa" and so on.
([0-9]) finds any single digit, in the number 987 it is the first digit 9, with replacement $1 and ReplaceAll you told it to replace all with what was found, that's what it did. ([^ ]*)[ ]*([^ ]*) Let's analyze the expression: First group is any number of characters except space, including 0 characters. Followed by any number of spaces, including 0 spaces. Followed by the second group of any number of characters except space, including 0 characters. Actually that expression matches almost everything, it depends on greediness what exactly is matched for the groups. The first group may already match everything except a space, replacing with $1$2 replaces the match with itself. Also leading and trailing spaces with one word would be matched by the expression. If you want to match space between non-space characters then say so: ([^ ]+)[ ]+([^ ]+) That an odd number of characters is replaced with itself twice using both, $1 and $2, as back references apparently is a bug.
This is bug report. So I give simple sample of bugs. You can try another sample: I'd like convert some variable name to readable form: "VarNameA" -> "Var Name A" Using regular expression it is simple task: "s/([A-Z])/ $1/g" But using regular expression in Calc: find: "([A-Z])" replace " $1" I got: " V " (many spaces) Do you really mean, that this is OK?
Why should it be ok? Thanks for the test case.
I can confirm both bugs on OOO320m12. I try to replace all the occurences of apostrophed numbers (formatted like, say, "'123,456") to the number itself (to include it into the calculation). By the way, it would be nice to be able to do it directly. But however, I found that it's impossible to search for the apostrophe (the search doesn't find anything). But if I search for the text without the apostrophe and replace it with the same text the apostrophe disappears - so that's what I need. I use the regex: ^([:digit:]*)(,?)([:digit:]*)$ . If I replace it with $1$2$3, than if the number consist of odd number of digits without decimal comma, it's replaced with itself triple times repeated (so '123 becomes 123123123). If the number contains the decimal comma, or if it has even number of digits, everything is ok. Seems like if the found text consists of odd number of characters, and some references are empty, they contain the last non-empty reference. I disagree that the first reported case is not a bug. The replacement should be done letter-by-letter here. So the program must find the character "9" and replace _it_ with the found text (here it's "9"), then it must find "8" and replace it with the found text (that is by now must have updated to "8"), and so on. The behaviour of the "Replace" and "Replace All" buttons must differ only in the range of cells they operate on, and must be identical if there's only one cell with matching text and it's selected.
@mikekaganski: use ^([0-9,]+)$ to search and replace with $1
2er: Thanks, but this is not the same thing. The mask includes, for example, values like "0,1,2,3,4,5,6,7,8,9" or even ",,,,,". However, it's not a problem: since I can replace anything with "$0" and this action simply eliminates the leading apostrophe, in my case it's enough to search for ".+" . Though it's really hard to teach users to do such things regularly, so a simple solution to convert apostrophed numbers to simple numbers would be very handy (combined with an easy way to see that some cells a formula depends on are formatted in such a way and are excluded from the calculation: sometims it takes a while to find out the cause of a strange result, and sometimes (even worse) we don't notice it at all!). :) Excuse for offtopic, I think I should make it a separate feature request...
*** Issue 113275 has been marked as a duplicate of this issue. ***
*** Issue 124521 has been marked as a duplicate of this issue. ***
In ScTable::SearchCell method, instance of css::util::SearchResult is not updated while repeat search. Therefore, the group reference is always the same with the first match.
"hanya" committed SVN revision 1653973 into trunk: #i107619# update search result while repeating search
Fixed on trunk.
I think this is fix should be included in 4.1.2.
Approved for 4.1.2; this could really benefit from a unit test though.
"kschenk" committed SVN revision 1705489 into branches/AOO410: #i107619# Merged from r1653973.
I still see the problem with 4.1.2 and one of the original example cases. I have 3 cells containing 987 a b a With Find: ([^ ]*)[ ]*([^ ]*) Replace: $1$2 I get the following after clicking Replace All 987987 ab aa If I use Find: ([0-9]) Replace: $1 I do not see the originally reported problem that 987 becomes 999 with Replace All. The 987 is left unchanged. I'm using Windows 7.
(In reply to Francis C. Costero from comment #17) > I still see the problem with 4.1.2 and one of the original example cases. I > have 3 cells containing > 987 > a b > a > > With > Find: ([^ ]*)[ ]*([^ ]*) Match zero or more occurences of a space following the beginning (grouping 1), followed by zero or more occurrences of "space", followed by match zero or more occurrences of a space following the beginning of the line (grouping 2). > Replace: $1$2 > I get the following after clicking Replace All > 987987 This is correct. In this case $1 and $2 are identical because there is zero or no occurrence of the "intervening" space. > ab This is correct. There was an intervening space. > aa This is correct and the same as the "987" string case. > > > If I use > Find: ([0-9]) > Replace: $1 > I do not see the originally reported problem that 987 becomes 999 with > Replace All. The 987 is left unchanged. Each number essentially matches itself and no changes. > I'm using Windows 7.
(In reply to Kay from comment #18) > (In reply to Francis C. Costero from comment #17) > > Find: ([^ ]*)[ ]*([^ ]*) > > Match zero or more occurences of a space following the beginning (grouping > 1), followed by zero or more occurrences of "space", followed by match > zero or more occurrences of a space following the beginning of the line > (grouping 2). Incorrect. [^ ] doesn't mean "a space following the beginning of the line", it's "Any character except space" syntax. Here "space" means character 0x20, not "any whitespace character" like [:space:], of course. If it were "a space following the beginning of the line", then in any of the following examples both groups would be empty (there's no lines starting with space). Thus, this means (taking greedy processing of "*" into account): put as much non-space characters as possible into the first grouping (possibly empty if the first char is space), THEN skip any spaces (if present), THEN take as much as possible of non-space characters LEFT into the second grouping (maybe zero if no other characters after those spaces). Anyway, first and last groupings represent DIFFERENT groups of characters, they may have identical value only if those different groups have identical characters in them. > > Replace: $1$2 > > I get the following after clicking Replace All > > 987987 > > This is correct. In this case $1 and $2 are identical because there is zero > or no occurrence of the "intervening" space. Incorrect. In this case $1 MUST be "987", and $2 MUST be "". > > ab > > This is correct. There was an intervening space. OK. > > aa > > This is correct and the same as the "987" string case. No, this is INCORRECT and the same as the "987" string case.
(In reply to mikekaganski from comment #19) > (In reply to Kay from comment #18) > > (In reply to Francis C. Costero from comment #17) > > > Find: ([^ ]*)[ ]*([^ ]*) > > > > Match zero or more occurences of a space following the beginning (grouping > > 1), followed by zero or more occurrences of "space", followed by match > > zero or more occurrences of a space following the beginning of the line > > (grouping 2). > > Incorrect. [^ ] doesn't mean "a space following the beginning of the line", > it's "Any character except space" syntax. Here "space" means character 0x20, > not "any whitespace character" like [:space:], of course. > > If it were "a space following the beginning of the line", then in any of the > following examples both groups would be empty (there's no lines starting > with space). > > Thus, this means (taking greedy processing of "*" into account): put as much > non-space characters as possible into the first grouping (possibly empty if > the first char is space), THEN skip any spaces (if present), THEN take as > much as possible of non-space characters LEFT into the second grouping > (maybe zero if no other characters after those spaces). Anyway, first and > last groupings represent DIFFERENT groups of characters, they may have > identical value only if those different groups have identical characters in > them. > > > > Replace: $1$2 > > > I get the following after clicking Replace All > > > 987987 > > > > This is correct. In this case $1 and $2 are identical because there is zero > > or no occurrence of the "intervening" space. > > Incorrect. In this case $1 MUST be "987", and $2 MUST be "". > > > > ab > > > > This is correct. There was an intervening space. > > OK. > > > > aa > > > > This is correct and the same as the "987" string case. > > No, this is INCORRECT and the same as the "987" string case. OMG! I stand corrected! you are right, mikekaganski!
Confirmed fixed with AOO412m1(Build:9780) - Rev. 1705625 2015-09-28 12:45:04 (Mo, 28 Sep 2015) under Windows XP Pro x86 SP3
(In reply to Francis C. Costero from comment #17) > I still see the problem with 4.1.2 and one of the original example cases. I > have 3 cells containing > 987 > a b > a > > With > Find: ([^ ]*)[ ]*([^ ]*) > Replace: $1$2 > I get the following after clicking Replace All > 987987 > ab > aa I observed strange result with: A1: 987 Find and replace: written in the above other cells: leave them empty Click "Replace All" button. Sometimes strange result is not shown but if you push the button sometimes, the strange result can be observed. It seems this problem is happen because of index out of bounds of sequence in TextSearch::searchForward method. In the above case, offset sequence has [0, 1, 2] elements. startOffset of the match: [0, 0, 3] endOffset of the match: [3, 3, 3] In the TextSearch::searchForward method: for ( int k = 0; k < sres.startOffset.getLength(); k++ ) { if (sres.startOffset[k]) sres.startOffset[k] = offset[sres.startOffset[k]]; When k == 2, sres.startOffset[k] gives 3 to refer offset[3] which gives illegal value from the out of the sequence. If offset[3] gave large number, the result in the cell was correct. But offset[3] was 0, it gave 0-3 sub string for $2. This behavior is simply broken. I have no idea how to fix the index miss match now.
The illegal index for the offset is happen when the start of the match result is started at the end of the string. In such case, next position from the last element of the offset can be used.
Fixed on the trunk, revision 1707844.
hanya has committed a revision to this bug. Thanks again, hanya! From my current testing using entries from Comment #17, I now get: 987 ab a after either a Replace or Replace All, which I believe is what we want for the result, yes? Please test from trunk nightly builds if you can. The previous fix has already been committed to the AOO410 branch. If we get some positive feedback over the next day or so, I will re-request this as a 4.1.2 release blocker, and, hopefully get this new update included. Setting this back to reopened for the time being.
This problem can be happen on Writer also. The same result can be seen in the paragraphs.
The new part of the patch was committed to AOO410 in revision 1709405 for OpenOffice 4.1.2-RC3.
The described problems no longer occurs. Verified fixed with AOO420m1(Build:9800) - Rev. 1757080 under Windows 7 Pro x64 SP1
Closing.