107619 – Bug in Find & replace with regular expression

Issue 107619 - Bug in Find & replace with regular expression

Summary: Bug in Find & replace with regular expression

Status:	CLOSED FIXED

Alias:	None

Product:	Calc
Classification:	Application
Component:	editing (show other issues)
Version:	OOO310m19
Hardware:	All All

Importance:	P3 Trivial with 2 votes (vote)
Target Milestone:	4.1.2
Assignee:	AOO issues mailing list
QA Contact:

URL:
Keywords:

Duplicates (2):	113275 124521 (view as issue list)
Depends on:
Blocks:	124521
	Show dependency tree

Reported:	2009-12-11 17:03 UTC by lp01
Modified:	2016-08-31 22:25 UTC (History)
CC List:	9 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Flags:	pescetti: 4.1.2_release_blocker+

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description lp01 2009-12-11 17:03:28 UTC

Find & replace with regular expression
find ([0-9])
replace $1

In cell with: 987

after replace      we get 987 (ok)
after replace all  we get 999

another case:
find: ([^ ]*)[ ]*([^ ]*)       
replace: $1$2      {replace space between something}

replace all

in cell with: "a b" we get "ab"    (ok)
in cell with: "a"   we get "aa"    (??)

Comment 1 Andrea Pescetti 2009-12-12 13:16:07 UTC

[Commenting on the second regular expression only, I didn't check the first]

Actually I would have used another regular expression for the second purpose you
state, i.e., ([^ ]*)[ ]+([^ ]*) which doesn't show the bug.

But indeed there is something strange with the behavior of
([^ ]*)[ ]*([^ ]*)
It seems that, in cells with no spaces, cells with an even number of characters
(aa, aaaa, ...) are correctly left untouched, while cells with an odd number of
characters (a, aaa, ...) are duplicated, thus resulting in "aa", "aaaaaa" and so on.

Comment 2 ooo 2009-12-14 14:18:57 UTC

([0-9]) finds any single digit, in the number 987 it is the first digit 9, with
replacement $1 and ReplaceAll you told it to replace all with what was found,
that's what it did.

([^ ]*)[ ]*([^ ]*)
Let's analyze the expression:
First group is any number of characters except space, including 0 characters.
Followed by any number of spaces, including 0 spaces.
Followed by the second group of any number of characters except space, including
0 characters.

Actually that expression matches almost everything, it depends on greediness
what exactly is matched for the groups. The first group may already match
everything except a space, replacing with $1$2 replaces the match with itself.
Also leading and trailing spaces with one word would be matched by the
expression. If you want to match space between non-space characters then say so:
([^ ]+)[ ]+([^ ]+)

That an odd number of characters is replaced with itself twice using both, $1
and $2, as back references apparently is a bug.

Comment 3 lp01 2009-12-14 16:42:12 UTC

This is bug report. So I give simple sample of bugs. 

You can try another sample:
I'd like convert some variable name to readable form:

"VarNameA" -> "Var Name A"

Using regular expression it is simple task: "s/([A-Z])/ $1/g"

But using regular expression in Calc:
find: "([A-Z])"
replace " $1"

I got: " V              "   (many spaces)

Do you really mean, that this is OK?

Comment 4 ooo 2009-12-15 12:34:25 UTC

Why should it be ok? Thanks for the test case.

Comment 5 mikekaganski 2010-05-04 02:07:42 UTC

I can confirm both bugs on OOO320m12.
I try to replace all the occurences of apostrophed numbers (formatted like, 
say, "'123,456") to the number itself (to include it into the calculation). By 
the way, it would be nice to be able to do it directly. But however, I found 
that it's impossible to search for the apostrophe (the search doesn't find 
anything). But if I search for the text without the apostrophe and replace it 
with the same text the apostrophe disappears - so that's what I need.
I use the regex: ^([:digit:]*)(,?)([:digit:]*)$ .
If I replace it with $1$2$3, than if the number consist of odd number of digits 
without decimal comma, it's replaced with itself triple times repeated (so '123 
becomes 123123123). If the number contains the decimal comma, or if it has even 
number of digits, everything is ok. Seems like if the found text consists of 
odd number of characters, and some references are empty, they contain the last 
non-empty reference.
I disagree that the first reported case is not a bug. The replacement should be 
done letter-by-letter here. So the program must find the character "9" and 
replace _it_ with the found text (here it's "9"), then it must find "8" and 
replace it with the found text (that is by now must have updated to "8"), and 
so on. The behaviour of the "Replace" and "Replace All" buttons must differ 
only in the range of cells they operate on, and must be identical if there's 
only one cell with matching text and it's selected.

Comment 6 ooo 2010-05-04 12:28:59 UTC

@mikekaganski: use
^([0-9,]+)$
to search and replace with $1

Comment 7 mikekaganski 2010-05-04 12:59:27 UTC

2er: Thanks, but this is not the same thing. The mask includes, for example, 
values like "0,1,2,3,4,5,6,7,8,9" or even ",,,,,".
However, it's not a problem: since I can replace anything with "$0" and this 
action simply eliminates the leading apostrophe, in my case it's enough to 
search for ".+" . Though it's really hard to teach users to do such things 
regularly, so a simple solution to convert apostrophed numbers to simple 
numbers would be very handy (combined with an easy way to see that some cells a 
formula depends on are formatted in such a way and are excluded from the 
calculation: sometims it takes a while to find out the cause of a strange 
result, and sometimes (even worse) we don't notice it at all!). :) Excuse for 
offtopic, I think I should make it a separate feature request...

Comment 8 j.nitschke 2014-03-26 16:02:00 UTC

*** Issue 113275 has been marked as a duplicate of this issue. ***

Comment 9 j.nitschke 2014-03-26 17:47:11 UTC

*** Issue 124521 has been marked as a duplicate of this issue. ***

Comment 10 hanya 2015-01-22 17:50:05 UTC

In ScTable::SearchCell method, instance of css::util::SearchResult is not updated 
while repeat search. Therefore, the group reference is always the same with the first match.

Comment 11 SVN Robot 2015-01-22 17:53:09 UTC

"hanya" committed SVN revision 1653973 into trunk:
#i107619# update search result while repeating search

Comment 12 hanya 2015-01-22 17:58:19 UTC

*** Issue 124521 has been marked as a duplicate of this issue. ***

Comment 13 hanya 2015-01-22 17:59:22 UTC

Fixed on trunk.

Comment 14 Kay 2015-09-07 20:46:32 UTC

I think this is fix should be included in 4.1.2.

Comment 15 Andrea Pescetti 2015-09-24 07:03:38 UTC

Approved for 4.1.2; this could really benefit from a unit test though.

Comment 16 SVN Robot 2015-09-26 21:51:30 UTC

"kschenk" committed SVN revision 1705489 into branches/AOO410:
#i107619# Merged from r1653973.

Comment 17 Francis C. Costero 2015-10-04 02:36:14 UTC

I still see the problem with 4.1.2 and one of the original example cases. I have 3 cells containing
987
a b
a

With 
Find: ([^ ]*)[ ]*([^ ]*) 
Replace: $1$2
I get the following after clicking Replace All
987987
ab
aa


If I use
Find: ([0-9])
Replace: $1
I do not see the originally reported problem that 987 becomes 999 with Replace All. The 987 is left unchanged. 
I'm using Windows 7.

Comment 18 Kay 2015-10-05 21:37:53 UTC

(In reply to Francis C. Costero from comment #17)
> I still see the problem with 4.1.2 and one of the original example cases. I
> have 3 cells containing
> 987
> a b
> a
> 
> With 
> Find: ([^ ]*)[ ]*([^ ]*) 

Match zero or more occurences of a space following the beginning (grouping 1),  followed by  zero or more occurrences of "space", followed by match zero or more occurrences of a space following the beginning of the line (grouping 2).

> Replace: $1$2
> I get the following after clicking Replace All
> 987987

This is correct. In this case $1 and $2 are identical because there is zero or no occurrence of the "intervening" space.

> ab

This is correct. There was an intervening space.

> aa

This is correct and the same as the "987" string case.

> 
> 
> If I use
> Find: ([0-9])
> Replace: $1
> I do not see the originally reported problem that 987 becomes 999 with
> Replace All. The 987 is left unchanged. 

Each number essentially matches itself and no changes.

> I'm using Windows 7.

Comment 19 mikekaganski 2015-10-05 22:48:48 UTC

(In reply to Kay from comment #18)
> (In reply to Francis C. Costero from comment #17)
> > Find: ([^ ]*)[ ]*([^ ]*) 
> 
> Match zero or more occurences of a space following the beginning (grouping
> 1),  followed by  zero or more occurrences of "space", followed by match
> zero or more occurrences of a space following the beginning of the line
> (grouping 2).

Incorrect. [^ ] doesn't mean "a space following the beginning of the line", it's "Any character except space" syntax. Here "space" means character 0x20, not "any whitespace character" like [:space:], of course.

If it were "a space following the beginning of the line", then in any of the following examples both groups would be empty (there's no lines starting with space).

Thus, this means (taking greedy processing of "*" into account): put as much non-space characters as possible into the first grouping (possibly empty if the first char is space), THEN skip any spaces (if present), THEN take as much as possible of non-space characters LEFT into the second grouping (maybe zero if no other characters after those spaces). Anyway, first and last groupings represent DIFFERENT groups of characters, they may have identical value only if those different groups have identical characters in them.

> > Replace: $1$2
> > I get the following after clicking Replace All
> > 987987
> 
> This is correct. In this case $1 and $2 are identical because there is zero
> or no occurrence of the "intervening" space.

Incorrect. In this case $1 MUST be "987", and $2 MUST be "".

> > ab
> 
> This is correct. There was an intervening space.

OK.

> > aa
> 
> This is correct and the same as the "987" string case.

No, this is INCORRECT and the same as the "987" string case.

Comment 20 Kay 2015-10-05 23:00:57 UTC

(In reply to mikekaganski from comment #19)
> (In reply to Kay from comment #18)
> > (In reply to Francis C. Costero from comment #17)
> > > Find: ([^ ]*)[ ]*([^ ]*) 
> > 
> > Match zero or more occurences of a space following the beginning (grouping
> > 1),  followed by  zero or more occurrences of "space", followed by match
> > zero or more occurrences of a space following the beginning of the line
> > (grouping 2).
> 
> Incorrect. [^ ] doesn't mean "a space following the beginning of the line",
> it's "Any character except space" syntax. Here "space" means character 0x20,
> not "any whitespace character" like [:space:], of course.
> 
> If it were "a space following the beginning of the line", then in any of the
> following examples both groups would be empty (there's no lines starting
> with space).
> 
> Thus, this means (taking greedy processing of "*" into account): put as much
> non-space characters as possible into the first grouping (possibly empty if
> the first char is space), THEN skip any spaces (if present), THEN take as
> much as possible of non-space characters LEFT into the second grouping
> (maybe zero if no other characters after those spaces). Anyway, first and
> last groupings represent DIFFERENT groups of characters, they may have
> identical value only if those different groups have identical characters in
> them.
> 
> > > Replace: $1$2
> > > I get the following after clicking Replace All
> > > 987987
> > 
> > This is correct. In this case $1 and $2 are identical because there is zero
> > or no occurrence of the "intervening" space.
> 
> Incorrect. In this case $1 MUST be "987", and $2 MUST be "".
> 
> > > ab
> > 
> > This is correct. There was an intervening space.
> 
> OK.
> 
> > > aa
> > 
> > This is correct and the same as the "987" string case.
> 
> No, this is INCORRECT and the same as the "987" string case.

OMG! I stand corrected! you are right, mikekaganski!

Comment 21 Pedro 2015-10-09 11:34:28 UTC

Confirmed fixed with

AOO412m1(Build:9780)  -  Rev. 1705625
2015-09-28 12:45:04 (Mo, 28 Sep 2015)

under Windows XP Pro x86 SP3

Comment 22 hanya 2015-10-09 17:00:44 UTC

(In reply to Francis C. Costero from comment #17)
> I still see the problem with 4.1.2 and one of the original example cases. I
> have 3 cells containing
> 987
> a b
> a
> 
> With 
> Find: ([^ ]*)[ ]*([^ ]*) 
> Replace: $1$2
> I get the following after clicking Replace All
> 987987
> ab
> aa
I observed strange result with: 
A1: 987
Find and replace: written in the above
other cells: leave them empty
Click "Replace All" button. Sometimes strange result is not shown but 
if you push the button sometimes, the strange result can be observed.

It seems this problem is happen because of index out of bounds of sequence 
in TextSearch::searchForward method.
In the above case, offset sequence has [0, 1, 2] elements.
startOffset of the match: [0, 0, 3]
endOffset of the match: [3, 3, 3]

In the TextSearch::searchForward method: 
        for ( int k = 0; k < sres.startOffset.getLength(); k++ )
        {
            if (sres.startOffset[k])
	      sres.startOffset[k] = offset[sres.startOffset[k]];
When k == 2, sres.startOffset[k] gives 3 to refer offset[3] which gives 
illegal value from the out of the sequence. 
If offset[3] gave large number, the result in the cell was correct. 
But offset[3] was 0, it gave 0-3 sub string for $2.
This behavior is simply broken.

I have no idea how to fix the index miss match now.

Comment 23 hanya 2015-10-10 08:32:45 UTC

The illegal index for the offset is happen when the start of the match result 
is started at the end of the string.
In such case, next position from the last element of the offset can be used.

Comment 24 hanya 2015-10-10 08:57:05 UTC

Fixed on the trunk, revision 1707844.

Comment 25 Kay 2015-10-14 21:16:52 UTC

hanya has committed a revision to this bug. Thanks again, hanya! From my current testing using entries from Comment #17, I now get:

987
ab
a

after either a Replace or Replace All, which I believe is what we want for the result, yes?

Please test from trunk nightly builds if you can. The previous fix has already been committed to the AOO410 branch. If we get some positive feedback over the next day or so, I will re-request this as a 4.1.2 release blocker, and,  hopefully get this new update included. 

Setting this back to reopened for the time being.

Comment 26 hanya 2015-10-15 06:14:06 UTC

This problem can be happen on Writer also. The same result can be seen in the paragraphs.

Comment 27 Andrea Pescetti 2015-10-19 12:49:31 UTC

The new part of the patch was committed to AOO410 in revision 1709405 for OpenOffice 4.1.2-RC3.

Comment 28 Pedro 2016-08-31 10:09:43 UTC

The described problems no longer occurs.

Verified fixed with AOO420m1(Build:9800)  -  Rev. 1757080 under Windows 7 Pro x64 SP1

Comment 29 Kay 2016-08-31 22:25:27 UTC

Closing.