Issue 82473

Summary: Regular Expressions that replace-all can match the result of the replace
Product: Writer Reporter: caolanm
Component: codeAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: issues, openoffice
Version: OOo 2.3   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: DEFECT Latest Confirmation in: ---
Developer Difficulty: ---

Description caolanm 2007-10-10 13:50:28 UTC
i.e. consider a writer document with just ABCZDEFZGHI in it. If we use a regexp of
^[^Z]+Z
then we want to match text that starts at the beginning of the paragraph that
has one of more characters that are not Z and that sequence followed with a
final Z, so taking the above example and just "find all", we simply match the
ABCZ above, the problem arises when we use "replace all" and set the replace
string to nothing, then we get GHI. We're replacing the ABCZ with nothing, but
then we're apparently re-running the regexp on DEFZGHI which is now the new
beginning of the paragraph.

i.e. for replace all, instead of 
echo ABCZDEFZGHI | sed -r -e 's/^[^Z]+Z//'
we have effectively 
echo ABCZDEFZGHI | sed -r -e 's/^[^Z]+Z//' | sed -r -e 's/^[^Z]+Z//'
Comment 1 caolanm 2007-10-10 13:54:43 UTC
Here's a perhaps more concise example, i.e. search string of 
^.{3}
and no replace string to remove the first 3 characters, with the above example
it keeps removing a block of 3 characters until only 2 are left with "replace
all" while a replacement string of FOO will cause it happen only once
Comment 2 michael.ruess 2007-10-10 14:36:01 UTC
Reassigned to SBA.
Comment 3 drking 2007-10-29 20:06:35 UTC
May I suggest that this is a duplicate of issue 77376 ? also may be related to 
issue 81096 ? 
Comment 4 stefan.baltzer 2007-11-01 17:29:55 UTC
SBA->AMA: Please proceed.
Comment 5 andreas.martens 2007-11-01 17:53:07 UTC
It's not a duplicate of issue 77376 nor issue 81096
It is simply our designed "replace all" behavior:
1. Look for the first occurence of the search string
2. Replace it
3. Continue with 1. and the end position of the replaced string.

In the bug case unfortunately the replacement has an effect to the following
search (because we replace the characters with nothing the following character
becomes the first of the paragraph).
To solve this issue we needs to search all occurrences, save their positions and
do the replacement with this positions without a new search. This could be done,
but it is a big effort for a small effect!

So I like to present a work-around:
1. "Find All" instead "Replace All" => all occurrences will be selected
2. "Replace" or "Selection only"+"Replace All" to get the right characters replaced.

Comment 6 drking 2007-11-01 19:17:19 UTC
I understand and thank you. My view is that it *is* a bug - we have a button 
that will 'Find All' in the current text, and the 'Replace All' button should 
therefore also operate on the current text. Certainly that's what a user would 
expect.

However I guess we have to live with it. There are a whole range of similar 
situations, eg in 'this is a sentence' Find \<. (or ^.) Replace with nothing; 
in 'ababab  abababab abab' Find \<ab Replace with a<space> - but they are 
perhaps a bit contrived. I hope the \1 \2 ... backreferences in 'Replace with'  
proposed for OOo2.4 don't throw up more - I guess probably not.

I'll edit the Regex HowTo to include your sensible workaround. The trouble is 
the HowTo will soon be spending longer talking about all the bugs than it is 
about the functionality ;)

Comment 7 Joe Smith 2007-11-06 19:25:05 UTC
> In the bug case unfortunately the replacement has an effect to the
> following search (because we replace the characters with nothing the
> following character becomes the first of the paragraph).

Well, I'm no regex lawyer, so I can't say that it's "wrong" or if it's a bug or
not, but as a long-time regex user, I can say that no other regex function that
I've ever seen works this way. And it seems pretty clear to me that the way OOo
handles this currently is not useful, if not wrong outright.

E.g.:

$ echo ABC | sed -e 's/^.//g'
BC
$ echo ABC | perl -pe 's/^.//g'
BC

So, neither sed nor perl consume the entire string, as OOo does.

By the above logic, the pattern "^" would match an infinite number of times on
the first line! Instead, it matches only once for each line:

$ ls /tmp | wc -l
93
$ ls /tmp | grep '^' | wc -l
93
$ ls /tmp | sed -ne '/^/p' | wc -l
93
$ ls /tmp | perl -ne 'print if /^/' | wc -l
93

In OOo, the pattern "^" never matches anywhere at all.

I know almost nothing about coding a regex engine, but this sounds more like
part of the engine logic than part of the text positioning: If a match begins at
an anchor, then the next iteration must logically start *after* (beyond) the
anchor, replacement or not.

If the user re-runs the Find/Replace, then yes, the same positions should match
again, but not during the same "Replace All".
Comment 8 drking 2007-11-07 05:22:59 UTC
What I actually wrote in the regex HowTo 
http://wiki.services.openoffice.org/wiki/Documentation/How_Tos/Regular_Expressio
ns_in_Writer was:

"Please be careful when using the Replace All button. There are a few rare 
occasions when this will give unexpected results. For example to remove the 
first character of every paragraph you might 'Search for' ^. and 'Replace with' 
nothing; clicking 'Replace All' now will wipe out *all* your text, instead of 
just the first character of each paragraph. Issue 82473 discusses this. The 
workaround is to 'Find All', then 'Replace'; perhaps the safest way is not to 
use the 'Replace All' button at all with regular expressions."

I'm afraid it does make OOo regex look rather silly...  but the point of the 
HowTo is to describe what regex actually do - because people keep re-
discovering all the woes for themselves.

At some point someone will have to bite the bullet and tackle the Great Regex 
Rethink - if that is likely to be soon then work on the existing code might be 
wasted - so that must influence any decision to work on this issue now.


Comment 9 akostadinov 2009-09-03 13:33:31 UTC
Hallo, for me it seems pretty simple.

Just don't process whatever already has been processed and matched and replaced.
As well don't use character number but some other marker.

One of the main advantages I usually find in OSS is a good support for regular
expressions and flexibility. In fact I sometimes switch to kwrite for reg exp
processing...

and btw some improvements so one easily finds and replaces text vs style and
replace both would be awesome and unbeatable advantage over other proprietary
crap software.

I'm using OO a lot so just want to thank you for your work.