Apache OpenOffice (AOO) Bugzilla – Issue 82473
Regular Expressions that replace-all can match the result of the replace
Last modified: 2013-08-07 14:44:35 UTC
i.e. consider a writer document with just ABCZDEFZGHI in it. If we use a regexp of ^[^Z]+Z then we want to match text that starts at the beginning of the paragraph that has one of more characters that are not Z and that sequence followed with a final Z, so taking the above example and just "find all", we simply match the ABCZ above, the problem arises when we use "replace all" and set the replace string to nothing, then we get GHI. We're replacing the ABCZ with nothing, but then we're apparently re-running the regexp on DEFZGHI which is now the new beginning of the paragraph. i.e. for replace all, instead of echo ABCZDEFZGHI | sed -r -e 's/^[^Z]+Z//' we have effectively echo ABCZDEFZGHI | sed -r -e 's/^[^Z]+Z//' | sed -r -e 's/^[^Z]+Z//'
Here's a perhaps more concise example, i.e. search string of ^.{3} and no replace string to remove the first 3 characters, with the above example it keeps removing a block of 3 characters until only 2 are left with "replace all" while a replacement string of FOO will cause it happen only once
Reassigned to SBA.
May I suggest that this is a duplicate of issue 77376 ? also may be related to issue 81096 ?
SBA->AMA: Please proceed.
It's not a duplicate of issue 77376 nor issue 81096 It is simply our designed "replace all" behavior: 1. Look for the first occurence of the search string 2. Replace it 3. Continue with 1. and the end position of the replaced string. In the bug case unfortunately the replacement has an effect to the following search (because we replace the characters with nothing the following character becomes the first of the paragraph). To solve this issue we needs to search all occurrences, save their positions and do the replacement with this positions without a new search. This could be done, but it is a big effort for a small effect! So I like to present a work-around: 1. "Find All" instead "Replace All" => all occurrences will be selected 2. "Replace" or "Selection only"+"Replace All" to get the right characters replaced.
I understand and thank you. My view is that it *is* a bug - we have a button that will 'Find All' in the current text, and the 'Replace All' button should therefore also operate on the current text. Certainly that's what a user would expect. However I guess we have to live with it. There are a whole range of similar situations, eg in 'this is a sentence' Find \<. (or ^.) Replace with nothing; in 'ababab abababab abab' Find \<ab Replace with a<space> - but they are perhaps a bit contrived. I hope the \1 \2 ... backreferences in 'Replace with' proposed for OOo2.4 don't throw up more - I guess probably not. I'll edit the Regex HowTo to include your sensible workaround. The trouble is the HowTo will soon be spending longer talking about all the bugs than it is about the functionality ;)
> In the bug case unfortunately the replacement has an effect to the > following search (because we replace the characters with nothing the > following character becomes the first of the paragraph). Well, I'm no regex lawyer, so I can't say that it's "wrong" or if it's a bug or not, but as a long-time regex user, I can say that no other regex function that I've ever seen works this way. And it seems pretty clear to me that the way OOo handles this currently is not useful, if not wrong outright. E.g.: $ echo ABC | sed -e 's/^.//g' BC $ echo ABC | perl -pe 's/^.//g' BC So, neither sed nor perl consume the entire string, as OOo does. By the above logic, the pattern "^" would match an infinite number of times on the first line! Instead, it matches only once for each line: $ ls /tmp | wc -l 93 $ ls /tmp | grep '^' | wc -l 93 $ ls /tmp | sed -ne '/^/p' | wc -l 93 $ ls /tmp | perl -ne 'print if /^/' | wc -l 93 In OOo, the pattern "^" never matches anywhere at all. I know almost nothing about coding a regex engine, but this sounds more like part of the engine logic than part of the text positioning: If a match begins at an anchor, then the next iteration must logically start *after* (beyond) the anchor, replacement or not. If the user re-runs the Find/Replace, then yes, the same positions should match again, but not during the same "Replace All".
What I actually wrote in the regex HowTo http://wiki.services.openoffice.org/wiki/Documentation/How_Tos/Regular_Expressio ns_in_Writer was: "Please be careful when using the Replace All button. There are a few rare occasions when this will give unexpected results. For example to remove the first character of every paragraph you might 'Search for' ^. and 'Replace with' nothing; clicking 'Replace All' now will wipe out *all* your text, instead of just the first character of each paragraph. Issue 82473 discusses this. The workaround is to 'Find All', then 'Replace'; perhaps the safest way is not to use the 'Replace All' button at all with regular expressions." I'm afraid it does make OOo regex look rather silly... but the point of the HowTo is to describe what regex actually do - because people keep re- discovering all the woes for themselves. At some point someone will have to bite the bullet and tackle the Great Regex Rethink - if that is likely to be soon then work on the existing code might be wasted - so that must influence any decision to work on this issue now.
Hallo, for me it seems pretty simple. Just don't process whatever already has been processed and matched and replaced. As well don't use character number but some other marker. One of the main advantages I usually find in OSS is a good support for regular expressions and flexibility. In fact I sometimes switch to kwrite for reg exp processing... and btw some improvements so one easily finds and replaces text vs style and replace both would be awesome and unbeatable advantage over other proprietary crap software. I'm using OO a lot so just want to thank you for your work.