Apache OpenOffice (AOO) Bugzilla – Issue 15666
Search and Replace - can't substitute regular expression subexpression in replace
Last modified: 2013-08-07 14:42:49 UTC
Expected to be able to search for "si(ll)y" with RE option checked, and replace occurances using subexpression replacement: "\1ogical" -> "llogical". Also, '&' in replacement string seems to get substituted with search string...
& in replacement string behaves as expected and specified in the help file. I know it's not what 'normal' REs do. The inability to use matched substrings in replacement expressions is confirmed. It's not stated in the help that it should be possible: againl coming fropm experience of 'real' REs, that looks silly. I'm not sure, though, whether it counts as a defect or an enhancement.
Reassigned to SBA
It's true that a "&" in the replace box is designed to replace the found string with itself. Have a look at the help to see the RegEx options (as they were designed for this product) I don't get the first problem you report. Please be more precise with the strings you enter, and the strings you hope to replace, and with WHAT. I think that a small sample (i.e. in a attached document that has the respective strings "ready for copy") could be helpful here. Please comment. Thank you.
Created attachment 7904 [details] Sample expected behavior
Also in attachment: Here are a list of people: LN=Blister FN=Mister LN=Blow FN=Joe LN=Sally FN=Silly But I really wanted a friendlier format, so I turn on regexp find and replace, search for: ^\s*LN=(.*)\s+FN=(.*)\s*$ with a replace string of: Hello, \2. How are you and the rest of the \1's? But alas, it does not work that way in OpenOffice.org. I can only substitute the whole string (where is that in the docs, anyway?), and \s doesn't seem to match [:space:]... Hope this helps...
I also think that possibility to substitute regular expressions in the replace string is important. For example, I often have to search for digit ranges dilimited by hyphens (like 121-122), and replace them with the same ranges, but delimited with endashes. Curently I use regexps to find matching strings, but I have to edit them manually :(. And there are other requests for the same functionality: see 11037 and 15515, which probably should be marked as duplicates of this issue (or this issue as a duplicate of one of those two).
\# should be replaced by the contents of the #th pair of round brackets () This is stated in the OnlineHelp: ( ) Defines the characters inside the brackets as a reference. You can then refer to the first reference in the current expression with "\1", to the second reference with "\2", and so on. For example, if your text contains the number 13487889 and you search using the regular expression (8)7\1\1, "8788" is found. #### It works in the search field (example: "(d)i\1" finds "did") but you cannot use this in the replace-field (where it would be much more useful). As mentioned before OnlineHelp doesn't state that it is possible to use this in the replace-field (so documentation is not wront) - but it would be nice if it worked. (I also disagree that & is unusual to be replaced by the matched string - sed, for example, uses the same thing) > "and \s doesn't seem to match [:space:]" \s is not listed as a short form of [:space:] why do you expect this to work? The index-entry is: "regular expressions; list of"
*** Issue 11037 has been marked as a duplicate of this issue. ***
*** Issue 15515 has been marked as a duplicate of this issue. ***
Real-life example from issue 15515: "The date is 2003-06-11." should be changed to "The date is 11-06-2003." unsing the regex-search: Search for: ([:digit]{4})-([:digit]{2})-([:digit]{2}) # 1st pair # < 2nd pair > ~ 3rd pair ~ ..of Brackets replace with: \3-\2-\1 Is expected to do the job, but instead 2003-06-11 gets replced by "\3-\2-\1"
I'm new here so if I make a mistake, please apologize me. I don't know if what I'm going to describe should be a different issue... When you find a regular expression that includes [:cntrl:] and replace with &, it should leave the text unchanged, but instead the control characters disappear. For example, I have: text[illustration] Search for text[:cntrl:] Replace with & When I hit replace only text remains, the illustration field is gone. Thanks for your help
I have changed this to an enhancement, though a very important one. It's not, strictly speaking, a defect that you can't replace portions of a regex: the help doesn't say that it should be possible. But it is a useful and powerful technique, found in almost every other implementation of regexes, and we ought to have it. So it's an enhancement
*** Issue 22592 has been marked as a duplicate of this issue. ***
I also think backreferences in the replace string is an important feature and should be implemented (even msword has them). A small suggestion about improving Help on Regular Expression: An example of using backreferences in the find string is pointless (explanation of "()" in the List of Regular Expressions). Who needs "(8)7\1\1" to find "8788"? Users who never used regular expressions before might not appreciate the value of backreferences. A more informative example would be something like finding palindrome words 3 chars long (did, dad, bob): "\<(.).\1\>"
SBA: Reassigned to AMA.
We should rework our support of RE, but we will not be able to do this in OOo2.0 :-(
keywords & component set according to new RFE-eval process... OS, Platform=ALL
*** Issue 43397 has been marked as a duplicate of this issue. ***
I can't believe OpenOffice made it to version 2 and still doesn't have a proper regular expression replacement feature. No offence intended mr andrewb, but this IS a DEFECT not an ENHANCEMENT. Who came up with the pointless backreferences "FEATURE"? WHat were they thinking?
I don't think this is a Writer specific issue. It's a general OOo issue. So please change the component accordingly (IMHO to framework). Thanks. The current behaviour is really suboptimal. RegExes would be MUCH more useful, if one could address matched subexpressions in the replacement string. When this feature is implemented, issue 46015 (Support for less greedy RegExes) would make more sense, too.
*** Issue 50043 has been marked as a duplicate of this issue. ***
Maybe it's time to change the issue type to "bug" from "enhancement" ? And set a target milestone. There are two reasons: 1. the common, standard feature is not available. 2. this may lead developers not to forget it :-).
The developers are perfectly capable of postponing issues marked as bugs, so I don't think that changing this designation will help. I agree that it's an urgently needed improvement. I agree that the specification is pretty much useless. But as a QA volunteer I know I can do nothing to persuade developers to take up issues just because I think they are urgent. Classifying things correctly by the rules does help a little. If MS Office does this the right and sensible way, I will add a keyword for office interoperability, which does seem to get management attention.
The follwoing excerpt from http://office.microsoft.com/en-us/assistance/ HA010873051033.aspx shows, that Office XP supports backreferences: 5. Click the Replace tab, and then enter the following characters in the Find what box. Make sure you include the space between the two sets of parentheses: (<*>) (<*>) 6. In the Replace with box, enter the following characters. Make sure you include the space between the comma and the second slash: \2, \1
I agree that this is a serious defect (if not quite a bug), and I'm quite disheartened that this has been an issue for over 2 years.
*** Issue 53775 has been marked as a duplicate of this issue. ***
Hi! This issue has been open for more than two and a half years now. Does anybody care to solve it?
I just sent a letter to the governing council explaining that this is very basic functionality that is missing from the product suite and it's embarrassing that they allow teams to explore lofty new projects without getting the basics completed first. I don't know about your, but this continues to be a deployment showstopper for me.
>> "and \s doesn't seem to match [:space:]" >\s is not listed as a short form of [:space:] why do you expect this to work? It's pretty standard: http://www.regular-expressions.info/charclass.html#shorthand
*** Issue 60029 has been marked as a duplicate of this issue. ***
MS advertises that its products are used by knowledge professionals. As a financial analyst, I fall into that group. I use spreadsheets extensively to handle data imported from a variety of sources. Often the data needs to be massaged into a form that can then be manipulated by standard spreadsheet functions The lack of a RegEx replace functionality is a critical defect. If I were using Windows, I would have to revert to Excel for this single functional absence. If OO meets its specification as given by Help and does not have a RegEx replace, then the DEFECT is in the Help specification. This issue is given as part of Write, but it is a part of the whole suite.
Created attachment 34070 [details] proposed patch
I've attached my proposal to this problem (only for writer for now). Since this is my first patch, I don't know that much about hacking OOo. So while the code does what I want (using \n in replace string inserts the content of the nth bracket), it might not please everyone. And at some places I'm not even sure how to handle the one or another situation (How should '\x' in the replace string be treated? since it hasn't any special meaning. Or how should '\4' in the replace string be handled if there's no capture group 4?). Besides adding the ability to use \1, \2 a.s.o. in the replace string, I also removed the special meaning of & (in text replace, not in attribute or format replace) since this is non standard and unexpected (although documented behaviour). Instead \0 now does the same trick and is more straightforward. Also something like '\\\t' now works as I'd expect: '\'0x09. Formerly it resulted in the string '\\t'. I didn't try implementing \s as placeholder for whitespace. This would have required hacking the regex lib (OOo uses a modified version of the GNU regular expression library 0.12 which also misses some advanced features like non greedy quantifiers) quite deep. IMHO this should be better some other bug.
At last someone is looking at this issue!!!! Well done cheyrich. Whilst hacking the S&R code is needed, I wondered if a macro could be written in Python (which has reasonable regex behaviour), and then attached to a tool bar. Since I am a Perl person (and no time to learn python now), and there is not yet an OO/perl interface, I have not investigated this possibility.
cheyrich, This is a good first step for the RE functionality. It get's us passed the current limitations. I am not sure what you are saying about the & substitution versus using \0. & in the replace field should return the entire matched string from the search field. If OO is using a broken version of GNU Regular expressions, then should this also be fixed? I haven't hacked OO either, but I am willing to give some of this a go. Gregg
> This is a good first step for the RE functionality. It get's us passed the > current limitations. Thanks. I just relized that in this code, registers usable in the replace string are limited to 1-digit, so 9 submatches. This could be extended to 2-or-more-digit registers, but would complicate the code. I guess 9 is a plenty. > I am not sure what you are saying about the & substitution versus using \0. > & in the replace field should return the entire matched string from the > search field. Of course it's possible to reimplement the & as special character. Users that are used it this way wouldn't have to change over, I don't see a reason for having & match the whole string. \0 for all and \1-\9 is more straightforward (to code an to learn). > If OO is using a broken version of GNU Regular expressions Er no, I didn't write broken - it's modified (mostly to make it use classes). What I meant with "misses some advanced features" is the lib misses them - also in the FSF's original version. If some more PCRE functionality is required it will be harder to implement since registers have already been supported and used internally. More than that, in my patch actually only *one* line of code was added to the regex code, the rest is in OOo code.
I thought '&' means the entire match. For example, Text: This is a search in the Open Office suite. Search: Op(.*)ice Doing a replace with '\1' (or '\0') would return "en Off" (which would become "...in the en Off suite." while '&' would return "Open Office" (which would then become "...in the Open Office suite."). Without '&' the search would need to be "(Op)(.*)(ice)" instead. Say you wanted to bold all instances of "Open Office" in HTML; you could just search for "open office" (case insensitive) and replace with "<b>&</b>" and not change the case everywhere.
If it is possible, I would keep the '&' as the whole pattern substitution since it is standard for regular expressions. Even microsoft uses '&' for whole pattern substitution. I would be happy with 9 sub-expression registers. I agree that for the time being basic RE functionality similar to the basic substitution capability of 'sed' would be fine. In the future, it might be worthwhile to update the RE code to include the newer GNU extensions. Does the current RE library in OO support posix extensions such as [:space:] etc? I think that would be more important that '\s'.
If it is possible, I would keep the '&' as the whole pattern substitution since it is standard for regular expressions. Even microsoft uses '&' for whole pattern substitution. I would be happy with 9 sub-expression registers. I agree that for the time being basic RE functionality similar to the basic substitution capability of 'sed' would be fine. In the future, it might be worthwhile to update the RE code to include the newer GNU extensions. Does the current RE library in OO support posix extensions such as [:space:] etc? I think that would be more important than '/s'.
Some comments from me... > Besides adding the ability to use \1, \2 a.s.o. in the replace string, I also > removed the special meaning of & (in text replace, not in attribute or format > replace) since this is non standard and unexpected No, this is a standard. e.g. sed and a couple of other utilities use it. Furthermore: Having it work in one situation and not in another is a nightmare both documentation wise as regarding to usability. I'd suggest to keep it since it has been there for ages (has been there long before OOo was born) As mentioned in another comment \0 could be "match all groups" > (although documented behaviour). Instead \0 now does the same trick and is > more straightforward. I'd call that more ecotic and far from being straightforward. > Also something like '\\\t' now works as I'd expect: Insert a backslash followed by a tabulator? > '\'0x09. Formerly it resulted in the string '\\t'. I guess you just misplaced the quote. Is it possible to add the other escape-sequences as well? (like for newline (as opposed to paragraph-break that unfortunately already is \n in the replace-box) glebovitz wrote: > Does the current RE library in OO support posix extensions such as [:space:] > etc? I think that would be more important than '/s'. Sure. Just have a look at the help for regular-expression search.
Created attachment 34181 [details] another try
> I thought '&' means the entire match. For example, It does and \0 also does. But I must admit that I was wrong in thinking perl knows \0, it's only available in PHP's preg_* and ereg_*. & is new to me for being an regex operator, but I mainly know regex from Perl and PHP, not from sed a.s.o. >> Also something like '\\\t' now works as I'd expect: > > Insert a backslash followed by a tabulator? Yes. But the current code only removes *one* backslash, regardless how much exist in the sequence. > Is it possible to add the other escape-sequences as well? (like for > newline (as opposed to paragraph-break that unfortunately already is > \n in the replace-box) Should be possible, if I know what sequences to use. Currently I can't see in what difference Return vs. Shift+Return results in reality. Frankly said, I don't like & (as well as $1...$9) because it complicates (and slows down) the replace code. If you've looked at the patch, you know it loops over the replace string, searching for a backslash. If it encounters one, it looks what character is next. If the code has to be able to handle \ as well as & (and maybe $) as first char of a special sequence, I currently think I'll need to loop over the string several times in several loops. For now I've modified ActualStrReplace() and prefaced the main Search loop with another in which unescaped & are replaced by \0. So & and \0 are synonyms now. I'd be happy if someone would say anything about the actual code. Maybe there's some even simpler way to do the replace. And there might exist some better method to use for the main loop than Search (like it would be strcspn for plain C-strings), but I'm not that firm with all the string handling in OOo.
I looked at the code and it seems reasonable. It looks like you aren't handling \n in the replacement string. You could add functionality where \n inserts a paragraph mark and \r (return) inserts a line break. Handling the & looks a little complicated. It seems like you need a search function that can look for multiple strings at once. That way you wouldn't need to go through the contortion of replacing all the unescaped & with \0. Gregg
> It seems like you need a search > function that can look for multiple strings at once. The Boyer-More string search algorithm* seems to handle those cases pretty well. * http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm
I'd like \r and \n , too (like glebovitz has suggested)! It's currently a bit silly that OpenOffice lets to substitite line breaks with paragraph breaks, but not vice versa. And it's silly that one does that by substituting \n with \n. The current behaviour is counter-intuitive, one would expect that substituting \n iwth \n wll result in no change. Cheyrich, it would be nice to have \r and \n, so we can change line breaks to paragraph breaks (\r -> \n) and paragraph breaks to line breaks (\n -> \r). What do You think about it?
> I'd like \r and \n , too (like glebovitz has suggested)! While that looks like a reasonable request and I'd try to solve that issue, I guess it's better to file a separate bug on this. I had some closer looks at the code and the return, paragraph, newline handling involves different code, mainly because it doesn't manipulate only a plain string but also copes with creating, separating and joining nodes. I also think \n in search and \n in replace should mean the same, but I fear this will also raise resistance since it breaks with the current design and thus will confuse long time users. So I request you to please file another bug on this and tell us/me the No.
Created attachment 34241 [details] version 3 of my proposal, now with real & support
@glebovitz > Handling the & looks a little complicated. It seems like you need a search > function that can look for multiple strings at once. That way you wouldn't need > to go through the contortion of replacing all the unescaped & with \0. It not only looks little complicated, it really is. Indeed needed a search function the can look for multiple strings, resp. multiple characters in this case. That's what I meant with "strcspn" in my comment. "Needed" because as you might have already noticed from my latest attachment, with SearchChar() I found it. So now the &-handling only takes a few lines of additional code in the main loop. So I'm quite happy now, hopefully any real OOo hacker will also be.
cheyrich, I looked at all the code and I think I can take the search method from the String class and write a function that will take an array of chars and search simultaneously for all of them. Would you like me to take a crack at this? Gregg
Gregg, > I looked at all the code and I think I can take the search method from the > String class and write a function that will take an array of chars and search > simultaneously for all of them. Sorry, I must have missed something. As I mentioned, I already found SearchChar() which does exactly this. Shouldn't I use this method of the warning in string.hxx: "THIS CODE IS DEPRECATED. DO NOT USE IT IN ANY NEW CODE. Use the string classes in rtl/ustring.hxx and rtl/ustrbuf.hxx (and rtl/string.hxx and rtl/strbuf.hxx for byte-sized strings) instead." I just discovered this while looking around because of your comment. So if this SearchChar() method should be reimplemented in the String class, you can of course do this. I don't get the difference between ByteString/UniString and String at the moment (besides that String looks like it misses many functions and isn't Unicode-capable). Christian
christian, I missed your comments about the SearchChar function. If you found a function that does what you need then of by all means use that. I spent some time looking at the various string classes and it looks to me like byte_string and unistring are part of the old String class. I think byte_string and unistring are currently '#defined' as String. The new replacement, I think, are OUstring and Ustring in the rtl libraries. By the way, I looked at prisonerofpain's suggestion for the boyer-moore search function and it is much too complex for what you (we) need 'cuz we are only searching for single characters ('\' and '&'). Gregg
Gregg, > I missed your comments about the SearchChar function. If you found a function > that does what you need then of by all means use that. Ah, ok - that's an explanation. So I'll stick with that. > The new replacement, I think, are OUstring and Ustring in the rtl libraries. From the comment I quoted I'd say yes. At that time I hadn't looked at these classes. And as it looks to me now, they're not equivalent since read only String and no useable search method. > I looked at prisonerofpain's suggestion for the boyer-moore search Yep, BM is overkill for short search strings and it's also new to me that you can search for multiple independent chars with it.
So, will this patch get accepted to the OO code tree?
Andreas, Christian, I took a short glance at this patch, it looks viable in general (apart from German comments in new code, hey, we should stop that ;-) However, as regexp search&replace is also used by Calc, we should offer the replacement functionality of ActualStrReplace() at a more common place, i.e. the utl::TextSearch wrapper. The method (name? RegexReplace?) should take parameters util::SearchResult, original string and replacement string. It should return a value indicating whether all replacements were done, errors occured, e.g. more backreferences than search groups, or result string overflows. Maybe a sequence of the length of the number of backreferences to be able point to the place of error. Just a quick thought. Eike
It would be good to have a summary of the RegEx behaviour, viz., what is replaced by what.
Eike > I took a short glance at this patch, it looks viable in general (apart from > German comments in new code, hey, we should stop that ;-) First it's easier for me to write in German and second I used it because comments in that function already where in this language. But yes, though most of my comments aren't really necessary (I always prefer commenting to much than to less) I can change that if you want. > However, as regexp search&replace is also used by Calc, we should offer the > replacement functionality of ActualStrReplace() at a more common place, i.e. > the utl::TextSearch wrapper. [...] I'll try addressing this ASAP. But because I'm still a OOo-coding-newbie I can't guarantee *if* and I'm in a transition to another OS on my computer I can't tell *when*.
Sorry, my previous request (below) was too short to be meaningful. Since there are a number of RegEx behaviours, it would be useful to have a summary of the way the current Search and Replace patch is supposed to work. (viz., the & replacement). >It would be good to have a summary of the RegEx behaviour, viz., what is replaced by what.
@cheyrich: heavy documentation in the code is not bad! er just wanted to say that new comments should be written in english, not german (so that all devs can understand them) @all: IMHO the following is how it *should* be: ..in replacement box does this.. & inserts complete string that matched (as it does now) \1 inserts group number one - the match enclosed in the first pair of matching (round) parentheses \2 \# inserts the second and #th matching group \0 inserts all matching groups \n inserts a newline (linebreak - as inserted with <shift>+>enter> in normal text) \r inserts a paragraph break \t inserts a tabulator \xFFFF inserts the character matching the hex-code FFFF \c where c is not one of [0-9nrtx] inserts the character c (*) \\ inserts a backslash (*) the list may not be exclusive, depending on what other escape-sequences are added - maybe to insert a non-breaking hyphen/space
> er just wanted to say that new comments should be written in english, When continuing work, I'll translate all comments. > \0 inserts all matching groups I guess there are different opinions on how it *should* work. As I wrote earlier, I know \0 to work like &, i.e. contain the complete string. That's how PHP's ereg* and preg* functions as well as - and that's the main reason it works as it does - FSF's GNU regular expression library, which OO uses, work.
Where does the functionality for \0 => all match groups specified? POSIX? Gnu? If there is a conflict over expected function, then we should probably follow the standards. Gregg
AFAIK there is no standardized specification for \0. However, quite a few implementations use that extension, e.g. GNU sed and, as I've read, the .NET framework. Btw, IEEE Std 1003.1, 2004 Edition, also does not define the '&' ampersand for backreference. See http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
I would be happy to get rid of the "&". I'm not used to treat it as a special character. For many characters (like backslash and all kinds of brackets eg) i have the intuitive feeling of "oh, this might be a special character, so i better escape it...", but OOo is the only place i know where "&" has a special meaning so i would never have the idea to escape it...
I've been using "vi" for 21 years and ALAICR "&" has had a special meaning. I use it all the time in regular expressions and would feel the functionality is missing if it were removed.
I may be missing something, but the IEEE document that er posted specifies the regular expression for regexec and regcomp only, and does not specify the syntax for replace. Back references in this document appears to discuss references in the match pattern only. IMHO, We've already had this discussion and decided that implementing a replace without an '&' does not follow convention. Since there is no dispute over the definition of '&' in the replace string, I believe we should not be trying to change this behavior. Here is the behavior of Perl, PHP, and sed using the '&' and '\0' substitution patters for the string "123 ABC DEF GHI": perl -> 's/123 (AB). (DE). (GH)./$&' -> '123 ABC DEF GHI' perl -> 's/123 (AB). (DE). (GH)./$0' -> '_' sed -> 's/123 \(AB\). \(DE\). \(GH\)./&' -> '123 ABC DEF GHI' sed -> 's/123 \(AB\). \(DE\). \(GH\)./\0' -> '123 ABC DEF GHI' php -> 's/123 (AB). (DE). (GH)./&/' -> '&' php -) 's/123 (AB). (DE). (GH)./\0/' -> '123 ABC DEF GHI' Note: while php and sed support the '\0' behavior, perl does not support '$0'. Only Microsoft replaces '\0' with all the matched subgroups. This creates a dilemma since Microsoft users expect one behavior and the open source world expects another. The best of all worlds would support two syntaxes and allow the user to select between them. I can see having a replace check box option labeled "use Microsoft replace" that would change the behavior to use the complete Microsoft syntax. For the time being, I suggest that we try to stick with GNU sed, Perl, and PHP and support both '&' and '\0' for whole string replace and revisit Microsoft compatibility at a later date.
Well - \0 for all matching groups just seemed to be more logical to me. (again remember that this is for the replace-box, not for the search). We already have a common character to match the whole string (&) - so having one for all groups seemed logical (simply \0 instead of writing \1\2\3\4) Also note that many of the other regex implementations cannot use \0 since that is often used to specify a character by its octal code - but OOo uses hex values instead (\x) - so this wouldn't collide. All in all I don't have a strong opiontion about it. If you decide to make \0 behave like &, I won't complain.... But keep the &. Even if it would not be common in regex (it is very common), it still would be an expression that used to be available in OOo (and its predecessor) for years. Removing it would be a regression.
@cloph Good points. An important issue in changing the behavior of \0 to all matching groups is that Christian implemented & by converting it to a \0 on the input buffer and then expanding it to the entire match string in the output buffer. Therefore providing for different behaviors between & and \0 is a large coding change. I suggest we keep the behavior as proposed by Christian.
> An important issue in changing the behavior of \0 to all matching groups is that > Christian implemented & by converting it to a \0 on the input buffer Er, did I? Oh, yes, but that was in version 2, in version 3 it's different since I had found SearchChar() in the meantime. But I nevertheless prefer keeping the behaviour of version 3 (& == \0). Mainly because the regex lib delivers \0 that way.
Christian, I helps to click on the correct link before making comments *blush*. I WAS looking at the version 2 document. Looks really good. Gregg
I'm currently in the phase of reorientate myself in the code. I hope I can realise the proposed changes soon. > we should offer the replacement functionality of > ActualStrReplace() at a more common place, i.e. the > utl::TextSearch wrapper. Making it globally available is ok for me. But I don't know if a class named TextSearch should contain a method that replaces. Not that I want to create another class TextReplace or so, but maybe there's a more fitting already in existance. However, if someone who's deeper in the system than me says I should put it in TextSearch, I'll do that.
Christian, Shouldn't the functionality of textsearch (and textreplace) be integrated into the base string class? Seems to me that strings, in an editor, should be self searching and self replacing? The QString class in Qt 4 supports search and replace. This makes the class very useful. Gregg
> Shouldn't the functionality of textsearch (and textreplace) be integrated into > the base string class? Seems to me that strings Having methods for search and various manipulations in string classes seems reasonably to me. OOo's UniString and ByteString look already quite rich equiped with methods. And maybe having a regex version of their SearchAndReplace() method would be good. But that's neither what I want to do nor what I can do. Searching through a whole document and replacing matches it's different from just searching through a string only. Either that or those who implemented the currenct S&R were insane. At least for me it's just a mess in which I was finally able to find the right point to insert my code. That just to inform you what you can expect from me (resp. what not).
*** Issue 25177 has been marked as a duplicate of this issue. ***
What is happening with that issue? When will it be integrated? 73 votes... Shouldn't a target be set? 2.2?
Getting this in would be great. I understand the requests for availability of the functionality to all parts OOo but that's too deep for me. So if not someone at least can point me in the right directions this will be open for another three years I fear.
Could you please integrate the patch into the product? You can make improvements at any time, but meanwhile it would be great to have the functionality in the product, even if it is only available in the Writer module. 80 votes so far! Best regards, Gerhard
I will set up a CWS for this patch and we will see, how far we'll get. Stay tuned!
Hi! Any news?
*** Issue 76188 has been marked as a duplicate of this issue. ***
I programmed a replace function in Delphi (I'm sorry to say that I'm absolutely no good at C++) with a different set of regular expressions or wildcards but with the possibility to use \1 - \9 in the replace by string to access variable text in the search string. My very simple system worked like this: The search and replace methods call a match function that returns true if a match is found. The match function has a lot of parameters, like the start and end position of the found text, the search and replace string, and of course a reference to the string that holds the text you're trying to find the search string in. The match function assembles text matching expressions in () in an array of strings; when a match is found the switches in the replace string are simply replaced by the corresponding strings in the array. I didn't include & and \0 - not having & was an oversight but I'm not sure that \0 is used a lot in word processors (it's definitely not done in MS Word) and I feel that comparing OO.o Writer with a (in my humble view) low-level editor like Sed isn't quite correct. I hope the support of switches in the replace by string will be implemented soon.
It looks like the target should be reconsidered; setting type to patch.
Yes, I created already a CWS regexp01 for this, but did not find the time for OOo2.3. My planning is to improve our regular expression support in one of the next versions, hopefully 2.4.
Andreas, please consider my comments in #desc55 from Mon Aug 14 10:43:40 +0000 2006 and rework the patch accordingly. Thanks Eike
I will integrate some improvements for regular expressions into OOo2.4. CWS regexp02 is on its way.
Any volunteers for doing the specification? We have a first draft at http://specs.openoffice.org/appwide/find_and_replace/Regular_Expressions.odt In CWS regexp02 the backwards references are already implemented (with $0 - $9) for Writer and Calc
Fixed in CWS regexp02. Only the specification needs a little bit improvement ;-)
Ready for QA.
> Any volunteers for doing the specification? It looks as if this needs someone who has done one before, and knows what is required. btw note that the 3rd example on page 4 (Detail Spec) should be ([1-9]+) not ([1-9]). I'll volunteer to update the wiki regex HowTo, unless someone beats me to it. But I am very puzzled why $1 - $9 has been chosen, rather than /1 - /9 as in the Search For box. In the HowTo this is going to look silly - along the lines of, well when you want a backref in the Search For you use /1 but in the Replace with box .... Could someone enlighten me if there's a good reason? Or will $1 - $9 now work in the Search For box as well? Not knocking the effort - it's a good step forward. Thank you.
@drking: $n was chosen because later at some point we will switch to the ICU regex engine that also knows this syntax, see http://www.icu-project.org/userguide/regexp.html for a complete reference. The $n is also what perl users are acquainted with. And no, $n in search is not supported, that would conflict with $ being the end-of-text anchor.
Please pardon my ignorance as a layman. Does the "Resolved" and "FIXED" in this issue mean that the issues in bugs http://www.openoffice.org/issues/show_bug.cgi?id=46165 and http://www.openoffice.org/issues/show_bug.cgi?id=70554 are also covered and fixed? In short: will an ordinary user be able to - search and find line breaks, any kind - search and find paragraph breaks, any kind - substitute any of the above, be it one or many with one or many of any combination of the above? There is a whole lot of translators and other users waiting for this good news, since these bugs make it impossible for us to use Openoffice as anything much more than an (resource heavy) auxiliary for petty tasks.
Question to AMA, When Christian implemented the regex substitution code, he supported both the $n and \n syntax. Did this change in the final version? I was looking at the regex specification document above and it only mentions the $n syntax. It doesn't make that much difference to me, but outside of perl, most regex packages seem to support the \n syntax, including MS office. Gregg
@er Thank you - that will be useful when explaining the rationale. @gudmund There are close to 40 issues about regex, and they're all treated separately, so no - I'm afraid the other issues you mention are not fixed. The good news is that if OOo migrates to the ICU regex engine, many of the existing issues may be resolved at a stroke. Although (looking at the ICU regex spec) probably not all of them.
ama->glebovitz: The current implementation will support $n, not \n. See comment from er (Nov 10) about the reason for choosing $n.
I have read this thread several times now, and am ecstatic to see that it will be possible to use back-references as described above. For the moment however, I can use back-references in the search box (the palindrome "algorithm" described above works perfectly), but the most recent version 2.3.0 does not seem to have the $n feature incorporated. Is this going to be integrated in a future release, or is it just that I have missed some crucial part of the syntax? Thank you for adding this feature! If someday it becomes possible to add style info to the search and replace boxes, I may be able to stop using MS Word entirely!
@sashiman: Please see the issue's target that reads OOo2.4, so the change most certainly is not available in OOo2.3 ... > If someday it becomes possible to add style info to the search and replace boxes If you used styles it was always possible to search for and replace with styles, see the "More Options" button and "Search for Styles". If you used hard formatting attributes instead then see "More Options" and the "Attributes..." and "Format..." buttons.
Ok, thanks for the info on backreferences. With regard to styles, I know I can replace one style with another, but what I would like to be able to do is replace a character style with an XML tag, i.e. find all that is marked with a user-created style, e.g. author, with <person>^&</person>. I'll doublecheck to be sure I'm not mistaken, but this feature does not seem to be available, contrary to in MS Word. (That said your HTML is readable, contrary to MS Word, such that I could do the replacement with a basic text editor (or with Oo) once I've converted to HTML.) I don't mean to be hijacking this thread with a separate issue, so my apologies, I just wanted you to be sure you understood the issue that I was raising.
In fact, just to make even clearer that this is related to the REGexp issue: imagine that one has different heading1 level titles that one wishes to convert to XML in a TEI type format: in the other product I would search for (Chapter) ([0-9]{1;3})(*)^13 having the attribute style="heading 1" and replace with <div type="\1" n="\2">\3</div>. potentially having the default style (of little relevance, as in exporting to encoded text all styles will be lost, whereas obviously the tags will not. Working with texts that have different labels for identical level items (or captions with different labels, for example) is certainly possible when working with digitized books. Using a high-level word-processor always non-specialists (or those who would rather not see all the tags) to work on markup, leaving the XML conversion to a macro... In any case, I'm glad to see that these concerns are being taken seriously, the plans for a major overhaul mentioned elsewhere (in the CRLF discussion) is superb, and from what I saw from the link the icu project looks like a fantastic target. Thank you!
Created attachment 49915 [details] Test Case
SBA: Verified in CWS regexp02.
Hey, this is great! Fantastic job. This will be a big enhancement for OOo's Find & Replace--thanks for tackling this! I just ran through a few example tasks that people have asked about. I only found one glitch: Capitalize words beginning with h: s/\<h([a-z]+)/ r/H$1/ Match case = Yes Starting text: He heard quiet steps behind him. Expected result: He Heard quiet steps behind Him. Actual result: He H$1 quiet steps behind H$1 OOo-Dev SRC680_m239 on Fedora Linux 8
OK in OOo 2.4. Closed. SBA->jes: To capitalize all words with "h", simply replace the "h" with "H" :-) Search for: "\<h" replace with: "H" Match Case and RegEx checked, Click "Replace all", works. But you are right, in this case the sub-expression do not work correctly. Please file another issue for that one because this issue here was about Sub expressions GENARALLY working. "Mutating" issues can not be handled with feasible effort.
From the 2.4 help file (node "regular expressions;list of"): "& or $0 Adds the string that was found by the search criteria in the Search for box to the term in the Replace with box when you make a replacement. For example, if you enter "window" in the Search for box and "&frame" in the Replace with box, the word "window" is replaced with "windowframe". You can also enter an "&" in the Replace with box to modify the Attributes or the Format of the string found by the search criteria." "^$ Finds an empty paragraph." But: & or $0 in the Replace box do not insert an empty paragraph mark but instead the characters & or $0. Nicely enough, & works for inserting \n (with e. g. &&& inserting \n\n\n if only one \n and one \n alone was in the Search box), which indicates that correct handling of \n (for inserting line breaks (newline)) and \r (for inserting paragraph breaks) might not be impossible after all. @SBA: Should this too be viewed as a special case, with an issue of its own? I thought issue 46165 http://www.openoffice.org/issues/show_bug.cgi?id=46165 was supposed to be some sort of collector issue for regular expressions in general. Or should I reopen issue 70554 http://www.openoffice.org/issues/show_bug.cgi?id=70554 as being a special case/"mutating" issue? I can't find a "spec template" that you've mentioned in issue 46165. Is this it?: http://specs.openoffice.org/ http://specs.openoffice.org/collaterals/template/2.0/OpenOffice-org-Specification-Template.ott http://specs.openoffice.org/collaterals/OpenOffice_org_Specification_guide.sxw http://eis.services.openoffice.org/EIS2/guide.CheckSpecification If this is the kind of thing it takes, I guess I should file an RFE to the OOo Bugzilla that such pointers be included in every issue page, "Write a specification template".
@gudmund >But: & or $0 in the Replace box do not insert an empty paragraph mark I think that's how the thing works - you found an empty *paragraph* but tried to insert a *paragraph mark*. The Application Help is rather sparse on this topic; you might like to read the Wiki: http://wiki.services.openoffice.org/wiki/Documentation/How_Tos/Regular_Expressio ns_in_Writer ?