When processing text files, using the copy, move, concat (etc) has the current ability to use filter readers. The current set of filter readers provides a very useful set of functionality. The enhancement I would like would be a filter that provide the following functionality. The SubSet Filter is a filter that can be used within a filterchain to extract data from a text file. It allows you to designate 2 reg exp patterns. The first pattern is what the filter uses to start the extraction and the second is the pattern is uses to stop the extraction. Rules: 1) If the beginning filter is not specified, starts at beginning 2) If the beginning filter is never found, no lines are returned 3) If the end filter is never found or not specified, then all lines till the end are returned 4) If the skipstart attribute is set, it will skip N number of matches before it starts 5) If the skipend attribute is set, it will skip N number of matches before it ends After the lines are determined, each line can then be limited by column index. The truncating of a line will keep the line-ending character. Rules: 1) If columnstart index is specified, the entire line is returned starting from that 0-based index 2) If columnstart is greater than a the line length, nothing is returned 3) If columnend is specified, only text up to that index is returned. 4) If columnend is greater than line length, then everything up to line length is returned. Now obviously I have already tried to do this and think it is a useful filter that would help complete the existing set of great filters. I have had a recent set of tasks and chose ant to help do text file processing and found that this was very useful. I realize you already have a great pool of talent but would look forward to contributing the code I do have. It is fully unit tested using the anttest util. Anyways, hopefully I will see it in a future release. Thanks, Donovan
This sounds useful. I personally would rather NOT see functionality duplicated; i.e. you can select by column using existing regex stuff chained after the basic functionality you have outlined here...
Rather than introducing new filters, I think the existing <head> and <tail> filters could be extended to take an additional regex. The use case you describe matches head/tail IMHO, and simply extends the concept to have a regex rather than a simple line count to determine the start/end. I'm wouldn't be against adding a new filter to 'cut' lines by column numbers. Although it's certainly possible to achieve using regexps, it would likely be easier to use a cut-like column specific filter ;-) --DD
okay, if we want to go down this road... ;) I would say that a cut filter should implement cut fully, including fields and delims... but either way, modularization is good.