Bug 64418 - Finding text in textfields is very slow
Summary: Finding text in textfields is very slow
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-10 19:49 UTC by j-lawyer.org
Modified: 2020-06-03 19:28 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description j-lawyer.org 2020-05-10 19:49:10 UTC
I am scanning docx documents for occurences of specific words / search terms. 

The code I am using is seen below.

The search terms can literally be anywhere: in header, footer, paragraphs, tables, text fields, ...

When using an even complex document that uses no / very few textfields, parsing takes a few seconds. As soon as multiple text fields are involved, parsing takes a considerate amount of time, e.g. 30 seconds or even more than a minute.


Is there aynthing I am doing wrong in how I use the API, or is there an issue with XWPF?

Thanks,
Jens



    private static void findInBodyElements(String key, List<IBodyElement> bodyElements, ArrayList<String> resultList) {
        if (resultList.contains(key)) {
            return;
        }

        for (IBodyElement bodyElement : bodyElements) {
            if (bodyElement.getElementType().compareTo(BodyElementType.PARAGRAPH) == 0) {
                findInParagraph(key, (XWPFParagraph) bodyElement, resultList);
                if (resultList.contains(key)) {
                    return;
                }
                findInTextfield(key, (XWPFParagraph) bodyElement, resultList);
                if (resultList.contains(key)) {
                    return;
                }
                
            }
            if (bodyElement.getElementType().compareTo(BodyElementType.TABLE) == 0) {
                findInTable(key, (XWPFTable) bodyElement, resultList);
                
            }
        }
    }

    private static void findInParagraph(String key, XWPFParagraph xwpfParagraph, ArrayList<String> resultList) {

        if (resultList.contains(key)) {
            return;
        }

        //for (XWPFParagraph paragraph : xwpfParagraphs) {
        List<XWPFRun> runs = xwpfParagraph.getRuns();

        String find = key;
        TextSegment found = xwpfParagraph.searchText(find, new PositionInParagraph());
        if (found != null) {
            if (!resultList.contains(key)) {
                resultList.add(key);
                return;
            }
        }

    }

    private static void findInTextfield(String key, XWPFParagraph xwpfParagraph, ArrayList<String> resultList) {

        if (resultList.contains(key)) {
            return;
        }

        XmlCursor cursor = xwpfParagraph.getCTP().newCursor();
        cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//*/w:txbxContent/w:p/w:r");

        List<XmlObject> ctrsintxtbx = new ArrayList<XmlObject>();

        while (cursor.hasNextSelection()) {
            cursor.toNextSelection();
            XmlObject obj = cursor.getObject();
            ctrsintxtbx.add(obj);
        }
        for (XmlObject obj : ctrsintxtbx) {
            try {
                CTR ctr = CTR.Factory.parse(obj.xmlText());
                XWPFRun bufferrun = new XWPFRun(ctr, (IRunBody) xwpfParagraph);
                String text = bufferrun.getText(0);
                if (text != null && text.contains(key)) {
                    if (!resultList.contains(key)) {
                        resultList.add(key);
                        return;
                    }
                }
            } catch (Exception ex) {
                log.error("Unable to iterate text fields", ex);
            }
        }

    }
Comment 1 Dominik Stadler 2020-05-20 05:35:20 UTC
Can you provide a sample file which shows the slowdown? Would make it much easier to try to analyze/reproduce it.
Comment 2 j-lawyer.org 2020-05-20 20:23:15 UTC
Thank you Dominik for the reply. 

I just created a fully runnable example:
https://www.j-lawyer.org/temp/DocXShowCase.zip

It is a Netbeans project that includes runnable test case as well as example documents. Both docx documents are comparable in complexity, one has no text fields, the other one has 10 text fields. 

When running the code, those are the performance numbers: 

without textfields, search: 676
with textfields, search: 15678

So, when text fields are involved, there is 23x factor for execution times.

Let me know if I can provide anything else and I will be on top of it in no time.

Thanks!
Jens / j-lawyer.org
Comment 3 Dominik Stadler 2020-05-24 21:27:36 UTC
Thanks, but unfortunately there is lots of code which is not related to the problem and thus makes reproducing and analyzing this very hard. The app seems to not finish for a very long time for me. It also looks a bit like you are iterating over the contents of the document many times with all the placeholders and some of the loops in your application.

Can you reduce the code in the sample project as much as possible so that it still shows the problem, but does not do all the things that are only needed for your application?
Comment 4 j-lawyer.org 2020-05-25 20:05:44 UTC
Thanks Dominik for looking into this. I have stripped down the test case, the URL is still the same: https://www.j-lawyer.org/temp/DocXShowCase.zip

- has a list of 50 strings to be searched in documents
- has two documents, both just 1 page - (a) has no textfields and (b) has 10 text fields
- each of the 50 strings is searched for using a loop, so i am iterating each document fifty times

Basically I just want to know which of the 50 strings are contained in the documents.

Thanks,
Jens
Comment 5 Dominik Stadler 2020-05-26 20:24:16 UTC
The following line is taking most of the CPU by far, so you likely need to rework your code to not have to produce XML and then parse it in again afterwards. 

CTR.Factory.parse(obj.xmlText())
Comment 6 j-lawyer.org 2020-05-26 21:34:26 UTC
Well, I would love to get rid of the expensive XML handling - however, I do not see how I could avoid it given POIs API. 

Is there an alternative approach for "getting all text content of text fields / text boxes"?

Even Apache Tika seems to use the exact same approach in their XWPFWordExtractorDecorator.java:

  331         // Also extract any paragraphs embedded in text boxes
  332         //Note "w:txbxContent//"...must look for all descendant paragraphs
  333         //not just the immediate children of txbxContent -- TIKA-2807
  334         if (config.getIncludeShapeBasedContent()) {
  335             for (XmlObject embeddedParagraph : paragraph.getCTP().selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape' .//*/wps:txbx/w:txbxContent//w:p")) {
  336                 extractParagraph(new XWPFParagraph(CTP.Factory.parse(embeddedParagraph.xmlText()), paragraph.getBody()), listManager, xhtml);
  337             }
  338         }


Am I missing something?

Thanks,
Jens
Comment 7 PJ Fanning 2020-05-26 21:57:51 UTC
Instead of `CTP.Factory.parse(embeddedParagraph.xmlText())` could you try `CTP.Factory.parse(embeddedParagraph.getDomNode())`

This might lower the overhead of the parse call
Comment 8 j-lawyer.org 2020-06-03 19:28:56 UTC
Thanks for the suggestion PJ!

I am not too familiar with the more low level APIs of POI. 

In the code I initially posted (findInTextfield method), I am using an XWPFRun which cannot be fed with a CTP


                CTR ctr = CTR.Factory.parse(obj.xmlText());
                XWPFRun bufferrun = new XWPFRun(ctr, (IRunBody) xwpfParagraph);
                String text = bufferrun.getText(0);
                if (text != null && text.contains(key)) {
                    if (!resultList.contains(key)) {
                        resultList.add(key);
                        return;
                    }
                }


When replacing 

CTR ctr = CTR.Factory.parse(obj.xmlText());

with

CTR ctr = CTR.Factory.parse(obj.getDomNode());

my code does no longer work - the text retrieved does no longer contain / find my search strings. Using the first line however (which involves re-parsing XML) works as expected. 
I have challenges finding proper Javadocs for CTP and CTR, assume they represent some disjoint sets of XML complex types. 

Do you have any hints on why the two variations above have different behaviour?

Thanks,
Jens