Bug 53243 - Extract Tables from word document
Summary: Extract Tables from word document
Status: RESOLVED WORKSFORME
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: unspecified
Hardware: PC Windows XP
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-16 15:43 UTC by ahmed
Modified: 2012-11-06 16:54 UTC (History)
0 users



Attachments
word document file (221.00 KB, application/msword)
2012-05-16 15:43 UTC, ahmed
Details

Note You need to log in before you can comment on or make changes to this bug.
Description ahmed 2012-05-16 15:43:34 UTC
Created attachment 28793 [details]
word document file

i used POI 3.8 to extract tables from word document
but i can't get all tables in Doc
i write this code to get this action

    public static void main(String[] args) {
        String fileName = "C:\\fjn3312r.doc";
        try {
            InputStream fis = new FileInputStream(fileName);
            POIFSFileSystem fs = new POIFSFileSystem(fis);
            HWPFDocument doc = new HWPFDocument(fs);

            Range range = doc.getRange();

            int tblNameIdx = 0;
            for (int i = 0; i < range.numParagraphs(); i++) {


                Paragraph tablePar = range.getParagraph(i);

                String parText = tablePar.text();

                try {
                    Pattern pattern = Pattern.compile("[\\s]*", Pattern.CASE_INSENSITIVE);
                    Matcher matcher = pattern.matcher(parText);

                    if (matcher.matches()) {
                        continue;
                    }
matcher.matches());
                } catch (Exception e) {
                    e.printStackTrace();
                }

                    Paragraph tableName = range.getParagraph(tblNameIdx);
                    System.out.println("Table name=====>>" + tableName.text());
                    Table table = range.getTable(tablePar);
                    for (int rowIdx = 0; rowIdx < table.numRows(); rowIdx++) {
                        TableRow row = table.getRow(rowIdx);
                        BorderCode bc = row.getVerticalBorder();
                        i = i + 1;
                        row.text();

                        String rowText = "";
                        for (int colIdx = 0; colIdx < row.numCells(); colIdx++) {
                            TableCell cell = row.getCell(colIdx);
                            rowText = rowText + "\t" + cell.getParagraph(0).text();


                            i = i + 1;
                        }
                        System.out.println("Row----" + rowIdx + " ===>>" + rowText);

                    }
                    i = i - 1;
                } else {
                    tblNameIdx = i;
                }

            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
Comment 1 Sergey Vladimirov 2012-11-06 16:54:50 UTC
Ahmed,

The first table is placed inside of textbox, not as part of "main" text. If you need content of it, you need to navigate into textbox document part and extract data from it.

Second and last table are correctly extracted.

Sergey