Bug 64165 - Can't read in rows without 'r' attribute
Summary: Can't read in rows without 'r' attribute
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 4.1.x-dev
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-02-20 17:33 UTC by John Claxton
Modified: 2020-02-20 20:30 UTC (History)
1 user (show)



Attachments
Proposed patch.tar.gz (9.99 KB, patch)
2020-02-20 17:44 UTC, John Claxton
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description John Claxton 2020-02-20 17:33:36 UTC
*I have a proposed fix for this (at least for XSSF). I'll attach a patch file shortly, I just wanted to secure a Bugzilla number first. 

When creating a workbook from an existing file, I noticed that rows without an 'r' XML attribute (used for a row number in this case) are not indexed correctly in the sheet object. See XSSFSheet.initRows() below. getRowNum() returns -1 for all rows and all but the last row is lost. 

My version of Excel can handle these rows, it just reads them in consecutively. When saving a workbook, it will add the 'r' attribute. I ran into this issue when trying to read in a file that was exported from a JavaScript spreadsheet tool. The authors of that tool claim their XML meets Microsoft's standards. 

Current code:
    private void initRows(CTWorksheet worksheetParam) {
        _rows.clear();
        tables = new TreeMap<>();
        sharedFormulas = new HashMap<>();
        arrayFormulas = new ArrayList<>();
        for (CTRow row : worksheetParam.getSheetData().getRowArray()) {
            XSSFRow r = new XSSFRow(row, this);
            // Performance optimization: explicit boxing is slightly faster 
            than auto-unboxing, though may use more memory
            //noinspection UnnecessaryBoxing
            final Integer rownumI = Integer.valueOf(r.getRowNum()); // 
            NOSONAR
            _rows.put(rownumI, r);
        }
    }

Proposed change: 
    private void initRows(CTWorksheet worksheetParam) {
        _rows.clear();
        tables = new TreeMap<>();
        sharedFormulas = new HashMap<>();
        arrayFormulas = new ArrayList<>();
        int rowIndex = 0;
        for (CTRow row : worksheetParam.getSheetData().getRowArray()) {
            XSSFRow r = new XSSFRow(row, this);
            if(r.getRowNum() == -1) {
                r.setRowNum(rowIndex);
                rowIndex++;
            }
            rowIndex = r.getRowNum();
            // Performance optimization: explicit boxing is slightly faster 
            than auto-unboxing, though may use more memory
            //noinspection UnnecessaryBoxing
            final Integer rownumI = Integer.valueOf(rowIndex); // NOSONAR
            _rows.put(rownumI, r);
        }
    }
Comment 1 John Claxton 2020-02-20 17:44:48 UTC
Created attachment 37030 [details]
Proposed patch.tar.gz
Comment 2 Axel Howind 2020-02-20 20:19:10 UTC
IMHO this is a good idea. But consider a file where the row numbers are missing in between. Say the last row before the rows with missing numbers had the number n. Then the next row will be assigned the same number. That doesn’t look right

Another thing to consider is that we might have to re-adjust row numbers of the following rows. Consider rows 1,2,3 all with correct row numbers. Then 5 rows without row numbers, followed by a row with number 4. How should row numbers be handled then? What does Excel do in that case?

Do you have (or can you create) excel sheets for all three case?
 1. row numbers missing from the start of the file
 2. row numbers missing in between
 3. row numbers missing in between, but would use the same row numbers as the following rows

We should really run this on these test sheets and also create a unit test that covers all three cases.

Thanks for contributing!
Axel
Comment 3 John Claxton 2020-02-20 20:30:45 UTC
Good point, I didn't really consider sheets with partial row numbers, my examples were all or nothing. I'll send a new patch file that covers all three cases. 

Thanks,
John Claxton