Issue 80139

Summary: Avoid ERRORS in duplicate Spreadsheets and Data
Product: Calc Reporter: discoleo <discoleo>
Component: codeAssignee: AOO issues mailing list <issues>
Status: CLOSED WONT_FIX QA Contact:
Severity: Trivial    
Priority: P3 CC: elish, issues, pagalmes.lists
Version: OOo 2.2.1Keywords: needmoreinfo
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: FEATURE Latest Confirmation in: ---
Developer Difficulty: ---
Issue Depends on: 3740, 34213, 66817, 80325, 8811    
Issue Blocks:    

Description discoleo 2007-07-28 18:24:38 UTC
This issue makes it in my TOP 5 of major design flaws of existing spreadsheet
applications.


TOC
===

1. INTRO / REAL CASES
2. PROBLEM
3. SOLUTIONS
  3.1 COLLABORATIONS
  3.2 DUPLICATE DATA / DATA LINKS
  3.3 TRACK CHANGES
  3.4 VERSIONING
4. OTHER RELEVANT ISSUES


1. INTRO
========

*Duplicate Data* perpetuates ERRORS!

One encounters in current practice dozens of copies of the same spreadsheet
because different people need to work with it, or because one needs only a
subset of the original data (which is copied/saved into a new table/
spreadsheet). [This is NOT a duplicate of issue 8811.]

One of my first tasks - as I started work in a big department - was to oversee
the allocation of revenues to various contractors based on some specific
characteristics. The first issue I stumbled upon, was the *lack of a central
repository* where those characteristics were stored.

Instead, various departments/employees had their own copy of the spreadsheet and
everybody was updating his own copy.

BUT no one had the spreadsheet up to date. There was NO way to get a fully *up
to date* spreadsheet, and there was NO way to *track the changes*.

As a second example, I will describe now a scenario that happened while I was
doing some research. I was the only person involved in analysing the
spreadsheet. However, due to the complexity of the spreadsheet, I created
various (sub-) tables containing only relevant portions of the data.

In the case I detected one error in the data, there was NO method to correct the
error *only ONCE*, and have the correction applied in all sheets/data. The
errors were doomed to persist in the various duplicate data.


2. THE PROBLEM
==============
Existing spreadsheets offer very little mechanisms to avoid problems of
duplicate data and the errors that arise because of this.

While many will argue that in such cases a database should be used, common
practice shows that most work is still done with spreadsheets, because of the
following reasons:
 a.) easy to create (actually NO effort at all)
 b.) needs NOT a rigid structure (unlike a DB), therefore
     great flexibility in the beginning, when the data may be largely unknown
 c.) you see what you do, and you already see the results
     (unlike for DBs - you need a dedicated software to compute the results)
 d.) NO special coding skills needed


3. SOLUTIONS
============

As mentioned, a DB might be tentative, BUT NOT in real practice. Some
spreadsheets offer some (incomplete) solutions (like Excel for issue 8811), BUT
they will all fail on a global scale.

  3.1 COLLABORATIONS
  ==================
  Spreadsheets should allow collaboration between people. This is described in
greater detail in issue 8811. Features to be implemented:
 - allow simultaneous editing by different users
 - allow various locking mechanisms

  3.2 DATA LINKS
  ==============
  Often, one needs only subsets of the original data to further process. One may
filter this data out and then copy/paste it to a new spreadsheet.

However, what is lacking is to paste NOT the actual data, but a link to the
original data, like a *HARD-LINK* on Unix-like OSs. This is described in greater
detail in issue 66817 (see http://www.openoffice.org/issues/show_bug.cgi?id=66817).

A *HARD-LINK* would implement a mechanism through which a correction done in a
copied cell is propagated back in the original cell.

Another issue in this series is issue 34213 that involves pasting references to
the original cells (see http://www.openoffice.org/issues/show_bug.cgi?id=34213).
[Though the hard-links would be more powerful.]

  3.3 TRACK CHANGES
  =================
  A very useful feature for such collaborative efforts is to track the changes.
This is even more important, as numerous simultaneous changes may slow down the
computer due to the re-calculations.

Therefore, a mechanism should be in place to disable automatic recalculations
BUT to show which cells have potentially changed (and need be updated IF one
needs those values).

Some more brainstorming is here really indicated.

  3.4 VERSIONING
  ==============
  Another problem of these spreadsheets arises directly due to the frequent
changes. Previous spreadsheets might well be needed (because e.g. of legal
reasons), so one has to save the various versions of the spreadsheet.

I like in this respect especially the versioning scheme described in FORTRESS
(see the Fortress specification,
http://research.sun.com/projects/plrg/Publications/index.html), where new
versions of the same program are saved alongside the old version.

This Versioning-mechanism should be able to store snapshots of the spreadsheet
for future use. [This is definitely more difficult than the case in FORTRESS, so
some more brainstorming is warranted.]


4. RELEVANT ISSUES
==================
I have already mentioned the relevant issues. There are surely more relevant
issues (to come). In brief:
 - issue 8811: Allow multiple users to edit the same spreadsheet...
 - issue 66817: Paste as HARD-LINK (Unix style)
 - issue 34213: Paste as reference
 - still others to come ...
Comment 1 discoleo 2007-08-16 20:46:49 UTC
  3.3 TRACK CHANGES
  =================

I have filed issue 80325 which deals with more advanced features in the
track-changes category (see also issues blocking that one). Please note, this is
not the simple track-changes that is lacking.

The problem with a simple track changes is:
 1.) nobody can track all the changes in a big-spreadsheet
 (consider a 100x1,000 = 100,000 cell spreadsheet, one needs years to review all
changes in a multi-user environment)
 2.) one wants to detect functional areas - track only some specific cells/ranges
 3.) currently, IF a formula-calculated results changes, it is marked as changed
(NOT the changing of the formula itself, BUT the computed value)
 4.) when one detects a suspicious value, one wants to track the changes in
upstream cells that resulted in this suspicious value
 5.) a better history of changes - organized depending on the functional area
Comment 2 discoleo 2007-08-16 20:52:24 UTC
Sorry for a slight ERROR in my previous post:
>  3.) currently, IF a formula-calculated results changes,
>      it is >>marked<< as changed (NOT the changing of the formula itself,
>      BUT the computed value)

CORRECT
=======
3.) currently, IF a formula-calculated result changes,
    it is *NOT marked* as changed (NOT IF changing the formula itself,
    BUT when only the computed value changes)
Comment 3 discoleo 2007-09-16 15:17:59 UTC
Another way to avoid duplicate data is to reference external data using the
existing mechanisms:
 a.) directly referencing data in a different spreadsheet
     [though hard-links would be more powerful]
 b.) using a DDE-Link
     [I will discuss this in a new post]

Unfortunately, both methods show substantial weaknesses.

As described in issue 3740:
Named ranges in one spreadsheet cannot be accessed from another spreadsheet.
(see http://www.openoffice.org/issues/show_bug.cgi?id=3740)

Why are *named ranges* such important?

A.) CORRECT RANGE
=================
Well, IF the user wrote the master-spreadsheet he may be able to recall in which
rows/columns he did store the relevant data. BUT how well will he remember these
design details in 2 weeks? And in 2 month?

And what are the chances, that a different user will reference the correct range?

B.) CHANGING RANGE
==================
Another even worse problem arises when the initial data range changes: say, one
has to add another data row, so *ALL END USERS* will have to correct their
spreadsheets that link to this master-spreadsheet.

This is a formidable task and one that generates numerous *hard-to-trace errors*.

These are the main reasons, why it is desirable to link to a named range!
Comment 4 discoleo 2008-02-05 00:07:02 UTC
An interesting project dealing with more advanced 'Track Changes' and
'Collaborative Editing' for spreadsheets (specifically for Calc) can be found on
sourceforge.net:
http://telltable-s.sourceforge.net/

see also the following presentation and article on this topic:
http://www.site.uottawa.ca/~adler/talks/2003/SSScan-eusprig-jul2003.pdf
http://www.site.uottawa.ca/~adler/publications/2004/adler-nash-noel-2004-challenges-in-collaborative.pdf
Comment 5 oc 2008-07-15 10:43:07 UTC
reassigning features and enhancements to user requirements@openoffice.org which
will be the default owner for those tasks (was introduced some time ago)
Comment 6 Edwin Sharp 2014-01-12 12:33:48 UTC
Please attach real life example.
Comment 7 Edwin Sharp 2014-01-26 07:18:01 UTC
No info from author.