Spreadsheets Considered Harmful insights

You shouldn't use a spreadsheet for important work (I mean it)

Daniel Lemire, computer science professor, Université du Québec, 2014
Introduction

While spreadsheets make simple tasks very easy, they make moderately-complex tasks much harder than they should be, leading to a high probability of undetected errors. Because spreadsheets are so common, these errors can sometimes have severe and wide-ranging consequences. So, excepting for the simplest of tasks, the spreadsheet is the wrong tool for every job.

Spreadsheets are good for a few things though:

  • Xmas card lists - simple lists of low-criticality information.
  • Quick and dirty models - one-off temporary things, not intended to be shared.
  • Crashing the economy - UK austerity was based on a spreadsheet error.
Explanation
The problem

These issues are fundamental to the concept and architecture of the spreadsheet, common to all spreadsheet programmes: Excel, LibreOffice Calc, GoogleSheets etc.

1. At the cell-level, too many things are co-mingled into a single cell, that should all be represented separately: data-entry, validation, constants, formulae, comments, data-storage, formatting, and data-display.

2. At the file-level, it tries to do too many different things: data-storage, data-interchange, data-analysis. Spreadsheets include, within a single file, the raw-data, the analysis framework, and the presentation (summary and graphs). This is very convenient and portable, but becomes problematic as spreadsheets are copied around and evolve.

Here are the consequences:

  • Data entry: spreadsheets are often 'too clever', interpreting what they think you mean, not what you typed.The data is irreversibly transformed at the time it is inserted (typed or imported), not just formatted for presentation.
  • Date-mangling. spreadsheets are particularly hungry for dates: entering "3/4" means the string three-slash-four, which you might chose to format as a literal, a fraction, a decimal, or a date. However, the spreadsheet will immediately convert it to '3rd April', or maybe '4th March'. This is particularly bad given the possibility to misinterpret dates between US/UK, and even worse if leads to data-errors, via: 0.753/43rd AprilApril 3rd4/31.33333!
  • No data-typing. A spreadsheet is not a database, but frequently used in situations where a database would be better. Strong typing, constraint validation, and foreign-keys would frequently improve reliability and accuracy.
  • No error-bars: spreadsheet models rarely track uncertainties, leading to failure to propagate uncertainties. As a result, the spreadsheet model suffers from spurious precision, and overconfidence.

    For example, if a spreadsheet's total forecasts, say, "3.7893452% growth", you might believe that "3.7%" is justified. But with correct tracking of the uncertainties, the evidence might only actually support "3.7% ±2.1%", i.e. there's a 1 in 6 chance that the true figure actually lies below 1.6%.
  • Nulls are mishandled: spreadsheet arithmetic treats all of 0 (zero), null (wrongly-missing-data), '' (empty string), and [ ] (cell not used) as the same. When calculating the sum of a set of figures, if any value is unknown, then the true total is (obviously) unknown. But spreadsheets cast null to zero and will just compute the sum of the column, ignoring the fact that it is invalid because a value is missing.
  • Obscure formulae: as they are condensed into a single cell, and normally hidden, the result is often extremely hard to understand, let alone debug. For example.
    Source: example formula from a single cell of an actuarial spreadsheet. [ref ]

    =IF(F6="050",tiers!$D$21+IF(AND(F6="050",OR(E6="W",E6="X")),tiers!$D$29-tiers!$D$26+VLOOKUP("N"&100&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&"N/A"&M6/12&"/"&N6&"0-"&R6/1000&C6*1,tiers!$L$2:$W$20969,12,0),IF(OR(E6="W",E6="X"),VLOOKUP("N"&F6*1&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&"N/A"&M6/12&"/"&N6&"0-"&R6/1000&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(F6="050",E6="P",M6=36,N6=60),tiers!$D$29-tiers!$D$26+VLOOKUP("E"&100&VLOOKUP(L6,tiers!$B$52:$C$55,2,0)&"N/A"&M6&"/"&N6&VLOOKUP(R6,tiers!$B$57:$C$59,2,0)&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(E6="P",M6=36,N6=60),VLOOKUP("E"&F6*1&VLOOKUP(L6,tiers!$B$52:$C$55,2,0)&"N/A"&M6&"/"&N6&VLOOKUP(R6,tiers!$B$57:$C$59,2,0)&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(F6="050",E6="P"),tiers!$D$29-tiers!$D$26+VLOOKUP("E"&100&VLOOKUP(L6,tiers!$B$52:$C$55,2,0)&"N/A"&M6/12&"/"&N6&VLOOKUP(R6,tiers!$B$57:$C$59,2,0)&C6*1,tiers!$L$2:$W$20969,12,0),IF(E6="P",VLOOKUP("E"&F6*1&VLOOKUP(L6,tiers!$B$52:$C$55,2,0)&"N/A"&M6/12&"/"&N6&VLOOKUP(R6,tiers!$B$57:$C$59,2,0)&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(F6="050",E6="*",M6=6),tiers!$D$29-tiers!$D$26+VLOOKUP(B6&100&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&IF(Q6=48,"0-4","6-10")&M6&"/"&N6&"0-100"&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(E6="*",M6=6),VLOOKUP(B6&F6*1&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&IF(OR(L6="PPH08",L6="PTH08"),"0-9",IF(Q6=48,"0-4","6-10"))&M6&"/"&N6&IF(R6=125000,"100-125","0-100")&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(F6="050",E6="*"),tiers!$D$29-tiers!$D$26+VLOOKUP(B6&100&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&IF(OR(L6="PPH08",L6="PTH08"),"0-9",IF(Q6=48,"0-4","6-10"))&M6&"/"&N6&IF(AND(OR(L6="PPH08",L6="PTH08"),R6=100000),"0-100",VLOOKUP(R6,tiers!$B$64:$C$70,2,0))&C6*1,tiers!$L$2:$W$20969,12,0),IF(E6="*",VLOOKUP(B6&F6*1&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&IF(OR(L6="PPH08",L6="PTH08"),"0-9",IF(Q6=48,"0-4","6-10"))&M6&"/"&N6&IF(AND(OR(L6="PPH08",L6="PTH08"),R6=100000),"0-100",VLOOKUP(R6,tiers!$B$64:$C$70,2,0))&C6*1,tiers!$L$2:$W$20969,12,0),"ERROR")))))))))),IF(AND(F6="050",OR(E6="W",E6="X")),tiers!$D$29-tiers!$D$26+VLOOKUP("N"&100&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&"N/A"&M6/12&"/"&N6&"0-"&R6/1000&C6*1,tiers!$L$2:$W$20969,12,0),IF(OR(E6="W",E6="X"),VLOOKUP("N"&F6*1&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&"N/A"&M6/12&"/"&N6&"0-"&R6/1000&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(F6="050",E6="P",M6=36,N6=60),tiers!$D$29-tiers!$D$26+VLOOKUP("E"&100&VLOOKUP(L6,tiers!$B$52:$C$55,2,0)&"N/A"&M6&"/"&N6&VLOOKUP(R6,tiers!$B$57:$C$59,2,0)&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(E6="P",M6=36,N6=60),VLOOKUP("E"&F6*1&VLOOKUP(L6,tiers!$B$52:$C$55,2,0)&"N/A"&M6&"/"&N6&VLOOKUP(R6,tiers!$B$57:$C$59,2,0)&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(F6="050",E6="P"),tiers!$D$29-tiers!$D$26+VLOOKUP("E"&100&VLOOKUP(L6,tiers!$B$52:$C$55,2,0)&"N/A"&M6/12&"/"&N6&VLOOKUP(R6,tiers!$B$57:$C$59,2,0)&C6*1,tiers!$L$2:$W$20969,12,0),IF(E6="P",VLOOKUP("E"&F6*1&VLOOKUP(L6,tiers!$B$52:$C$55,2,0)&"N/A"&M6/12&"/"&N6&VLOOKUP(R6,tiers!$B$57:$C$59,2,0)&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(F6="050",E6="*",M6=6),tiers!$D$29-tiers!$D$26+VLOOKUP(B6&100&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&IF(Q6=48,"0-4","6-10")&M6&"/"&N6&"0-100"&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(E6="*",M6=6),VLOOKUP(B6&F6*1&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&IF(OR(L6="PPH08",L6="PTH08"),"0-9",IF(Q6=48,"0-4","6-10"))&M6&"/"&N6&IF(R6=125000,"100-125","0-100")&C6*1,tiers!$L$2:$W$20969,12,0),IF(AND(F6="050",E6="*"),tiers!$D$29-tiers!$D$26+VLOOKUP(B6&100&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&IF(OR(L6="PPH08",L6="PTH08"),"0-9",IF(Q6=48,"0-4","6-10"))&M6&"/"&N6&IF(AND(OR(L6="PPH08",L6="PTH08"),R6=100000),"0-100",VLOOKUP(R6,tiers!$B$64:$C$70,2,0))&C6*1,tiers!$L$2:$W$20969,12,0),IF(E6="*",VLOOKUP(B6&F6*1&VLOOKUP(L6,tiers!$B$38:$C$49,2,0)&IF(OR(L6="PPH08",L6="PTH08"),"0-9",IF(Q6=48,"0-4","6-10"))&M6&"/"&N6&IF(AND(OR(L6="PPH08",L6="PTH08"),R6=100000),"0-100",VLOOKUP(R6,tiers!$B$64:$C$70,2,0))&C6*1,tiers!$L$2:$W$20969,12,0),"ERROR")))))))))))

  • Disconnected constants. In algebra or programming, you only define a constant once, name it, and then always refer back to it from multiple places, e.g., conversion_efficiency = 0.42. Should you later change an assumption, every part of the calculation changes in lockstep.

    While you can do this in a spreadsheet (e.g. with named-cells, or fixed references such as \$B\$3), usually, these values simply get copied around. Later, when someone else edits the original assumption, other formulae still say 0.42, and do not reflect the changes. This is an anti-poka-yoke.
  • Harder to read and write. Because spreadsheets co-mingle data, formatting, code, constants, and output, all in the same place, and prevent the practice of code-commenting, it's very hard to 'read', understand or reverse-engineer a spreadsheet, resulting in a high-probability of errors occurring, and a low-probability of spotting them. In comparison, python models are easier to read and write, and less error-prone.
  • Lower productivity. The formula-language of spreadsheets is far more verbose, and less expressive than a real programming language. This means spreadsheets cannot easily make use of advanced analysis methods, external libraries, or machine-learning tools.
  • No source-control: the logical/mathematical code in a spreadsheet is awkward to read, or code-review, and rarely contains comments (in the sense that programmers use comments, to explain what it does and why). There's no way to do source-control of algorithms (i.e. extract just the analysis model from a spreadsheet and check it into git), which means limited re-use of models and libraries, and everything gets re-invented each time.
  • Hard to automate: because spreadsheets merge data, model, and display, it's much harder to automate with them. Manually-generated/modified spreadsheets frequently contain inconsistencies, which cause production systems to fail to ingest them correctly.

    The .xlsx format is not a reliable data-exchange format, because it can contain more than just data, and the data tables it does contain have no guarantees on their structure, types, or internal/external consistency.
  • Awkward collaboration: if multiple people want to take it in turns to append / edit / review, without 'stepping on' each others' changes. This is partly mitigated by online-spreadsheets, but, as per Keegan's Law, "Every time you email a spreadsheet, you should feel guilty".
  • Poor performance: we showed that python is ~ 1000× faster than spreadsheets, and can easily process data-sets 10 million times larger. The low data-size limit (1m rows) of Excel caused a significant loss of public-health data at the worst time.
  • No nuance: spreadsheet models don't facilitate accounting for nuance, documenting uncertainty, or evaluating intangibles. Thus, a project with a large intangible (non-financial) benefit, or whose value is diffuse (shared across an organisation, but not 'owned' by any single cost-centre), long-term, or enables value realised elsewhere, will be undervalued. Spreadsheets cause floccinaucinihilipilification.
  • Bad decisions: finally, and perhaps controversially, there is the view that spreadsheets have led businesses and investors to focus on short-term decisions, based primarily on financial-metrics (omitting uncertainties), excluding intangibles, and overlooking the longer-term purpose and values of their business and their stakeholders.

Note that some of the problems above can be mitigated by the really careful use of advanced features, by 'spreadsheet-gurus'. However, we're concerned with how spreadsheets are commonly (mis)-used. To the extent that an expert has to constantly work-around these pitfalls, they would be better-served, and far more productive, with a different toolchain.

Spreadsheet Damage

Here are some real-world examples of the severe consequences of the errors. These aren't just mistakes in data-analysis; these are errors that happened (and were not detected) specifically because the tool of choice for the analysis was a spreadsheet.

Reinhart-Rogoff and Austerity

The basis for the UK's decade of austerity is a spreadsheet error. The paper, Growth in a Time of Debt, suffered from a wrong analysis in Excel, which was not spotted because the authors didn't publish their working. Their conclusions were that that if public debt exceeded a specific threshold of GDP, which they calculated to be 90%, it would impede future growth. There is actually no evidence that this is true.

As the UK was near this threshold, this provided the justification for savage austerity, the long-term effects of which are still hurting the UK over a decade later:

Again and again, Osborne trotted out one particular academic paper to justify his assault on the welfare state – a 2010 study by Carmen Reinhart and Kenneth Rogoff, which compared GDP growth rates in various countries against their public debt and seemed to show that higher debt slowed growth.

But in 2013, that paper fell apart. A postgraduate economics student in the US looked through their spreadsheet and discovered they had made some serious copy/paste errors. Fix the typos, and their findings – the only justification for austerity – disappeared.

In reality, the cuts did what most economists said they would: they prolonged the recession, drove down wages and did a decade of damage to the UK economy. With tax receipts failing to rise – which is what happens when wages fail to rise – government debt ballooned as a percentage of GDP. Millions of people were driven into poverty. Homelessness soared, and the NHS was damaged. A decade in which serious investment was needed to switch the UK to a zero-carbon economy was wasted.
Genetic Data loss

Because Excel automatically tries to interpret anything that might possibly be a date as a date, it corrupted a large amount of genetic data. For example, the gene symbol "SEPT2" [Septin 2] was converted to "2-Sep"; similarly "MARCH1" [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] was converted to '1-Mar'.

This automatic conversion is silent, and irreversible - and led to widespread gene-name errors in the scientific literature, affecting about one in five published papers at the time. Ultimately, a number of genes were renamed as a workaround.

Piketty's Capital

The sequel to the Reinhart-Rogoff bug is the the Piketty error. His work, "Capital in the Twenty-First Century", claims that there is rising wealth-inequality due to the greater yield on capital than wage-growth. His analysis is wrong, again, because of spreadsheet errors, which completely undermine the basis for his conclusion.

Near Miss

In our own experience, we narrowly saved a customer from committing to a £1m investment on a device that would not have worked. They based their plan on a supplier's spreadsheet, which made the "disconnected constants" mistake described above. Fortunately, we also analysed the device, but with symbolic algebra (using pencil and paper), had identified a discrepancy, and this led to a re-evaluation.

Other

Research suggests that more than 90 per cent of spreadsheets have errors, and half of spreadsheet models used in large businesses have "material defects". EuSpRIG, the European Spreadsheet Risks Interest Group, discuss this further, and list additional horror stories of spreadsheet-created failures.

Research on spreadsheet errors is substantial, compelling, and unanimous. It has three simple conclusions.

The first is that spreadsheet errors are rare on a per-cell basis, but in large programs, at least one incorrect bottom-line value is very likely to be present.

The second is that errors are extremely difficult to detect and correct.

The third is that spreadsheet developers and corporations are highly overconfident in the accuracy of their spreadsheets.

The disconnect between the first two conclusions and the third appears to be due to the way human cognition works.
Alternatives

For data analysis, use Python + NumPy, + SciPy, or GNU Octave, MATLAB, Julia, or even a browser-based tool e.g. statistics.js + Plotly. For quick prototyping, use Jupyter. This approach gives you more powerful tools, which are easier to use, and easier to validate; a much better editor including syntax-highlighting; and source-control.

For data storage, use a database such as PostgreSQL, or SQLite for simpler tasks. This gives better performance, strict-typing of data, the ability to use indexes and joins, and correctly handles nulls.

For data interchange, use a format designed purely for holding data, such as CSV, TSV, or JSON, or a dedicated API between systems if you have one. This also allows for strict validation of the data.

Insight
Spreadsheets should never be used for serious or important work: they make errors both inevitable and invisible, and don't track uncertainties and context, often resulting in highly-confident, but wrong business-decisions.

They are an inefficient way of doing data-modelling, can corrupt data when it is stored and exchanged, and have poor performance on large data-sets. Furthermore, they are actually harder to write, except in the simplest cases.

While it is possible to build really complex models in spreadsheets, there are better tools for data analysis, data storage, and data interchange; many of which are open-source and zero-cost. Please let us know how we can help you with your own modelling and processes.
more insights