In a recent comments thread, Dave Ricardo notes my observation

Unfortunately the data is only published annually

and asks

Don’t you mean, the data are published only annually?

This is a common error, reflecting a confusion between Latin and English.

In Latin, data is the plural of datum (‘something given’). The word ‘datum’ is used in English, but is an archaism, except for a specialised use in surveying. On standard principles of modern English usage, the correct plural of datum is ‘datums’ and a Google search reveals 158 000 occurrences of this term. (For comparison, ‘data’ occurs over 100 million times).

In English ‘data’ is a mass noun like water or wheat. Hence it can be used in compounds like “data base”, “data processing” and so on, which would be ungrammatical for a plural like ‘datums’. This simply reflects our everyday experience with data which, is that it is a quantity of information, not a collection of facts. For example, it would be natural to refer to “500Kb of data”, but wrong to refer to “500 data”.

Data is normally dealt with as a mass, but it is sometimes important to refer to discrete units, in which case it is appropriate to use the count nouns ‘data point’ or ‘observation’ (drops of water and grains of wheat provide an analogy for other mass nouns). A collection of data points can be referred to as a ‘data set’.

Although lots of people imagine that ‘data’ is a plural count noun, and some try to treat it that way, hardly any do so consistently. To give just one example, I looked for occurrences of the phrases ‘not much data’ (correct for a mass noun) and ‘not many data’ (correct for a count noun) using Google. There were 10 times as many occurrences of ‘not much data’. Moreover, a large proportion of the ‘not many data’ observations were either written by non-native speakers of English or formed part of grammatically correct phrases such as ‘not many data sets’.

Update There is nothing new under the sun. Kevin Drum at Calpundit blogged on the same topic a few months ago, reaching the same conclusion.

  1. The classic example is from the RAF shortly after WWII. A statistician was asked for a progress report on some work and replied that his data was not yet all in. His CO came back with the thing about data being the plural of datum, so next time he was asked, he wrote:

    “Sir, we have finished compiling our data, and we are now sitting on our ba, doing our sa.”

  2. For related reasons, the plural “referenda” is incorrect. (It more likely apples to a referendum on several topics.)

    Referendums is the right one.

    Too much of a liberal education can be a bad thing.

  3. When I posted on this a few months ago I was told that it’s common to treat “data” as a plural in the hard sciences, but perhaps not much of anywhere else.

    Just thought I’d pass that along.

  4. I offer this partial defence of Dave.

    As John says, ‘data’ in the social sciences has come to be used as a mass noun like fruit and equipment (water is not really the best analogy). But it is actually a rather inconvenient development, because in lots of situations we do in fact want to refer to particular individual units. Then we resort to inelegant constructions like ‘piece of data’ or, in economics, the ubiquitous and pretentious ‘observation’. I for one wish we had stuck to a singular-plural distinction. Whether the plural is data or datums is neither here nor there. (There doesn’t seem to be any accepted rule for dealing with these ‘um’ words. ‘Stratums’ sounds illiterate. But one can say maximums, minimums, optimums, memorandums and symposiums without raising an eyebrow. Whether it’s mediums or media seems to depend on whether you’re talking about sŽances or television. On the other hand ‘these football stadia’ sounds ridiculous.)

  5. James on what basis do you say that the English ‘observation’ is pretentious, and the Latin ‘data’ is not?

    A data set is typically a two-dimensional (sometimes higher-dimensional) array with values for a number of variables for each observation. So correcting what I said in the post, a data point (a value for a particular variable for a particular observation) is not the same thing as an observation. You could use ‘datum’ instead of ‘data point’, but I don’t think you gain anything.

    The old use of ‘data’ as a plural related to logic/maths problems which might be stated in terms of a handful of given facts “James is 11. John is 17. How old was James when he was half as old as John”. The data here are the ages.

  6. Bravo this discussion!

    It looks like entry 2 under ‘datum’ in my 1987 edition of the “Australian Concise Oxford Dictionary”, which I had never been patient enough to read carefully before, is still current. It encapsulates the discussion, from beginning to end (including noting the disputed use of the term) beautifully.

  7. John,

    Is datum that much more Latin than observation?

    In any case, pretentious was entirely the wrong word: I just had in mind that ‘observation’ seems a bit grandiose when we are just talking about a bloody number. And at that stage I thought, contrary to your last comment, that the observation was indeed one number while the data point was the array. I was thinking of a doctor recording observations of pulse, blood pressure etc. at a moment in time. A (data) point in n-dimensional space would represent this array of observations. But I was doubtless incorrect.

    On a lighter note: my father, who in semi-retirement devotes much of his time to pedantry, recently emailed an academic at Edith Cowan and admonished her for using the word ‘dilemmata’ in a Campus Review article. Yes, it was intended as the plural of dilemma.

  8. Well, I beg to differ. According to the
    Macquarie Dictionary, data is the plural of datum. According to the Australian concise Oxford Dictionary, data is a plural noun, but it notes “also treated as singular … although the singular form is strictly datum”.

    Just because data has fallen into common use as a singular noun doesn’t make it correct. If that was the standard, youse would be plural of you.

  9. Correctness is determined by usage, not by dictionaries. To the extent that there are differences between educated and uneducated speakers, the usage of educated speakers is ‘correct’ – this reflects relative status and the fact that ‘correctness’ is a concern of the educated and not of the uneducated.

    So, if “youse” were normally used by educated speakers of English as the plural of “you” it would be the correct plural, and failure to use it would be an error. Note that “youse” is in fact more regular, so that in the hypothetical situation,someone who used “you” as a plural would be told that they had failed to make verb and subject agree in number.

  10. Well whatever any of you’s think, I’m just happy that, read carefully, my 1987 (16 years old) Aust Concise Oxford appears still to be a fairly reliable guide to both usage and correctness and I don’t know where I’d be, in either respect, without it. I’m also happy that it doesn’t look like I’ll be needing to invest in a replacement anytime soon.

    By the way, does anyone know where the new pronunciation of Colin Powell’s name came from (ie, Coal-in, Colon, etc)? Is it a case of journalists herding together to use their intermediary power to convince us of a new common-usage pronunciation of the name? Everyone appears to just assume that it’s the way Colin himself pronounces it, but I have this from that doyen of our mother tongue himself, Phillip Adams: “… it seems timely to point out that Powell isn’t Colon Powell. The erstwhile general objects to his name being colonised. As with sitting on the fence, he finds it a pain in the arse. Colin complains that nobody listens so he’s given up trying” (The Australian, 3 November 2001)

