About a week ago, I attended a discussion session and social event on ‘data journalism’. To a large degree, it was about converting datasets, many of them collected from governments, into news stories of interest to the general public. You can take crime data, for instance, and process it into a form with a lot of general appeal. The same goes for education, transport, and other topics.
One general point that the discussion reminded me of is the importance of aggregated versus disaggregated data. For example, saying that the average income in Happytown is $75,000 is quite different from providing the individual data points for every person in the town. If you give someone the first piece of data, all they can really do is report it and compare it with similar statistics. If you give them the disaggregated data, they can do all sorts of their own analysis. What do the top and bottom 10% of the population earn? Are there any high or low outsiders?
If the data is embedded in a database with other types of information, you can do even more fancy stuff. Which are the richest neighbourhoods in town? What level of education does the average person earning more than $100,000 possess? If you can link databases together, you can do even more. What kinds of crime are committed in the city’s poorest neighbourhoods? How about in the richest?
All this creates privacy risks, particularly given how data from different databases can be meshed together and used to identify individuals. There is also the risk of errors, if data from different sources is incorrectly integrated, or if the methodology of analysis is not sound. All the more reason why basic statistical literacy is an increasingly important piece of education to possess, for those trying to make sense of the world. Otherwise, you may fall victim to deeply faulty claims. The average income of a Happytown resident who owns a monocle may be $500,000, but that doesn’t mean that buying a monocle will make you rich.