In the last few days, I have been reading and thinking a lot about audio compression.
Lossy v. lossless compression
As most of you will know, there are two major types of compression: lossless and lossy. In the first case, we take a string of digital information and reduce the amount of space it takes to store without actually destroying any information at all. For example, we could take a string like:
1-2-1-7-3-5-5-5-5-5-5-5-5-5-5-5-5-5-2-2-2-3-4
And convert it into:
1-2-1-7-3-5(13)-2(3)-3-4
Depending on the character of the data and the kinds of rules we use to compress it, this will result in a greater or lesser amount of compression. The upshot is that we can always return the data to its original state. If the file in question is an executable (a computer program), this is obviously required. A file that closely resembles Doom, as a string of bits, will nonetheless probably not run like Doom (or at all).
Lossless compression is great. It allows us, for instance, to go back to the original data and then manipulate it with as much freedom as we had to begin with. The cost associated with that flexibility is that files compressed in lossless compression are larger than those treated with lossy compression. For data that is exposed to human senses (especially photos, music, and video), it is generally worthwhile to employ ‘lossy’ compression. A compact disc stores somewhere in the realm of 700MB of data. Uncompressed, that would take up an equivalent amount of space on an iPod or computer hard drive. There is almost certainly some level of lossy compression at which it would be impossible for a human being with good ears and the best audio equipment to tell if they were hearing the compressed or uncompressed version. This is especially true when the data source is CDs, which have considerable limitations of their own when it comes to storing audio information.
Lossy compression, therefore, discards bits of the information that are less noticeable in order to save space. Two bits of sky that are almost-but-not-quite the same colour of blue in an uncompressed image file might become actually the same colour of blue in a compressed image file. This happens to a greater and greater degree as the level of compression increases. As with music, there is some point where it is basically impossible to distinguish the original uncompressed data from a compressed file of high quality. With music, it might be that a tenth of a second of near silence followed by a tenth of a second of the slightest noise becomes a twentieth of a second of near silence.
MP3 and AAC are both very common kinds of music compression. Each can be done at different bit-rates, which determines how much data is used to represent a certain length of time. Higher bit rates contain more data (which one may or may not be able to hear), while lower bit rates contain less. The iTunes standard is to use 128-bit AAC. I have seen experts do everything from utterly condemn this as far too low to claim that at this level the sound is ‘transparent:’ meaning that it is impossible to tell that it was compressed.
But what sort to use, exactly?
Websites on which form of compression to use generally take the form of: “I have made twenty five different versions of the same three songs. I then listened to each using my superior audio equipment and finely tuned ear and have decided that X is the best sort of compression. Anyone who thinks you should use something more compressed than X obviously doesn’t have my fine ability to discern detail. Anyone who wants you to use more than X is an audiophile snob who is more concerned about equipment than music.”
This is not a very useful kind of judgment. Most problematically, the subject/experimenter knows which track is which, when listening to them. It has been well established that taking an audio expert and telling them that they are listening to a $50,000 audiophile quality stereo will lead to a good review of the sound, even if they are really listening to a $2,000 system. (There are famous pranks where people have put a $100 portable CD player inside the case for absurdly expensive audio gear and passed the former off as the latter to experts.) The trouble is both that those being asked to make the judgement feel pressured to demonstrate their expertise and that people actually do perceive things which they expect to be superior as actually being so.
Notoriously, people who are given Coke and Pepsi to taste are more likely to express a preference for the latter if they do not know which is which, but for the former when they do. Their pre-existing expectations affect the way they taste the drinks.
What is really necessary is a double-blinded study. We would make a large number of versions of a collection of tracks with different musical qualities. The files would then be assigned randomized names by a group that will not communicate with either the experimenters or the subjects. The subjects will then listen to two different versions of the same track and choose which they prefer. Each of these trials would produce what statisticians call a dyad. Once we have hundreds of dyads through which to compare versions, we can start to generate statistically valid conclusions about whether the two tracks can be distinguished, and which one is perceived as better. On the basis of hundreds of such tests, in differing orders, we would gain knowledge about whether a certain track is preferred on average to another.
We would then analyze those frequencies to determine whether the difference between one track (say, 128-bit AAC) and another (say, 192-bit AAC) is statistically significant. I would posit that we will eventually find a point where people are likely to pick one or the other at random, because they are essentially the same (640-bit AAC v. 1024-bit AAC, for instance). We therefore take the quality setting that is lowest, but still distinguishable from the one below based on, say, a 95% confidence level and use that to encode our music.
This methodology isn’t perfect, but it would be dramatically more rigorous than the expertly-driven approach described above.