The Aural Perception of Organ Tones
by Colin Pykett
Posted: January 2007 Last revised: 22 December 2009 Copyright © C E Pykett
Abstract Two articles elsewhere on this website describe the tonal structures of flute and principal stops in terms of certain characteristics of their sounds such as their frequency spectra. Because the dimensions of open metal flute pipes and principal pipes are so similar, it is remarkable that our aural perception mechanisms assign a quite distinct perceptual character to the two classes of tone. This article examines how these mechanisms might be operating on the information contained in the sound waves impinging on the ear using techniques borrowed from computer pattern recognition and artificial intelligence, and it is shown that a computer is also capable of discriminating between these types of tone.
These results might not be of mere academic interest. On the contrary, their implications could be profound. Because of the rate of advance in artificial intelligence research, the article goes on to suggest that machines will progressively encroach on many currently sacrosanct aspects of professional life over the next few decades. For example, in the musical field, the results here can be extrapolated to show that a machine would have little difficulty recognising not only various organ stops but other types of instrument as well. Such abilities would be necessary for a machine which would demonstrate greater powers, such as a critical musical analysis based on a live performance. Once such capabilities have been demonstrated, it would then be legitimate to ask questions such as whether machines could replace teaching staff in universities and music colleges as they already have done in banks etc. Because machines do not require salaries or pensions, it would be surprising if these institutions did not begin to ask such questions themselves in the relatively near future. At present artificial intelligence is a largely invisible and under-discussed topic, pushed beyond the public's event horizon by other media issues such as climate change, yet its implications will be profound in decades to come.
Contents (click on the headings below to access the corresponding section)
Two articles elsewhere on this website describe the tonal structures of flute and principal stops in terms of certain characteristics of their sounds, particularly their frequency spectra or harmonic strengths [1], [2]. Because the dimensions of open metal flute pipes and principal pipes are so similar, it is perhaps remarkable that our aural perception mechanisms assign a quite distinct perceptual character to the two classes of tone. The articles just mentioned showed that one reason for this is the strength of the second harmonic, sounding the octave above the fundamental, relative to the others. In principals the second harmonic is usually much stronger than for flutes. Another reason is that flutes generally have odd-numbered harmonics which are stronger than the even-numbered ones, whereas principals do not. These differences were shown to arise from certain adjustments made when voicing the pipe, particularly the cut-up (mouth height) and the position of the upper lip relative to the lower.
Therefore it is probable that our ears and brains somehow latch onto these differences in the frequency spectra emitted by pipes, and it is of interest to ponder how the perception mechanism might be working. We know quite a lot about how our inner ear performs a spectrum analysis on the sound waves impinging on the ear drum, and we also know more or less how the spectrum information is encoded and passed to the brain through the auditory nerve. But beyond that our knowledge of what happens in the brain itself is currently sparse. Given an acceptance that the brain is but a machine, albeit an awesomely complex one, we can nevertheless proceed to do two quite different things: firstly we can try to construct models of how it might work, and secondly we can try to construct other types of machine which imitate the brain's behaviour as a "black box" whose internal workings are irrelevant. This sort of thinking spawned the field of artificial intelligence which arose around 1930 when the first electronic computers were proposed [3]. Without making any assumptions about how the brain works, this article goes some way towards demonstrating that a computer can recognise the differences between organ tones. Such a machine therefore would possess some basic properties of aural perception necessary for the appreciation of music, as does the brain.
Given the rate of advance of artificial intelligence research, which is increasing in breadth and capability exponentially with time as does the computer technology on which it relies, the question is not whether a machine capable of the most sophisticated musical analysis and appreciation could be made but when it will appear. Not only will it have these capabilities but it will interact verbally with humans in their own language and listen to their responses. The stage will then be set for the appearance of these machines in institutions such as universities and music colleges to assist or replace teaching staff. That such machines would not require salaries or pensions will not be lost on administrators.
This article shows that a basic requirement of such a machine, the ability to recognise and classify organ stops (and therefore musical instruments more generally) just by listening to their sounds, is already possible. It describes some elementary concepts of data analysis which the machines might use when performing these functions.
The Frequency Spectra of Principals and Flutes
We return now to the statement near the beginning of this article which said that the dimensions of principal pipes and many types of open metal flute are similar. The only major structural differences occur at the mouth, those of flutes having a cut-up (mouth height) maybe around 30% greater than those of principals, together with even more minute variations introduced by the voicer. Yet on the whole the two types of tone sound completely different. If you had just paid for a new organ in which you had specified a Wald Flute on the great, you would doubtless be pretty annoyed if it sounded like the Open Diapason. In fact it would not, even though the appearance of the two ranks inside the instrument probably would not look much different.
How does this quite unmistakable perceptual difference come about, given the similarities in the pipes? Some reasons have been touched on above in terms of differences in the harmonic structure, particularly that the second harmonic is stronger in principals than in flutes. Also, principals do not exhibit the systematic differences between the odd and even numbered harmonics as do flutes. When we listen to the sounds of organ pipes we automatically classify them as flutes or principals without conscious effort. Without making any assumptions about how the brain does this, it is of interest to speculate whether a computer could operate on the acoustic spectrum in the relatively simple ways suggested by the differences just mentioned, and come up with its own "answer" as to whether the sound it was presented with was a flute or a principal.
Having worked for some years on topics such as computer-based pattern recognition, I am interested in such questions. There are of course dangers of over-simplification to be heeded. As an example, the steady state acoustic spectrum of a pipe, i.e. its harmonic structure, is by no means the only feature which characterises its sound despite the proper importance attached to it since the work of von Helmholtz in the 19th century. It has been known ever since the necessary electronic equipment became available around the 1930's that altering the attack and decay characteristics of a sound can completely change the way it is perceived by the ear and brain, even though the steady state spectrum remains the same. Merely by doctoring sounds in this way one can "change a violin into a piano" for example, as the demonstrations of the late Professor Charles Taylor often used to show. But it would be interesting if one could discover something relatively simple which relates to aural perception by a computer. Therefore I took some harmonic spectra of various organ pipes and put them into a computer to see if it could distinguish between them as readily as the ear does. The rationale of this work was two-fold: firstly, if a computer can do it, then there is a possibility, however remote, that our ears and brains might be doing it in a similar manner. Secondly, if a man-made machine can do it, this is of interest in itself because it might then be possible to go further and ultimately make the computer appreciate music at an arbitrary level of sophistication.
In this study I took only the harmonic structures of various types of organ tone, i.e. their frequency spectra in the steady state phase of pipe speech, as the starting point and ignored everything else such as attack transients. Some interesting results emerged which will now be described. Five types of flute tone and five types of principal were considered, as in Table 1 below:
Table 1. Flute and Principal stops used for the computer perception study
The flutes were those described in one of the earlier articles referred to already [1], in which the harmonic spectrum of each can be seen. The principals were those discussed in the other article mentioned [2], save Arp Schnitger's principals because the data relating to them were not available at the time of drafting this present article.
Each of the pipes whose sound was analysed was of 8 foot pitch, lying within an octave or so centered on middle C. Features relating to the frequency spectra (harmonic structures) were as follows:
1. Each frequency spectrum was represented as an ordered string of numbers which indicated the strengths of the harmonics, starting with the fundamental.
2. All numbers were in decibel (dB) form, a representation explained in more detail in [1]. Decibels express numbers logarithmically, and much of the data processing in the brain and its external sensors such as our ears seems to use a logarithmic description of the incoming information. For example, the phenomenon of subjective loudness is logarithmic, as discussed in [1]. So is pitch, where the musical importance of notes an octave apart (with a frequency ratio of 2) suggests strongly that logarithmic processing to the base 2 is used in some way by the brain. The intervals within the octave are related logarithmically. So is time, where the ratio of most of the common time symbols used in music (semiquaver, quaver, crotchet, minim, semibreve, etc) all have the same binary or factor-of-two relationship to each other.
3. The spectra were arbitrarily normalised, that is, for each spectrum the largest number was adjusted to 60 dB, with all the others being less. This removed the effect of the different volumes or loudnesses of the various samples.
4. Each number string contained 19 numbers, because this was the maximum number of harmonics encountered in this sample of 10 pipe spectra (for the Geigen Diapason as it happened). Those spectra with fewer than 19 harmonics were padded out with zeros so that they were all 19 numbers in length.
Therefore, as an example, the Harmonic Flute was represented by the number string:
(60, 29, 32, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
This spectrum can be seen plotted graphically in Figure 7 of [1], as can those of the other flute tones used here.
To understand how the set of 10 spectra, each with 19 numbers, was manipulated by the computer it is first necessary to discuss some aspects of what is called data cluster analysis. This is a technique which lies at the heart of pattern recognition.
If a machine is to be able to recognise the differences between various occurrences of sensory input, each occurrence has to represented within the machine in such a way that it lies closer in some manner to other occurrences of the same class than to occurrences of other classes. This might be a difficult statement to comprehend, and it might be helpful to take an example.
Take the problem of deciding whether something is too cold or too hot - a drink, maybe. This can be done purely on the basis of temperature, and assuming the machine has some means of measuring temperature such as the hot/cold sensors in our mouth, then we can represent the drink we have just tasted as a point on the "temperature" line in the diagram below (Figure 1):
Figure 1. A 1-dimensional Decision Space
It is pretty obvious that a temperature measurement of a drink can be used to classify it as too hot, too cold or just right, but this trivial diagram is actually a useful example of a decision space. As each trial of a drink results in only a single value, a number representing temperature, it is a diagram of a 1-dimensional decision space.
By including a second measurement performed on each drink, this time of sweetness as well, a 2-dimensional decision space is necessary to represent the results, as in Figure 2:
Figure 2. A 2-dimensional Decision Space
This time we have also included a red dot to represent the measurements of temperature and sweetness corresponding to the latest drink tested. These are shown with the coordinate values t and s respectively, lying along the two axes of the decision space. Because the dot happens to lie inside the region labelled "just right", we can classify this new arrival as OK also. But if this dot had fallen inside one of the other circles we should have said the drink had the properties of that particular region. If the dot had fallen in the region of the decision space outside all of the circles, we would have been uncertain as to the taste of this particular drink and we would therefore have had to classify it as "not sure whether I really like this one but I don't positively dislike it either". A more succinct description of this region of the space, frequently used in artificial intelligence research, is simply the "don't know" category.
Note that we been implying a measurement of distance in this discussion when talking about whether a dot lies inside or outside the circles. The distance between any two points in this space can be calculated quite simply using Pythagoras's Theorem for right angled triangles, and the result of this calculation enables a machine to decide how close any point we choose lies to any of the circles, whether the point is inside a particular circle, etc. The machine cannot see or visualise the decision space as a picture on a page as we can here, but it does not need to - all it has to do is a simple Pythagoras-type calculation to figure out where one point lies relative to another.
The circles in Figure 2 are said to represent data clusters - every drink tested has its own values of temperature and sweetness, and they will all fall into one or other of the circles (though in practice the shapes of the clusters might not be circular), or into the don't-know region of uncertainty lying outside all of the circles. It is not a giant step beyond this to say OK, now let's stop talking about drinks and go back to the sounds of organ stops. Each cluster in the decision space could just as easily have a name such as "principal", "flute", etc, though in this case the axes of the diagram would obviously have to measure attributes relevant to organ tones rather than the temperature and sweetness of a drink.
In the previous section we derived some measurements of organ tones which required 19 numbers (measurements) to represent each one, and it is impossible to represent a point defined by 19 numbers in a 2-dimensional space. You can only plot 2-dimensional points in a 2-dimensional space. Even a 3-dimensional space would be no use. But there is nothing to stop us using a 19-dimensional space, that is, a space with 19 axes all at right angles to each other. If we do this, each of the 10 organ tones discussed previously can be represented as a single point in a 19-dimensional space. Hey, come on, I hear you say. Talking of 19 dimensions makes this article seem to be rapidly getting like Star Trek! Well, maybe it is, but it is a plain fact that any number of dimensions can be used with hardly any more difficulty than using just two or three. Even the distances between two points in a 19-dimensional space can be calculated using Pythagoras's Theorem, extended quite simply (and it is simple) to 19 dimensions rather than just 2.
There is, of course, a major snag in using more than 3 dimensions. We cannot begin to imagine what such high-dimensional spaces would look like - these are hyperspaces which can only be explored mathematically, they cannot be visualised by our brains. I leave you to ponder on whether this means that the humblest computer has an inbuilt intelligence greater than our own, as it is a philosophical question of no relevance here. Keeping our feet firmly on the ground, we do need nevertheless to solve the visualisation problem in some way, otherwise we will have no way of knowing what our data clusters "look like" in their hyperspace.
This problem has exercised the artificial intelligence community since the subject began, and there is no single means of solving it. Instead, there is a whole toolkit of different techniques for exploring hyperspaces, and the tools have to be selected depending on the problem and the preferences of individual workers. One early technique is still widely used however, as it displays an attractive and often extremely useful representation of a hyperspace in just two dimensions on a computer screen, and we can of course view such a display whereas we cannot view the hyperspace directly. This method was invented by J W Sammon [4] in the 1960's, and I published a variant of it some years later which was more efficient for some problems [5]. Rather to my surprise, this paper still apparently finds applications in fields as wide apart as stock market analysis and the examination of medical imaging pictures, but we shall not use it here. When the amount of data is small, as it is here with only 10 organ tones being considered, Sammon's original technique is the more appropriate.
I mentioned above that Pythagoras's Theorem can be used to measure distances in any space regardless of the number of dimensions it represents. Sammon's visualisation technique uses Pythagoras to calculate all the inter-point distances between all data points in the hyperspace, and then it attempts to plot the same points in only 2 dimensions while keeping the inter-point distances the same. Therefore, if there was a cluster of data points in the hyperspace in which the inter-point distances were small, then this cluster should be observable in 2 dimensions. If there were several such clusters, again their existence should be visible. In practice it is usually not possible to achieve a perfect match between the inter-point distances in the hyperspace and the 2-dimensional space, but quite often a good correspondence can be achieved.
We shall now use Sammon's method to examine what our organ tones look like in their 19-dimensional space.
To recapitulate where we have got to:
We took 10 examples of different organ tones, 5 flutes and 5 principals, and represented the harmonic spectrum of each one as a string of 19 numbers. We plotted each 19-number string as a point in a 19-dimensional hyperspace, and then used Sammon's visualisation technique to re-plot the points in 2 dimensions so we could view them. This plot is shown in Figure 3.
Figure 3. Harmonic structures of organ sounds represented as points in a 2-dimensional space
Each point corresponds to one of the organ pipe sounds listed in Table 1 and they have been labelled for convenience. The most interesting feature is that the five flute sounds and the five principals form two reasonably distinct clusters, identified by the black lines surrounding them. Another way of saying the same thing is that the data points for the two types of tone are not mixed up in the diagram. Note that I drew in the lines surrounding the clusters afterwards; they were not drawn automatically as part of the visualisation process itself. Another interesting feature is that the relative positions of the points seem to have meaning. For example, the "flutiest" diapason, the Diapason Phonon, lies closest to the flute cluster whereas the "stringiest" one, the Geigen, is furthest away in the top right hand corner. And for the flutes, the one with the most harmonic development is the Stopped Diapason and this lies closest to the principal cluster, which represents stops with more harmonics than flutes. The remaining flutes have less harmonic development than the Stopped Diapason and they all lie to the left of it, further away from the principals, in the picture.
These features indicate that the harmonic spectrum of an organ pipe, at least in the case of flutes and principals, contains sufficient information for a machine to be able to distinguish between them using the simplest of pattern recognition techniques developed well over 30 years ago [6]. If the harmonic structure of an unknown pipe was presented to the machine, there is a good chance it would be able to say whether it was a flute or a principal and maybe go further to say what type of stop it was, by measuring the distances between the unknown data point and those making up the existing clusters in the decision hyperspace. More data would need to be analysed before we could confirm this statement, but the results of the analysis in Figure 3 are encouraging. Incidentally, it would be easy to program the computer so that it listened to the sounds for itself - there would be no need for human intervention, such as in deriving the harmonic spectra as a separate exercise. Such a machine, although having a rather rudimentary perception capability, would probably provide food for thought for at least some musicians new to the subject of artificial intelligence.
It has been shown that the frequency spectra (harmonic structures) of flute and principal organ pipes contain sufficient information for a computer to be able to discriminate between them, and therefore to classify them automatically. This is not really surprising because the ear and brain can also do the same job, though no suggestion is made that aural perception in humans uses similar mechanisms to those programmed into the computer for the purposes of this study. Nevertheless, the fact that a basic requirement of musical perception - the recognition of the type of instrument being played - can be done readily by a machine suggests that more sophisticated levels of musical analysis might also be amenable to mechanistic simulation.
This is likely to become the case in the relatively near future. Although the concepts described in this article might seem abstruse and complicated, it is necessary to be aware that they illustrate the simplest of methods in widespread use 35 or more years ago, and today they are old hat in terms of artificial intelligence techniques. Given the advances made since that time, it is but a small step beyond current capabilities to envisage a computer which could listen to an organ performance and discover for itself the stops which were used. (At the same time it would recognise the piece played and thus be able to pick up mistakes). The machine could then make suggestions for improving the clarity or musicality of future renditions using different registrations. Interaction with the system would not be via a computer keyboard, mouse and monitor - it would make its suggestions by speaking to you in your own language and it would listen to your verbal responses. A more elaborate system could establish for itself the full score of a new orchestral work, say, merely by listening in real time to its first performance. It would then be able to ask you questions such as whether you thought the sequences starting at bar 46 in the second movement were essential to the work or whether they merely indicated poverty of invention, or whether the use of both oboes at once was compatible with the desired quiet texture elsewhere. If you did not have the score on your lap you would be at a considerable disadvantage against that machine. If such a machine was available, using it to reduce the subjectivity of adjudicating things like the quality of compositions for a D Mus degree could follow.
There is little doubt that machines with this level of capability will be possible well before the end of this century, and probably within the next 20 to 30 years. The more valid question is whether a company will think it worthwhile to develop and market them, which is a commercial judgement rather than a technical one. Like other employing institutions, universities and music colleges might at some point begin to consider whether buying these machines would be more cost-effective than paying salaries to as many teaching staff as they currently employ and making provision for their pensions. Machines do not need salaries or pensions. We have seen it happen with banks and similar institutions, now we can expect to see machines replace many of today's professionally qualified staff in many disciplines over forthcoming decades.
It is interesting to speculate whether machines with this level of capability could be said to be intelligent or conscious, though perhaps the question is irrelevant if you are about to lose your job as a music lecturer to one of them . The more important point is whether they will be able to do your job more cheaply than you can. Whatever you might think about these ideas, it is inadvisable to dismiss too lightly the impact that artificial intelligence will have on the lives of those still in the cradle today.
1. "The Tonal Structure of Organ Flutes", Colin Pykett 2003, currently on this website (read).
2. "The Tonal Structure of Organ Principals", Colin Pykett 2006, currently on this website (read).
3. A small minority of scientists and a larger proportion of non-scientists do not accept that the workings of the brain will ever be explained in terms of machine-like mechanistic principles, though they have yet to propose a credible alternative view which can be used as a basis for research. But if it is a machine, it should not be assumed that it works like a computer nor even that a computer can necessarily be used to model it. Because of such unresolved problems, the field of artificial intelligence (AI) research is associated with rather more than its fair share of academic disreputability, and even today the artificial intelligentsia are unable to define the meanings of basic words which they use frequently, such as consciousness and intelligence. However this problem affects most fields - for example, what musician or physicist could explain what time is, a concept essential to both disciplines? Therefore these philosophical difficulties have not inhibited progress, and areas such as computer-based pattern recognition (which fall within the broad ambit of AI) have delivered useful results. For example, commercially viable systems are now available to read car license plates or recognise speech (sometimes rather badly). No implication is made that these systems necessarily mimic the brain or illuminate how it works; they are merely examples of how these functions can be implemented mechanistically by another type of machine - a computer. It is this narrower aspect and interpretation of AI which is adopted in this article, together with an educated belief that its impact on 21st century life will be dramatic and largely unexpected over the next two or three decades whether we like it or not. Unlike subjects such as global warming, the growth and potential of AI remains at present a largely invisible and under-discussed issue to the general population.
4. "A nonlinear mapping for data structure analysis", J W Sammon, IEEE Transactions on Computers, vol. C-18, 1969, p. 401.
5. "Improving the efficiency of Sammon's nonlinear mapping by using clustering archetypes", C E Pykett, Electronics Letters, vol. 14(25), 1978, p. 799.
6. When the clusters in the 2-dimensional space are as well defined as they are here, this usually means that much of the information contained in the original hyperspace is redundant. In other words, we probably do not need to use as many as 19 dimensions to represent the original data points if they cluster well in only 2. Given the earlier discussions in this article relating to the physics of organ pipes, it would probably be possible to use fewer numbers to represent each spectrum, such as one number defining the strength of the second harmonic relative to the fundamental, another which measures the ratio of the acoustic power contained in the odd-numbered harmonics to that contained in the even-numbered ones, a third indicating the total number of harmonics, etc. Using only this reduced set of numbers, it is possible or even likely that the separation between the clusters would increase because the "noise" introduced by the excess of useless dimensions in the hyperspace had been removed. Analysing a data set with the intention of reducing its dimensionality is called principal components analysis or PCA for short, and there are several mathematical PCA techniques available which the computer itself can carry out automatically.
|