Sampling based synthesis and wavetable lookup

Sampling and wavetable synthesis, although originally slightly different in scope, have come to stand for the same thing: playing back stored sound through some processing. This processing includes mixing multiple sounds, pitch shifting and interpolation thereof plus some additional filtering and possibly effects. The idea behind sampling is to use existing sound material (which is often very difficult to synthesize exactly) as the starting point, and so to create very convincing simulations of acoustical instruments. Of course, when crude processing is applied, instead of subtle coloration, entirely new sounds emerge. A good example is the way modern dance music abuses its drum tracks and beats; musique concrete and industrial are further examples: instead of sampling real instruments, they sample almost any sound in existence. Because of this great generality, sampling is one of the most versatile synthesis methods in existence and is widely deployed in commercial synthesizers.

Under the cover, even such a naive technique as sample based synthesis can be quite tricky. When implementing wavetable playback, one needs to start with a sampled sound and play it at a different speed, but still result in a data stream that is constant in rate. (Only in special purpose synthesis solutions can one expect to find multiple truly variable rate DACs. In home multimedia products, one needs to do the mixing part in software and deliver the resultant data stream through a single, fixed rate DAC.) The standard way to accomplish this is to keep a counter which points to the single sound sample now being played, and an increment which tells how much to add to the counter to reach the next sample. Both the counter and the increment need to include a fractional part, because otherwise one would have only a very limited set of pitches that could be used. From the fractional parts the next issue arises: what to do when you have to actually output a sample from a fractional wavetable offset? One solution would be to just neglect the fractional part (truncate the address to a whole number) and use the previous true sample. This is bad - what it amounts to is severe (by musical standards) nonlinear distortion; in effect, noise. A better way would be to round the offset, but even this isn't suitable for high quality synthesis. A better solution is to interpolate. That means you take a couple of samples on both sides of the fractional location and use them and the counter value to create a suitable sample inbetween. The most common method encountered is linear interpolation: here one conceptually draws a line between the sample values - basically taking a weighted average of the two nearest samples where the weighing coefficient is the fractional part of the address. This is already quite good. As long as the signal is properly bandlimited (no components approaching the Nyquist limit), no problem arises. However, linear interpolation doesn't behave very well in the presence of very high frequencies. To correct for this, one might add to the order of interpolation - fitting splines, for example. But even this isn't optimal. The reason is that straight forward lookup and interpolation leads to aliasing, if large enough increments are employed, and, in addition to that, polynomial interpolation is not, theoretically, the right way to go. (When the order of the polynomial increases, so does the wiggle between sample values - the result is poorly behaved and leads to high frequency distortion.) Also, because one is in effect down-sampling the signal, sooner or later some of the higher frequency components will fold and sound bad. The optimal solution to the problem would be to do true band-limited interpolation. (Meaning reconstruction by a sufficiently accurate representation of the perfect low-pass filter and sampling the result at specified intervals.) However, this is not suitable for real-time operation and usually wouldn't even be worthwhile. (Some oversampling headroom and linear interpolation will do the trick in most musical applications.)

In addition to the basic resampling process, some other features are also needed to make a workable synthesis method. The first is looping. When using a sampled sound, a lot of space is required to hold the wavetable. Often this requirement creates a need to somehow compress the sound. The most obvious way is to make it loop. This means wrapping the counter over some specific point in the wavetable back to some earlier address. This is useful because most instrumental sounds start with an attack transient and decay to a quasi-periodic waveform that is often amenable to looping. When the sound reaches the loop, it also sticks - depending on the length of the loop, slow variations in the sound disappear or become cyclic. This can be good or bad, depending on the goals of the musician. On top of the looping resampler, volume enveloping and low-order low-pass filtering is usually applied. The need for the former is obvious: to create discernible notes. Filtering, on the other hand, can be used in different ways. The original reason (and the reason why most such filters are low-pass) is that when one plays acoustical instruments louder, the resulting sound is richer in high partials than when playing soft. This is an effect easily emulated by a properly controlled time variable low-pass filter. But since the time of the first samplers, filtering has become an expressive tool as well (as in the analog synthesizers) and thus more complex filter designs have proliferated in samplers as well.

What are the pros and cons of sampling synthesis, then? On the plus side, sampling is extremely versatile: you can sample most anything with a good Akai. It is also quite easy and efficient to implement, leading to low-cost designs. Further, when beefed up by some additional processing, very convincing acoustical instrument simulations can be created (witness the K2500). Drums in special are almost made for sampling. And, because of the inherent quality of sampling, a true industry of sound distribution and reuse has developed around sampling - instruments and samples are easy to obtain and to create. On the minus side, sampling isn't very good at creating time variable timbres, per se. Original synthesis is also very difficult on the basic sampler. A lot of memory is required and since layering (using multiple separately controlled sampled sounds to create one instrument) is widely employed, processing requirements are not as low as one might think. Sampling doesn't have many perceptually significant parameters, when thought of as a synthesis method and it also has a tendency to distort sounds in a non-natural way when significant pitch shifts are used. (Playing a sample faster means shortening it - not something that happens in natural instruments or that the ear is used to. This shortening leads to severe distortion of the oh-so-important attack transient in acoustical sounds. Further, pitch shifting stretches the harmonics of the sound proportionally. This is not something that would occur in natural instruments, either.) Beefing the idea up with more sophisticated filtering, layering, modulation (on amplitude, pan position, pitch, filter parameters etc.), multisampling (multiple samples of the same instrument for different parts of the scale), per timbre effects, wavetable interpolation (fading multiple wavetables in and out in series) and so on makes the method very usable but sampling is still quite limited if truly original synthesis or complex modulation and performance control are needed.

Subtractive synthesis

Subtractive synthesis is the prominent method in analog synthesizers and, of course, has its digital counterpart. The idea is to start with a sound rich in partials and subtract from it to create a desired timbre. In effect, this means placing time-variable filters on the signal path. In analog implementations, the filter is usually of the 12dB or 24dB per octave resonant low-pass or multimode type. In digital implementations (e.g. Z-plane synthesis, by E-mu), the filter order may be much greater and in some academic ones even approach the orders used in linear prediction, making the method suitable for spectral analysis based modelling. The basic method is quite limited, though, because of the low number of controllable parameters. It is thus often added to by including more than one oscillator (useful for detuning - a method where more than one oscillator in almost identical configurations are used in parallel, creating complex beating of the partials, something which contributes to long scale evolution of the sound and that has really become a cliche in analog synthesis), usual parameter modulation, signal modulation (amplitude and frequency modulation at audio rates), nonlinear distortion, feedback, hybrids with sampling and arpeggiators (simple, fast, looping sequencers). In digital implementations, subtractive synthesis is almost invariably crossbred with sample playback.

In implementations (especially analog ones), the starting point is usually a simple-to-produce, harmonically rich waveform, such as a triangle, sawtooth, rectangular, pulse or noise signal. Often these have modulatable parameters, such as pulse width and often interlocking of multiple oscillators is also possible (i.e. hard sync). The filters are fairly standard, although implementation details vary. (As good examples one could give the Moog ladder filter, the Prophet's distinctive filter sound and, of course, the crappy-but-so-wonderful 303 filter section.) In the analog synthesis community, many filter designs have really become institutions and so are the single most sought-after feature of some synths. The digital counterparts of these filters are often much more accurate, which leads to a certain lack of depth in their sound. (This is why analog filters sound so much warmer. Most of the analog sound actually comes from design flaws, component inaccuracy, thermal drift and non-linear distortions in circuit components.) Also, it is surprisingly difficult to produce harmonically correct basis waveforms digitally. The reason is the same one that was encountered in sampling synthesis: straight wavetable methods do not suffice here because of aliasing and interpolation artifacts.

Subtractive synthesis is a very workable method. Because low-order filtering is very intuitive, subtractive synthesis is easy and rewarding to use. Most of its parameters also have proper psychoacoustical meaning - timbre is created by taking a proper starting waveform and shaping its spectrum with filters. Modulation is then used to shape the sound into an instrument and a little twitching makes the sound more lively and organic (this is mostly about detuning, feedback, sync etc.). Also, the method is quite easy to implement in analog circuitry. On the negative side, accurate instrument simulations are surprisingly difficult to create because of the simplicity of the synthesis engine. Digital implementations are often quite problematic and do not sound very good without extensive modification and addition of features.

Additive synthesis

Additive synthesis reflects the mental opposite of subtractive synthesis. Whereas subtractive synthesis takes a top-down simplify-from-complex attitude, additive synthesis works from bottom up, combining simple sounds to form more complex ones. The basic prototype has its roots in Fourier theory: any sound can be created by combining multiple sine waves at different frequencies, phase angles and amplitudes. Additive synthesis, then, thrives to create instruments by decomposition and reconstruction.

Implementation of additive synthesis is quite straight-forward - one only needs a way to create a lot of sine waves. Why so many? Because most instrumental sounds include rapidly varying and stochastic components that arise from nonlinear interactions in the instrument. This is something that leads at best to hundreds of partials, all time variant and often multiply interconnected (e.g. very small changes to the original instrument sound require very large-scale modification in additive synthesis parameters to acheive accurate reproduction). Thus, although theoretically perfect, additive synthesis is not very well matched to the actual production of sound in physical instruments. The great number of partials makes true additive synthesis less than efficient to implement. Some simplifications can make the method more usable though. The first one is not to allow arbitrary sine waves but to group them into bundles of mutually harmonic partials. This allows the use of the Fast Fourier Transform to generate each group efficiently. If the further simplification of disallowing separate envelopes inside such a harmonic group is made, group additive synthesis results - here each group can be recreated by wavetable lookup and amplitude scaling, which is very efficient indeed. And the resulting synthesis quality is still excellent. A completely different - but in certain situations even more powerful - optimization is based on discrete summation formulae (DSF). These are mathematical equivalities based on trigonometric identities. They make it possible to calculate values for some special classes of functions (most often polynomials, whence the name) very efficiently by simplifying them through trigonometric manipulation. For instance, there is an efficient closed form equivalent for a trigonometric polynomial composed of the first n even harmonics, assuming each sinusoid is present at an amplitude of some constant times the amplitude of the preceding one (i.e. the spectrum decays exponentially). This particular DSF can be used to implement bandlimited square waves without any oversampling or filtering - a major speedup on general purpose hardware.

Additive synthesis is probably the most versatile synthesis method in existence. Any sound can be represented accurately by it. It is also capable of creating new timbres from scratch, in addition to being susceptible to analysis-resynthesis techniques. It is also one of the few synthesis methods for which automatic transcription of instrumental sounds is fairly well developed. As a general synthesis method, it is also unusually accurate - even the slightest nyances can be captured by it. Additive synthesis is, however, computationally expensive and near impossible to implement in analog form. (Due to the high number of partials, noise levels shoot through the roof.) It also requires immense amounts of control data, even in reduced form. (Amplitude and frequency envelopes for each of the partials in nonreduced form.) Tweaking is possible, but due to the frequency envelope sensitivity of the human hearing, large scale modifications to synthesis parameters are necessary to produce a natural sounding modification to a timbre. Thus the psychoacoustical significance of a single parameter is quite limited. This is the reason additive synthesis easily leads to thin sounding instruments if operated manually - the complex harmonic structure of a sound is easy to destroy. Additive synthesis also behaves rather badly in the presence of stochastic components and highly transient signals. (These do not gracefully decompose into neat, slowly varying sinusoidal partials. The result: a huge amount of partials with rapidly varying parameters - difficult to implement efficiently and quite storage and control rate hungry.) Additive synthesis also takes a lot of programming time and is difficult to master; consequently it is not widely used. As a recent example, the Kawai K5000 employs additive synthesis with six parts of 64 harmonic partials (either the lowest or highest 64 of 128), almost certainly implemented by FFT.

Phase modulation synthesis - frequency modulation (FM), phase modulation and phase distortion

Phase modulation is a synthesis technique with a long history. The first forms of phase modulation can be found in, where else, radio technology. There it is commonly employed in FM radio. Phase modulation was also used with analog synthesizers, but the limited accuracy of analog oscillators and the difficulty of building oscillators with negative frequency support hindered the analog implementations. The true break-through of FM technology came with John Chowning and the subsequent patenting of the method for sound synthesis by Yamaha. The result was DX7, probably the most successful single synthesizer in existence. More recent derivatives include OPL2-4 synth chips (ADLIB etc.) and Yamaha's more mature version of the DX7 synthesis principles, the SY series.

The idea behind FM synthesis is that quite rich and deliciously time-variable timbres can be created by modulating the frequency of a carrier sine oscillator by another, the modulator. When the modulator frequency stays below 20Hz or so, only more or less rapid vibrato results. But when the modulation frequency rises to the audio band, the characteristic sidebands resulting from the modulation process can be heard. The sidebands are (generally) not harmonic, except in special cases. These come about when the frequencies of the carrier and the modulator form a simple ratio. The method has few variable parameters, these including the volumes of the two oscillators (the modulator volume affects timbre, not volume) and modulator frequencies. The basic configuration is two oscillators cascaded, as described above, plus envelope generators to control the amplitudes. Often more oscillators are used as well, since interesting (and complex) inharmonic spectra are thus easily produced, allowing for quite realistic bell and brass sounds to be generated. The most characteristic FM sound is a slow sweep of the modulator volume, while keeping the carrier-modulator frequency ratio constant. This produces the well-known ADLIB timbre. Common modifications include several two-oscillator complexes in parallel (allowing for a form of additive synthesis), multiple oscillators in series (allowing for extremely inharmonic and noise-like spectrum formation), non-sine components (they produce a richer sound and, when added in parallel, result in modified group additive synthesis that complements the capabilities of the base FM system), layering, feedback (for noise and weird sounds and adding long term development to the sound), addition of filters (since FM can produce most of the basis waveforms for subtractive synthesis, this also complements the capabilities of the synthesizer) and several combinations of the preceding. Some specific modifications include a limited form of FM, called formant FM, which is capable of producing voice like timbres and formant peaks and has an associated analysis procedure which makes instrument design considerably easier, and a couple of other academic projects, with no presence in the commercial music business.

The implementation of FM synthesis is very easy, the only problem being aliasing which results from high modulator frequencies. (In theory, the bandwidth of the modulated sinewave is infinite, with sidebands falling off quite rapidly. When one increases the modulator frequency, the more dominant sidebands begin approaching the Nyquist frequency, and eventually wrap over. Usually, this is not taken to be a problem.) Specifically, the computational cost of the algorithm is very low due to the high simplicity of the algorithm - nothing more than a couple of table lookups are needed to produce a sample by FM. The cost increases when more oscillators, options and enhancements are added, but if implemented in hardware (which is also quite easy and has resulted in the commercial synthesizers of the DX and SY series), generally can achieve very high polyphony with minimal control data, reasonable sound quality and very low cost.

The other two forms of phase modulation, general phase modulation and phase distortion modulation, are less used. Phase modulation, in general, means modulating the instantaneous phase of a carrier. This is very similar to FM, except for the fact that arbitrary phase curves are allowed. The advantage is minimal. Phase distortion modulation was originally used by Casio in its CZ-series synthesizers as a way to circumvent Yamaha patents. The idea, here, is to vary the reading speed of the carrier wavetable during a single cycle of sound production. The modulator function is essentially a saw-tooth wave, with the form and frequency depending on the carrier frequency. It's sort of like hard synced phase modulation. The CZ series synths are very nice for beeps and buzzes, something that is quite hot in the techno scene, nowadays. For real instrument simulation, phase distortion is practically useless. (Although the CZ's do a remarkable job, considering what's under the hood.)

So the pros and the cons. FM synthesis is cost-efficient and easy to implement. Additionally, the parameters are, in a sense, acoustically significant and quite easy to predict because a firm mathematical theory exists (in terms of Bessel functions of the first kind) for the formation of the sidebands. Also, since the prime function of the modulation process is to spread the carrier frequency into multiple, symmetric sidebands around the original carrier, the method can be used to create rough estimations of formants. Because most FM realisations include many options and enhancements, they are well suited for original synthesis - many unheard of sounds can easily be produced. However, since the synthesis procedure bears absolutely no resemblance to the formation of sound in nature, the method is poorly suited to general simulation of acoustical instruments. The method has its own distinctive sound which can be extremely annoying in the long run. Further, the method has been intellectual property of the Yamaha corporation for so long, it has not gained long enduring acceptance outside the academic community.

Nonlinear waveshaping

Waveshaping is just what the name tells you - it takes a simpler wave and shapes it until it sounds right. The most simple form takes a sine wave (only one frequency) and passes it through a carefully crafted function (usually implemented by a lookup table) that adds sidebands to it, based on the amplitude of the original signal. The theory behind this method is that if the function is a suitable Chebyshev polynomial, any combination of upper harmonic partials can be produced from a steady state sinewave. When one, then, varies the sine volume, the larger the volume is, the more harmonic content there is in the resulting waveshaped signal. Thus descending volume produces descending harmonic content - something that is characteristic of most instrumental sounds. Thus we have hope that this might help produce some realistic timbres. And it does. The problem is that although the theory is well-developed, one waveshaper almost never suffices. This is because most instrumental sounds include elements (transients, stochastic components, inharmonic partials and partials with different volume envelopes) that make it impossible to synthesize the sound with a single waveshaper. More sophisticated versions exist, including waveshaping of non-sine input signals (harmonic or inharmonic, of which the latter is more complex to analyse), combinations with filtering and cascading multiple waveshapers, either in series or in parallel. Some research into using multiparameter functions as waveshapers has also been done, the result being called wave terrain synthesis. (In this case we have multiple input signals which are combined by the waveshaping function. This allows the different input signals to beat against each other and, thus, to produce long term evolution. Similar effects arise with waveshaping of nonharmonic signals, but the total energy of the output signal is more difficult to control.) The problem is, none of these really has sufficient theory behind them to make them easily applicable, let alone to allow instrument design to be automated.

As said, implementation of waveshaping is easy: all you need is a lookup table (with interpolation, probably) and a simple oscillator with amplitude control. You use the output of the oscillator to lookup from the table. But in reality, nothing is this simple. The problem is, once again, the one of aliasing. As the process is nonlinear, it adds to the frequency content of the input signal. Especially, it widens the signal bandwidth. (The higher the degree of the shaping polynomial, the more marked the effect. For example, raising to the second power (squaring) doubles the bandwidth.) The result is that with high input frequencies and/or insufficiently smooth shaping functions, significant aliasing may occur. But the low computational cost of the algorithm sometimes makes it worthwhile as well as makes it useful as a building block for more sophisticated hybrid synthesis methods.

The better side of this algorithm is its simplicity and the rugged theoretical foundation on which it is built. Also, some instruments are fairly well modelled with variants of the waveshaping algorithm. When combined with filters, some quite usable synthesis methods can be built. However, since the algorithm has little parameters (aside from the input waveform and lookup table contents, which are difficult to modify systematically on the fly), it allows little in the way of modulation effects and long-term development in the sound. By itself, the method also produces difficult to control volume envelopes, since the nonlinearity inserts uncontrollable extra energy into the sound. The situation is even worse with of the more complex variants.

Granular synthesis, FOF and VOSIM

Granular synthesis has its roots in the area of quantum physics and wavelet analysis. The basic premise here is that signals can be decomposed in bases different from the classical Fourier one. (Well, to tell the truth, wavelet decompositions do not generally form bases, only generalized frames, which do not meet proper orthogonality requirements.) Especially, we might wish for a decomposition in which local changes to the signal being analysed only result in local changes to the analysis result. This means we want time localization. Classical Fourier integral transforms have no such thing: add a local bump and the whole frequency spectrum changes, add a jump discontinuity and you get nonuniform convergence/Gibbs' phenomenon. The problem is in the result the quantum mechanics people call the uncertainty principle (it was formulated in quantum mechanics by Werner Heisenberg). What it says is, basically, that no matter what decomposition you have, you always have strict bounds on time-resolution of the analysis in terms of the frequency resolution and vice versa. What does this mean? It means that since Fourier analysis has infinite frequency resolution (Fourier integral transform gives you the exact frequencies required to synthesize the signal), it necessarily has no time localization. On the other end of the scale we have analyses that have no frequency resolution (decomposition into an integral transform of delta-distributions) but have perfect time resolution (they give indefinitely accurate times of occurrence for all the deltas). All this seems a bit odd, since our ears can certainly pinpoint sounds in both frequency (or sounds wouldn't have a pitch) and time (or we wouldn't need the concept of notes). So there should exist a form in-between that behaves similarly to our hearing organ. Such a form could also be very useful. (Strictly speaking, our ears do not decompose sounds losslessly or even uniquely - you could not create anything like an Ear Transform and its inverse.)

Such forms do exist. They are standard material in wavelet analysis. The basic idea is to trade frequency resolution for time localization and the other way around, depending on your needs. What results are transforms which have both good time and frequency resolution. (But, with the restriction, that not both the analysis wavelet and its Fourier transform can have compact support - in English, one cannot have analyses with both perfect time and frequency localization. If the analysis wavelet spans only a limited portion of the real axis, it will, to some extent, span the whole spectrum and vice versa.) These kinds of analyses permit decomposition of sound signals in ways that slightly resemble the way our ears decompose sound. The inverse of this procedure leads to/resembles granular synthesis, which has its theory rooted in the writings of Dennis Gabor and, later, in the musical applications end, Iannis Xenakis.

The basic premise is that we can create rich sound textures by superimposing large numbers of small sound grains - little pieces of sound that have little distinctive flavor on their own, but when used in large numbers, coalesce into a coherent sonic matte. Usually these grains are windowed sine waves (often using a truncated Gaussian or raised-cosine window, since these yield good frequency localization without sacrificing compact support, i.e. finite length, of the grain) or something very close to them. These little sound bites are then combined stochastically, with parameters such as density (grains per second), mean frequency, mean length, variance and envelopes of the previous controlling the overall sonic experience. What results is an extremely powerful, general and easily adapted technique of sound generation that yields very rich timbres and lends itself as well to automated design as to creation of entirely new instruments. Also, by using more sophisticated forms of control (such as statistical distributions and/or grain by grain control) and by substituting richer grain material (non-sine waves, different windows, chopped natural sound etc.), the method scales almost indefinitely.

This all may sound like a lot of semi-scientific mumbo-jumbo, but in the end it is very easy to see what is going on, if some thought is put into it. Think about a sine wave. Its frequency content is simple: it's just a delta-spike at the frequency of the wave. (Let's not burden ourselves with the fact that in the normal sense of the word, these are not functions and the Fourier integral doesn't converge.) So we have only one frequency. Now let's take a Gaussian. We know that a Fourier integral transform of a Gaussian is another Gaussian. So we have a clear peak in the spectrum. Now, sample by sample, multiply these two together. What results is something like a windowed sine, except it doesn't vanish anywhere, but, instead, only decays rapidly towards zero. What is the spectrum of this new signal? It is the convolution of the spectra of the original two signals, i.e. a Gaussian with a higher center frequency. And the time domain representation decays quickly, so we can take it to be time-localized around the peak amplitude at the center of the original Gaussian. So we have a signal that is both time- and frequency localized. (See the illustration below to get a sense of what is going on.) Now we can add these together to add specific frequencies at specified times (approximately), something we certainly cannot do with inverse Fourier transformations unless we use the discrete version and window the results - something that is really just a naive version of the grain approach. Using stochastic control and great enough grain densities produces sounds that have no recognizable structure aside from the desired timbre.

VOSIM, (VOice SIMulation) is actually a method that is completely independent of the grain based synthesis principles. But it shares some common ground with them, nevertheless. The idea of VOSIM is based around the source-excitation model of speech production. Here, speech is viewed as being produced by a linear filter (the vocal tract) driven by a series of pulses with a wide, relatively constant spectrum (glottal pulses). In VOSIM (which was originally developed as a side product of research into speech), one first looks at the spectral response of this conceptual filter. More often than not one can find the distinctive formant peaks characterizing the instantaneous quality of the sound. One then models the waveform by adding together carefully crafted signals composed of periodic decreasing trains of raised cosine pulses. The point behind the procedure is that the pulse trains form controllable formants: the decay factor controls the width of the formant lobe, the rate of repeat of the base raised cosine wave tells the center frequency of the formant and the repeat rate of the whole pulse train is the frequency of the glottal excitation function. (To see what is going on, see the picture to the left, the different parts of the basic VOSIM waveform can clearly be seen.) All in all, a second-order all-pole filter with the aforementioned properties (i.e. resonance frequency and Q-value), driven with a periodic pulse train is (rather crudely) approximated. If we make the assumption that speech can be modelled as an all-pole, pulse-excited filter, we can decompose it into parallel second order filter sections which, in case, can be modelled by VOSIM generators. Very rich timbral envelopes can be modelled as a combination of additive VOSIM elements. Advantage over a pulse excited filter bank: VOSIM is computationally cheap - one generator requires only a single multiplication per raised cosine cycle, a table lookup for the waveform and a counter to count to the length of the bigger cycle.

FOF, the brain-child of Xavier Rodet of IRCAM, is very similar in spirit to the VOSIM method, but is designed more for music and singing than for speech sounds. It's in fact closer to the granular methods, since it employs a bank of what resembles grain oscillators to produce unconventionally windowed sine waves. But the ideology behind the algorithm is closer to VOSIM - construction of speech and/or chant by methods derived from the source-excitation paradigm. FOF has been included in the influential MUSIC V and CSOUND synthesis languages, which makes for its widespread use inside the academic community.

Physical modelling - waveguides and controlled nonlinearity

Physical modelling is a bundle of methods which all aim at a common goal - the modelling of some of the relevant parts of sound production in real, physical instruments. There are many different ways to do what is called physical modelling, including waveguides, filterbanks, the finite element method, Karplus-Strong type algorithms, and then some. What is common to all of these, is that they implement different large scale theories of sound production in different types of instruments.

Waveguides are the prominent technology at the moment. They are based on an abstraction of sound transmission in instrument bodies and cavities as linear transmission of waves in a one-dimensional tube. The argument for woodwinds goes like this: since the inner tube of these instruments is rather thin compared to the wavelength of the sounds they emit, they can be abstracted with high precision as one-dimensional transmission lines with linear loss over the tube length. So the tube is modelled as a bidirectional delay line with occasional points of reflection (implemented as taps from one direction of the delay line to the other with a filter in between to model reflection losses) and a driving reed on the other end. The reed is implemented as a pulse producing oscillator with some controlled amount of non-linear response to the pressure in the near end of the tube. (This is included to model the reed reacting to the varying air pressure and giving rise to nonlinear effects on the amplitude and shape of the driving pulses it emits into the delay line.) Then the delay lines are tapped in appropriate places (mainly in the end of the tube, sometimes in the midst to model directional radiation and valve leakage) for sound transmission out of the system. All this is computationally heavy (a lot of delay memory and processing power for the filters are needed), but extremely high realism can be achieved. However, some problems arise, when the one-dimensional abstraction isn't as valid as in this case. Good examples are such instruments as the violin (where we can, however, model the strings as being one-dimensional) and drums (where the assumption collapses completely). In these cases more accurate simulations can be achieved by creating a two or three-dimensional mesh of delay lines, but now the expense starts growing immensely and good realism is much more difficult to achieve. (In the case of string instruments, the resonant cavity can sometimes be modelled sufficiently accurately as a linear filter, possibly by linear prediction techniques, but the strings still produce complications, since they have multimode behavior with nonlinear coupling between the modes. (Longitudinal waves and twisting couple with the usual modes, especially on high playing volumes and when using the bow). And even in the case of multidimensional meshes, nonlinear coupling between modes in different directions complicates matters appreciably.

As said, there are other methods of physical modelling, so brief descriptions are in order. The finite element method is based on completely different principles from the other methods and is only mentioned for completeness. The finite element method is heavy enough to be totally unusable as a real synthesis method. Basically it is a generic method used to solve partial differential equations numerically. Canondale uses it to design their bikes to withhold stress, for instance. But as wave transmission is a phenomenon which is mathematically described by partial differential equations, such numerical solutions actually are a way to synthesize sound. FEM is only used in theoretical studies, though, since it hogs mind-boggling amounts of computing power. More in line with the application oriented note of this text, the Karplus-Strong can be thought of as a greatly simplified version of the waveguide model, in effect one with only a very simple (often first degree) filter, one way delay line with feedback and a single random driving waveform. The basic method works by filling the delay line with random numbers and then iteratively feeding back the average of the last two samples of the output end to the input. This creates surprisingly convincing string sounds. Modifications include inversion of certain samples in the delay line (AM, if you wish), higher order filters, fractional delay line lengths (with various kinds of interpolation to achieve the desired effect) and signals added to the delay line at specific points during the cycle. All in all, this is a very well known synthesis method and a predecessor of most of the waveguide methods. Finally, filter based methods of physical modelling rely on a more classical analysis of sound and attempt to model the response of approximately linear resonators by certain kinds of filters. One approach, appropriately named modal synthesis handles the problem by subdividing it: the instrument is divided into parts whose characteristics are known and, as vibration analysis data is readily available in engineering literature, the differential equations describing these parts are just looked up. After that all that needs to be done is to glue the parts together and numerically solve the resulting equations - this is often done just by creating difference equations to estimate the original ones and running the resulting algorithms against our known excitation functions. Of course, finding efficient and sufficiently accurate ways to estimate the original equations can be quite tricky indeed.

As classical instruments are quite complicated in the mathematical sense, it is an enormously time consuming task to create accurate, efficient models of them. This means that automatic analysis or at least some good analytical tools to aid in the process would be nice. However, the fact that originally made the instruments difficult to analyze (nonlinearities and complex physical properties) also make completely automated analysis impossible. Tools are available, of course, but most of these are more in the line of classical spectral and statistical analysis rather than being especially suited for the task at hand. Currently this means that each instrument has to be modelled separately, from first principles, but some recent discoveries have eased the burden a little. The most important is called higher order spectral analysis (HOS). It was conceived to help in the analysis of general, nonlinear differential equations and systems and is thus quite a handy tool for the synthesist as well. The idea, here, is to track the complex dependencies between different vibratory motions appearing in a signal so that nonlinear interactions can be tracked down and isolated. This helps greatly in designing excitation sources and their coupling to the other parts of the instrument being modelled.

All in all, physical modelling is an extremely good choice for synthesis of many classical instruments, especially those of the wind and brass families. Its parameters directly reflect the ones of the real instrument and excellent emulations can be produced. Original synthesis is fairly easy on PM platforms. The downside is that serious processing power is needed, something that limits the polyphony of current PM implementations. In addition, instrument design can be very time consuming. Some types of instruments are more difficult to model, as well, especially instruments with significant two plus dimensional effects. These include e.g. drums and plates, and, to some extent, string instruments. Sometimes these problems can be solved by using modelling alongside other synthesis methods or expanding our models to include samples as excitation or by allowing traditional sound processing methods (effects, filtering etc.) to be applied within our instrument. Sometimes not. Progress is fast, tehcniques are developing constantly and the field will certainly get even more attention as time goes by and serious commercial applications continue to appear.

Time-domain ad hoc methods

As computers have pervaded the music industry and academia, direct manipulation and trial-and-error methods (as opposed to careful top-down classical planning/composition and batch synthesis) has taken foot hold as a method of composing. With waveforms as the basic building blocks, sampling and digital processing have had a huge impact on how we see sound. This is the basis on which many a strange synthesis method has been built.

Common to the methods discussed here is that they are all influenced by the view of sound as a stream of numbers, a discrete signal. Since such signals are the natural representation of audio on computers, one might ask, whether this view suggests original synthesis methods. And indeed, there are some synthesis methods that are based on boolean and other purely numeric manipulation of discretized signals. Examples include SAWDUST and others. The basic premise here is that since the sounds are byte streams, one should treat them as such and apply methods designed for number streams to them. Other influences include serial composition, deconstructionist ideology and granular synthesis methods, which suggest that it might be beneficial to adopt a truly bottom-up view of composition. Namely, one starts from individual samples, builds series, mutates by bit-wise, logical and numerical operations, splices and glues, mixes and transforms and iterates, reiterates and rereiterates... What results is something that truly is different (and pretty horrible sounding ;) - the result can be quite indistinquishable from digital noise, something you'd get if you converted a program image to sound. On the other hand, the result can be even melodic (which, however, is not usually the goal and doesn't happen by accident).

On the positive side, such methods are accurate to the maximum - one couldn't get more flexibility than full samplerate, sample level control. And the resulting sounds are new. But the utter lack of perceptual significance of the operations and the truly ad hoc nature of the algorithms pave way for their primarily academic interest. Meager results can be expected to result from such methods alone. However, in combination with other synthesis algorithms, such innovations can be useful. For example, many of the resulting digital timbres are excellent raw material for carefully crafted grains or attack transients for more conventional sounds.

Analysis-resynthesis

Analysis-resynthesis techniques are different from the other methods described here in that they are not stand-alone algorithms for sound synthesis - they always require some starting material for sound construction. Here we first take a sound, analyse it, modify the analysis data and then resynthesize it to create more or less similar sounds. The technique was already hinted at in the additive synthesis paragraph. This is because additive synthesis is the most straight-forward synthesis end for most analysis algorithms. Also, the amount of control data required by additive synthesis can realistically be produced only by automated analysis of existing instrumental sounds, followed, perhaps, by some hand-tuning to make for specific impressions. Good examples include such sound processing methods as vocoding (more on that in the effects section), linear prediction based synthesis of vocal/instrument hybrids and generation of instrument families by automatic transformation from a single member of the family (used by the additive synthesis community).

Analysis-resynthesis is good in that it is often quite an intuitive method. It also results in drastic savings of time when used in combination with additive synthesis, in comparison with raw additive. Furthermore, its different forms may allow for extensive modification and intuitive control of existing sound parameters, making it suitable for both original synthesis, transformation, mutation and automated conversion. The downside is that original material is required, the analysis quality is often far from perfect and great amounts of analysis data can result from processing rather simple sounds. Further, as the amount of data increases, the perceptual significance of a single parameter decreases - this results in the need for complex processing environments and extensive know-how to manage the resulting intermediate data. Analysis of sounds from instruments with stochastic and/or nonlinear interactions often presents the greatest challenge for additive analysis-resynthesis techniques, because an immense number of low amplitude sine waves are needed to account for the highly irregular and time-variable spectra involved. Problems of this kind are alleviated by combination with other modelling techniques, notably subtractive synthesis. A good example of this approach is spectral modelling synthesis (SMS), in which dominant partials are taken care of by decomposition into sinusoids and the residual signal is modelled as an additive, filtered noise source. Hybrids of this kind are often more viable than pure additive methods since they slice off difficult to model parts of the signal and leave harmonic analysis with more coherent data to work on. Result: less intermediate data with more significant parameters - an obvious win-win situation.

All in all, analysis-resynthesis really resides somewhere between a synthesis method, a generic sound transformation paradigm and an effects algorithm. Considering that, it is a great addition to our bag'o'tricks.

Hybrid methods and derivatives, modelling in general

As indicated above on many occasions, most synthesis methods do not perform well alone. Many of the basic algorithms do one thing well but may fail miserably when something else is desired. An excellent example is FM synthesis: certain inharmonic sounds such as bell and tube sounds are amazingly well reproduced, as well as completely new synthetic sounds. But when string or woodwind sounds are needed, the method reveals its limits. Physical modelling can take care of these, but the tubes and bells do not reproduce well because of the limitations of the one-dimensional waveguide abstraction. That is why most commercial implementations of the different algorithms are hybrids: most samplers have filters, most subtractive synths have multiple waveforms and often some kind of waveform playback, many physical modelling synthesizers include a sample-based drum kit, at the very least and greatly modified FM algorithms are favored over pure FM. Furthermore, most electronic sound generation methods of today are enhanced by the addition of a selected assortment of digital effects.

There are also many less benign reasons for this trend towards greater complexity. One of them is the nature of the commerce - one has to have a distinctive product to make it to the stores. Another is the need to achieve cost-efficiency. Although some synthesis methods are capable of unbelievable generality (e.g. additive and physical modelling synthesis), their cost is so great that they cannot be incorporated into a mass produced synthesizer. It is cheaper to pack a few tens of megabytes of sample memory or a dozen different, lower computational cost algorithms into a module than to design a custom ASIC to do the job of handling a sufficient number of physical modelling voices. Then there are patent and intellectual rights issues - one often needs to circumvent these by adding to ones repertoir of algorithms. Also, people want to have more power on their fingertips each day; especially since the timbre has only now begun to get a truly important part in the fabric of modern popular music.

One final reason for the conception of highly hybrid synthesizer designs is the need to model existing instruments - in a sense, to guard the heritage. This is because unlike in the early days of the synthesizer industry, replication of instrumental sounds is not necessarily the main goal of synth design, anymore. Now one also has to be able to model the electronic instruments of the past. For this end, a multitude of analog emulation synthesizers have come to fore. They employ a number of different techniques to achieve their goal, some of which are physical modelling techniques in a small scale, digitized (sampled) versions of analog oscillators and filters (which are often extremely difficult to faithfully reproduce in discrete form; witness the 303 and Moog ladder, the latter of which includes a zero-delay loop in the naively discretized version), samples of actual analog instrumental sounds and from the bottom rebuilds of analog instruments into digital-analog hybrids. The success of this breed of synthesizers in their task depends heavily on the original sound they attempt to replicate. The weirder the original instruments, the harder the job of the architect. Analog instruments often get their distinctive feel from design flaws, component weaknesses and the generally weaker stability of analog designs - all things that are difficult to spot when analysing an analog design and even more difficult to model effectively and efficiently.

Polyphony. Multitimbrality.

In describing synthesis algorithms, not much thought is usually given to their actual use or implementation details. One of the aspects usually neglected in brief treatments (such as this one), is polyphony and, with that, multitimbrality. Knowing how synthesis works is fine, but one cannot make any music before multiple voices and separate timbres can be combined. Polyphonic (as opposed to monophonic) is the word used to describe instruments which can generate multiple instrumental sounds at once and multitimbral the one used for an instrument capable of generating multiple separate timbres at once (i.e. in which separate voices can use separate parameters and/or algorithms).

Todays high-end low-cost synthesizers and computer sound cards are usually both polyphonic and multitimbral, which makes many people think this is the only way of doing things. However, in the past there have been many instruments which were either monophonic or monotimbral, or which had only limited multitimbral capabilities. (Nowadays, most sampling based instruments allow completely different parameters to be used on each synthesized voice. We will call this full multitimbrality from now on. This is in strict contrast to many analog synthesizers which, were they multitimbral at all, usually limited the number of simultaneous timbres to two, four or some similarly restricted number.) The usual reason for limiting these capabilities is implementation complexity: it is surprisingly difficult to build cost-efficient hardware that allows for such complex setups and signal routings as required by full multitimbrality. Sometimes one can also drastically optimize one's implementation if timbrality is restricted: wavetables can be reused, fewer translation tables need to be kept for interpretation of modulator data, effects routings can be simpler and so on. And as for polyphony, some synthesis algorithms are so complex that available/affordable hardware cannot support more than monophonic operation. Full-blown physical modelling comes very close: Yamaha's original VL-1 was duophonic (i.e. had only two voices) and two part multitimbral. For the same reason some of the more intensive algorithms often limit available polyphony.

How is polyphonic/multitimbral operation implemented, then? One of the more common ways is to use a signal processor (possibly augmented with some special purpose hardware to do some of the routine calculations involved), divide the computing capacity in equal parts, use these time slots to implement voices and use a separate microprocessor for control (e.g. enveloping, modulation, MIDI, user interface,...). Almost all commercially available synthesizers use this approach, varying only in the type of processors, software and auxiliary chips employed. When one uses this approach, it is obvious that full multitimbrality kind of comes for free - since each voice is separate from the others, it can use its own local copy of synthesis parameters and so can produce any timbre desired. This way, one gets a bank of voices onto which a musical performance can then be mapped. And if full multitimbral operation is available, this mapping can be quite complicated indeed - a single logical note can map into multiple simultaneous notes (layering) with more than one individually controlled subpart (vector synthesis) each with many component events (wave sequencing), sometimes with quite a bit of control data and parameters and even multiple separate synthesis algorithms flying around. Such mapping is the second reason for limited polyphony in current instruments - the resources are there but patches tend to use more of them. It is no wonder, then, that to avoid missing musically significant events, we need to systematically weed out resource allocations which are of little sonic importance. This is the subject of the next section on voice allocation.

Voice allocation

There are, basically, two way to map logical instruments into physical voices. The first is to use fixed mapping: instrument x always uses voice y. This is the approach used by monophonic instruments and older tracker type composition software. Instrument equals channel equals instrument. The second way is to use dynamic allocation: a new musical event is mapped to a free physical output voice at the time of its creation. Most instruments use the latter approach. This is because it brings with it a sort of useful abstraction - from user perspective, the instrument is constructed to behave like it had practically infite polyphony. From implementation view, the hardware only realizes a fixed number of physical voices and tries to allocate these voices to capture the most significant logical events. This might seem quite abstract, but has the feature that the logical and physical sides are decoupled - one can implement the same instrument with varying degrees of polyphony. The implementation approximates the perfect ∞-phonic instrument to an implementation dependent degree. The same songs will play, albeit with varying levels of sonic accuracy, on all synthesizers of a series.

This is good, but embodies a problem - how is the logical-physical mapping done best? And no slight problem that is. It is extremely difficult to determine algorithmically which of a multitude of competing events should be realized and which - if any - can be discarded. Two crude heuristics are commonly employed to solve the dilemma. The first is to discard the oldest note still sounding, the other throws away the quietest. Both give similar results, since in Western music, notes tend to die away rather quickly. (I.e., Western music tends to have notes...) Sometimes, to aid in simulating ensembles of independent instruments, voices can be divided in banks (say, a minimum of 6 voices for a guitar) and the allocation algorithm executed within a bank, only.

Vocabulary

Absorption
In absorption, wave motion is attenuated - usually by conversion into heat. Usually the unabsorbed part of the original wave is reflected.
AES
Audio Engineering Society; an American organization of audio engineers which standardizes audio related technology and forms a common forum for experts in the field.
AES10 or MADI
Multichannel Audio Digital Interface; a unidirectional multichannel digital audio transmission standard originated by the AES. MADI is based on the FDDI (Fibre Distributed Data Interface) transmission format, but usually uses coaxial cable instead of optical fibre. Accommodates up to 56 channels and 24 bits per sample. Used for point-to-point multichannel digital audio connections in studio and broadcasting environments.
AES/EBU digital audio bus
A digital sound transmission standard which is based on a synchronous, self-clocking RS-422A compatible physical layer on top of which stereo digital audio and associated data (called sub-channel data) is transmitted. The standard has been strongly influenced by CD technology, and is mainly used between digital studio equipment. The standard specifies multiple sample rates (32kHz to 48kHz) and sample bit depths (upto 24 bits per sample). Originally developed by AES, later adopted by EBU; hence the name.
Beating
A phenomenon produced by the interference of two sinusoidal sounds at close enough frequencies; the audible sensation is the one of a single sound, periodically modulated in amplitude by a second one. The reason for beating is the inherent incapability of the human hearing of separating two close to each other frequencies and the mathematical equivalence of such a combination to a sinusoidally amplitude modulated sine wave.
Clipping/overdrive/saturation
When circuits or transmission media are driven past the point of their maximum input amplitude, they tend to limit the signal to its maximum value. This can happen sharply (digital full scale) or softly (the sigmoid type limiting action of analog tape) and results in the effect of hard or soft limiting, respectively. Limiting produces heavy sidebanding and, consequently, harsh and nonconsonant distortion. Synonymous terms are overdrive (especially when speaking of amplifiers) and saturation (taken from tube amplifier terminology).
CODEC (Abbr.)
Coder/decoder. When talking about data transmission, a coder/decoder is a device or algorithm which works on a bidirectional data link, coding transmitted and decoding received data. Audio codecs usually use computer files, multimedia data streams or TV broadcast channels for their data.
Composite
A loose term used in this tutorial to mean sound signals that originate not from a single, but multiple sound sources, notably instruments. Examples include most recordings and natural sounds.
EBU
European Broadcasting Union; an organization formed originally by national radio stations in Europe. Specializes in broadcast audio distribution technology. Current standardization efforts include terrestial digital radio, both for audio and various kinds of data.
Harmonic (overtone)
Given a signal, we can decompose it with the Fourier transform. Then, a harmonic (of some particular frequency present in the transform) is any frequency (also present in the analysis) which is at a whole number ratio to our base tone. If the signal is periodic, every partial present in the analysis is harmonic. The term implies an underlying base tone to which the harmonic in question is related. (Thus, one doesn't say that components present in a composite signal are necessarily harmonics, even though they may appear in integer frequency ratios.)
HRTF
Head related transfer function; the transfer function of the system resulting from the linear filtering action of placing the human body (especially the head) in a sound field. The main components arise from shoulder and ear lobe reflections and from diffraction effects on sound travelling around the head. Also used to denote the impulse response of such a system or any processing method simulating such a system (the usage is quite fuzzy indeed).
IMA
International MIDI association; a consortium formed as a place for users of MIDI and related software to discuss their problems and propositions. IMA keeps close contacts with MMA to relay user input and suggestions to manufacturers.
Interpolation
Interpolation is a technique used to reconstruct waveforms from discrete samples taken from them. Many different such techniques exist, differing by their underlying mathematical structure. Most common ones are based on fitting a polynomial of some degree to the sample data. From this come the terms linear, quadratic, cubic and so on. These refer to the degree of the polynomial that is fitted. Interpolation methods are also named after the families of polynomials used (Chebychev, Legendre etc.) and their construction (NURBS: nonuniform rational b-spline). Common to all these methods is that they strive for optimality in some sense - most try to achieve smooth approximations with the resulting curve passing through the data points. When used to reconstruct acoustic waveform from evenly spaced samples, polynomial interpolation is never the optimal way. Instead, reconstruction by approximations to perfect lowpass reconstructing filters should be used.
Medium
A scientific term used to denote the underlying substance or space where waves travel. In the context of audio systems, air is usually the medium, although compressible liquids and solids can also transmit sound.
MMA
Midi Manufacturers' Association; a consortium formed to promote and refine the MIDI specification and to guide in the implementation of the standard. MMA extensions to the original MIDI specification include MIDI time signaling and SDS.
Modulation
The variation of some characteristic of a signal or a parameter of an algorithm producing the signal to achieve some specific goal. Examples include amplitude modulation (time-variant scaling of a signal (AM, tremolo)) and frequency modulation (variation of the repetition rate of a (quasi-) periodic signal (FM, vibrato)).
MP3
MPEG1 or 2 (sic!), layer 3 audio coding; a lossy, perceptual audio coding format widely used for the transmission of stereophonic sound, both in commercial and non-commercial environments. Layer 3 is the most sophisticated of the 3 layers specified for MPEG1 and MPEG2 (They share the same audio bitstream formats, only the allowed bitrates differ. Funny enough, MPEG2 allows only three of the lower bitrates.). The standard does not specify the codec, per se, only the bitstream. However, implementation issues have stabilized fairly well by now. MP3 offers excellent audio quality for music and similar sound encountered on soundtracks at relatively low bit-rates (in the range from 48kbps to 196kbps). Isn't suitable for very low bitrate speech coding, for which different methods exist. The acronym comes from the common filename extension used for files of this content. (FYI: MPEG1 layer 1 audio coding is used in DCC under a different name.)
MPEG
Motion Picture Experts Group; a joint consortium of motion picture engineers. Standardizes movie related material. Commonly known for its MPEG1, MPEG2 and MPEG4 standards, which pertain to the digital coding and transmission of moving picture and associated sound (MPEG1-2), and multimedia (MPEG4, in draft stage).
Partial
Given a signal, we can decompose it with the Fourier transform. Then, a partial (of some particular frequency present in the transform) is any frequency present in the analysis. The term implies an underlying base tone to which the harmonic in question is related. (Thus, one doesn't say that components present in a composite signal are necessarily partials.)
PQ code(d)
PQ refers to the first two (of the eight, named from P to W) subchannel bits on CDs. These are used to carry auxiliary data, such as track information, the table of contents (TOC), catalog numbers, ISRC (International Standard Recording Code) information, de-emphasis status, SCMS copy propagation control and so on. The majority of this information is carried over the Q channel, accumulated in 98 bit frames, whereas the P channel carries a simplistic code denoting the starts and ends of CD tracks, lead-in and lead-out areas. Most current CD players are sophisticated enough not to use the P channel code at all, since all relevant information is also available through the more sophisticated Q coding scheme. The addition of PQ code is a major portion of the CD mastering process - often manufacturing plant bound masters are simply referred to as being PQ coded or PQed.
SCMS
Serial Copy Management System; a protocol used for restricting digital copying of audio material in consumer applications. Based on sub-channel coding of generation identifiers and copy protection bits on digital audio media, such as DATs and CDs. Only implemented in consumer mode applications, pro mode applications ignore SCMS. AES/EBU in pro mode cannot even convey SCMS information.
SMDI
SCSI Musical Data Interchange; a data interchange standard originated in 1991 by Peavey Electronics. In the late 80's and early 90's, samplers were coming into fashion and a standardized way to exchange sample data was needed. As MIDI was quite old and extremely slow (MIDI choke was a problem even then), it was seen that a new bus was needed. As the SCSI (Small Computers System Interface) bus already existed and had proven to be interoperable, SMDI leveraged the existing technology. Nowadays SMDI can be used to convey all kinds of information besides pure sample data and is invaluable whenever samplers need to be integrated to the rest of the studio. As an added bonus, computer connectivity and use of existing SCSI hard drives became possible.
SDS
Sample Dump Standard; standardized by the Midi Manufacturers Association, this protocol allows unified downloading of sample data to synthesizers and samplers through the MIDI bus. Utilizes SysEx messages and offers two separate modes: open loop and closed loop. Open loop corresponds to the usual MIDI connection topology, in closed loop configuration a separate return cable is used to provide feedback. SDS is extraordinarily slow, even in the context of the MIDI physical layer. In addition, operating SDS reliably is quite difficult (to use SDS in closed loop mode, the physical cabling has to be changed, for instance) and so the standard is not currently widely deployed in studio environments.
Sidebands
A frequency components added to a signal when put through a suitable modulation process. Especially AM and FM produce sidebands. The name implies roughly symmetrical placing of the added components relative to the original unmodulated signal.
SMPTE
Society for Motion Picture and Television Engineers; an organization of motion picture and television technology experts that standardizes technical aspects of moving picture and related data (such as audio) transmission and coding, such as frame rates, time codes and modulation techniques. Responsible for the time code format of the same name which is commonly used in broadcasting, film production and professional audio applications as a common synchronization standard to relate pieces of audiovisual presentations together.
S/PDIF or IEC-958
Sony/Philips Digital Interface; a consumer derivative of the AES/EBU bus. Standardized by the International Electrotechnical Commission under the name IEC-958, but marketed as S/PDIF for consumer applications. (Technically, these are two different standards but in practice, they are almost identical. They interoperate perfectly.) Uses simplified AES/EBU (consumer mode) and includes provisions for copy management through SCMS. Used primarily for digital audio transmission in consumer applications, such as CD players, DATs, MiniDisc players, and DCC recorders. Applied on top of both electrical and optical interfaces.
Subchannel coding
The transmission of auxiliary data on CD data frame subchannel bits. Includes PQ coding of track and SCMS data, as well as the additional data oriented applications standardized as CD+G and CD+MIDI. Later, the same coding was transferred to AES/EBU frames and DAT tape.
Superposition
1. The addition together of multiple signals.
2. Mathematically, the superposition principle characterizes linear systems. What it says is that, first, if we input two signals to the system and add the respective outputs, we get the same result as we would get by inputting the sum of the original signals and observing the output (additivity). Second, if we amplitude scale the signal by a constant and observe the system output, the result is equal to inputting the unscaled signal and only after that scaling with the constant (homogeneity). All this put into a single formula gives the superposition principle. It is usually applied backwards when we already know the system obeys linearity.
Vocoding
The superimposition of the estimated varying short-term spectral envelope of a signal on another. Used as an effect to create illusions of singing instruments and other spectral hybrids of separate sound sources.

Further reading

References

[Ben01] Benade, Arthur: Fundamentals of Musical Acoustics, Dover edition
A book covering most aspects of physical sound production and instrumental and room acoustics. Mathematics is kept to an absolute minimum, here.
[Gol01] Goldstein, E. Bruce: Sensation & Perception, fourth edition
Some quite enjoyable reading on psychophysics of sensation. In the fourth edition hearing is much more thoroughly discussed. This is an entry level book which puts more weight on the cognitive side of perception than does the one by Kandel et al.. This is fun to read and helps shed light on the higher functions involved in perceptual tasks.
[Kan01] Kandel, Eric R.; Schwartz, James H.; Jessell, Thomas M.: Principles of Neural Science, third edition
An excellent book on general neural science. The chapter on hearing is the best I've ever read, but it's not for the faint hearted - it takes some effort to get a grip on the lingo. After that, it is a hefty bunch of distilled knowledge.
[Opp01] Oppenheim, Alan V.; Schafer, Pierre: Digital Signal Processing, second edition
After grasping the basics of DSP, this is the place to continue the quest. The level of mathematical sophistication is, partly, quite high. However, the more basic concepts and formulae are explained and illustrated very clearly. Highly recommended.
[Pen01] Penfold, R.A.: Advanced MIDI User's Guide, second edition
A fairly readable exposition of MIDI related techniques and technologies. Serves best as a reference, not a tutorial.
[Poh01] Pohlmann, Ken C.: Principles of Digital Audio, second edition
An excellent covering of the technology aspects of digital audio. Presents a strong emphasis on technical issues, such as coding, error correction, transmission formats and manufacturing technology. Serves well as a reference to most popular digital audio systems. Recommended.
[Roa01] Roads, Curtis: The Computer Music Tutorial
A quite all-encompassing reference on all aspects of computer music, digital audio and related areas of study. The emphasis, here, is on practical application instead of strict mathematical rigor. Easy, entertaining and comprehesive - a true bible.