realinstruments, they sample almost any sound in existence. Because of this great generality, sampling is one of the most versatile synthesis methods in existence and is widely deployed in commercial synthesizers.
Under the cover, even such a naive technique as sample based synthesis can
be quite tricky. When implementing wavetable playback, one needs to start
with a sampled sound and play it at a different speed, but still result
in a data stream that is constant in rate. (Only in special purpose synthesis
solutions can one expect to find multiple truly variable rate DACs. In
home multimedia products, one needs to do the mixing part in software and
deliver the resultant data stream through a single, fixed rate DAC.) The
standard way to accomplish this is to keep a counter which points to the
single sound sample now being played, and an increment which tells how much
to add to the counter to reach the next sample. Both the counter and the
increment need to include a fractional part, because otherwise one would
have only a very limited set of pitches that could be used. From the fractional
parts the next issue arises: what to do when you have to actually output
a sample from a fractional wavetable offset? One solution would be to just
neglect the fractional part (truncate the address to a whole number) and
use the previous true sample. This is bad - what it amounts to is severe
(by musical standards) nonlinear distortion; in effect, noise. A better
way would be to round the offset, but even this isn't suitable for high
quality synthesis. A better solution is to interpolate. That means you
take a couple of samples on both sides of the fractional location and
use them and the counter value to create a suitable sample inbetween
.
The most common method encountered is linear interpolation: here one
conceptually draws a line between the sample values
- basically
taking a weighted average of the two nearest samples where the weighing
coefficient is the fractional part of the address. This is already quite
good. As long as the signal is properly bandlimited (no components
approaching the Nyquist limit), no problem arises. However, linear
interpolation doesn't behave very well in the presence of very high
frequencies. To correct for this, one might add to the order of
interpolation - fitting splines, for example. But even this isn't optimal.
The reason is that straight forward lookup and interpolation leads to
aliasing, if large enough increments are employed, and, in addition to
that, polynomial interpolation is not, theoretically, the right way to
go. (When the order of the polynomial increases, so does the wiggle
between sample values - the result is poorly behaved and leads to high
frequency distortion.) Also, because one is in effect down-sampling the
signal, sooner or later some of the higher frequency components will
fold and sound bad. The optimal solution to the problem would be to do
true band-limited interpolation. (Meaning reconstruction by a sufficiently
accurate representation of the perfect low-pass filter and sampling the
result at specified intervals.) However, this is not suitable for
real-time operation and usually wouldn't even be worthwhile. (Some
oversampling headroom and linear interpolation will do the trick in most
musical applications.)
In addition to the basic resampling process, some other features are also
needed to make a workable synthesis method. The first is looping. When
using a sampled sound, a lot of space is required to hold the wavetable.
Often this requirement creates a need to somehow compress the sound. The
most obvious way is to make it loop. This means wrapping the counter
over some specific point in the wavetable back to some earlier address.
This is useful because most instrumental sounds start with an attack
transient and decay to a quasi-periodic waveform that is often amenable
to looping. When the sound reaches the loop, it also sticks
- depending
on the length of the loop, slow variations in the sound disappear or become
cyclic. This can be good or bad, depending on the goals of the musician.
On top of the looping resampler, volume enveloping and low-order low-pass
filtering is usually applied. The need for the former is obvious: to create
discernible notes. Filtering, on the other hand, can be used in different
ways. The original reason (and the reason why most such filters are low-pass)
is that when one plays acoustical instruments louder, the resulting sound
is richer in high partials than when playing soft. This is an effect easily
emulated by a properly controlled time variable low-pass filter. But since
the time of the first samplers, filtering has become an expressive tool
as well (as in the analog synthesizers) and thus more complex filter
designs have proliferated in samplers as well.
What are the pros and cons of sampling synthesis, then? On the plus side, sampling is extremely versatile: you can sample most anything with a good Akai. It is also quite easy and efficient to implement, leading to low-cost designs. Further, when beefed up by some additional processing, very convincing acoustical instrument simulations can be created (witness the K2500). Drums in special are almost made for sampling. And, because of the inherent quality of sampling, a true industry of sound distribution and reuse has developed around sampling - instruments and samples are easy to obtain and to create. On the minus side, sampling isn't very good at creating time variable timbres, per se. Original synthesis is also very difficult on the basic sampler. A lot of memory is required and since layering (using multiple separately controlled sampled sounds to create one instrument) is widely employed, processing requirements are not as low as one might think. Sampling doesn't have many perceptually significant parameters, when thought of as a synthesis method and it also has a tendency to distort sounds in a non-natural way when significant pitch shifts are used. (Playing a sample faster means shortening it - not something that happens in natural instruments or that the ear is used to. This shortening leads to severe distortion of the oh-so-important attack transient in acoustical sounds. Further, pitch shifting stretches the harmonics of the sound proportionally. This is not something that would occur in natural instruments, either.) Beefing the idea up with more sophisticated filtering, layering, modulation (on amplitude, pan position, pitch, filter parameters etc.), multisampling (multiple samples of the same instrument for different parts of the scale), per timbre effects, wavetable interpolation (fading multiple wavetables in and out in series) and so on makes the method very usable but sampling is still quite limited if truly original synthesis or complex modulation and performance control are needed.
Subtractive synthesis is the prominent method in analog synthesizers and,
of course, has its digital counterpart. The idea is to start with a sound
rich in partials and subtract
from it to create a desired timbre. In
effect, this means placing time-variable filters on the signal path. In
analog implementations, the filter is usually of the 12dB or 24dB per
octave resonant low-pass or multimode type. In digital implementations
(e.g. Z-plane synthesis
, by E-mu), the filter order may be much
greater and in some academic ones even approach the orders used in
linear prediction, making the method suitable for spectral analysis
based modelling. The basic method is quite limited, though, because of
the low number of controllable parameters. It is thus often added to by
including more than one oscillator (useful for detuning - a
method where more than one oscillator in almost identical
configurations are used in parallel, creating complex beating of the
partials, something which contributes to long scale evolution of the
sound and that has really become a cliche in analog synthesis), usual
parameter modulation, signal modulation (amplitude and frequency
modulation at audio rates), nonlinear distortion, feedback, hybrids with
sampling and arpeggiators (simple, fast, looping sequencers). In digital
implementations, subtractive synthesis is almost invariably crossbred
with sample playback.
In implementations (especially analog ones), the starting point is usually
a simple-to-produce, harmonically rich waveform, such as a triangle,
sawtooth, rectangular, pulse or noise signal. Often these have modulatable
parameters, such as pulse width and often interlocking of multiple
oscillators is also possible (i.e. hard sync). The filters
are fairly standard, although implementation details vary. (As good
examples one could give the Moog ladder filter, the Prophet's distinctive
filter sound and, of course, the crappy-but-so-wonderful 303 filter section.)
In the analog synthesis community, many filter designs have really become
institutions and so are the single most sought-after feature of some
synths. The digital counterparts of these filters are often much more
accurate, which leads to a certain lack of depth in their sound. (This
is why analog filters sound so much warmer
. Most of the analog
sound actually comes from design flaws, component inaccuracy, thermal
drift and non-linear distortions in circuit components.) Also, it is
surprisingly difficult to produce harmonically correct basis waveforms
digitally. The reason is the same one that was encountered in sampling
synthesis: straight wavetable methods do not suffice here because of
aliasing and interpolation artifacts.
Subtractive synthesis is a very workable method. Because low-order filtering
is very intuitive, subtractive synthesis is easy and rewarding to use.
Most of its parameters also have proper psychoacoustical meaning - timbre
is created by taking a proper starting waveform and shaping its spectrum
with filters. Modulation is then used to shape the sound into an
instrument and a little twitching makes the sound more lively and
organic
(this is mostly about detuning, feedback, sync etc.).
Also, the method is quite easy to implement in analog circuitry. On the
negative side, accurate instrument simulations are surprisingly difficult
to create because of the simplicity of the synthesis engine. Digital
implementations are often quite problematic and do not sound very good
without extensive modification and addition of features.
Additive synthesis reflects the mental opposite of subtractive synthesis. Whereas subtractive synthesis takes a top-down simplify-from-complex attitude, additive synthesis works from bottom up, combining simple sounds to form more complex ones. The basic prototype has its roots in Fourier theory: any sound can be created by combining multiple sine waves at different frequencies, phase angles and amplitudes. Additive synthesis, then, thrives to create instruments by decomposition and reconstruction.
Implementation of additive synthesis is quite straight-forward - one only needs a way to create a lot of sine waves. Why so many? Because most instrumental sounds include rapidly varying and stochastic components that arise from nonlinear interactions in the instrument. This is something that leads at best to hundreds of partials, all time variant and often multiply interconnected (e.g. very small changes to the original instrument sound require very large-scale modification in additive synthesis parameters to acheive accurate reproduction). Thus, although theoretically perfect, additive synthesis is not very well matched to the actual production of sound in physical instruments. The great number of partials makes true additive synthesis less than efficient to implement. Some simplifications can make the method more usable though. The first one is not to allow arbitrary sine waves but to group them into bundles of mutually harmonic partials. This allows the use of the Fast Fourier Transform to generate each group efficiently. If the further simplification of disallowing separate envelopes inside such a harmonic group is made, group additive synthesis results - here each group can be recreated by wavetable lookup and amplitude scaling, which is very efficient indeed. And the resulting synthesis quality is still excellent. A completely different - but in certain situations even more powerful - optimization is based on discrete summation formulae (DSF). These are mathematical equivalities based on trigonometric identities. They make it possible to calculate values for some special classes of functions (most often polynomials, whence the name) very efficiently by simplifying them through trigonometric manipulation. For instance, there is an efficient closed form equivalent for a trigonometric polynomial composed of the first n even harmonics, assuming each sinusoid is present at an amplitude of some constant times the amplitude of the preceding one (i.e. the spectrum decays exponentially). This particular DSF can be used to implement bandlimited square waves without any oversampling or filtering - a major speedup on general purpose hardware.
Additive synthesis is probably the most versatile synthesis method in existence. Any sound can be represented accurately by it. It is also capable of creating new timbres from scratch, in addition to being susceptible to analysis-resynthesis techniques. It is also one of the few synthesis methods for which automatic transcription of instrumental sounds is fairly well developed. As a general synthesis method, it is also unusually accurate - even the slightest nyances can be captured by it. Additive synthesis is, however, computationally expensive and near impossible to implement in analog form. (Due to the high number of partials, noise levels shoot through the roof.) It also requires immense amounts of control data, even in reduced form. (Amplitude and frequency envelopes for each of the partials in nonreduced form.) Tweaking is possible, but due to the frequency envelope sensitivity of the human hearing, large scale modifications to synthesis parameters are necessary to produce a natural sounding modification to a timbre. Thus the psychoacoustical significance of a single parameter is quite limited. This is the reason additive synthesis easily leads to thin sounding instruments if operated manually - the complex harmonic structure of a sound is easy to destroy. Additive synthesis also behaves rather badly in the presence of stochastic components and highly transient signals. (These do not gracefully decompose into neat, slowly varying sinusoidal partials. The result: a huge amount of partials with rapidly varying parameters - difficult to implement efficiently and quite storage and control rate hungry.) Additive synthesis also takes a lot of programming time and is difficult to master; consequently it is not widely used. As a recent example, the Kawai K5000 employs additive synthesis with six parts of 64 harmonic partials (either the lowest or highest 64 of 128), almost certainly implemented by FFT.
Phase modulation is a synthesis technique with a long history. The first forms of phase modulation can be found in, where else, radio technology. There it is commonly employed in FM radio. Phase modulation was also used with analog synthesizers, but the limited accuracy of analog oscillators and the difficulty of building oscillators with negative frequency support hindered the analog implementations. The true break-through of FM technology came with John Chowning and the subsequent patenting of the method for sound synthesis by Yamaha. The result was DX7, probably the most successful single synthesizer in existence. More recent derivatives include OPL2-4 synth chips (ADLIB etc.) and Yamaha's more mature version of the DX7 synthesis principles, the SY series.
The idea behind FM synthesis is that quite rich and deliciously time-variable timbres can be created by modulating the frequency of a carrier sine oscillator by another, the modulator. When the modulator frequency stays below 20Hz or so, only more or less rapid vibrato results. But when the modulation frequency rises to the audio band, the characteristic sidebands resulting from the modulation process can be heard. The sidebands are (generally) not harmonic, except in special cases. These come about when the frequencies of the carrier and the modulator form a simple ratio. The method has few variable parameters, these including the volumes of the two oscillators (the modulator volume affects timbre, not volume) and modulator frequencies. The basic configuration is two oscillators cascaded, as described above, plus envelope generators to control the amplitudes. Often more oscillators are used as well, since interesting (and complex) inharmonic spectra are thus easily produced, allowing for quite realistic bell and brass sounds to be generated. The most characteristic FM sound is a slow sweep of the modulator volume, while keeping the carrier-modulator frequency ratio constant. This produces the well-known ADLIB timbre. Common modifications include several two-oscillator complexes in parallel (allowing for a form of additive synthesis), multiple oscillators in series (allowing for extremely inharmonic and noise-like spectrum formation), non-sine components (they produce a richer sound and, when added in parallel, result in modified group additive synthesis that complements the capabilities of the base FM system), layering, feedback (for noise and weird sounds and adding long term development to the sound), addition of filters (since FM can produce most of the basis waveforms for subtractive synthesis, this also complements the capabilities of the synthesizer) and several combinations of the preceding. Some specific modifications include a limited form of FM, called formant FM, which is capable of producing voice like timbres and formant peaks and has an associated analysis procedure which makes instrument design considerably easier, and a couple of other academic projects, with no presence in the commercial music business.
The implementation of FM synthesis is very easy, the only problem being aliasing which results from high modulator frequencies. (In theory, the bandwidth of the modulated sinewave is infinite, with sidebands falling off quite rapidly. When one increases the modulator frequency, the more dominant sidebands begin approaching the Nyquist frequency, and eventually wrap over. Usually, this is not taken to be a problem.) Specifically, the computational cost of the algorithm is very low due to the high simplicity of the algorithm - nothing more than a couple of table lookups are needed to produce a sample by FM. The cost increases when more oscillators, options and enhancements are added, but if implemented in hardware (which is also quite easy and has resulted in the commercial synthesizers of the DX and SY series), generally can achieve very high polyphony with minimal control data, reasonable sound quality and very low cost.
The other two forms of phase modulation, general phase modulation and
phase distortion modulation, are less used. Phase modulation, in general,
means modulating the instantaneous phase of a carrier. This is very similar
to FM, except for the fact that arbitrary phase curves are allowed. The
advantage is minimal. Phase distortion modulation was originally used by
Casio in its CZ-series synthesizers as a way to circumvent Yamaha patents.
The idea, here, is to vary the reading speed of the carrier wavetable
during a single cycle of sound production. The modulator function is
essentially a saw-tooth wave, with the form and frequency depending on
the carrier frequency. It's sort of like hard synced phase modulation
.
The CZ series synths are very nice for beeps and buzzes, something that
is quite hot in the techno scene, nowadays. For real instrument simulation,
phase distortion is practically useless. (Although the CZ's do a remarkable
job, considering what's under the hood.)
So the pros and the cons. FM synthesis is cost-efficient and easy to implement. Additionally, the parameters are, in a sense, acoustically significant and quite easy to predict because a firm mathematical theory exists (in terms of Bessel functions of the first kind) for the formation of the sidebands. Also, since the prime function of the modulation process is to spread the carrier frequency into multiple, symmetric sidebands around the original carrier, the method can be used to create rough estimations of formants. Because most FM realisations include many options and enhancements, they are well suited for original synthesis - many unheard of sounds can easily be produced. However, since the synthesis procedure bears absolutely no resemblance to the formation of sound in nature, the method is poorly suited to general simulation of acoustical instruments. The method has its own distinctive sound which can be extremely annoying in the long run. Further, the method has been intellectual property of the Yamaha corporation for so long, it has not gained long enduring acceptance outside the academic community.
Waveshaping is just what the name tells you - it takes a simpler wave and shapes it until it sounds right. The most simple form takes a sine wave (only one frequency) and passes it through a carefully crafted function (usually implemented by a lookup table) that adds sidebands to it, based on the amplitude of the original signal. The theory behind this method is that if the function is a suitable Chebyshev polynomial, any combination of upper harmonic partials can be produced from a steady state sinewave. When one, then, varies the sine volume, the larger the volume is, the more harmonic content there is in the resulting waveshaped signal. Thus descending volume produces descending harmonic content - something that is characteristic of most instrumental sounds. Thus we have hope that this might help produce some realistic timbres. And it does. The problem is that although the theory is well-developed, one waveshaper almost never suffices. This is because most instrumental sounds include elements (transients, stochastic components, inharmonic partials and partials with different volume envelopes) that make it impossible to synthesize the sound with a single waveshaper. More sophisticated versions exist, including waveshaping of non-sine input signals (harmonic or inharmonic, of which the latter is more complex to analyse), combinations with filtering and cascading multiple waveshapers, either in series or in parallel. Some research into using multiparameter functions as waveshapers has also been done, the result being called wave terrain synthesis. (In this case we have multiple input signals which are combined by the waveshaping function. This allows the different input signals to beat against each other and, thus, to produce long term evolution. Similar effects arise with waveshaping of nonharmonic signals, but the total energy of the output signal is more difficult to control.) The problem is, none of these really has sufficient theory behind them to make them easily applicable, let alone to allow instrument design to be automated.
As said, implementation of waveshaping is easy: all you need is a lookup table (with interpolation, probably) and a simple oscillator with amplitude control. You use the output of the oscillator to lookup from the table. But in reality, nothing is this simple. The problem is, once again, the one of aliasing. As the process is nonlinear, it adds to the frequency content of the input signal. Especially, it widens the signal bandwidth. (The higher the degree of the shaping polynomial, the more marked the effect. For example, raising to the second power (squaring) doubles the bandwidth.) The result is that with high input frequencies and/or insufficiently smooth shaping functions, significant aliasing may occur. But the low computational cost of the algorithm sometimes makes it worthwhile as well as makes it useful as a building block for more sophisticated hybrid synthesis methods.
The better side of this algorithm is its simplicity and the rugged theoretical foundation on which it is built. Also, some instruments are fairly well modelled with variants of the waveshaping algorithm. When combined with filters, some quite usable synthesis methods can be built. However, since the algorithm has little parameters (aside from the input waveform and lookup table contents, which are difficult to modify systematically on the fly), it allows little in the way of modulation effects and long-term development in the sound. By itself, the method also produces difficult to control volume envelopes, since the nonlinearity inserts uncontrollable extra energy into the sound. The situation is even worse with of the more complex variants.
Granular synthesis has its roots in the area of quantum physics and wavelet
analysis. The basic premise here is that signals can be decomposed in
bases different from the classical Fourier one. (Well, to tell the truth,
wavelet decompositions do not generally form bases, only generalized
frames, which do not meet proper orthogonality requirements.) Especially,
we might wish for a decomposition in which local changes to the signal
being analysed only result in local changes to the analysis result. This
means we want time localization. Classical Fourier integral transforms
have no such thing: add a local bump and the whole frequency spectrum
changes, add a jump discontinuity and you get nonuniform convergence/Gibbs'
phenomenon. The problem is in the result the quantum mechanics people
call the uncertainty principle (it was formulated in quantum mechanics
by Werner Heisenberg). What it says is, basically, that no matter what
decomposition you have, you always have strict bounds on time-resolution
of the analysis in terms of the frequency resolution and vice versa. What
does this mean? It means that since Fourier analysis has infinite frequency
resolution (Fourier integral transform gives you the exact frequencies
required to synthesize the signal), it necessarily has no time
localization. On the other end of the scale we have analyses that have
no frequency resolution (decomposition into an integral transform of
delta-distributions) but have perfect time resolution (they give indefinitely
accurate times of occurrence for all the deltas). All this seems a bit
odd, since our ears can certainly pinpoint sounds in both frequency (or
sounds wouldn't have a pitch) and time (or we wouldn't need the concept
of notes). So there should exist a form in-between that behaves similarly
to our hearing organ. Such a form could also be very useful. (Strictly
speaking, our ears do not decompose sounds losslessly or even uniquely -
you could not create anything like an Ear Transform
and its
inverse.)
Such forms do exist. They are standard material in wavelet analysis. The basic idea is to trade frequency resolution for time localization and the other way around, depending on your needs. What results are transforms which have both good time and frequency resolution. (But, with the restriction, that not both the analysis wavelet and its Fourier transform can have compact support - in English, one cannot have analyses with both perfect time and frequency localization. If the analysis wavelet spans only a limited portion of the real axis, it will, to some extent, span the whole spectrum and vice versa.) These kinds of analyses permit decomposition of sound signals in ways that slightly resemble the way our ears decompose sound. The inverse of this procedure leads to/resembles granular synthesis, which has its theory rooted in the writings of Dennis Gabor and, later, in the musical applications end, Iannis Xenakis.
The basic premise is that we can create rich sound textures by superimposing large numbers of small sound grains - little pieces of sound that have little distinctive flavor on their own, but when used in large numbers, coalesce into a coherent sonic matte. Usually these grains are windowed sine waves (often using a truncated Gaussian or raised-cosine window, since these yield good frequency localization without sacrificing compact support, i.e. finite length, of the grain) or something very close to them. These little sound bites are then combined stochastically, with parameters such as density (grains per second), mean frequency, mean length, variance and envelopes of the previous controlling the overall sonic experience. What results is an extremely powerful, general and easily adapted technique of sound generation that yields very rich timbres and lends itself as well to automated design as to creation of entirely new instruments. Also, by using more sophisticated forms of control (such as statistical distributions and/or grain by grain control) and by substituting richer grain material (non-sine waves, different windows, chopped natural sound etc.), the method scales almost indefinitely.
This all may sound like a lot of semi-scientific mumbo-jumbo, but in the end it is very easy to see what is going on, if some thought is put into it. Think about a sine wave. Its frequency content is simple: it's just a delta-spike at the frequency of the wave. (Let's not burden ourselves with the fact that in the normal sense of the word, these are not functions and the Fourier integral doesn't converge.) So we have only one frequency. Now let's take a Gaussian. We know that a Fourier integral transform of a Gaussian is another Gaussian. So we have a clear peak in the spectrum. Now, sample by sample, multiply these two together. What results is something like a windowed sine, except it doesn't vanish anywhere, but, instead, only decays rapidly towards zero. What is the spectrum of this new signal? It is the convolution of the spectra of the original two signals, i.e. a Gaussian with a higher center frequency. And the time domain representation decays quickly, so we can take it to be time-localized around the peak amplitude at the center of the original Gaussian. So we have a signal that is both time- and frequency localized. (See the illustration below to get a sense of what is going on.) Now we can add these together to add specific frequencies at specified times (approximately), something we certainly cannot do with inverse Fourier transformations unless we use the discrete version and window the results - something that is really just a naive version of the grain approach. Using stochastic control and great enough grain densities produces sounds that have no recognizable structure aside from the desired timbre.
VOSIM, (VOice SIMulation) is actually a method that is completely independent of the grain based synthesis principles. But it shares some common ground with them, nevertheless. The idea of VOSIM is based around the source-excitation model of speech production. Here, speech is viewed as being produced by a linear filter (the vocal tract) driven by a series of pulses with a wide, relatively constant spectrum (glottal pulses). In VOSIM (which was originally developed as a side product of research into speech), one first looks at the spectral response of this conceptual filter. More often than not one can find the distinctive formant peaks characterizing the instantaneous quality of the sound. One then models the waveform by adding together carefully crafted signals composed of periodic decreasing trains of raised cosine pulses. The point behind the procedure is that the pulse trains form controllable formants: the decay factor controls the width of the formant lobe, the rate of repeat of the base raised cosine wave tells the center frequency of the formant and the repeat rate of the whole pulse train is the frequency of the glottal excitation function. (To see what is going on, see the picture to the left, the different parts of the basic VOSIM waveform can clearly be seen.) All in all, a second-order all-pole filter with the aforementioned properties (i.e. resonance frequency and Q-value), driven with a periodic pulse train is (rather crudely) approximated. If we make the assumption that speech can be modelled as an all-pole, pulse-excited filter, we can decompose it into parallel second order filter sections which, in case, can be modelled by VOSIM generators. Very rich timbral envelopes can be modelled as a combination of additive VOSIM elements. Advantage over a pulse excited filter bank: VOSIM is computationally cheap - one generator requires only a single multiplication per raised cosine cycle, a table lookup for the waveform and a counter to count to the length of the bigger cycle.
FOF, the brain-child of Xavier Rodet of IRCAM, is very similar in spirit to the VOSIM method, but is designed more for music and singing than for speech sounds. It's in fact closer to the granular methods, since it employs a bank of what resembles grain oscillators to produce unconventionally windowed sine waves. But the ideology behind the algorithm is closer to VOSIM - construction of speech and/or chant by methods derived from the source-excitation paradigm. FOF has been included in the influential MUSIC V and CSOUND synthesis languages, which makes for its widespread use inside the academic community.
Physical modelling is a bundle of methods which all aim at a common goal - the modelling of some of the relevant parts of sound production in real, physical instruments. There are many different ways to do what is called physical modelling, including waveguides, filterbanks, the finite element method, Karplus-Strong type algorithms, and then some. What is common to all of these, is that they implement different large scale theories of sound production in different types of instruments.
Waveguides are the prominent technology at the moment. They are based on an abstraction of sound transmission in instrument bodies and cavities as linear transmission of waves in a one-dimensional tube. The argument for woodwinds goes like this: since the inner tube of these instruments is rather thin compared to the wavelength of the sounds they emit, they can be abstracted with high precision as one-dimensional transmission lines with linear loss over the tube length. So the tube is modelled as a bidirectional delay line with occasional points of reflection (implemented as taps from one direction of the delay line to the other with a filter in between to model reflection losses) and a driving reed on the other end. The reed is implemented as a pulse producing oscillator with some controlled amount of non-linear response to the pressure in the near end of the tube. (This is included to model the reed reacting to the varying air pressure and giving rise to nonlinear effects on the amplitude and shape of the driving pulses it emits into the delay line.) Then the delay lines are tapped in appropriate places (mainly in the end of the tube, sometimes in the midst to model directional radiation and valve leakage) for sound transmission out of the system. All this is computationally heavy (a lot of delay memory and processing power for the filters are needed), but extremely high realism can be achieved. However, some problems arise, when the one-dimensional abstraction isn't as valid as in this case. Good examples are such instruments as the violin (where we can, however, model the strings as being one-dimensional) and drums (where the assumption collapses completely). In these cases more accurate simulations can be achieved by creating a two or three-dimensional mesh of delay lines, but now the expense starts growing immensely and good realism is much more difficult to achieve. (In the case of string instruments, the resonant cavity can sometimes be modelled sufficiently accurately as a linear filter, possibly by linear prediction techniques, but the strings still produce complications, since they have multimode behavior with nonlinear coupling between the modes. (Longitudinal waves and twisting couple with the usual modes, especially on high playing volumes and when using the bow). And even in the case of multidimensional meshes, nonlinear coupling between modes in different directions complicates matters appreciably.
As said, there are other methods of physical modelling, so brief descriptions
are in order. The finite element method is based on completely different
principles from the other methods and is only mentioned for completeness.
The finite element method is heavy enough to be totally unusable as a
real synthesis method. Basically it is a generic method used to solve
partial differential equations numerically. Canondale uses it to design
their bikes to withhold stress, for instance. But as wave transmission is
a phenomenon which is mathematically described by partial differential
equations, such numerical solutions actually are a way to synthesize
sound. FEM is only used in theoretical studies, though, since it hogs
mind-boggling amounts of computing power. More in line with the application
oriented note of this text, the Karplus-Strong can be thought of as a
greatly simplified version of the waveguide model, in effect one with
only a very simple (often first degree) filter, one way delay line with
feedback and a single random driving waveform. The basic method works by
filling the delay line with random numbers and then iteratively feeding
back the average of the last two samples of the output end to the input.
This creates surprisingly convincing string sounds. Modifications include
inversion of certain samples in the delay line (AM, if you wish), higher
order filters, fractional delay line lengths (with various kinds of
interpolation to achieve the desired effect) and signals added to the
delay line at specific points during the cycle. All in all, this is a
very well known synthesis method and a predecessor of most of the waveguide
methods. Finally, filter based methods of physical modelling rely on a
more classical analysis of sound and attempt to model the response of
approximately linear resonators by certain kinds of filters. One approach,
appropriately named modal synthesis handles the problem by
subdividing it: the instrument is divided into parts whose characteristics
are known and, as vibration analysis data is readily available in
engineering literature, the differential equations describing these parts
are just looked up. After that all that needs to be done is to glue
the parts together and numerically solve the resulting equations - this
is often done just by creating difference equations to estimate the
original ones and running the resulting algorithms against our known
excitation functions. Of course, finding efficient and sufficiently
accurate ways to estimate the original equations can be quite tricky
indeed.
As classical instruments are quite complicated in the mathematical sense, it is an enormously time consuming task to create accurate, efficient models of them. This means that automatic analysis or at least some good analytical tools to aid in the process would be nice. However, the fact that originally made the instruments difficult to analyze (nonlinearities and complex physical properties) also make completely automated analysis impossible. Tools are available, of course, but most of these are more in the line of classical spectral and statistical analysis rather than being especially suited for the task at hand. Currently this means that each instrument has to be modelled separately, from first principles, but some recent discoveries have eased the burden a little. The most important is called higher order spectral analysis (HOS). It was conceived to help in the analysis of general, nonlinear differential equations and systems and is thus quite a handy tool for the synthesist as well. The idea, here, is to track the complex dependencies between different vibratory motions appearing in a signal so that nonlinear interactions can be tracked down and isolated. This helps greatly in designing excitation sources and their coupling to the other parts of the instrument being modelled.
All in all, physical modelling is an extremely good choice for synthesis of many classical instruments, especially those of the wind and brass families. Its parameters directly reflect the ones of the real instrument and excellent emulations can be produced. Original synthesis is fairly easy on PM platforms. The downside is that serious processing power is needed, something that limits the polyphony of current PM implementations. In addition, instrument design can be very time consuming. Some types of instruments are more difficult to model, as well, especially instruments with significant two plus dimensional effects. These include e.g. drums and plates, and, to some extent, string instruments. Sometimes these problems can be solved by using modelling alongside other synthesis methods or expanding our models to include samples as excitation or by allowing traditional sound processing methods (effects, filtering etc.) to be applied within our instrument. Sometimes not. Progress is fast, tehcniques are developing constantly and the field will certainly get even more attention as time goes by and serious commercial applications continue to appear.
As computers have pervaded the music industry and academia, direct manipulation and trial-and-error methods (as opposed to careful top-down classical planning/composition and batch synthesis) has taken foot hold as a method of composing. With waveforms as the basic building blocks, sampling and digital processing have had a huge impact on how we see sound. This is the basis on which many a strange synthesis method has been built.
Common to the methods discussed here is that they are all influenced by the view of sound as a stream of numbers, a discrete signal. Since such signals are the natural representation of audio on computers, one might ask, whether this view suggests original synthesis methods. And indeed, there are some synthesis methods that are based on boolean and other purely numeric manipulation of discretized signals. Examples include SAWDUST and others. The basic premise here is that since the sounds are byte streams, one should treat them as such and apply methods designed for number streams to them. Other influences include serial composition, deconstructionist ideology and granular synthesis methods, which suggest that it might be beneficial to adopt a truly bottom-up view of composition. Namely, one starts from individual samples, builds series, mutates by bit-wise, logical and numerical operations, splices and glues, mixes and transforms and iterates, reiterates and rereiterates... What results is something that truly is different (and pretty horrible sounding ;) - the result can be quite indistinquishable from digital noise, something you'd get if you converted a program image to sound. On the other hand, the result can be even melodic (which, however, is not usually the goal and doesn't happen by accident).
On the positive side, such methods are accurate to the maximum - one couldn't get more flexibility than full samplerate, sample level control. And the resulting sounds are new. But the utter lack of perceptual significance of the operations and the truly ad hoc nature of the algorithms pave way for their primarily academic interest. Meager results can be expected to result from such methods alone. However, in combination with other synthesis algorithms, such innovations can be useful. For example, many of the resulting digital timbres are excellent raw material for carefully crafted grains or attack transients for more conventional sounds.
Analysis-resynthesis techniques are different from the other methods described here in that they are not stand-alone algorithms for sound synthesis - they always require some starting material for sound construction. Here we first take a sound, analyse it, modify the analysis data and then resynthesize it to create more or less similar sounds. The technique was already hinted at in the additive synthesis paragraph. This is because additive synthesis is the most straight-forward synthesis end for most analysis algorithms. Also, the amount of control data required by additive synthesis can realistically be produced only by automated analysis of existing instrumental sounds, followed, perhaps, by some hand-tuning to make for specific impressions. Good examples include such sound processing methods as vocoding (more on that in the effects section), linear prediction based synthesis of vocal/instrument hybrids and generation of instrument families by automatic transformation from a single member of the family (used by the additive synthesis community).
Analysis-resynthesis is good in that it is often quite an intuitive method. It also results in drastic savings of time when used in combination with additive synthesis, in comparison with raw additive. Furthermore, its different forms may allow for extensive modification and intuitive control of existing sound parameters, making it suitable for both original synthesis, transformation, mutation and automated conversion. The downside is that original material is required, the analysis quality is often far from perfect and great amounts of analysis data can result from processing rather simple sounds. Further, as the amount of data increases, the perceptual significance of a single parameter decreases - this results in the need for complex processing environments and extensive know-how to manage the resulting intermediate data. Analysis of sounds from instruments with stochastic and/or nonlinear interactions often presents the greatest challenge for additive analysis-resynthesis techniques, because an immense number of low amplitude sine waves are needed to account for the highly irregular and time-variable spectra involved. Problems of this kind are alleviated by combination with other modelling techniques, notably subtractive synthesis. A good example of this approach is spectral modelling synthesis (SMS), in which dominant partials are taken care of by decomposition into sinusoids and the residual signal is modelled as an additive, filtered noise source. Hybrids of this kind are often more viable than pure additive methods since they slice off difficult to model parts of the signal and leave harmonic analysis with more coherent data to work on. Result: less intermediate data with more significant parameters - an obvious win-win situation.
All in all, analysis-resynthesis really resides somewhere between a synthesis method, a generic sound transformation paradigm and an effects algorithm. Considering that, it is a great addition to our bag'o'tricks.
As indicated above on many occasions, most synthesis methods do not perform well alone. Many of the basic algorithms do one thing well but may fail miserably when something else is desired. An excellent example is FM synthesis: certain inharmonic sounds such as bell and tube sounds are amazingly well reproduced, as well as completely new synthetic sounds. But when string or woodwind sounds are needed, the method reveals its limits. Physical modelling can take care of these, but the tubes and bells do not reproduce well because of the limitations of the one-dimensional waveguide abstraction. That is why most commercial implementations of the different algorithms are hybrids: most samplers have filters, most subtractive synths have multiple waveforms and often some kind of waveform playback, many physical modelling synthesizers include a sample-based drum kit, at the very least and greatly modified FM algorithms are favored over pure FM. Furthermore, most electronic sound generation methods of today are enhanced by the addition of a selected assortment of digital effects.
There are also many less benign reasons for this trend towards greater complexity. One of them is the nature of the commerce - one has to have a distinctive product to make it to the stores. Another is the need to achieve cost-efficiency. Although some synthesis methods are capable of unbelievable generality (e.g. additive and physical modelling synthesis), their cost is so great that they cannot be incorporated into a mass produced synthesizer. It is cheaper to pack a few tens of megabytes of sample memory or a dozen different, lower computational cost algorithms into a module than to design a custom ASIC to do the job of handling a sufficient number of physical modelling voices. Then there are patent and intellectual rights issues - one often needs to circumvent these by adding to ones repertoir of algorithms. Also, people want to have more power on their fingertips each day; especially since the timbre has only now begun to get a truly important part in the fabric of modern popular music.
One final reason for the conception of highly hybrid synthesizer designs
is the need to model existing instruments - in a sense, to guard the
heritage
. This is because unlike in the early days of the synthesizer
industry, replication of instrumental sounds is not necessarily the main
goal of synth design, anymore. Now one also has to be able to model the
electronic instruments of the past. For this end, a multitude
of analog emulation synthesizers have come to fore. They employ
a number of different techniques to achieve their goal, some of which
are physical modelling techniques in a small scale, digitized (sampled)
versions of analog oscillators and filters (which are often extremely
difficult to faithfully reproduce in discrete form; witness the 303 and
Moog ladder, the latter of which includes a zero-delay loop in the
naively discretized version), samples of actual analog instrumental
sounds and from the bottom rebuilds of analog instruments into digital-analog
hybrids. The success of this breed of synthesizers in their task depends
heavily on the original sound they attempt to replicate. The weirder the
original instruments, the harder the job of the architect. Analog instruments
often get their distinctive feel from design flaws, component weaknesses
and the generally weaker stability of analog designs - all things that are
difficult to spot when analysing an analog design and even more difficult
to model effectively and efficiently.
In describing synthesis algorithms, not much thought is usually given to their actual use or implementation details. One of the aspects usually neglected in brief treatments (such as this one), is polyphony and, with that, multitimbrality. Knowing how synthesis works is fine, but one cannot make any music before multiple voices and separate timbres can be combined. Polyphonic (as opposed to monophonic) is the word used to describe instruments which can generate multiple instrumental sounds at once and multitimbral the one used for an instrument capable of generating multiple separate timbres at once (i.e. in which separate voices can use separate parameters and/or algorithms).
Todays high-end low-cost synthesizers and computer sound cards are usually both polyphonic and multitimbral, which makes many people think this is the only way of doing things. However, in the past there have been many instruments which were either monophonic or monotimbral, or which had only limited multitimbral capabilities. (Nowadays, most sampling based instruments allow completely different parameters to be used on each synthesized voice. We will call this full multitimbrality from now on. This is in strict contrast to many analog synthesizers which, were they multitimbral at all, usually limited the number of simultaneous timbres to two, four or some similarly restricted number.) The usual reason for limiting these capabilities is implementation complexity: it is surprisingly difficult to build cost-efficient hardware that allows for such complex setups and signal routings as required by full multitimbrality. Sometimes one can also drastically optimize one's implementation if timbrality is restricted: wavetables can be reused, fewer translation tables need to be kept for interpretation of modulator data, effects routings can be simpler and so on. And as for polyphony, some synthesis algorithms are so complex that available/affordable hardware cannot support more than monophonic operation. Full-blown physical modelling comes very close: Yamaha's original VL-1 was duophonic (i.e. had only two voices) and two part multitimbral. For the same reason some of the more intensive algorithms often limit available polyphony.
How is polyphonic/multitimbral operation implemented, then? One of the
more common ways is to use a signal processor (possibly augmented with
some special purpose hardware to do some of the routine calculations
involved), divide the computing capacity in equal parts, use these
time slots to implement voices and use a separate
microprocessor for control (e.g. enveloping, modulation, MIDI, user
interface,...). Almost all commercially available synthesizers use this
approach, varying only in the type of processors, software and auxiliary
chips employed. When one uses this approach, it is obvious that full
multitimbrality kind of comes for free
- since each voice is
separate from the others, it can use its own local copy of synthesis
parameters and so can produce any timbre desired. This way, one gets a
bank of voices onto which a musical performance can then be mapped. And
if full multitimbral operation is available, this mapping can be quite
complicated indeed - a single logical note can map into multiple
simultaneous notes (layering) with more than one individually controlled
subpart (vector synthesis) each with many component events (wave sequencing),
sometimes with quite a bit of control data and parameters and even
multiple separate synthesis algorithms flying around. Such mapping is the
second reason for limited polyphony in current instruments - the resources
are there but patches tend to use more of them. It is no wonder, then,
that to avoid missing musically significant events, we need to
systematically weed out resource allocations which are of little sonic
importance. This is the subject of the next section on voice allocation.
There are, basically, two way to map logical instruments into physical voices. The first is to use fixed mapping: instrument x always uses voice y. This is the approach used by monophonic instruments and older tracker type composition software. Instrument equals channel equals instrument. The second way is to use dynamic allocation: a new musical event is mapped to a free physical output voice at the time of its creation. Most instruments use the latter approach. This is because it brings with it a sort of useful abstraction - from user perspective, the instrument is constructed to behave like it had practically infite polyphony. From implementation view, the hardware only realizes a fixed number of physical voices and tries to allocate these voices to capture the most significant logical events. This might seem quite abstract, but has the feature that the logical and physical sides are decoupled - one can implement the same instrument with varying degrees of polyphony. The implementation approximates the perfect ∞-phonic instrument to an implementation dependent degree. The same songs will play, albeit with varying levels of sonic accuracy, on all synthesizers of a series.
This is good, but embodies a problem - how is the logical-physical mapping
done best? And no slight problem that is. It is extremely difficult to
determine algorithmically which of a multitude of competing events should
be realized and which - if any - can be discarded. Two crude heuristics
are commonly employed to solve the dilemma. The first is to discard the
oldest note still sounding, the other throws away the quietest. Both
give similar results, since in Western music, notes tend to die away
rather quickly. (I.e., Western music tends to have notes
...)
Sometimes, to aid in simulating ensembles of independent instruments,
voices can be divided in banks (say, a minimum of 6 voices for a guitar)
and the allocation algorithm executed within a bank, only.
PQ codedor
PQed.
| [Ben01] | Benade, Arthur: Fundamentals of Musical Acoustics, Dover edition |
| A book covering most aspects of physical sound production and instrumental and room acoustics. Mathematics is kept to an absolute minimum, here. | |
| [Gol01] | Goldstein, E. Bruce: Sensation & Perception, fourth edition |
| Some quite enjoyable reading on psychophysics of sensation. In the fourth edition hearing is much more thoroughly discussed. This is an entry level book which puts more weight on the cognitive side of perception than does the one by Kandel et al.. This is fun to read and helps shed light on the higher functions involved in perceptual tasks. | |
| [Kan01] | Kandel, Eric R.; Schwartz, James H.; Jessell, Thomas M.: Principles of Neural Science, third edition |
| An excellent book on general neural science. The chapter on hearing is the best I've ever read, but it's not for the faint hearted - it takes some effort to get a grip on the lingo. After that, it is a hefty bunch of distilled knowledge. | |
| [Opp01] | Oppenheim, Alan V.; Schafer, Pierre: Digital Signal Processing, second edition |
| After grasping the basics of DSP, this is the place to continue the quest. The level of mathematical sophistication is, partly, quite high. However, the more basic concepts and formulae are explained and illustrated very clearly. Highly recommended. | |
| [Pen01] | Penfold, R.A.: Advanced MIDI User's Guide, second edition |
| A fairly readable exposition of MIDI related techniques and technologies. Serves best as a reference, not a tutorial. | |
| [Poh01] | Pohlmann, Ken C.: Principles of Digital Audio, second edition |
| An excellent covering of the technology aspects of digital audio. Presents a strong emphasis on technical issues, such as coding, error correction, transmission formats and manufacturing technology. Serves well as a reference to most popular digital audio systems. Recommended. | |
| [Roa01] | Roads, Curtis: The Computer Music Tutorial |
| A quite all-encompassing reference on all aspects of computer music, digital audio and related areas of study. The emphasis, here, is on practical application instead of strict mathematical rigor. Easy, entertaining and comprehesive - a true bible. |