Full article: A continuous time model for Karnatic flute music synthesis

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Gamakas, the essential embellishments, are integral parts of Karnatic music. Synthesising any form of Karnatic music necessitates proper modelling and synthesis of different gamakas associated with each note. We propose a spectral model to efficiently synthesize gamakas for Karnatic bamboo flute music from the notes, duration and gamaka information. We model three different components of the flute sound, namely, pitch contour, harmonic weights and time domain amplitude envelope. Cubic splines are used to parametrically represent these components. Subjective analysis of the results shows that the proposed method is better than the existing spectral methods in terms of tonal and aesthetic qualities of gamaka rendition. Hypothesis test results show that the observed improvements over other methods are statistically significant at 95% confidence interval.

Keywords:

1. Introduction

Flute is one of the earliest musical instruments used by humans (Conard et al., Citation2009). In the tradition of South Indian music, which is called Karnatic music, side-blown bamboo flutes are used for solo performances and also as one of the accompaniments for vocal music concerts, typically in related dance performances.

Karnatic flutes are made from bamboo and contain eight finger holes (tone holes) and one embouchure hole (Ramamurthy & Raghavan, Citation2013). Sound is produced when the air jet from the player’s mouth hits the edges of the embouchure hole and excites the air column inside the cylindrical body of the flute (Helmholtz, Citation1954). The effective length of the air column can be adjusted by means of opening and closing the tone holes. This changes the resonance frequency, thereby generating different notes in the octave (Benade, Citation1990). Higher or lower octaves of the same note can be generated by increasing or decreasing the blowing pressure, respectively (Helmholtz, Citation1954).

1.1. Motivation

Synthesizing Karnatic music in bamboo flutes is a challenging problem. This is mainly due to two reasons. The flute tone has a complex nature with the relative strengths of its different harmonics varying from time to time. The second reason is the continuously varying pitch contour in Karnatic music. Gamakas, the pitch bends used as essential ornamentations, are one of the important characteristics of Karnatic music. Modelling and synthesis become even more challenging in the presence of such continuously varying pitch bends.

1.2. Related works

Two major approaches towards the synthesis of flute tones are discussed in the literature—physical models and spectral models. While the physical models are based on the sound production mechanism of the flute, the spectral models rely on the perception of sound by the human ear (Smith, Citation1991).

1.2.1. Physical models

Physical models for flute date back to the early 1980s, where the model comprised an energy source, an energetically active non-linear element and an energetically passive linear element. The non-linear element was used to model the air jet, and the linear element represented the bore of the flute. A delay is incorporated in the non-linear element to account for the dependence of the frequency of sound on the blowing pressure (McIntyre et al., Citation1983). Later on, this model was improved by adding the dispersive effect of finger holes on the air column inside the flute bore.

A real-time implementation of the flute model was proposed, in which filters were used to model the reflection, dissipation and losses inside the flute body. The effect of over-blowing and vibrato were also modelled (Välimäki et al., 199Citation19922). A transmission line model for the transverse flute was developed, which modelled six finger holes. However, the partial closing of the holes was not modelled (Keefe, Citation1990).

Later on, a digital waveguide model was proposed, which modelled only the first two or three open finger holes (Välimäki et al., Citation1993). Later, the system was extended by modelling 15 finger holes (Välimäki et al., Citation1996).

A distributed tone hole model using the digital wave guides was also proposed, which modelled the open, partially open and closed tone holes in real time (Scavone & Cook, Citation1998). The model was again improved by modelling dispersion and dissipation inside the bore, keypad noise, vibrato and tremolo (Ystad & Voinier, Citation2001).

A filter design-based approach was proposed to synthesize Indian bamboo flute tones (Ramamurthy & Raghavan, Citation2013). For each group of notes, the spectra of individual notes were combined to generate a composite spectrum, and the coefficients were found out. Attack and decay portions were not modelled.

1.2.2. Spectral models

Spectral models date back to the late 1960s. A 700-ms-long flute tone (along with other wind instrument tones) was generated by means of the spectral analysis method (Strong & Clark, Citation1967) using the weighted sum of 30 sinusoids. Later on, another additive synthesis method called Spectral Model Synthesis (SMS) based on overlap-add method was developed, which modelled the spectrum as the sum of deterministic and stochastic components (Serra & Smith, Citation1990; Serra et al., Citation1997). Time domain envelope was not explicitly modelled.

SMS model was, later, improved by adding provision to model transients in addition to sinusoids and noise (Verma & Meng, Citation2000). Basic sinusoidal model was improved by modifying the pitch and harmonic magnitude (Suyun & Yibiao, Citation2016). Instead of using the peaks from the spectrum directly, the amplitude values in the gaps between the peaks were estimated by means of fitting a cubic spline between all the peaks. Another improved version of SMS incorporated the noise part into the sinusoidal part itself (Kreutzer et al., Citation2008). The harmonic amplitude envelope for the entire note was modelled by means of a sixth-degree polynomial.

A contiguous group synthesis approach for synthesizing Chinese flutes was developed based on the grouping of harmonics (Horner & Ayers, Citation1998). Amplitude envelope for group 1 was designed using line segment approximation, and the powers of this envelope were used for the other groups. Later on, this method was modified for synthesizing trills (Ayers, Citation2003) and tremolos (Ayers, Citation2004). Frequency of the trill was modelled as a line segment approximation of average frequency contour shape of trills. An amplitude envelope using line segments was also proposed. This was again improved by creating a timbre database and adding provisions for changing the attack and decay rates (Rocamora et al., Citation2009).

Different Chinese flutes were modelled using additive synthesis, making use of around 30 harmonic components (Ayers, Citation2005). In addition to trills and tremolos, vibrato was also modelled. A harmonic band wavelet transform-based method was developed to model the breath noise of a flute sound as pseudo-periodic 1/f-like noise (Polotti & Evangelista, Citation2001, Citation2007). Synthesis of Andean quena tones was also performed based on this method (Dïaz & Mendes, Citation2015).

1.2.3. Neural audio synthesis

Neural audio synthesis is the latest addition to the instrument music synthesis, which makes use of neural networks to generate music. A neural-network-based auto-regressive generative model called WaveNET was proposed for raw audio generation (van den Oord et al., Citation2016). Conditional parameters were used to modify different characteristics of the generated audio. A data set containing 200 h of recording was used to train the system. The system performed well for generating musical notes with a duration of around 1 s. WaveNET model was later refined by modifying the auto-encoder to learn the temporal structure of audio without using an external conditioning. A data set of roughly 300,000 samples was used to train the model (Engel et al., Citation2015). WaveNET was again improved by adding a language model and a transcription model (Hawthorne et al., Citation2018). Training data consisting of 172 h of recording were used in this work.

A raw audio synthesis method called WaveGAN and a spectrogram-based synthesis method called SpecGAN were developed by introducing Generative Adversarial Networks (GANs) into the audio synthesis arena for the first time (Donahue et al., Citation2018). Also, 25 min of piano recordings and 40 min of drum recordings were used to train the models. Later on, a log magnitude spectrogram synthesis using GAN was proposed, which used more than 300,000 piano notes for training (Engel et al., Citation2019). An instantaneous frequency spectrum was modelled additionally to increase the coherence of generated audio. TiFGAN-M and TiFGAN-MTF were the two models developed by incorporating time-frequency parameters into GAN-based synthesis (Marafioti et al., Citation2019). Time-frequency direction derivatives of unwrapped phase were modelled to improve the performance. MelGAN, an autoregressive fully convolutional framework, was introduced to generate raw audio at a higher processing speed (Kumar et al., Citation2019). The model was capable of handling the phase artefacts by properly selecting the kernel size and stride without needing any additional models.

1.2.4. Related works on Karnatic flute music synthesis

A method for modelling and synthesizing gamaka was proposed for Karnatic music (Karaikudi Subramanian). Automatic addition of gamakas for some popular songs was implemented. For other songs, the user needed to manually specify the constituent notes and their time durations involved in each gamaka. Synthesis was done using contiguous group synthesis.

Another work synthesized gamakas in Karnatic flute music using a modified harmonics plus noise model (Ashtamoorthy et al., Citation2018). Gamakas were approximated using combinations of decaying and increasing exponential functions. Only three types of gamakas were synthesized. A common amplitude envelope having predefined attack, sustain and decay characteristics was designed for all the notes.

Most of the methods discussed in the literature were developed for synthesizing isolated notes. Synthesizing complete songs with automatic addition of ornamentations was not achieved in most of the methods. Moreover, very few works focused on the special features of Karnatic music. Most of the methods approximated the pitch contour as a discrete set of constant pitch segments. This violates the essential characteristic of Karnatic music, which is the continuously varying pitch contour. Automatic modelling and synthesis of the gamakas are discussed in only one paper (Ashtamoorthy et al., Citation2018). This work addresses only three gamakas, that too, as a coarse approximation of the actual shape. This approximation will not be sufficient enough to model other gamakas used in Karnatic music.

1.2.5. Main contribution

In our work, we model the spectral parameters of flute sound as continuous functions of time. These parameters are pitch contour, weights of the harmonics and time-domain amplitude envelope. Rest of the paper is organized as follows. We give a brief introduction to Karnatic music and bamboo flute tone in Section 2. The spectral model used for synthesis is detailed in Section 3. Experiments and the analysis of spectral properties of the results are discussed in Section 4. Details of the subjective quality evaluation are given in Section 5, and the paper is concluded in Section 6.

2. Karnatic music on South Indian bamboo flute

Most of the energy present in a bamboo flute tone is contributed by the fundamental frequency and its harmonics. Figure shows the spectrogram of a single note played in a South Indian bamboo flute. The dominance of the harmonic components is evident from the spectrogram. There is also a noise-like energy present in the spectrogram. Thus, a bamboo flute tone can be decomposed into several harmonically related sinusoids plus coloured noise.

Figure 1. Spectrogram of a single note played on South Indian bamboo flute. The image is magnified to show the individual harmonics in the spectrogram.

2.1. Pitch contour

When multiple notes are played on the bamboo flute in a single blow, the resulting tones will have a continuous pitch contour. To demonstrate this, we compare the spectrograms of the note sequences produced using a bamboo flute and a piano. Figure shows the spectrograms for Karnatic music played on a South Indian bamboo flute and western musical notes played on a piano. From the spectrogram of the piano sample, it can be clearly seen that the frequency contours of individual notes overlap in the transition region. The sinusoidal components corresponding to the previous note start to fade out, slowly, after the beginning of the next note. In the case of flute spectrogram, the note transitions are continuous. This smooth transition can be seen in different harmonics.

Figure 2. Spectrograms for Karnatic music played on bamboo flute and western music played on piano.

Karnatic music is different from western music in many ways. One of the major differences lies in the use of gamaka, which can be thought of as a bend or inflection in the pitch contour of a note. These can occur in the transition region between two notes or during the course of a single note itself. Due to the extensive use of gamakas, the pitch contour of a note in Karnatic flute music may fluctuate most of the time. The term Swara is used in Karnatic music to describe a note and its inherent fluctuations together. Such a continuous pitch contour is approximated using discrete segments in frame-based synthesis methods. Even though the perceived difference can be made very low by the use of smaller windows, this method still deviates from the fundamental concept of continuous pitch contour.

Figure shows the pitch contours of two different gamakas. Regions corresponding to different notes are labelled. Each of these pitch contours corresponds to only a single Swara, even though the pitch transitions clearly show the presence of more than one note in each of them. Traditionally, these additional, but essential, notes are omitted while writing the notation of the song. For example, in Figure , pitch contour shape of the gamaka called Sphuritham is shown. Pitch contour starts from one note, goes down to reach the lower note, then jumps up to reach the same note and settles there. The entire pitch bend is traditionally assumed to be one single Swara. Another gamaka named Vali is shown in Figure . In this, while going from one note to the other (“Note #1” to “Note #3”), pitch contour touches another note (“Note #2”). This intermediate note and the need to connect the two adjacent notes are omitted from the musical notation.

Figure 3. Pitch contours for two different gamakas played on flute.

2.2. Spectral weights of harmonics

The spectral weights for different harmonics are not the same for all notes played on the flute. The relative weights of harmonics with respect to the first harmonic also differ from note to note. For example, Figure shows the variation of relative weights for the second and third harmonics for five different notes in a Karnatic rāga Mohanam.

Figure 4. Variation of relative weights for the first two harmonics over five different notes. Weight of first harmonic is normalized at 0 dB.

When moving from one note to another, not only the pitch and harmonic frequencies change but also their respective weights. For a signal consisting of continuous frequency changes, the weights for the note transition regions as well as the gamaka regions need to be interpolated for a perfect representation.

2.3. Time-domain envelope

Another important characteristic feature of a flute tone is its time-domain amplitude envelope. This can be split into three different regions, namely, attack, sustain and decay. The way in which the waveform evolves into its actual shape is different for different pitch contours. Figure shows the amplitude envelopes for different types of flute tones. It can be observed that the amplitude envelope of a single plain note, when it is played alone, is different from the amplitude envelope of the plain note when it is played as part of a sequence or a gamaka.

Figure 5. Amplitude envelopes for four different note sequences.

2.4. Wind noise

The wind noise is another important component present in the flute sound. In a flute, while the harmonic part of the signal is generated as a result of the sustained oscillations produced inside the bore, the wind noise is produced by the turbulent streaming of the air when it passes through a narrow opening (Serra et al., Citation1997). In addition to the harmonic components, noise-like energy can also be seen in the spectrogram shown in Figure . This noise-like energy also needs to be modelled for adding naturalness to the flute tone. For analysis, wind noise for a note is recorded by blowing into the flute without creating resonance while maintaining the same finger positions for generating the actual tone for the note. Figure shows the spectrograms of the wind noise recorded for the notes dha and ga for middle octave.

Figure 6. Spectrograms of the wind noise for two notes in the same octave.

From the spectrograms, it can be seen that there are dominant peaks present in the noise spectra that are located very close to the fundamental frequency of the actual note. A similar trend is observed in other notes too. It indicates that the noise is different for different notes. This demands the noise to be modelled differently for each note for a perfect reconstruction of the note’s tone. At the same time, the spectrograms for the noise signals from different octaves are almost the same, where the spectral peaks appear almost at the same locations. From this, it can be deduced that the noise signal for each note is different, but they are independent of the octave positions (for Chinese flutes, these properties of the breath noise have been reported in the literature (Ayers, Citation2005)). Hence, we feel that modelling the noise for every note from any one octave will be sufficient in representing the wind noise.

3. Methodology

Our goal is to generate flute music from the song notations. The inputs to our system are note labels, durations and gamakas associated with each note present in the song. We model the pitch contour, harmonic amplitudes, time-domain envelope and wind noise for each of the notes for this task. Synthesis is based on the modified sinusoids plus noise model (Serra et al., Citation1997).

3.1. Sinusoids plus noise model

This is a frame-by-frame and analysis-by-synthesis method for modelling the sound produced by any physical system. The spectrum of the sound is approximated as the sum of sinusoids plus filtered white noise. In the analysis phase, from the spectrum of the original signal, parameters such as the number of sinusoids as well as the time-varying phase and spectral weight of each of the sinusoids are estimated for every frame. Weighted sum of these sinusoids is subtracted from the original signal to obtain the residual signal. By spectral fitting of this residual, the impulse response of the noise filter is obtained. White noise is filtered using this filter to obtain the noise part of the signal. Adding the sinusoidal part and noise part together gives the final synthesized signal for the corresponding frame. For every frame, the synthesized signal is expressed as follows:

(1)

\hat{s} (t) = \sum_{k = 0}^{\hat{K}} {\hat{H}}_{k} (t) \cdot \cos (2 π {\hat{f}}_{k} t) + \hat{n} (t),

(1)

where $\hat{K}$ is the estimated number of sinusoids, ${\hat{H}}_{k} (t)$ and ${\hat{f}}_{k}$ are the amplitude and frequency for $k^{th}$ sinusoid, and $\hat{n} (t)$ is the noise part for a frame. By repeating this process for all the frames and performing overlap-add on them, the final synthesized signal is obtained.

3.2. Shortcomings of sinusoids plus noise model

Karnatic music is characterized by the continuous nature of pitch contour. If we directly implement the sinusoidal plus noise model in a frame-based approach, this continuity cannot be achieved. Frame-based synthesis and overlap-addition would provide only a discrete approximation of the actual pitch contour. Hence, we do not go for frame-based synthesis and instead synthesize the entire frequency contour of each note at a stretch. We also parameterize the pitch contour, ${\hat{f}}_{0} (t)$ , using cubic splines, which makes the time and frequency scaling much easier. The frequency contours for the other sinusoidal components are generated as integer multiples of ${\hat{f}}_{0} (t)$ as given by

(2)

{\hat{f}}_{k} (t) = k \cdot {\hat{f}}_{0} (t) .

(2)

The term ${\hat{H}}_{k} (t)$ in the SNM (Sinusoids plus Noise Model) consists of two components. One component accounts for the different spectral weights of harmonics corresponding to different notes. The second component accounts for the time-domain amplitude of the signal $\hat{s} (t)$ . Hence, ${\hat{H}}_{k} (t)$ depends not only on the frequency domain weights of different notes but also on the attack, sustain and decay of the time-domain waveform. As depicted in Figure , time-domain amplitude envelopes for the same notes vary differently for different conditions. The waveform of the same note evolves differently in time domain depending upon whether that note is played as a plain note or as part of a transition/gamaka. Since it is complicated to model these two different components of ${\hat{H}}_{k} (t)$ together, we split it into two different components and model them separately.

We represent the first component as ${\hat{a}}_{k} (t)$ , which is used to express only the spectral weights of the different sinusoidal components without considering the time-domain amplitude envelope. The second component, $\hat{E} (t)$ , is the time-domain amplitude envelope of the signal, which takes the attack, sustain and decay into consideration. This component plays an important role in defining the timbre of different notes. Here also, we use parametric representation using cubic splines for making the time scaling of these contours easier. The synthesized signal for each note is given as follows:

(3)

\hat{s} (t) = \hat{E} (t) (\sum_{k = 0}^{\hat{K}} {\hat{a}}_{k} (t) \cdot \cos ({\hat{ϕ}}_{k} (t)) + \hat{n} (t)),

(3)

where ${\hat{ϕ}}_{k} (t)$ is the time-varying phase contour for the $k^{th}$ sinusoid, which is obtained by integrating ${\hat{f}}_{k} (t)$ . A simplified block diagram of the whole process is given in Figure

Figure 7. Block diagram of the proposed system.

3.3. Model parameters

Model parameters are the spectral weights of different notes, the pitch contour and time-domain amplitude envelope shapes for different types of gamakas/transitions and the noise waveforms for different notes. We consider 10 different notes, 8 different gamakas and 2 types of non-gamaka note transitions in this work. The list of these notes, gamakas and transitions are given in Section 4. These data are divided into 11 subclasses for computing the pitch and amplitude envelope. All the plain notes fall into 1 subclass, while the 8 gamakas and 2 transitions form the other 10 subclasses.

For each subclass, we estimate the pitch contours and amplitude envelopes for all the elements and find a representative pitch contour and amplitude envelope for that subclass. The representative shapes are then parameterized using cubic splines and the parametric forms are stored in the database. By spectral analysis, the weights of different harmonics for all the 10 notes are also found out and stored in the database.

3.4. Synthesis of plain notes

From the input, information like the note label and the duration of the note are extracted. We use sinusoids with frequencies equal to the pitch and its harmonics for synthesizing the sinusoidal part. For a plain note, the pitch contour and the spectral weights of the harmonics are constant for the entire time duration. Based on the note label, the corresponding pitch value, ${\hat{f}}_{0}$ , and the weights of different harmonics, ${\hat{a}}_{k}$ s, are found out from the database. The phase for the $k^{th}$ harmonic between the time instants $t_{1}$ and $t_{2}$ is given by

(4)

{\hat{ϕ}}_{k} (t) = \int_{t_{1}}^{t} {\hat{ω}}_{k} dt = 2 πk {\hat{f}}_{0} t + {\hat{ϕ}}_{k} (t_{1}),

(4)

where ${\hat{ϕ}}_{k} (t_{1})$ is the initial phase of the $k^{th}$ harmonic. The initial phase at time $t_{1}$ is added to make sure that the phase is continuous at the note boundaries.

For generating the time-domain amplitude envelope $\hat{E} (t)$ , parameterized representative shape corresponding to the envelope of the plain note is selected from the data base, and it is time scaled to match the desired note duration. Abrupt jumps at the note boundaries are avoided by making the envelope and its first derivative continuous at the end points. The sinusoidal part is synthesized as given by

(5)

{\hat{s}}_{h} (t) = \hat{E} (t) (\sum_{k = 0}^{\hat{K}} {\hat{a}}_{k} (t) \cdot \cos ({\hat{ϕ}}_{k} (t)))

(5)

For generating the noise part, we use the pre-recorded noise signals, corresponding to the input noise, from the database. The noise signals are lengthened or shortened to match the desired input duration. The final synthesis is performed using EquationEquation (3)(3) $\hat{s} (t) = \hat{E} (t) (\sum_{k = 0}^{\hat{K}} {\hat{a}}_{k} (t) \cdot \cos ({\hat{ϕ}}_{k} (t)) + \hat{n} (t)),$ (3) .

3.5. Synthesis of note transitions and Gamaka

In the case of gamakas and other note transitions, the pitch contour is not assumed to be constant. During a non-gamaka transition, two notes are present in the pitch contour. When an input note containing such a transition is encountered, the information such as the starting and ending notes, duration and type of the transition are extracted. Based on the starting and ending notes, the corresponding note frequencies are found out from the database.

The parameterized representative pitch contour shape is also selected from the database, which corresponds to the type of note transition. This pitch contour is then scaled in frequency and time, based on the frequencies of starting and ending notes, and the desired duration of the transition.

For example, if the starting and ending notes’ frequencies are $f_{1}$ and $f_{2}$ , and the duration extends from time instants $t_{1}$ to $t_{2}$ , then the representative pitch contour is scaled such that its frequency varies from $f_{1}$ to $f_{2}$ and the time duration spans from $t_{1}$ to $t_{2}$ . The endpoint slopes of the parametric form are set to zero to enable smooth concatenation with adjacent pitch contours. Frequency contours of other harmonics are obtained by integer multiplication of this scaled pitch contour. Phase contour of $k^{th}$ harmonic between the time instants $t_{1}$ to $t_{2}$ is expressed as follows:

(6)

{\hat{ϕ}}_{k} (t) = 2 πk \int_{t_{1}}^{t} {\hat{f}}_{0} (t) dt + {\hat{ϕ}}_{k} (t_{1}),

(6)

where ${\hat{f}}_{0} (t)$ is pitch contour, and ${\hat{ϕ}}_{k} (t_{1})$ is the initial phase of the $k^{th}$ harmonic at the starting point of the current segment being synthesized. The phase correction by adding the term ${\hat{ϕ}}_{k} (t_{1})$ ensures the continuity of the phase contour at the segment boundaries.

As opposed to the plain note case, the spectral weights of different harmonics are not constant in the case of note transitions. For generating the continuously varying spectral weights, we first extract the weights of different harmonics for all the constituent notes from the database. The regions where each note appears are identified with the help of pitch contour. For the regions where each note is active, the corresponding spectral weights are assigned. For the region where the note transition occurs, the weights are found by interpolation. We use cubic spline interpolation for obtaining the smooth and continuously varying spectral weights, ${\hat{a}}_{k} (t)$ . The time-domain amplitude envelope, $\hat{E} (t)$ , is found out using the same procedure mentioned in the plain note case. The sinusoidal part of the synthesized signal is obtained by multiplying the weighted sum of the harmonically related sinusoids with the amplitude envelope as given by EquationEquation (5)(5) ${\hat{s}}_{h} (t) = \hat{E} (t) (\sum_{k = 0}^{\hat{K}} {\hat{a}}_{k} (t) \cdot \cos ({\hat{ϕ}}_{k} (t)))$ (5) .

Since there are two notes present in the pitch contour of a transition, noise signals corresponding to both the notes are to be added at the respective time instants. To do this, we perform a windowed addition of the duration-modified noise waveforms.

The synthesis of gamaka is also performed in the similar fashion. The only difference between the gamaka and a note transition lies in the number of notes involved. A gamaka may contain more than two notes, where a note transition is the pitch change between only two notes. Adding the sinusoid part and noise part gives the final synthesized signal as given by EquationEquation (3)(3) $\hat{s} (t) = \hat{E} (t) (\sum_{k = 0}^{\hat{K}} {\hat{a}}_{k} (t) \cdot \cos ({\hat{ϕ}}_{k} (t)) + \hat{n} (t)),$ (3) .

4. Experiment

In our experiments, we consider three pitch classes, namely, plain notes, gamakas and non- gamaka transitions. Eight gamakas and two non-gamaka transitions are considered in this work. Each of them is considered as a subclass. We synthesize songs belonging to a popular Karnatic pentatonic rāga called Mōhanam, which contains the notes Sa, Ri, Ga, Pa and Dha in an octave.

4.1. Database creation

We use Karnatic bamboo flutes in three different scales (F-scale, C-scale and D-scale) for creating the sound files used in our experiments. We select 10 popular songs from the Karnatic rāga Mōhanam played by a professional flautist. Eight different types of gamakas, 2 types of non- gamaka transitions and 10 plain notes belonging to three octaves are present in the recorded samples. We manually segment each of these for creating the database. The database contains around 3,000 notes with 1,700 plain notes, 600 gamakas and around 1,300 non-gamaka transitions. Samples for individual gamakas vary from 28 to 82.

Pitch contour of all the sound files belonging to the class “plain notes” is extracted using PRAAT (Boersma & Weenink, Citation2001). Median pitch value is found for each note and is stored as the pitch value, ${\hat{f}}_{0}$ , of that note. Pitch values for the notes played on F-scaled bamboo flute used in our experiments are listed in Table . The alphabet denotes the note’s name, and the dot below or above the alphabet denotes that the note is in lower or higher octave, respectively.

Table 1. Different notes and their corresponding pitch values (F—scale bamboo flute)

Display Table

Pitch contours of all the sound samples belonging to each gamaka/transition are also extracted using PRAAT. For each subclass, pitch contours of each sound are re-sampled to a standard size to compensate for the difference in their duration. We select one pitch contour that best represents the subclass by choosing the one that has minimum distance from all the other pitch contours belonging to that subclass. Figure shows the frequency and duration normalized pitch contours belonging to the gamaka named Ētra Jāru. The median pitch contour and its spline approximation are shown in Figure .

Figure 8. Time- and frequency-normalized pitch contours, their median pitch contour and its spline approximation for the gamaka named Ētra Jāru.

The representative pitch contours for all the subclasses are normalized in time and frequency such that the time and frequency variations are limited between 0 and 1. Such a generalized shape can be scaled in time and frequency to match the desired duration and pitch. These normalized pitch contours are then subjected to parameterization using cubic spline modelling. We use 50 cubic splines to parameterize each pitch contour. These coefficients are stored in the database as the representative pitch contour shape for each gamaka/transition. Thus, we have 500 coefficients stored in the database, corresponding to 10 different subclasses. Normalized pitch contour shapes for different gamaka and non-gamaka transitions used in this experiment are shown in Figures , respectively.

Figure 9. Normalized pitch contours of different gamakas used in this work.

Figure 10. Normalized pitch contours of different non-gamaka transitions used in this work.

We repeat the same procedure for creating a database of representative shapes for the time-domain amplitude envelopes. For this, we consider all the 10 subclasses along with the plain notes. As explained before, for each subclass and plain notes, the time-domain amplitude envelopes of all the sound files are extracted. Representative time-domain amplitude envelope is found out by re-sampling the envelopes for each subclass and selecting the one with minimum distance from all others, as in the case of pitch contour. The time and amplitude normalized representative amplitude envelopes are parameterized using cubic splines and the coefficients are stored in the database. A total of 550 coefficients are stored in the database for representing the amplitude envelopes for 10 subclasses and the plain notes.

For calculating the spectral weights of different notes, we consider only the plain notes. In the first step, effect of amplitude envelope is nullified by dividing the note waveforms by the corresponding amplitude envelopes. Spectral analysis is performed on the resultant signal for finding out the weights, $a_{k}$ , of each harmonic. We consider only the first 10 harmonics for our experiments, since the magnitude of higher harmonics is very small for the samples in our database. Thus, 10 spectral weights are extracted for each of the 10 plain notes, and they are stored in the database.

We record the wind noise using the same F-scale flute, by blowing into the flute without creating resonance. As explained in Section 2.4, the octave difference does not affect the wind noise characteristics significantly. Hence, we use recorded wind noise corresponding to the notes in the middle octave only. Since there are five notes in one octave for the rāga used in our experiments, we use the noise waveforms corresponding to these five notes. Each of the noise waveforms has a duration of 8 s.

4.2. Synthesis of plain notes

Input to our system is a text file containing information such as the note label, note duration in terms of the number of beat cycles and the gamaka/transition information. If there is no gamaka associated with a note, that note is assumed to be plain. If there is no gamaka between two notes, a non-gamaka transition is inserted between them at the time of synthesis. An example of the input text file is shown in Table .

Table 2. An excerpt from input text file

Download CSV Display Table

After parsing the input file, frequency for every note is found out from the database. If the note does not contain a gamaka/transition, a constant pitch contour of this frequency is generated for the desired duration specified in the input. By the integer multiplication of this pitch contour, frequency contours for different harmonics are generated. Since all the frequency contours are constant in time, the corresponding phase contours can be obtained by multiplication with time.

The harmonic weights corresponding to the input note are fetched from the database, and the weighted addition of sinusoids is performed to synthesize the composite signal. Harmonic weights for a note are assumed to be constant for the entire duration of that note.

The amplitude envelope corresponding to the plain note is selected from the database and is time-modified to match the duration of the note. We discretize all these parameters by evaluating the cubic splines at each sample point. In our experiments, we use a sampling frequency of 32 kHz. Final synthesis is performed as given by EquationEquation 5(5) ${\hat{s}}_{h} (t) = \hat{E} (t) (\sum_{k = 0}^{\hat{K}} {\hat{a}}_{k} (t) \cdot \cos ({\hat{ϕ}}_{k} (t)))$ (5) . The spectrograms of the synthesized signal with and without amplitude envelope are shown in Figure .

Figure 11. Waveform of synthesized note Sa with and without time-domain amplitude envelope.

Noise waveform corresponding to the input note is chosen from the data base and is time-modified to match the input duration. This noise is then modulated with the amplitude envelope and added to the harmonic part to obtain the final synthesized signal as given by EquationEquation 3(3) $\hat{s} (t) = \hat{E} (t) (\sum_{k = 0}^{\hat{K}} {\hat{a}}_{k} (t) \cdot \cos ({\hat{ϕ}}_{k} (t)) + \hat{n} (t)),$ (3) . Figure shows the spectrogram of the synthesized plain note.

Figure 12. Spectrogram of synthesized plain note Sa.

4.3. Synthesis of non-gamaka transitions

If two adjacent notes are different and there is no gamaka associated with the second note, a non- gamaka transition is inserted between the two notes. After parsing the input, every pair of notes is checked for the presence of gamaka between them. If no gamaka is present in the second note, their pitch frequencies are compared to decide on the transition to be added. If the second note is higher in frequency than the first one, an upward transition is to be added, and if the second note’s frequency is lower, a downward transition is to be added.

Once the type of transition is finalized, its parametric form is selected from the database. The parametric form is stored in the database such that its time and frequency vary between 0 and 1. It is scaled in time to match the duration specified at the input and also scaled in frequency to generate a transition pitch contour between the first note and the second note. Integer multiples of this pitch contour give the frequency contours of other harmonics.

The frequency contours of individual harmonics are integrated with respect to time to obtain the phase contours. Since the cubic splines are used for the parametric representation of the pitch contour, the result of integration can be obtained in closed form. The constant of integration is selected such that the phase smoothly varies across the note boundaries. Phase contours of the transition are appended smoothly to the phase contours of the previous note to generate the continuous phase contour for the entire input sequence. Spectrograms for the note boundary with and without phase correction are shown in Figure .

Figure 13. Spectrograms of synthesized transition with and without imposing phase continuity.

Spectral weights for the two notes on either side of the transition are selected from the database. In the transition region, the spectral weights are found out by interpolation of spectral weights of both the notes. This is done with the help of the pitch contour of the transition region. The exact location of the beginning and end of the transition is found from the pitch contour, and a spline interpolation is performed to find the spectral weights during the transition. Pitch contour for the transition from the note Sa to the note Pa and the corresponding spectral weights for the first three harmonics are shown in Figure .

Figure 14. Pitch contour and corresponding spectral weights of first three harmonics for an upward transition from the note Sa to the note Pa.

As in the case of plain note synthesis, parametric form of the time-domain amplitude envelope for the desired transition is selected from the database. It is scaled in time to match the duration of the actual transition. Smooth concatenation with the previous note’s amplitude envelope is also performed before using the envelope to modulate the weighted sum of the sinusoids.

For generating the noise part, we introduce a function called activation function for each of the notes. The activation function of a note takes the value 1 for the entire duration where that note is active. Whenever the note is not present, its activation function is at 0. The time instances for which a note is present or absent are located with the help of pitch contour.

For example, activation functions for different notes corresponding to the transition from note Sa to the note Pa are shown in Figure . As can be seen from the pitch contour shown in Figure , when only the Sa is present in the pitch contour, only the activation function corresponding to Sa is at one for that entire duration. Same is true for the case of the note Pa. However, in the transition region, corresponding activation functions for Sa and Pa are found out by spline interpolation. Activation function for all the other eight notes is at zero for the entire duration.

Figure 15. Activation functions for the notes corresponding to the upward transition from Sa to Pa.

The noise waveforms for all the notes are selected from the database and are time scaled to match the input duration. They are multiplied with the corresponding activation functions to generate the active noise waveforms for a particular duration. All the active noise waveforms are added together to generate the noise part of the synthesized signal, as given by

(7)

\hat{n} (t) = \sum_{i = 1}^{10} A_{i} (t) \cdot n_{i} (t),

(7)

where $A_{i} (t)$ is the activation function and $n_{i} (t)$ is the duration-modified noise waveform corresponding to the $i^{th}$ note in the database. Adding the sinusoidal and the envelope-modulated noise part together gives the final synthesized signal as given in EquationEquation (3)(3) $\hat{s} (t) = \hat{E} (t) (\sum_{k = 0}^{\hat{K}} {\hat{a}}_{k} (t) \cdot \cos ({\hat{ϕ}}_{k} (t)) + \hat{n} (t)),$ (3) .

4.4. Synthesis of Gamaka

Gamakas are also synthesized in the same manner as that of the transitions. Some gamakas may contain more than two constituent notes in their pitch contour. In such cases, the spectral weights of all those constituent notes need to be interpolated to find the spectral weights of the gamaka region. Also, the noise activation functions for more than two notes will have non-zero values during the course of such gamakas. The pitch contour, interpolated spectral weights, noise activation functions, synthesized signal and spectrum for such a gamaka are shown in Figure .

Figure 16. Pitch contour, synthesised waveform, spectrogram, interpolated harmonic weights and noise activation function for a three note gamaka.

5. Subjective quality evaluation

We conduct listening tests for evaluating the quality of the synthesized music using Mean Opinion Score (Series, Citation2019). Two different subjective quality assessments are conducted. First one is to analyse the impact of different steps involved in the proposed method in deciding the tonal quality of the synthesized flute tone. In the second experiment, we compare our synthesis method with the existing spectral synthesis methods. We analyse two parameters of the synthesized music in this assessment—the tonal and the aesthetic qualities of gamaka rendition. For both the experiments, we select 23 subjects who are either trained Karnatic flautists or Karnatic vocalists.

For the first listening experiment, we choose four different excerpts from different songs and generate five different stimuliFootnote¹ from each of these excerpts. The first stimulus is the synthesized tone using proposed method without adding the amplitude envelope and the wind noise. The aecond stimulus is the synthesized flute music with the amplitude envelope, without adding the wind noise. The third stimulus is the synthesized flute music using the complete model including the harmonics, amplitude envelope and noise. Fourth stimulus is the original recording from a real flute, and the fifth one is generated by passing the original recording through a high pass filter (HPF) with cut-off frequency 3.5 kHz.

Each listener is asked to listen to the five different versions of all the four audio files. He/she is asked to rate the tonal quality of each audio file. Tonal quality is a measure of how well the tone resembles to an actual flute tone. A five point scale is used to rate the quality, with 1 corresponding to the worst quality and 5 corresponding to the best. Only those responses that rated the original flute recording as the best one and the HPF version as the worst one are considered for the evaluation. Acceptable responses for all the audio files are compiled and the mean ratings are found out. The results are summarized in Table .

Table 3. Mean Opinion Score for assessment I

Download CSV Display Table

The results show that modelling the amplitude envelope and adding the noise help in improving the quality of the synthesized tone. To test the statistical significance of these improvements, we perform students paired t-test, and the results show that the improvements are significant at 95% confidence interval.

For the second listening experiment, we use excerpts from five different songs, each excerpt consisting of 15 s. Each of these five files are synthesized using Gāyaka (Karaikudi Subramanian), basic SMS (Serra et al., Citation1997) and the proposed method. In addition to this, we record the original flute music for each of these five audio files. We also include a low-pass filtered version of these five files as the fifth stimulus.

Each listener is asked to listen to the five different versions of all the five audio files and rate two parameters of each audio file—tonal quality and the aesthetic quality of gamaka rendition—on a 5-point scale. Tonal quality is a measure of how well the tone resembles to an actual flute tone, and the aesthetic quality of gamaka rendition is a measure of correctness of gamaka synthesis in the view point of the listener. Acceptable responses for all the audio files are compiled, and the mean ratings are found out. The results are summarized in Table .

Table 4. Mean Opinion Score for assessment II

Download CSV Display Table

There is an improvement of both the tonal quality and the correctness of rendering gamakas using the proposed method, when compared to the other two methods. In this experiment also, the statistical test results show that the improvements are significant at 95% confidence interval.

6. Conclusion and future work

In this work, we propose a method to synthesize plain notes and gamakas for Karnatic flute music by extending the sinusoidal model. Towards this, we model three important aspects of flute sounds—the frequency contours, weights of different harmonics and the time-domain amplitude envelope. All these are modelled as continuous functions of time without using overlap-add method.

We analyse different recordings to find out a representative shape for the pitch contour and time-domain amplitude envelopes of each plain note, transition and gamaka. We represent them in a parametric form by means of cubic splines, so as to facilitate the time and frequency scaling to match the input pitch and durations. The progression of harmonic weights with respect to time are also modelled using cubic splines. Synthesis is performed by weighted addition of harmonically related sinusoids, which is then modulated by a time-domain envelope. We use the pre-recorded wind noise corresponding to each of the notes and modulate them using a noise activation function before adding to the weighted sum of harmonics.

Mean Opinion Score obtained from the subjective evaluation suggests that modelling the time-domain amplitude and adding the wind noise improve tonal quality. Another evaluation is conducted to compare the performance of the proposed approach with the existing Karnatic flute synthesis methods. Results suggest that the proposed method is better in terms of tonal quality as well as the correctness of rendering gamakas. Hypothesis tests performed on the subjective evaluation results show that the observed improvements are statistically significant over a 95% confidence interval.

In this work, the authors analyse gamakas in a popular pentatonic Karnatic rāga Mōhanam. We hope that analysing some major rāgas that are rich in gamakas will yield better modelling of gamakas and will improve the quality of synthesised music. Another problem that needs to be addressed is the change of gamaka shapes with respect to the change in tempo. In this work, the authors have considered only three tempi. But when the speed is doubled or halved, gamakas are not scaled uniformly (Karaikudi Subramanian et al., Citation2011; Viraraghavan et al., Citation2019). For a more robust gamaka synthesis system, this factor also needs to be taken into consideration.

Disclosure statement

No potential conflict of interest was reported by the authors.

Data availability statement

The data used in this study are generated with the help of a professional flautist, which can be made available from the corresponding author on a reasonable request.

Notes

1. Audio samples are available at https://sites.google.com/view/muraleeravam/https://sites.google.com/view/muraleeravam/.

References

Ashtamoorthy, A., Prasad, P., Dhar, S., & Vijayasenan, D. (2018). Frequency contour modeling to synthesize natural flute renditions for Carnatic music. Proceedings of the 2018 International Conference on Signal Processing and Communications (SPCOM), Bangalore, India. (pp.172–20). IEEE.
Google Scholar
Ayers, L. (2003). Synthesizing trills for the Chinese dizi. Proceedings of the International Computer Music Conference, ICM, Bangalore, India.
Google Scholar
Ayers, L. (2004). Synthesizing timbre tremolos and flutter tonguing on wind instruments. Proceedings of the International Computer Music Conference, ICMC, Miami, Florida, USA. (pp. 390–393).
Google Scholar
Ayers, L. (2005). Synthesising Chinese flutes using C sound. Organised Sound, 10(1), 37–49. https://doi.org/10.1017/S1355771805000658
Google Scholar
Benade, A. H. (1990). Fundamentals of musical acoustics. Dover Publications.
Google Scholar
Boersma, P., & Weenink, D. (2001). PRAAT, a system for doing phonetics by computer. Glot International, 5(9), 341–345. https://cir.nii.ac.jp/crid/1572261550900588928
Google Scholar
Conard, N., Malina, M., & Münzel, S. (2009, July). New flutes document the earliest musical tradition in Southwestern Germany. Nature, 460(7256), 737–740. https://doi.org/10.1038/nature08169
PubMed Web of Science ®Google Scholar
Dïaz, A. and Mendes, R. (2015). Analysis synthesis of the Andean Quena via harmonic Band Wavelet Transform. Proceedings of the Digital Audio Effects (DAFx-15), Trondheim, Norway (pp. 433–437).
Google Scholar
Donahue, C., McAuley, J., & Puckette, M. (2018). Adversarial audio synthesis. arXiv Preprint arXiv: 1802.04208.
Google Scholar
Engel, J. et al. (2015). Neural audio synthesis of musical notes with WaveNet autoencoders. Proceedings of the International Conference on Machine Learning, Sydney, Australia (pp. 1068–1077).
Google Scholar
Engel, J., et al. (2019). Gansynth: Adversarial neural audio synthesis. arXiv Preprint arXiv: 1902.08710.
Google Scholar
Hawthorne, C., et al. (2018). Enabling factorized piano music modeling and generation with the MAESTRO dataset. arXiv Preprint arXiv: 1810.12247.
Google Scholar
Helmholtz, H. L. (1954). On the Sensations of tone, translated by AJ Ellis. Dover Publications.
Google Scholar
Horner, A., & Ayers, L. (1998). Modeling acoustic wind instruments with contiguous group synthesis. Journal of the Audio Engineering Society, 46(10), 868–879.
Web of Science ®Google Scholar
Karaikudi Subramanian, S. GAAYAKA. https://carnatic2000.tripod.com/gaayaka6.html
Google Scholar
Karaikudi Subramanian, S., Wyse, L., & McGee, K. (2011). Modeling speed doubling in carnatic music. ICMC. https://lonce.org/Publications/publications/modeling-speed-doubling-in-carnatic-music.pdf
Google Scholar
Keefe, D. H. (1990). Woodwind air column models. The Journal of the Acoustical Society of America, 88(1), 35–51. https://doi.org/10.1121/1.399911
Web of Science ®Google Scholar
Kreutzer, C., Walker, J., and O’Neill, M. (2008). A parametric model for spectral sound synthesis of musical sounds. Proceedings of the 2008 International Conference on Audio, Language and Image Processing, Shanghai, China (pp. 633–637).IEEE
Google Scholar
Kumar, K., Kumar, R., De Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., & Courville, A. C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in Neural Information Processing Systems, 32.
Google Scholar
Marafioti, A. et al. (2019). Adversarial generation of time-frequency features with application in audio synthesis. Proceedings of the International Conference on Machine Learning, Long Beach, California (pp. 4352–4362). PMLR.
Google Scholar
McIntyre, M. E., Schumacher, R. T., & Woodhouse, J. (1983). On the oscillations of musical instruments. The Journal of the Acoustical Society of America, 74(5), 1325–1345. https://doi.org/10.1121/1.390157
Web of Science ®Google Scholar
Polotti, P., & Evangelista, G. (2001). Analysis and synthesis of pseudo-periodic 1/f-like noise by means of wavelets with Applications to digital audio. EURASIP Journal on Advances in Signal Processing, 2001(1), 584201. https://doi.org/10.1155/S1110865701000129
Google Scholar
Polotti, P., & Evangelista, G. (2007). Fractal additive synthesis. IEEE Signal Processing Magazine, 24(2), 105–115. https://doi.org/10.1109/MSP.2007.323275
Web of Science ®Google Scholar
Ramamurthy, S. and Raghavan, M. (2013). Filter design for synthesis of musical notes: A multidimensional feature-based approach. Proceedings of the 2013 IEEE International Conference on Signal and Image Processing Applications, Melaka, Malaysia (pp. 106–111).
Google Scholar
Rocamora, M., Lopez, E., and Jure, L. (2009) Wind instruments synthesis toolbox for generation of music audio signals with labeled partials. Proceedings of the 2009 Brazilian Symposium on Computer Music SBCM09, Recife, Brazil (pp. 2–4).
Google Scholar
Scavone, G. P., & Cook, P. R. (1998). Real-time computer modeling of woodwind instruments. Proceedings of the International Symposium on Musical Acoustics (ISMA-98), Leavenworth, WA (pp. 197–202).
Google Scholar
Series, B. S. (2019). Recommendation ITU-R BS. 1284-2, General methods for the subjective assessment of sound quality. ITU-R Recommendation BS.
Google Scholar
Serra, X., et al. (1997). Musical sound modeling with sinusoids plus noise. Musical Signal Processing, 91–122.
Google Scholar
Serra, X., & Smith, J. Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. (1990). Computer Music Journal, 14(4), 12–24. issn: 01489267, 15315169. https://doi.org/10.2307/3680788
Web of Science ®Google Scholar
Smith, J. O., III. (1991). Viewpoints on the history of digital synthesis. Proceedings of the International Computer Music Conference, ICMC, Quebec, Canada (pp. 1–10).
Google Scholar
Strong, W., & Clark, M. (1967). Synthesis of wind-instrument tones. The Journal of the Acoustical Society of America, 41(1), 39–52. https://doi.org/10.1121/1.1910327
Web of Science ®Google Scholar
Suyun, F. and Yibiao, Y. (2016). Improve music synthesis quality by particular harmonic spectrum interpolation based on sinusoidal model. Proceedings of the 2016 IEEE 13th International Conference on Signal Processing (ICSP), Chengdu, China (pp. 183–186). IEEE.
Google Scholar
Välimäki, V., Hänninen, R., and Karjalainen, M. (1996). An improved digital waveguide model of flute-implementation Issues. Proceedings of the International Computer Music Conference, Hong Kong (pp. 1–4). Citeseer.
Google Scholar
Valimaki, V., Karjalainen, M., Jánosy, Z., & Laine, U. K. (1992). A real-time DSP implementation of a flute model. Proceedings of the International Conference Acoustics, Speech Signal Processing (Vol. 2, pp. 249–252). IEEE Computer Society.
Google Scholar
Välimäki, V., Karjalainen, M., and Laakso, T. I. (1993).Modeling of woodwind bores with finger holes. Proceedings of the International Computer Music Conference, ICMC 1993, Tokyo, Japan (pp. 32–39).
Google Scholar
van den Oord, A., et al. (2016). Wavenet: A generative model for raw audio. arXiv Preprint arXiv: 1609.03499.
Google Scholar
Verma, T. S., & Meng, T. H. (2000). Extending spectral modeling synthesis with transient modeling synthesis. Computer Music Journal, 24(2), 47–59. https://doi.org/10.1162/014892600559317
Web of Science ®Google Scholar
Viraraghavan, V., Gavas, R., Murthy, H., & Aravind, R. (2019). Visualizing carnatic music as projectile motion in a uniform gravitational field. Proceedings of the Workshop on Speech, Music and Mind 2019, Quebec, Canada (pp. 31–35).
Google Scholar
Ystad, S., & Voinier, T. (2001). A virtually real flute. Computer Music Journal, 25(2), 13–24. https://doi.org/10.1162/014892601750302552
Web of Science ®Google Scholar

A continuous time model for Karnatic flute music synthesis

Abstract