Full article: Randomly generated lyrics using mixture models: the poem that Frank Sinatra never sang

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In this article, we use a simple method to generate words from a pool of several songs. The approach is based on fitting a finite mixture of geometric probability mass functions to the distribution of the frequencies of the words as they occur in the pool. The classification of the lyrics with the estimated probabilities of belonging to each of the class yields a way of generating a given number words. This statistical procedure can be complemented with human composition of a poem using the generated lyrics. An application with 63 songs of Frank Sinatra illustrates how the methodology was used to compose the new poem Young.

Keywords:

1. Introduction

With the increasing number of algorithms capable of generating meaningful texts on a given topic, the world has already entered an era in which prose and poetry can now be composed by some intelligent algorithm. This new dimension brings with it a lot of excitement but imposes new challenges. The so-called Generative Pre-Training algorithms, described in Radford et al. (Citation2018) and Zhang et al. (Citation2019), are able nowadays to compose a text in a wished style, upon a command written in a prompt window. The latest generation of such algorithms, GPT-2 and GPT-3, which were recently released by OpenAI in 2019 and 2020, have impressive ability to understand and generate texts. The texts generated can be mistaken to be written by a human. In the context of poetry, see the nice article of Köbis and Mossink (Citation2021) and the references therein. To still remain in the context of poetry, the focus of this article, the following prompt was given in the playground of the OpenAI website allowing a user (after creation and verification of an account) to experiment with the GPT-3 Davinci model: ‘write a poem on she left me with another man’. Using the default parameters of the model, we obtained the following result:

When the prompt was changed to ‘write a poem on she left me with another man in the style of Frank Sinatra’, we obtained

The generated poems are certainly coherent and their content is consistent with the given commands. However, it might be debatable whether the second poem is actually in the style of Frank Sinatra. To test this assumption, it is necessary to have a plausible statistical model which enables us to discriminate between a poem (or portion thereof) whose style is consistent with that of the songs interpreted by Frank Sinatra and the other ones.

In this work, we take a much simpler approach to generating lyrics in the style of some given artist. The simplicity of our approach could be seen as a limitation if we compare the result to those obtained with the GPT-3 Davinci model. In fact, we do not try to generate semantically coherent verses. Instead, we generate words without binding syntax. These isolated words will then serve as raw material for a new poem. Thus the same limitation can be seen as a room given to humans for creativity.

The statistical methodology, which we describe in detail in the next section, relies on fitting a suitable model for the frequency of the words counts. While this approach is not new, we develop our own tools to justify it, at least partially. Modelling the frequency of the words counts is connected to a well-known estimation problem: The species richness problem. Species richness signifies the (unknown) total number of species in some given population and the estimation techniques. Thus the species richness problem means the problem of estimating this number on the basis of the observed abundances, i.e. the number of individuals which belong to the observed species. This problem has been very much studied in a variety of applications, including biology, ecology, entomology and linguistics, see, e.g. Bunge and Fitzpatrick (Citation1993), Balabdaoui and Kulagina (Citation2020), Chao and Bunge (Citation2002), Chee and Wang (Citation2016), Durot et al. (Citation2015), Good (Citation1953); Good and Toulmin (Citation1956), Norris and Pollock (Citation1998), Orlitsky et al. (Citation2016), and Wang and Lindsay (Citation2005). An example for linguistics is the estimation of the number that Shakespeare knew, see Balabdaoui and Kulagina (Citation2020), Chee and Wang (Citation2016), Efron and Thisted (Citation1976), and Spevack (Citation1970). In this problem, words which occurred in Shakespeare's works are considered as species and a model for the frequency of their counts is fitted to infer about the number of words which have never occurred. In the current work, we are not preoccupied with estimating vocabulary richness. Instead, we will focus here on classifying lyrics according to the frequency of their occurrences.

The paper is organized as follows. In the following section, we introduce finite mixture models and give their statistical interpretation. In the same section, we focus on multivariate geometric distributions with conditional independence structure and recall the well-known link from Analysis to complete monotonicity. Based on 63 songs of Frank Sinatra that we gathered from the web, we will illustrate informally and formally how such a mixture model can be considered as suitable for the lyrics counts. In Section 3, we describe the procedure of generating a given number of random words from the pool of lyrics. Then, we shall present the new poem that we composed based on 175 randomly generated words. The poem was interpreted by two singers (jazz and soprano) to the music of New York, New York. In Section 4, we finish off with conclusions and some future research directions.

2. Lyrics counts: finite mixtures of geometric distributions

2.1. Mixture models in a nutshell

Let Θ be the parameter space, a subset of $R^{d}$ for $d \geq 1$ some integer-valued number. Also, let $X \subset R^{p}$ , for some integer $p \geq 1$ , be a measurable sample space, that is the space to which the random variable or vector of interest, X, belongs. Consider a parametric family $F = {f_{θ}, θ \in Θ}$ , where for $θ \in Θ$ , $f_{θ}$ is a density with respect to a σ-finite dominating measure defined on $X$ . A mixture density f with components in $F$ is any density which can be written as $f (x) = \int_{Θ} f_{θ} (x) d G (θ),$ where G is a distribution function defined on Θ (assuming that Θ is also measurable). If G is a discrete with finitely many points in its support, then f is said to be the density of a finite mixture and is given by (1) $f (x) = \sum_{j = 1}^{m} π_{j} f_{θ_{j}} (x)$ (1) where $π_{j} \in [0, 1]$ for $j \in {1, \dots, m}$ such that $\sum_{j = 1}^{m} π_{j} = 1$ . The integer m is referred to as the mixture order or complexity in case the representation in (Equation2(2) $f (x) = \sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{p} f_{θ_{j, k}} (x_{k})$ (2) ) is the most economic; i.e. when all the mixing probabilities $π_{j}, j = 1, \dots m$ are all strictly positive. Excellent reviews on mixture models include Bouveyron et al. (Citation2019); Lindsay (Citation1983a, Citation1983b, Citation1989, Citation1995); McLachlan Peel (Citation2000); Teicher (Citation1961, Citation1963); Titterington et al. (Citation1985).

One of the advantages of finite mixture models is that they are easy to interpret. Such a model says that the sample space $X$ (representation the population of interest) is composed of m classes. To draw an individual or item from the population, a class has first to be drawn with probability $π_{j}$ and an individual from the class according to the density $f_{θ_{j}}$ . The only problem is that the information on the exact class from which it is sampled is not accessible. Thus the first layer in the sampling scheme is hidden but conditioning over this first step results into averaging over all the classes. In the next section, we will discuss some of the well-known methods used to learn m, the mixing probabilities $π_{j}$ , the parameters $θ_{j}$ and also to assign an observed realization from the random variable or vector X to (the most probable) class.

In the following sections, we will focus on the case where $p \geq 2$ and the components of the random vector X are independent given that X belongs to a particular class. This particular structure is commonly known under the name of conditional independence. This means that the mixture density in (Equation2(2) $f (x) = \sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{p} f_{θ_{j, k}} (x_{k})$ (2) ) can be given at $x = (x_{1}, \dots, x_{p})$ as (2) $f (x) = \sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{p} f_{θ_{j, k}} (x_{k})$ (2) with $θ_{j, k}$ is possibly a multi-dimensional parameter, and $f_{θ_{j, k}}$ is a (marginal) density with respect to a σ-dominating measure on $R$ . In the context of modelling the frequency of lyrics counts, we will focus next on the case where the marginal densities are that of geometric distributions.

2.2. Mixtures of geometric distributions and complete monotonicity

The probability mass function (pmf) of a random variable X with a geometric distribution of parameter $α \in [0, 1)$ is given by $p_{α} (x) = (1 - α) α^{x}, x \in N \equiv {0, 1, \dots,} .$ Note that $p_{α}$ is also the density of the distribution of X with respect to the counting measure on the set of non-negative integers $N$ . The number $1 - α$ can be viewed as the success probability if we interpret $α^{i}$ as the probability of failing i times in a row before obtaining the first success. Consider the class $M$ of all mixtures of geometrics, that is, the set of all pmf's of the form $p (x) = \int_{[0, 1)} p_{α} (x) d G (α)$ where G is a mixing distribution supported on $[0, 1]$ such that $G (1 -) = G (1) = 1$ . It turns out that the class $M$ coincides with the class of all completely monotone probability mass functions defined on $N$ , that is, the set of all probability mass functions p on $N$ satisfying $[Δ^{(r)} p] (i) \geq 0$ for all $r \in N$ and $i \in N$ , with $[Δ^{(0)} p] (i) = p (i), [Δ^{(1)} p] (i) = p (i + 1) - p (i)$ and $[Δ^{(r + 1)} p] (i) = [Δ^{(r)} p] (i + 1) - [Δ^{(r)} p] (i)$ . In other words, p is completely monotone on $N$ if and only if p has a nonnegative discrete Laplacian of any order $r \in N$ . This equality is a remarkable result from Analysis which is also known as the Hausdorff Theorem, see Balabdaoui and de Fournas-Labrosse (Citation2020) and Feller (Citation1971). In the work of Balabdaoui and de Fournas-Labrosse (Citation2020), estimation of a completely monotone pmf on $N$ was considered for the first time. In that article, the (nonparametric) complete monotone least squares estimator (LSE) was defined by orthogonally projecting the empirical estimator ${\bar{p}}_{n} (x) = n^{- 1} \sum_{j = 1}^{n} I_{{X_{j} = x}}, i \in N$ on the class of completely monotone integrable sequences. Above, $(X_{1}, \dots, X_{n})$ is a random sample of size n from the unknown completely monotone pmf. An efficient support reduction algorithm was developed to compute the complete monotone LSE.

Now, consider a multivariate integer-valued vector $X = (X_{1}, \dots, X_{p})$ such that its pmf is given by the following finite mixture of geometric distributions with conditional independence: (3) $f (x_{1}, \dots, x_{p}) = \sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{p} (1 - α_{j, k}) α_{j, k}^{x_{k}}, (x_{1}, \dots, x_{p}) \in N^{p}$ (3) for some unknown failure probabilities $α_{j, k}, (j, k) \in {1, \dots, m} \times {1, \dots, p}$ .

To connect the model in (Equation3(3) $f (x_{1}, \dots, x_{p}) = \sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{p} (1 - α_{j, k}) α_{j, k}^{x_{k}}, (x_{1}, \dots, x_{p}) \in N^{p}$ (3) ) with the frequency of the counts of words occurring in a large pool of lyrics, we will first assume that a word belongs to one of the four categories or sub-species: (1) Verbs, (2) Nouns, (3) Adjective or adverb (descriptive) and (4) Other. Thus p = 4. Now, since these sub-species do not overlap, the vector of counts is of the following forms: $(X_{1}, 0, 0, 0)$ , $(0, X_{2}, 0, 0)$ , $(0, 0, X_{3}, 0)$ or $(0, 0, 0, X_{4})$ , where $X_{1}$ , $X_{2}$ , $X_{3}$ and $X_{4}$ are strictly positive integers representing counts of nouns, verbs, adjective/adverb or other. Thus, when a component is not equal to 0 it has to start at 1, which means that the marginal pmf is that of a truncated geometric distribution. Here, the term truncation describes the fact that the number 0 was cut off.

Thus the pmf to be fitted has the following form: (4) $f (x_{1}, x_{2}, x_{3}, x_{4}) = \sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{4} (1 - α_{j, k}) α_{j, k}^{x_{k} - 1} I_{x_{k} \geq 1} .$ (4) It is not difficult to see that for $k \in {1, \dots, 4}$ the marginal pmf of $X_{k}$ is given as the mixture pmf (5) $p (x_{k}) = \sum_{j = 1}^{m} π_{j} (1 - α_{j, k}) α_{j, k}^{x_{k} - 1}, x_{k} \in N ∖ {0} .$ (5) This means that the respective frequencies of the counts of each of the four categories are completely monotone under the assumption of our model.

In Table , we give the observed frequencies of the words occurrences up to 6 according to their category. For instance, there are exactly 177 distinct verbs which appear exactly once, 138 distinct nouns which appear exactly twice, 9 distinct descriptive words which appear exactly 6 times, etc.

Table 1. Frequency of words counts in Frank Sinatra's lyrics.

Download CSV Display Table

2.3. Application to Frank Sinatra's songs

2.3.1. Validating the model

63 songs of Frank Sinatra were gathered from the web with no particular sampling criterion. For each of about the gathered 8500 words, we determined the corresponding number of occurrence. This yielded a vector of counts for each of the word categories. It is possible to validate the assumption of complete monotonicity of their distribution using a formal statistical test. As we obtain similar results for all the four categories, we use the category ‘verbs’ to illustrate our methodology. On the left of Figure , we show the plot of the frequencies of the verb counts , i.e. the empirical (or sample) estimator of their true pmf along with the complete monotone estimator. The estimators are quite close suggesting that the complete monotone assumption is valid. In Table , we give the support points and corresponding weights of the complete monotone LSE fitted to the verb counts. In the notation of the table, this LSE is given by the function $x \mapsto \sum_{j = 1}^{6} {\hat{π}}_{j} (1 - {\hat{α}}_{j}) {\hat{α}}_{j}^{x - 1}$ where $x \geq 1$ is the verb count. Here, the number of components in the fitted mixture (6) does not have to be to the same number one finds for the complete monotone LSE fitted to the counts of the other 3 categories, nor does it have to be an accurate estimate of the number m in the multidimensional version of the model given in (Equation4(4) $f (x_{1}, x_{2}, x_{3}, x_{4}) = \sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{4} (1 - α_{j, k}) α_{j, k}^{x_{k} - 1} I_{x_{k} \geq 1} .$ (4) ). In fact, for nouns and descriptive words we find (4 and 5), while we find again 6 components for the ‘other’ category.

Figure 1. Left: The solid dots depict the values of the empirical estimator of the counts of verbs occurring in the gathered lyrics of Frank Sinatra. Right: Histogram of the bootstrapped Euclidean distance between the empirical and complete monotone estimators based on the verb counts. The dashed line corresponds to the 95%-quantile of the bootstrapped distances $D_{n}^{(b)}, b = 1, \dots, 100}$ , while the solid line corresponds the observed distance $D_{n}$ . See text for more details.

Table 2. The support points and weights of the fitted complete monotone LSE to the verb counts.

Display Table

To formalize the conclusion that complete monotonicity is a valid model, we can use the bootstrap test of complete monotonicity as constructed by Balabdaoui and de Fournas-Labrosse (Citation2020). The idea is quite simple. Consider for example the Euclidean distance between the sample and complete monotone estimators. Call this distance $D_{n}$ (here n is the number of the observed verb counts). Now, consider B random samples from the fitted complete monotone estimator, all of size n. Based on each of the B samples, we can again compute the empirical and complete monotone estimators as well as their Euclidean distance $D_{n}^{(b)}$ , with $b = 1, \dots, B$ . Then, if the complete monotonicity assumption is true, then the observed distance $D_{n}$ should have the same distribution as that of the random sample $(D_{n}^{(1)}, \dots, D_{n}^{(B)})$ . A very usual way to test whether the latter is true is to compare $D_{n}$ with, $q_{0.95}^{B}$ , the $95 %$ -upper empirical quantile of $(D_{n}^{(1)}, \dots, D_{n}^{(B)})$ . Then the assumption is accepted if $D_{n} \leq q_{0.95}^{B}$ , and is rejected otherwise. For the verbs counts, the sample size is n = 429. Using the sampling scheme described above with B = 100, we computed Euclidean distances between the empirical and complete monotone estimators based on the bootstrapped samples obtained from the original data of verb counts. The resulting histogram is shown on the right of Figure along with the value of the observed distance $D_{n}$ (solid line) and that of $q_{0.95}^{B}$ (dashed line). The inequality $q_{0.95}^{B} > D_{n}$ allows us to declare the validity of the complete monotone assumption. For reproducibility, the histogram of Figure was obtained by fixing the random seed to be 1987. The R-codes used to obtain the plots of Figure are provided in the supplementary material.

It is clear that validating complete monotonicity of the frequency of counts is necessary but not sufficient for the mixture model in (Equation4(4) $f (x_{1}, x_{2}, x_{3}, x_{4}) = \sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{4} (1 - α_{j, k}) α_{j, k}^{x_{k} - 1} I_{x_{k} \geq 1} .$ (4) ) to hold. Verifying the validity of the latter requires devising a multivariate version of the complete monotone estimator used above. Still, we consider the validation of the marginal complete monotonicity of the counts as a very satisfactory finding and proceed to estimating the number of components m using some of the classical methods based on penalizing the maximum likelihood estimator (MLE). This will also allow us to classify the lyrics using the maximum posterior probability (MAP) approach.

2.3.2. Clustering and classification

To estimate the number of components in the mixture model given in (Equation4(4) $f (x_{1}, x_{2}, x_{3}, x_{4}) = \sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{4} (1 - α_{j, k}) α_{j, k}^{x_{k} - 1} I_{x_{k} \geq 1} .$ (4) ), one can use one of the classical methods devised for this task, for example BIC and ICL. Both methods are based on computing the MLE for each stipulated m. This means that the MLE is the vector whose components are ${\hat{α}}_{j, k}, 1 \leq j \leq m, 1 \leq k \leq 4$ and ${\hat{π}}_{j}, 1 \leq j \leq m - 1$ (it is enough to estimate the first m−1 mixing probabilities since $π_{m} = 1 - \sum_{j = 1}^{m - 1} π_{j}$ ) which yield the maximum of the log-likelihood $\begin{aligned} \sum_{x = 0}^{n_{1} - 1} N_{x 1} \log (\sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{4} (1 - α_{j, k}) α_{j, 1}^{x}) + \sum_{x = 0}^{n_{2} - 1} N_{x 2} \log (\sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{4} (1 - α_{j, k}) α_{j, 2}^{x}) \\ + \sum_{x = 0}^{n_{3} - 1} N_{x 3} \log (\sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{4} (1 - α_{j, k}) α_{j, 3}^{x}) + \sum_{x = 0}^{n_{4} - 1} N_{x 4} \log (\sum_{j = 1}^{m} π_{j} \prod_{k = 1}^{4} (1 - α_{j, k}) α_{j, 4}^{x}) \end{aligned}$ over the relevant space of parameters. Above, $n_{k}$ is the maximal count of words in the kth category, and $N_{x k}$ is the observed frequency of count x in this category. We will denote ${\hat{ℓ}}_{m}$ the value of the maximum log-likelihood. The BIC method looks for the maximum of the function $m \mapsto 2 {\hat{ℓ}}_{m} - (5 m - 1) \log n \equiv {B I C}_{m}$ with $n = n_{1} + n_{2} + n_{3} + n_{4}$ . The term 5m−1 represents the number of parameters in the model since there are $4 m$ failure probabilities $α_{j, k}$ and m−1 free mixing probabilities $π_{j}$ . The ICL method aims at maximizing $m \mapsto {B I C}_{m} + \sum_{i = 1}^{n} \sum_{j = 1}^{m} {\hat{z}}_{i j} \log ({\hat{z}}_{i j})$ with (6) ${\hat{z}}_{i j} = \frac{{\hat{π}}_{j} \prod_{k = 1}^{4} (1 - {\hat{α}}_{j, k}) {\hat{α}}_{j, ⋆}^{x_{i}}}{\sum_{r = 1}^{m} {\hat{π}}_{r} \prod_{k = 1}^{4} (1 - {\hat{α}}_{r, k}) {\hat{α}}_{r, ⋆}^{x_{i}}}$ (6) where $x_{i}$ is a count $- 1$ , ${\hat{α}}_{j, ⋆} = {\hat{α}}_{r, k}$ if the count is that of words in the kth category. The number ${\hat{z}}_{i j}$ is known as the estimated posterior probability of the data $x_{i}$ to belong to the jth class. For references on the BIC and ICL (and other clustering methods), we refer for example to Bouveyron et al. (Citation2019), Biernacki, Celeux, and Govaert (Citation2000), McLachlan and Rathnayake (Citation2014), and Vrieze (Citation2012). Both BIC and ICL methods are shown to yield, under some regularity conditions, consistent estimators of the number of components m. If m is seen as the number of subclasses/clusters in the data, then this step can be referred to as data clustering.

Applying the BIC and ICL methods to the lyrics of Frank Sinatra gives different estimates of m: 9 with the BIC and 5 with the ICL. Since some of the classes found by the BIC approach are assigned very small probabilities, we go for the more parsimonious representation obtained by the ICL.

Classification of the count data, which means assigning each data to one of the five classes, can be now done easily by simply computing $max_{j \in {1, \dots, 5}} {\hat{z}}_{i j}$ , where ${\hat{z}}_{i j}$ was defined in (Equation6(6) ${\hat{z}}_{i j} = \frac{{\hat{π}}_{j} \prod_{k = 1}^{4} (1 - {\hat{α}}_{j, k}) {\hat{α}}_{j, ⋆}^{x_{i}}}{\sum_{r = 1}^{m} {\hat{π}}_{r} \prod_{k = 1}^{4} (1 - {\hat{α}}_{r, k}) {\hat{α}}_{r, ⋆}^{x_{i}}}$ (6) ), and assigning $x_{i}$ to the $\hat{j}$ th class where $\hat{j}$ is the index yielding the maximum. It is important to note that classification of the counts gives a way of classifying the words themselves. To give a toy example, imagine that we have the following pool of 20 words: bee, I, always, can, be, flower, always, bee, bee, fly, fly, fly, fly, great, it, it, always, it, flower, flower. In this example, we have (the counts are in parentheses)

Verbs: be (1), can (1), fly (4)
Nouns: bee (3), flower (3)
Adjective/Adverb (descriptive): great (1), always (3)
Other: I (1), it (3).

Suppose some classifier of the counts yields 3 classes and that upon classification of the counts we find:

Verbs and descriptive words with 1 count belong to class #1
Nouns, descriptive and ‘other’ words with 3 counts belong to class #2
Verbs with 4 counts and ‘other’ words with 1 count belong to class #3.

Then, this classification implies that

Class #1 $= {b e, c a n, g r e a t}$
Class #2 $= {b e e, f l o w e r, a l w a y s, i t}$
Class #3 $= {f l y, I}$ .

When classifying the counts of the lyrics of Frank Sinatra, it is very interesting to see that one is able to identify a common theme in each of the five classes found by the ICL method. In the following, we present our attempt to give labels to each one of the classes.

Love and romance: e.g. love, heart, heaven, troubles
You and me and the world around: e.g. us, well, yours, Valentines, New York
Life of everyday: e.g. birds, sky, stars, dreams, girl, rainbow, rain, movie, museum, music, honeymoon
Things in process: e.g. forget, might, remind, thinking, tumble, waiting, chase
Time and place: e.g. suddenly, never, always, lately, high, there, everywhere.

The estimated contributions of the five classes to the overall population of lyrics; i.e. ${\hat{π}}_{j}, j = 1, \dots, 5$ , are found to be equal $1.8 %$ , $9.10 %$ , $50.9 %$ , $26.2 %$ and $12 %$ respectively.

Now that all the words are classified, one can randomly draw a given number from the big pool. We settled on sampling 175 words as this is about the total number of words in the poem of the song Theme from New York, New York, also known simply as New York, New York. The lyrics were written by Fred Ebb and the music was composed Jon Kander. The original version of the song, which was performed by Liza Minnelli in 1977 in the movie New York, New York was subject to a few modifications when first sung by Frank Sinatra in 1978. It is commonly accepted that Sinatra's several interpretations of the song contributed to its big fame.

In the next section, we describe how the sampling of the (new) random 175 words is done. We also give details about our protocol for writing the poem, entitled Young, based on the generated words. In the pool of the 63 Frank Sinatra songs that we have gathered from the web, one can see that the length of the poems (number of words) varies between the values 66 and 253. Our choice to generate 175 words was mainly driven by the number of words in the song New York, New York.

3. The poem Young

3.1. Sampling from the classes

Recall ${\hat{π}}_{j}, j = 1, \dots, 5$ the estimated contributions (mixing proportions) of the five classes found above. First, we need to sample from each class a number of words such that they add up exactly to the desired total N = 175. This can be done by resorting to drawing a 5-dimensional vector $(N_{1}, \dots, N_{5})$ from a multinomial distribution with parameters N and vector of probabilities $({\hat{π}}_{1}, \dots, {\hat{π}}_{5})$ . The next step is to randomly sample $N_{k}$ words from the k-class. Recall that the $N_{k}$ words come possibly with repetitions in each class $# k$ , for $k \in {1, \dots, 5}$ . To sample $N_{k}$ words, we first compute the empirical frequencies of the words composing the kth class. Suppose that in the kth class there are $M_{k}$ distinct words ‘ ${w o r d}_{1}$ ’,…, ‘ ${w o r d}_{M_{k}}$ ’ and let ${\hat{θ}}_{1}, \dots, {\hat{θ}}_{M_{k}}$ be the empirical frequencies of the distinct words. Then, we draw a random vector of $(B_{1}, \dots, B_{M_{k}})$ from a Multinomial distribution with parameters $M_{k}$ and vector of probabilities $({\hat{θ}}_{1}, \dots, {\hat{θ}}_{M_{k}})$ . Then, this means that we have $B_{1}$ times ${w o r d}_{1}$ ,…, $B_{M_{k}}$ times ${w o r d}_{M_{k}}$ . We proceed with this operation for all $k \in {1, \dots, 5}$ .

At the end of the sampling procedure, we obtain N = 175 words with the repetitions that are more or less faithful to the intrinsic structure of the songs that were interpreted by Frank Sinatra. As mentioned above, the generated words are not (yet) under the form of structured lyrics. Human creativity is now needed to put them into a meaningful poem. In a sense, the randomly generated words can be regarded as raw material for a new song. Here is a subset of the generated words:

3.2. Protocol

Before composing a poem based on the generated words, the following rules were set:

The structure of the poem should be close to that of the lyrics of New York, New York (as sung by Frank Sinatra). This implies in particular that the syllabic beats and structures of the stanzas should be close.
The original number of words (175) should not be augmented by more than 10%.
At least 90% of the generated words should be used. In other words, we are allowed to suppress only 17 of the 175 generated words.
Words can be replaced by their synonyms in case a synonym fits better the rhythm. For instance, ‘start’ is allowed to be replaced by ‘begin’.
Verbs in the infinitive should be conjugated. They can also be given in the gerund form.

It is clear that there are so many ways to write a poem based on the generated words, while respecting the above protocol. It is also clear that given our generating mechanism it is very hard to obtain words that can go to a refrain. In our case, the main starting point was to notice the presence of the antonyms in the set of generated words: young and old. The storyline became immediately clear: She left me for a younger man. The first version of the poem, which we entitled Young, was subject to several iterations and modifications consisting of shuffling some of the words or giving up some of the generated words to allow inclusion of others not in the original set. It might seem unfortunate that we have forced our poem to be singable to the music of the New York, New York song. However, one can say that this proves that it is possible to compose lyrics that are meaningful and able to reflect the style of a given singer. A sophomore jazz student, Maxine Vulliet, and a Soprano singer, Hélène Walter, agreed each to interpret the new poem. Each of the singers was accompanied by a pianist: Maxine by Leandro Irarragorri and Hélène by professor Hans Adolfsen. If the interpretations were absolutely stunning, they cannot be made public as we failed to obtain the copyright for the New York, New York song. Currently, a jazz band at ETH is working on composing a new melody. In the following section, we present the text of the poem.

3.3. The poem

4. Some conclusions

In this work, we have presented a method which can be implemented to generate random lyrics from a large pool of songs. Compared to other existing approaches which are able to compose a coherent and meaningful text based on a simple prompt, our method is certainly much less impressive. In fact, it only enables to generate random words according to the ‘right distribution’ of frequencies of words as they occur in the texts sung by some artist. However, one could see this limitation as a great opportunity to use one's creativity in putting together the necessary syntax. The additional degree of freedom given to the poet in composing a new text based on the generated words can be even wished in certain situations and preferred to a very quick and maybe less deep outcome. Our method is also based on statistically justifying the model chosen, among so many, for the distribution of the frequency of the words' counts. There is certainly a big room for improvement and we plan to explore some meaningful ways to combine existing techniques with our simple approach with the aim of creating new poems with a desired style.

Acknowledgments

Both authors owe special thanks Maxine Vuillet, Leandro Irarragorri, Hélène Walter and professor Hans Adolfsen for having believed in our project and accepted to embark with us in this adventure.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Balabdaoui, F., & de Fournas-Labrosse, G. (2020). Least squares estimation of a completely monotone PMF: From analysis to statistics. Journal of Statistical Planning and Inference, 204, 55–71. https://doi.org/10.1016/j.jspi.2019.04.006
Web of Science ®Google Scholar
Balabdaoui, F., & Kulagina, Y. (2020). Completely monotone distributions: Mixing, approximation and estimation of number of species. Computational Statistics and Data Analysis, 150, 107014–107026. https://doi.org/10.1016/j.csda.2020.107014
Web of Science ®Google Scholar
Biernacki, C., Celeux, G., & Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7), 719–725. https://doi.org/10.1109/34.865189
Web of Science ®Google Scholar
Bouveyron, C., Celeux, G., Murphy, T. B., & Raftery, A. E. (2019). Model-based clustering and classification for data science: with applications in R. Cambridge University Press.
Google Scholar
Bunge, J., & Fitzpatrick, M. (1993). Estimating the number of species: A review. Journal of the American Statistical Association, 88(421), 364–371. https://doi.org/10.2307/2290733
Web of Science ®Google Scholar
Chao, A., & Bunge, J. (2002). Estimating the number of species in a stochastic abundance model. Biometrics, 58(3), 531–539. https://doi.org/10.1111/j.0006-341X.2002.00531.x
PubMed Web of Science ®Google Scholar
Chee, C.-S., & Wang, Y. (2016). Nonparametric estimation of species richness using discrete k-Monotone distributions. Computational Statistics and Data Analysis, 93, 107–118. https://doi.org/10.1016/j.csda.2014.10.021
Web of Science ®Google Scholar
Durot, C., Huet, S., Koladjo, F., & Robin, S. (2015). Nonparametric species richness estimation under convexity constraint. Environmetrics, 26(7), 502–513. https://doi.org/10.1002/env.v26.7
Web of Science ®Google Scholar
Efron, B., & Thisted, R. A. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrka, 63(3), 435–447. https://doi.org/10.2307/2335721
Web of Science ®Google Scholar
Feller, W. (1971). An introduction to probability theory and its applications. 6th ed. (Vol. John Wiley & Sons Inc.,.
Google Scholar
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4), 237–264. https://doi.org/10.1093/biomet/40.3-4.237
Web of Science ®Google Scholar
Good, I. J., & Toulmin, G. H. (1956). The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 43(1-2), 45–63. https://doi.org/10.1093/biomet/43.1-2.45
Web of Science ®Google Scholar
Köbis, N., & Mossink, L. D. (2021). Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry. Computers in Human Behavior, 114, 106553. https://doi.org/10.1016/j.chb.2020.106553
Web of Science ®Google Scholar
Lindsay, B. G. (1983a). The geometry of mixture likelihoods: A general theory. The Annals of Statistics, 11(1), 86–94. https://doi.org/10.1214/aos/1176346059
Web of Science ®Google Scholar
Lindsay, B. G. (1983b). The geometry of mixture likelihoods: The exponential family. The Annals of Statistics, 11(3), 783–792. https://doi.org/10.1214/aos/1176346245
Web of Science ®Google Scholar
Lindsay, B. G. (1989). Moment matrices: Applications in mixtures. The Annals of Statistics, 17(2), 722–740. https://doi.org/10.1214/aos/1176347138
Web of Science ®Google Scholar
Lindsay, B. G. (1995). Mixture models: theory, geometry, and applications. Institute of Mathematical Statistics.
Google Scholar
McLachlan, G., & Peel, D. (2000). Finite mixture models, Wiley Series in Probability and Statistics: Applied Probability and Statistics. John Wiley & Sons.
Google Scholar
McLachlan, G. J., & Rathnayake, S. (2014). On the number of components in a Gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 341–355. https://doi.org/10.1002/widm.1135
Web of Science ®Google Scholar
Norris, J. L., & Pollock, K. H. (1998). Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environmental and Ecological Statistics, 5(4), 391–402. https://doi.org/10.1023/A:1009659922745
Web of Science ®Google Scholar
Orlitsky, A., Suresh, A. T., & Wu, Y. (2016). Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences, 113(47), 13283–13288. https://doi.org/10.1073/pnas.1607774113
PubMed Web of Science ®Google Scholar
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training, OpenAI.
Google Scholar
Spevack, M. (1970). A complete and systematic concordance to the works of shakespeare. Journal of Aesthetics and Art Criticism, 29(2), 279–280. https://doi.org/10.2307/428622
Google Scholar
Teicher, H. (1961). Identifiability of mixtures. The Annals of Mathematical Statistics, 32(1), 244–248. https://doi.org/10.1214/aoms/1177705155
Google Scholar
Teicher, H. (1963). Identifiability of finite mixtures. The Annals of Mathematical Statistics, 34(4), 1265–1269. https://doi.org/10.1214/aoms/1177703862
Google Scholar
Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis of finite mixture distributions, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley & Sons.
Google Scholar
Vrieze, S. I. (2012). Model selection and psychological theory: A discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychological Methods, 17(2), 228–243. https://doi.org/10.1037/a0027127
PubMed Web of Science ®Google Scholar
Wang, J. Z., & Lindsay, B. G. (2005). A penalized nonparametric maximum likelihood approach to species richness estimation. Journal of the American Statistical Association, 100(471), 942–959. https://doi.org/10.1198/016214504000002005
Web of Science ®Google Scholar
Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C., Gao, X., Gao, J., Liu, J., & Dolan, B. (2019). Dialogpt: Large-scale generative pre-training for conversational response generation. Preprint arXiv:1911.00536.
Google Scholar

Randomly generated lyrics using mixture models: the poem that Frank Sinatra never sang

Abstract

1. Introduction