Notes:
DeepMind, Google subsidiary focused on Machine Learning -> https://deepmind.com/
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Synthesizing speech = making babble that sounds like/mimic speaking. "It has a certain drama". Rawness.
"No pre-conceived idea of speech" - (raw) frequency synthesis
hiss, pop. Sending the body moving. Cues about context of the speaker, situating the body. Ernst. Warm-bloodedness of relational rhythms, relations to the body. Training, dressage in machine learning (Lefevbre).
Material entanglement of sender and receiver
Questions, comments:
Given the media archaeological framework, other historical examples come to mind such as von Kempelen's speaking machine (or others, Faber) that tried to model the human speech apparatus including modelling the larynx or using bellows for lungs, etc. in parallel how artistic practices have worked with these ideas (Kurt Schwitters's Ursonate). Whta is new here? Or, is it mainly a critique of the cold gaze/ear of Ernst? What do we know differently about acoustic knowledge?
I'd aslo like to know more about how this is a mode of listening, and how ears emit sounds (cf. the work of jacob kierkegaard) andthe deep listening (ecology) of pauline oliveros.
Listening to something cannot just be a matter of source + receiver—it is a material entanglement of these two together. - Not just these two, also time in terms of temporality (or history) plays a role. Maybe there are even more 'actors' we could think of.
How do you define time? (Rhythm, Chronology,
"Time is materially independent" - Maybe I misheard this, or am taking this out of context, but I dont think that is completely true, time definitely has an affect on material. As a trace, in meaning, as a context.
and in relation to this, the temporality that Ernst unfolds, ahistoricity of media, and so on.
Does WaveNet cause us to ask: can the speaker themselves - or the amalgam of speakers - be broken down into parts, "genetic" code of their voice, and recomposed? A new intricate alphabet of the body's voice. Or is 'the body' what is lost anyway, despite its granular reproduction? Is WaveNet an example of the "physical quality" being separated from the physical itself?
WaveNet is tried to a historical moment in which we have v. clean recordings also. What about if it was trained on 'noisy recordings'... *should it be tied to more noisy recordings? Emotional "noise" in particular?
I think there is something questionable in the idea of 'raw' sound that Wavenet works on. Again, the way that the speech is collected is as you point out depending on the shape of the room, but also on the samplerate of the recording equipment and so on, which makes this 'rawness' actually more like an averaging of the capture of the sounds spread over the hours of recording..
Machines not needing to make sense, but sense can be made from nonsense – non-sensouos perception as the pre-cognitive, affective signal before semantic meaning
But if there is no logic for the human, there might still be a machinic logic that we cant 'sense'
*situated body / situated listening / situated acoustic knowledges (not olnly materially situated, but also politically/life-story-based situated epistemic "advantages" for knowledge processing?) ((beyond "standpoint theory", actually))
*sensorial aspects: phenomenology could perhaps be diffracted from here? (like alien and/or queer phenomenologists do?)
+Erin Manning
*it interestes me so much brought to class/geographical accents and the subjectivities that might emerge from them / time/space/somatic affections // e.g.: uttering, slang...
building on the idea of perceptual encoding does the rhythm not overlap with the semantic spectrum you initially separate it from. And if these two are entangled, how would we go on to engage with this entanglement. So these question have to do with the listeners engagement with the performance of wavenet.
how is wavenet scaled up as entangled within practice? again on the listener/machine interaction, is there an axis that makes explicit the situation in which they relate to each other, in literal conversation maybe.
The 'character' of the voice that is learned or sythesised from multiple real voices (or rather real moments of speaking)
A machine learning Alvin Lucier I am sitting in a room experiment might be fun to try +1!
"Unpassionate listening" as methodology, is also somehow replicated inside this machine listening.. in this sense, WaveNet would work as a critiique itself to a proposed human methodology... there's an interesting loop here <- would this be something like "uncreative listening"? <- the creative/synthetic split is interesting here, what are the distinctions
Ernst's concept is too narrow, needs a new one that opens up to new materialisms
concept of time is not satisfactory
Geoff: You are not mentioning reference(http://computationalculture.net/article/algorhythmics-understanding-micro-temporality-in-computational-cultures).... Rhythm algorithmics. + *rythmic events* (Eleni Ikonadiou?)
Brian: there is an abstractness; enjoying Lefevbre warm bloodedness.
Geoff: And what would Ernst say to you? His position being tactical?
Kristoffer: Irony of Ernst: he is a passionate listener himself. Cold gaze and colonial connotations of that gaze. What is the relation of WaveNet to this cold gaze? It remains an ambigous object of study in the article but seems to be based on a new level of rendering acoustic knowledge transparent, now also including situated spatial and bodily parameters. However, what new aspects does it hide in this process (labour for example).
Going beyond accoustic knowledge.
Brian: What is an accoustics of knowledge? Questioning the status of the blank slate.
Wavenet poses a methodological challenge. Training set of 109 individuals, who are they? What was the process with them? It is already entangled.
Political process of outsourced machine labour.
Ethnographic use of phonographs recording indigenous voices, how would WaveNet operate there? -> There is a scale issue here. If phonographic records deal somehow with individual death, this algorithmic archive postprocessing seems to operate on the scale of extinction, species loss, genocide, etc.
Generate new examples from that/disappearing languages? Holographic TuPac or MJ
What is a temporal accent? Time & Language. (i love that question!)
Nicolas: Wondering how much the mathematical model needs to be understood to study this algorithm? How much do you need to entangle with the math?
Soren: machinic unconsciousness in bodies & machines. Bubbling of something to get out ... machinic unconcious (Felix Guattari) how is sense/nonsense (Alfred North Whitehead/Brian Massumi) made, what we heard was non sense
Jara: Finds an interesting potential in combining acoustic knowledge and situatedness of the body. Bring in Donna Haraway, Situated Knowledges? The political situation of accents. (and how these reflect in the code itself? It is also written by human 'bodies')
xxxx: How is the granular break up of the voice, a micro alphabet. A zoomed in alphabet, genetic sequencing.
Brian: Get into neural networks and look at what features were generated.
Did he really sound flemish? :-) He did!
There are many relations to how Google works more generally – as a language based industry. They do not seem to have an interest in what is said, but in reading *how* you speak/write – in the control of the paradigmatic network of text. Wavenet and your analysis is a really important and interesting addition to this.
Machine Listening, to Deep Mind, to Deep Listening, like the work of Pauline Oliveros?
The paragraph where you describe the reoording of your granmother's voice is is intriguing: How the "sonographic resonance" is touching - but how her stories are equally touching for you (a sort of breakdonw of an argument?). It makes me wonder about the relations between the resonances/signals and signs?
Unlike the alphabetic or phonetic notation of speech, this micro-alaphbet (as Nathan said) will be vectoral, though I can't quite think why that would be interesting/important
------------------------------------
title: Brian House – Machine Listening
slug: brian-house
id: 84
link: https://machineresearch.wordpress.com/2016/09/26/brian-house/
guid: https://machineresearch.wordpress.com/2016/09/26/brian-house/
status: publish
terms: Uncategorized
---
MACHINE LISTENING: WaveNet and approaching media materialism through rhythmanalysis
"The Blue Lagoon is a 1980 American romance and adventure film directed by Randal Kleiser" [1]. With this clever reference lifted from the Internet Movie Database we are introduced to the voice of WaveNet. A "generative model of raw audio waveforms," the WaveNet algorithm is outlined in a paper published just this September by DeepMind, a machine learning subsidiary of Google (van den Oord). It is a significant step forward in the synthesis of human-sounding voices by computers, an endeavor which is both paradigmatic of artificial intelligence research and a mainstay in popular culture, from Hal in the film 2001: A Space Odyssey to current voiced consumer products like Apple's Siri. According to DeepMind's own testing [2], WaveNet outperforms current state of the art text-to-speech systems in subjective quality tests by over 50% when compared to actual human speech—it sounds very good, and no doubt we will be hearing much more of it.
My purpose in this text, however, is not to explore a genealogy of computer speech. Rather, it's about "machine listening." That term comprises both a philosophical question—can machines listen? (and its corollary, what is listening?)—as well as the sub-field of computer science concerned with the extraction of meaningful information from audio data. The timely emergence of WaveNet is compelling on both fronts, and I am going to proceed with the hypothesis that WaveNet is, perhaps more than anything else, a listening machine.
To this end, the second set of examples of synthesized speech provided by DeepMind is the more intriguing. Having been trained to speak, WaveNet nonetheless must be told what to say (hence the IMDb quote, etc). If it isn't told, however, it still generates "speech" that is "a kind of babbling, where real words are interspersed with made-up word-like sounds" (van den Oord) [3]. Listening to these, I'm struck first by the idea that this is the perfect answer to the classic campfire philosophy question, "what is the sound of my native language?" When we understand the words, the sub-semiotic character of a language is, perhaps, obscured. This babbling seems like a familiar tongue, or at least one somewhat related to English—maybe Icelandic? Secondly, to my ear, this set of examples sounds more realistic than the first. I'm hearing ennui in these voices, a measured cadence punctuated by breaths just as expressive as the "words," a performance with the unmistakeable hallmarks of a overwrought poetry reading. The Turing test [4] has been mis-designed—it's not the semantics that make this voice a "who" rather than an "it".
The inclusion of aspirations and a more musical sense of timbre, rhythm, and inflection in WaveNet is a function of the acoustic level at which it operates. Previous techniques of text-to-speech, as DeepMind explains, are parametric or concatenative. The former is purely synthetic, attempting to explicitly model the physical characteristics of human voices with oscillators; the second relies on a database of sounds snippets recorded by human speakers that are pieced together to form the desired sentences. Both strategies proceed from assumptions about how speech is organized—for example, they take the phoneme as speech's basic unit rather than sound itself. Where WaveNet is different is that it begins with so-called "raw" audio—that is, unprocessed digital recordings of human speech, to the tune of 44 hours worth from 109 different speakers (van den Oord). This data is feed into a convolutional, "deep" neural network, an algorithm designed to infer its own higher-order structures from elementary inputs. Subsequently, WaveNet generates speech one audio sample at a time, 22 thousand of which add up to a single second of sound in the form of a the digital waveform. An intriguing aspect of the result is that WaveNet models not only the incidental aspects of speech in the training examples, but the very acoustics of the rooms in which they were recorded.
WaveNet's use of raw audio invokes what media theorist Wolfgang Ernst dubs "acoustic knowledge" (Ernst 179). For him, such knowledge is a matter of media rather than cultural interpretation, embodied in the material processes by which sound is recorded on a phonographic disc. As he puts it, "these are physically real (in the sense of indexical) traces of past articulation, sonic signals that differ from the indirect, arbitrary evidence symbolically expressed in literature and musical notation" (Ernst 173). It is the "physically real frequency" (Ernst 173) that matters, the signal over semantics. Erst makes clear the implications for listening: "Cultural tradition, or the so-called collective memory, does not lead to a reconstruction of the actual sonic eventality; we have to switch our attention to the laws of technological media in order to be in a position to reenact past sound experience" (Ernst 176). Ernst's "media archaeology" is thus concerned with the "event" as a confluence of dynamical processes, albeit one inscribed in material artifacts.
To provide my own example, in a tape recording from the late 1940s of my grandmother speaking, she has a distinct Pennsylvania Dutch accent. This was somewhat of a revelation when I first heard it some 60 years later, having known her as an elderly woman with no such inflection. Her description of those years to me was to some extent limited by its telling—it required machine temporality, rather than human, to reveal the dialect that was inevitably missing from her own narrative. The sonographic resonance was something different than the hermeneutic empathy of her stories. To me, they are equally touching—Ernst would privilege the former.
And yet analog recording media are not without their own acoustic inflections—the hiss and pops of tape or record are an added valence to the sonic events they reproduce. There is a "style" to media, a dialect in this addition. For Ernst, this indicates how the medium is inseparable from the recording. It also mitigates the insinuation that a technical signal, in its physical realness, is somehow objective or unmediated. Rather, material contingencies comprise the character of such listening machines. Further, that a phonograph is an imperfect listener grants it some affective agency; its status as a listener is in fact predicated on having experienced in recording a change that is expressed in playback.
Such is the nature of sound. As Brandon Labelle puts it, "Sound is intrinsically and unignorably relational: it emanates, propagates, communicates, vibrates, and agitates; it leaves a body and enters others; it binds and unhinges, harmonizes and traumatizes; it send the body moving" (Labelle ix). Sound leaves an impression. How we experience it and how we respond to it with our own particular bodies is conditioned by both physiology and past experience that marks us as listeners, whether non-biological or of a race, class, culture, species. Listening to something cannot just be a matter of source + receiver—it is a material entanglement of these two together.
From this perspective, Ernst's fascination with technical apparatuses is unnecessarily circumscribed. In the effort to assert acoustic knowledge over symbolic meaning, he sidesteps the material nature of human listening. It's revealing when he writes that "Instead of applying musicological hermeneutics, the media archaeologist suppresses the passion to hallucinate 'life' when he listens to recorded voices" (Ernst 60). Such a call for "unpassioned listening" (Ernst 25) might be an attempt at empathizing with the machines, but it is at odds with the interrelationality of listening and oddly replays the detached ocularity—the cold gaze—of colonial naturalism. Can we ask instead if there are physical processes of which that "life" so characteristic of human listening is comprised?
This leads us to the history of research on human perception of which artificial intelligence like WaveNet is progeny. Jonathan Sterne recounts how beginning in the 1920s, institutes like Bell Labs realized that "Sound was not simply something out there in the world that human and animal ears happened to encounter and faithfully reproduce; nor where human ears solipsistically creating sound through the simple fact of hearing" (Sterne 98). Instead, "Hearing was itself a medium and could therefore by be understood in terms analogous to the media that were being built to address it" (Sterne 99). This demonstrates the perspective of Ernst, and Freidrich Kittler before him, that the invention of that media—the phonograph—predetermined such a revelation. Regardless, the cochlea of the human ear and its psychoacoustic properties made possible what Sterne calls "perceptual coding" (Sterne 2) that capitalizes on the difference between human and machine listening. If, depending on conditions, the human can perceive only a fraction of frequencies audible to the machine, and if the machine is able to digitally encode only those frequencies, there is a surplus of bandwidth that remains. Multiple streams of acoustic data can therefore be processed simultaneously, or, in particular, sent down a telephone line at the same time (all current telephony infrastructure does this). The difference in our listening capacities thus produces a poly-temporal relation.
This complicates a simplistic notion of acoustic knowledge as a direct signal. The machine, here, is no less comprised of processes that are physically real, but there exists a material semiotics in the digital encoding performed by its internal processor. Ernst excludes this from cultural symbolism as it operates on a machinic level "below the sensual thresholds of sight and sound—a level that is not directly accessible human sense because of its sheer electronic and calculating speed" (Ernst 60). But digital logic contains within it an adaptation to human sense that mediates between our differing temporalities. Computers typically sample audio at 44.1kHz—a number chosen to match the standard threshold of human hearing [5], but far below the capacity of contemporary digital processors (such as the 3Ghz computer I am typing this on). From the perspective of Sterne's perceptual researchers, that threshold is a sensible choice if one wants to treat hearing as medium. Already, then, the human body reverberates in the digital acoustic impression.
However, we're not much closer to our hallucinations. Sterne dubs the perceptual model "hypodermic" (Sterne 74) in that it assumes hearing is akin to the transmission of a message straight to the cochlea that might as well bypass the body—the audio signal is presumably "decoded" by some cognitive function thereafter. Ernst's divide between technicity and cultural knowledge is, perhaps, similar, stuck within an idea of source + receiver. Consider, though, a problem I've recently come up against—the frame rate of virtual reality systems. For decades, film was made and shown at 24 frames-per-second. Though much slower than the ear, this rate was similarly determined by a perceptual limit beyond which a sequence of images appears convincingly continuous. But an audience sitting still in a theater looking at a stationary projection is a different story than one moving around with screens glued to their faces. As it turns out, anything less than 60fps in VR is stomach-churning—not only does the gastro-intestinal system then make it into machine rhythms, but it shows how the temporality of human senses is not so easily isolated from its embodied material-cultural situation.
Recent cognitive science research has shed further light into how that might work. "Neural resonance theory," championed by Edward Large, observes (via fMRI) that electrical oscillations between neurons in the brain entrain to the rhythmic stimulus of the body by music or other behaviors. Once adapted, these endogenous oscillations can be maintained independently. Are these not our hallucinations? If Large is correct, the brain's primary purpose might be that of a complex oscillator constantly adapting to its environment, not via some internally coded representation, but as a physical coupling of brain to world via the body. The song that pops into your head, the voice that you recognize, the familiar acoustic quality of a habitual space—these experiences are acoustic knowledge that are not limited to technical inscription by the machine, but which are no less material as they resonate within your own physiology.
This would not be news to Henri Lefebvre. Ernst's dispassion is contrasted by Lefebvre's warm bloodedness in which "the living body has (in general) always been present: a constant reference. The theory of rhythms is founded on the experience and knowledge of the body; the concepts derive from this consciousness and this knowledge, simultaneously banal and full of surprises" (Lefebvre 67). Rhythm, here, might be compared to acoustic knowledge as it is a form of material memory, but it encompasses a greater sense of both contingency and potentiality. Lefebvre's "rhythmanalysis" is also concerned, like Ernst's media archaeology, with the event: "Everywhere there is interaction between a place, a time and an expenditure of energy, there is rhythm" (Lefebvre xv). However, for Lefebvre “We know that a rhythm is slow or lively only in relation to other rhythms (often our own: those of our walking, our breathing, our heart)” (Lefebvre 10). Furthermore, these rhythms are not spontaneous or self-contained but are the result of a process of external influences. This he labels "dressage," or training, the acculturation of an individual to a socially produced articulation of time (Lefebvre 39). This could be described as inscription, but it realizes the necessity of its own continual reperformance.
We know by now that the meaning of speech is not just a matter of semantics. As Deleuze and Guattari put it, "Because style is not an individual psychological creation but an assemblage of enunciation, it unavoidably produces a language within a language" (Deleuze 97). This second-order language, this style, this rhythm, is what is important to the rhythmanalyst, and what she can offer to the media archaeologist. For it brings an enunciation into play with the listening that conditions it. Ernst's strict division of the semantic versus the technical requires us to repress the very reverberations that make acoustic knowledge significant, the chain of embodied entrainments in which both us and the machine are co-implicated. And yet, conversely, the pulse of the machine is absent in Lefebvre's thinking, and can only be supplied by a close attention to technical means. To my ear, something like WaveNet requires their interanimation.
WaveNet is a listening machine. Like a phonograph, it processes raw audio, and reproduces raw audio in return. It operates beneath a human conception of what speech "is" and captures instead the acoustic knowledge that actually composes it. That we recognize the quality of that audio as important to a "realistic" voice shows that humans, too, possess a means of acoustic knowledge beyond the semantic—a sense of rhythm. WaveNet also functions as an algorithm for perceptual coding concerned with these very features–what's retained from those 44 hours in the 10 second snippet is a sense of the embodied human enunciation. The mechanism through which WaveNet "learns"—training a deep convolutional neural network (van den Oord)—is in fact an entrainment to these rhythms. Starting as a blank slate (like children shipwrecked alone in a lagoon, natch), with the introduction of each human recording it learns how to predict the sequence of audio samples relative to a given text. With each recording it hears, it changes. This is what makes it a listener, and a better one than a phonograph that only can receive a single sonic impression.
We know from Large that the quality of internal oscillation in human physiology is conditioned by the environment—rhythmanalysis demonstrates that how you listen and how you walk, have sex, or use a computer are not materially separable. Likewise, WaveNet introduces its own inflections that are intrinsic to its material situation—algorithm, hardware, Google engineers. Its speech is a negotiation between human resonance and this embodied machine temporality. Lefebvre muses how "If one could 'know' from outside the beatings of the heart of ... a person ..., one would learn much about the exact meaning of his words" (Lefebvre 4). Beating at nonhuman rates, WaveNet both listens and speaks differently. What is it that we hear, then, in the melodrama of its babblings? Though its phonetic poetry is at first hearing benign, it begs the question of what qualities of enunciation it might normalize—who are the voices it listens to? To which listeners does it appeal? And how will interacting with WaveNet voices shape human ears, as they inevitably will?
Notes
[1] https://storage.googleapis.com/deepmind-media/pixie/us-english/wavenet-1.wav
[2] This testing was conducted via online crowdsourcing. The anonymous, underpaid, typically non-US human labor involved in training contemporary AI systems is an intriguingly problematic method beyond the scope possible here.
[3] https://storage.googleapis.com/deepmind-media/pixie/knowing-what-to-say/first-list/speaker-2.wav
[4] Alan Turing proposed a test that predicated a machine's ability to think on its ability to imitate a human. This was to be done via teletype—only written language is ever exchanged.
[5] An adult human can typically hear up to 22kHz—a sampling rate of twice this frequency is required to accurately reproduce the waveform (CD-quality audio is 44.1kHz). WaveNet operates at 22khz, meaning it's limited to frequencies below 11kHz—it's not hi-fi from an audiofile perspective, but that's still pretty good.
References
Deleuze, Gilles and Felix Guattari. A Thousand Plateaus: Capitalism and Schizophrenia, trans. Brian Massumi. Minneapolis: University of Minnesota Press, 1987.
Ernst, Wolfgang. Digital Memory and the Archive. Minneapolis: University of Minnesota Press, 2013.
Labelle, Brandon. Background Noise: Perspectives on Sound Art. London: Continuum, 2006.
Large, Edward, et al. "Neural networks for beat perception in musical rhythm" in Frontiers in Systems Neuroscience, 2015; 9: 159. <http://dx.doi.org/10.3389/fnsys.2015.00159>
Lefebvre, Henri. Rhythmanalysis: Space, Time, and Everyday Life. London: Continuum, 2004.
Sterne, Jonathan. MP3: The Meaning of a Format. Durham: Duke University Press, 2012.
van den Oord, Aäron, et al., "WaveNet: A Generative Model for Raw Audio," presented at the 9th ISCA Speech Synthesis Workshop, published September 19, 2016, blog post <https://deepmind.com/blog/wavenet-generative-model-raw-audio/> accessed September 25, 2016.