By Edward Shen
Since the era of digitised media, people are trying to use computing power to process encoded media contents. Looking back on the path of how computers can assist with media processing:
- 20th Century Fox first used unique IDs to make their production contents identifiable.
- OpenCL (Open Computing Language) was introduced in 2009 and allows heterogeneous platforms (CPU, GPU, DSP, FPGA, etc.) to execute tasks parallelly to increase a given system’s processing power (Khronos Group, 2018).
- The latest version of Adobe Premiere Pro enables automatically shot comparison and colour matching (Adobe, 2018).
The above examples point out the main task we rely on computers to do: enhance both quality and efficiency along with the broadcast and post-production workflows.
How far away are artificial intelligence and machine learning?
- Artificial Intelligence – See previous Digistor blog article by Patrick Trivuncevic: Artificial Intelligence in Media.
- Machine Learning - Apple embedded a dedicated ASIC (Neural Engine) in their A11 SoC (System on Chip) to execute convolution calculation. They also introduced Core ML learning framework to handle tasks such as cross-lingual voice recognition (Apple, 2017), biological information identification (Apple, 2017) and synthetic image processing (Apple, 2017).
So, the answer is: these technologies are everywhere, even inside your mobile phone. This blog article further explains the key mechanisms behind a very fundamental machine learning model example.
How can machines have ‘cognition’ and use that to deliver outputs?
Let’s take voice recognition as an example. We give one computer a meeting recording where different people were speaking, and we want the computer to compile a transcript. To finish this task, voice recognition tasks such as spectral analysis, and speaker diarisation are involved, speaker diarisation is the task of determining “whom spoke when” in an audio track with environment noise or a situation with more than one speakers.
Human speech has unique acoustic characteristics such as a syllable, phoneme (vowel, consonant) and formant (Ladefoged 2005). Very early research shows that we may use mathematics modelling to represent human speech. The most famous one is formant model (Paliwal et al. 1982), based on this model, a human channel can be treated as a cavity resonator. When a person is speaking, his/her formant frequency may be presented in a mathematical function. Based on this theory, we can illustrate the meeting recording as:
Figure 1 Microphone captured sound
Miro et al. (2012) introduced a new speaker diarisation architecture (figure 2). It trains many cluster models at the beginning, then re-train those models to merge them, the goal of this approach is to reduce the cluster number to one for each speaker.
Figure 2 Speaker diarisation architecture
Gaussian mixture model (GMM) is an ideal way of training this cluster. Figure 3 and figure 4 show examples of using GMM to classify data sets (Reynolds, 2015). In figure 3, there are 3 data sets which have 150 data, 250 data, 100 data respectively, each data set follows the normal distribution. After the algorithm converges, it can accurately represent the characteristic of each data set. In figure 4, it is the Spirite data set, which distributed in 3-dimensional space and is very complex and has no observable category boundaries, the mixture of GMM algorithm is set to 10, and after it converges, it still can approximately represent the distribution of this data set.
Figure 3 GMM with three sets of normal distribution data sets
Figure 4 GMM with three sets of Spirite data sets
The audio track can also be presented as a 3-dimentional model according to the formant model. Now we are processing the given audio recording (24 bits * 4395 frames) where two people were talking one after the other, and in the end is some noise from a phone speaker playing a YouTube video. GMM and UBM (Universal Background Model, used to overcome GMM’s limitations) are used to try separate different voice sources (Reynolds, 2015).
Figure 5 Diarisation result
From the MATLAB result, we can tell that there are three voice sources, the first one is Alice’s speaking, this user spoke from the beginning to 23.58 second, the second voice segment’s source is Bob who spoke from 23.58 second to 48.24 second, and the last source is environmental noise, from 48.23 second to the end.
Figure 6 Audio waveform with diarisation markers
That’s the whole process of model training and cluster matching, and this algorithm can be applied to any recording, it will use its ‘cognition’ function to work out the question of whom spoke when.
With the processes above, the computer now has its own ‘understanding’ of a process by doing model training and cluster matching. Next step is to choose optimal actions in the partially observable stochastic domains (Kaelbling et al, 1997). With Markov Decision Process (MDP), the agent (computer) showing in figure 7 can make its decision based on the previous state of the world.
Figure 7 Markov decision process model
It is very similar to how human makes decisions. A computer can also react against an object based on its own understanding.
Why can machine learning help with media production?
The biggest topic for NAB 2018 is artificial intelligence and machine learning. These technologies will analyse your content, learn your operation, allocate proper resources before you begin the next step…
Cloud computing is also a boost for AI and ML. It has more computing power for the model training and cluster matching to further converge the decision tree in order to provide more accurate and more optimal reactions.
Years ago, you need to tell a computer the geometry structure of a chair in order to let it recognise chairs. With machine learning, you only need to give your computer Internet access and it will finish training the cluster by itself with millions of photos of chairs and non-chairs, it’ll form its own ‘cognition’ to ‘understand’ what a chair looks like.
If you’d like to know more about how machine learning could help your business with enhancements in both quality and efficiency, please contact us at Digistor.
References: Adobe Systems Incorporated, 2018, Premiere Pro CC New Features Apr-03-2018, viewed 23 May 2018, <https://www.adobe.com/au/products/premiere/features.html>
 Apple Inc. 2017, An On-device Deep Neural Network for Face Detection, viewed 23 May 2018, <https://machinelearning.apple.com/2017/11/16/face-detection.html>
 Apple Inc. 2017, Deep Learning for Siri’s Voice: On-device Deep Mixture Density Networks for Hybrid Unit Selection Synthesis, viewed 23 May 2018, <https://machinelearning.apple.com/2017/08/06/siri-voices.html>
 Apple Inc. 2017, Improving the Realism of Synthetic Images, viewed 23 May 2018, <https://machinelearning.apple.com/2017/07/07/GAN.html>
 Digistor (2018), Artificial Intelligence in Media, viewed 23 May 2018, <https://www.digistor.com.au/the-latest/cat/digistor-blog/post/artificial-intelligence-in-media/>
 Kaelbling, L.P. & Littman, M.L. & Cassandra, A.R., 'Planning and acting in partially observable stochastic domains', Artificial Intellgence 101 (1998) 99-134
 Khronos Group, 2018, The OpenCL™ Specification, viewed 23 May 2018, <https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html>
 Ladefoged, P. 2005, A Course in Phonetics, 5th edn, Cengage Learning.
 Miro, A.X. & Bozonnet, S. & Evans, N. & Fredouille, C. & Friedland, G. & Vinyals, O, 'Speaker diarization: a review of recent research', IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 356-370.
 Paliwal, K.K. & Ainsworth, A.W. & Lindsay, D, 'A study of two-formant models for vowel identification', Speech Communication, vol. 2, pp. 295-303.
 Reynolds, D. 2015, 'Gaussian mixture models', Encyclopedia of Biometrics, pp.827-832.
 Reynolds, D. 2015, 'Universal background models', Encyclopedia of Biometrics, pp.1547-1550.