Voice and Staff Separation in Symbolic Piano Music with GNNs | by Emmanouil Karystinaios

The big question is how can we make automatic transcription models better.

To develop a more effective system for separating musical notes into voices and staves, particularly for complex piano music, we need to rethink the problem from a different perspective. We aim to improve the readability of transcribed music starting from a quantized MIDI, which is important for creating good score engravings and better performance by musicians.

For good score readability, two elements are probably the most important:

the separation of staves, which organizes the notes between the top and bottom staff;
and the separation of voices, highlighted in this picture with lines of different colors.

In piano scores, as said before, voices are not strictly monophonic but homophonic, which means a single voice can contain one or multiple notes playing at the same time. From now on, we call these chords. You can see some examples of chord highlighted in purple in the bottom staff of the picture above.

From a machine-learning perspective, we have two tasks to solve:

The first is staff separation, which is straightforward, we just need to predict for each note a binary label, for top or bottom staff specifically for piano scores.
The voice separation task may seem similar, after all, if we can predict the voice number for each voice, with a multiclass classifier, and the problem would be solved!

However, directly predicting voice labels is problematic. We would need to fix the maximum number of voices the system can accept, but this creates a trade-off between our system flexibility and the class imbalance within the data.

For example, if we set the maximum number of voices to 8, to account for 4 in each staff as it is commonly done in music notation software, we can expect to have very few occurrences of labels 8 and 4 in our dataset.

Looking specifically at the score excerpt here, voices 3,4, and 8 are completely missing. Highly imbalanced data will degrade the performance of a multilabel classifier and if we set a lower number of voices, we would lose system flexibility.

The solution to these problems is to be able to translate the knowledge the system learned on some voices, to other voices. For this, we abandon the idea of the multiclass classifier, and frame the voice prediction as a link prediction problem. We want to link two notes if they are consecutive in the same voice. This has the advantage of breaking a complex problem into a set of very simple problems where for each pair of notes we predict again a binary label telling whether the two notes are linked or not. This approach is also valid for chords, as you see in the low voice of this picture.

This process will create a graph which we call an output graph. To find the voices we can simply compute the connected components of the output graph!

To re-iterate, we formulate the problem of voice and staff separation as two binary prediction tasks.

For staff separation, we predict the staff number for each note,
and to separate voices we predict links between each pair of notes.

While not strictly necessary, we found it useful for the performance of our system to add an extra task:

Chord prediction, where similar to voice, we link each pair of notes if they belong to the same chord.

Let’s recap what our system looks like until now, we have three binary classifiers, one that inputs single notes, and two that input pairs of notes. What we need now are good input features, so our classifiers can use contextual information in their prediction. Using deep learning vocabulary, we need a good note encoder!

We choose to use a Graph Neural Network (GNN) as a note encoder as it often excels in symbolic music processing. Therefore we need to create a graph from the musical input.

For this, we deterministically build a new graph from the Quantized midi, which we call input graph.

Creating these input graph can be done easily with tools such as GraphMuse.

Now, putting everything together, our model looks something like this:

from machine learning – Techyrack Hub https://ift.tt/PHdr9zt
via IFTTT

Hot Posts

Recent Posts

Voice and Staff Separation in Symbolic Piano Music with GNNs | by Emmanouil Karystinaios | Oct, 2024

Posted by AI Global Tech

Post a Comment

0 Comments

Comments

Popular Post

4 Years of Data Science in 8 Minutes | by Egor Howell | Oct, 2024

Function Calling at the Edge – The Berkeley Artificial Intelligence Research Blog

Most Popular

4 Years of Data Science in 8 Minutes | by Egor Howell | Oct, 2024

Function Calling at the Edge – The Berkeley Artificial Intelligence Research Blog

Categories

Random Posts

Featured post

ScreenAI: A visual language model for UI and visually-situated language understanding

Popular Posts

Dream AI by Wombo Pricing, Pros Cons, Features, Alternatives

ScreenAI: A visual language model for UI and visually-situated language understanding

Function Calling at the Edge – The Berkeley Artificial Intelligence Research Blog

Contact form

Hot Posts

Ad Code

Recent Posts

Voice and Staff Separation in Symbolic Piano Music with GNNs | by Emmanouil Karystinaios | Oct, 2024

Posted by AI Global Tech

You may like these posts

Post a Comment

0 Comments

Comments

Popular Post

Most Popular

Categories

Ad Code

Random Posts

Featured post

Popular Posts

Contact form