Music Signal Recognition Based on the Mathematical and ...

Research ArticleMusic Signal Recognition Based on the Mathematical andPhysical Equation Inversion Method

Wei Jiang 1 and Dong Sun2

1Department of Music, Shandong University of Science and Technology, Qingdao 266590, China2School of Electronic and Information Engineering, Henan Institute of Technology, Xinxiang 453003, China

Correspondence should be addressed to Wei Jiang; [email protected]

Received 24 August 2021; Accepted 20 September 2021; Published 1 October 2021

Academic Editor: Miaochao Chen

Copyright © 2021 Wei Jiang and Dong Sun. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

Digitization and analysis processing technology of music signals is the core of digital music technology. The paper studies themusic signal feature recognition technology based on the mathematical equation inversion method, which is aimed atdesigning a method that can help music learners in music learning and music composition. The paper firstly studies themodeling of music signal and its analysis and processing algorithm, combining the four elements of music sound, analyzingand extracting the characteristic parameters of notes, and establishing the mathematical model of single note signal and musicscore signal. The single note recognition algorithm is studied to extract the Mel frequency cepstrum coefficient of the signaland improve the DTW algorithm to achieve single note recognition. Based on the implementation of the single note algorithm,we combine the note temporal segmentation method based on the energy-entropy ratio to segment the music score into singlenote sequences to realize the music score recognition. The paper then goes on to study the music synthesis algorithm andperform simulations. The benchmark model demonstrates the positive correlation of pitch features on recognition throughcomparative experiments and explores the number of harmonics that should be attended to when recognizing differentinstruments. The attention network-based classification model draws on the properties of human auditory attention toimprove the recognition scores of the main playing instruments and the overall recognition accuracy of all instruments. Thetwo-stage classification model is divided into a first-stage classification model and a second-stage classification model, and thesecond-stage classification model consists of three residual networks, which are trained separately to specifically identifystrings, winds, and percussions. This method has the highest recognition score and overall accuracy.

1. Introduction

The emergence of computer technology and the Internet hasfacilitated the birth and development of a series of interdisci-plinary disciplines that combine science and art. In the fieldof music research, music, as an artistic discipline closely con-nected with daily life and learning, is gradually going digitaland technological. In recent years, modern music technology,especially electronic music technology, has made rapid devel-opment, and issues such as music recognition, retrieval, andsynthesis based on computer technology have received moreattention from researchers [1]. Traditional music teachingrequires professional teachers to tutor students, and teachingis characterized by repetitive practice. This repetitive work

not only greatly reduces the effective utilization of teachers,but also charges expensive fees for one-on-one teaching andtutoring, making systematic music learning impossible forfamilies with low-income levels. In addition, in the teachingprocess, musicians judge pitch based on their rich teachingexperience, based on what the human ear hears, which istoo subjective and less accurate to make mistakes [2]. If com-puter technology is applied to music teaching, on the onehand, it can assist musicians in music teaching to reducelabor intensity, and on the other hand, music learners cancarry out music learning independently from teachers to acertain extent and reduce learning costs. In addition to play-ing a significant role in music teaching, digital music technol-ogy can also promote the development of intelligent music

HindawiAdvances in Mathematical PhysicsVolume 2021, Article ID 3148747, 12 pageshttps://doi.org/10.1155/2021/3148747

https://orcid.org/0000-0001-6906-9207

https://creativecommons.org/licenses/by/4.0/

https://creativecommons.org/licenses/by/4.0/

https://doi.org/10.1155/2021/3148747

composition [3]. The realization of music synthesis technol-ogy makes automatic music composition possible, and forpeople who are not very proficient in music theory, musicsynthesis technology lowers the threshold of music composi-tion, so that more music lovers can create their works. Inaddition, music synthesis technology also contributes to thedevelopment of electronic instruments and the improvementof the sound of traditional instruments [4].

With the development of artificial intelligence technolo-gies, music information retrieval has received renewed atten-tion in the field of computer science. Content-based musicinformation retrieval includes several research directions:music recognition, melody extraction, pitch estimation, sen-timent classification, rhythm detection, genre, and style clas-sification [5]. Among them, the identification of multipleinstruments in a music song and the prediction of theiractivity levels is an important research topic in the MIR task.Music recognition techniques can be applied in many con-texts, such as searching for songs with specific instrumentsor identifying the starting and ending positions of a certaininstrument played in the audio. Modeling the performanceof music recommendation systems based on user prefer-ences for certain instruments can be improved. They canalso be used for automatic music transcription in polyphonicmusic, playback technique detection, and source separationtasks, where preconditioning models for the specific instru-ments present in the source separation task may improveits performance [6]. Multimusic recognition is essentially atimbre perception task. The tone is a subjective property thatis difficult to quantify. A person with good musical sense andprofessional training can easily identify instruments inaudio. However, the vast amount of music cannot rely on ahuman to identify and then provide labeled informationfor retrieval. With the development of artificial intelligenceand computing power, we can extract the corresponding fea-tures of musical instruments in audio files and train efficientdeep convolutional networks to achieve automatic recogni-tion of musical instruments.

Music signals, as a type of audio signal, are widely dis-tributed through the convenient Internet. With the permis-sion of the copyright, people can download various kindsof music on the Internet. Therefore, the data volume ofmusic audio is getting larger and larger, and the require-ments for the retrieval task are getting higher and higher.However, many mainstream music search engines are stillbased on simple text retrieval, manually labeled song titles,artists, or years. It would be significant for retrieval efficiencyand user experience if retrieval could be based on the con-tent information of the music signal itself, and these featurescould be automatically identified. Chapter 1: Introduction.Firstly, the background and significance of the paper areexplained in the context of the current social situation andsocial needs, and the main research content and the arrange-ment of each chapter are given. Chapter 2: Related Work. Aresearch analysis of the current research method is con-ducted, and some knowledge of music theory and the basicelements of digital audio are introduced, which are condu-cive to an in-depth understanding of the essential character-istics of musical instruments and the key features of

construction identification. Chapter 3: Research on MusicSignal Recognition Based on Mathematical Equation Inver-sion Methods. In terms of recognition, the paper choosesto characterize the original signal using Mel inversion coeffi-cients. Then, the single note recognition algorithm is intro-duced, and based on it, the note cutting algorithm isstudied to achieve multinote recognition. In terms of synthe-sis, mathematical modeling of the music signal is studied,and additive synthesis techniques are used to achieve pianotone reproduction based on the music score as well as notetime value information. Chapter 4: Analysis of Results.Chapter 5: Conclusion. It mainly summarizes the finalresearch results of the paper, analyzes the shortcomings ofthe paper in the research process, and also provides an out-look for future work because of these shortcomings.

2. Related Work

Douglas Nunn proposed a music recognition system basedon the inverse signal processing method of mathematicalequations. The maximum number of articulations that thesystem can recognize is increased to 8, but the accuracy ofthe system is not very high because it is more concernedwith the consistency of the recognition results with the audi-tory perception [7]. Since the inverse mathematical equationapproach network uses a distributed collaborative approachto eliminate the global control module, researchers began toapply the inverse mathematical equation approach networkto music recognition systems. [8] The successful applicationof Bayesian networks in music recognition systems hasproven to result in better a priori knowledge of the system.In recent years, researchers have started to apply fuzzy neu-ral networks to music recognition. It has been verified thatthis method is closest to the human cognitive process ofmusic and can effectively extract music information; thus,it has been more widely used. Ambrosanio et al. proposedan automatic music emotion recognition method based ona mathematical equation inversion model and gene expres-sion programming algorithm, which has a high recognitionrate for single-emotion music, but poor recognition for com-plex music with multiple emotions [9]. Yatabe et al. appliedthe tonal level contour feature in the chord recognition algo-rithm and achieved satisfactory recognition results [10].

In monophonic music, it has been possible to performmusical recognition of audio fragments at the note level orcontinuous audio signals played by solo instruments [11].He et al. proposed a linear spectral feature that was usedtogether with a Gaussian mixture model to evaluate the clas-sification of instrument families and to classify instrumentsby 14 instrument families. In addition to analyzing prede-fined features for classification, a classifier can be used tolearn the features to accomplish the classification task [12].Long et al. used sparse spectral coding and support vectormachines to classify single and multisource audio [13]. Jianget al. proposed to extract the Meier spectrogram from a data-set of single note clips of 24 musical instruments, use sparsecoding to learn the features in the spectrogram, and thentrain a support vector machine to use the learned featuresto classify the instruments with an accuracy of about 0.95

2 Advances in Mathematical Physics

for the 24 instrument categories [14]. The deep architectureallows for end-to-end training of the feature extraction andclassification modules to “learn” features, resulting in higheraccuracy than traditional methods [15]. The successfulapplication of deep learning in these two scenarios, basedon monophonic music recognition and major music recog-nition in polyphonic music, inspires us to further apply itto polyphonic music recognition [16].

To perform multimusic recognition in polyphonicmusic, general time-frequency features may not achievegood recognition, so we selected pitch features and mathe-matical equation inversion recognition as the input featuresof the model. The pitch features reflect the range of theinstrument and the fundamental frequency of the notes,and we use the idea of multibasis frequency estimation toextract the pitch features by using a filter set with customparameters to extract the initial features of the audio fed tothe convolutional network. Numerical equation inverseidentification is a special wavelet transform that has beenimproved to facilitate music analysis, which can reflect theenergy distribution of each pitch. We use an improved fastcomputational method to extract the numerical equationinverse identification of the audio. These two features com-bined can effectively capture the harmonic structure of themusic signal, which is reflected in the music as the timbreof the instrument [17, 18]. We are currently not aware ofany work on correlating timbre with the pitch in music rec-ognition. Finally, we feature-processed the extracted featuresand constructed three classification models, namely, a base-line model, an attention network-based classification model,and a two-level classification model. The baseline modeldemonstrates the effectiveness of pitch features in music rec-ognition; the attention mechanism has been widely used incomputer vision, and we apply it to the “auditory” attentionof music signals. The two-level classification model first per-forms coarse classification of instrument families and thenperforms subclassification of specific instruments of the cor-responding instrument families, and the hierarchical recog-nition is consistent with the basic. The two-levelclassification model first performs the coarse classificationof instrument families and then the subcategorization of spe-cific instruments of the corresponding instrument families.A series of comparative experiments of the three classifica-tion models also explore the validity of various known expe-riences in multimusic recognition, as well as the possibilityof unknown methods.

3. Research on Music Signal IdentificationBased on Inversion Method ofMathematical Equations

3.1. Music Signal Feature Parameter Extraction. A completemusical work is composed of different parts; partially it iscomposed of different motives, sections, and phrases. As awhole, it is composed of complete sections, parts, or move-ments. So the musical signal has both a whole character,and it is the interaction and connection between them thatconstitutes the musical integrity. The overall characteristic

is expressed by the main theme of the music, and the localcharacteristics are developed around the overall characteris-tic. That is, the relationship between commonality and indi-viduality, where the commonality determines theindividuality and the individuality reflects the commonality[19]. The study of the overall characteristics and local char-acteristics of the musical signal can reveal the essential char-acteristics of the musical signal. An individual section is thesmallest unit of the musical signal that can be divided, as it isalready clearly expressed in the expression of musical ideasand the shaping of musical images. Therefore, in this project,we take a single section as the smallest unit of music analysis.Figure 1 shows the characteristic diagram of the music sig-nal. The rhythm of slow music changes slowly, and themusic signal is soft. Through mathematical equations, itextracts musical features such as motives, festivals, phrases,passages, music clubs, and movements.

The expression for the spectral energy is shown in Equa-tion (1), which is a statistical quantity. The elemental repre-sentation is based on the study of the fundamental frequencycycles of human perception, a method also commonlyreferred to as the chromaticity vector method. In the vector,each element corresponds to one of the 12 traditional cycles[20]. The value obtained by finding the root mean square ofthe spectral energy is a physical quantity related to the inten-sity of the sound. In note modeling and single note recogni-tion, only the note segment needs to be detected from thespeech mixed with blank noise segment. Therefore, thepaper selects the short-term average energy with less calcula-tion and better real-time performance for endpoint detec-tion.

E xð Þ = avg limN⟶∞

〠N

i=1ei − 1j j2

!: ð1Þ

Each critical spectrum expansion function has 10 dB and25 dB expansion to both high and low frequencies, respec-tively. The masking effect of the low-frequency band onthe high-frequency band is strong. The effect of the criticalband xi on xj satisfies Equation (2), where f ðxÞ = xi − xj.The music signal is also different from the general audio sig-nal in that it has not only the genre division but also the songstyle division. From the point of view of music theory, thebeat usually occurs at the point of note onset, and the selec-tion of the frame has a direct impact on the characteristics ofthe signal, as the instrument is articulated and played, andthe singer sings and ends according to the beat in an orderlymanner [21, 22]. The speed of the beat usually represents thestyle of the music signal; generally speaking, the signal spec-trum changes more intense music beat faster, and the musicsignal is more active. Softer music has slower beat changesand softer music signals.

Zdb f xð Þð Þ = α + β ∇f xð Þ + χð Þ: ð2Þ

The all-pole model obtained by linear predictive analysis

3Advances in Mathematical Physics

has a system function of Equation (3).

F xð Þ = 1 − limM⟶∞

〠M

i=1β + 1ð Þf xið Þ−2: ð3Þ

In Equation (3), M is the order of the linear predictor. Ifthe impulse response is assumed to be f ðxÞ, we have Equa-tion (4). However, since the LPC cepstral coefficients areonly based on the prediction of linear relationships, therobustness of the parameters is not very good and the noiseimmunity is low.

F xð Þ = limM⟶∞

〠M

i=1f xð Þx−i: ð4Þ

When a speech signal is transmitted in a traveling waveacross the cochlear basilar membrane, the transmission dis-tance of the low-frequency signal is greater than that of thehigh-frequency signal due to its low frequency and longwavelength; thus, the high-frequency signal is masked bythe low-frequency signal, and the masking ability of thehigher frequency sound varies for different frequencies,and the higher the audio, the greater the masking ability[23]. Therefore, the human ear hearing system is equivalentto a filtering system to filter the treble. In terms of designimplementation, a set of band-pass filters can be designed,which are arranged from dense to sparse according to themasking ability of each frequency point based on the hearingcharacteristics of the human ear. The conversion relation-

ship between linear frequency and H frequency is shownin Equation (5).

H xð Þ ≜ 4096 ∗ ln 1 + x900

� �: ð5Þ

The logarithmic energy output of each triangular filterbank is calculated as shown in Equation (6).

E nð Þ = lg limN⟶∞

〠N

i=1ei − 1ð ÞH nð Þf xið Þ

!, n ⊆ 0,N½ �: ð6Þ

The MFCC is obtained by doing a discrete sine transfor-mation on EðnÞ, and the transformation equation is as inEquation (7). K is the dimension of the characteristic param-eter. Since the Mel frequency cepstrum coefficient not onlyresponds to the human ear hearing effect but also does notmake any assumptions and restrictions on the input signal,it has better robustness.

M nð Þ = limN⟶∞

〠N

i=1E nð Þ sin n i − 0:25ð Þπ

K

� �, n ⊆ 0, K½ �: ð7Þ

3.2. Mathematical Equation Inversion IdentificationAlgorithm. In this study, we propose a procedure to calculatethe adaptive crossover rate and variation rate using the pop-ulation concentration by adding an extra procedure to calcu-late the population concentration in between the selectionoperation and the crossover operation. The population

Pitch length Timbre

Speed Strength Range

Basic characteristics ofmusic signal

0 20 40 60 80 1000

1

2

Complex features ofmusic signal

Rhythm Melody Harmony

20 40 60 80 10001234567

Musical structure Emotional connotationMusic style

Musicmotivation

Musicfestival

Musicphrase Music Club Music

section Musicalmovement

Pitch

Figure 1: Characteristic diagram of music signal.


concentration J , used in this study, is calculated as Equation(8). N is the number of evolutionary generations.

J Nð Þ = f X1 Nð Þð Þf Xn Nð Þð Þ : ð8Þ

Because of the great randomness when the articulatorvibrates, the length of articulation time cannot be well con-trolled. If the linear uniform expansion method is used toalign the frame lengths of the text file and the template file,it will ignore the time length transformation of each smallsegment in the audio file under different circumstances,leading to the result of a low recognition rate [24]. The pop-ulation concentration J will be used to regulate the crossoverrate and variation rate, and when the population is dis-persed, the strategy of more crossover and less variation isadopted to increase the exploitation. When the populationis concentrated, a strategy of more variation and less cross-over is used to increase exploration. The specific settings ofthe adaptive crossover rate and variable rate are shown inEquation (9). The mn can be adjusted as needed to controlthe crossover rate and variable rate to fluctuate in the speci-fied range.

f jð Þ =m1 +m2 ∗ J − 1ð Þ,f kð Þ =m3 +m4 ∗ J:

(ð9Þ

The inverse algorithm is initialized by randomly generat-ing a model K with the structure of Equation (10), saving itin the population, and setting the evolutionary algebra J to 0.

The initialization is run only once at the start of the geneticalgorithm.

K = βi ; pif g, i ⊆ 1,N½ �: ð10Þ

In Equation (10), N denotes the number of layers fittedby inversion. βi is the dielectric constant of layer, and pi isthe thickness of layer. The population is expressed as Equa-tion (11). J denotes the evolutionary algebra.

P Jð Þ = Ki Jð Þk k, i ⊆ 1,N½ �: ð11Þ

The music signal recognition record f sðtÞ of each modelis compared with the measured data f rðtÞ, and the adapta-tion value θðtÞ of each model is calculated. The calculationis determined by the objective function. The objective func-tion for this study is set as Equation (12).

θ tð Þ =maxffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi∣f r tð Þ − f s tð Þ ∣

p� �: ð12Þ

f rðtÞ is the measured waveform data, and f sðtÞ is the inversefitted waveform data, and this objective function is to mini-mize the error of the measured and synthesized waveformdata. At the same time, the error is also set to the adaptationvalue of the model, and the smaller the adaptation value ofthe model, the better it is. The forward and backward pro-cesses continuously interact so that the models close to thesubsurface medium are retained and similar children arereproduced to eliminate the poorly fitted models. After sev-eral generations of evolution, the population model will

Audio conversion Audio mergeAudio cut Audio noisereduction

Music signalfragment 1



Music signalfragment n

Music signalcharacteristics 1



Music signal segmentMusic signal

characteristics

Music signal synthesis result

Music signalcharacteristics n

Figure 2: Additive synthesis schematic.


gradually approximate the measured stratigraphic model,and the optimal model of the population will be output afterthe evolution is completed to obtain the inversion results ofthe music signal.

f xð Þ = limN⟶∞

〠N

i=1f r tð Þ − f s tð Þj j1/i: ð13Þ

Since the search process of the optimal path is con-strained by the slope, some frames cannot be matched inthe actual process of solving the optimal solution. Therefore,the improved DTW algorithm takes the constraints into fullconsideration and reduces the matching computationbetween unnecessary information frames. The effective com-putation range of the dynamic regularization algorithm canbe divided into three parts: ð1, f ðaÞÞ, ð f ðaÞ + 1, f ðbÞÞ, and ðf ðbÞ + 1, YÞ. f ðaÞ and f ðbÞ take the values of the two closestintegers.

f að Þ = 2X − Y3 ,

f bð Þ = 1 − 2Y − X3 :

8>><>>: ð14Þ

When template matching is performed, each frame onthe x-axis with the parameters to be identified only needsto be compared with the frames in the interval ½min ðyÞ,max ðyÞ� on the y-axis, where min ðyÞ and max ðyÞ are cal-

culated as in Equation (15).

min yð Þ = 2f að Þ + X − 2Y ,

max yð Þ = f bð Þ + 2X − Y2 :

8<: ð15Þ

Analytically, the range of max ðyÞ increases by twoframes for each frame on the x-axis until max ðyÞ = X. minðyÞ is the opposite ofmin ðyÞ, which decreases by two framesfor each frame on the x-axis until min ðyÞ = 1. Therefore, inthe actual encoding min ðyÞ, max ðyÞ. The computationalinterval is obtained using Equation (16).

min yð Þ =max 1, X − 2ð Þð ∗ Y − xð Þ,max yð Þ =min X, 2x − 1ð Þ,

(, x ⊆ 1, Y½ �: ð16Þ

If the energy does not fluctuate much in the distributionof each frequency band, then the signal corresponding tothis band of the spectrum contains more information, andthe entropy value for this band of the signal is also larger.Therefore, the information entropy can be used to detectthe instability of the signal and find the correct note segmen-tation point in the continuous notes. However, when usingentropy value to segment directly, there is a problem thatthe audio energy is large but the spectral entropy value issmall, and to solve this problem, the energy-entropy ratiois introduced. The energy-entropy ratio is the ratio of theshort-time energy of each frame to the entropy value, andthe spectrum of each frame is obtained by Fourier transformof the preprocessed discrete audio signal xðiÞ, which is given

Mus

ic si

gnal

div

ision

resu

lt

Music number

Music duration (s)

Music section

Music duration (s)

Music section

0 1 2 3 4 5 6 7 8 9 10 1120406080

100120140160180200

1 2 3 4 5 6 7 8 9 1020406080

100120140160180200

Figure 3: Results of different music signal divisions.


by Equation (17).

g xð Þ = limN⟶∞

〠N

i=1f ið Þ ln 1

x ið Þ� �

: ð17Þ

The vibration of a piano string is a set of standing wavevibrations with many overtone components, and each over-tone energy is strongest during a very short period when thekey is pressed and then slowly decreases to nothing overtime. The High-Frequency Content (HFC) based note seg-mentation method uses this property of piano notes toweight the high-frequency energy in the frequency domain,thus improving the frequency domain analysis of the high-frequency band of the signal. HðxÞ is defined in Equation(18). The wðiÞ is the frequency domain weighting window,and Masri proposes to use linear weighting by wðiÞ = ∣i ∣for high-frequency energy.

H xð Þ = limN⟶∞

〠N

i=1w ið Þ∣f ið Þ2: ð18Þ

3.3. Music Signal Recognition Modeling. Music synthesis isbased on the analysis of musical signals, and the paper usesadditive synthesis techniques in spectral synthesis to simu-late musical tones generated by piano notes. The additivesynthesis technique was developed from Fourier’s theorythat any periodic signal can be decomposed into many sinu-soidal signals with different frequencies, amplitudes, andphases. Figure 2 shows the schematic diagram of the princi-ple of additive synthesis. Define the frequency and ampli-tude of different harmonics and mix them together to forma new sound. But if you want to use the 1-9th harmonic toform a saw tooth-like waveform, you need an oscillator,amplifier, mixer, and thresholds to control the switch ofthe amplifier. Use the data equation inversion method tomake the synthesizer more efficient.

The attention network-based classification model has ashortcoming that although the overall accuracy and the rec-ognition scores of instruments with the higher frequency ofoccurrence are improved, the recognition of harmonicinstruments is not satisfactory when the main playinginstruments appear simultaneously with harmonic instru-ments of other instrument families. This is essentially a cat-egory imbalance problem; differences in the proportions ofdifferent categories can interfere with the learning of modelparameters. When the probability of a category occurringis only 0.01, even if the model misidentifies all such catego-ries, the error rate only increases by 0.01. This makes themodel tend to get parameters that favor the recognition ofa larger proportion of categories during training, while ittends to ignore a smaller proportion of categories. Someclassification scenarios address this problem fundamentallyby increasing the number of samples in the smaller catego-ries, but for the multi-instrument recognition problem inmusic signals, category imbalance is unavoidable. This isbecause in the creation of various musical genres, certaininstruments are suitable for melodic instruments and certaininstruments are suitable for harmonic use due to their tim-bral characteristics and range width, and melodic instru-ments always appear much more frequently than harmonicinstruments. We often hear various piano pieces, but rarelydo we hear “trumpet pieces” or “snare drum pieces.”

The two-stage classification model consists of a first-stage classification model and a second-stage classificationmodel, which are two convolutional network models. Thefirst-level classification model uses the inverse identificationof the mathematical equation as the input feature and firstcoarsely classifies the instrument families in the audio signal,that is, only three coarse classification labels are available forstrings, winds, and percussion. The three instrument fami-lies of strings, winds, and percussion have distinct energycharacteristics. For strings, the peaks at the lower-order har-monic frequency points are distinct and sharp, and the high-

0 20 40 60 80 100

Music section number

Lyap

unov

inde

x

0.0

0.2

0.4

0.6

0.8

1.0

Lyapunov index

Figure 4: Lyapunov index line chart.


frequency harmonic amplitudes are attenuated. For windinstruments, the peaks of lower-order harmonic frequencypoints are sharper than those of strings, and there are stillabundant harmonic spectral peaks with higher amplitude

in the high-frequency region. For percussion, the spectralpeaks are not obvious, and there are also noninteger har-monics, and synthesizers often have to add white noise whensynthesizing certain percussion. The inverse identification of

Standard genetic algorithmfor mathematical equations

Binary standard genetic algorithm

Binary adaptive genetic algorithm

Adaptive genetic algorithmfor mathematical equations

Inve

rsio

n re

sult

Number of inversions

0 1010

20

20

30

30

40

40

50

50

60

60

70 80 90 100

Figure 5: Average adaptation value curve of optimal solution per generation.

0 1 2 3 4 5 6 7 8 9 10 11

Testing frequency

1

2

3

4

5

6

Ope

ratio

n tim

e (s)

Standard genetic algorithmfor mathematical equations

Binary standard genetic algorithm

Binary adaptive geneticalgorithm

Adaptive genetic algorithmfor mathematical equations

Figure 6: Algorithmic computing costs.


the mathematical equation reflects the time-frequencyenergy distribution of the audio signal, which we believecan be used as an effective feature for coarse classification.The second-level classification model consists of three resid-ual network models with the same architecture, and eachresidual network model is specifically trained to identify var-ious instruments under a certain instrument family; there isa specific network model for each of the three instrumentfamilies; based on the coarse classification results of theinstrument families identified by the first-level classificationmodel, the corresponding network models in the second-level classification model are selected, and finally, the fineclassification results of each network model in the second-level classification model are selected. The subclassificationresults of each network model in the second classificationmodel are aggregated as the final classification results ofthe audio signal.

4. Analysis of Results

4.1. Music Signal Acquisition Analysis. According to themusical pattern, we divide the musical signal according tothe bars in the pattern, and the length of the bars is deter-mined by the length of the musical signal and the numberof bars. The bars have a clear termination in the spectrumof the music signal. A total of ten different types of musicsignals were selected for processing in this experiment, andthe results of the division of the music signals are shown inFigure 3.

Figure 4 corresponds to the statistical line graph of theresults for the Lyapunov exponent of the musical signal.The Lyapunov exponent of each bar is greater than 0, indi-cating that each bar (local) of the piece has a chaotic charac-ter. The bar with the largest Lyapunov index is bar 92, whichhas the strongest chaotic feature. The smallest Lyapunovexponent is still vignette 37, which has the weakest chaoticfeature. The maximum and minimum Lyapunov exponentsindicate that this subsection has the largest range of chaoticfeatures and strong nonlinear features. At the same time, theLyapunov exponent itself is not particularly large, indicatingthat the musical work is a weakly chaotic system with con-trollable nonlinear characteristics. This is the same as thenature of musical works, where the overall trend of a musicalwork is controllable, but the length and intensity of a partic-ular note at a certain moment are not precisely controllableand random.

4.2. Music Signal Recognition Analysis. The evolutionary effi-ciency analysis of genetic algorithms can be discussed interms of the evolutionary speed and the adaptation valueof the final evolutionary result. Figure 5 shows the averageadaptation value curve of the optimal solution for each gen-eration for 1000 inversions. In Figure 5, it can be seen thatthe evolutionary speed of the genetic algorithm with themathematical equation coding system is significantly fasterthan that of the genetic algorithm with the binary systemat the beginning of the evolution, but the standard geneticalgorithm with the mathematical equation almost stagnatesafter the 10th generation and the adaptation value of the

Piano

ViolinViola

Guitar

Bassoon

Timpan

i

snare Drumbass

Xylophone

DrumBeth

Saxophone

0

20

40

60

80

100

Reco

gniti

on sc

ore

Reco

gniti

on sc

ore

95

90

85

80

75

70

Musical instrument classification

B13 modelB15 modelMA modelMT modelMT model

MA modelB15 modelB13 model

Figure 7: Comparison of recognition scores.


evolutionary result is lower than that of the binary geneticalgorithm. The results of the binary standard genetic algo-rithm and the binary adaptive genetic algorithm are similar,but the binary adaptive genetic algorithm is faster than themathematical equation adaptive genetic algorithm in termsof evolutionary speed. The adaptive genetic algorithm formathematical equations is the fastest and the best in termsof evolutionary speed and evolutionary results.

The computational cost of the algorithm was analyzedby counting the computational time for each of the teninversions, and the results are shown in Figure 6, wherethe platform for performing the computation is a personalcomputer. Since the mathematical equation coding systemsaves the conversion between binary and decimal, the aver-age computing time of the genetic algorithm using the math-ematical equation coding system is reduced from4.76 s~4.89 s to 0.92 s~0.93 s, which saves 81.23%~86.16%of the computing cost. The use of the mathematical equationcoding system can significantly improve the computationalefficiency of wave impedance inversion.

Through the above analysis, the adaptive genetic algo-rithm for mathematical equations has superior performancein both evolutionary efficiency and operational efficiency. Inthe experiment of the music signal inversion model, themathematical equation adaptive genetic algorithm relies onthe continuous space coding system and self-adjusting cross-over rate and variable rate with the evolution status, whicheffectively avoids the problems of poor stability and slowevolution in traditional genetic algorithm. The adaptivegenetic algorithm is used for the inversion of the measured

data. It is proved that the adaptive genetic algorithm ofmathematical equations has high stability and operationalefficiency, and the adaptive genetic algorithm of mathemat-ical equations is selected as the method to invert the mea-sured data.

4.3. Music Signal Simulation Analysis. The experimentalenvironment and data set threshold settings are the sameas in the previous section. The inversion of the mathematicalequations is used as input in the first-level classificationmodel, and the music signal-time series matrix is output.Using the momentum algorithm with a momentum of0.93, the minibatch size is 60, the initial learning rate is0.05, and the weight decay factor is 2 × 10−2. In thesecond-level classification model, we use the third-orderharmonic mapping matrix I3, the fifth-order harmonic map-ping matrix I5, and the sixth-order harmonic mappingmatrix I6 as the input features of the string classification net-work, the wind classification network, and the percussionclassification network, respectively. Then, the outputs ofthe three networks are aggregated to obtain the final musicsignal-time series matrix. The recognition scores and overallaccuracy of various instruments in the two-level classifica-tion model can be seen in Figure 7. It can be seen that therecognition scores of most instruments are improved, espe-cially for xylophone, which indicates that the two-level clas-sification model alleviates the problem of categoryimbalance. Thus, the overall accuracy is also improved.

Using the real pitch labels and the extracted pitch fea-tures to construct harmonic mapping matrices separately

0

20

40

60

80

100

Accu

racy

(%)

1 2 3 4 5 6 7 8

(a) Classification experiment results

1 2 3 4 5 6 7 80

100

200

300

400

500

Mea

n sq

uare

erro

r

Test data set

L1 mathematical equationinversion identification modelL2 mathematical equationinversion identification model

L3 mathematical equationinversion identification modelSelf-adaptive mathematicalequationinversion identification model

(b) Regression experiment results

Figure 8: Accuracy and mean squared error results.


for input to the benchmark model, a comparison experimentwas conducted to demonstrate that the pitch features have apositive correlation effect on multi-instrument recognition;in addition, the comparison of harmonic mapping matricesof different orders led to the conclusion that recognition ofdifferent instruments should focus on different numbers ofharmonics. The attention network-based classificationmodel, which draws on the idea of visual attention, improvesthe recognition scores of the main playing instruments. Thetwo-level classification model constructs a specialized classi-fication network for each instrument family, with the coarseclassification of instrument families followed by fine classifi-cation of a specific instrument, which conforms to basic cog-nitive logic and alleviates the problem of category imbalance.In terms of performance, the two-level classification modelhas the best recognition results, and the attention network-based classification model is the most cost-effective.

In this experiment, the main models compared are theL1 mathematical equation inverse recognition model, L2mathematical equation inverse recognition model, L3 math-ematical equation inverse recognition model, and adaptivemathematical equation inverse recognition model.Figure 8(a) shows the final results of all models on classifica-tion experiments, and Figure 8(b) shows the final results ofall models on regression experiments. We can see that theadaptive mathematical equation inversion recognitionmodel achieves excellent results in terms of accuracy andmean square error, and the adaptive mathematical equationinversion recognition model can obtain higher accuracy andlower mean square error loss compared with the other threemodels. The general development trend of a piece of musiccan be inferred from the score chart, but the performanceof the same piece of music cannot be the same when playedby different people, and there is a lot of uncertainty in theprocess of performance, which does not change the overalldevelopment trend of the music. The music signal has thecharacteristics of chaos. By calculating the correlationdimension, we also find that the chaotic character of themusic signal exists, and it remains stable at a single valuedespite the multiple differencing, which also shows the sta-bility of the chaos in the music signal.

5. Conclusion

The system studied in the paper focuses on the applicationof computer science in the field of music, so to process musicsignals digitally, it is necessary to understand the four ele-ments of music signals. Of these four elements, pitch andtimbre are the more important characteristic parameters.From the system point of view, the music signal is a time-lagged nonlinear dynamical system, and time-lagged systemsoften have multiple degrees of freedom and high-dimensional characteristics. The bifurcation process isaccompanied by the generation of weak chaotic phenomena.The paper firstly compares the more commonly used featureparameters in cloudy rain recognition and selects the MFCCparameter as the feature parameter for note recognitionbased on the comparison results. Then, the paper introducesthe note recognition algorithm based on the inversion

method of mathematical equations and presents theimproved DTW algorithm. In the graded recognition, weuse two levels of grading, which can be increased in thefuture according to the number and characteristics of therecognized music. The noise in the music signal is not neces-sarily the AC noise that we set up. In the process of musicsignal data field acquisition, for example, the zero drift ofthe amplifier that changes with temperature, the interferencearound the microphone, and its instability may bring a lot ofnoise, so we also need to study a variety of practical situa-tions. Considering the many different styles of music signals,we will increase the depth of research not only vertically butalso horizontally in future studies to analyze many types ofsignals.

Data Availability

The data used to support the findings of this study are avail-able from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no known competingfinancial interests or personal relationships that could haveappeared to influence the work reported in this paper.

Acknowledgments

The work was supported by the Social Science Planning Pro-ject of Qingdao: The Research on the Role and Developmentof Music Technology in the Inheritance of Qingdao Tradi-tional Culture and Red Culture (No.: QDSKL2001161s).

References

[1] M. A. Jatoi and N. Kamel, “Brain source localization usingreduced EEG sensors,” Signal, Image and Video Processing,vol. 12, no. 8, pp. 1447–1454, 2018.

[2] G. Korvel, P. Treigys, G. Tamulevicus, J. Bernataviciene, andB. Kostek, “Analysis of 2d feature spaces for deep learning-based speech recognition,” Journal of the Audio EngineeringSociety, vol. 66, no. 12, pp. 1072–1081, 2018.

[3] Z. Wei, D. Liu, and X. Chen, “Dominant-current deep learningscheme for electrical impedance tomography,” IEEE Transac-tions on Biomedical Engineering, vol. 66, no. 9, pp. 2546–2555, 2019.

[4] A. Massa, D. Marcantonio, X. Chen, M. Li, and M. Salucci,“DNNs as applied to electromagnetics, antennas, and propaga-tion—a review,” IEEE Antennas and Wireless Propagation Let-ters, vol. 18, no. 11, pp. 2225–2229, 2019.

[5] D. C. Bowden, K. Sager, A. Fichtner, and M. Chmiel, “Con-necting beamforming and kernel-based noise source inver-sion,” Geophysical Journal International, vol. 224, no. 3,pp. 1607–1620, 2020.

[6] B. Zhang, N. Bi, C. Zhang, X. Gao, and Z. Lv, “Robust EOG-based saccade recognition using multi-channel blind sourcedeconvolution,” Biomedical Engineering/BiomedizinischeTechnik, vol. 64, no. 3, pp. 309–324, 2019.

[7] M. P. Atre and S. D. Apte, “Generalized modeling of bodyresponse of acoustic guitar for all frets using a response for sin-gle fret,” Applied Acoustics, vol. 145, pp. 439–444, 2019.


[8] B. B. Mehta, S. Coppo, D. F. McGivney et al., “Magnetic reso-nance fingerprinting: a technical review,” Magnetic Resonancein Medicine, vol. 81, no. 1, pp. 25–46, 2019.

[9] M. Ambrosanio, M. T. Bevacqua, T. Isernia, and V. Pascazio,“The tomographic approach to ground-penetrating radar forunderground exploration and monitoring: a more user-friendly and unconventional method for subsurface investiga-tion,” IEEE Signal Processing Magazine, vol. 36, no. 4, pp. 62–73, 2019.

[10] K. Yatabe, Y. Masuyama, T. Kusano, and Y. Oikawa, “Repre-sentation of complex spectrogram via phase conversion,”Acoustical Science and Technology, vol. 40, no. 3, pp. 170–177, 2019.

[11] J. Zhuo, L. Ye, F. Han, L. Xiong, and Q. H. Liu, “Multipara-metric electromagnetic inversion of 3-D biaxial anisotropicobjects embedded in layered uniaxial media using VBIMenhanced by structural consistency constraint,” IEEE Transac-tions on Antennas and Propagation, vol. 68, no. 6, pp. 4774–4785, 2020.

[12] C. He, A. Cao, J. Chen et al., “Direction finding by time-modulated linear array,” IEEE Transactions on Antennas andPropagation, vol. 66, no. 7, pp. 3642–3652, 2018.

[13] T. Long, Z. Liang, and Q. Liu, “Advanced technology of high-resolution radar: target detection, tracking, imaging, and rec-ognition,” SCIENCE CHINA Information Sciences, vol. 62,no. 4, pp. 10–26, 2019.

[14] D. Jiang, D. Jin, J. Zhuang, D. Tan, D. Chen, and Y. Liang, “Acomputational model of emotion based on audio-visual stim-uli understanding and personalized regulation with concur-rency,” Concurrency and Computation: Practice andExperience, vol. 33, no. 17, article e6269, 2021.

[15] S. Sun, B. J. Kooij, A. G. Yarovoy, and T. Jin, “A linear methodfor shape reconstruction based on the generalized multiplemeasurement vectors model,” IEEE Transactions on Antennasand Propagation, vol. 66, no. 4, pp. 2016–2025, 2018.

[16] P. Klimek, R. Kreuzbauer, and S. Thurner, “Fashion and artcycles are driven by counter-dominance signals of elite compe-tition: quantitative evidence from music styles,” Journal of theRoyal Society Interface, vol. 16, no. 151, p. 20180731, 2019.

[17] P. Lei, M. Chen, and J. Wang, “Speech enhancement for in-vehicle voice control systems using wavelet analysis and blindsource separation,” IET Intelligent Transport Systems, vol. 13,no. 4, pp. 693–702, 2019.

[18] A. Massa, G. Oliveri, M. Salucci, N. Anselmi, and P. Rocca,“Learning-by-examples techniques as applied to electromag-netics,” Journal of Electromagnetic Waves and Applications,vol. 32, no. 4, pp. 516–541, 2018.

[19] M. Darbas and S. Lohrengel, “Review on mathematical model-ling of electroencephalography (EEG),” Jahresbericht derDeutschen Mathematiker-Vereinigung, vol. 121, no. 1, pp. 3–39, 2019.

[20] A. T. Bukkapatnam, P. Depalle, andM. M.Wanderley, “Defin-ing a vibrotactile toolkit for digital musical instruments: char-acterizing voice coil actuators, effects of loading, andequalization of the frequency response,” Journal on Multi-modal User Interfaces, vol. 14, no. 3, pp. 285–301, 2020.

[21] M. Müller, “An educational guide through the FMP notebooksfor teaching and learning fundamentals of music processing,”Signals, vol. 2, no. 2, pp. 245–285, 2021.

[22] M. C. Knaus, “A double machine learning approach to esti-mate the effects of musical practice on student’s skills,” Journal

of the Royal Statistical Society: Series A (Statistics in Society),vol. 184, no. 1, pp. 282–300, 2021.

[23] K. Liu, “Fast imaging of sources and scatterers in a stratifiedocean waveguide,” SIAM Journal on Imaging Sciences,vol. 14, no. 1, pp. 224–245, 2021.

[24] Z. Jamil, A. Jamil, and M. Majid, “Artifact removal from EEGsignals recorded in non-restricted environment,” Biocyber-netics and Biomedical Engineering, vol. 41, no. 2, pp. 503–515, 2021.


Date post:	08-Dec-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Music Signal Recognition Based on the Mathematical and ...

Documents