Progress Report

(As of Nov 20, 2016)

To provide a quick recap on our Deliverables Follow-up memo (pdf available here for reference), we have stored several Japanese & English sound files, concatenated content by the same speaker, and run FFTs of each of the concatenated files. According to this methodology, we came up with the following plots (a subset of what we have):

One change we are considering with this methodology is to run an N-point FFT of the files instead of concatenating files to have the same length. Currently, our understanding of the N-point transform is that the files will automatically be zero-padded so that the resolution of the transform will increase. With the increased number of coefficients, we think that this might help better classify our languages. This is something we plan on discussing during the next project group meeting with Professor Balzano to confirm and correct our understanding.

Since then, we have worked on analyzing cross-correlations of each sound file with vowel sounds to take advantage of an aspect of the Japanese phonology. Contrary to English--which is often characterized to have 15 unique vowel sounds--all five vowels in Japanese are pronounced the same in all circumstances and all words are composed of combinations of a consonant sound and a vowel, with the exception of the nasal consonant 'n' (i.e. 'ka', 'mi', 'fu', 'te', 'ro'). In this analysis, we succeeded in capturing/marking most important peaks programmatically (where the peaks represent portions of the files that match well with the given vowel sound).

Fig. 1 FFT of all sample English voice files

Fig. 2 FFT of all sample Japanese voice files

Fig. 3 Samples of some voice files cross-correlated with Japanese vowel sounds. The top row shows Japanese results, bottom row shows English results. The peaks of the results are marked. There are 3 plots for each graph because we used 3 different voice samples of the 'a' sound (by different speakers) .

While more analysis is still required, a cursory glance tells us that cross-correlation of vowels with Japanese files results in a larger number of sharp, prominent peaks than English files. Because some vowels reveal starker differences than others between Japanese and English sound files, we plan to focus on only certain vowels in the future. Moving forward, we also want to replicate this analysis with certain Japanese consonant + vowel combos that are pronounced differently from the English pronunciation (i.e. 'ra', 'fu', 'tsu', etc.), as well as with English consonant sounds that are not heard in Japanese, namely the voiceless and voiced fricative 'th' (i.e. 'that" and "thing"), including the fricative 'v'. (Though a new Japanese syllable, 'vu', was introduced recently to help the Japanese approximate the sound of this consonant, it is very rarely actually used.)

The approach we have taken is machine learning with a Naive Bayes classifier model. With this method, we produce a simple model of probabilistic classifiers using the FFTs of all of our voice files (not concatenated). In MATLAB, we produced an N*M matrix, N being the number of files being used to train the classifier and M being the number of points in an M-point FFT of the file. So far, this method has produced no laudable results. For now, we assume a multivariate multinomial (MVMN) distribution for our predictor distributions since our predictors are categorical.

This simple model has produced some questionable results. When we run the same files we trained our model with through the classifier, we get the correct results. This seems obvious but was only the case when we used the MVMN distribution.

Fig. 4a Using Naive Bayes model on the same files used to train the model (MVMN distribution).

Fig. 4b Using Naive Bayes model on the same files used to train the model (Normal distribution). This obviously does not work.

Through this, we can see that we should probably use an MVMN distribution. The real problem, however, arises when we test the model on new files.

We are not sure why everything is classified as English. Perhaps the DFT just does not provide much distinction between the two languages, or perhaps it was just a mistake we made when we set up the model. Another possibility is that the Naive Bayes isn't a good classifier for this data set. As a corollary to Figure 5, we also found that the number of files used to train the model can make a large difference, as the model can go from classifying everything in English to classifying everything as Japanese with just a few minor changes in training files. Some solutions we foresee include fixing the implementation of the model (if we did in fact set it up incorrectly) or using a different classification model such as k-nearest neighbors.

To summarize, we plan to continue working on the following for the next 3 weeks:

1) Decide on the methodology we want to use in generating a model (concatenate files, truncate files, or zero-pad to achieve constant-length voice files?).

2) Generalize cross-correlation program by including support for more sounds; also, implement an algorithm to classify files based on results.

3) Fix our Naive Bayes classifier or implement some new form of classifier.

Fig. 5 Using previously defined model on some new files.

So far, the power (and coding simplicity) of cross-correlation has really stood out to us. In our opinion, it has produced the clearest results as opposed to inspection of the FFTs and machine learning strategies. Though each still require a lot of development, we envision our final project to include a combination of all three techniques.