Before you go, check out these stories!

0
Hackernoon logoSpeech Recognition And Speech Synthesis on Angular by@msarica

Speech Recognition And Speech Synthesis on Angular

Author profile picture

@msaricaMehmet

msarica.com

I was writing a chat bot where a user interacts with a machine learning powered bot, then I wanted to write a general example application for anybody to use it. In this application, there will not be any intelligence. The bot will simply recite what it heard so that anyone can implement his/her own logic.

I used Angular & reactive approach and the main logic is mainly the service. Demo can be found here.

A small catch for now this application will only work on Chrome browsers as it's the only browser supports Web Speech API currently.

Since I will be using reactive approach, I created an interface with an optional payload parameter named Action. All our actions will implement this interface. Components will be subscribing to these actions and react to them.

export interface Action {
	payload?: any;
}
export class SpeakingStarted implements Action {}
export class SpeakingEnded implements Action {}

export class ListeningStarted implements Action {}
export class ListeningEnded implements Action {}

export class RecognizedTextAction implements Action {
	constructor(public payload: string) {}
}
export class SpeakAction implements Action {
	constructor(public payload: string) {}
}

In the constructor, we will inject NgZone because web speech API lives outside of Angular, we need NgZone to bring it into the Angular realm.

	constructor(private zone: NgZone) {
		this.window = (window as unknown) as IWindow;

		this.createSpeaker();
		this.createListener();

		this.subscriptions();
	}

One challenge is that synthesizer & recognizer are separate objects. When synthesizer is speaking, the recognizer picks up and it goes into a loop. So, we need to pause one or the other.

As you can see in the snippet above, we are creating speaker and listener separately and then we will set up subscriptions to prevent issues in between them.

For example: We are reacting to SpeakingStarted action (as shown below), when it's received, we stop listening and when SpeakingEnded or ListeningEnded we start listening again.

subscriptions() {
  this.getType(SpeakingStarted)
    .pipe(
      // tap(()=> console.log('will stop recognition')),
      tap(() => this.stopListening()),
      takeUntil(this.destroy$)
    )
    .subscribe();

  merge(this.getType(SpeakingEnded), this.getType(ListeningEnded))
    .pipe(
      filter(() => !this.isSpeaking),
      // tap(()=> console.log('will start recognition')),
      tap(() => this.startListening()),
      takeUntil(this.destroy$)
    )
    .subscribe();

  this.getType(SpeakAction)
    .pipe(
      tap((text) => this._speak(text)),
      takeUntil(this.destroy$)
    )
    .subscribe();
}

We need a couple observables that application can subscribe to. Speaker & listener will use the action$ observable for dispatch their actions.

getType is a utility method so that consumer can just pass the type of action it wants and gets only those events.

such as getType(SpeakerStarted).

private _voices$ = new BehaviorSubject(null);
voices$ = this._voices$.asObservable();

private _activeVoice$ = new BehaviorSubject(null);
activeVoice$ = this._activeVoice$.asObservable();

private _action$ = new Subject<Action>();
action$ = this._action$.asObservable();

getType(action: Action | any) {
  return this.action$.pipe(
    filter((i) => i instanceof action),
    map((i) => i.payload),
    takeUntil(this.destroy$)
  );
}

Based on the SpeechSynthesisUtterance API, we are creating an instance and setting up the parameters and attaching functions that we are interested in. onstart and onend are the functions we need in our case.

We are assigning functions so that whenever invoked, it will trigger an action. We are wrapping them in zone.run so that actions actually work in Angular. Lastly, we are loading voices.

private createSpeaker() {
  this.speaker = new SpeechSynthesisUtterance();
  this.speaker.lang = this.language;

  this.speaker.onstart = () => {
    this.zone.run(() => {
      this.isSpeaking = true;
      this._action$.next(new SpeakingStarted());
    });
  };

  this.speaker.onend = () => {
    this.zone.run(() => {
      this.isSpeaking = false;
      this._action$.next(new SpeakingEnded());
    });
  };

  this.loadVoices();
}

To load voices, we need to add onvoiceschanged function to speechSynthesis object on the window (This is not the speaker object). Similarly, we are emitting an event after receiving voices and remove this function after the first run. The reason is onvoiceschanged function may be invoked more than once during the lifetime of our service. (We could also check if the voices are changed but we don't need such a feature for now).

private loadVoices() {
  this.window.speechSynthesis.onvoiceschanged = () => {
    this.zone.run(() => {
      const voices = this.window.speechSynthesis.getVoices();

      this.voices = voices;
      this._voices$.next(voices);

      const voice_us = voices.find((i) => {
        // console.log(i.name);
        return i.name.indexOf(this.defaultVoiceName) > -1;
      });

      this.onVoiceSelected(voice_us, false);
    });

    // we are removing the function after its called,
    // as we will not need this to be called any more.
    this.window.speechSynthesis.onvoiceschanged = null;
  };
}

Similar to speaker, we are instantiating our listener, setting up parameters and actions. Documentation can be found here for the parameters.

When listener resulted in some value, it will invoke onresult function with possible results. We call extractText method to get the text out of it and dispatch RecognizedTextAction with the actual recognized text. Any component subscribed to this action can get the actual value without dealing with details.

When listener ended, we are restarting the listener so it can start over.

Full Service:

import { Observable, merge, Subject, BehaviorSubject } from 'rxjs';
import { Injectable, NgZone } from '@angular/core';
import { map, filter, tap, takeUntil } from 'rxjs/operators';

interface IWindow extends Window {
	webkitSpeechRecognition: any;
	SpeechRecognition: any;
	SpeechSynthesisUtterance: any;
}

export interface RecognizedText {
	term: string;
	confidence: number;
	isFinal: boolean;
}

export interface Action {
	payload?: any;
}
export class SpeakingStarted implements Action {}
export class SpeakingEnded implements Action {}

export class ListeningStarted implements Action {}
export class ListeningEnded implements Action {}

export class RecognizedTextAction implements Action {
	constructor(public payload: string) {}
}
export class SpeakAction implements Action {
	constructor(public payload: string) {}
}

@Injectable({
	providedIn: 'root',
})
export class SenseService {
	private defaultVoiceName = 'Google US English';
	private language = 'en-US';

	destroy$ = new Subject();

	window: IWindow;
	listener: any;
	speaker: any;

	isAllowed = true;

	voices: any[] = null;
	private _voices$ = new BehaviorSubject(null);
	voices$ = this._voices$.asObservable();

	private _activeVoice$ = new BehaviorSubject(null);
	activeVoice$ = this._activeVoice$.asObservable();

	private _action$ = new Subject<Action>();
	action$ = this._action$.asObservable();

	set isSpeaking(val: boolean) {
		this.speaker._isSpeaking = val;
	}

	get isSpeaking(): boolean {
		return !!this.speaker._isSpeaking;
	}

	set isListening(val: boolean) {
		this.listener._isListening = val;
	}

	get isListening(): boolean {
		return !!this.listener._isListening;
	}

	constructor(private zone: NgZone) {
		this.window = (window as unknown) as IWindow;

		this.createSpeaker();
		this.createListener();

		this.subscriptions();
	}

	subscriptions() {
		this.getType(SpeakingStarted)
			.pipe(
				// tap(()=> console.log('will stop recognition')),
				tap(() => this.stopListening()),
				takeUntil(this.destroy$)
			)
			.subscribe();

		merge(this.getType(SpeakingEnded), this.getType(ListeningEnded))
			.pipe(
				filter(() => !this.isSpeaking),
				// tap(()=> console.log('will start recognition')),
				tap(() => this.startListening()),
				takeUntil(this.destroy$)
			)
			.subscribe();

		this.getType(SpeakAction)
			.pipe(
				tap((text) => this._speak(text)),
				takeUntil(this.destroy$)
			)
			.subscribe();
	}

	getType(action: Action | any): Observable<any> {
		return this.action$.pipe(
			filter((i) => i instanceof action),
			map((i) => i.payload),
			takeUntil(this.destroy$)
		);
	}

	private createSpeaker() {
		const key = '_ms_Speaker';
		if (this.window[key]) {
			console.log('speaker found');
			this.speaker = this.window[key];
			return;
		}

		this.speaker = new SpeechSynthesisUtterance();
		this.window[key] = this.speaker;
		// this.speaker.voiceURI = 'native';
		// this.speaker.volume = 1; // 0 to 1
		// this.speaker.rate = 1; // 0.1 to 10
		// this.speaker.pitch = 0; //0 to 2
		// this.speaker.text = 'Hello World';
		this.speaker.lang = this.language;

		this.speaker.onstart = () => {
			this.zone.run(() => {
				this.isSpeaking = true;
				this._action$.next(new SpeakingStarted());
			});
		};

		this.speaker.onend = () => {
			this.zone.run(() => {
				this.isSpeaking = false;
				this._action$.next(new SpeakingEnded());
			});
		};

		this.loadVoices();
	}

	private loadVoices() {
		this.window.speechSynthesis.onvoiceschanged = () => {
			this.zone.run(() => {
				const voices = this.window.speechSynthesis.getVoices();

				this.voices = voices;
				this._voices$.next(voices);

				const voice_us = voices.find((i) => {
					// console.log(i.name);
					return i.name.indexOf(this.defaultVoiceName) > -1;
				});

				this.onVoiceSelected(voice_us, false);
			});

			// we are removing the function after its called,
			// as we will not need this to be called any more.
			this.window.speechSynthesis.onvoiceschanged = null;
		};
	}

	private createListener() {
		const key = '_ms_Listener';
		if (this.window[key]) {
			console.log('recognition found');
			this.listener = this.window[key];
			this.startListening();
			return;
		}

		const webkitSpeechRecognition = this.window.webkitSpeechRecognition;
		this.listener = new webkitSpeechRecognition();
		this.window[key] = this.listener;
		this.listener.continuous = true;
		this.listener.interimResults = true;
		this.listener.lang = this.language;
		this.listener.maxAlternatives = 1;
		this.listener.maxResults = 25;

		this.listener.onstart = () => {
			this.zone.run(() => {
				this.isListening = true;
				this._action$.next(new ListeningStarted());
			});
		};

		this.listener.onresult = (speech) => {
			if (speech.results) {
				let term: RecognizedText;
				term = this.extractText(speech);
				// console.log(term)

				if (term.isFinal) {
					this.zone.run(() => {
						this._action$.next(new RecognizedTextAction(term.term));
					});
				}
			}
		};

		this.listener.onerror = (error) => {
			if (error.error === 'no-speech') {
			} else if (error.error === 'not-allowed') {
				this.isAllowed = false;
			} else {
				console.error(error.error);
			}
		};

		this.listener.onend = () => {
			this.zone.run(() => {
				// console.log('recognition onend');
				this.isListening = false;
				this._action$.next(new ListeningEnded());
			});
		};

		this.startListening();
	}

	private stopListening() {
		this.listener.stop();
	}

	private startListening() {
		if (!this.startListening) {
			return;
		}
		try {
			console.log('recognition started');

			if (!this.isAllowed) {
				return;
			}

			this.listener.start();
		} catch {}
	}

	onVoiceSelected(voice: any, speak = true) {
		this.speaker.voice = voice;
		this._activeVoice$.next(voice);

		if (speak) this._speak('Hello');
	}

	extractText(speech: any): RecognizedText {
		let term = '';
		let result = speech.results[speech.resultIndex];
		let transcript = result[0].transcript;
		let confidence = result[0].confidence;
		if (result.isFinal) {
			if (result[0].confidence < 0.3) {
				// console.log('Not recognized');
			} else {
				term = transcript.trim();
				// console.log(term);
			}
		} else {
			if (result[0].confidence > 0.6) {
				term = transcript.trim();
			}
		}
		// return term;
		return <RecognizedText>{
			term,
			confidence,
			isFinal: result.isFinal,
		};
	}

	speak(text: string) {
		this._action$.next(new SpeakAction(text));
	}

	private _speak(text: string): void {
		console.log('speaking...');
		this.speaker.text = text;
		this.window.speechSynthesis.speak(this.speaker);
	}
}

Usage on any component/service:

this.senseService.getType(RecognizedTextAction).subscribe(text=> console.log(text));
this.senseService.speaker('test speak');
// or 
this.senseService
.getType(RecognizedTextAction)
.pipe(
  debounceTime(200),
  tap((msg) => {
    // process ... 
    this.senseService.speak(`response....`);
  }, takeUntil(this.destroy$))
)
.subscribe();

Demo can be found here.

Tags

The Noonification banner

Subscribe to get your daily round-up of top tech stories!