voice-engine / make-a-smart-speaker Goto Github PK
View Code? Open in Web Editor NEWA collection of resources to make a smart speaker
A collection of resources to make a smart speaker
How can I put the wake word I tried many time
The code:
import speech_recognition as sr # recognise speech
import playsound # to play an audio file
from gtts import gTTS # google text to speech
import random
from time import ctime # get time details
import webbrowser # open browser
import yfinance as yf # to fetch financial data
import ssl
import certifi
import time
import os # to remove created audio files
class person:
name = ''
def setName(self, name):
self.name = name
def there_exists(terms):
for term in terms:
if term in voice_data:
return True
r = sr.Recognizer() # initialise a recogniser
def record_audio(ask=False):
with sr.Microphone() as source: # microphone as source
if ask:
speak(ask)
audio = r.listen(source) # listen for the audio via source
voice_data = ''
try:
voice_data = r.recognize_google(audio) # convert audio to text
except sr.UnknownValueError: # error: recognizer does not understand
speak('Can you repeat')
except sr.RequestError:
speak('Sorry, the service is down') # error: recognizer is not connected
print(f">> {voice_data.lower()}") # print what user said
return voice_data.lower()
def speak(audio_string):
tts = gTTS(text=audio_string, lang='en') # text to speech(voice)
r = random.randint(1,20000000)
audio_file = 'audio' + str(r) + '.mp3'
tts.save(audio_file) # save as mp3
playsound.playsound(audio_file) # play the audio file
print(f"kiri: {audio_string}") # print what app said
os.remove(audio_file) # remove audio file
wake = ["welcome", "hey sara"]
def respond(voice_data):
# 1: greeting
if there_exists(['hey','hi','hello']):
greetings = [f"hey, how can I help you {person_obj.name}", f"hey, what's up? {person_obj.name}", f"I'm listening {person_obj.name}", f"how can I help you? {person_obj.name}", f"hello {person_obj.name}"]
greet = greetings[random.randint(0,len(greetings)-1)]
speak(greet)
# 2: name
if there_exists(["what is your name","what's your name","tell me your name"]):
if person_obj.name:
speak("my name is sara")
else:
speak("my name is sara. what's your name?")
if there_exists(["my name is"]):
person_name = voice_data.split("is")[-1].strip()
speak(f"okay, i will remember that {person_name}")
person_obj.setName(person_name) # remember name in person object
# 3: greeting
if there_exists(["how are you","how are you doing"]):
speak(f"I'm very well, thanks for asking {person_obj.name}")
# 4: time
if there_exists(["what's the time","tell me the time","what time is it"]):
time = ctime().split(" ")[3].split(":")[0:2]
if time[0] == "00":
hours = '12'
else:
hours = time[0]
minutes = time[1]
time = f'{hours} {minutes}'
speak(time)
# 5: search google
if there_exists(["search for"]) and 'youtube' not in voice_data:
search_term = voice_data.split("for")[-1]
url = f"https://google.com/search?q={search_term}"
webbrowser.get().open(url)
speak(f'Here is what I found for {search_term} on google')
# 6: search youtube
if there_exists(["youtube"]):
search_term = voice_data.split("for")[-1]
url = f"https://www.youtube.com/results?search_query={search_term}"
webbrowser.get().open(url)
speak(f'Here is what I found for {search_term} on youtube')
# 7: get stock price
if there_exists(["price of"]):
search_term = voice_data.lower().split(" of ")[-1].strip() #strip removes whitespace after/before a term in string
stocks = {
"apple":"AAPL",
"microsoft":"MSFT",
"facebook":"FB",
"tesla":"TSLA",
"bitcoin":"BTC-USD"
}
try:
stock = stocks[search_term]
stock = yf.Ticker(stock)
price = stock.info["regularMarketPrice"]
speak(f'price of {search_term} is {price} {stock.info["currency"]} {person_obj.name}')
except:
speak('oops, something went wrong')
# 8: to stop
if there_exists(["exit", "quit", "goodbye"]):
speak("going offline")
exit()
# 9: thank you responed
if there_exists(["thank you","thanks","you are the best"]):
speak(" welcome sir ")
time.sleep(0.5)
person_obj = person()
while(1):
voice_data = record_audio() # get the voice input
respond(voice_data) # respond
sorry I am new in githup
Hi, it is a very nice resource, perhaps you could also add Snips wakeword engine and ASR engines to the list, we are open-sourcing them over time
You can look at https://snips.ai, you can build your own assistant with the console, download the platform and use it with any audio frontend (beamforming, AEC) that you want, and code any skill that you like
Do you know Rhasspy? It's an open source, fully offline voice assistant toolkit for many languages that works well with Home Assistant, Hass.io, and Node-RED. Especially since Snips has been bought by Sonos, there are a lot of new users and developers joining the project and development is going on rapidly.
part of Smart Speaker from Scratch
To make a smart speaker, the first thing to do in terms of software is to learn how to record & play audio.
Most of smart speakers, except Apple's, are using Linux, so let's get started with audio on Linux.
We mainly focus on ALSA (Advanced Linux Sound Architecture) here. ALSA has two parts:
one is sound card drivers in Linux kernal which describes the audio hardware and provide a unified interface;
the other is user-space ALSA library, which has a user friendly API (compare to ALSA kernel API), some utils and a plugin system. The plugin system has plugins such as plug
, dmix
, dsnoop
, pulse
, provides functions like resampling, mixing, channel mapping.
ALSA is not powerful enough, for example, we can use ALSA dmix
plugin to support multiple applications using one sound card to play at the same time, but it's very limited, it's very hard to adjust each volume of each application.
So we have PulseAudio, which is a sound server to manage all the audio sources and sinks. It can handle not just ALSA audio, but also Bluetooth audio and network audio. When a USB sound card is plugged, PulseAudio will detect it and redirect playback to it (based on settings). Nowadays, PulseAudio is installed by default on most Linux desktop distributions. Similar to PulseAudio, we also have JACK and PipeWire. JACK mainly focuses on realtime, low-latency professional audio applications. PipeWire is an ongoing project which aims to support the usecases currently handled by both PulseAudio and JACK and at the same time provide same level of powerful handling of Video input and output.
When ALSA, PulseAudio and JACK all have their own APIs, what should we choose to use?
The short answer is ALSA API for most recording/playback applications, as PulseAudio and JACK both provide ALSA plugins to support applications through ALSA API. It's like APP3 in the above graph. The author of PulseAudio, Lennart Poettering, alsa recommended a safe ALSA subset API to do basic PCM audio playback/capturing.
So let's start to use ALSA to record & play audio.
For raw audio or wav audio, we can just aplay
and arecord
. Actually, they are the same program with different names. You can run ls -la /usr/bin/arecord
to find it out. aplay
and arecord
have a bunch of parameters, but it is worth to learn these parameters (We will use these parameters in the code in a extremely simple way later). The best way to learn how to use aplay
and arecord
is to read the help of aplay -h
or man arecord
and practice:
$ aplay -h
Usage: aplay [OPTION]... [FILE]...
-h, --help help
--version print current version
-l, --list-devices list all soundcards and digital audio devices
-L, --list-pcms list device names
-D, --device=NAME select PCM by name
-q, --quiet quiet mode
-t, --file-type TYPE file type (voc, wav, raw or au)
-c, --channels=# channels
-f, --format=FORMAT sample format (case insensitive)
-r, --rate=# sample rate
-d, --duration=# interrupt after # seconds
-M, --mmap mmap stream
-N, --nonblock nonblocking mode
-F, --period-time=# distance between interrupts is # microseconds
-B, --buffer-time=# buffer duration is # microseconds
--period-size=# distance between interrupts is # frames
--buffer-size=# buffer duration is # frames
-A, --avail-min=# min available space for wakeup is # microseconds
-R, --start-delay=# delay for automatic PCM start is # microseconds
(relative to buffer size if <= 0)
-T, --stop-delay=# delay for automatic PCM stop is # microseconds from xrun
-v, --verbose show PCM structure and setup (accumulative)
-V, --vumeter=TYPE enable VU meter (TYPE: mono or stereo)
-I, --separate-channels one file for each channel
-i, --interactive allow interactive operation from stdin
-m, --chmap=ch1,ch2,.. Give the channel map to override or follow
--disable-resample disable automatic rate resample
--disable-channels disable automatic channel conversions
--disable-format disable automatic format conversions
--disable-softvol disable software volume control (softvol)
--test-position test ring buffer position
--test-coef=# test coefficient for ring buffer position (default 8)
expression for validation is: coef * (buffer_size / 2)
--test-nowait do not wait for ring buffer - eats whole CPU
--max-file-time=# start another output file when the old file has recorded
for this many seconds
--process-id-file write the process ID here
--use-strftime apply the strftime facility to the output file name
--dump-hw-params dump hw_params of the device
--fatal-errors treat all errors as fatal
Recognized sample formats are: S8 U8 S16_LE S16_BE U16_LE U16_BE S24_LE S24_BE U24_LE U24_BE S32_LE S32_BE U32_LE U32_BE FLOAT_LE FLOAT_BE FLOAT64_LE FLOAT64_BE IEC958_SUBFRAME_LE IEC958_SUBFRAME_BE MU_LAW A_LAW IMA_ADPCM MPEG GSM SPECIAL S24_3LE S24_3BE U24_3LE U24_3BE S20_3LE S20_3BE U20_3LE U20_3BE S18_3LE S18_3BE U18_3LE U18_3BE G723_24 G723_24_1B G723_40 G723_40_1B DSD_U8 DSD_U16_LE DSD_U32_LE DSD_U16_BE DSD_U32_BE
Some of these may not be available on selected hardware
The available format shortcuts are:
-f cd (16 bit little endian, 44100, stereo)
-f cdr (16 bit big endian, 44100, stereo)
-f dat (16 bit little endian, 48000, stereo)
To record a 10 second, 48000Hz, 2 channels, S16_LE format audio:
arecord -d 10 -r 48000 -c 2 -f S16_LE audio.wav # or arecord -d 10 -f dat audio.wav
When debug sound card, we want to check if there is any sound, we can use -V mono
or -V stereo
to show sound VU meter. To know the details of used plugins and settings, -v
can be used. Or to loopback:
arecord -v -f dat - | aplay -v -Vstereo -
To choose PCM device, use -D {name}
, use aplay -L
to get a list of available PCM devices. PCM devices is different from sound cards. PCM devices can be created by ALSA plugins. When PulseAudio is installed, ALSA pulse
plugin will create a PCM device named pulse
which will be set as the default PCM device. ALSA plugins are stackable. For example, we can use plug
and dmix
together to run two recorder in two terminal:
arecord -v -D plug:dsnoop:0 -f dat -V stereo /dev/null
Use one sound card to play multiple audio:
aplay -v -D plug:dmix:0 music.wav
Next, we will use aplay
and arecord
in the code.
arecord
to stdout
pipe,and then read audio from stdout
pipe. In C code, it is:#include <stdio.h>
int main(int argc, char **argv) {
char *cmd = "arecord -D plughw:0 -f S16_LE -c 1 -r 16000 -t raw -q -";
char buf[256];
FILE *fp = popen(cmd, "r");
for (int i=0; i<16; i++) {
int result = fread(buf, 1, sizeof(buf), fp);
printf("read %d bytes\n", result);
}
pclose(fp);
return 0;
}
To play sound:
#include <stdio.h>
int main(int argc, char **argv) {
char *cmd = "aplay -D plughw:0 -f S16_LE -c 1 -r 16000 -t raw -q -";
char buf[256] = {0,};
FILE *fp = popen(cmd, "r");
for (int i=0; i<16; i++) {
int result = fwrite(buf, 1, sizeof(buf), fp);
printf("write %d bytes\n", result);
}
pclose(fp);
return 0;
}
In Python code, it is quite simple too:
import subprocess
import audioop
cmd = ['arecord', '-D', 'plughw:0', '-f', 'S16_LE', '-c', '1', '-r', '16000', '-t', 'raw', '-q', '-']
process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
for i in range(16):
audio = process.stdout.read(160 * 2)
print(audioop.rms(audio, 2))
process.stdout.close()
process.terminate()
We can also use ALSA API directly, but we learn to use varous of API to set parameters. It's a little hard touse. See pcm.c to learn how to play a sine signal in different ways such as sync, async and poll.
If you want cross-platform support, you may use PortAudio or libsoundio. There is a comparison between libsoundio and PortAudio.
For python program, it seems libsoundio doesn't have a mature python binding, while PortAudio have pyaudio and python-sounddevice.
When using PortAudio (pyaudio or sounddevice), PortAudio will iterate all PCM devices and try to open them. There will be some error logs which is a little annoying.
Using pyaudio to record audio is simple, here is an example from https://people.csail.mit.edu/hubert/pyaudio/:
"""PyAudio example: Record a few seconds of audio and save to a WAVE file."""
import pyaudio
import wave
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 2
RATE = 44100
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "output.wav"
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
print("* recording")
frames = []
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
print("* done recording")
stream.stop_stream()
stream.close()
p.terminate()
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()
IMHO, for simple audio recording/playback program, the best way is just using aplay
and arecord
. No extra dependencies is required, stdin
/stdout
pipe provides a buffer mechanism and it's stable and multi-process (A++ for multi-core ARM processor).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.