Very early experiments with conversational AI in python. Pipeline: Speech-to-text -> text sent to LLM -> LLM text response sent to text-to-speech.
The code is ugly and barely functional and subject to large breaking changes. I've only started making it generic enough to make it easier to plugin new services.
- Set up virtual environment
python -m venv .venv
source .venv/bin/activate
- Install packages:
- For Assembly AI:
1.Make sure to install
apt install portaudio19-dev
(Debian/Ubuntu) orbrew install portaudio
(MacOS) before installing - For online services only: (assembly, openai, elevenlabs)
pip -r requirements.txt
- For Assembly AI:
1.Make sure to install
- For whisper local:
pip -r requirements_whisper.txt
- Make a copy of
.env.sample
and rename in.env
- Add your API keys to
.env
Without very good (hardware) echo cancellation on your mic it will hear the text-to-speech output, making client.py
pretty useless (for now). I've made a conversational.py
that uses Assembly, OpenAI, and Elevenlabs and only listens when it is not thinking/speaking.
This supports various options for input and presence detection (webcam). Some of these may require passing the index of the microphone or webcam if you have more than one. You can see the enumerations using --list-devices
python conversational.py --list-devices
There is also support for playing a video (using mpv) during the conversation and displaying the transcription text on the video.
For example the full arguments I used with the Seeed USB speaker and a webcam at a recent prototype installation was:
python3 conversational.py --video experimance.mp4 --presence-idle 20 --presence-threshold 1e9 --input-index 1 --input-channels 6 --sample-rate 16000 --webcam 2 --video-volume 0.9 --onscreen-display
The client.py
file is an older version that using async but handles turn taking poorly. Don't use it.
Run client with online services: (defaults: assembly.ai for speech-to-text (STT), OpenAI for LLM, elevenlabs for text-to-speech (TTS))
python client.py --stt assembly --tts elevenlabs
Run local whisper server (with NVIDIA card):
python server.py
Run client with local whisper running on another machine:
python client.py --host 192.168.2.196 --model "large-v3"
- detect voice while LLM speaking to stop and listen
- litellm / langchain for LLM management
https://github.com/deepgram/deepgram-python-sdk
pip install deepgram-sdk==3.*
https://github.com/AssemblyAI/assemblyai-python-sdk
pip install -U assemblyai "assemblyai[extras]"
https://github.com/openai/openai-python
pip install openai
https://github.com/elevenlabs/elevenlabs-python
pip install elevenlabs
-
applied pull:
-
for custom WhisperLive (that updates when changing files):
cd WhisperLife python -m pip install --editable .
For sentence detection. https://github.com/bminixhofer/wtpsplit
wtpsplit
- prevent output from interfering with speech-2-text
- Acoustic Echo Cancellation
- https://github.com/fjiang9/NKF-AEC
- https://github.com/gkamradt/QuickAgent
- Deepgram + groq
- https://github.com/deepgram-devs/deepgram-conversational-demo