I bought a new desktop PC this year with the goal of getting back into GPU related programming and learning about AI. I wasn’t comfortable with all the sign up requirements for products like ChatGPT and AI image generation sites, and thought running open source equivalents locally would be better. Up until recently I’d been playing with stable diffusion via various UIs locally, but still needed to try running a text generation LLM. I’d previously tried running Alpaca-electron, but experienced a lot of instability issues.
After a few minutes of searching, This blog led me to find text-generation-webui which proclaims a goal of being the Automatic1111 for text generation. This sounded like just what I wanted, and was.
Requirements / Environment
- I’m running Ubuntu, but the same process probably works on Windoze. To follow exactly you need a terminal and
git
. - An Nvidia GPU. Although this apparently works with CPU-only too (but it’s probably very slow/limited)
Setting up text-generation-webui
text-generation-webui is a webapp you can run locally that supports a whole bunch of LLM stuff that I don’t understand. But seems to use a convention-over-configuration approach meaning that most of the defaults just work, but you can configure things when you know what you’re doing.
The steps to get up and running are simple:
cd
to the location you want to install itcd /path/to/install
- Clone the Git repo and enter the created directory
git clone [email protected]:oobabooga/text-generation-webui.git && cd text-generation-webui
- Run the script
./start_linux.sh
which automatically installs dependencies and then runs the app.
When prompted, I choseNVIDIA
for GPU, andN
to go with the latest CUDA versions. Once everythign is downloaded, you wil be up and running.
Installing a model
The app is capable of a lot, but it can’t do anything without an LLM model to work with.
Once the app has started, navigate to http://127.0.0.1:7860/ (where the app is listening) and switch to the Model tab at the top. You now need to pick a model. I really don’t know much about these, but HuggingFace is the “GitHub of models”. You’ll find models of all different types, differing by quantization method and other stuff I don’t really understand (YET!).
I read about a recently released model that had been getting a lot of praise, OpenHermes2.5. So I looked on HuggingFace and found TheBloke/OpenHermes-2.5-neural-chat-v3-3-Slerp-GPTQ which is a quantized version of Yağız Çalık’s merged model. The notes explain that there are different quantisation parameters available on different branches. I chose the 4bit-32g version solely because it had “highest inference quality”.
The model variant you choose will depend on the conditions of your computing environment (eg. VRAM of your graphics car).
Whatever model you choose first, enter the identifier in the Download text field in the format username/model:branch
. So in our case it’s
TheBloke/OpenHermes-2.5-neural-chat-v3-3-Slerp-GPTQ:gptq-4bit-32g-actorder_True
Clicking Get file list
will confirm that it’s working. Then click Download
to start pulling it to your machine. You can
watch the progress in the terminal.
Once the model has downloaded, you’re ready to start chatting away. Select the Chat
tab in the top navigation bar.
Remember that each time you restart the app, you’ll need to “Load” (not Download) the Model using the list at the top left under the Model tab.
I actually tried a Llama-2 model before OpenHermes2.5, but the difference in quality and speed when I switched to OpenHermes was so insane, that I skipped mentioning it.
Launcher
Now that you’re able to chat, you might find it convenient to create an Ubuntu launcher so that you don’t have to
run the script from the terminal each time you want to start it up.
See my blog post on how to do that: naddr1qq…zvaf