Learning from Nostr - Take 2

Cyborg Emo

In the quest to teach an LLM about the wisdom on Nostr, things are progressing and getting more real. I wanted to spend more time on filtering notes better, i.e. choosing the “meaningful” ones. Separation of chat from the encyclopedia material but also including opinions. Individual opinions matter a lot. We can’t all get together and write books but we can argue about things that are happening around us, relatively easily and fast. And those matter too. In the future LLMs could start learning real time. I think they will become more relevant at that point.

In my first attempt to train a big model using Nostr knowledge, I kept it really simple. I just used a web of trust scoring that I developed earlier. Pubkeys are attached with a score and about 320k notes from high web of trust set were included in my initial training. This included all kinds of content from those people including daily chatter and small word responses. Also “GM”. The result of that work is here: https://huggingface.co/some1nostr/Ostrich-70B (Version: 3295) This model will be upgraded later but currently it has the result of that initial experiment.

This still thought a lot of things to the model. Link: naddr1qv…35j6 I think it is because of high number of notes even though they had not much filtering, it appears that the LLMs don’t come undone easily even though you feed them with very unstructured data. Or they keep their integrity when you push chatter to them. But if you overfit them they lose abilities, for sure. Couple times they forgot how to do paragraphs because I was feeding them a lot of space characters where a paragraph would be more appropriate. I try to keep it simple. I should switch to JSONs at some point. Right now the training material is in TXT files.

Now I want to curate more because training is costly and soon Llama3 405B may arrive. Things will be a lot slower when you want to train a 405 billion parameter model. I want to carefully curate, to cut costs of training. So the curation currently is comprised of a few steps.

1. Storing kind 0’s

This will be used in step 3. An LLM won’t understand public key strings (npub1……….). It will see a number of characters that doesn’t make sense. In the future this may be different. Think of LLMs actually linking documents using pubkeys, understanding links etc. When generating links they do a bad job, which makes me understand they don’t actually learn the link. For links to work, the exact string has to be memorized. But LLMs are probabilistic. It may generate nostr.com as well as nostr.co or nostr.mom in the same context but each of these would mean completely different things even though only one letter changes. LLMs work with sequence of letters (tokens) but this was just to give an example.

2. Filtering based on simple algo

In this step I apply some simple and quick algo’s.

Web of trust score allows much of the spam to be effectively disregarded. Nostr is super open. To everyone and every bot. So there has to be some kind of filtering in everything we do.

Small content is removed (I am interested in 100+ characters).

Notes with too many tags are removed (10+).

Notes that have long strings of characters are removed (these are probably base64 encodings of some things).

Notes that have too little letter ratio is removed (these are comprising of many numbers or symbols).

The result of this step is most of notes are gone and I end up with 1.6 million notes to go along with in the next steps.

3. Editing notes to make more sense for LLM

LLMs does not understand the links. They don’t follow links. So ‘http://…’ have to be converted to things like ‘[link]’.

nostr:naddr1….. are converted to [pubkey].

nostr:note1…… are converted to [note].

etc.

I am sure this is not the best way to do it. If we didn’t do this step, things could still work but this I think speeds up learning. Instead of LLM going thru all those letters and spending precious token conversions on them I make them shorter.

4. Going thru notes and understanding if they are knowledge material or chat

This is the most time consuming step. Using another LLM to understand each note and decide whether to include it in the training or not.

This is what I used in the system message:

You are a machine that filters tweets.
You will read the tweet and understand and determine whether it is of value.
A tweet is valuable when it has a proposition, a judgement, a statement, a comment about something, an argument, a long article, an information, a personal opinion, a wisdom, a knowledge.
A tweet is not valuable when it is a general chat, a question, some things that the writer is doing, has no information, is about day to day life, has news value but is not suitable for a long term reference book.
Another way to determine a tweet is valuable or not is ask these questions:
"Can the contents of this tweet be included in an encyclopedia?".
"Can the contents of this tweet be included in a reference book?".
"Can the contents of this tweet be used as an argument?".
If the answer to any of those questions is yes, than the tweet is valuable.
A longer tweet is usually more valuable.
In the first line you will get a web of trust score (wot) as part of the input. This shows how trustable the writer of the tweet is. You can use this data to help judge.
In the second line and the following lines you will get the tweet.
If the tweet has [link] and it talks about the link, it has not much value.
If you don't understand what the tweet is about, it has no value.

Then I gave a few shots as examples. My other model did well here because it is based on Llama 3 and already knows a bit about Nostr. https://huggingface.co/some1nostr/Emu-70B-Llama3 This model spends about 1.5 seconds per note.

I also used Llama3 8B to speed up things in the low web of trust areas. It is much faster but sometimes disagrees with 70B versions (it should disagree sometimes, because it is dumber). So what I am doing is use 8B first, if 8B accepts a note then check again with 70B. I have to make sure things are of value with 70B. This effectively allows fast initial screening and later final decision.

5. Elimination of bots, LLMs and news

I realized there are lots of bots already. Lots of news submitters. People copying LLM outputs to their notes. Additional filtering needed here. Later I may do an LLM to detect whether a note is written by an LLM. I am mostly interested in notes generated by humans.

Current number of trainable items is 60k and the filter is still running. I am expecting the number to be around 80k.

6. Dividing the notes to pieces to apply different learning rates

This is to basically give more weight to high web of trust sources. High wot could use 5e-5 learning rate and cosine scheduler. Then low wot ones could do 1e-5 and linear decay. This just pushes the narrative towards more accepted notes on Nostr. If a person has a huge following it is more accepted by Nostr. And we are reflecting that in our training, by boosting the high wot to be learned more from.

CyberEmu

Conclusion

It looks like it will take several days to train on the resulting notes. In my setup it takes 1.5 hour to train on 1MB of data. Which is probably very slow because I am using fsdp_qlora. There is also a new technique fsdp_qdora: https://www.answer.ai/posts/2024-04-26-fsdp-qdora-llama3.html I will try that next time. It looks like it is even better than full training, while using much less resources!