Session 2
LLMs 101
Creating a Basic LLM Using Xojo and Ngrams
What are Ngrams?
Ngrams are continuous sequences of words or tokens in a given text. An Ngram model predicts the next word in a sequence based on the previous N words. For example, in a trigram model (3-gram), the sequence “I am going” can be used to predict the next word based on the occurrence patterns of three-word combinations in the training data.
Creating a Basic LLM Using Xojo and Ngrams
LLMs 101
Large Language Models (LLMs) have become the cornerstone of modern artificial intelligence, revolutionizing the way machines understand and generate human language. These models, trained on massive datasets, are capable of performing a wide range of tasks such as translation, summarization, and conversation. In this session, we will explore the creation of a basic fully capable LLM, using Xojo and Ngrams; demonstrating how even without traditional machine learning techniques, we can generate coherent text by leveraging statistical patterns in the data.
Understanding Ngrams
Role of Ngrams in Modern LLMs
Ngrams form the foundational basis of many modern LLMs. While contemporary models like GPT-4 use sophisticated neural networks and vast computational power, they still rely on the statistical principles established by Ngram models. These models capture local dependencies and patterns in text, enabling the generation of plausible and contextually relevant sequences.
How Our Ngram-Based LLM Works
Sliding Context Window
Just like all the big-name LLMs, our model utilizes a context window – sliding context window to be exact – to generate text. This means it examines a fixed number of words before and after the current word to predict the next word. In our implementation, we use a context window of 3 words ahead and 3 words behind the current word. By analyzing these local patterns, the model can generate coherent text sequences.
Model Training and Text Generation
- Loading and Tokenizing Text: The model begins by loading the training text (freely available – no standing copyright books, from Project Gutenberg Online) and breaking it down into tokens (words). This tokenized data forms the basis for building Ngram counts. Large models use byte-code pairing methods, but today, we’ll keep it simple. Tokenization is the process of building a vocabulary for the LLM. Our LLM will look at complete words, so it will NOT be able to generalize words it has never encountered outside it’s training dataset (books it consumes).
- Building Ngram Counts: The model counts the occurrences of each Ngram (in our case, sets of 3 words – a Trigram) in the tokenized text. These counts are used to calculate the probabilities of different words following a given sequence.
- Generating Text: Using the trained Ngram counts, the model generates new text by predicting the next word based on the previous context. The sliding context window ensures that the generated text is coherent and contextually relevant. Since the training data will be quite small – a mere million or few million tokens, don’t expect GPT4 or Claude level intelligence. You’ll still see AI in action, and some generations might even be considered sophisticated for the size of the actual model we’re creating and utilizing.
Achieving High Variability
One of the remarkable aspects of our Ngram-based LLM is its ability to generate text that varies 88% or more, from its original training data. This is called “generalization” – or the ability to formulate entirely new “ideas” from a given context. This high variability is achieved by leveraging the statistical patterns in the training data, allowing the model to generalize text creation without relying on actual machine learning techniques. We’re going to leave out the transformer based architecture, but we will utilize an attention mechanism called weighted bias to steer the text generation. To read more about attention mechanisms, see the original “Attention is All You Need” paper at [1706.03762] Attention Is All You Need (arxiv.org)
Building the Ngram LLM in Xojo
Overview of Ngram-Based LLM
An Ngram-based LLM works by analyzing sequences of words (or tokens) in a given text to predict the next word in a sequence. Our model will learn from a provided text corpus, count the occurrences of sequences, and use these counts to generate new text based on learned patterns.
Setting Up the Project
- Creating the Xojo Project
- Open Xojo and create a new Console Application project.
- Defining Properties
- Add the following properties to your project:
Public Property NumEpochs As Integer = 1 Public Property NgramCounts As Dictionary Public Property ContextCount As Integer = 3 Private Property ContextIndex As Integer = 0 Private Property PrefixMap As Dictionary Public Property StartingToken As String = "He"
- Add the following properties to your project:
Handling Unhandled Exceptions
We need to handle any unexpected errors that occur during the execution of our program.
Function UnhandledException(error As RuntimeException) Handles UnhandledException as Boolean
Print(error.Message)
Return True
End Function
Main Function: Run
The Run function is the entry point of our program. It loads or creates tokens, trains the Ngram model, and generates text based on the trained model.
Function Run(args() as String) Handles Run as Integer
Var tokens() As String
Var ngramFilePath As String = "ngrams.dat"
Var tokenFilePath as String = "tokens.dat"
var vocabfilePath as string = "vocab.dat"
// Load text and create tokens
Var path As String = "training"
If FileExists(tokenFilePath) Then
tokens = LoadTokens(tokenFilePath)
Else
Var text As String = LoadTextFromDirectory(path)
tokens = TokenizeText(text)
// SaveTokens(tokens, tokenFilePath)
End If
// Check if the ngram file exists
If FileExists(ngramFilePath) Then
NgramCounts = LoadNgramCounts(ngramFilePath)
InitializePrefixMap(tokens)
Else
InitializeNgramCounts(tokens)
For epoch As Integer = 1 To NumEpochs
Print("Training Model. Epoch " + epoch.ToString)
UpdateNgramCounts(tokens)
Next
// SaveNgramCounts(NgramCounts, ngramFilePath)
End If
// Generate text using the trained model
Var generatedText As String = GenerateTextFromNgrams(tokens, 256)
Print(generatedText.TrimSentence + EndOfLine + EndOfLine + "DONE. Press the Enter key to quit.")
Var y as String = Input()
Return 0
End Function
Loading and Saving Functions
These functions handle the loading and saving of data such as tokens, Ngram counts, and text files.
Loading Text from File
Private Function LoadTextFromFile(path As String) As String
Print("Loading Training File at: " + path)
Var f As FolderItem = GetFolderItem(path)
If f <> Nil And f.Exists Then
Var tis As TextInputStream
tis = TextInputStream.Open(f)
tis.Encoding = Encodings.UTF8
Var xtext As String = tis.ReadAll(Encodings.UTF8)
tis.Close
Return xtext.ReplaceLineEndings(EndOfLine)
Else
Return ""
End If
End Function
Checking if File Exists
Private Function FileExists(path As String) As Boolean
Var f As FolderItem = GetFolderItem(path)
Return f <> Nil And f.Exists
End Function
Loading Text from Directory
Private Function LoadTextFromDirectory(path As String) As String
Print("Loading Training Files at: " + path)
Var f As FolderItem = GetFolderItem(path)
var ff as FolderItem
var data() as String
for i as integer = 0 to f.Count-1
ff = f.ChildAt(i)
if not ff.IsFolder then
Var tis As TextInputStream
tis = TextInputStream.Open(ff)
tis.Encoding = Encodings.UTF8
Var txt As String = tis.ReadAll
tis.Close
data.Add txt.ReplaceLineEndings(EndOfLine)
end if
next
Return String.FromArray(data, EndOfLine)
End Function
Ngram Model Functions
These functions initialize, update, and use the Ngram model for text generation.
Initializing Ngram Counts
Private Sub InitializeNgramCounts(tokens() As String)
NgramCounts = ParseJSON( "{}" )
PrefixMap = ParseJSON( "{}" )
For i As Integer = 0 To tokens.LastIndex - ContextCount
Var ngramKey As String = String.FromArray(tokens.Slice(i, ContextCount), " ")
NgramCounts.Value(ngramKey) = 1 // Start with Laplace Smoothing
Var prefix As String = String.FromArray(tokens.Slice(i, ContextCount - 1), " ")
If Not PrefixMap.HasKey(prefix) Then
var s() as String
PrefixMap.Value(prefix) = s
End If
var x() as String = PrefixMap.Value(prefix)
x.Add(tokens(i + ContextCount - 1))
PrefixMap.Value(prefix) = x
Next
End Sub
Updating Ngram Counts
Private Sub UpdateNgramCounts(tokens() As String)
Print("Updating Model Graph...")
For i As Integer = 0 To tokens.LastIndex - ContextCount
Var ngramKey As String = String.FromArray(tokens.Slice(i, ContextCount), " ")
If NgramCounts.HasKey(ngramKey) Then
NgramCounts.Value(ngramKey) = NgramCounts.Value(ngramKey) + 1
Else
NgramCounts.Value(ngramKey) = 1 // Initialize unseen ngram due to additional data
End If
Var prefix As String = String.FromArray(tokens.Slice(i, ContextCount - 1), " ")
var x() as String = PrefixMap.Value(prefix)
x.Add(tokens(i + ContextCount - 1))
PrefixMap.Value(prefix) = x
next
End Sub
Generating Text from Ngrams
Private Function GenerateTextFromNgrams(tokens() As String, MaximumTokens As Integer) As String
If tokens.LastIndex < ContextCount Then Return "Not enough tokens to generate text." End If Print(EndOfLine + "Running Inference, please wait...") Print("Generating " + MaximumTokens.ToString + " tokens." + EndOfLine) Var random As New Random Var currentSequence() As String Var output() As String var starters() as Integer for e as integer = 0 to Tokens.LastIndex if tokens(e).Compare(StartingToken, ComparisonOptions.CaseSensitive) = 0 then starters.Add(e) next var rnd as new Random If starters.LastIndex = -1 Then Return "No valid starting token '" + StartingToken + "' found in tokens." End If starters.Shuffle() var startIndex = starters(rnd.InRange(0,starters.LastIndex)) currentSequence = tokens.Slice(startIndex, ContextCount - 1) output.Add EndOfLine output.Add String.FromArray(currentSequence, " ") print("Initial attention sequence: " + String.FromArray(currentSequence, " ")) var possiblePastWords() as String var reRunCount as Integer = 0 var tokenAdded as Boolean = False var randomint() as Integer For i As Integer = 1 To MaximumTokens Var currentSeqStr As String = String.FromArray(currentSequence.Slice(0,ContextCount-1), " ") randomint.RemoveAll() ReRun: Var possibleNextWords() As String = GetNextWords(currentSeqStr) if possibleNextWords.LastIndex = -1 then possibleNextWords = GetNextWords(RemovePunctuation(currentSeqStr.Lowercase)) if possibleNextWords.LastIndex = -1 then possibleNextWords = GetNextWords(currentSeqStr.Lowercase.Trim) Var nextWord As String print("[" + currentSeqStr + "]: [" + possibleNextWords.Count.ToString + "] ") If possibleNextWords.LastIndex = -1 Then reRunCount = reRunCount + 1 if reRunCount > possiblePastWords.LastIndex then
reRunCount = 0
nextWord = tokens(random.InRange(0, tokens.LastIndex))
print("Model Not Adequately Trained - No next words found for sequence: " + currentSeqStr)
else
output.RemoveAt(output.LastIndex)
if currentSequence.Count > 0 then
currentSequence.RemoveAt(currentSequence.LastIndex)
end if
var rint as Integer
if possiblePastWords.LastIndex > -1 then
while randomint.IndexOf(rint) > -1
rint = rnd.InRange(0,possiblePastWords.LastIndex)
wend
randomint.Add rint
var possibility as string = possiblePastWords(rint)
output.Add possibility
currentSequence.Add possibility
currentSeqStr = String.FromArray(currentSequence.Slice(0,ContextCount-1), " ")
else
currentSeqstr = String.FromArray(output.Slice(output.LastIndex-(ContextCount-1), ContextCount-1), " ")
end if
goto ReRun
end if
Else
Var scores() As Double
Var totalScore As Double = 0.0
For Each word As String In possibleNextWords
Dim fullNgram As String = currentSeqStr + " " + word
if NgramCounts.HasKey(fullNgram) then
scores.Append(NgramCounts.Value(fullNgram))
totalScore = totalScore + NgramCounts.Value(fullNgram)
end if
Next
Var probabilities() As Double
For Each score As Double In scores
probabilities.Append(score / totalScore)
Next
var wRand as Integer = WeightedRandom(probabilities)
if wRand = -1 then wRand = 0
if possibleNextWords.Count = 1 then
nextWord = possibleNextWords(0)
else
nextWord = possibleNextWords(wRand)
end if
End If
output.Add(nextWord)
currentSequence.Append(nextWord)
If currentSequence.Count > ContextCount - 1 Then
currentSequence.Remove(0)
End If
possiblePastWords = possibleNextWords
reRunCount = 0
print(nextWord + " ")
System.DebugLog("Chosen word: " + nextWord)
Next
Return String.FromArray(output, " ")
End Function
Additional Helper Functions
Tokenizing Text
Private Function TokenizeText(text As String) As String()
Print("Tokenizing text...")
Var tokens() As String = text.Split(" ")
var ut() as String
print("Token Count: " + Tokens.Count.tostring)
var t as String = text
while t.IndexOf(" ") > -1
t = t.ReplaceAll(" ", " ")
wend
ut = text.Lowercase.Split(" ")
var tokCount as Integer = tokens.Count
print "Vocabulary Size: " + str(ut.Count)
print "Parameter Count: " + str(tokCount * ContextCount)
Return tokens
End Function
Getting Next Words
Private Function GetNextWords(currentSequence As String) As String()
Var nextWords() As String
For Each key As String In NgramCounts.Keys
If key.Left(currentSequence.Len + 1) = currentSequence + " " Then
Var nextPart() As String = key.Split(" ")
nextWords.Append(nextPart(nextPart.LastIndex))
End If
Next
System.DebugLog("Possible Next Words: [" + String.FromArray(nextWords, ", ") + "]")
Return nextWords
End Function
Removing Punctuation
Public Function RemovePunctuation(input As String) As String
Var punctuationMarks() As String = Split(".,!?;:'""()[]{}<>-—_@#$/\&*^%`~|=+", "")
var output as String = input
For Each mark as String in punctuationMarks
output = output.ReplaceAll(mark, "")
next
Return output
End Function
Loading and Saving Tokens and Ngram Counts
Private Function LoadTokens(path As String) As String()
Print("Loading Tokens from " + path)
Var f As FolderItem = GetFolderItem(path)
Var tokens() As String
If f <> Nil And f.Exists Then
Var tis As TextInputStream = TextInputStream.Open(f)
tokens = GunZip(tis.ReadAll()).Split(EndOfLine.UNIX)
tis.Close
End If
Return tokens
End Function
Load Ngram Probabilities (Model)
Private Function LoadNgramCounts(path As String) As Dictionary
Print("Loading Model Graph")
PrefixMap = ParseJSON( "{}" )
Var ngramCounts As Dictionary = ParseJSON( "{}" )
Var f As FolderItem = GetFolderItem(path)
If f <> Nil And f.Exists Then
Var tis As TextInputStream = TextInputStream.Open(f)
var vTis() as String = GunZip(tis.ReadAll()).Split(EndOfLine.UNIX)
for i as Integer = 0 to vTis.LastIndex
Var line As String = vTis(i)
Var parts() As String = line.Split(":::")
If parts.LastIndex = 1 Then
ngramCounts.Value(parts(0)) = parts(1).ToInteger
End If
Next
tis.Close
End If
Return ngramCounts
End Function
Save Token Dictionary for Reuse
Private Sub SaveTokens(tokens() As String, path As String)
Var f As FolderItem = GetFolderItem(path)
var tok as String = String.FromArray(tokens, EndOfLine.UNIX)
var sok as string = GZip(tok)
If f <> Nil Then
Var tos As TextOutputStream = TextOutputStream.Create(f)
tos.Write(sok)
tos.Close
End If
End Sub
Save Ngram Probabilities Model
Private Sub SaveNgramCounts(ngramCounts As Dictionary, path As String)
Print("Saving Ngram Graph")
Var f As FolderItem = GetFolderItem(path)
var sok() as String
If f <> Nil Then
Var tos As TextOutputStream = TextOutputStream.Create(f)
For Each key As String In ngramCounts.Keys
sok.Add key + ":::" + ngramCounts.Value(key).StringValue
Next
tos.Write GZip(String.FromArray(sok, EndOfLine.UNIX))
tos.Close
End If
End Sub
Initializing Prefix Map
Private Sub InitializePrefixMap(tokens() As String)
PrefixMap = ParseJSON( "{}" )
For i As Integer = 0 To tokens.LastIndex - ContextCount
Var prefix As String = String.FromArray(tokens.Slice(i, ContextCount - 1), " ")
If Not PrefixMap.HasKey(prefix) Then
var s() as String
PrefixMap.Value(prefix) = s
End If
var x() as String = PrefixMap.Value(prefix)
x.Add(tokens(i + ContextCount - 1))
PrefixMap.Value(prefix) = x
Next
End Sub
Congratulations!
You have successfully created a basic, fully-functional, LLM using Xojo and Ngrams. This model can now generate text based on the patterns it has learned from the provided training data. While this implementation is simple, it lays the foundation for more complex and powerful language models. Keep experimenting and improving your model to achieve even better results. For those interested, I’ve included the C++ version along with the Xojo code. The C++ version runs approximately 30-50x faster than Xojo’s implementation. Included source has some additional code and settings you can play around with.
Happy coding!
Latest Sessions
Session 1: Vectors 101
Session 1 Vectors 101An introductory toe-dip into the magical world of vector embeddings.Vector Embeddings Vectors can be found everywhere in the world of AI and machine learning, from word embeddings to tensors. Vectors are the magical arrays of double numbers that...