Session 2

LLMs 101

Creating a Basic LLM Using Xojo and Ngrams

DALL·E 2024-08-02 14.11.30 - A detailed illustration showcasing the Ngram concept in a language model. The image should depict a sequence of words connected by arrows, highlightin

What are Ngrams?

Ngrams are continuous sequences of words or tokens in a given text. An Ngram model predicts the next word in a sequence based on the previous N words. For example, in a trigram model (3-gram), the sequence “I am going” can be used to predict the next word based on the occurrence patterns of three-word combinations in the training data.

Creating a Basic LLM Using Xojo and Ngrams

LLMs 101

Large Language Models (LLMs) have become the cornerstone of modern artificial intelligence, revolutionizing the way machines understand and generate human language. These models, trained on massive datasets, are capable of performing a wide range of tasks such as translation, summarization, and conversation. In this session, we will explore the creation of a basic fully capable LLM, using Xojo and Ngrams; demonstrating how even without traditional machine learning techniques, we can generate coherent text by leveraging statistical patterns in the data.

Understanding Ngrams

Role of Ngrams in Modern LLMs

Ngrams form the foundational basis of many modern LLMs. While contemporary models like GPT-4 use sophisticated neural networks and vast computational power, they still rely on the statistical principles established by Ngram models. These models capture local dependencies and patterns in text, enabling the generation of plausible and contextually relevant sequences.

How Our Ngram-Based LLM Works

Sliding Context Window

Just like all the big-name LLMs, our model utilizes a context window – sliding context window to be exact – to generate text. This means it examines a fixed number of words before and after the current word to predict the next word. In our implementation, we use a context window of 3 words ahead and 3 words behind the current word. By analyzing these local patterns, the model can generate coherent text sequences.

Model Training and Text Generation

Loading and Tokenizing Text: The model begins by loading the training text (freely available – no standing copyright books, from Project Gutenberg Online) and breaking it down into tokens (words). This tokenized data forms the basis for building Ngram counts. Large models use byte-code pairing methods, but today, we’ll keep it simple. Tokenization is the process of building a vocabulary for the LLM. Our LLM will look at complete words, so it will NOT be able to generalize words it has never encountered outside it’s training dataset (books it consumes).
Building Ngram Counts: The model counts the occurrences of each Ngram (in our case, sets of 3 words – a Trigram) in the tokenized text. These counts are used to calculate the probabilities of different words following a given sequence.
Generating Text: Using the trained Ngram counts, the model generates new text by predicting the next word based on the previous context. The sliding context window ensures that the generated text is coherent and contextually relevant. Since the training data will be quite small – a mere million or few million tokens, don’t expect GPT4 or Claude level intelligence. You’ll still see AI in action, and some generations might even be considered sophisticated for the size of the actual model we’re creating and utilizing.

Achieving High Variability

One of the remarkable aspects of our Ngram-based LLM is its ability to generate text that varies 88% or more, from its original training data. This is called “generalization” – or the ability to formulate entirely new “ideas” from a given context. This high variability is achieved by leveraging the statistical patterns in the training data, allowing the model to generalize text creation without relying on actual machine learning techniques. We’re going to leave out the transformer based architecture, but we will utilize an attention mechanism called weighted bias to steer the text generation. To read more about attention mechanisms, see the original “Attention is All You Need” paper at [1706.03762] Attention Is All You Need (arxiv.org)

Building the Ngram LLM in Xojo

Overview of Ngram-Based LLM

An Ngram-based LLM works by analyzing sequences of words (or tokens) in a given text to predict the next word in a sequence. Our model will learn from a provided text corpus, count the occurrences of sequences, and use these counts to generate new text based on learned patterns.

Setting Up the Project

Creating the Xojo Project
- Open Xojo and create a new Console Application project.

Defining Properties

Add the following properties to your project:

Public Property NumEpochs As Integer = 1
Public Property NgramCounts As Dictionary
Public Property ContextCount As Integer = 3
Private Property ContextIndex As Integer = 0
Private Property PrefixMap As Dictionary
Public Property StartingToken As String = "He"

Handling Unhandled Exceptions

We need to handle any unexpected errors that occur during the execution of our program.

Function UnhandledException(error As RuntimeException) Handles UnhandledException as Boolean
  Print(error.Message)
  Return True
End Function

Main Function: Run

The Run function is the entry point of our program. It loads or creates tokens, trains the Ngram model, and generates text based on the trained model.

Function Run(args() as String) Handles Run as Integer
  Var tokens() As String
  Var ngramFilePath As String = "ngrams.dat"
  Var tokenFilePath as String = "tokens.dat"
  var vocabfilePath as string = "vocab.dat"
  
  // Load text and create tokens
  Var path As String = "training"
  If FileExists(tokenFilePath) Then
    tokens = LoadTokens(tokenFilePath)
  Else
    Var text As String = LoadTextFromDirectory(path)
    tokens = TokenizeText(text)
    // SaveTokens(tokens, tokenFilePath)
  End If
  
  // Check if the ngram file exists
  If FileExists(ngramFilePath) Then
    NgramCounts = LoadNgramCounts(ngramFilePath)
    InitializePrefixMap(tokens)
  Else
    InitializeNgramCounts(tokens)
    For epoch As Integer = 1 To NumEpochs
      Print("Training Model. Epoch " + epoch.ToString)
      UpdateNgramCounts(tokens)
    Next
    // SaveNgramCounts(NgramCounts, ngramFilePath)
  End If
  
  // Generate text using the trained model
  Var generatedText As String = GenerateTextFromNgrams(tokens, 256)
  Print(generatedText.TrimSentence + EndOfLine + EndOfLine + "DONE. Press the Enter key to quit.")
  Var y as String = Input()
  Return 0
End Function

Loading and Saving Functions

These functions handle the loading and saving of data such as tokens, Ngram counts, and text files.

Loading Text from File

Private Function LoadTextFromFile(path As String) As String
  Print("Loading Training File at: " + path)
  Var f As FolderItem = GetFolderItem(path)
  If f <> Nil And f.Exists Then
    Var tis As TextInputStream
    tis = TextInputStream.Open(f)
    tis.Encoding = Encodings.UTF8
    Var xtext As String = tis.ReadAll(Encodings.UTF8)
    tis.Close
    Return xtext.ReplaceLineEndings(EndOfLine)
  Else
    Return ""
  End If
End Function

Checking if File Exists

Private Function FileExists(path As String) As Boolean
  Var f As FolderItem = GetFolderItem(path)
  Return f <> Nil And f.Exists
End Function

Loading Text from Directory

Private Function LoadTextFromDirectory(path As String) As String
  Print("Loading Training Files at: " + path)
  Var f As FolderItem = GetFolderItem(path)
  var ff as FolderItem
  var data() as String
  for i as integer = 0 to f.Count-1
    ff = f.ChildAt(i)
    if not ff.IsFolder then
      Var tis As TextInputStream
      tis = TextInputStream.Open(ff)
      tis.Encoding = Encodings.UTF8
      Var txt As String = tis.ReadAll
      tis.Close
      data.Add txt.ReplaceLineEndings(EndOfLine)
    end if
  next
  Return String.FromArray(data, EndOfLine)
End Function

Ngram Model Functions

These functions initialize, update, and use the Ngram model for text generation.

Initializing Ngram Counts

Private Sub InitializeNgramCounts(tokens() As String)
  NgramCounts = ParseJSON( "{}" )
  PrefixMap = ParseJSON( "{}" )
  For i As Integer = 0 To tokens.LastIndex - ContextCount
    Var ngramKey As String = String.FromArray(tokens.Slice(i, ContextCount), " ")
    NgramCounts.Value(ngramKey) = 1  // Start with Laplace Smoothing
    Var prefix As String = String.FromArray(tokens.Slice(i, ContextCount - 1), " ")
    If Not PrefixMap.HasKey(prefix) Then
      var s() as String
      PrefixMap.Value(prefix) = s
    End If
    var x() as String = PrefixMap.Value(prefix)
    x.Add(tokens(i + ContextCount - 1))
    PrefixMap.Value(prefix) = x
  Next
End Sub

Updating Ngram Counts

Private Sub UpdateNgramCounts(tokens() As String)
  Print("Updating Model Graph...")
  For i As Integer = 0 To tokens.LastIndex - ContextCount
    Var ngramKey As String = String.FromArray(tokens.Slice(i, ContextCount), " ")
    If NgramCounts.HasKey(ngramKey) Then
      NgramCounts.Value(ngramKey) = NgramCounts.Value(ngramKey) + 1
    Else
      NgramCounts.Value(ngramKey) = 1  // Initialize unseen ngram due to additional data
    End If
    Var prefix As String = String.FromArray(tokens.Slice(i, ContextCount - 1), " ")
    var x() as String = PrefixMap.Value(prefix)
    x.Add(tokens(i + ContextCount - 1))
    PrefixMap.Value(prefix) = x
  next
End Sub

Generating Text from Ngrams

Private Function GenerateTextFromNgrams(tokens() As String, MaximumTokens As Integer) As String
  If tokens.LastIndex < ContextCount Then     Return "Not enough tokens to generate text."   End If   Print(EndOfLine + "Running Inference, please wait...")   Print("Generating " + MaximumTokens.ToString + " tokens." + EndOfLine)   Var random As New Random   Var currentSequence() As String   Var output() As String   var starters() as Integer   for e as integer = 0 to Tokens.LastIndex     if tokens(e).Compare(StartingToken, ComparisonOptions.CaseSensitive) = 0 then starters.Add(e)   next   var rnd as new Random   If starters.LastIndex = -1 Then     Return "No valid starting token '" + StartingToken + "' found in tokens."   End If   starters.Shuffle()   var startIndex = starters(rnd.InRange(0,starters.LastIndex))   currentSequence = tokens.Slice(startIndex, ContextCount - 1)   output.Add EndOfLine   output.Add String.FromArray(currentSequence, " ")   print("Initial attention sequence: " + String.FromArray(currentSequence, " "))   var possiblePastWords() as String   var reRunCount as Integer = 0   var tokenAdded as Boolean = False   var randomint() as Integer   For i As Integer = 1 To MaximumTokens     Var currentSeqStr As String = String.FromArray(currentSequence.Slice(0,ContextCount-1), " ")     randomint.RemoveAll()     ReRun:     Var possibleNextWords() As String = GetNextWords(currentSeqStr)     if possibleNextWords.LastIndex = -1 then possibleNextWords = GetNextWords(RemovePunctuation(currentSeqStr.Lowercase))     if possibleNextWords.LastIndex = -1 then possibleNextWords = GetNextWords(currentSeqStr.Lowercase.Trim)     Var nextWord As String     print("[" + currentSeqStr + "]: [" + possibleNextWords.Count.ToString + "] ")     If possibleNextWords.LastIndex = -1 Then       reRunCount = reRunCount + 1       if reRunCount > possiblePastWords.LastIndex then
        reRunCount = 0
        nextWord = tokens(random.InRange(0, tokens.LastIndex))
        print("Model Not Adequately Trained - No next words found for sequence: " + currentSeqStr)
      else
        output.RemoveAt(output.LastIndex)
        if currentSequence.Count > 0 then
          currentSequence.RemoveAt(currentSequence.LastIndex)
        end if
        var rint as Integer
        if possiblePastWords.LastIndex > -1 then
          while randomint.IndexOf(rint) > -1 
            rint = rnd.InRange(0,possiblePastWords.LastIndex)
          wend
          randomint.Add rint
          var possibility as string = possiblePastWords(rint)
          output.Add possibility
          currentSequence.Add possibility
          currentSeqStr = String.FromArray(currentSequence.Slice(0,ContextCount-1), " ")
        else
          currentSeqstr = String.FromArray(output.Slice(output.LastIndex-(ContextCount-1), ContextCount-1), " ")
        end if
        goto ReRun
      end if
    Else
      Var scores() As Double
      Var totalScore As Double = 0.0
      For Each word As String In possibleNextWords
        Dim fullNgram As String = currentSeqStr + " " + word
        if NgramCounts.HasKey(fullNgram) then
          scores.Append(NgramCounts.Value(fullNgram))
          totalScore = totalScore + NgramCounts.Value(fullNgram)
        end if
      Next
      Var probabilities() As Double
      For Each score As Double In scores
        probabilities.Append(score / totalScore)
      Next
      var wRand as Integer = WeightedRandom(probabilities)
      if wRand = -1 then wRand = 0
      if possibleNextWords.Count = 1 then
        nextWord = possibleNextWords(0)
      else
        nextWord = possibleNextWords(wRand)
      end if
    End If
    output.Add(nextWord)
    currentSequence.Append(nextWord)
    If currentSequence.Count > ContextCount - 1 Then
      currentSequence.Remove(0)
    End If
    possiblePastWords = possibleNextWords
    reRunCount = 0
    print(nextWord + " ")
    System.DebugLog("Chosen word: " + nextWord)
  Next
  Return String.FromArray(output, " ")
End Function

Additional Helper Functions

Tokenizing Text

Private Function TokenizeText(text As String) As String()
  Print("Tokenizing text...")
  Var tokens() As String = text.Split(" ")
  var ut() as String
  print("Token Count: " + Tokens.Count.tostring)
  var t as String = text
  while t.IndexOf("  ") > -1
    t = t.ReplaceAll("  ", " ")
  wend
  ut = text.Lowercase.Split(" ")
  var tokCount as Integer = tokens.Count
  print "Vocabulary Size: " + str(ut.Count)
  print "Parameter Count: " + str(tokCount * ContextCount)
  Return tokens
End Function

Getting Next Words

Private Function GetNextWords(currentSequence As String) As String()
  Var nextWords() As String
  For Each key As String In NgramCounts.Keys
    If key.Left(currentSequence.Len + 1) = currentSequence + " " Then
      Var nextPart() As String = key.Split(" ")
      nextWords.Append(nextPart(nextPart.LastIndex))
    End If
  Next
  System.DebugLog("Possible Next Words: [" + String.FromArray(nextWords, ", ") + "]")
  Return nextWords
End Function

Removing Punctuation

Public Function RemovePunctuation(input As String) As String
  Var punctuationMarks() As String = Split(".,!?;:'""()[]{}<>-—_@#$/\&*^%`~|=+", "")
  var output as String = input
  For Each mark as String in punctuationMarks
    output = output.ReplaceAll(mark, "")
  next
  Return output
End Function

Loading and Saving Tokens and Ngram Counts

Private Function LoadTokens(path As String) As String()
  Print("Loading Tokens from " + path)
  Var f As FolderItem = GetFolderItem(path)
  Var tokens() As String
  If f <> Nil And f.Exists Then
    Var tis As TextInputStream = TextInputStream.Open(f)
    tokens = GunZip(tis.ReadAll()).Split(EndOfLine.UNIX)
    tis.Close
  End If
  Return tokens
End Function

Load Ngram Probabilities (Model)

Private Function LoadNgramCounts(path As String) As Dictionary
  Print("Loading Model Graph")
  PrefixMap = ParseJSON( "{}" )
  Var ngramCounts As Dictionary = ParseJSON( "{}" )
  Var f As FolderItem = GetFolderItem(path)
  If f <> Nil And f.Exists Then
    Var tis As TextInputStream = TextInputStream.Open(f)
    var vTis() as String = GunZip(tis.ReadAll()).Split(EndOfLine.UNIX)
    for i as Integer = 0 to vTis.LastIndex
      Var line As String = vTis(i)
      Var parts() As String = line.Split(":::")
      If parts.LastIndex = 1 Then
        ngramCounts.Value(parts(0)) = parts(1).ToInteger
      End If
    Next
    tis.Close
  End If
  Return ngramCounts
End Function

Save Token Dictionary for Reuse

Private Sub SaveTokens(tokens() As String, path As String)
  Var f As FolderItem = GetFolderItem(path)
  var tok as String = String.FromArray(tokens, EndOfLine.UNIX)
  var sok as string = GZip(tok)
  If f <> Nil Then
    Var tos As TextOutputStream = TextOutputStream.Create(f)
    tos.Write(sok)
    tos.Close
  End If
End Sub

Save Ngram Probabilities Model

Private Sub SaveNgramCounts(ngramCounts As Dictionary, path As String)
  Print("Saving Ngram Graph")
  Var f As FolderItem = GetFolderItem(path)
  var sok() as String
  If f <> Nil Then
    Var tos As TextOutputStream = TextOutputStream.Create(f)
    For Each key As String In ngramCounts.Keys
      sok.Add key + ":::" + ngramCounts.Value(key).StringValue
    Next
    tos.Write GZip(String.FromArray(sok, EndOfLine.UNIX))
    tos.Close
  End If
End Sub

Initializing Prefix Map

Private Sub InitializePrefixMap(tokens() As String)
  PrefixMap = ParseJSON( "{}" )
  For i As Integer = 0 To tokens.LastIndex - ContextCount
    Var prefix As String = String.FromArray(tokens.Slice(i, ContextCount - 1), " ")
    If Not PrefixMap.HasKey(prefix) Then
      var s() as String
      PrefixMap.Value(prefix) = s
    End If
    var x() as String = PrefixMap.Value(prefix)
    x.Add(tokens(i + ContextCount - 1))
    PrefixMap.Value(prefix) = x
  Next
End Sub

Congratulations!

You have successfully created a basic, fully-functional, LLM using Xojo and Ngrams. This model can now generate text based on the patterns it has learned from the provided training data. While this implementation is simple, it lays the foundation for more complex and powerful language models. Keep experimenting and improving your model to achieve even better results. For those interested, I’ve included the C++ version along with the Xojo code. The C++ version runs approximately 30-50x faster than Xojo’s implementation. Included source has some additional code and settings you can play around with.

Happy coding!

Download Session 2 Source Code

Latest Sessions

Session 1: Vectors 101

by Matthew Combatti | Jul 19, 2024 | Xojo Tutorials

Session 1 Vectors 101An introductory toe-dip into the magical world of vector embeddings.Vector Embeddings Vectors can be found everywhere in the world of AI and machine learning, from word embeddings to tensors. Vectors are the magical arrays of double numbers that...

Creating a Basic LLM using Xojo and Ngrams

Session 2

LLMs 101

What are Ngrams?

Creating a Basic LLM Using Xojo and Ngrams

LLMs 101

Understanding Ngrams

Role of Ngrams in Modern LLMs

How Our Ngram-Based LLM Works

Sliding Context Window

Model Training and Text Generation

Achieving High Variability

Building the Ngram LLM in Xojo

Overview of Ngram-Based LLM

Setting Up the Project

Handling Unhandled Exceptions

Main Function: Run

Loading and Saving Functions

Loading Text from File

Checking if File Exists

Loading Text from Directory

Ngram Model Functions

Initializing Ngram Counts

Updating Ngram Counts

Generating Text from Ngrams

Additional Helper Functions

Tokenizing Text

Getting Next Words

Removing Punctuation

Loading and Saving Tokens and Ngram Counts

Load Ngram Probabilities (Model)

Save Token Dictionary for Reuse

Save Ngram Probabilities Model

Initializing Prefix Map

Congratulations!

Latest Sessions

Session 1: Vectors 101

Other Controls and Classes

Creating a Basic LLM using Xojo and Ngrams

Session 2

LLMs 101

What are Ngrams?

Creating a Basic LLM Using Xojo and Ngrams

LLMs 101

Understanding Ngrams

Role of Ngrams in Modern LLMs

How Our Ngram-Based LLM Works

Sliding Context Window

Model Training and Text Generation

Achieving High Variability

Building the Ngram LLM in Xojo

Overview of Ngram-Based LLM

Setting Up the Project

Handling Unhandled Exceptions

Main Function: Run

Loading and Saving Functions

Loading Text from File

Checking if File Exists

Loading Text from Directory

Ngram Model Functions

Initializing Ngram Counts

Updating Ngram Counts

Generating Text from Ngrams

Additional Helper Functions

Tokenizing Text

Getting Next Words

Removing Punctuation

Loading and Saving Tokens and Ngram Counts

Load Ngram Probabilities (Model)

Save Token Dictionary for Reuse

Save Ngram Probabilities Model

Initializing Prefix Map

Congratulations!

Latest Sessions

Session 1: Vectors 101

Share this:

Other Controls and Classes