Session 1
Vectors 101
An introductory toe-dip into the magical world of vector embeddings.
Vector Embeddings
Vectors can be found everywhere in the world of AI and machine learning, from word embeddings to tensors. Vectors are the magical arrays of double numbers that allow us to recognize patterns in speech, vision, text, and many other fields. Under-the-hood, vectors give language models the ability to ascertain deep context from limited or vast input.
Welcome AI/ML Apprentice!
Welcome to Session 1 of our AI/Machine Learning for Xojo series: Vectors 101.
Today, we will explore the fascinating world of vector embeddings, a powerful tool in natural language processing (NLP) that helps capture the contextual meaning of words and phrases. Understanding vector embeddings is essential for building intelligent applications that can analyze and understand human language.
Understanding Vector Embeddings
Vector embeddings are numerical representations of words in a high-dimensional space. Each word is transformed into a vector, a list of numbers, that represents its position in this space. The closer two words are in this space, the more similar their meanings. This allows us to measure the similarity between words and phrases based on their contextual usage.
Similarity in Phrases
A common question arises: why do phrases like “I like cars” and “I hate cars” have high similarity? Despite their opposite sentiments, both phrases share significant contextual elements:
- Both express an emotion (like or dislike).
- Both have the same subject (“I”) and object (“cars”).
Vector embeddings capture this broader contextual meaning, which is why these phrases are similar in the vector space. The sentiment of the phrases is determined by the surrounding words, but the core context remains similar.
Determining Anteposition
To determine if a statement expresses an opposite sentiment, we can strip out the similar words and focus on the remaining ones. For instance, comparing “like” and “hate” can help determine the sentiment difference between the phrases. This technique helps in identifying the true meaning and sentiment behind the words.
When comparing words like “white/black,” “night/day,” “man/woman,” and “king/queen,” the similarity values remain close to 0, implying nearly or totally complete opposites. A cosine similarity value of 1 implies a perfect match, while -1 implies 100% dissimilar or unrelated.
Capabilities of Vector Embeddings
Vector embeddings are incredibly versatile. They can capture:
- Sentiment and emotion:Â Understanding the sentiment behind a phrase.
- Implied meanings: Grasping the underlying meaning even when not explicitly stated.
- Contextual similarity: Finding content that is contextually similar or dissimilar using entirely different wording.
In this session, we will use the English GloVe (Global Vectors for Word Representation) models to demonstrate these capabilities.
NOTE: IN THIS SESSSION WE WILL USE “crawl-300d-2M.vec” FROM THE FIRST “COMMON CRAWL” LINK BELOW. IF YOU CHOOSE ANY OTHER VECTOR EMBEDDING MODEL, PLEASE UPDATE IT IN THE XOJO THREAD CONTROL OPEN EVENT FOR “thInitializeVectors”. SET THE DOWNLOADED VECTOR FILE LOCATION HERE ALSO. THE MODEL IS INCLUDED WITH THE SESSION TEST APPLICATION FOR CONVENIENCE.
Download pre-trained word vectors
The links below contain word vectors obtained from the respective corpora. If you want word vectors trained on massive web datasets, you need only download one of these text files! Pre-trained word vectors are made available under the Public Domain Dedication and License.
- Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip [mirror]
- Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip [mirror]
- Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download): glove.6B.zip [mirror]
- Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 200d vectors, 1.42 GB download): glove.twitter.27B.zip [mirror]
Practical Example with Xojo Application
Let’s dive into our Vectors_101 demo Xojo application. Take a look in the session archive file for the GloveVector plugin (*MacOS/Linux users will need to build the plugin source to proceed). Upon building the plugin, drop a copy of the DLL/DYLIB/SO file in your Xojo Plugins Installation folder. Be sure to download a GloVe Vector embedding model from the pre-trained open-source ones above. Once you’ve downloaded a GloVe Vector Model, extract the text/vec file, and set its location in the test application as described above. After you’ve done this, Run the application, wait a moment for the vector model to load, and then begin by…
Enter the query: “I need a list of scribes and their manuscripts.” and test against the Trivia database.
The vector embeddings understand that “scribes” are similar to “authors” and “manuscripts” are related to “books.”
You were presented with a list of authors and their books I’m sure, yes?
Let’s try another query, “Who writes books about rodents?” should return “Of Mice and Men” by John Steinbeck as a top result, showcasing the power of contextual understanding. This demonstrates how vector embeddings can link related concepts and provide meaningful results even with different wording. Full-text SQL searches alone, cannot perform these feats.
Vector Math Demonstration
Vector embeddings allow us to perform fascinating mathematical operations. For example, if we take the vector array for “King” and subtract the vector array for “Man,” then add the vector array for “Woman,” we get a vector close, almost identical to – “Queen.” This demonstrates how embeddings capture complex relationships and analogies.
Custom Training with GloVe
We can also train our vector embedding models on custom data to improve accuracy for specific applications. Custom training helps tailor the model to better understand specific jargon or context related to your data. For more information on training your models, check out the GloVe project on GitHub. If you’d like a session on training a vector model from Xojo – stay tuned for a later module.
Ultimately…
Vector embeddings are a cornerstone of modern AI and machine learning, enabling powerful and contextually aware applications. They allow for nuanced understanding and processing of natural language, making them invaluable in many applications. I encourage you to explore and experiment with vector embeddings to unlock their full potential. Using embeddings to reconcile data with SQL or any other type of database can easily be achieved. In one of the next sessions, we’ll try out a REST API based Vector Database and add a Graph Database to it. Both the Graph database and Vector database will be entirely built using Xojo leveraging our GloveVector Plugin. As we progress, you’ll end-up with a highly complex, yet sophisticated, real-world application that you can use for any use case in any field. Imagine ChatGPT/Claude/Gemini – but with limitless capabilities and able to interact with the REAL world!
Stay tuned for our next session, where we’ll delve deeper into the applications of AI in Xojo development.
By mastering vector embeddings, you’ll be equipped to build smarter, more intuitive applications that understand and process human language in remarkable ways. Dive in, experiment, and see how these powerful tools can transform your development projects.
Latest Sessions
Creating a Basic LLM using Xojo and Ngrams
Session 2 LLMs 101Creating a Basic LLM Using Xojo and NgramsWhat are Ngrams? Ngrams are continuous sequences of words or tokens in a given text. An Ngram model predicts the next word in a sequence based on the previous N words. For example, in a trigram model...