Embeddings Visualizer: Understanding How LLMs Represent Meaning#
What are Embeddings?#
Embeddings are dense numerical representations of text in a high-dimensional vector space. Think of it as mapping words and sentences to points in space, where similar meanings cluster together.
Why Does This Matter?#
Semantic understanding - LLMs don’t compare words as strings; they compare them as vectors in space
Similarity search - Embeddings power recommendation systems, search engines, and retrieval-augmented generation (RAG)
Dimensionality - Each embedding is a list of hundreds of numbers (e.g. 384 or 1536 dimensions)
Key Concepts:#
An embedding ≈ a fixed-length array of floating-point numbers representing meaning
Cosine similarity measures how “close” two embeddings are (1.0 = identical, 0.0 = unrelated)
Vector arithmetic works on meaning:
king − man + woman ≈ queenEmbeddings can represent words, sentences, paragraphs, or even images
Setup: Installing Required Libraries#
Local setup
This notebook is excluded from the website build and is not executed during normal site generation. To run it locally from the course repository, install the optional LLM dependency group first:
poetry install --with llm
If you are using a separate Jupyter or Colab environment, install the notebook packages there instead:
%pip install sentence-transformers plotly scikit-learn
import numpy as np
import plotly.graph_objects as go
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
Load Embedding Model#
We’ll use all-MiniLM-L6-v2, a compact but powerful model that:
Generates 384-dimensional embeddings
Is fast and efficient
Works great for semantic similarity tasks
# Load the model (this may take a moment on first run)
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"✅ Model loaded!")
print(f" Embedding dimension: {model.get_sentence_embedding_dimension()}")
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
✅ Model loaded!
Embedding dimension: 384
/tmp/ipykernel_68153/1080270063.py:5: FutureWarning: The `get_sentence_embedding_dimension` method has been renamed to `get_embedding_dimension`.
print(f" Embedding dimension: {model.get_sentence_embedding_dimension()}")
Part 1: Word Embeddings & Similarity#
Run the cells below to encode a set of words and visualise how similar they are to one another.
# Define some words
words = ['king', 'queen', 'man', 'woman', 'prince', 'princess',
'dog', 'cat', 'puppy', 'kitten',
'car', 'vehicle', 'bicycle', 'motorcycle']
# Generate embeddings
word_embeddings = model.encode(words)
print(f"Generated {len(word_embeddings)} embeddings")
print(f"Each embedding has {len(word_embeddings[0])} dimensions")
print(f"\nExample — 'king' embedding (first 10 values):")
print(word_embeddings[0][:10])
Generated 14 embeddings
Each embedding has 384 dimensions
Example — 'king' embedding (first 10 values):
[-0.05959932 0.05051232 -0.06951012 0.07968019 -0.04674765 0.00098894
0.07904322 -0.01273938 0.05839579 -0.03140246]
Exercise 1: Compare Two Words#
Use the code below to compare the cosine similarity of any two words.
Change word_a and word_b to any words you like and re-run — they don’t need to be in the list above.
Cosine Similarity measures the angle between two vectors:
Score of 1.0 = identical meaning
Score of 0.0 = unrelated
Score of -1.0 = opposite (rare in practice)
word_a = "king"
word_b = "queen"
emb_a = model.encode(word_a).reshape(1, -1)
emb_b = model.encode(word_b).reshape(1, -1)
print(f"'{word_a}' vs '{word_b}':")
print(f"Cosine Similarity: {float(cosine_similarity(emb_a, emb_b)[0, 0])}")
'king' vs 'queen':
Cosine Similarity: 0.6807128190994263
Example: Compare All Words#
The code below generates a heatmap that shows all pairwise similarities at once. Use the cell below to zoom in on any specific pair.
# Compute pairwise cosine similarity and plot as a heatmap
similarity_matrix = cosine_similarity(word_embeddings)
fig = go.Figure(data=go.Heatmap(
z=similarity_matrix,
x=words,
y=words,
colorscale='RdYlGn',
texttemplate='%{z:.2f}',
textfont={"size": 10},
hovertemplate='Word A: %{y}<br>Word B: %{x}<br>Similarity: %{z:.2f}<extra></extra>',
hoverongaps=False,
colorbar=dict(title="Similarity")
))
fig.update_layout(
title='Word Similarity Heatmap (Cosine Similarity)',
xaxis_title='Words',
yaxis_title='Words',
width=800,
height=700
)
fig.show()
Part 2: Vector Arithmetic#
Embeddings support arithmetic on meaning. The classic example:
king − male + female ≈ ?
The cells below test this idea and extend it to other analogies.
def find_most_similar(target_embedding, candidate_embeddings, candidates, exclude_words=None, top_k=5):
"""Find the most similar words to a target embedding."""
similarities = cosine_similarity([target_embedding], candidate_embeddings)[0]
word_similarities = list(zip(candidates, similarities))
if exclude_words:
word_similarities = [(w, s) for w, s in word_similarities if w not in exclude_words]
word_similarities.sort(key=lambda x: x[1], reverse=True)
return word_similarities[:top_k]
# Vector arithmetic: king - male + female = ?
king_emb = model.encode('king')
male_emb = model.encode('male')
female_emb = model.encode('female')
result_emb = king_emb - male_emb + female_emb
candidates = ['queen', 'princess', 'man', 'prince', 'lady', 'monarch', 'duchess', 'empress']
candidate_embeddings = model.encode(candidates)
similar_words = find_most_similar(result_emb, candidate_embeddings, candidates, top_k=5)
print("🎯 king - male + female = ?\n")
for i, (word, score) in enumerate(similar_words, 1):
bar = '█' * int(score * 50)
print(f"{i}. {word:15s} {bar} {score:.4f}")
🎯 king - male + female = ?
1. queen ███████████████████████████████████ 0.7048
2. monarch █████████████████████████████████ 0.6639
3. lady █████████████████████████ 0.5015
4. princess █████████████████████████ 0.5008
5. empress ██████████████████████ 0.4518
Exercise 2: Explore Relationships in Embedded Space#
Add some more embeddings, apply vector arithmetic and see what you can make.
def semantic_analogy(word1, word2, word3, candidates):
"""Solve: word1 - word2 + word3 = ?"""
emb1 = model.encode(word1)
emb2 = model.encode(word2)
emb3 = model.encode(word3)
result = emb1 - emb2 + emb3
candidate_embeddings = model.encode(candidates)
similar = find_most_similar(result, candidate_embeddings, candidates,
exclude_words=[word1, word2, word3], top_k=3)
print(f"\n{word1} - {word2} + {word3} = ?")
for word, score in similar:
print(f" → {word} ({score:.3f})")
print("🧪 Semantic Analogies\n" + "="*40)
semantic_analogy('France', 'Paris', 'London',
['England', 'Britain', 'UK', 'Germany', 'Italy', 'Spain', 'France', 'Wales'])
semantic_analogy('doctor', 'hospital', 'school',
['caretaker', 'nurse', 'teacher', 'office', 'scientist', 'laboratory'])
semantic_analogy('puppy', 'dog', 'cat',
['cat', 'feline', 'tiger', 'lion', 'pet', 'animal', 'kitten'])
🧪 Semantic Analogies
========================================
France - Paris + London = ?
→ Britain (0.782)
→ England (0.769)
→ UK (0.685)
doctor - hospital + school = ?
→ teacher (0.613)
→ scientist (0.550)
→ office (0.338)
puppy - dog + cat = ?
→ kitten (0.828)
→ feline (0.613)
→ pet (0.590)
Exercise 3: Find More Semantic Analogies#
Part 3: Semantic Search#
Embeddings make it possible to search by meaning rather than exact keywords.
Below we index a small set of research paper titles, then query it with plain-English phrases.
Run the cells, then try changing the queries to search for something relevant to your own research.
# Research paper titles covering a range of topics
papers = [
"Deep Learning for Image Classification Using Convolutional Neural Networks",
"Neural Networks for Computer Vision: A Comprehensive Review",
"Attention Mechanisms in Natural Language Processing",
"BERT: Pre-training of Deep Bidirectional Transformers",
"Quantum Computing: Algorithms and Applications",
"Quantum Error Correction and Fault-Tolerant Computing",
"Climate Change Impact on Marine Ecosystems",
"Ocean Acidification and Coral Reef Degradation",
"Machine Learning in Healthcare: Predictive Analytics",
"AI-Based Disease Diagnosis Using Medical Imaging",
"Blockchain Technology for Secure Financial Transactions",
"Cryptocurrency Mining and Environmental Sustainability",
"Gene Editing with CRISPR-Cas9: Ethical Implications",
"Synthetic Biology and Genetic Engineering Advances",
"Renewable Energy Storage Solutions for Smart Grids",
"Solar Panel Efficiency and Energy Conversion Optimization"
]
paper_embeddings = model.encode(papers)
print(f"✅ Indexed {len(papers)} research papers")
def semantic_search(query, papers, paper_embeddings, top_k=3):
"""Return the top_k most semantically similar papers for a query."""
query_embedding = model.encode(query)
similarities = cosine_similarity([query_embedding], paper_embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
return [{'paper': papers[i], 'similarity': similarities[i]} for i in top_indices]
# Try changing these queries
queries = [
"transformers for text understanding",
"environmental impact of oceans",
"medical AI applications",
]
for query in queries:
print(f"\n🔍 '{query}'")
print("-" * 60)
for i, result in enumerate(semantic_search(query, papers, paper_embeddings), 1):
bar = '█' * int(result['similarity'] * 40)
print(f"{i}. [{result['similarity']:.3f}] {bar}")
print(f" {result['paper']}")
🔍 'transformers for text understanding'
------------------------------------------------------------
1. [0.466] ██████████████████
BERT: Pre-training of Deep Bidirectional Transformers
2. [0.316] ████████████
Attention Mechanisms in Natural Language Processing
3. [0.146] █████
Deep Learning for Image Classification Using Convolutional Neural Networks
🔍 'environmental impact of oceans'
------------------------------------------------------------
1. [0.668] ██████████████████████████
Climate Change Impact on Marine Ecosystems
2. [0.504] ████████████████████
Ocean Acidification and Coral Reef Degradation
3. [0.240] █████████
Cryptocurrency Mining and Environmental Sustainability
🔍 'medical AI applications'
------------------------------------------------------------
1. [0.637] █████████████████████████
AI-Based Disease Diagnosis Using Medical Imaging
2. [0.458] ██████████████████
Machine Learning in Healthcare: Predictive Analytics
3. [0.235] █████████
Deep Learning for Image Classification Using Convolutional Neural Networks
Key Takeaways#
By now, you should understand:
✅ What embeddings are and why LLMs use them to represent meaning
✅ Cosine similarity as a measure of semantic closeness
✅ Vector arithmetic works on meaning — relationships are encoded geometrically
✅ Semantic search finds relevant content by meaning, not just keywords
Practical Applications:#
Semantic Search: Find documents by meaning rather than exact keyword match
Recommendation Systems: Suggest similar items based on embedding proximity
Retrieval-Augmented Generation (RAG): Ground LLM responses in relevant retrieved context