From Text to Embeddings
Published on May 5, 2026
Loading...Subscribe to the Newsletter
Join other readers for the latest posts and insights on AI, MLOps, and best practices in software design.
Published on May 5, 2026
Loading...Join other readers for the latest posts and insights on AI, MLOps, and best practices in software design.
This article provides a guide to understanding embeddings in natural language processing. My goal is to explain what embeddings are, how they work, and why they are important for machine learning models that process text data.
Let's start with the basics, show the concepts with examples, and then explore a simple implementation of embeddings in Python. The source code for this article is available here.
Neural networks do not understand text directly. They do not know what the word "king" means. They do not know that "queen" is related to "king". They do not know that "cat" and "dog" are more similar than "cat" and "airplane".
A neural network only works with numbers. So before we can use text in a model, we need to answer one basic question:
How do we turn words into numbers?
But not just any numbers. We want useful numbers. Ideally, we want a representation where words with similar meanings have similar numerical representations.
That is the idea behind word embeddings.
Imagine that every word is a point in space.
For example:
If we could place these words in a 2D space, we would like related words to be close together.
Maybe king and queen are close.
Maybe man and woman are close.
Maybe the direction from man to woman is similar to the direction from king to queen.

This is the essence of word embeddings:
Once words are represented as vectors, we can compare them using mathematical operations:
So instead of asking:
Are these two words similar?
We can ask:
Are these two vectors close together? Do they point in the same direction? Is their dot product high?
This is why embeddings are useful. They transform words into a space where meaning can be handled mathematically.
But now we have to build up to that. How do we get from raw text to such vectors?
As mentioned earlier, a neural network cannot directly process this:
Attention is all you need.
It needs numbers. So the first step is to split the text into smaller pieces, called tokens. This process is called tokenization.
A token can be a word, a subword, or even a character. In modern language models, tokens are often subwords, which allows the model to handle rare words and misspellings more effectively.
But for simplicity and educational purposes, let's consider a character level tokenization. In this case, the word Attention would be split into its individual characters, which would then be mapped to numbers called token IDs. This is essentially creating a vocabulary of characters and assigning each character a unique ID.

Now each token has a corresponding token ID, which is a number. This is already a number representation of the text, but it is not a good representation of meaning. The token IDs are just arbitrary numbers assigned to each token, and they do not capture any semantic relationships between the tokens.
The number 6 does not inherently mean anything about the character e. It is just a label. So we need a better representation that uses these token IDs.
The simplest way to represent tokens as vectors is to use one-hot encoding. In this approach, we create a vector for each token that has a length equal to the size of the vocabulary. The vector is filled with zeros except for a single position that corresponds to the token ID, which is set to 1.
Now the neural network can process these vectors, but one-hot encoding has a major drawback...

One-hot vectors are very inefficient. If we have a vocabulary of 10,000 tokens, then every token is represented as a vector of length 10,000, where only one element is 1 and the rest are 0. This means that most of the elements in the vector are zeros, which is not an efficient use of memory.
But the bigger problem is meaning. One-hot vectors do not capture any semantic relationships between tokens. For example, the vectors for "king" and "queen" are just as different as the vectors for "king" and "apple".
king -> [0, 0, 0, 1, 0, ...]
queen -> [0, 0, 0, 0, 1, ...]
apple -> [0, 1, 0, 0, 0, ...]All tokens are equally distant from each other in one-hot encoding, which means that the model has no way to learn that "king" and "queen" are related concepts, while "king" and "apple" are not.
This is not what we want. We want a representation where the model can learn that
So we need dense vectors that can capture these relationships. Let's see how we can create such embeddings vectors.
An embedding is a vector assigned to a token. Instead of representing "t" as a long one-hot vector:
t = [0, 0, 0, 0, 0, 0, 1]We can represent it as a dense vector of, say, 3 dimensions:
t = [0.21, 0.54, 0.10]This vector is shorter and contains real numbers. The important thing is that
These numbers are learned during training.
We store all token vectors in an embedding matrix C. Each row of this matrix corresponds to a token ID, and the values in that row are the embedding vector for that token. If the vocabulary size is $l$ and the embedding dimension is d, then the embedding matrix has a shape of (l, d).
When we input a token ID into the model, it looks up the corresponding row in the embedding matrix to get the embedding vector for that token. This vector can then be used as input to the rest of the neural network.

This is called an embedding lookup. But before we jump to lookup tables, it is useful to see where this comes from.
Suppose that the letter "A" has a token ID of i and is represented as a one-hot vector x_A of length l (the size of the vocabulary). The embedding matrix C has a shape of (l, d), where d is the embedding dimension.
Schematically, we can represent this a an input layers with one-hot vectors, and an embedding layer with the embedding matrix C:

When we multiply the one-hot vector x_A by the embedding matrix C, we get
x_A @ C = [0, 0, ..., 1, ..., 0] @ C = C[i]This means that the result of the multiplication is simply the i-th row of the embedding matrix C, which is the embedding vector for the token "A".
This is the central bridge between one-hot vectors and embeddings. A one-hot vector multiplied by an embedding matrix gives the embedding vector.
But there is an efficient problem with this approach.
In real models, we usually do not actually create the huge one-hot vector. For example, if we have a vocabulary of 100,000 tokens, the one-hot vector has 100,000 entries, but 99,999 of them are zero. Multiplying this huge vector by the embedding matrix C would involve a lot of unnecessary computations, since most of the entries are zero and do not contribute to the final result.
Instead, we directly select the correct row:

This is done using a lookup table. The token ID i is used to directly access
the i-th row of the embedding matrix C, which gives us the embedding vector
for that token without needing to create the one-hot vector or perform the
multiplication.
That is why embedding layers are often described as lookup tables.
At the beginning, the embedding matrix C contains random numbers.
For example:
C = [
[0.12, 0.34, 0.56], # embedding for token 0
[0.78, 0.90, 0.12], # embedding for token 1
...
]These numbers do not have any meaning. They are just random values. The question is, how do these random numbers become meaningful representations of the tokens and why should we expect that similar words will have similar embeddings?
The answer lies in the training process. When we train a neural network on a language task, the model learns to adjust the values in the embedding matrix C to minimize the loss function. During this process, the model discovers patterns in the data and adjusts the embeddings so that words that appear in similar contexts have similar vectors.
So the embedding matrix is not manually designed, it is learned from data. Let's think what could be a simple training objective that would encourage the model to learn meaningful embeddings...
A very simple training objective is to predict the next character given the current character. This is called bigram language modeling. For example, using the word "Attention", we can create training examples of consecutive pairs of characters and store their token IDs in two lists X and Y, where each pair means
given the character in
X, predict the character inY

This is enough to train a tiny neural network with an embedding layer. The model will learn to predict the next character based on the current character. As it learns, it will adjust the embedding vectors in C so that characters that appear in similar contexts have similar embeddings.
The architecture of the bigram neural network is very simple. It consists of an embedding layer followed by a linear layer that maps the embedding vector to a vector of logits, which represent the unnormalized probabilities of the next character.
We then apply a softmax function to the logits to get the probabilities of each possible next character. The model is trained using a loss function that measures how well the predicted probabilities match the actual next character in the training data.

Training this will cause the embedding vectors to cluster together numbers, consonants, and vowels. This is because characters that appear in similar contexts will have similar embeddings, which allows the model to make better predictions about the next character. An example of the learned embedding matrix after training might look like this:

So the embedding vectors have become meaningful representations of the characters and all that happened as a by product of training the model to predict the next character.
This idea is extremely important. We could come up with a more complex training objectives, such as predicting surrounding characters or predicting masked characters and the same principle would apply. The model would learn to adjust the embedding vectors in a way that captures the relationships between characters based on the training objective. This is really how people developed many other types of embeddings, such as word2vec, GloVe, and BERT embeddings. They all rely on the idea that by training a model on a specific task, the embedding vectors will learn to capture meaningful relationships between tokens.
Ok, that is the end of the article. I hope you found it useful and that it helped you understand the basics of embeddings in natural language processing.
And for those who like thinking in images, here is a visual representation of the basics we covered in this article. Check this out and see if you can understand the concepts just by looking at the image.
If you have any questions or want to see more examples, feel free to reach out or check out the source code here.
Thanks for reading!