Skip to content

word2vec

WORD2VEC

A Word2Vec model implementation for generating word embeddings.

This class provides methods for training word embeddings and managing model persistence. It's particularly useful for recommendation systems that need to understand word-level semantics.

Attributes:

Name Type Description
model Word2Vec

The underlying Gensim Word2Vec model instance

Methods:

Name Description
train

Trains the Word2Vec model on a corpus of sentences

get_embedding

Retrieves the embedding vector for a specific word

save_model

Persists the trained model to disk

load_model

Loads a previously trained model from disk

Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
class WORD2VEC:
    """
    A Word2Vec model implementation for generating word embeddings.

    This class provides methods for training word embeddings and managing model
    persistence. It's particularly useful for recommendation systems that need
    to understand word-level semantics.

    Attributes:
        model (Word2Vec): The underlying Gensim Word2Vec model instance

    Methods:
        train: Trains the Word2Vec model on a corpus of sentences
        get_embedding: Retrieves the embedding vector for a specific word
        save_model: Persists the trained model to disk
        load_model: Loads a previously trained model from disk
    """

    def __init__(self, vector_size: int = 100, window: int = 5, min_count: int = 1, workers: int = 4):
        """
        Initialize a new Word2Vec model with specified parameters.

        Args:
            vector_size (int): Dimensionality of the word vectors. Higher dimensions can capture
                             more complex semantic relationships but require more data.
            window (int): Maximum distance between the current and predicted word within a sentence.
                         Larger windows consider broader context but may be noisier.
            min_count (int): Ignores all words with total frequency lower than this value.
                           Helps reduce noise from rare words.
            workers (int): Number of worker threads for training parallelization.
                         More workers can speed up training on multicore systems.

        Note:
            The model is not trained upon initialization. Call train() with your corpus
            to begin training.
        """
        self.model = Word2Vec(vector_size=vector_size, window=window, min_count=min_count, workers=workers)

    def train(self, sentences: List[List[str]], epochs: int = 10):
        """
        Train the Word2Vec model on a corpus of sentences.

        Args:
            sentences (List[List[str]]): A list of tokenized sentences where each sentence
                                       is represented as a list of strings (tokens).
            epochs (int): Number of iterations over the corpus during training.
                         More epochs can improve quality but increase training time.

        Note:
            - Sentences should be preprocessed (tokenized, cleaned) before training
            - Training time scales with corpus size and vector_size
            - Progress can be monitored through Gensim's logging
        """
        self.model.build_vocab(sentences)
        self.model.train(sentences, total_examples=self.model.corpus_count, epochs=epochs)

    def get_embedding(self, word: str) -> List[float]:
        """
        Get the embedding vector for a given word.

        Parameters:
        - word (str): The word to retrieve the embedding for.

        Returns:
        - List[float]: The embedding vector.
        """
        if word in self.model.wv:
            return self.model.wv[word].tolist()
        else:
            return [0.0] * self.model.vector_size

    def save_model(self, path: str):
        """
        Save the trained Word2Vec model.

        Parameters:
        - path (str): File path to save the model.
        """
        self.model.save(path)

    def load_model(self, path: str):
        """
        Load a pre-trained Word2Vec model.

        Parameters:
        - path (str): File path of the saved model.
        """
        self.model = Word2Vec.load(path)

__init__(vector_size=100, window=5, min_count=1, workers=4)

Initialize a new Word2Vec model with specified parameters.

Parameters:

Name Type Description Default
vector_size int

Dimensionality of the word vectors. Higher dimensions can capture more complex semantic relationships but require more data.

100
window int

Maximum distance between the current and predicted word within a sentence. Larger windows consider broader context but may be noisier.

5
min_count int

Ignores all words with total frequency lower than this value. Helps reduce noise from rare words.

1
workers int

Number of worker threads for training parallelization. More workers can speed up training on multicore systems.

4
Note

The model is not trained upon initialization. Call train() with your corpus to begin training.

Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def __init__(self, vector_size: int = 100, window: int = 5, min_count: int = 1, workers: int = 4):
    """
    Initialize a new Word2Vec model with specified parameters.

    Args:
        vector_size (int): Dimensionality of the word vectors. Higher dimensions can capture
                         more complex semantic relationships but require more data.
        window (int): Maximum distance between the current and predicted word within a sentence.
                     Larger windows consider broader context but may be noisier.
        min_count (int): Ignores all words with total frequency lower than this value.
                       Helps reduce noise from rare words.
        workers (int): Number of worker threads for training parallelization.
                     More workers can speed up training on multicore systems.

    Note:
        The model is not trained upon initialization. Call train() with your corpus
        to begin training.
    """
    self.model = Word2Vec(vector_size=vector_size, window=window, min_count=min_count, workers=workers)

get_embedding(word)

Get the embedding vector for a given word.

Parameters: - word (str): The word to retrieve the embedding for.

Returns: - List[float]: The embedding vector.

Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def get_embedding(self, word: str) -> List[float]:
    """
    Get the embedding vector for a given word.

    Parameters:
    - word (str): The word to retrieve the embedding for.

    Returns:
    - List[float]: The embedding vector.
    """
    if word in self.model.wv:
        return self.model.wv[word].tolist()
    else:
        return [0.0] * self.model.vector_size

load_model(path)

Load a pre-trained Word2Vec model.

Parameters: - path (str): File path of the saved model.

Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
111
112
113
114
115
116
117
118
def load_model(self, path: str):
    """
    Load a pre-trained Word2Vec model.

    Parameters:
    - path (str): File path of the saved model.
    """
    self.model = Word2Vec.load(path)

save_model(path)

Save the trained Word2Vec model.

Parameters: - path (str): File path to save the model.

Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
102
103
104
105
106
107
108
109
def save_model(self, path: str):
    """
    Save the trained Word2Vec model.

    Parameters:
    - path (str): File path to save the model.
    """
    self.model.save(path)

train(sentences, epochs=10)

Train the Word2Vec model on a corpus of sentences.

Parameters:

Name Type Description Default
sentences List[List[str]]

A list of tokenized sentences where each sentence is represented as a list of strings (tokens).

required
epochs int

Number of iterations over the corpus during training. More epochs can improve quality but increase training time.

10
Note
  • Sentences should be preprocessed (tokenized, cleaned) before training
  • Training time scales with corpus size and vector_size
  • Progress can be monitored through Gensim's logging
Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
def train(self, sentences: List[List[str]], epochs: int = 10):
    """
    Train the Word2Vec model on a corpus of sentences.

    Args:
        sentences (List[List[str]]): A list of tokenized sentences where each sentence
                                   is represented as a list of strings (tokens).
        epochs (int): Number of iterations over the corpus during training.
                     More epochs can improve quality but increase training time.

    Note:
        - Sentences should be preprocessed (tokenized, cleaned) before training
        - Training time scales with corpus size and vector_size
        - Progress can be monitored through Gensim's logging
    """
    self.model.build_vocab(sentences)
    self.model.train(sentences, total_examples=self.model.corpus_count, epochs=epochs)