word2vec

`WORD2VEC`

A Word2Vec model implementation for generating word embeddings.

This class provides methods for training word embeddings and managing model persistence. It's particularly useful for recommendation systems that need to understand word-level semantics.

Attributes:

Name	Type	Description
`model`	`Word2Vec`	The underlying Gensim Word2Vec model instance

Methods:

Name	Description
`train`	Trains the Word2Vec model on a corpus of sentences
`get_embedding`	Retrieves the embedding vector for a specific word
`save_model`	Persists the trained model to disk
`load_model`	Loads a previously trained model from disk

Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py

class WORD2VEC:
    """
    A Word2Vec model implementation for generating word embeddings.

    This class provides methods for training word embeddings and managing model
    persistence. It's particularly useful for recommendation systems that need
    to understand word-level semantics.

    Attributes:
        model (Word2Vec): The underlying Gensim Word2Vec model instance

    Methods:
        train: Trains the Word2Vec model on a corpus of sentences
        get_embedding: Retrieves the embedding vector for a specific word
        save_model: Persists the trained model to disk
        load_model: Loads a previously trained model from disk
    """

    def __init__(self, vector_size: int = 100, window: int = 5, min_count: int = 1, workers: int = 4):
        """
        Initialize a new Word2Vec model with specified parameters.

        Args:
            vector_size (int): Dimensionality of the word vectors. Higher dimensions can capture
                             more complex semantic relationships but require more data.
            window (int): Maximum distance between the current and predicted word within a sentence.
                         Larger windows consider broader context but may be noisier.
            min_count (int): Ignores all words with total frequency lower than this value.
                           Helps reduce noise from rare words.
            workers (int): Number of worker threads for training parallelization.
                         More workers can speed up training on multicore systems.

        Note:
            The model is not trained upon initialization. Call train() with your corpus
            to begin training.
        """
        self.model = Word2Vec(vector_size=vector_size, window=window, min_count=min_count, workers=workers)

    def train(self, sentences: List[List[str]], epochs: int = 10):
        """
        Train the Word2Vec model on a corpus of sentences.

        Args:
            sentences (List[List[str]]): A list of tokenized sentences where each sentence
                                       is represented as a list of strings (tokens).
            epochs (int): Number of iterations over the corpus during training.
                         More epochs can improve quality but increase training time.

        Note:
            - Sentences should be preprocessed (tokenized, cleaned) before training
            - Training time scales with corpus size and vector_size
            - Progress can be monitored through Gensim's logging
        """
        self.model.build_vocab(sentences)
        self.model.train(sentences, total_examples=self.model.corpus_count, epochs=epochs)

    def get_embedding(self, word: str) -> List[float]:
        """
        Get the embedding vector for a given word.

        Parameters:
        - word (str): The word to retrieve the embedding for.

        Returns:
        - List[float]: The embedding vector.
        """
        if word in self.model.wv:
            return self.model.wv[word].tolist()
        else:
            return [0.0] * self.model.vector_size

    def save_model(self, path: str):
        """
        Save the trained Word2Vec model.

        Parameters:
        - path (str): File path to save the model.
        """
        self.model.save(path)

    def load_model(self, path: str):
        """
        Load a pre-trained Word2Vec model.

        Parameters:
        - path (str): File path of the saved model.
        """
        self.model = Word2Vec.load(path)

`init(vector_size=100, window=5, min_count=1, workers=4)`

Initialize a new Word2Vec model with specified parameters.

Parameters:

Name	Type	Description	Default
`vector_size`	`int`	Dimensionality of the word vectors. Higher dimensions can capture more complex semantic relationships but require more data.	`100`
`window`	`int`	Maximum distance between the current and predicted word within a sentence. Larger windows consider broader context but may be noisier.	`5`
`min_count`	`int`	Ignores all words with total frequency lower than this value. Helps reduce noise from rare words.	`1`
`workers`	`int`	Number of worker threads for training parallelization. More workers can speed up training on multicore systems.	`4`

Note

The model is not trained upon initialization. Call train() with your corpus to begin training.

Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py

def __init__(self, vector_size: int = 100, window: int = 5, min_count: int = 1, workers: int = 4):
    """
    Initialize a new Word2Vec model with specified parameters.

    Args:
        vector_size (int): Dimensionality of the word vectors. Higher dimensions can capture
                         more complex semantic relationships but require more data.
        window (int): Maximum distance between the current and predicted word within a sentence.
                     Larger windows consider broader context but may be noisier.
        min_count (int): Ignores all words with total frequency lower than this value.
                       Helps reduce noise from rare words.
        workers (int): Number of worker threads for training parallelization.
                     More workers can speed up training on multicore systems.

    Note:
        The model is not trained upon initialization. Call train() with your corpus
        to begin training.
    """
    self.model = Word2Vec(vector_size=vector_size, window=window, min_count=min_count, workers=workers)

`get_embedding(word)`

Get the embedding vector for a given word.

Parameters: - word (str): The word to retrieve the embedding for.

Returns: - List[float]: The embedding vector.

Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py

def get_embedding(self, word: str) -> List[float]:
    """
    Get the embedding vector for a given word.

    Parameters:
    - word (str): The word to retrieve the embedding for.

    Returns:
    - List[float]: The embedding vector.
    """
    if word in self.model.wv:
        return self.model.wv[word].tolist()
    else:
        return [0.0] * self.model.vector_size

`load_model(path)`

Load a pre-trained Word2Vec model.

Parameters: - path (str): File path of the saved model.

Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py

def load_model(self, path: str):
    """
    Load a pre-trained Word2Vec model.

    Parameters:
    - path (str): File path of the saved model.
    """
    self.model = Word2Vec.load(path)

`save_model(path)`

Save the trained Word2Vec model.

Parameters: - path (str): File path to save the model.

Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py

def save_model(self, path: str):
    """
    Save the trained Word2Vec model.

    Parameters:
    - path (str): File path to save the model.
    """
    self.model.save(path)

`train(sentences, epochs=10)`

Train the Word2Vec model on a corpus of sentences.

Parameters:

Name	Type	Description	Default
`sentences`	`List[List[str]]`	A list of tokenized sentences where each sentence is represented as a list of strings (tokens).	required
`epochs`	`int`	Number of iterations over the corpus during training. More epochs can improve quality but increase training time.	`10`

Note

Sentences should be preprocessed (tokenized, cleaned) before training
Training time scales with corpus size and vector_size
Progress can be monitored through Gensim's logging

Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py

def train(self, sentences: List[List[str]], epochs: int = 10):
    """
    Train the Word2Vec model on a corpus of sentences.

    Args:
        sentences (List[List[str]]): A list of tokenized sentences where each sentence
                                   is represented as a list of strings (tokens).
        epochs (int): Number of iterations over the corpus during training.
                     More epochs can improve quality but increase training time.

    Note:
        - Sentences should be preprocessed (tokenized, cleaned) before training
        - Training time scales with corpus size and vector_size
        - Progress can be monitored through Gensim's logging
    """
    self.model.build_vocab(sentences)
    self.model.train(sentences, total_examples=self.model.corpus_count, epochs=epochs)

word2vec

WORD2VEC

__init__(vector_size=100, window=5, min_count=1, workers=4)

get_embedding(word)

load_model(path)

save_model(path)

train(sentences, epochs=10)

`WORD2VEC`

`init(vector_size=100, window=5, min_count=1, workers=4)`

`get_embedding(word)`

`load_model(path)`

`save_model(path)`

`train(sentences, epochs=10)`