doc2vec

`DOC2VEC`

A Doc2Vec model implementation for generating document embeddings.

This class provides methods for training document embeddings, retrieving vectors, and managing model persistence. It's particularly useful for recommendation systems that need to understand document-level semantics.

Attributes:

Name	Type	Description
`model`	`Doc2Vec`	The underlying Gensim Doc2Vec model instance

Methods:

Name	Description
`train`	Trains the Doc2Vec model on a corpus of documents
`get_embedding`	Retrieves the embedding vector for a specific document
`save_model`	Persists the trained model to disk
`load_model`	Loads a previously trained model from disk

Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py

class DOC2VEC:
    """
    A Doc2Vec model implementation for generating document embeddings.

    This class provides methods for training document embeddings, retrieving vectors,
    and managing model persistence. It's particularly useful for recommendation systems
    that need to understand document-level semantics.

    Attributes:
        model (Doc2Vec): The underlying Gensim Doc2Vec model instance

    Methods:
        train: Trains the Doc2Vec model on a corpus of documents
        get_embedding: Retrieves the embedding vector for a specific document
        save_model: Persists the trained model to disk
        load_model: Loads a previously trained model from disk
    """

    def __init__(self, vector_size: int = 100, window: int = 5, min_count: int = 1, 
                 workers: int = 4, epochs: int = 10):
        """
        Initialize a new Doc2Vec model with specified parameters.

        Args:
            vector_size (int): Dimensionality of the feature vectors. Higher dimensions can capture
                             more complex patterns but require more data and computation.
            window (int): Maximum distance between the current and predicted word within a sentence.
                         Larger windows capture broader context but may introduce noise.
            min_count (int): Ignores all words with total frequency lower than this value.
                           Helps reduce noise from rare words.
            workers (int): Number of worker threads for training parallelization.
                         More workers can speed up training on multicore systems.
            epochs (int): Number of iterations over the corpus during training.
                         More epochs can improve quality but increase training time.

        Note:
            The model is not trained upon initialization. Call train() with your corpus
            to begin training.
        """
        self.model = Doc2Vec(vector_size=vector_size, window=window, min_count=min_count, workers=workers, epochs=epochs)

    def train(self, documents: List[List[str]]):
        """
        Train the Doc2Vec model on a corpus of documents.

        This method processes the input documents, builds a vocabulary, and trains
        the model using the specified parameters from initialization.

        Args:
            documents (List[List[str]]): A list of tokenized documents where each document
                                       is represented as a list of strings (tokens).

        Example:
            >>> doc2vec = DOC2VEC()
            >>> docs = [['this', 'is', 'doc1'], ['this', 'is', 'doc2']]
            >>> doc2vec.train(docs)

        Note:
            - Documents should be preprocessed (tokenized, cleaned) before training
            - Training time scales with corpus size and vector_size
            - Progress can be monitored through Gensim's logging
        """
        tagged_data = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(documents)]
        self.model.build_vocab(tagged_data)
        self.model.train(tagged_data, total_examples=self.model.corpus_count, epochs=self.model.epochs)

    def get_embedding(self, doc_id: int) -> List[float]:
        """
        Retrieve the embedding vector for a specific document.

        Args:
            doc_id (int): The unique identifier of the document to embed.
                         Must be within range of trained documents.

        Returns:
            List[float]: A dense vector representation of the document with
                        dimensionality specified by vector_size.

        Raises:
            KeyError: If doc_id is not found in the trained model
            RuntimeError: If called before training the model

        Note:
            The returned vector captures semantic properties of the document
            and can be used for similarity calculations or as features for
            downstream tasks.
        """
        return self.model.dv[str(doc_id)].tolist()

    def save_model(self, path: str):
        """
        Save the trained Doc2Vec model.

        Parameters:
        - path (str): File path to save the model.
        """
        self.model.save(path)

    def load_model(self, path: str):
        """
        Load a pre-trained Doc2Vec model.

        Parameters:
        - path (str): File path of the saved model.
        """
        self.model = Doc2Vec.load(path)

`init(vector_size=100, window=5, min_count=1, workers=4, epochs=10)`

Initialize a new Doc2Vec model with specified parameters.

Parameters:

Name	Type	Description	Default
`vector_size`	`int`	Dimensionality of the feature vectors. Higher dimensions can capture more complex patterns but require more data and computation.	`100`
`window`	`int`	Maximum distance between the current and predicted word within a sentence. Larger windows capture broader context but may introduce noise.	`5`
`min_count`	`int`	Ignores all words with total frequency lower than this value. Helps reduce noise from rare words.	`1`
`workers`	`int`	Number of worker threads for training parallelization. More workers can speed up training on multicore systems.	`4`
`epochs`	`int`	Number of iterations over the corpus during training. More epochs can improve quality but increase training time.	`10`

Note

The model is not trained upon initialization. Call train() with your corpus to begin training.

Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py

def __init__(self, vector_size: int = 100, window: int = 5, min_count: int = 1, 
             workers: int = 4, epochs: int = 10):
    """
    Initialize a new Doc2Vec model with specified parameters.

    Args:
        vector_size (int): Dimensionality of the feature vectors. Higher dimensions can capture
                         more complex patterns but require more data and computation.
        window (int): Maximum distance between the current and predicted word within a sentence.
                     Larger windows capture broader context but may introduce noise.
        min_count (int): Ignores all words with total frequency lower than this value.
                       Helps reduce noise from rare words.
        workers (int): Number of worker threads for training parallelization.
                     More workers can speed up training on multicore systems.
        epochs (int): Number of iterations over the corpus during training.
                     More epochs can improve quality but increase training time.

    Note:
        The model is not trained upon initialization. Call train() with your corpus
        to begin training.
    """
    self.model = Doc2Vec(vector_size=vector_size, window=window, min_count=min_count, workers=workers, epochs=epochs)

`get_embedding(doc_id)`

Retrieve the embedding vector for a specific document.

Parameters:

Name	Type	Description	Default
`doc_id`	`int`	The unique identifier of the document to embed. Must be within range of trained documents.	required

Returns:

Type	Description
`List[float]`	List[float]: A dense vector representation of the document with dimensionality specified by vector_size.

Raises:

Type	Description
`KeyError`	If doc_id is not found in the trained model
`RuntimeError`	If called before training the model

Note

The returned vector captures semantic properties of the document and can be used for similarity calculations or as features for downstream tasks.

Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py

def get_embedding(self, doc_id: int) -> List[float]:
    """
    Retrieve the embedding vector for a specific document.

    Args:
        doc_id (int): The unique identifier of the document to embed.
                     Must be within range of trained documents.

    Returns:
        List[float]: A dense vector representation of the document with
                    dimensionality specified by vector_size.

    Raises:
        KeyError: If doc_id is not found in the trained model
        RuntimeError: If called before training the model

    Note:
        The returned vector captures semantic properties of the document
        and can be used for similarity calculations or as features for
        downstream tasks.
    """
    return self.model.dv[str(doc_id)].tolist()

`load_model(path)`

Load a pre-trained Doc2Vec model.

Parameters: - path (str): File path of the saved model.

Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py

def load_model(self, path: str):
    """
    Load a pre-trained Doc2Vec model.

    Parameters:
    - path (str): File path of the saved model.
    """
    self.model = Doc2Vec.load(path)

`save_model(path)`

Save the trained Doc2Vec model.

Parameters: - path (str): File path to save the model.

Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py

def save_model(self, path: str):
    """
    Save the trained Doc2Vec model.

    Parameters:
    - path (str): File path to save the model.
    """
    self.model.save(path)

`train(documents)`

Train the Doc2Vec model on a corpus of documents.

This method processes the input documents, builds a vocabulary, and trains the model using the specified parameters from initialization.

Parameters:

Name	Type	Description	Default
`documents`	`List[List[str]]`	A list of tokenized documents where each document is represented as a list of strings (tokens).	required

Example

doc2vec = DOC2VEC() docs = [['this', 'is', 'doc1'], ['this', 'is', 'doc2']] doc2vec.train(docs)

Note

Documents should be preprocessed (tokenized, cleaned) before training
Training time scales with corpus size and vector_size
Progress can be monitored through Gensim's logging

Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py

def train(self, documents: List[List[str]]):
    """
    Train the Doc2Vec model on a corpus of documents.

    This method processes the input documents, builds a vocabulary, and trains
    the model using the specified parameters from initialization.

    Args:
        documents (List[List[str]]): A list of tokenized documents where each document
                                   is represented as a list of strings (tokens).

    Example:
        >>> doc2vec = DOC2VEC()
        >>> docs = [['this', 'is', 'doc1'], ['this', 'is', 'doc2']]
        >>> doc2vec.train(docs)

    Note:
        - Documents should be preprocessed (tokenized, cleaned) before training
        - Training time scales with corpus size and vector_size
        - Progress can be monitored through Gensim's logging
    """
    tagged_data = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(documents)]
    self.model.build_vocab(tagged_data)
    self.model.train(tagged_data, total_examples=self.model.corpus_count, epochs=self.model.epochs)

doc2vec

DOC2VEC

__init__(vector_size=100, window=5, min_count=1, workers=4, epochs=10)

get_embedding(doc_id)

load_model(path)

save_model(path)

train(documents)

`DOC2VEC`

`init(vector_size=100, window=5, min_count=1, workers=4, epochs=10)`

`get_embedding(doc_id)`

`load_model(path)`

`save_model(path)`

`train(documents)`