Skip to content

doc2vec

DOC2VEC

A Doc2Vec model implementation for generating document embeddings.

This class provides methods for training document embeddings, retrieving vectors, and managing model persistence. It's particularly useful for recommendation systems that need to understand document-level semantics.

Attributes:

Name Type Description
model Doc2Vec

The underlying Gensim Doc2Vec model instance

Methods:

Name Description
train

Trains the Doc2Vec model on a corpus of documents

get_embedding

Retrieves the embedding vector for a specific document

save_model

Persists the trained model to disk

load_model

Loads a previously trained model from disk

Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
class DOC2VEC:
    """
    A Doc2Vec model implementation for generating document embeddings.

    This class provides methods for training document embeddings, retrieving vectors,
    and managing model persistence. It's particularly useful for recommendation systems
    that need to understand document-level semantics.

    Attributes:
        model (Doc2Vec): The underlying Gensim Doc2Vec model instance

    Methods:
        train: Trains the Doc2Vec model on a corpus of documents
        get_embedding: Retrieves the embedding vector for a specific document
        save_model: Persists the trained model to disk
        load_model: Loads a previously trained model from disk
    """

    def __init__(self, vector_size: int = 100, window: int = 5, min_count: int = 1, 
                 workers: int = 4, epochs: int = 10):
        """
        Initialize a new Doc2Vec model with specified parameters.

        Args:
            vector_size (int): Dimensionality of the feature vectors. Higher dimensions can capture
                             more complex patterns but require more data and computation.
            window (int): Maximum distance between the current and predicted word within a sentence.
                         Larger windows capture broader context but may introduce noise.
            min_count (int): Ignores all words with total frequency lower than this value.
                           Helps reduce noise from rare words.
            workers (int): Number of worker threads for training parallelization.
                         More workers can speed up training on multicore systems.
            epochs (int): Number of iterations over the corpus during training.
                         More epochs can improve quality but increase training time.

        Note:
            The model is not trained upon initialization. Call train() with your corpus
            to begin training.
        """
        self.model = Doc2Vec(vector_size=vector_size, window=window, min_count=min_count, workers=workers, epochs=epochs)

    def train(self, documents: List[List[str]]):
        """
        Train the Doc2Vec model on a corpus of documents.

        This method processes the input documents, builds a vocabulary, and trains
        the model using the specified parameters from initialization.

        Args:
            documents (List[List[str]]): A list of tokenized documents where each document
                                       is represented as a list of strings (tokens).

        Example:
            >>> doc2vec = DOC2VEC()
            >>> docs = [['this', 'is', 'doc1'], ['this', 'is', 'doc2']]
            >>> doc2vec.train(docs)

        Note:
            - Documents should be preprocessed (tokenized, cleaned) before training
            - Training time scales with corpus size and vector_size
            - Progress can be monitored through Gensim's logging
        """
        tagged_data = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(documents)]
        self.model.build_vocab(tagged_data)
        self.model.train(tagged_data, total_examples=self.model.corpus_count, epochs=self.model.epochs)

    def get_embedding(self, doc_id: int) -> List[float]:
        """
        Retrieve the embedding vector for a specific document.

        Args:
            doc_id (int): The unique identifier of the document to embed.
                         Must be within range of trained documents.

        Returns:
            List[float]: A dense vector representation of the document with
                        dimensionality specified by vector_size.

        Raises:
            KeyError: If doc_id is not found in the trained model
            RuntimeError: If called before training the model

        Note:
            The returned vector captures semantic properties of the document
            and can be used for similarity calculations or as features for
            downstream tasks.
        """
        return self.model.dv[str(doc_id)].tolist()

    def save_model(self, path: str):
        """
        Save the trained Doc2Vec model.

        Parameters:
        - path (str): File path to save the model.
        """
        self.model.save(path)

    def load_model(self, path: str):
        """
        Load a pre-trained Doc2Vec model.

        Parameters:
        - path (str): File path of the saved model.
        """
        self.model = Doc2Vec.load(path)

__init__(vector_size=100, window=5, min_count=1, workers=4, epochs=10)

Initialize a new Doc2Vec model with specified parameters.

Parameters:

Name Type Description Default
vector_size int

Dimensionality of the feature vectors. Higher dimensions can capture more complex patterns but require more data and computation.

100
window int

Maximum distance between the current and predicted word within a sentence. Larger windows capture broader context but may introduce noise.

5
min_count int

Ignores all words with total frequency lower than this value. Helps reduce noise from rare words.

1
workers int

Number of worker threads for training parallelization. More workers can speed up training on multicore systems.

4
epochs int

Number of iterations over the corpus during training. More epochs can improve quality but increase training time.

10
Note

The model is not trained upon initialization. Call train() with your corpus to begin training.

Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def __init__(self, vector_size: int = 100, window: int = 5, min_count: int = 1, 
             workers: int = 4, epochs: int = 10):
    """
    Initialize a new Doc2Vec model with specified parameters.

    Args:
        vector_size (int): Dimensionality of the feature vectors. Higher dimensions can capture
                         more complex patterns but require more data and computation.
        window (int): Maximum distance between the current and predicted word within a sentence.
                     Larger windows capture broader context but may introduce noise.
        min_count (int): Ignores all words with total frequency lower than this value.
                       Helps reduce noise from rare words.
        workers (int): Number of worker threads for training parallelization.
                     More workers can speed up training on multicore systems.
        epochs (int): Number of iterations over the corpus during training.
                     More epochs can improve quality but increase training time.

    Note:
        The model is not trained upon initialization. Call train() with your corpus
        to begin training.
    """
    self.model = Doc2Vec(vector_size=vector_size, window=window, min_count=min_count, workers=workers, epochs=epochs)

get_embedding(doc_id)

Retrieve the embedding vector for a specific document.

Parameters:

Name Type Description Default
doc_id int

The unique identifier of the document to embed. Must be within range of trained documents.

required

Returns:

Type Description
List[float]

List[float]: A dense vector representation of the document with dimensionality specified by vector_size.

Raises:

Type Description
KeyError

If doc_id is not found in the trained model

RuntimeError

If called before training the model

Note

The returned vector captures semantic properties of the document and can be used for similarity calculations or as features for downstream tasks.

Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
def get_embedding(self, doc_id: int) -> List[float]:
    """
    Retrieve the embedding vector for a specific document.

    Args:
        doc_id (int): The unique identifier of the document to embed.
                     Must be within range of trained documents.

    Returns:
        List[float]: A dense vector representation of the document with
                    dimensionality specified by vector_size.

    Raises:
        KeyError: If doc_id is not found in the trained model
        RuntimeError: If called before training the model

    Note:
        The returned vector captures semantic properties of the document
        and can be used for similarity calculations or as features for
        downstream tasks.
    """
    return self.model.dv[str(doc_id)].tolist()

load_model(path)

Load a pre-trained Doc2Vec model.

Parameters: - path (str): File path of the saved model.

Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
129
130
131
132
133
134
135
136
def load_model(self, path: str):
    """
    Load a pre-trained Doc2Vec model.

    Parameters:
    - path (str): File path of the saved model.
    """
    self.model = Doc2Vec.load(path)

save_model(path)

Save the trained Doc2Vec model.

Parameters: - path (str): File path to save the model.

Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
120
121
122
123
124
125
126
127
def save_model(self, path: str):
    """
    Save the trained Doc2Vec model.

    Parameters:
    - path (str): File path to save the model.
    """
    self.model.save(path)

train(documents)

Train the Doc2Vec model on a corpus of documents.

This method processes the input documents, builds a vocabulary, and trains the model using the specified parameters from initialization.

Parameters:

Name Type Description Default
documents List[List[str]]

A list of tokenized documents where each document is represented as a list of strings (tokens).

required
Example

doc2vec = DOC2VEC() docs = [['this', 'is', 'doc1'], ['this', 'is', 'doc2']] doc2vec.train(docs)

Note
  • Documents should be preprocessed (tokenized, cleaned) before training
  • Training time scales with corpus size and vector_size
  • Progress can be monitored through Gensim's logging
Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def train(self, documents: List[List[str]]):
    """
    Train the Doc2Vec model on a corpus of documents.

    This method processes the input documents, builds a vocabulary, and trains
    the model using the specified parameters from initialization.

    Args:
        documents (List[List[str]]): A list of tokenized documents where each document
                                   is represented as a list of strings (tokens).

    Example:
        >>> doc2vec = DOC2VEC()
        >>> docs = [['this', 'is', 'doc1'], ['this', 'is', 'doc2']]
        >>> doc2vec.train(docs)

    Note:
        - Documents should be preprocessed (tokenized, cleaned) before training
        - Training time scales with corpus size and vector_size
        - Progress can be monitored through Gensim's logging
    """
    tagged_data = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(documents)]
    self.model.build_vocab(tagged_data)
    self.model.train(tagged_data, total_examples=self.model.corpus_count, epochs=self.model.epochs)