doc2vec
DOC2VEC
A Doc2Vec model implementation for generating document embeddings.
This class provides methods for training document embeddings, retrieving vectors, and managing model persistence. It's particularly useful for recommendation systems that need to understand document-level semantics.
Attributes:
Name | Type | Description |
---|---|---|
model |
Doc2Vec
|
The underlying Gensim Doc2Vec model instance |
Methods:
Name | Description |
---|---|
train |
Trains the Doc2Vec model on a corpus of documents |
get_embedding |
Retrieves the embedding vector for a specific document |
save_model |
Persists the trained model to disk |
load_model |
Loads a previously trained model from disk |
Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
|
__init__(vector_size=100, window=5, min_count=1, workers=4, epochs=10)
Initialize a new Doc2Vec model with specified parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vector_size
|
int
|
Dimensionality of the feature vectors. Higher dimensions can capture more complex patterns but require more data and computation. |
100
|
window
|
int
|
Maximum distance between the current and predicted word within a sentence. Larger windows capture broader context but may introduce noise. |
5
|
min_count
|
int
|
Ignores all words with total frequency lower than this value. Helps reduce noise from rare words. |
1
|
workers
|
int
|
Number of worker threads for training parallelization. More workers can speed up training on multicore systems. |
4
|
epochs
|
int
|
Number of iterations over the corpus during training. More epochs can improve quality but increase training time. |
10
|
Note
The model is not trained upon initialization. Call train() with your corpus to begin training.
Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
|
get_embedding(doc_id)
Retrieve the embedding vector for a specific document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_id
|
int
|
The unique identifier of the document to embed. Must be within range of trained documents. |
required |
Returns:
Type | Description |
---|---|
List[float]
|
List[float]: A dense vector representation of the document with dimensionality specified by vector_size. |
Raises:
Type | Description |
---|---|
KeyError
|
If doc_id is not found in the trained model |
RuntimeError
|
If called before training the model |
Note
The returned vector captures semantic properties of the document and can be used for similarity calculations or as features for downstream tasks.
Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
|
load_model(path)
Load a pre-trained Doc2Vec model.
Parameters: - path (str): File path of the saved model.
Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
129 130 131 132 133 134 135 136 |
|
save_model(path)
Save the trained Doc2Vec model.
Parameters: - path (str): File path to save the model.
Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
120 121 122 123 124 125 126 127 |
|
train(documents)
Train the Doc2Vec model on a corpus of documents.
This method processes the input documents, builds a vocabulary, and trains the model using the specified parameters from initialization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
documents
|
List[List[str]]
|
A list of tokenized documents where each document is represented as a list of strings (tokens). |
required |
Example
doc2vec = DOC2VEC() docs = [['this', 'is', 'doc1'], ['this', 'is', 'doc2']] doc2vec.train(docs)
Note
- Documents should be preprocessed (tokenized, cleaned) before training
- Training time scales with corpus size and vector_size
- Progress can be monitored through Gensim's logging
Source code in engines/contentFilterEngine/embedding_representation_learning/doc2vec.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
|