word2vec
WORD2VEC
A Word2Vec model implementation for generating word embeddings.
This class provides methods for training word embeddings and managing model persistence. It's particularly useful for recommendation systems that need to understand word-level semantics.
Attributes:
Name | Type | Description |
---|---|---|
model |
Word2Vec
|
The underlying Gensim Word2Vec model instance |
Methods:
Name | Description |
---|---|
train |
Trains the Word2Vec model on a corpus of sentences |
get_embedding |
Retrieves the embedding vector for a specific word |
save_model |
Persists the trained model to disk |
load_model |
Loads a previously trained model from disk |
Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
|
__init__(vector_size=100, window=5, min_count=1, workers=4)
Initialize a new Word2Vec model with specified parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vector_size
|
int
|
Dimensionality of the word vectors. Higher dimensions can capture more complex semantic relationships but require more data. |
100
|
window
|
int
|
Maximum distance between the current and predicted word within a sentence. Larger windows consider broader context but may be noisier. |
5
|
min_count
|
int
|
Ignores all words with total frequency lower than this value. Helps reduce noise from rare words. |
1
|
workers
|
int
|
Number of worker threads for training parallelization. More workers can speed up training on multicore systems. |
4
|
Note
The model is not trained upon initialization. Call train() with your corpus to begin training.
Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|
get_embedding(word)
Get the embedding vector for a given word.
Parameters: - word (str): The word to retrieve the embedding for.
Returns: - List[float]: The embedding vector.
Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
load_model(path)
Load a pre-trained Word2Vec model.
Parameters: - path (str): File path of the saved model.
Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
111 112 113 114 115 116 117 118 |
|
save_model(path)
Save the trained Word2Vec model.
Parameters: - path (str): File path to save the model.
Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
102 103 104 105 106 107 108 109 |
|
train(sentences, epochs=10)
Train the Word2Vec model on a corpus of sentences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sentences
|
List[List[str]]
|
A list of tokenized sentences where each sentence is represented as a list of strings (tokens). |
required |
epochs
|
int
|
Number of iterations over the corpus during training. More epochs can improve quality but increase training time. |
10
|
Note
- Sentences should be preprocessed (tokenized, cleaned) before training
- Training time scales with corpus size and vector_size
- Progress can be monitored through Gensim's logging
Source code in engines/contentFilterEngine/embedding_representation_learning/word2vec.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
|