AnaText code documentation
ETA module
eta
- cluster_documents_with_keywords(filename: str | pathlib.Path, verbose: bool = False)
Document clustering with keywords for each cluster …
- Parameters:
- Returns:
df (pd.DataFrame) – dataframe with columns {text_col}, {emb_col} и {label_col}
top_word_dict (dict) – dictionary with keywords for each cluster
data (pd.DataFrame) – dataframe for internal function operation
cluster_centers (np.array) – cluster centers in the original dimension
radiuses (np.array) – relative radii of clusters for two-dimensional representation
cluster_model – KMeans with the number of clusters determined at the approximation stage
cluster_centers_2d (np.array) – cluster centers for 2D Visualization
reduce_model – dimension reduction model for visualization
embeddings (np.array) – embeddings
tokenizer
model – SCCLBert
- split_cluster(cluster_num, divisor, data, reduce_model, embeddings)
Splitting cluster with number {cluster_num} into new clusters in the amount of {divisor}. …
- Parameters:
- Returns:
data (pd.DataFrame) – dataframe with columns “current_class”, “embeddings”, “class”, “new_class”, “old_class”
cluster_centers_2d (list(list)) – cluster centers for 2D Visualization
radiuses (dict (cluster_num: float)) – relative radii of clusters for two-dimensional representation
- union_clusters(cl_list, data, reduce_model, embeddings)
Splitting cluster with number {cluster_num} into new clusters in the amount of {divisor}. …
- Parameters:
- Returns:
data (pd.DataFrame) – dataframe with columns “current_class”, “embeddings”, “class”, “new_class”, “old_class”
cluster_centers_2d (list(list)) – cluster centers for 2D Visualization
radiuses (dict (cluster_num: float)) – relative radii of clusters for two-dimensional representation