AnaText code documentation

ETA module

eta

cluster_documents_with_keywords(filename: str | pathlib.Path, verbose: bool = False)

Document clustering with keywords for each cluster …

Parameters:

Returns:

df (pd.DataFrame) – dataframe with columns {text_col}, {emb_col} и {label_col}
top_word_dict (dict) – dictionary with keywords for each cluster
data (pd.DataFrame) – dataframe for internal function operation
cluster_centers (np.array) – cluster centers in the original dimension
radiuses (np.array) – relative radii of clusters for two-dimensional representation
cluster_model – KMeans with the number of clusters determined at the approximation stage
cluster_centers_2d (np.array) – cluster centers for 2D Visualization
reduce_model – dimension reduction model for visualization
embeddings (np.array) – embeddings
tokenizer
model – SCCLBert

split_cluster(cluster_num, divisor, data, reduce_model, embeddings)

Splitting cluster with number {cluster_num} into new clusters in the amount of {divisor}. …

Parameters:

cluster_num (int) – number of the cluster
divisor (int) – amount of clusters
data (pd.DataFrame) – dataframe with columns “current_class”, “embeddings”, “class”
embeddings (list(list)) –
reduce_model – model for reducing embedding dimensions to 2 for visualization

Returns:

data (pd.DataFrame) – dataframe with columns “current_class”, “embeddings”, “class”, “new_class”, “old_class”
cluster_centers_2d (list(list)) – cluster centers for 2D Visualization
radiuses (dict (cluster_num: float)) – relative radii of clusters for two-dimensional representation

union_clusters(cl_list, data, reduce_model, embeddings)

Splitting cluster with number {cluster_num} into new clusters in the amount of {divisor}. …

Parameters:

cl_list (list) – list of numbers of clusters to merge
data (pd.DataFrame) – dataframe with columns “current_class”, “embeddings”, “class”
embeddings (list(list)) –
reduce_model – model for reducing embedding dimensions to 2 for visualization

Returns:

data (pd.DataFrame) – dataframe with columns “current_class”, “embeddings”, “class”, “new_class”, “old_class”
cluster_centers_2d (list(list)) – cluster centers for 2D Visualization
radiuses (dict (cluster_num: float)) – relative radii of clusters for two-dimensional representation