AnaText code documentation

ETA module

eta

cluster_documents_with_keywords(filename: str | pathlib.Path, verbose: bool = False)

Document clustering with keywords for each cluster …

Parameters:
  • filename (str) – path to the file

  • verbose (bool) – flag for printing

Returns:

  • df (pd.DataFrame) – dataframe with columns {text_col}, {emb_col} и {label_col}

  • top_word_dict (dict) – dictionary with keywords for each cluster

  • data (pd.DataFrame) – dataframe for internal function operation

  • cluster_centers (np.array) – cluster centers in the original dimension

  • radiuses (np.array) – relative radii of clusters for two-dimensional representation

  • cluster_model – KMeans with the number of clusters determined at the approximation stage

  • cluster_centers_2d (np.array) – cluster centers for 2D Visualization

  • reduce_model – dimension reduction model for visualization

  • embeddings (np.array) – embeddings

  • tokenizer

  • model – SCCLBert

split_cluster(cluster_num, divisor, data, reduce_model, embeddings)

Splitting cluster with number {cluster_num} into new clusters in the amount of {divisor}. …

Parameters:
  • cluster_num (int) – number of the cluster

  • divisor (int) – amount of clusters

  • data (pd.DataFrame) – dataframe with columns “current_class”, “embeddings”, “class”

  • embeddings (list(list)) –

  • reduce_model – model for reducing embedding dimensions to 2 for visualization

Returns:

  • data (pd.DataFrame) – dataframe with columns “current_class”, “embeddings”, “class”, “new_class”, “old_class”

  • cluster_centers_2d (list(list)) – cluster centers for 2D Visualization

  • radiuses (dict (cluster_num: float)) – relative radii of clusters for two-dimensional representation

union_clusters(cl_list, data, reduce_model, embeddings)

Splitting cluster with number {cluster_num} into new clusters in the amount of {divisor}. …

Parameters:
  • cl_list (list) – list of numbers of clusters to merge

  • data (pd.DataFrame) – dataframe with columns “current_class”, “embeddings”, “class”

  • embeddings (list(list)) –

  • reduce_model – model for reducing embedding dimensions to 2 for visualization

Returns:

  • data (pd.DataFrame) – dataframe with columns “current_class”, “embeddings”, “class”, “new_class”, “old_class”

  • cluster_centers_2d (list(list)) – cluster centers for 2D Visualization

  • radiuses (dict (cluster_num: float)) – relative radii of clusters for two-dimensional representation