cuery.tools.dedupe#
Tool for semantically de-duplicating entities using LLM-based clustering.
This module provides tools for grouping semantically equivalent entities (phrases, categories, aspect terms, etc.) into clusters and selecting canonical representatives. This is useful for post-processing outputs from other LLM-based extraction tools like AspectSentimentExtractor, where near-duplicate entities are common.
The approach uses large context windows efficiently - processing up to thousands of entities in a single LLM call, avoiding expensive recursive merging.
Example usage:
>>> entities = [
... "food too expensive", "overpriced food", "food prices high",
... "long lines", "queues too long", "long wait times",
... "friendly staff", "staff was nice",
... ]
>>> clusterer = EntityClusterer(entities=entities)
>>> results = await clusterer()
>>> # Returns ClusteredEntities with clusters and canonical names
Attributes#
Classes#
A cluster of semantically equivalent entities. |
|
Result of clustering entities into semantic groups. |
|
A group of clusters that should be merged together. |
|
Instructions for which clusters to merge. |
|
Cluster semantically similar entities using LLM. |
|
Merge semantically equivalent clusters using LLM-guided instructions. |
Functions#
|
Normalize string for pre-deduplication. |
|
Remove exact duplicates (case-insensitive) and return unique list + mapping. |
|
Map a list of entities to their canonical forms using clustering results. |
Module Contents#
- cuery.tools.dedupe.CLUSTER_PROMPT_SYSTEM = ''#
- cuery.tools.dedupe.CLUSTER_PROMPT_USER = ''#
- cuery.tools.dedupe.MERGE_PROMPT_SYSTEM = ''#
- cuery.tools.dedupe.MERGE_PROMPT_USER = ''#
- class cuery.tools.dedupe.EntityCluster(/, **data)#
Bases:
cuery.ResponseA cluster of semantically equivalent entities.
- Parameters:
data (Any)
- canonical: str#
The canonical/representative name for this cluster.
- members: list[str]#
All entities that belong to this cluster.
- class cuery.tools.dedupe.ClusteredEntities(/, **data)#
Bases:
cuery.ResponseResult of clustering entities into semantic groups.
- Parameters:
data (Any)
- clusters: list[EntityCluster]#
List of entity clusters.
- _max_cluster_size: ClassVar[int | None] = None#
- _total_entities: ClassVar[int | None] = None#
- validate_no_degenerate_clusters()#
Reject catch-all clusters and other degenerate patterns.
- Return type:
Self
- classmethod with_validation_limits(max_cluster_size=None, total_entities=None)#
Create a subclass with validation limits baked in.
- Parameters:
max_cluster_size (int | None)
total_entities (int | None)
- Return type:
type[ClusteredEntities]
- property canonicals: list[str]#
Get all canonical names.
- Return type:
list[str]
- property mapping: dict[str, str]#
Get a mapping from each member entity to its canonical name.
Keys are normalized (lowercase, whitespace-collapsed) for robust matching.
- Return type:
dict[str, str]
- property all_members: set[str]#
Get all member entities across all clusters (normalized).
- Return type:
set[str]
- property member_count: int#
Get the total number of member entities across all clusters.
- Return type:
int
- coverage(entities)#
Calculate what fraction of entities are covered by clusters.
- Parameters:
entities (collections.abc.Iterable[str])
- Return type:
float
- missing(entities)#
Get entities that are not in any cluster.
- Parameters:
entities (collections.abc.Iterable[str])
- Return type:
list[str]
- to_dict()#
Convert to a dictionary mapping canonical names to members.
- Return type:
dict[str, list[str]]
- class cuery.tools.dedupe.MergeGroup(/, **data)#
Bases:
cuery.ResponseA group of clusters that should be merged together.
- Parameters:
data (Any)
- canonical: str#
The canonical name to keep (best representative for the merged cluster).
- merge: list[str]#
Other canonical names that should be merged into this cluster.
- validate_no_self_reference()#
Ensure canonical name is not in its own merge list.
- Return type:
Self
- class cuery.tools.dedupe.MergeInstructions(/, **data)#
Bases:
cuery.ResponseInstructions for which clusters to merge.
- Parameters:
data (Any)
- groups: list[MergeGroup]#
Groups of clusters to merge. Each group specifies a canonical to keep and others to merge into it.
- _valid_canonicals: ClassVar[set[str] | None] = None#
- validate_merge_instructions()#
Validate merge instructions for consistency and against valid canonicals.
- Return type:
Self
- classmethod with_valid_canonicals(valid_canonicals)#
Create a subclass with valid canonicals baked in for validation.
This allows validation to happen during Pydantic parsing, triggering LLM retries on invalid responses.
- Parameters:
valid_canonicals (set[str]) – Set of valid canonical names from the original clusters.
- Returns:
A dynamically created MergeInstructions subclass with validation.
- Return type:
type[MergeInstructions]
- cuery.tools.dedupe._normalize(s)#
Normalize string for pre-deduplication.
- Parameters:
s (str)
- Return type:
str
- cuery.tools.dedupe._pre_deduplicate(entities)#
Remove exact duplicates (case-insensitive) and return unique list + mapping.
- Returns:
List of unique entities (first occurrence kept) reverse_map: Maps normalized form to all original variants
- Return type:
unique_entities
- Parameters:
entities (list[str])
- class cuery.tools.dedupe.EntityClusterer(/, **data)#
Bases:
cuery.ToolCluster semantically similar entities using LLM.
This tool groups a list of entities into semantic clusters, where each cluster contains entities that express the same concept. Uses large context windows efficiently - processes up to thousands of entities per LLM call.
The tool first removes exact duplicates (case-insensitive), then sends unique entities to the LLM for semantic clustering. If multiple batches are needed, an optional merge step can consolidate similar clusters across batches.
- Parameters:
entities – List of entity strings to cluster
instructions – Additional domain-specific instructions for clustering
batch_size – Max entities per LLM call (default: 2000 - handles most cases in one call)
merge_clusters – If True and multiple batches, merge similar clusters across batches (one LLM call)
data (Any)
Example
>>> clusterer = EntityClusterer( ... entities=["food too expensive", "overpriced food", "long lines", "queues too long"], ... ) >>> results = await clusterer() >>> print(results.mapping) {'food too expensive': 'expensive food', 'overpriced food': 'expensive food', ...}
- entities: collections.abc.Iterable[str]#
Entities to cluster.
- instructions: str = ''#
Additional domain-specific instructions for the clustering task.
- batch_size: int = 2000#
Max entities per LLM call. Default handles most use cases in a single call.
- merge_clusters: bool = True#
If True, merge similar clusters (across batches or within single batch for consolidation).
- consolidate: bool = True#
If True, always run a merge pass even on single-batch results to consolidate similar clusters.
- max_cluster_size: int = 100#
Maximum allowed members per cluster. Larger clusters trigger validation error and retry.
- _unique_entities: list[str] | None = None#
- _reverse_map: dict[str, list[str]] | None = None#
- model_post_init(__context)#
Pre-deduplicate entities after initialization.
- Return type:
None
- property response_model: cuery.ResponseClass#
Create response model with validation limits for cluster size.
- Return type:
cuery.ResponseClass
- property prompt: cuery.Prompt#
Defines the prompt for this tool (ClassVar or property).
- Return type:
- property context: cuery.AnyContext#
Create batched contexts - typically just one for most use cases.
- Return type:
cuery.AnyContext
- _expand_clusters(clusters)#
Expand clusters to include all original variants from pre-deduplication.
- Parameters:
clusters (list[EntityCluster])
- Return type:
list[EntityCluster]
- _concat_batch_results(results)#
Concatenate results from multiple batches without LLM merge.
- Parameters:
results (list[ClusteredEntities])
- Return type:
- async __call__(**kwargs)#
Run the clustering tool.
- Return type:
- class cuery.tools.dedupe.ClusterMerger(/, **data)#
Bases:
cuery.ToolMerge semantically equivalent clusters using LLM-guided instructions.
This tool asks the LLM to identify which clusters should be merged (by canonical name), then applies the merges programmatically. This approach: - Never loses entities (merging is done in code, not by LLM) - Requires much smaller LLM output (just canonical names, not all entities) - Is more reliable than asking LLM to output all entities again
- Parameters:
clusters – List of EntityCluster objects to merge
instructions – Additional instructions for the merge task
data (Any)
- clusters: list[EntityCluster]#
Clusters to potentially merge.
- instructions: str = ''#
Additional domain-specific instructions.
- property response_model: cuery.ResponseClass#
Create response model with valid canonicals baked in for validation.
- Return type:
cuery.ResponseClass
- property prompt: cuery.Prompt#
Defines the prompt for this tool (ClassVar or property).
- Return type:
- property context: cuery.AnyContext#
- Return type:
cuery.AnyContext
- _apply_merge_instructions(instructions)#
Apply merge instructions to clusters programmatically.
- Parameters:
instructions (MergeInstructions)
- Return type:
- async __call__(**kwargs)#
Get merge instructions from LLM and apply them programmatically.
- Return type:
- cuery.tools.dedupe.deduplicate_entities(entities, results)#
Map a list of entities to their canonical forms using clustering results.
- Parameters:
entities (collections.abc.Iterable[str]) – Original list of entities (may contain duplicates)
results (ClusteredEntities) – ClusteredEntities result from EntityClusterer
- Returns:
List of canonical entity names in the same order as input. Entities not found in the mapping are returned as-is.
- Return type:
list[str]