cuery.tools.dedupe#

Tool for semantically de-duplicating entities using LLM-based clustering.

This module provides tools for grouping semantically equivalent entities (phrases, categories, aspect terms, etc.) into clusters and selecting canonical representatives. This is useful for post-processing outputs from other LLM-based extraction tools like AspectSentimentExtractor, where near-duplicate entities are common.

The approach uses large context windows efficiently - processing up to thousands of entities in a single LLM call, avoiding expensive recursive merging.

Example usage:

>>> entities = [
...     "food too expensive", "overpriced food", "food prices high",
...     "long lines", "queues too long", "long wait times",
...     "friendly staff", "staff was nice",
... ]
>>> clusterer = EntityClusterer(entities=entities)
>>> results = await clusterer()
>>> # Returns ClusteredEntities with clusters and canonical names

Attributes#

Classes#

EntityCluster

A cluster of semantically equivalent entities.

ClusteredEntities

Result of clustering entities into semantic groups.

MergeGroup

A group of clusters that should be merged together.

MergeInstructions

Instructions for which clusters to merge.

EntityClusterer

Cluster semantically similar entities using LLM.

ClusterMerger

Merge semantically equivalent clusters using LLM-guided instructions.

Functions#

_normalize(s)

Normalize string for pre-deduplication.

_pre_deduplicate(entities)

Remove exact duplicates (case-insensitive) and return unique list + mapping.

deduplicate_entities(entities, results)

Map a list of entities to their canonical forms using clustering results.

Module Contents#

cuery.tools.dedupe.CLUSTER_PROMPT_SYSTEM = ''#
cuery.tools.dedupe.CLUSTER_PROMPT_USER = ''#
cuery.tools.dedupe.MERGE_PROMPT_SYSTEM = ''#
cuery.tools.dedupe.MERGE_PROMPT_USER = ''#
class cuery.tools.dedupe.EntityCluster(/, **data)#

Bases: cuery.Response

A cluster of semantically equivalent entities.

Parameters:

data (Any)

canonical: str#

The canonical/representative name for this cluster.

members: list[str]#

All entities that belong to this cluster.

class cuery.tools.dedupe.ClusteredEntities(/, **data)#

Bases: cuery.Response

Result of clustering entities into semantic groups.

Parameters:

data (Any)

clusters: list[EntityCluster]#

List of entity clusters.

_max_cluster_size: ClassVar[int | None] = None#
_total_entities: ClassVar[int | None] = None#
validate_no_degenerate_clusters()#

Reject catch-all clusters and other degenerate patterns.

Return type:

Self

classmethod with_validation_limits(max_cluster_size=None, total_entities=None)#

Create a subclass with validation limits baked in.

Parameters:
  • max_cluster_size (int | None)

  • total_entities (int | None)

Return type:

type[ClusteredEntities]

property canonicals: list[str]#

Get all canonical names.

Return type:

list[str]

property mapping: dict[str, str]#

Get a mapping from each member entity to its canonical name.

Keys are normalized (lowercase, whitespace-collapsed) for robust matching.

Return type:

dict[str, str]

property all_members: set[str]#

Get all member entities across all clusters (normalized).

Return type:

set[str]

property member_count: int#

Get the total number of member entities across all clusters.

Return type:

int

coverage(entities)#

Calculate what fraction of entities are covered by clusters.

Parameters:

entities (collections.abc.Iterable[str])

Return type:

float

missing(entities)#

Get entities that are not in any cluster.

Parameters:

entities (collections.abc.Iterable[str])

Return type:

list[str]

to_dict()#

Convert to a dictionary mapping canonical names to members.

Return type:

dict[str, list[str]]

class cuery.tools.dedupe.MergeGroup(/, **data)#

Bases: cuery.Response

A group of clusters that should be merged together.

Parameters:

data (Any)

canonical: str#

The canonical name to keep (best representative for the merged cluster).

merge: list[str]#

Other canonical names that should be merged into this cluster.

validate_no_self_reference()#

Ensure canonical name is not in its own merge list.

Return type:

Self

class cuery.tools.dedupe.MergeInstructions(/, **data)#

Bases: cuery.Response

Instructions for which clusters to merge.

Parameters:

data (Any)

groups: list[MergeGroup]#

Groups of clusters to merge. Each group specifies a canonical to keep and others to merge into it.

_valid_canonicals: ClassVar[set[str] | None] = None#
validate_merge_instructions()#

Validate merge instructions for consistency and against valid canonicals.

Return type:

Self

classmethod with_valid_canonicals(valid_canonicals)#

Create a subclass with valid canonicals baked in for validation.

This allows validation to happen during Pydantic parsing, triggering LLM retries on invalid responses.

Parameters:

valid_canonicals (set[str]) – Set of valid canonical names from the original clusters.

Returns:

A dynamically created MergeInstructions subclass with validation.

Return type:

type[MergeInstructions]

cuery.tools.dedupe._normalize(s)#

Normalize string for pre-deduplication.

Parameters:

s (str)

Return type:

str

cuery.tools.dedupe._pre_deduplicate(entities)#

Remove exact duplicates (case-insensitive) and return unique list + mapping.

Returns:

List of unique entities (first occurrence kept) reverse_map: Maps normalized form to all original variants

Return type:

unique_entities

Parameters:

entities (list[str])

class cuery.tools.dedupe.EntityClusterer(/, **data)#

Bases: cuery.Tool

Cluster semantically similar entities using LLM.

This tool groups a list of entities into semantic clusters, where each cluster contains entities that express the same concept. Uses large context windows efficiently - processes up to thousands of entities per LLM call.

The tool first removes exact duplicates (case-insensitive), then sends unique entities to the LLM for semantic clustering. If multiple batches are needed, an optional merge step can consolidate similar clusters across batches.

Parameters:
  • entities – List of entity strings to cluster

  • instructions – Additional domain-specific instructions for clustering

  • batch_size – Max entities per LLM call (default: 2000 - handles most cases in one call)

  • merge_clusters – If True and multiple batches, merge similar clusters across batches (one LLM call)

  • data (Any)

Example

>>> clusterer = EntityClusterer(
...     entities=["food too expensive", "overpriced food", "long lines", "queues too long"],
... )
>>> results = await clusterer()
>>> print(results.mapping)
{'food too expensive': 'expensive food', 'overpriced food': 'expensive food', ...}
entities: collections.abc.Iterable[str]#

Entities to cluster.

instructions: str = ''#

Additional domain-specific instructions for the clustering task.

batch_size: int = 2000#

Max entities per LLM call. Default handles most use cases in a single call.

merge_clusters: bool = True#

If True, merge similar clusters (across batches or within single batch for consolidation).

consolidate: bool = True#

If True, always run a merge pass even on single-batch results to consolidate similar clusters.

max_cluster_size: int = 100#

Maximum allowed members per cluster. Larger clusters trigger validation error and retry.

_unique_entities: list[str] | None = None#
_reverse_map: dict[str, list[str]] | None = None#
model_post_init(__context)#

Pre-deduplicate entities after initialization.

Return type:

None

property response_model: cuery.ResponseClass#

Create response model with validation limits for cluster size.

Return type:

cuery.ResponseClass

property prompt: cuery.Prompt#

Defines the prompt for this tool (ClassVar or property).

Return type:

cuery.Prompt

property context: cuery.AnyContext#

Create batched contexts - typically just one for most use cases.

Return type:

cuery.AnyContext

_expand_clusters(clusters)#

Expand clusters to include all original variants from pre-deduplication.

Parameters:

clusters (list[EntityCluster])

Return type:

list[EntityCluster]

_concat_batch_results(results)#

Concatenate results from multiple batches without LLM merge.

Parameters:

results (list[ClusteredEntities])

Return type:

ClusteredEntities

async __call__(**kwargs)#

Run the clustering tool.

Return type:

ClusteredEntities

class cuery.tools.dedupe.ClusterMerger(/, **data)#

Bases: cuery.Tool

Merge semantically equivalent clusters using LLM-guided instructions.

This tool asks the LLM to identify which clusters should be merged (by canonical name), then applies the merges programmatically. This approach: - Never loses entities (merging is done in code, not by LLM) - Requires much smaller LLM output (just canonical names, not all entities) - Is more reliable than asking LLM to output all entities again

Parameters:
  • clusters – List of EntityCluster objects to merge

  • instructions – Additional instructions for the merge task

  • data (Any)

clusters: list[EntityCluster]#

Clusters to potentially merge.

instructions: str = ''#

Additional domain-specific instructions.

property response_model: cuery.ResponseClass#

Create response model with valid canonicals baked in for validation.

Return type:

cuery.ResponseClass

property prompt: cuery.Prompt#

Defines the prompt for this tool (ClassVar or property).

Return type:

cuery.Prompt

property context: cuery.AnyContext#
Return type:

cuery.AnyContext

_apply_merge_instructions(instructions)#

Apply merge instructions to clusters programmatically.

Parameters:

instructions (MergeInstructions)

Return type:

ClusteredEntities

async __call__(**kwargs)#

Get merge instructions from LLM and apply them programmatically.

Return type:

ClusteredEntities

cuery.tools.dedupe.deduplicate_entities(entities, results)#

Map a list of entities to their canonical forms using clustering results.

Parameters:
  • entities (collections.abc.Iterable[str]) – Original list of entities (may contain duplicates)

  • results (ClusteredEntities) – ClusteredEntities result from EntityClusterer

Returns:

List of canonical entity names in the same order as input. Entities not found in the mapping are returned as-is.

Return type:

list[str]