cuery.tools.dedupe#

Tool for semantically de-duplicating entities using LLM-based clustering.

This module provides tools for grouping semantically equivalent entities (phrases, categories, aspect terms, etc.) into clusters and selecting canonical representatives. This is useful for post-processing outputs from other LLM-based extraction tools like AspectSentimentExtractor, where near-duplicate entities are common.

The approach uses large context windows efficiently - processing up to thousands of entities in a single LLM call, avoiding expensive recursive merging.

Example usage:

>>> entities = [
...     "food too expensive", "overpriced food", "food prices high",
...     "long lines", "queues too long", "long wait times",
...     "friendly staff", "staff was nice",
... ]
>>> clusterer = EntityClusterer(entities=entities)
>>> results = await clusterer()
>>> # Returns ClusteredEntities with clusters and canonical names

Attributes#

`CLUSTER_PROMPT_SYSTEM`
`CLUSTER_PROMPT_USER`
`MERGE_PROMPT_SYSTEM`
`MERGE_PROMPT_USER`

Classes#

`EntityCluster`	A cluster of semantically equivalent entities.
`ClusteredEntities`	Result of clustering entities into semantic groups.
`MergeGroup`	A group of clusters that should be merged together.
`MergeInstructions`	Instructions for which clusters to merge.
`EntityClusterer`	Cluster semantically similar entities using LLM.
`ClusterMerger`	Merge semantically equivalent clusters using LLM-guided instructions.

Functions#

`_normalize`(s)	Normalize string for pre-deduplication.
`_pre_deduplicate`(entities)	Remove exact duplicates (case-insensitive) and return unique list + mapping.
`deduplicate_entities`(entities, results)	Map a list of entities to their canonical forms using clustering results.

Module Contents#

cuery.tools.dedupe.CLUSTER_PROMPT_SYSTEM = ''#

cuery.tools.dedupe.CLUSTER_PROMPT_USER = ''#

cuery.tools.dedupe.MERGE_PROMPT_SYSTEM = ''#

cuery.tools.dedupe.MERGE_PROMPT_USER = ''#

class cuery.tools.dedupe.EntityCluster(/, **data)#

Bases: cuery.Response

A cluster of semantically equivalent entities.

Parameters:: data (Any)

canonical: str#: The canonical/representative name for this cluster.

members: list[str]#: All entities that belong to this cluster.

class cuery.tools.dedupe.ClusteredEntities(/, **data)#

Bases: cuery.Response

Result of clustering entities into semantic groups.

Parameters:: data (Any)

clusters: list[EntityCluster]#: List of entity clusters.

_max_cluster_size: ClassVar[int | None] = None#

_total_entities: ClassVar[int | None] = None#

validate_no_degenerate_clusters()#

Reject catch-all clusters and other degenerate patterns.

Return type:: Self

classmethod with_validation_limits(max_cluster_size=None, total_entities=None)#

Create a subclass with validation limits baked in.

Parameters:

max_cluster_size (int | None)
total_entities (int | None)

Return type:

type[ClusteredEntities]

property canonicals: list[str]#

Get all canonical names.

Return type:: list[str]

property mapping: dict[str, str]#

Get a mapping from each member entity to its canonical name.

Keys are normalized (lowercase, whitespace-collapsed) for robust matching.

Return type:: dict[str, str]

property all_members: set[str]#

Get all member entities across all clusters (normalized).

Return type:: set[str]

property member_count: int#

Get the total number of member entities across all clusters.

Return type:: int

coverage(entities)#

Calculate what fraction of entities are covered by clusters.

Parameters:: entities (collections.abc.Iterable[str])
Return type:: float

missing(entities)#

Get entities that are not in any cluster.

Parameters:: entities (collections.abc.Iterable[str])
Return type:: list[str]

to_dict()#

Convert to a dictionary mapping canonical names to members.

Return type:: dict[str, list[str]]

class cuery.tools.dedupe.MergeGroup(/, **data)#

Bases: cuery.Response

A group of clusters that should be merged together.

Parameters:: data (Any)

canonical: str#: The canonical name to keep (best representative for the merged cluster).

merge: list[str]#: Other canonical names that should be merged into this cluster.

validate_no_self_reference()#

Ensure canonical name is not in its own merge list.

Return type:: Self

class cuery.tools.dedupe.MergeInstructions(/, **data)#

Bases: cuery.Response

Instructions for which clusters to merge.

Parameters:: data (Any)

groups: list[MergeGroup]#: Groups of clusters to merge. Each group specifies a canonical to keep and others to merge into it.

_valid_canonicals: ClassVar[set[str] | None] = None#

validate_merge_instructions()#

Validate merge instructions for consistency and against valid canonicals.

Return type:: Self

classmethod with_valid_canonicals(valid_canonicals)#

Create a subclass with valid canonicals baked in for validation.

This allows validation to happen during Pydantic parsing, triggering LLM retries on invalid responses.

Parameters:: valid_canonicals (set[str]) – Set of valid canonical names from the original clusters.
Returns:: A dynamically created MergeInstructions subclass with validation.
Return type:: type[MergeInstructions]

cuery.tools.dedupe._normalize(s)#

Normalize string for pre-deduplication.

Parameters:: s (str)
Return type:: str

cuery.tools.dedupe._pre_deduplicate(entities)#

Remove exact duplicates (case-insensitive) and return unique list + mapping.

Returns:: List of unique entities (first occurrence kept) reverse_map: Maps normalized form to all original variants
Return type:: unique_entities
Parameters:: entities (list[str])

class cuery.tools.dedupe.EntityClusterer(/, **data)#

Bases: cuery.Tool

Cluster semantically similar entities using LLM.

This tool groups a list of entities into semantic clusters, where each cluster contains entities that express the same concept. Uses large context windows efficiently - processes up to thousands of entities per LLM call.

The tool first removes exact duplicates (case-insensitive), then sends unique entities to the LLM for semantic clustering. If multiple batches are needed, an optional merge step can consolidate similar clusters across batches.

Parameters:

entities – List of entity strings to cluster
instructions – Additional domain-specific instructions for clustering
batch_size – Max entities per LLM call (default: 2000 - handles most cases in one call)
merge_clusters – If True and multiple batches, merge similar clusters across batches (one LLM call)
data (Any)

Example

>>> clusterer = EntityClusterer(
...     entities=["food too expensive", "overpriced food", "long lines", "queues too long"],
... )
>>> results = await clusterer()
>>> print(results.mapping)
{'food too expensive': 'expensive food', 'overpriced food': 'expensive food', ...}

entities: collections.abc.Iterable[str]#: Entities to cluster.

instructions: str = ''#: Additional domain-specific instructions for the clustering task.

batch_size: int = 2000#: Max entities per LLM call. Default handles most use cases in a single call.

merge_clusters: bool = True#: If True, merge similar clusters (across batches or within single batch for consolidation).

consolidate: bool = True#: If True, always run a merge pass even on single-batch results to consolidate similar clusters.

max_cluster_size: int = 100#: Maximum allowed members per cluster. Larger clusters trigger validation error and retry.

_unique_entities: list[str] | None = None#

_reverse_map: dict[str, list[str]] | None = None#

model_post_init(__context)#

Pre-deduplicate entities after initialization.

Return type:: None

property response_model: cuery.ResponseClass#

Create response model with validation limits for cluster size.

Return type:: cuery.ResponseClass

property prompt: cuery.Prompt#

Defines the prompt for this tool (ClassVar or property).

Return type:: cuery.Prompt

property context: cuery.AnyContext#

Create batched contexts - typically just one for most use cases.

Return type:: cuery.AnyContext

_expand_clusters(clusters)#

Expand clusters to include all original variants from pre-deduplication.

Parameters:: clusters (list[EntityCluster])
Return type:: list[EntityCluster]

_concat_batch_results(results)#

Concatenate results from multiple batches without LLM merge.

Parameters:: results (list[ClusteredEntities])
Return type:: ClusteredEntities

async __call__(**kwargs)#

Run the clustering tool.

Return type:: ClusteredEntities

class cuery.tools.dedupe.ClusterMerger(/, **data)#

Bases: cuery.Tool

Merge semantically equivalent clusters using LLM-guided instructions.

This tool asks the LLM to identify which clusters should be merged (by canonical name), then applies the merges programmatically. This approach: - Never loses entities (merging is done in code, not by LLM) - Requires much smaller LLM output (just canonical names, not all entities) - Is more reliable than asking LLM to output all entities again

Parameters:

clusters – List of EntityCluster objects to merge
instructions – Additional instructions for the merge task
data (Any)

clusters: list[EntityCluster]#: Clusters to potentially merge.

instructions: str = ''#: Additional domain-specific instructions.

property response_model: cuery.ResponseClass#

Create response model with valid canonicals baked in for validation.

Return type:: cuery.ResponseClass

property prompt: cuery.Prompt#

Defines the prompt for this tool (ClassVar or property).

Return type:: cuery.Prompt

property context: cuery.AnyContext#

Return type:: cuery.AnyContext

_apply_merge_instructions(instructions)#

Apply merge instructions to clusters programmatically.

Parameters:: instructions (MergeInstructions)
Return type:: ClusteredEntities

async __call__(**kwargs)#

Get merge instructions from LLM and apply them programmatically.

Return type:: ClusteredEntities

cuery.tools.dedupe.deduplicate_entities(entities, results)#

Map a list of entities to their canonical forms using clustering results.

Parameters:

entities (collections.abc.Iterable[str]) – Original list of entities (may contain duplicates)
results (ClusteredEntities) – ClusteredEntities result from EntityClusterer

Returns:

List of canonical entity names in the same order as input. Entities not found in the mapping are returned as-is.

Return type:

list[str]

cuery.tools.dedupe#

Attributes#

Classes#

Functions#

Module Contents#

This Page