cuery.tools.dedupe ================== .. py:module:: cuery.tools.dedupe .. autoapi-nested-parse:: Tool for semantically de-duplicating entities using LLM-based clustering. This module provides tools for grouping semantically equivalent entities (phrases, categories, aspect terms, etc.) into clusters and selecting canonical representatives. This is useful for post-processing outputs from other LLM-based extraction tools like AspectSentimentExtractor, where near-duplicate entities are common. The approach uses large context windows efficiently - processing up to thousands of entities in a single LLM call, avoiding expensive recursive merging. Example usage: >>> entities = [ ... "food too expensive", "overpriced food", "food prices high", ... "long lines", "queues too long", "long wait times", ... "friendly staff", "staff was nice", ... ] >>> clusterer = EntityClusterer(entities=entities) >>> results = await clusterer() >>> # Returns ClusteredEntities with clusters and canonical names Attributes ---------- .. autoapisummary:: cuery.tools.dedupe.CLUSTER_PROMPT_SYSTEM cuery.tools.dedupe.CLUSTER_PROMPT_USER cuery.tools.dedupe.MERGE_PROMPT_SYSTEM cuery.tools.dedupe.MERGE_PROMPT_USER Classes ------- .. autoapisummary:: cuery.tools.dedupe.EntityCluster cuery.tools.dedupe.ClusteredEntities cuery.tools.dedupe.MergeGroup cuery.tools.dedupe.MergeInstructions cuery.tools.dedupe.EntityClusterer cuery.tools.dedupe.ClusterMerger Functions --------- .. autoapisummary:: cuery.tools.dedupe._normalize cuery.tools.dedupe._pre_deduplicate cuery.tools.dedupe.deduplicate_entities Module Contents --------------- .. py:data:: CLUSTER_PROMPT_SYSTEM :value: '' .. py:data:: CLUSTER_PROMPT_USER :value: '' .. py:data:: MERGE_PROMPT_SYSTEM :value: '' .. py:data:: MERGE_PROMPT_USER :value: '' .. py:class:: EntityCluster(/, **data) Bases: :py:obj:`cuery.Response` A cluster of semantically equivalent entities. .. py:attribute:: canonical :type: str The canonical/representative name for this cluster. .. py:attribute:: members :type: list[str] All entities that belong to this cluster. .. py:class:: ClusteredEntities(/, **data) Bases: :py:obj:`cuery.Response` Result of clustering entities into semantic groups. .. py:attribute:: clusters :type: list[EntityCluster] List of entity clusters. .. py:attribute:: _max_cluster_size :type: ClassVar[int | None] :value: None .. py:attribute:: _total_entities :type: ClassVar[int | None] :value: None .. py:method:: validate_no_degenerate_clusters() Reject catch-all clusters and other degenerate patterns. .. py:method:: with_validation_limits(max_cluster_size = None, total_entities = None) :classmethod: Create a subclass with validation limits baked in. .. py:property:: canonicals :type: list[str] Get all canonical names. .. py:property:: mapping :type: dict[str, str] Get a mapping from each member entity to its canonical name. Keys are normalized (lowercase, whitespace-collapsed) for robust matching. .. py:property:: all_members :type: set[str] Get all member entities across all clusters (normalized). .. py:property:: member_count :type: int Get the total number of member entities across all clusters. .. py:method:: coverage(entities) Calculate what fraction of entities are covered by clusters. .. py:method:: missing(entities) Get entities that are not in any cluster. .. py:method:: to_dict() Convert to a dictionary mapping canonical names to members. .. py:class:: MergeGroup(/, **data) Bases: :py:obj:`cuery.Response` A group of clusters that should be merged together. .. py:attribute:: canonical :type: str The canonical name to keep (best representative for the merged cluster). .. py:attribute:: merge :type: list[str] Other canonical names that should be merged into this cluster. .. py:method:: validate_no_self_reference() Ensure canonical name is not in its own merge list. .. py:class:: MergeInstructions(/, **data) Bases: :py:obj:`cuery.Response` Instructions for which clusters to merge. .. py:attribute:: groups :type: list[MergeGroup] Groups of clusters to merge. Each group specifies a canonical to keep and others to merge into it. .. py:attribute:: _valid_canonicals :type: ClassVar[set[str] | None] :value: None .. py:method:: validate_merge_instructions() Validate merge instructions for consistency and against valid canonicals. .. py:method:: with_valid_canonicals(valid_canonicals) :classmethod: Create a subclass with valid canonicals baked in for validation. This allows validation to happen during Pydantic parsing, triggering LLM retries on invalid responses. :param valid_canonicals: Set of valid canonical names from the original clusters. :returns: A dynamically created MergeInstructions subclass with validation. .. py:function:: _normalize(s) Normalize string for pre-deduplication. .. py:function:: _pre_deduplicate(entities) Remove exact duplicates (case-insensitive) and return unique list + mapping. :returns: List of unique entities (first occurrence kept) reverse_map: Maps normalized form to all original variants :rtype: unique_entities .. py:class:: EntityClusterer(/, **data) Bases: :py:obj:`cuery.Tool` Cluster semantically similar entities using LLM. This tool groups a list of entities into semantic clusters, where each cluster contains entities that express the same concept. Uses large context windows efficiently - processes up to thousands of entities per LLM call. The tool first removes exact duplicates (case-insensitive), then sends unique entities to the LLM for semantic clustering. If multiple batches are needed, an optional merge step can consolidate similar clusters across batches. :param entities: List of entity strings to cluster :param instructions: Additional domain-specific instructions for clustering :param batch_size: Max entities per LLM call (default: 2000 - handles most cases in one call) :param merge_clusters: If True and multiple batches, merge similar clusters across batches (one LLM call) .. rubric:: Example >>> clusterer = EntityClusterer( ... entities=["food too expensive", "overpriced food", "long lines", "queues too long"], ... ) >>> results = await clusterer() >>> print(results.mapping) {'food too expensive': 'expensive food', 'overpriced food': 'expensive food', ...} .. py:attribute:: entities :type: collections.abc.Iterable[str] Entities to cluster. .. py:attribute:: instructions :type: str :value: '' Additional domain-specific instructions for the clustering task. .. py:attribute:: batch_size :type: int :value: 2000 Max entities per LLM call. Default handles most use cases in a single call. .. py:attribute:: merge_clusters :type: bool :value: True If True, merge similar clusters (across batches or within single batch for consolidation). .. py:attribute:: consolidate :type: bool :value: True If True, always run a merge pass even on single-batch results to consolidate similar clusters. .. py:attribute:: max_cluster_size :type: int :value: 100 Maximum allowed members per cluster. Larger clusters trigger validation error and retry. .. py:attribute:: _unique_entities :type: list[str] | None :value: None .. py:attribute:: _reverse_map :type: dict[str, list[str]] | None :value: None .. py:method:: model_post_init(__context) Pre-deduplicate entities after initialization. .. py:property:: response_model :type: cuery.ResponseClass Create response model with validation limits for cluster size. .. py:property:: prompt :type: cuery.Prompt Defines the prompt for this tool (ClassVar or property). .. py:property:: context :type: cuery.AnyContext Create batched contexts - typically just one for most use cases. .. py:method:: _expand_clusters(clusters) Expand clusters to include all original variants from pre-deduplication. .. py:method:: _concat_batch_results(results) Concatenate results from multiple batches without LLM merge. .. py:method:: __call__(**kwargs) :async: Run the clustering tool. .. py:class:: ClusterMerger(/, **data) Bases: :py:obj:`cuery.Tool` Merge semantically equivalent clusters using LLM-guided instructions. This tool asks the LLM to identify which clusters should be merged (by canonical name), then applies the merges programmatically. This approach: - Never loses entities (merging is done in code, not by LLM) - Requires much smaller LLM output (just canonical names, not all entities) - Is more reliable than asking LLM to output all entities again :param clusters: List of EntityCluster objects to merge :param instructions: Additional instructions for the merge task .. py:attribute:: clusters :type: list[EntityCluster] Clusters to potentially merge. .. py:attribute:: instructions :type: str :value: '' Additional domain-specific instructions. .. py:property:: response_model :type: cuery.ResponseClass Create response model with valid canonicals baked in for validation. .. py:property:: prompt :type: cuery.Prompt Defines the prompt for this tool (ClassVar or property). .. py:property:: context :type: cuery.AnyContext .. py:method:: _apply_merge_instructions(instructions) Apply merge instructions to clusters programmatically. .. py:method:: __call__(**kwargs) :async: Get merge instructions from LLM and apply them programmatically. .. py:function:: deduplicate_entities(entities, results) Map a list of entities to their canonical forms using clustering results. :param entities: Original list of entities (may contain duplicates) :param results: ClusteredEntities result from EntityClusterer :returns: List of canonical entity names in the same order as input. Entities not found in the mapping are returned as-is.