cuery.tools =========== .. py:module:: cuery.tools Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/cuery/tools/abs/index /autoapi/cuery/tools/dedupe/index /autoapi/cuery/tools/flex/index /autoapi/cuery/tools/schema/index /autoapi/cuery/tools/topics/index Classes ------- .. autoapisummary:: cuery.tools.AspectEntities cuery.tools.AspectSentimentExtractor cuery.tools.ClusteredEntities cuery.tools.ClusterMerger cuery.tools.EntityCluster cuery.tools.EntityClusterer cuery.tools.Classifier cuery.tools.EntityExtractor cuery.tools.Auto cuery.tools.Generic cuery.tools.Scorer cuery.tools.MultiTopicAssigner cuery.tools.TopicAssigner cuery.tools.TopicExtractor cuery.tools.SchemaGenerator cuery.tools.SchemaResponse Functions --------- .. autoapisummary:: cuery.tools.deduplicate_entities Package Contents ---------------- .. py:class:: AspectEntities(/, **data) Bases: :py:obj:`cuery.Response` Represents a collection of entities with their sentiments and reasons for assignment. .. py:attribute:: entities :type: list[AspectEntity] A list of entities with their sentiments and reasons. .. py:class:: AspectSentimentExtractor(/, **data) Bases: :py:obj:`cuery.Tool` Extract entities with sentiments from texts. .. py:attribute:: texts :type: collections.abc.Iterable[str | float | None] The texts to extract entities from. .. py:attribute:: instructions :type: str :value: '' Further instructions from the user for the entity extraction task. .. py:attribute:: aspect_categories :type: list[str] | None :value: None Optional list of aspect categories to map entities to (e.g., ['food', 'service', 'pricing']). .. py:attribute:: response_model :type: ClassVar[cuery.ResponseClass] Defines the response model for this tool (ClassVar or property). .. py:method:: _coerce_na(v) :classmethod: Convert pandas NA/NaN values to None so Pydantic accepts them. .. py:property:: prompt :type: cuery.Prompt Defines the prompt for this tool (ClassVar or property). .. py:property:: context :type: cuery.AnyContext .. py:class:: ClusteredEntities(/, **data) Bases: :py:obj:`cuery.Response` Result of clustering entities into semantic groups. .. py:attribute:: clusters :type: list[EntityCluster] List of entity clusters. .. py:attribute:: _max_cluster_size :type: ClassVar[int | None] :value: None .. py:attribute:: _total_entities :type: ClassVar[int | None] :value: None .. py:method:: validate_no_degenerate_clusters() Reject catch-all clusters and other degenerate patterns. .. py:method:: with_validation_limits(max_cluster_size = None, total_entities = None) :classmethod: Create a subclass with validation limits baked in. .. py:property:: canonicals :type: list[str] Get all canonical names. .. py:property:: mapping :type: dict[str, str] Get a mapping from each member entity to its canonical name. Keys are normalized (lowercase, whitespace-collapsed) for robust matching. .. py:property:: all_members :type: set[str] Get all member entities across all clusters (normalized). .. py:property:: member_count :type: int Get the total number of member entities across all clusters. .. py:method:: coverage(entities) Calculate what fraction of entities are covered by clusters. .. py:method:: missing(entities) Get entities that are not in any cluster. .. py:method:: to_dict() Convert to a dictionary mapping canonical names to members. .. py:class:: ClusterMerger(/, **data) Bases: :py:obj:`cuery.Tool` Merge semantically equivalent clusters using LLM-guided instructions. This tool asks the LLM to identify which clusters should be merged (by canonical name), then applies the merges programmatically. This approach: - Never loses entities (merging is done in code, not by LLM) - Requires much smaller LLM output (just canonical names, not all entities) - Is more reliable than asking LLM to output all entities again :param clusters: List of EntityCluster objects to merge :param instructions: Additional instructions for the merge task .. py:attribute:: clusters :type: list[EntityCluster] Clusters to potentially merge. .. py:attribute:: instructions :type: str :value: '' Additional domain-specific instructions. .. py:property:: response_model :type: cuery.ResponseClass Create response model with valid canonicals baked in for validation. .. py:property:: prompt :type: cuery.Prompt Defines the prompt for this tool (ClassVar or property). .. py:property:: context :type: cuery.AnyContext .. py:method:: _apply_merge_instructions(instructions) Apply merge instructions to clusters programmatically. .. py:method:: __call__(**kwargs) :async: Get merge instructions from LLM and apply them programmatically. .. py:class:: EntityCluster(/, **data) Bases: :py:obj:`cuery.Response` A cluster of semantically equivalent entities. .. py:attribute:: canonical :type: str The canonical/representative name for this cluster. .. py:attribute:: members :type: list[str] All entities that belong to this cluster. .. py:class:: EntityClusterer(/, **data) Bases: :py:obj:`cuery.Tool` Cluster semantically similar entities using LLM. This tool groups a list of entities into semantic clusters, where each cluster contains entities that express the same concept. Uses large context windows efficiently - processes up to thousands of entities per LLM call. The tool first removes exact duplicates (case-insensitive), then sends unique entities to the LLM for semantic clustering. If multiple batches are needed, an optional merge step can consolidate similar clusters across batches. :param entities: List of entity strings to cluster :param instructions: Additional domain-specific instructions for clustering :param batch_size: Max entities per LLM call (default: 2000 - handles most cases in one call) :param merge_clusters: If True and multiple batches, merge similar clusters across batches (one LLM call) .. rubric:: Example >>> clusterer = EntityClusterer( ... entities=["food too expensive", "overpriced food", "long lines", "queues too long"], ... ) >>> results = await clusterer() >>> print(results.mapping) {'food too expensive': 'expensive food', 'overpriced food': 'expensive food', ...} .. py:attribute:: entities :type: collections.abc.Iterable[str] Entities to cluster. .. py:attribute:: instructions :type: str :value: '' Additional domain-specific instructions for the clustering task. .. py:attribute:: batch_size :type: int :value: 2000 Max entities per LLM call. Default handles most use cases in a single call. .. py:attribute:: merge_clusters :type: bool :value: True If True, merge similar clusters (across batches or within single batch for consolidation). .. py:attribute:: consolidate :type: bool :value: True If True, always run a merge pass even on single-batch results to consolidate similar clusters. .. py:attribute:: max_cluster_size :type: int :value: 100 Maximum allowed members per cluster. Larger clusters trigger validation error and retry. .. py:attribute:: _unique_entities :type: list[str] | None :value: None .. py:attribute:: _reverse_map :type: dict[str, list[str]] | None :value: None .. py:method:: model_post_init(__context) Pre-deduplicate entities after initialization. .. py:property:: response_model :type: cuery.ResponseClass Create response model with validation limits for cluster size. .. py:property:: prompt :type: cuery.Prompt Defines the prompt for this tool (ClassVar or property). .. py:property:: context :type: cuery.AnyContext Create batched contexts - typically just one for most use cases. .. py:method:: _expand_clusters(clusters) Expand clusters to include all original variants from pre-deduplication. .. py:method:: _concat_batch_results(results) Concatenate results from multiple batches without LLM merge. .. py:method:: __call__(**kwargs) :async: Run the clustering tool. .. py:function:: deduplicate_entities(entities, results) Map a list of entities to their canonical forms using clustering results. :param entities: Original list of entities (may contain duplicates) :param results: ClusteredEntities result from EntityClusterer :returns: List of canonical entity names in the same order as input. Entities not found in the mapping are returned as-is. .. py:class:: Classifier(/, **data) Bases: :py:obj:`cuery.tools.flex.base.FlexTool` Zero-shot classify a data record with arbitrary attributes. .. py:attribute:: categories :type: dict[str, str] Dictionary of category labels and their descriptions. .. py:attribute:: instructions :type: str :value: '' Additional instructions (context) for the classification task. .. py:property:: prompt :type: cuery.Prompt Defines the prompt for this tool (ClassVar or property). .. py:property:: response_model :type: cuery.ResponseClass Defines the response model for this tool (ClassVar or property). .. py:class:: EntityExtractor(/, **data) Bases: :py:obj:`cuery.tools.flex.base.FlexTool` "Extract SEO-relevant entities from Google SERP AI Overview data. .. py:attribute:: entities :type: dict[str, str] Dictionary of entity names/categories and their descriptions. .. py:property:: prompt :type: cuery.Prompt Defines the prompt for this tool (ClassVar or property). .. py:property:: response_model :type: cuery.ResponseClass Defines the response model for this tool (ClassVar or property). .. py:method:: __call__(**kwargs) :async: Normalize the nested input records back into individual columns in output. .. py:class:: Auto(/, **data) Bases: :py:obj:`Generic` Fully automatic, general-purpose tool for processing data records. First auto-generates a response model from the response model instructions, then iterates over the records using that model and the provided tools instructions. .. py:attribute:: response_schema :type: str | dict | None :value: None Instructions to generate a JSON schema used as response model. .. py:attribute:: schema_model :type: str :value: None Specific model to use to generate the JSON schema. .. py:attribute:: _response :type: cuery.ResponseSet | None :value: None .. py:property:: prompt :type: cuery.Prompt Generate a prompt string based on the instructions and current schema. .. py:method:: response_model() :async: Defines the response model for this tool (ClassVar or property). .. py:method:: task() :async: Create a Task instance for this tool. .. py:method:: __call__(**kwargs) :async: Normalize the nested input records back into individual columns in output. .. py:class:: Generic(/, **data) Bases: :py:obj:`cuery.tools.flex.base.FlexTool` Tools that iterates over records with a JSON-schema response model. .. py:attribute:: response_schema :type: dict JSON schema used as response model. .. py:attribute:: instructions :type: str Instructions for the tool, describing its purpose and how to use it. .. py:property:: prompt :type: cuery.Prompt Generate a prompt string based on the instructions and current schema. .. py:property:: response_model :type: cuery.ResponseClass Defines the response model for this tool (ClassVar or property). .. py:class:: Scorer(/, **data) Bases: :py:obj:`cuery.tools.flex.base.FlexTool` Classify intent for keywords based on their SERP results. .. py:attribute:: name :type: str Name of the score to assign. .. py:attribute:: type :type: Literal['integer', 'float'] :value: 'float' Whether to return the score as integer or float. .. py:attribute:: min :type: float Minimum value of the score. .. py:attribute:: max :type: float Maximum value of the score. .. py:attribute:: description :type: str Description of the score to assign. .. py:method:: validate_name(name) :classmethod: Ensure the name is a valid Python identifier. .. py:property:: scorer_params :type: dict Get the parameters for the score model. .. py:property:: prompt :type: cuery.Prompt Defines the prompt for this tool (ClassVar or property). .. py:property:: response_model :type: cuery.ResponseClass Defines the response model for this tool (ClassVar or property). .. py:class:: MultiTopicAssigner(/, **data) Bases: :py:obj:`TopicAssigner` Enforce correct multi-topic-subtopic assignment via a Pydantic model. .. py:attribute:: SYSTEM_PROMPT :type: ClassVar[str] :value: '' .. py:attribute:: USER_PROMPT :type: ClassVar[str] :value: '' .. py:property:: response_model :type: cuery.ResponseClass Defines the response model for this tool (ClassVar or property). .. py:class:: TopicAssigner(/, **data) Bases: :py:obj:`cuery.tools.flex.base.FlexTool` Assign topics to records with arbitrary attributes. .. py:attribute:: topics :type: cuery.tools.topics.Topics Topics and subtopics to use for assignment, either as a Topics object or a dict. .. py:attribute:: instructions :type: str :value: '' Additional use-case specific instructions or context for the topic extraction. .. py:attribute:: SYSTEM_PROMPT :type: ClassVar[str] :value: '' .. py:attribute:: USER_PROMPT :type: ClassVar[str] :value: '' .. py:method:: validate_topics(topics) :classmethod: .. py:property:: prompt :type: cuery.Prompt Defines the prompt for this tool (ClassVar or property). .. py:property:: response_model :type: cuery.ResponseClass Defines the response model for this tool (ClassVar or property). .. py:class:: TopicExtractor(/, **data) Bases: :py:obj:`cuery.tools.flex.base.FlexTool` Extract topics from records with arbitrary attributes. .. py:attribute:: n_topics :type: int :value: None Approximate number of top-level topics to extract (maximum 20). .. py:attribute:: n_subtopics :type: int :value: None Approximate number of subtopics per top-level topic (At least 2, maximum 10). .. py:attribute:: instructions :type: str :value: '' Additional use-case specific instructions or context for the topic extraction. .. py:attribute:: min_ldist :type: int :value: None Minimum Levenshtein distance between topic labels. .. py:attribute:: max_samples :type: int :value: 500 Maximum number of samples to use for topic extraction. .. py:attribute:: record_format :type: Literal['attr_wise', 'rec_wise'] :value: 'attr_wise' Format of the records in the prompt. .. py:property:: response_model :type: cuery.ResponseClass Defines the response model for this tool (ClassVar or property). .. py:property:: prompt :type: cuery.Prompt Defines the prompt for this tool (ClassVar or property). .. py:property:: context :type: dict Override FlexTool base implementation. This tool is different because it doesn't iterate over records, but rather processes them all at once to extract topics. .. py:method:: __call__(**kwargs) :async: Normalize the nested input records back into individual columns in output. .. py:class:: SchemaGenerator(/, **data) Bases: :py:obj:`cuery.Tool` Create or modify a JSON schema given a prompt and optionally an existing schema. .. py:attribute:: instructions :type: str Prompt instructions with details of the schema to generate. .. py:attribute:: current_schema :type: dict | None :value: None Optional existing schema to modify or extend. .. py:attribute:: response_model :type: ClassVar[cuery.ResponseClass] All instances of this tool will use the SchemaResponse model. .. py:property:: prompt :type: cuery.Prompt Add system and assistant messages to user's prompt. .. py:method:: __call__(**kwds) :async: Extracts a two-level topic hierarchy from a list of texts. .. py:class:: SchemaResponse(/, **data) Bases: :py:obj:`cuery.Response` Response from the AI that includes both conversation and schema update. .. py:attribute:: reasoning :type: str Brief explanation of schema design choices .. py:attribute:: json_schema :type: dict[str, Any] Valid JSON schema as a dictionary defining a structured output .. py:method:: validate_json_schema(json_schema) :classmethod: Validate that the schema is a proper JSON schema.