Untitled document
Created: 2025-02-12
techinques for cleaning and preprocessing data, using LLMs to backfill missing values, fix inconsistant formatting, remove duplicates and consolidate
See in context at Untitled document
Created: 2025-02-12
Large Language Models (LLMs) can fix multiple problems in one sweep:Missing Values: Gaps in datasets from incomplete data entry, technical errors, or system limitations.Inconsistencies: Different representations of the same information (e.g., “New York” vs “NY”, “123 Main St.” vs “123 Main Street”) that complicate aggregation and analysis.Duplicate Records: Multiple entries of the same data that can skew analysis results and waste resources.
See in context at Untitled document
Created: 2025-02-12
by using the Pydantic models for structured outputs, the returned data automatically conforms to our schema. We don’t need to provide additional formatting instructions or parse the response.
See in context at Untitled document
Created: 2025-02-12
useful for automated data labeling. We can either let the model choose appropriate categories based on context, or define a specific set of categories ourselves.