A Python library for intelligent schema inference and validation across various data formats.
- Infer schema from CSV, JSON, Parquet, and relational database tables.
- Generate human-readable schema definitions (e.g., YAML, JSON, Python dict).
- Validate new data rows or files against a loaded schema, reporting type mismatches, missing fields, or constraint violations.
- Suggest data type coercions and transformations based on inferred patterns.
- Support for custom validation rules and data cleaning hooks.
First, install the library using pip:
pip install datasenseLet's infer a schema from a simple CSV file. Create a file named data.csv:
id,name,age,is_active
1,Alice,30,true
2,Bob,24,false
3,Charlie,35,true
Now, use datasense to infer its schema:
import json
from datasense.core import infer_schema
# For demonstration, let's assume 'infer_schema' can read directly from a path
# In a real scenario, you might pass a file object or specific format handler.
csv_file_path = "data.csv"
# Infer schema
inferred_schema = infer_schema(csv_file_path, format='csv')
# Print the inferred schema (e.g., as a pretty JSON string)
print(json.dumps(inferred_schema, indent=2))
# Expected output (simplified example):
# {
# "fields": [
# {"name": "id", "type": "integer"},
# {"name": "name", "type": "string"},
# {"name": "age", "type": "integer"},
# {"name": "is_active", "type": "boolean"}
# ]
# }This example demonstrates the basic usage of datasense to quickly infer a data schema. Refer to the documentation for more advanced features like custom rules, validation, and different data formats.