Add client.dataframe namespace for pandas DataFrame CRUD operations#98
Add client.dataframe namespace for pandas DataFrame CRUD operations#98zhaodongwang-msft wants to merge 24 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds pandas DataFrame/Series wrappers to the Dataverse Python SDK so callers can perform CRUD operations using DataFrame-native inputs/outputs, plus accompanying docs, examples, and tests.
Changes:
- Added
DataverseClientDataFrame CRUD wrapper methods:get_dataframe,create_dataframe,update_dataframe,delete_dataframe. - Added unit tests and end-to-end example demonstrating DataFrame CRUD workflows.
- Updated docs/README and added
pandasas a dependency.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/PowerPlatform/Dataverse/client.py |
Implements DataFrame CRUD wrapper methods on DataverseClient. |
tests/unit/test_client_dataframe.py |
Adds unit coverage for DataFrame CRUD wrappers. |
examples/advanced/dataframe_operations.py |
Adds a walkthrough script showing DataFrame CRUD usage. |
pyproject.toml |
Adds pandas to project dependencies. |
README.md |
Documents DataFrame CRUD usage examples. |
src/PowerPlatform/Dataverse/claude_skill/dataverse-sdk-use/SKILL.md |
Documents DataFrame CRUD usage in the packaged skill doc. |
.claude/skills/dataverse-sdk-use/SKILL.md |
Documents DataFrame CRUD usage in the repo-local skill doc. |
.gitignore |
Ignores additional Claude local markdown files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 11 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@zhaodongwang-msft I've opened a new pull request, #99, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@zhaodongwang-msft I've opened a new pull request, #100, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…/github.com/microsoft/PowerPlatform-DataverseClient-Python into users/zhaodongwang/dataFrameExtensionClaude
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 11 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
We'll want to track 2 followups from this that are dependent on other refactor PRs so we can't do quite yet:
|
…lete input validation, export DataFrameOperations (#145) Addresses four unresolved review comments from PR #98 against the `client.dataframe` namespace: a crash on array-valued cells, silent NumPy serialization failures, missing ID validation in `update()` and `delete()`, and missing exports/tests. ## `utils/_pandas.py` - **Fix `pd.notna()` crash on array-like cells**: Guard with `pd.api.types.is_scalar(v)` before calling `pd.notna()`; non-scalar values (lists, dicts, numpy arrays) pass through directly. Previously raised `ValueError: The truth value of an array is ambiguous`. - **Normalize NumPy scalar types**: New `_normalize_scalar(v)` helper converts `np.integer` → `int`, `np.floating` → `float`, `np.bool_` → `bool`, `pd.Timestamp` → ISO string. DataFrames with integer columns produce `np.int64` by default, which `json.dumps()` cannot serialize. ```python # Before: would crash or produce non-serializable values df = pd.DataFrame([{"tags": ["a", "b"]}, {"count": np.int64(5)}]) dataframe_to_records(df) # ValueError / TypeError at serialization time # After: safe [{"tags": ["a", "b"]}, {"count": 5}] ``` ## `operations/dataframe.py` - **`update()` — validate `id_column` values**: After extracting IDs, raises `ValueError` listing offending row indices if any value is not a non-empty string (catches `NaN`, `None`, numeric IDs). - **`update()` — validate non-empty change columns**: Raises `ValueError` if the DataFrame contains only the `id_column` and no fields to update. - **`delete()` — validate `ids` Series**: Returns `None` immediately for an empty Series; raises `ValueError` listing offending indices for any non-string or blank value. ## `operations/__init__.py` - Exports `DataFrameOperations` so consumers can use it for type annotations. ## Tests - `tests/unit/test_pandas_helpers.py` — 11 isolated tests for `dataframe_to_records()` covering NaN handling, NumPy type normalization, Timestamp conversion, list/dict passthrough, and empty input. - `tests/unit/test_dataframe_operations.py` — 35 tests covering the full `DataFrameOperations` namespace, including all new validation paths. <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> ## Context This PR addresses unresolved review comments from PR #98 ("add dataframe methods") and adds comprehensive test coverage for the DataFrame operations namespace (`client.dataframe`). The base branch is `users/zhaodongwang/dataFrameExtensionClaude` which contains the current state of the DataFrame operations code from PR #98. ## Files to modify ### 1. `src/PowerPlatform/Dataverse/utils/_pandas.py` Current code at the HEAD of the PR branch (`8838bb69533dd8830bac8724c44696771a6704e7`): ```python # Copyright (c) Microsoft Corporation. # Licensed under the MIT license. """Internal pandas helpers""" from __future__ import annotations from typing import Any, Dict, List import pandas as pd def dataframe_to_records(df: pd.DataFrame, na_as_null: bool = False) -> List[Dict[str, Any]]: """Convert a DataFrame to a list of dicts, converting Timestamps to ISO strings. :param df: Input DataFrame. :param na_as_null: When False (default), missing values are omitted from each dict. When True, missing values are included as None (sends null to Dataverse, clearing the field). """ records = [] for row in df.to_dict(orient="records"): clean = {} for k, v in row.items(): if pd.notna(v): clean[k] = v.isoformat() if isinstance(v, pd.Timestamp) else v elif na_as_null: clean[k] = None records.append(clean) return records ``` **Required changes:** #### Fix A: `pd.notna()` crash on array-like values (unresolved comment #98 (comment)) `pd.notna(v)` raises `ValueError: The truth value of an array is ambiguous` when a cell contains a list, dict, numpy array, etc. Fix by guarding with `pd.api.types.is_scalar(v)`: ```python for k, v in row.items(): if pd.api.types.is_scalar(v): if pd.notna(v): clean[k] = _normalize_scalar(v) elif na_as_null: clean[k] = None else: clean[k] = v # pass through lists, dicts, etc. ``` #### Fix B: NumPy scalar types not normalized (acknowledged but deferred by author in #98 (comment)) NumPy scalars (`np.int64`, `np.float64`, `np.bool_`) are NOT JSON-serializable by default `json.dumps()`. DataFrames with integer columns produce `np.int64` values. Add a helper function `_normalize_scalar(v)` that: - Converts `pd.Timestamp` to `.isoformat()` - Converts `numpy.integer` types to Python `int` - Converts `numpy.floating` types to Python `float` - Converts `numpy.bool_` to Python `bool` - Passes everything else through Use `import numpy as np` and `isinstance` checks. ### 2. `src/PowerPlatform/Dataverse/operations/dataframe.py` Current code at the HEAD of the PR branch: ```python # Copyright (c) Microsoft Corporation. # Licensed under the MIT license. """DataFrame CRUD operations namespace for the Dataverse SDK.""" from __future__ import annotations from typing import List, Optional, TYPE_CHECKING import pandas as pd from ..utils._pandas import dataframe_to_records if TYPE_CHECKING: from ..client import DataverseClient __all__ = ["DataFrameOperations"] class DataFrameOperations: """Namespace for pandas DataFrame CRUD operations. ... """ def __init__(self, client: DataverseClient) -> None: self._client = client def get(self, table, record_id=None, select=None, filter=None, orderby=None, top=None, expand=None, page_size=None) -> pd.DataFrame: # ... (current code) pass def create(self, table, records) -> pd.Series: # ... (current code with empty DataFrame check and ID count validation) pass def update(self, table, changes, id_column, clear_nulls=False) -> None: if not isinstance(changes, pd.DataFrame): raise TypeError("changes must be a pandas DataFrame") if id_column not in changes.columns: raise ValueError(f"id_column '{id_column}' not found in DataFrame columns") ids = changes[id_column].tolist() change_columns = [column for column in changes.columns if column != id_column] change_list = dataframe_to_records(changes[change_columns], na_as_null=clear_nulls) if len(ids) == 1: self._client.records.update(table, ids[0], change_list[0]) else: self._client.records.update(table, ids, change_list) def delete(self, table, ids, use_bulk_delete=True) -> Optional[str]: if not isinstance(ids, pd.Series): raise TypeError("ids must be a pandas Series") id_list = ids.tolist() if len(id_list) == 1: return self._client.records.delete(table, id_list[0]) else: return self._client.records.delete(table, id_list, use_bulk_delete=use_bulk_delete) ``` **Required changes:** #### Fix C: Validate `id_column` values in... </details> <!-- START COPILOT CODING AGENT SUFFIX --> *This pull request was created from Copilot chat.* > <!-- START COPILOT CODING AGENT TIPS --> --- ✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/PowerPlatform-DataverseClient-Python/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: saurabhrb <32964911+saurabhrb@users.noreply.github.com>
…_scalar` (#146) Adds test coverage gaps identified in the PR #98 review: direct tests for `_normalize_scalar()` and an end-to-end mocked CRUD flow for `DataFrameOperations`. ## `tests/unit/test_pandas_helpers.py` - New `TestNormalizeScalar` class (9 tests) directly exercising `_normalize_scalar()`: - NumPy types (`np.integer`, `np.floating`, `np.bool_`) → Python natives - `pd.Timestamp` → ISO 8601 string - Native Python types and `None` pass through unchanged ## `tests/unit/test_dataframe_operations.py` - New `TestDataFrameEndToEnd` class (2 tests): - Full mocked CRUD cycle: `create → get → update → delete` - Verifies NumPy types are normalized to Python-native values before reaching the API layer ## Notes - `filter` parameter kept as-is (consistent with `records.get()` API; repo convention prohibits `# noqa` suppression) - `DataFrameOperations` not re-exported from top-level `__init__.py` (repo convention: package `__init__.py` files use `__all__ = []`) <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> ## Context This PR addresses the remaining unresolved review comments from PR #98 (#98) and adds comprehensive unit tests for the DataFrame operations. The PR #98 adds DataFrame CRUD wrappers (`client.dataframe.get()`, `client.dataframe.create()`, `client.dataframe.update()`, `client.dataframe.delete()`) to the Dataverse Python SDK. The author has addressed many review comments but several remain unresolved. ## Current State of the Code The branch `users/zhaodongwang/dataFrameExtensionClaude` has the latest code. Key files: ### `src/PowerPlatform/Dataverse/utils/_pandas.py` (current) ```python # Copyright (c) Microsoft Corporation. # Licensed under the MIT license. """Internal pandas helpers""" from __future__ import annotations from typing import Any, Dict, List import numpy as np import pandas as pd def _normalize_scalar(v: Any) -> Any: """Convert numpy scalar types to their Python native equivalents.""" if isinstance(v, pd.Timestamp): return v.isoformat() if isinstance(v, np.integer): return int(v) if isinstance(v, np.floating): return float(v) if isinstance(v, np.bool_): return bool(v) return v def dataframe_to_records(df: pd.DataFrame, na_as_null: bool = False) -> List[Dict[str, Any]]: """Convert a DataFrame to a list of dicts, normalizing values for JSON serialization.""" records = [] for row in df.to_dict(orient="records"): clean = {} for k, v in row.items(): if pd.api.types.is_scalar(v): if pd.notna(v): clean[k] = _normalize_scalar(v) elif na_as_null: clean[k] = None else: clean[k] = v records.append(clean) return records ``` ### `src/PowerPlatform/Dataverse/operations/dataframe.py` (current - 305 lines) The `DataFrameOperations` class provides get/create/update/delete methods. Key points: - `get()` returns a single consolidated DataFrame (iterates all pages internally) - `create()` validates non-empty, validates ID count matches - `update()` validates id_column exists, validates IDs are non-empty strings, validates at least one change column exists; has `clear_nulls` parameter - `delete()` validates ids is Series, validates IDs are non-empty strings, special-cases single ID ### `src/PowerPlatform/Dataverse/operations/__init__.py` (current) ```python from .dataframe import DataFrameOperations __all__ = ["DataFrameOperations"] ``` ### `src/PowerPlatform/Dataverse/__init__.py` (current) ```python from importlib.metadata import version __version__ = version("PowerPlatform-Dataverse-Client") __all__ = ["__version__"] ``` ### `src/PowerPlatform/Dataverse/client.py` (current) Already imports and exposes `DataFrameOperations` as `self.dataframe`. ## Issues to Fix ### 1. `filter` parameter shadows Python built-in (item #8) In `dataframe.py` `get()` method, the parameter `filter` shadows the Python built-in `filter()`. Since this mirrors the existing `records.get()` API which also uses `filter`, renaming is risky for API consistency. The safe fix is to add a `# noqa: A002` comment on the parameter and leave it as-is for API consistency (the base `records.get()` already uses `filter`). Alternatively, rename to `filter_expr` with an alias for backward compatibility. **Decision: keep `filter` for API consistency with existing `records.get()`, but suppress the lint warning.** ### 2. Missing `__init__.py` export for `DataFrameOperations` (item #9) The `operations/__init__.py` already exports `DataFrameOperations`. However, the top-level `src/PowerPlatform/Dataverse/__init__.py` does NOT export it. Add the export there so users can do `from PowerPlatform.Dataverse import DataFrameOperations` if needed. ### 3. Comprehensive unit tests (item #10) The existing `tests/unit/test_client_dataframe.py` has 365 lines of tests. We need to add MORE tests to ensure full coverage. Specifically add tests for: **Unit tests for `_pandas.py` helpers:** - `_normalize_scalar` with np.int64, np.float64, np.bool_, pd.Timestamp, regular Python types - `dataframe_to_records` with NaN handling (na_as_null=True vs False) - `dataframe_to_records` with Timestamp conversion - `dataframe_to_records` with non-scalar values (lists, dicts in cells) - `dataframe_to_records` with numpy scalar types in DataFrame - `dataframe_to_records` with empty DataFrame - `dataframe_to_records` with mixed types **Unit tests for `DataFrameOperations`:** - `get()` single record - `get()` multi-page results concatenated - `get()` empty results - `get()` with all parameters passed through - `create()` with valid DataFrame - `create()` with empty DataFrame (should raise ValueError) - `create()` with non-DataFrame input (should raise TypeError) - `create()` ID count mismatch (should raise ValueError) - `update()` with valid DataFrame - `update()` single record path - `... </details> <!-- START COPILOT CODING AGENT SUFFIX --> *This pull request was created from Copilot chat.* > <!-- START COPILOT CODING AGENT TIPS --> --- 🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. [Learn more about Advanced Security.](https://gh.io/cca-advanced-security) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: saurabhrb <32964911+saurabhrb@users.noreply.github.com>
|
@tpellissier-msft update on the two followups:
|
Summary
Adds a
client.dataframenamespace with pandas DataFrame/Series wrappers for all CRUD operations. Users can now query, create, update, and delete Dataverse records using DataFrame-native inputs and outputs -- no manual dict conversion required.Quick Example
What's Included
New Files
src/.../operations/dataframe.pyDataFrameOperationsclass withget(),create(),update(),delete()src/.../utils/_pandas.pydataframe_to_records()helper -- normalizes NumPy scalars, handles NaN/None, converts Timestamps to ISO stringsexamples/advanced/dataframe_operations.pytests/unit/test_dataframe_operations.pyDataFrameOperationstests/unit/test_client_dataframe.pytests/unit/test_pandas_helpers.pydataframe_to_records()and_normalize_scalar()Modified Files
client.pyself.dataframe = DataFrameOperations(self)namespacepyproject.tomlpandas>=2.0.0as a required dependencyREADME.mdoperations/__init__.pyAPI Design
All methods live under
client.dataframeand delegate to the existingclient.records.*methods:get(table, ...)pd.DataFrame(all pages consolidated)records.get()get(table, record_id=...)pd.DataFramerecords.get()create(table, df)pd.DataFrameof recordspd.Seriesof GUIDsrecords.create()->CreateMultipleupdate(table, df, id_column)pd.DataFramewith ID columnNonerecords.update()->UpdateMultipledelete(table, ids)pd.Seriesof GUIDsOptional[str](job ID)records.delete()->BulkDeleteKey Design Decisions
clear_nullsparameter onupdate(): By default (False), NaN/None values are skipped (field unchanged on server). Set toTrueto explicitly send null and clear fields.np.int64->int,np.float64->float,pd.Timestamp-> ISO string. Prevents JSON serialization failures.CreateMultiple/UpdateMultiple. Docstrings recommend splitting very large DataFrames into smaller batches.Validation