-
Notifications
You must be signed in to change notification settings - Fork 4k
Fix data loss when converting pandas Timedelta from replace() to Table #49238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shashbha14
wants to merge
9
commits into
apache:main
Choose a base branch
from
shashbha14:gh-49222-timedelta-from-replace
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Fix data loss when converting pandas Timedelta from replace() to Table #49238
shashbha14
wants to merge
9
commits into
apache:main
from
shashbha14:gh-49222-timedelta-from-replace
+487
−51
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…able function - Add errors parameter to cast() function with 'raise' (default) and 'coerce' options - errors='coerce' converts invalid values to null instead of raising errors - Add errors parameter to Array.cast(), Scalar.cast(), and ChunkedArray.cast() instance methods - Verify is_castable() function is properly exposed and working - Add comprehensive tests including the exact example from issue apache#48972 - Update documentation with examples showing errors='coerce' usage This addresses issue apache#48972 by providing pandas.to_numeric(errors='coerce') equivalent functionality in PyArrow.
…ma is provided When reading JSON with explicit schema, the parser now attempts to convert values to match the schema type before erroring. This allows JSON files with inconsistent types (e.g., number and string for the same field) to be read successfully when an explicit schema is provided. Changes: - Store explicit_schema in HandlerBase for access during parsing - Modified AppendScalar to check for conversion before erroring - Added TryConvertAndAppend helper function to handle conversions - Updated Bool handler to also support conversion - Added tests for number->string and string->number conversions Supported conversions: - Number <-> String (when numeric) - Boolean <-> String - Boolean <-> Number - Number -> Boolean (0=false, non-zero=true) Fixes apache#49158
…v for Apple Clang 14.0.0 compatibility This fixes the CRAN build failure on macOS 13.3 with Apple Clang 14.0.0, which doesn't fully support the C++20 std::floating_point concept. The change replaces std::floating_point<T> with std::is_floating_point_v<T> in the CFloatingPointConcept definition, maintaining the same functionality while ensuring compatibility with older compilers. Fixes apache#49176
… from replace Adds a regression test for issue apache#49222 and adjusts _from_pydict to box lists of pandas Timedelta/Timestamp into a pandas Series so that pa.array uses the pandas-aware conversion path.
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #49222
When you create a Table from a dict that has pandas Timedelta objects made with
Timestamp.replace(), the values were getting lost and showing up as 0:00:00 instead of the actual duration.The problem was that
from_pydictwas treating lists of pandas Timedelta/Timestamp as plain Python lists, so it wasn't using the pandas-aware conversion path that knows how to handle these types correctly.I fixed it by detecting when a list contains pandas temporals and wrapping it in a pandas Series before conversion. That way
pa.array()uses the pandas conversion logic and preserves the values.Added a test that reproduces the exact issue from the bug report. CI should verify everything works.
Note: I couldn't run the tests locally because my C++ build is currently broken from unrelated work, but this is a small Python-only change so CI should cover it.