Feature Request / Improvement
Hi team,
I recently encountered that that the table.upsert results in some unexpected low level error(s), such as bus error, or illegal hardware instruction error. I tried to isolate what I have in the attached files.
How to recreate
- Run
first_run.py
- Run
second_run.py with the commented out upsert:
#table.upsert(
# df=data,
# join_cols=['block_number', 'transaction_index', 'log_index'],
# when_matched_update_all=True,
# when_not_matched_insert_all=True,
# case_sensitive=True,
#)
Note that the following works:
for rb in data.to_batches(max_chunksize=1_000):
batch_tbl = pa.Table.from_batches([rb])
table.upsert(
df=batch_tbl,
join_cols=['block_number', 'transaction_index', 'log_index'],
when_matched_update_all=True,
when_not_matched_insert_all=True,
case_sensitive=True,
)
Versions
Pyiceberg version: 0.9.1
Pyarrow: 20.0.0 (Also tried with 18.0.0, 17.0.0)
Hardware: Apple M2
Additional context
The same issue seems to have been mentioned here.
Thanks you in advance! 😊
first.zip
second.zip
scripts.zip