Skip to content

Comments

streaming create_final_documents#2243

Merged
dayesouza merged 3 commits intomainfrom
create_final_documents
Feb 23, 2026
Merged

streaming create_final_documents#2243
dayesouza merged 3 commits intomainfrom
create_final_documents

Conversation

@dayesouza
Copy link
Contributor

Streaming Performance Improvement (with .txt):

run time: -60.70%
peak memory: -31.38%
memory delta: -11.58%

This pull request refactors the create_final_documents workflow to use streaming table reads and writes instead of loading entire dataframes into memory. The main goal is to make the workflow more efficient and scalable by processing data in a streaming fashion. The changes also simplify the logic by removing the dependency on pandas and the DataReader class.

Refactor to streaming table processing:

  • Rewrote the run_workflow function in create_final_documents.py to use asynchronous context managers for opening tables and removed pandas DataFrame operations in favor of streaming row-by-row processing.
  • Implemented a new create_final_documents async function that builds a mapping from text units to documents and enriches each document row with its associated text unit IDs as it streams through the data, writing results directly to the output table.

Dependency and import changes:

  • Removed the import and usage of pandas and DataReader, and added imports for the new row transformer and table abstractions.

@dayesouza dayesouza requested a review from a team as a code owner February 23, 2026 13:36
@dayesouza dayesouza merged commit 1cedb79 into main Feb 23, 2026
18 checks passed
@dayesouza dayesouza deleted the create_final_documents branch February 23, 2026 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants