Skip to content

Conversation

@davidsbatista
Copy link
Contributor

@davidsbatista davidsbatista commented Feb 6, 2026

Related Issues

Proposed Changes:

The LLMDocumentContentExtractor now extracts both content and metadata using the same prompt.

Response handling:
- If the LLM returns a plain string it is written to the document's content.
- If the LLM returns a JSON object with only the key document_content, that value is written to content.
- If the LLM returns a JSON object with multiple keys, the value of document_content (if present) is
written to content, and all other keys are merged into the document's metadata.

The ChatGenerator should be configured to return JSON (e.g. response_format={"type": "json_object"} in generation_kwargs).

How did you test it?

  • new tests for metadata mode and runtime override
  • new live test for metadata extraction

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

@vercel
Copy link

vercel bot commented Feb 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
haystack-docs Ignored Ignored Preview Feb 11, 2026 2:33pm

Request Review

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Feb 6, 2026
@coveralls
Copy link
Collaborator

coveralls commented Feb 6, 2026

Pull Request Test Coverage Report for Build 21909308680

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 52 unchanged lines in 6 files lost coverage.
  • Overall coverage increased (+0.09%) to 92.649%

Files with Coverage Reduction New Missed Lines %
core/component/component.py 1 99.46%
utils/jinja2_chat_extension.py 1 99.07%
dataclasses/chat_message.py 4 98.83%
components/agents/agent.py 10 96.69%
human_in_the_loop/strategies.py 13 90.51%
components/generators/chat/openai_responses.py 23 87.66%
Totals Coverage Status
Change from base Build 21820332340: 0.09%
Covered Lines: 15213
Relevant Lines: 16420

💛 - Coveralls

@davidsbatista davidsbatista changed the title extending original LLMDocumentContentExtractor to extract metadata feat: extending oLLMDocumentContentExtractor allowing metadata extraction Feb 9, 2026
@davidsbatista davidsbatista marked this pull request as ready for review February 9, 2026 15:09
@davidsbatista davidsbatista requested a review from a team as a code owner February 9, 2026 15:10
@davidsbatista davidsbatista requested review from julian-risch and sjrl and removed request for a team and julian-risch February 9, 2026 15:10
@sjrl
Copy link
Contributor

sjrl commented Feb 10, 2026

@davidsbatista thanks for working on this!

A few high-level comments:

  • We would like to enable content and metadata extraction at the same time instead of having to choose one. I think we could do this by expanding extraction_mode to include a both option as well. And we can leave the default to content as you have it.
  • For the metadata extraction I think we should follow the same behavior as the LLMMetadataExtractor and not store the extracted dict into a metadata_field but directly put them into the Document's metadata.

Also I noticed something that is potentially undesirable. I could fore-see that users would want to specify a response format of json_schema for the metadata extraction like we recommend in the LLMMetadataExtractor. However, if we did that and had extraction_mode='both' enabled this wouldn't work since then we'd want to call the chat_generator without the response format. Any ideas on how we could get around this?

@sjrl
Copy link
Contributor

sjrl commented Feb 10, 2026

Also I noticed something that is potentially undesirable. I could fore-see that users would want to specify a response format of json_schema for the metadata extraction like we recommend in the LLMMetadataExtractor. However, if we did that and had extraction_mode='both' enabled this wouldn't work since then we'd want to call the chat_generator without the response format. Any ideas on how we could get around this?

@davidsbatista maybe an idea to tackle this would be if the output of the LLM is a dictionary then we check if one is called content and put that into the content field of the doc and the rest into metadata. Then if the output is only a string then we could put it directly into content. WDYT?

@davidsbatista
Copy link
Contributor Author

Also I noticed something that is potentially undesirable. I could fore-see that users would want to specify a response format of json_schema for the metadata extraction like we recommend in the LLMMetadataExtractor. However, if we did that and had extraction_mode='both' enabled this wouldn't work since then we'd want to call the chat_generator without the response format. Any ideas on how we could get around this?

@davidsbatista maybe an idea to tackle this would be if the output of the LLM is a dictionary then we check if one is called content and put that into the content field of the doc and the rest into metadata. Then if the output is only a string then we could put it directly into content. WDYT?

I think we either assume the LLM in this component will always reply in JSON (with a 'content' key and all other keys for metadata), or we allow changing the config of the LLM at runtime - this might make things a bit more complicated, not all LLMs support that as far as I know.

I would got for your suggestion.

@sjrl
Copy link
Contributor

sjrl commented Feb 11, 2026

@davidsbatista thanks for making the changes we discussed! I think we can make a simplification now that we expect a JSON response and fallback to content only extraction if a dict is not returned.

The idea would be to only accept one prompt now and completely drop the extraction mode. If the returned object by the chat generator has multiple keys then we can auto populate the metadata and content. If it only has the document_content key or returns a plain string then we put it just in the content of the Document.

So I would:

  • remove extraction_mode, metadata_prompt
  • just run the prompt on each Document
  • check if the returned response can be loaded with json.loads and then process differently on if it's a dict or a plain string

@davidsbatista davidsbatista requested a review from sjrl February 11, 2026 12:40
Comment on lines +52 to +62
Return a single JSON object. It must contain the key "document_content" with the extracted text as value.

Include all other extracted information as keys for metadata. All metadata should be returned as separate keys in the
JSON object. For example, if you extract the document type and date, you should return:

{"title": "Example Document", "author": "John Doe", "date": "2024-01-15", "document_type": "invoice"}

Don't include any metadata in the "document_content" field. The "document_content" field should only contain the
image description and any possible text extracted from the image.

No markdown, no code fence, only raw JSON.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably leave the default prompt to just return a string and not extract any metadata.

Simply because it's hard to guess/assume what kind of metadata users would like to extract and we increase the risk something going wrong with the JSON parsing since they user may not realize they should pass in a chat generator with structured output set up. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could instead just add an example in the docstring below or in the docs page for this component how we'd recommend calling it to also extract metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will then shorten this prompt and instead use the example for the docs and maybe tests



# Reserved key in the LLM JSON response that holds the main document text.
DOCUMENT_CONTENT_KEY = "document_content"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense if the key we expect from the returned dict would just be content right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named it like this - less straightforward - so it doesn't conflict with any potential metadata field name

---
enhancements:
- |
The ``LLMDocumentContentExtractor`` now also supports metadata extraction. It can run in three modes: "content", "metadata", or "both". Extracting only content, metadata, or both for a given Document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update the reno to reflect your recent changes like no longer having a mode

davidsbatista and others added 2 commits February 11, 2026 15:33
…actor.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
…actor.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants