feat: extending o`LLMDocumentContentExtractor` allowing metadata extraction #10523

davidsbatista · 2026-02-06T17:12:12Z

Related Issues

Add component that can extract metadata from Image Documents #10402

Proposed Changes:

The LLMDocumentContentExtractor now extracts both content and metadata using the same prompt.

Response handling:
- If the LLM returns a plain string it is written to the document's content.
- If the LLM returns a JSON object with only the key document_content, that value is written to content.
- If the LLM returns a JSON object with multiple keys, the value of document_content (if present) is
written to content, and all other keys are merged into the document's metadata.

The ChatGenerator should be configured to return JSON (e.g. response_format={"type": "json_object"} in generation_kwargs).

How did you test it?

new tests for metadata mode and runtime override
new live test for metadata extraction

Checklist

I have read the contributors guidelines and the code of conduct.
I have updated the related issue with new insights and changes.
I have added unit tests and updated the docstrings.
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I have documented my code.
I have added a release note file, following the contributors guidelines.
I have run pre-commit hooks and fixed any issue.

vercel · 2026-02-06T17:12:16Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
haystack-docs	Ignored	Preview	Feb 11, 2026 2:33pm

coveralls · 2026-02-06T17:16:31Z

Pull Request Test Coverage Report for Build 21909308680

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
52 unchanged lines in 6 files lost coverage.
Overall coverage increased (+0.09%) to 92.649%

Files with Coverage Reduction	New Missed Lines	%
core/component/component.py	1	99.46%
utils/jinja2_chat_extension.py	1	99.07%
dataclasses/chat_message.py	4	98.83%
components/agents/agent.py	10	96.69%
human_in_the_loop/strategies.py	13	90.51%
components/generators/chat/openai_responses.py	23	87.66%

Totals
Change from base Build 21820332340:	0.09%
Covered Lines:	15213
Relevant Lines:	16420

💛 - Coveralls

…or-metadata

sjrl · 2026-02-10T10:18:53Z

@davidsbatista thanks for working on this!

A few high-level comments:

We would like to enable content and metadata extraction at the same time instead of having to choose one. I think we could do this by expanding extraction_mode to include a both option as well. And we can leave the default to content as you have it.
For the metadata extraction I think we should follow the same behavior as the LLMMetadataExtractor and not store the extracted dict into a metadata_field but directly put them into the Document's metadata.

Also I noticed something that is potentially undesirable. I could fore-see that users would want to specify a response format of json_schema for the metadata extraction like we recommend in the LLMMetadataExtractor. However, if we did that and had extraction_mode='both' enabled this wouldn't work since then we'd want to call the chat_generator without the response format. Any ideas on how we could get around this?

sjrl · 2026-02-10T14:49:59Z

Also I noticed something that is potentially undesirable. I could fore-see that users would want to specify a response format of json_schema for the metadata extraction like we recommend in the LLMMetadataExtractor. However, if we did that and had extraction_mode='both' enabled this wouldn't work since then we'd want to call the chat_generator without the response format. Any ideas on how we could get around this?

@davidsbatista maybe an idea to tackle this would be if the output of the LLM is a dictionary then we check if one is called content and put that into the content field of the doc and the rest into metadata. Then if the output is only a string then we could put it directly into content. WDYT?

davidsbatista · 2026-02-10T14:57:44Z

Also I noticed something that is potentially undesirable. I could fore-see that users would want to specify a response format of json_schema for the metadata extraction like we recommend in the LLMMetadataExtractor. However, if we did that and had extraction_mode='both' enabled this wouldn't work since then we'd want to call the chat_generator without the response format. Any ideas on how we could get around this?

@davidsbatista maybe an idea to tackle this would be if the output of the LLM is a dictionary then we check if one is called content and put that into the content field of the doc and the rest into metadata. Then if the output is only a string then we could put it directly into content. WDYT?

I think we either assume the LLM in this component will always reply in JSON (with a 'content' key and all other keys for metadata), or we allow changing the config of the LLM at runtime - this might make things a bit more complicated, not all LLMs support that as far as I know.

I would got for your suggestion.

haystack/components/extractors/image/llm_document_content_extractor.py

sjrl · 2026-02-11T07:00:06Z

@davidsbatista thanks for making the changes we discussed! I think we can make a simplification now that we expect a JSON response and fallback to content only extraction if a dict is not returned.

The idea would be to only accept one prompt now and completely drop the extraction mode. If the returned object by the chat generator has multiple keys then we can auto populate the metadata and content. If it only has the document_content key or returns a plain string then we put it just in the content of the Document.

So I would:

remove extraction_mode, metadata_prompt
just run the prompt on each Document
check if the returned response can be loaded with json.loads and then process differently on if it's a dict or a plain string

sjrl · 2026-02-11T13:14:44Z

haystack/components/extractors/image/llm_document_content_extractor.py

+Return a single JSON object. It must contain the key "document_content" with the extracted text as value.
+
+Include all other extracted information as keys for metadata. All metadata should be returned as separate keys in the
+JSON object. For example, if you extract the document type and date, you should return:
+
+{"title": "Example Document", "author": "John Doe", "date": "2024-01-15", "document_type": "invoice"}
+
+Don't include any metadata in the "document_content" field. The "document_content" field should only contain the
+image description and any possible text extracted from the image.
+
+No markdown, no code fence, only raw JSON.


I think we should probably leave the default prompt to just return a string and not extract any metadata.

Simply because it's hard to guess/assume what kind of metadata users would like to extract and we increase the risk something going wrong with the JSON parsing since they user may not realize they should pass in a chat generator with structured output set up. WDYT?

I think we could instead just add an example in the docstring below or in the docs page for this component how we'd recommend calling it to also extract metadata.

I will then shorten this prompt and instead use the example for the docs and maybe tests

haystack/components/extractors/image/llm_document_content_extractor.py

sjrl · 2026-02-11T13:17:51Z

haystack/components/extractors/image/llm_document_content_extractor.py



+# Reserved key in the LLM JSON response that holds the main document text.
+DOCUMENT_CONTENT_KEY = "document_content"


I think it would make more sense if the key we expect from the returned dict would just be content right?

I named it like this - less straightforward - so it doesn't conflict with any potential metadata field name

haystack/components/extractors/image/llm_document_content_extractor.py

sjrl · 2026-02-11T13:54:16Z

releasenotes/notes/extend-llm-document-extractor-for-metadata-894eeb40912fa04a.yaml

+---
+enhancements:
+  - |
+    The ``LLMDocumentContentExtractor`` now also supports metadata extraction. It can run in three modes: "content", "metadata", or "both". Extracting only content, metadata, or both for a given Document.


Let's update the reno to reflect your recent changes like no longer having a mode

…actor.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

extending original LLMDocumentContentExtractor to extract metadata

8292163

github-actions bot added topic:tests type:documentation Improvements on the docs labels Feb 6, 2026

davidsbatista added 4 commits February 6, 2026 18:40

updating class docstring

d798d11

simplifying code

b8899cd

Merge branch 'main' into feat/extend-llm_document_content_extractor-f…

6c562d7

…or-metadata

fixing

42460ae

vercel bot deployed to Preview February 9, 2026 14:21 View deployment

davidsbatista added 2 commits February 9, 2026 15:22

updating tests

456b66e

adding release notes

c49f103

davidsbatista changed the title ~~extending original LLMDocumentContentExtractor to extract metadata~~ feat: extending oLLMDocumentContentExtractor allowing metadata extraction Feb 9, 2026

fixing release notes; adding new live test for metadata only

303f8e2

davidsbatista marked this pull request as ready for review February 9, 2026 15:09

davidsbatista requested a review from a team as a code owner February 9, 2026 15:10

davidsbatista requested review from julian-risch and sjrl and removed request for a team and julian-risch February 9, 2026 15:10

davidsbatista added 2 commits February 10, 2026 19:09

refactoring code and adding more tests

1df851d

updating release notes

6413f8e