-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: extending oLLMDocumentContentExtractor allowing metadata extraction
#10523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: extending oLLMDocumentContentExtractor allowing metadata extraction
#10523
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
Pull Request Test Coverage Report for Build 21909308680Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
LLMDocumentContentExtractor allowing metadata extraction
|
@davidsbatista thanks for working on this! A few high-level comments:
Also I noticed something that is potentially undesirable. I could fore-see that users would want to specify a response format of |
@davidsbatista maybe an idea to tackle this would be if the output of the LLM is a dictionary then we check if one is called |
I think we either assume the LLM in this component will always reply in JSON (with a 'content' key and all other keys for metadata), or we allow changing the config of the LLM at runtime - this might make things a bit more complicated, not all LLMs support that as far as I know. I would got for your suggestion. |
haystack/components/extractors/image/llm_document_content_extractor.py
Outdated
Show resolved
Hide resolved
|
@davidsbatista thanks for making the changes we discussed! I think we can make a simplification now that we expect a JSON response and fallback to content only extraction if a dict is not returned. The idea would be to only accept one prompt now and completely drop the extraction mode. If the returned object by the chat generator has multiple keys then we can auto populate the metadata and content. If it only has the So I would:
|
| Return a single JSON object. It must contain the key "document_content" with the extracted text as value. | ||
|
|
||
| Include all other extracted information as keys for metadata. All metadata should be returned as separate keys in the | ||
| JSON object. For example, if you extract the document type and date, you should return: | ||
|
|
||
| {"title": "Example Document", "author": "John Doe", "date": "2024-01-15", "document_type": "invoice"} | ||
|
|
||
| Don't include any metadata in the "document_content" field. The "document_content" field should only contain the | ||
| image description and any possible text extracted from the image. | ||
|
|
||
| No markdown, no code fence, only raw JSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably leave the default prompt to just return a string and not extract any metadata.
Simply because it's hard to guess/assume what kind of metadata users would like to extract and we increase the risk something going wrong with the JSON parsing since they user may not realize they should pass in a chat generator with structured output set up. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could instead just add an example in the docstring below or in the docs page for this component how we'd recommend calling it to also extract metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will then shorten this prompt and instead use the example for the docs and maybe tests
|
|
||
|
|
||
| # Reserved key in the LLM JSON response that holds the main document text. | ||
| DOCUMENT_CONTENT_KEY = "document_content" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make more sense if the key we expect from the returned dict would just be content right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I named it like this - less straightforward - so it doesn't conflict with any potential metadata field name
haystack/components/extractors/image/llm_document_content_extractor.py
Outdated
Show resolved
Hide resolved
haystack/components/extractors/image/llm_document_content_extractor.py
Outdated
Show resolved
Hide resolved
| --- | ||
| enhancements: | ||
| - | | ||
| The ``LLMDocumentContentExtractor`` now also supports metadata extraction. It can run in three modes: "content", "metadata", or "both". Extracting only content, metadata, or both for a given Document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update the reno to reflect your recent changes like no longer having a mode
…actor.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
…actor.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Related Issues
Proposed Changes:
The
LLMDocumentContentExtractornow extracts both content and metadata using the same prompt.Response handling:
- If the LLM returns a plain string it is written to the document's content.
- If the LLM returns a JSON object with only the key
document_content, that value is written to content.- If the LLM returns a JSON object with multiple keys, the value of
document_content(if present) iswritten to content, and all other keys are merged into the document's metadata.
The ChatGenerator should be configured to return JSON (e.g.
response_format={"type": "json_object"}ingeneration_kwargs).How did you test it?
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.