Skip to content

Suggestion: get raw OCR text for non-table content #216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gonarguello opened this issue Sep 12, 2024 · 0 comments
Open

Suggestion: get raw OCR text for non-table content #216

gonarguello opened this issue Sep 12, 2024 · 0 comments

Comments

@gonarguello
Copy link

Oftentimes it is really useful to have all the text that does not belong to tables in the document to make further processing.
Maybe, in the same way that the lib extracts 'title' it could extract 'footer'.
Or just put all the OCR text that is not part of a table in another attribute, accesible through the 'table' object.

Example:
When processing an invoice, the 'invoice items' would come in a 'table' and everything else in 'title' and 'footer' objects to make further (manual) processing of important fields such as date, number, account numbers, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant