Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give users the ability to download the text of a document at a txt file #2312

Open
Rosencrantz opened this issue Jun 7, 2022 · 5 comments
Open
Labels
feature-request Requests for new features or enhancements of existing features good first issue Issues that are well suited for first-time contributors ui Issues related to Aleph’s frontend

Comments

@Rosencrantz
Copy link
Contributor

When reviewing a document in Aleph we provide the ability to download a pdf of the document. It would be good to also be able to download the textual content of that file as a text document.

This document would only include text, no images or formatting. Just the text.

@Rosencrantz Rosencrantz added enhancement ui Issues related to Aleph’s frontend labels Jun 7, 2022
@tillprochaska
Copy link
Contributor

It would be possible to retrieve the text content for all pages via the entities API endpoint and assemble a text file that’s downloaded on the client side.

https://data.occrp.org/api/2/entities?filter:properties.document={{DOCUMENT_ID}}&filter:schema=Page

@sunu I am not sure though how well that endpoint would perform for documents with lots of pages and if this might need to be done on the server side?

Also, there might be some value in exposing this feature via the API as well?

@brrttwrks
Copy link

@tillprochaska I have a Python script that I use for this very purpose. However, we get asked this a lot. Maybe it makes sense to also generate a single text file when ingesting the document that can then be called. Thought that would add a lot of overhead up front and we don't get asked this enough to justify blowing up our storage. Doing this on demand would be more sane. Maybe a job like exports?

@sunu
Copy link
Contributor

sunu commented Jun 9, 2022

Yes, I would prefer doing it on the backend as an export as well. Ideally, we should cache the combined text for reuse. But I would be ok with skipping that in the first iteration.

What should the url endpoint look like for this? Something like documents/<id>/textexport?

@Rosencrantz Rosencrantz added the good first issue Issues that are well suited for first-time contributors label Jun 10, 2022
@tillprochaska
Copy link
Contributor

tillprochaska commented Jun 16, 2022

What should the url endpoint look like for this? Something like documents/<id>/textexport?

@sunu Not sure if that question was directed at me, but from an API consumer perspective, I’d have expected it to be part of the archive endpoint, as getting the original document vs. getting the text content for that document is similar.

Although I can see that that doesn’t make a lot of sense from a technical/implementation perspective, as loading and concatenating text for a document is different from simply returning a file from storage.

@tillprochaska
Copy link
Contributor

tillprochaska commented Jun 16, 2022

@tillprochaska I have a Python script that I use for this very purpose. However, we get asked this a lot.

@brrttwrks Sorry, when I said "client side" I was referring to implementing it in the front end/browser (without changing the backend) and not to the Aleph CLI -- didn’t mean to say that you should keep doing the current workarounds! :)

@tillprochaska tillprochaska added feature-request Requests for new features or enhancements of existing features and removed enhancement labels Oct 18, 2022
@Rosencrantz Rosencrantz moved this to 🏷️ Triage in Aleph Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Requests for new features or enhancements of existing features good first issue Issues that are well suited for first-time contributors ui Issues related to Aleph’s frontend
Projects
No open projects
Status: 🏷️ Triage
Development

No branches or pull requests

4 participants