-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract Amazon image URLs from the page's embedded JavaScript #86
Conversation
d38b85e
to
0b10180
Compare
0b10180
to
98699b6
Compare
Codecov Report
@@ Coverage Diff @@
## main #86 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 27 28 +1
Lines 510 554 +44
Branches 91 104 +13
=========================================
+ Hits 510 554 +44
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good, just some minor nitpick.
However, I've noticed a bunch of try { JSON.parse(...) as ... } catch (err) { ... }
patterns dotted around the codebase, it might make sense to put a safeJsonParse
helper somewhere in the utils instead, which wraps the try..catch
and returns undefined
on error, and also uses a type parameter instead of needing to cast all the time. We could then use optional chaining or simple if (typeof ... === 'undefined')
checks instead. I'd be happy to amend that refactoring to this PR myself, IIRC there's also similar parsing going on elsewhere.
I'm also not 100% sure what we should do with the coverage, since there's a couple error handlers that aren't covered. I think a safeJsonParse
could eliminate some of them, for others we can probably just make some very small test cases to exercise those (perhaps feed in a garbage string).
Go for it, I won't be able to work on this before tomorrow (ok... today) afternoon. |
LGTM. Feel free to merge if you're okay with my changes. Thanks for your work on this! |
Thanks for the refactoring and the tests. |
Take your time, I'm in favour of postponing the merge until after #115 to check whether the new workflows work properly 😄 |
This allows greater flexibility than an already parsed DOM document.
Using the thumbnails which are shown on the page has two disadvantages: - Pages show only a limited number of thumbnails, not all of them. - Maximised versions of these might still not be the largest available. Unfortunately the most reliable way (so far) to get all images in their highest resolution, is to extract an object from the page's embedded JavaScript. The advantage of this approach is that we can also extract artwork types from there now (at least 'Front' and 'Back', maybe others).
Move extraction of embedded JS into its own method. Catch and log JSON parsing errors and fallback to use the thumbnails in the sidebar (DOM).
Move conversion from Amazon variant into MB type into its own function. Also try to extract the variants of thumbnails from a JSON data attribute. Now it is no longer necessary to assume that the first image is the front. In addition to MAIN and BACK, FRNT and SIDE likely also match MB types.
The example release has 5 images, but only 4 thumbnails are shown on amazon.com by default. Using the embedded JavaScript we can extract larger images whose filenames contain different identifiers. Additionally we can also detect the type of the front and the back. Still testing the thumbnail grabbing (which is now only a fallback) by mocking the failed attempt of extracting images from the embedded JS.
This edge case was already accounted for previously, but I accidentally removed it during refactoring.
Generalise the streaming product extractor to a front cover extractor. It can also handle the front cover of Audible audiobooks (on Amazon) now. For physical audiobooks (and books) we need yet another embedded JS extractor to get all images in the highest available resolution.
8e45cac
to
ebf6925
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any other variant of Amazon pages pops up, we might need to look into splitting it into multiple classes 😅
Extract Amazon image URLs from the page's embedded JavaScript (#86)
🚀 Released 1 new userscript version(s):
|
Using the thumbnails which are shown on the page has two disadvantages:
Unfortunately the most reliable way (so far) to get all images in their highest resolution, is to extract an object from the page's embedded JavaScript. The advantage of this approach is that we can also extract artwork types from there now (at least 'Front' and 'Back', maybe others).
Closes #85.
Depends on #82 (
and needs to be rebased).P.S. This is a non-issue for digital media releases, so I have not touched that code. And I kept the old thumbnail extraction code as a fallback (in case the regex does not find a match in the page's source).