Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Amazon image URLs from the page's embedded JavaScript #86

Merged
merged 15 commits into from
Oct 19, 2021

Conversation

kellnerd
Copy link
Collaborator

@kellnerd kellnerd commented Oct 17, 2021

Using the thumbnails which are shown on the page has two disadvantages:

  • Pages show only a limited number of thumbnails, not all of them.
  • Maximised versions of these might still not be the largest available.

Unfortunately the most reliable way (so far) to get all images in their highest resolution, is to extract an object from the page's embedded JavaScript. The advantage of this approach is that we can also extract artwork types from there now (at least 'Front' and 'Back', maybe others).

Closes #85.
Depends on #82 (and needs to be rebased).

P.S. This is a non-issue for digital media releases, so I have not touched that code. And I kept the old thumbnail extraction code as a fallback (in case the regex does not find a match in the page's source).

@codecov
Copy link

codecov bot commented Oct 18, 2021

Codecov Report

Merging #86 (ffd7d2c) into main (0c94513) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main       #86   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           27        28    +1     
  Lines          510       554   +44     
  Branches        91       104   +13     
=========================================
+ Hits           510       554   +44     
Impacted Files Coverage Δ
src/lib/util/json.ts 100.00% <100.00%> (ø)
.../mb_enhanced_cover_art_uploads/providers/amazon.ts 100.00% <100.00%> (ø)
...rc/mb_enhanced_cover_art_uploads/providers/base.ts 100.00% <100.00%> (ø)
...mb_enhanced_cover_art_uploads/providers/discogs.ts 100.00% <100.00%> (ø)
...c/mb_enhanced_cover_art_uploads/providers/qobuz.ts 100.00% <100.00%> (ø)
...c/mb_enhanced_cover_art_uploads/providers/tidal.ts 100.00% <100.00%> (ø)
...c/mb_enhanced_cover_art_uploads/providers/vgmdb.ts 100.00% <100.00%> (ø)
...b_enhanced_cover_art_uploads/seeding/parameters.ts 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c94513...ffd7d2c. Read the comment docs.

Copy link
Owner

@ROpdebee ROpdebee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good, just some minor nitpick.

However, I've noticed a bunch of try { JSON.parse(...) as ... } catch (err) { ... } patterns dotted around the codebase, it might make sense to put a safeJsonParse helper somewhere in the utils instead, which wraps the try..catch and returns undefined on error, and also uses a type parameter instead of needing to cast all the time. We could then use optional chaining or simple if (typeof ... === 'undefined') checks instead. I'd be happy to amend that refactoring to this PR myself, IIRC there's also similar parsing going on elsewhere.

I'm also not 100% sure what we should do with the coverage, since there's a couple error handlers that aren't covered. I think a safeJsonParse could eliminate some of them, for others we can probably just make some very small test cases to exercise those (perhaps feed in a garbage string).

@kellnerd
Copy link
Collaborator Author

I'd be happy to amend that refactoring to this PR myself, IIRC there's also similar parsing going on elsewhere.

Go for it, I won't be able to work on this before tomorrow (ok... today) afternoon.

@ROpdebee
Copy link
Owner

LGTM. Feel free to merge if you're okay with my changes.

Thanks for your work on this!

@kellnerd
Copy link
Collaborator Author

Thanks for the refactoring and the tests.
I'm sorry to say it, but I have just ruined your 100% coverage again 😂
I have just implemented the missing support for audiobook images, but have not written tests for them yet, will do that later today.

@ROpdebee
Copy link
Owner

Take your time, I'm in favour of postponing the merge until after #115 to check whether the new workflows work properly 😄

kellnerd and others added 14 commits October 19, 2021 21:33
This allows greater flexibility than an already parsed DOM document.
Using the thumbnails which are shown on the page has two disadvantages:
- Pages show only a limited number of thumbnails, not all of them.
- Maximised versions of these might still not be the largest available.

Unfortunately the most reliable way (so far) to get all images in their
highest resolution, is to extract an object from the page's embedded
JavaScript. The advantage of this approach is that we can also extract
artwork types from there now (at least 'Front' and 'Back', maybe others).
Move extraction of embedded JS into its own method. Catch and log JSON
parsing errors and fallback to use the thumbnails in the sidebar (DOM).
Move conversion from Amazon variant into MB type into its own function.
Also try to extract the variants of thumbnails from a JSON data attribute.
Now it is no longer necessary to assume that the first image is the front.
In addition to MAIN and BACK, FRNT and SIDE likely also match MB types.
The example release has 5 images, but only 4 thumbnails are shown on
amazon.com by default. Using the embedded JavaScript we can extract
larger images whose filenames contain different identifiers.
Additionally we can also detect the type of the front and the back.
Still testing the thumbnail grabbing (which is now only a fallback) by
mocking the failed attempt of extracting images from the embedded JS.
This edge case was already accounted for previously, but I accidentally
removed it during refactoring.
Generalise the streaming product extractor to a front cover extractor.
It can also handle the front cover of Audible audiobooks (on Amazon) now.
For physical audiobooks (and books) we need yet another embedded JS
extractor to get all images in the highest available resolution.
Copy link
Owner

@ROpdebee ROpdebee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any other variant of Amazon pages pops up, we might need to look into splitting it into multiple classes 😅

@ROpdebee ROpdebee merged commit 8e3f91d into main Oct 19, 2021
github-actions bot added a commit that referenced this pull request Oct 19, 2021
Extract Amazon image URLs from the page's embedded JavaScript (#86)
@github-actions
Copy link

🚀 Released 1 new userscript version(s):

  • mb_enhanced_cover_art_uploads 2021.10.19.3 in f7dcb47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Amazon provider does not return the highest resolution
2 participants