Extract Amazon image URLs from the page's embedded JavaScript #86

kellnerd · 2021-10-17T21:52:20Z

Using the thumbnails which are shown on the page has two disadvantages:

Pages show only a limited number of thumbnails, not all of them.
Maximised versions of these might still not be the largest available.

Unfortunately the most reliable way (so far) to get all images in their highest resolution, is to extract an object from the page's embedded JavaScript. The advantage of this approach is that we can also extract artwork types from there now (at least 'Front' and 'Back', maybe others).

Closes #85.
Depends on #82 (~~and needs to be rebased~~).

P.S. This is a non-issue for digital media releases, so I have not touched that code. And I kept the old thumbnail extraction code as a fallback (in case the regex does not find a match in the page's source).

src/mb_enhanced_cover_art_uploads/providers/amazon.ts

tests/mb_enhanced_cover_art_uploads/providers/amazon.test.ts

codecov · 2021-10-18T21:56:33Z

Codecov Report

Merging #86 (ffd7d2c) into main (0c94513) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main       #86   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           27        28    +1     
  Lines          510       554   +44     
  Branches        91       104   +13     
=========================================
+ Hits           510       554   +44

Impacted Files	Coverage Δ
src/lib/util/json.ts	`100.00% <100.00%> (ø)`
.../mb_enhanced_cover_art_uploads/providers/amazon.ts	`100.00% <100.00%> (ø)`
...rc/mb_enhanced_cover_art_uploads/providers/base.ts	`100.00% <100.00%> (ø)`
...mb_enhanced_cover_art_uploads/providers/discogs.ts	`100.00% <100.00%> (ø)`
...c/mb_enhanced_cover_art_uploads/providers/qobuz.ts	`100.00% <100.00%> (ø)`
...c/mb_enhanced_cover_art_uploads/providers/tidal.ts	`100.00% <100.00%> (ø)`
...c/mb_enhanced_cover_art_uploads/providers/vgmdb.ts	`100.00% <100.00%> (ø)`
...b_enhanced_cover_art_uploads/seeding/parameters.ts	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c94513...ffd7d2c. Read the comment docs.

ROpdebee

Looking pretty good, just some minor nitpick.

However, I've noticed a bunch of try { JSON.parse(...) as ... } catch (err) { ... } patterns dotted around the codebase, it might make sense to put a safeJsonParse helper somewhere in the utils instead, which wraps the try..catch and returns undefined on error, and also uses a type parameter instead of needing to cast all the time. We could then use optional chaining or simple if (typeof ... === 'undefined') checks instead. I'd be happy to amend that refactoring to this PR myself, IIRC there's also similar parsing going on elsewhere.

I'm also not 100% sure what we should do with the coverage, since there's a couple error handlers that aren't covered. I think a safeJsonParse could eliminate some of them, for others we can probably just make some very small test cases to exercise those (perhaps feed in a garbage string).

src/mb_enhanced_cover_art_uploads/providers/amazon.ts

tests/mb_enhanced_cover_art_uploads/providers/amazon.test.ts

kellnerd · 2021-10-18T22:47:52Z

I'd be happy to amend that refactoring to this PR myself, IIRC there's also similar parsing going on elsewhere.

Go for it, I won't be able to work on this before tomorrow (ok... today) afternoon.

ROpdebee · 2021-10-19T00:11:23Z

LGTM. Feel free to merge if you're okay with my changes.

Thanks for your work on this!

kellnerd · 2021-10-19T17:30:56Z

Thanks for the refactoring and the tests.
I'm sorry to say it, but I have just ruined your 100% coverage again 😂
I have just implemented the missing support for audiobook images, but have not written tests for them yet, will do that later today.

ROpdebee · 2021-10-19T19:15:14Z

Take your time, I'm in favour of postponing the merge until after #115 to check whether the new workflows work properly 😄

This allows greater flexibility than an already parsed DOM document.

Using the thumbnails which are shown on the page has two disadvantages: - Pages show only a limited number of thumbnails, not all of them. - Maximised versions of these might still not be the largest available. Unfortunately the most reliable way (so far) to get all images in their highest resolution, is to extract an object from the page's embedded JavaScript. The advantage of this approach is that we can also extract artwork types from there now (at least 'Front' and 'Back', maybe others).

Move extraction of embedded JS into its own method. Catch and log JSON parsing errors and fallback to use the thumbnails in the sidebar (DOM).

Move conversion from Amazon variant into MB type into its own function. Also try to extract the variants of thumbnails from a JSON data attribute. Now it is no longer necessary to assume that the first image is the front. In addition to MAIN and BACK, FRNT and SIDE likely also match MB types.

The example release has 5 images, but only 4 thumbnails are shown on amazon.com by default. Using the embedded JavaScript we can extract larger images whose filenames contain different identifiers. Additionally we can also detect the type of the front and the back. Still testing the thumbnail grabbing (which is now only a fallback) by mocking the failed attempt of extracting images from the embedded JS.

This edge case was already accounted for previously, but I accidentally removed it during refactoring.

Generalise the streaming product extractor to a front cover extractor. It can also handle the front cover of Audible audiobooks (on Amazon) now. For physical audiobooks (and books) we need yet another embedded JS extractor to get all images in the highest available resolution.

src/mb_enhanced_cover_art_uploads/providers/amazon.ts

ROpdebee

If any other variant of Amazon pages pops up, we might need to look into splitting it into multiple classes 😅

Extract Amazon image URLs from the page's embedded JavaScript (#86)

github-actions · 2021-10-19T22:33:39Z

🚀 Released 1 new userscript version(s):

mb_enhanced_cover_art_uploads 2021.10.19.3 in f7dcb47

kellnerd force-pushed the amazon-hires branch from d38b85e to 0b10180 Compare October 17, 2021 22:02

ROpdebee reviewed Oct 17, 2021

View reviewed changes

kellnerd force-pushed the amazon-hires branch from 0b10180 to 98699b6 Compare October 18, 2021 21:53

ROpdebee requested changes Oct 18, 2021

View reviewed changes

src/mb_enhanced_cover_art_uploads/providers/amazon.ts Outdated Show resolved Hide resolved

tests/mb_enhanced_cover_art_uploads/providers/amazon.test.ts Outdated Show resolved Hide resolved

ROpdebee approved these changes Oct 19, 2021

View reviewed changes

kellnerd and others added 14 commits October 19, 2021 21:33

refactor(caa upload): let providers deal with the page's source code

34d06d8

This allows greater flexibility than an already parsed DOM document.

refactor(caa upload): catch and log extraction errors (Amazon)

24cfccf

Move extraction of embedded JS into its own method. Catch and log JSON parsing errors and fallback to use the thumbnails in the sidebar (DOM).

refactor(caa upload): use safer JSON parsing

f7a4557

test(caa upload): remove unnecessary mock restore

ac32a23

test(caa upload): test edge cases for Amazon JS extraction

5e478fd

fix(caa upload): make sure Amazon JS gives an array

67d5cd7

This edge case was already accounted for previously, but I accidentally removed it during refactoring.

style: allow use of double quotes to avoid escapes

b63d3b5

style: use semicolons to delimit interface members, part 2

bd06d68

fix(caa upload): accept dirty Amazon URL without trailing slash

565abe6

test(caa upload): add tests for Amazon audiobooks

ebf6925

kellnerd force-pushed the amazon-hires branch from 8e45cac to ebf6925 Compare October 19, 2021 21:13

kellnerd requested a review from ROpdebee October 19, 2021 21:17

ROpdebee reviewed Oct 19, 2021

View reviewed changes

src/mb_enhanced_cover_art_uploads/providers/amazon.ts Show resolved Hide resolved

ROpdebee approved these changes Oct 19, 2021

View reviewed changes

Merge branch 'main' into amazon-hires

ffd7d2c

ROpdebee merged commit 8e3f91d into main Oct 19, 2021

github-actions bot added a commit that referenced this pull request Oct 19, 2021

🤖 mb_enhanced_cover_art_uploads 2021.10.19.3

f7dcb47

Extract Amazon image URLs from the page's embedded JavaScript (#86)

github-actions bot added the deploy:success label Oct 19, 2021

kellnerd mentioned this pull request Oct 19, 2021

refactor(caa upload): match against clean URLs to simplify regex #122

Merged

kellnerd deleted the amazon-hires branch October 19, 2021 23:10

ROpdebee added a commit that referenced this pull request Oct 20, 2021

docs(caa upload): update Amazon docs after #86

6b3c6c3

ROpdebee mentioned this pull request Oct 20, 2021

docs(caa upload): update Amazon docs after #86 #144

Merged

ROpdebee added a commit that referenced this pull request Oct 20, 2021

docs(caa upload): update Amazon docs after #86

36a1552

kellnerd added the mb_enhanced_cover_art_uploads label Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract Amazon image URLs from the page's embedded JavaScript #86

Extract Amazon image URLs from the page's embedded JavaScript #86

kellnerd commented Oct 17, 2021 •

edited

Loading

codecov bot commented Oct 18, 2021 •

edited

Loading

ROpdebee left a comment •

edited

Loading

kellnerd commented Oct 18, 2021

ROpdebee commented Oct 19, 2021

kellnerd commented Oct 19, 2021

ROpdebee commented Oct 19, 2021

ROpdebee left a comment

github-actions bot commented Oct 19, 2021

Extract Amazon image URLs from the page's embedded JavaScript #86

Extract Amazon image URLs from the page's embedded JavaScript #86

Conversation

kellnerd commented Oct 17, 2021 • edited Loading

codecov bot commented Oct 18, 2021 • edited Loading

Codecov Report

ROpdebee left a comment • edited Loading

Choose a reason for hiding this comment

kellnerd commented Oct 18, 2021

ROpdebee commented Oct 19, 2021

kellnerd commented Oct 19, 2021

ROpdebee commented Oct 19, 2021

ROpdebee left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 19, 2021

kellnerd commented Oct 17, 2021 •

edited

Loading

codecov bot commented Oct 18, 2021 •

edited

Loading

ROpdebee left a comment •

edited

Loading