Expand embedded metadata detection #77

avram · 2011-12-15T04:31:31Z

The <link rel="alternate" /> syntax for providing alternate representations should be used when we look for embedded metadata. A recent discussion notes a site providing dissertations that we don't import correctly. In addition to Google/Highwire metadata which we're parsing, it includes such <link rel="alternate" /> references to structured descriptions:

<link href="http://umu.diva-portal.org/smash/getreferences?referenceFormat=librismarcxml&pids=diva2:459013"
  rel="alternate" title="MARC-XML Representation" type="text/xml" />
<link href="http://umu.diva-portal.org/smash/getreferences?referenceFormat=swepubmods&pids=diva2:459013"
  rel="alternate" title="MODS Representation" type="text/xml" />

I don't think we can expect to read these as-is, since the text/xml type is too vague, but we should look for known types for formats we do read, just like we do for intercepting RIS/BibTeX download. That means application/mods+xml for MODS, etc.

The text was updated successfully, but these errors were encountered:

aurimasv · 2012-03-31T00:35:00Z

With recent changes to Embeded Metadata, EM performs quite well on the linked page. It properly detects the type as thesis, which is not detected with either MODS or MARCXML.

In this particular case, MARCXML performs quite poorly, but it does supply numPages and seriesNumber, while the others do not.

MODS supplies the full abstract and picks up additional authors, which are probably more appropriately classified as contributors (i.e. professor, university, etc. Is this even desirable?). It also contains the ISBN and publication title, but these are not preserved in a thesis itemType.

So the advantage here is that EM detects it as a thesis, but it would be nice to get the full abstract. If we decide to supplement EM data with MARC or MODS, it may become difficult to determine which data we would rather prefer. Would we include all the authors? Add all the notes? (MODS adds 7 additional notes, which are fairly redundant and don't seem to enhance the metadata)

One feasible solution, I think, would be to parse linked MODS and MARC pages and only supplement fields that are completely missing in the EM translator. This would leave out the abstract. Perhaps we can say that MODS/MARC abstracts will always be more complete?

dstillman · 2016-03-08T21:21:21Z

I was just wondering to myself why we didn't do this. Seems like it would be absolutely trivial to implement, for us and publishers. We're also trying to improve embedded metadata support by adding support for JSON-LD, but this would be a lower-tech solution when sites already have BibTeX, RIS, etc. — basically, an easier, more standards-compliant, non-abandoned unAPI. The HTML 5.1 draft also allows <link> in body content with itemprop, which would allow the use of this for multiple items in a page.

Any reason we shouldn't do this? I guess the biggest downside is that, as with unAPI, we'd need to make a separate request for each link and run detection on the result in order to show a proper icon.

adam3smith · 2016-03-08T21:23:30Z

no reason from my side. @zuphilip has brought this up, too (he'd know where, but in some related EM discussion), and it seems like a very good idea to me.

zuphilip · 2016-03-08T21:35:24Z

Do you mean this discussion about blacklight discovery system: #893 (comment) ?

adam3smith · 2016-03-08T21:59:42Z

yup, thanks.

zuphilip · 2017-09-01T14:52:33Z

Here are some examples from Blacklight catalogs:

dstillman · 2018-12-07T00:22:40Z

Another example, where MODS is available:

https://purl.stanford.edu/fv751yt5934

<link rel="alternate" title="MODS XML" type="application/xml" href="https://purl.stanford.edu/fv751yt5934.mods" />

For a case like this I think we'd just want to look for 'mods' in the title and href when the type is application/xml or text/xml.

Translator additions and fixes

ghost assigned avram Dec 15, 2011

aurimasv added the Difficulty: Medium label Nov 3, 2014

dstillman added Difficulty: Easy and removed Difficulty: Medium New Translator Pull requests for new translators labels Mar 8, 2016

zuphilip mentioned this issue Aug 28, 2017

University of Wisconsin-Madison Libraries Catalog.js #1397

Merged

zotero deleted a comment Dec 18, 2017

dstillman mentioned this issue Mar 7, 2018

Support <link> rel="alternate" and/or rel="meta" zotero/zotero-connectors#223

Closed

mrtcode mentioned this issue Dec 19, 2018

Generic webpage translator #1092

Open

socheres pushed a commit to socheres/translators that referenced this issue Apr 7, 2020

Merge pull request zotero#77 from ubtue/mkannan

8f62fac

Translator additions and fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand embedded metadata detection #77

Expand embedded metadata detection #77

avram commented Dec 15, 2011

aurimasv commented Mar 31, 2012

dstillman commented Mar 8, 2016

adam3smith commented Mar 8, 2016

zuphilip commented Mar 8, 2016

adam3smith commented Mar 8, 2016

zuphilip commented Sep 1, 2017

dstillman commented Dec 7, 2018 •

edited

Loading

Expand embedded metadata detection #77

Expand embedded metadata detection #77

Comments

avram commented Dec 15, 2011

aurimasv commented Mar 31, 2012

dstillman commented Mar 8, 2016

adam3smith commented Mar 8, 2016

zuphilip commented Mar 8, 2016

adam3smith commented Mar 8, 2016

zuphilip commented Sep 1, 2017

dstillman commented Dec 7, 2018 • edited Loading

dstillman commented Dec 7, 2018 •

edited

Loading