Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand embedded metadata detection #77

Open
avram opened this issue Dec 15, 2011 · 7 comments
Open

Expand embedded metadata detection #77

avram opened this issue Dec 15, 2011 · 7 comments

Comments

@avram
Copy link
Contributor

avram commented Dec 15, 2011

The <link rel="alternate" /> syntax for providing alternate representations should be used when we look for embedded metadata. A recent discussion notes a site providing dissertations that we don't import correctly. In addition to Google/Highwire metadata which we're parsing, it includes such <link rel="alternate" /> references to structured descriptions:

<link href="http://umu.diva-portal.org/smash/getreferences?referenceFormat=librismarcxml&pids=diva2:459013"
  rel="alternate" title="MARC-XML Representation" type="text/xml" />
<link href="http://umu.diva-portal.org/smash/getreferences?referenceFormat=swepubmods&pids=diva2:459013"
  rel="alternate" title="MODS Representation" type="text/xml" />

I don't think we can expect to read these as-is, since the text/xml type is too vague, but we should look for known types for formats we do read, just like we do for intercepting RIS/BibTeX download. That means application/mods+xml for MODS, etc.

@ghost ghost assigned avram Dec 15, 2011
@aurimasv
Copy link
Contributor

With recent changes to Embeded Metadata, EM performs quite well on the linked page. It properly detects the type as thesis, which is not detected with either MODS or MARCXML.

In this particular case, MARCXML performs quite poorly, but it does supply numPages and seriesNumber, while the others do not.

MODS supplies the full abstract and picks up additional authors, which are probably more appropriately classified as contributors (i.e. professor, university, etc. Is this even desirable?). It also contains the ISBN and publication title, but these are not preserved in a thesis itemType.

So the advantage here is that EM detects it as a thesis, but it would be nice to get the full abstract. If we decide to supplement EM data with MARC or MODS, it may become difficult to determine which data we would rather prefer. Would we include all the authors? Add all the notes? (MODS adds 7 additional notes, which are fairly redundant and don't seem to enhance the metadata)

One feasible solution, I think, would be to parse linked MODS and MARC pages and only supplement fields that are completely missing in the EM translator. This would leave out the abstract. Perhaps we can say that MODS/MARC abstracts will always be more complete?

@dstillman dstillman added Difficulty: Easy and removed Difficulty: Medium New Translator Pull requests for new translators labels Mar 8, 2016
@dstillman
Copy link
Member

I was just wondering to myself why we didn't do this. Seems like it would be absolutely trivial to implement, for us and publishers. We're also trying to improve embedded metadata support by adding support for JSON-LD, but this would be a lower-tech solution when sites already have BibTeX, RIS, etc. — basically, an easier, more standards-compliant, non-abandoned unAPI. The HTML 5.1 draft also allows <link> in body content with itemprop, which would allow the use of this for multiple items in a page.

Any reason we shouldn't do this? I guess the biggest downside is that, as with unAPI, we'd need to make a separate request for each link and run detection on the result in order to show a proper icon.

@adam3smith
Copy link
Collaborator

no reason from my side. @zuphilip has brought this up, too (he'd know where, but in some related EM discussion), and it seems like a very good idea to me.

@zuphilip
Copy link
Contributor

zuphilip commented Mar 8, 2016

Do you mean this discussion about blacklight discovery system: #893 (comment) ?

@adam3smith
Copy link
Collaborator

yup, thanks.

@zuphilip
Copy link
Contributor

zuphilip commented Sep 1, 2017

@dstillman
Copy link
Member

dstillman commented Dec 7, 2018

Another example, where MODS is available:

https://purl.stanford.edu/fv751yt5934

<link rel="alternate" title="MODS XML" type="application/xml" href="https://purl.stanford.edu/fv751yt5934.mods" />

For a case like this I think we'd just want to look for 'mods' in the title and href when the type is application/xml or text/xml.

socheres pushed a commit to socheres/translators that referenced this issue Apr 7, 2020
Translator additions and fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants