Skip to content

Commit

Permalink
Fix incomplete abstract and title issue
Browse files Browse the repository at this point in the history
In some cases the title and/or abstract obtained was incomplete
(issue gijswobben#23).

This happens when the text contains html markup tags.
The most frequent ones seems to be (in descending order):
<i>, <sub>, <sup>, <b>, <mml:*>, ...?.

Example: PMID 31689885
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Before the fix the returned title was just: 'Gamma Irradiated '
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Before the fix the returned abstract was just: 'The effect of '

Fastest solution found: cleanup of tags.
It seems to fix gijswobben#23 correctly, at least for non-mml tags.
NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
  • Loading branch information
iacopy committed Apr 12, 2020
1 parent ced77e5 commit 31c9096
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions pymed/api.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import datetime
import re
import requests
import itertools

Expand Down Expand Up @@ -170,6 +171,10 @@ def _getArticles(self: object, article_ids: list) -> list:
url="/entrez/eutils/efetch.fcgi", parameters=parameters, output="xml"
)

# Remove html markup tags (<i>, <sub>, <sup>, <b>, etc.) to prevent
# title and abstract truncation
response = re.sub("<[/ ]*[a-z]{1,3}>|</?mml:.+?>", "", response)

# Parse as XML
root = xml.fromstring(response)

Expand Down

0 comments on commit 31c9096

Please sign in to comment.