Fix incomplete abstract and title issue

In some cases the title and/or abstract obtained was incomplete (issue gijswobben#23). This happens when the text contains html markup tags. The most frequent ones seems to be (in descending order): <i>, <sub>, <sup>, <b>, <mml:*>, ...?. Example: PMID 31689885 <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Before the fix the returned title was just: 'Gamma Irradiated ' <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Before the fix the returned abstract was just: 'The effect of ' Fastest solution found: cleanup of tags. It seems to fix gijswobben#23 correctly, at least for non-mml tags. NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
iacopy · Apr 12, 2020 · 31c9096 · 31c9096
1 parent ced77e5
commit 31c9096
Showing 1 changed file with 5 additions and 0 deletions.
diff --git a/pymed/api.py b/pymed/api.py
@@ -1,4 +1,5 @@
 import datetime
+import re
 import requests
 import itertools
 
@@ -170,6 +171,10 @@ def _getArticles(self: object, article_ids: list) -> list:
             url="/entrez/eutils/efetch.fcgi", parameters=parameters, output="xml"
         )
 
+        # Remove html markup tags (<i>, <sub>, <sup>, <b>, etc.) to prevent
+        # title and abstract truncation
+        response = re.sub("<[/ ]*[a-z]{1,3}>|</?mml:.+?>", "", response)
+
         # Parse as XML
         root = xml.fromstring(response)