Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnsupportedOperation 'seek' when loading excel files from url #20434

Closed
mcrot opened this issue Mar 21, 2018 · 4 comments · Fixed by #20437
Closed

UnsupportedOperation 'seek' when loading excel files from url #20434

mcrot opened this issue Mar 21, 2018 · 4 comments · Fixed by #20437
Labels
IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@mcrot
Copy link
Contributor

mcrot commented Mar 21, 2018

Code Sample, a copy-pastable example if possible

url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/io/data/test1.xls'
pandas.read_excel(url)

Problem description

In my version 0.23.0.dev0+657.g01882ba I get an UnsupportedOperation:

UnsupportedOperation                      Traceback (most recent call last)
<ipython-input-5-71715bfd4345> in <module>()
----> 1 pandas.read_excel(url)

~/prj/pandas-mcrot/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    170                 else:
    171                     kwargs[new_arg_name] = new_arg_value
--> 172             return func(*args, **kwargs)
    173         return wrapper
    174     return _deprecate_kwarg

~/prj/pandas-mcrot/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    170                 else:
    171                     kwargs[new_arg_name] = new_arg_value
--> 172             return func(*args, **kwargs)
    173         return wrapper
    174     return _deprecate_kwarg

~/prj/pandas-mcrot/pandas/io/excel.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, **kwds)
    313 
    314     if not isinstance(io, ExcelFile):
--> 315         io = ExcelFile(io, engine=engine)
    316 
    317     return io._parse_excel(

~/prj/pandas-mcrot/pandas/io/excel.py in __init__(self, io, **kwds)
    390             if hasattr(io, 'seek'):
    391                 # GH 19779
--> 392                 io.seek(0)
    393 
    394             data = io.read()

UnsupportedOperation: seek

The PR #19926 was made in order to fix #19779. It introduced a call of seek() method
only for objects having a that method. In case of giving a URL to pandas.read_excel(),
seek() is called on a HTTPResponse instance and it seems like that it does not support seeking, although the method seek() is available.

This issue is already covered by a test. When running

pytest pandas/tests/io/test_excel.py

the test method TestXlrdReader.test_read_from_http_url fails for the same reason.

Expected Output

In version 0.22 this code returns the data as expected:

Out[4]: 
                   A         B         C         D
2000-01-03  0.980269  3.685731 -0.364217 -1.159738
2000-01-04  1.047916 -0.041232 -0.161812  0.212549
2000-01-05  0.498581  0.731168 -0.537677  1.346270
2000-01-06  1.120202  1.567621  0.003641  0.675253
2000-01-07 -0.487094  0.571455 -1.611639  0.103469
2000-01-10  0.836649  0.246462  0.588543  1.062782
2000-01-11 -0.157161  1.340307  1.195778 -1.097007

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 01882ba
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-116-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: de_DE.UTF-8

pandas: 0.23.0.dev0+657.g01882ba
pytest: 3.4.2
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.2
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

@mcrot
Copy link
Contributor Author

mcrot commented Mar 21, 2018

This patch fixes the failing tests on my machine:

@@ -10,6 +10,7 @@ import os
import abc
import warnings
import numpy as np
+from http.client import HTTPResponse

from pandas.core.dtypes.common import (
    is_integer, is_float,
@@ -387,7 +388,9 @@ class ExcelFile(object):
            self.book = io
        elif not isinstance(io, xlrd.Book) and hasattr(io, "read"):
            # N.B. xlrd.Book has a read attribute too
-            if hasattr(io, 'seek'):
+            #
+            # http.client.HTTPResponse.seek() -> UnsupportedOperation exception
+            if not isinstance(io, HTTPResponse) and hasattr(io, 'seek'):
                # GH 19779
                io.seek(0)

Do you think it's sufficient for resolving this issue? I would like to provide my first pull request.

Or is this issue more related to my setup? I'm completely new to pandas development and I've just prepared a working environment following the guide Contributing to pandas.

Thanks,

mcrot added a commit to mcrot/pandas that referenced this issue Mar 21, 2018
@TomAugspurger TomAugspurger added IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version labels Mar 21, 2018
mcrot added a commit to mcrot/pandas that referenced this issue Mar 22, 2018
Closes pandas-dev#20434.

Back in pandas-dev#19779 a call of a seek() method was added. This call fails
on HTTPResponse instances with an UnsupportedOperation exception,
so for this case a try..except wrapper was added here.
mcrot added a commit to mcrot/pandas that referenced this issue Mar 22, 2018
Closes pandas-dev#20434.

Back in pandas-dev#19779 a call of a seek() method was added. This call fails
on HTTPResponse instances with an UnsupportedOperation exception,
so for this case a try..except wrapper was added here.
@jreback jreback added this to the 0.23.0 milestone Mar 22, 2018
@mcrot
Copy link
Contributor Author

mcrot commented Mar 23, 2018

I can reproduce this issue including the failure of the existing tests on another Linux machine:

git clone https://github.com/mcrot/pandas.git pandas-mcrot
cd pandas-mcrot
git remote add upstream https://github.com/pandas-dev/pandas.git
conda update conda
conda env create -f ci/environment-dev.yaml
source activate pandas-dev
python setup.py build_ext --inplace -j 4
python -m pip install -e .
conda install -c defaults -c conda-forge --file=ci/requirements-optional-conda.txt
pytest pandas/tests/io/test_excel.py

Result:

============================================================================================= test session starts ==============================================================================================
platform linux -- Python 3.6.4, pytest-3.4.2, py-1.5.2, pluggy-0.6.0
rootdir: /home/roettm/prj/pandas-mcrot, inifile: setup.cfg
plugins: xdist-1.22.1, forked-0.2, cov-2.5.1
collected 0 items                                                                                                                                                                                              

========================================================================================= no tests ran in 0.01 seconds =========================================================================================
(pandas-dev) roettm@mail:~/prj/pandas-mcrot$ ^C
(pandas-dev) roettm@mail:~/prj/pandas-mcrot$ python
Python 3.6.4 |Anaconda, Inc.| (default, Mar 13 2018, 01:15:57) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'0.23.0.dev0+657.g01882ba'
>>> 
(pandas-dev) roettm@mail:~/prj/pandas-mcrot$ pytest pandas/tests/io/test_excel.py 
============================================================================================= test session starts ==============================================================================================
platform linux -- Python 3.6.4, pytest-3.4.2, py-1.5.2, pluggy-0.6.0
rootdir: /home/roettm/prj/pandas-mcrot, inifile: setup.cfg
plugins: xdist-1.22.1, forked-0.2, cov-2.5.1
collected 549 items                                                                                                                                                                                            

pandas/tests/io/test_excel.py ........................................................................FFF............................................................................................... [ 30%]
........................................................................................................................................................................................................ [ 67%]
..................................s.s.s.s.s.s.s.s.............................................................................................................................x....                      [100%]

=================================================================================================== FAILURES ===================================================================================================
_________________________________________________________________________________ TestXlrdReader.test_read_from_http_url[.xls] _________________________________________________________________________________

self = <pandas.tests.io.test_excel.TestXlrdReader object at 0x7f77f80c4eb8>, ext = '.xls'

    @tm.network
    def test_read_from_http_url(self, ext):
        url = ('https://raw.github.com/pandas-dev/pandas/master/'
               'pandas/tests/io/data/test1' + ext)
>       url_table = read_excel(url)

pandas/tests/io/test_excel.py:562: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/util/_decorators.py:172: in wrapper
    return func(*args, **kwargs)
pandas/util/_decorators.py:172: in wrapper
    return func(*args, **kwargs)
pandas/io/excel.py:315: in read_excel
    io = ExcelFile(io, engine=engine)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pandas.io.excel.ExcelFile object at 0x7f77f823fe10>, io = <http.client.HTTPResponse object at 0x7f77f80d52e8>, kwds = {}, err_msg = 'Install xlrd >= 0.9.0 for Excel support'
xlrd = <module 'xlrd' from '/home/roettm/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/xlrd/__init__.py'>, ver = (1, 1), engine = None

    def __init__(self, io, **kwds):
    
        err_msg = "Install xlrd >= 0.9.0 for Excel support"
    
        try:
            import xlrd
        except ImportError:
            raise ImportError(err_msg)
        else:
            ver = tuple(map(int, xlrd.__VERSION__.split(".")[:2]))
            if ver < (0, 9):  # pragma: no cover
                raise ImportError(err_msg +
                                  ". Current version " + xlrd.__VERSION__)
    
        # could be a str, ExcelFile, Book, etc.
        self.io = io
        # Always a string
        self._io = _stringify_path(io)
    
        engine = kwds.pop('engine', None)
    
        if engine is not None and engine != 'xlrd':
            raise ValueError("Unknown engine: {engine}".format(engine=engine))
    
        # If io is a url, want to keep the data as bytes so can't pass
        # to get_filepath_or_buffer()
        if _is_url(self._io):
            io = _urlopen(self._io)
        elif not isinstance(self.io, (ExcelFile, xlrd.Book)):
            io, _, _, _ = get_filepath_or_buffer(self._io)
    
        if engine == 'xlrd' and isinstance(io, xlrd.Book):
            self.book = io
        elif not isinstance(io, xlrd.Book) and hasattr(io, "read"):
            # N.B. xlrd.Book has a read attribute too
            if hasattr(io, 'seek'):
                # GH 19779
>               io.seek(0)
E               io.UnsupportedOperation: seek

pandas/io/excel.py:392: UnsupportedOperation
[...similar output for for test parameters 'xlsx' and 'xlsm']

Output of pd.show_versions() on that machine:

INSTALLED VERSIONS

commit: 01882ba
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-042stab127.2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: de_DE.UTF-8

pandas: 0.23.0.dev0+657.g01882ba
pytest: 3.4.2
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.0
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

@mcrot
Copy link
Contributor Author

mcrot commented Mar 26, 2018

Is there anyone who can confirm this issue? Probably not since there would be a lot failures of test runs because of this. Alternatively, does anyone has an idea what could be wrong with my setup, e.g. do you have a significant difference in the output for pd.show_versions? Thanks for any pointers.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 26, 2018

I can confirm that I get a failure locally.

mcrot added a commit to mcrot/pandas that referenced this issue Apr 3, 2018
Closes pandas-dev#20434.

Back in pandas-dev#19779 a call of a seek() method was added. This call fails
on HTTPResponse instances with an UnsupportedOperation exception,
so for this case a try..except wrapper was added here.
TomAugspurger pushed a commit that referenced this issue Apr 3, 2018
Closes #20434.

Back in #19779 a call of a seek() method was added. This call fails
on HTTPResponse instances with an UnsupportedOperation exception,
so for this case a try..except wrapper was added here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants