Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GAP: Apparent problem with workspace initialization #27681

Closed
embray opened this issue Apr 16, 2019 · 51 comments
Closed

GAP: Apparent problem with workspace initialization #27681

embray opened this issue Apr 16, 2019 · 51 comments

Comments

@embray
Copy link
Contributor

embray commented Apr 16, 2019

I don't fully understand the problem yet but there seems to be an issue in 8.8.beta2 (if not earlier) related to GAP workspaces.

The only way I've been able to reproduce it reliably is by running

$ rm -rf ~/.sage/gap/*

and then

./sage -t --long src/sage/combinat/tableau.py

at some point during this test (possibly others but this is the only where I've reliably reproduced the issue) things start to go horribly wrong:

sage -t --long src/sage/combinat/tableau.py
**********************************************************************
File "src/sage/combinat/tableau.py", line 7751, in sage.combinat.tableau.StandardTableaux_size.cardinality
Failed example:
    StandardTableaux(50).cardinality()  # long time
Exception raised:
    Traceback (most recent call last):
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 671, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 1095, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.combinat.tableau.StandardTableaux_size.cardinality[4]>", line 1, in <module>
        StandardTableaux(Integer(50)).cardinality()  # long time
      File "sage/misc/lazy_import.pyx", line 354, in sage.misc.lazy_import.LazyImport.__call__ (build/cythonized/sage/misc/lazy_import.c:3691)
        return self.get_object()(*args, **kwds)
      File "sage/misc/classcall_metaclass.pyx", line 330, in sage.misc.classcall_metaclass.ClasscallMetaclass.__call__ (build/cythonized/sage/misc/classcall_metaclass.c:1735)
        return cls.classcall(cls, *args, **kwds)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/combinat/tableau.py", line 7584, in __classcall_private__
        return StandardTableaux_size(n)
      File "sage/misc/classcall_metaclass.pyx", line 330, in sage.misc.classcall_metaclass.ClasscallMetaclass.__call__ (build/cythonized/sage/misc/classcall_metaclass.c:1735)
        return cls.classcall(cls, *args, **kwds)
      File "sage/misc/cachefunc.pyx", line 1003, in sage.misc.cachefunc.CachedFunction.__call__ (build/cythonized/sage/misc/cachefunc.c:6011)
        w = self.f(*args, **kwds)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/structure/unique_representation.py", line 1027, in __classcall__
        instance = typecall(cls, *args, **options)
      File "sage/misc/classcall_metaclass.pyx", line 497, in sage.misc.classcall_metaclass.typecall (build/cythonized/sage/misc/classcall_metaclass.c:2185)
        return (<PyTypeObject*>type).tp_call(cls, args, kwds)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/combinat/tableau.py", line 7693, in __init__
        facade=True, keepkey=False)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/sets/disjoint_union_enumerated_sets.py", line 288, in __init__
        self._facade_for = tuple(family)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/sets/family.py", line 1078, in __getitem__
        return self.function(i)
      File "sage/misc/classcall_metaclass.pyx", line 330, in sage.misc.classcall_metaclass.ClasscallMetaclass.__call__ (build/cythonized/sage/misc/classcall_metaclass.c:1735)
        return cls.classcall(cls, *args, **kwds)
      File "sage/misc/cachefunc.pyx", line 1003, in sage.misc.cachefunc.CachedFunction.__call__ (build/cythonized/sage/misc/cachefunc.c:6011)
        w = self.f(*args, **kwds)
      File "sage/structure/parent.pyx", line 303, in sage.structure.parent.Parent.__init__ (build/cythonized/sage/structure/parent.c:4756)
        for cls in self.__class__.mro():
    MemoryError
**********************************************************************
File "src/sage/combinat/tableau.py", line 8572, in sage.combinat.tableau.IncreasingTableaux.__init__
Failed example:
    TestSuite(S).run()  # long time
Exception raised:
    Traceback (most recent call last):
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 671, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 1108, in compile_and_execute
        self.update_digests(example)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 999, in update_digests
        digest.update(str_to_bytes(reduce_hex(gen), 'ascii'))
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/parsing.py", line 442, in reduce_hex
        from operator import xor
    MemoryError
Process DocTestWorker-1:
Traceback (most recent call last):
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 2139, in run
    task(self.options, self.outtmpfile, msgpipe, self.result_queue)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 2498, in __call__
    result_queue.put(result, False)
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/multiprocessing/queues.py", line 107, in put
    self._start_thread()
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/multiprocessing/queues.py", line 195, in _start_thread
    self._thread.start()
  File "/home/embray/src/sagemath/sage/local/lib/python2.7/threading.py", line 736, in start
    _start_new_thread(self.__bootstrap, ())
error: can't start new thread
    Bad exit: 1
**********************************************************************

ISTM like maybe some memory corruption occurs as a result of trying to load a corrupt or partial GAP workspace. It's really bad, whatever it is.

I should add, if I subsequently run the test it passes. But then I can reproduce the problem again by deleting workspaces and re-rerunning the test.

Depends on #27680

CC: @jdemeyer

Component: packages: standard

Keywords: gap libgap

Author: Erik Bray

Branch: ecacac8

Reviewer: Dima Pasechnik

Issue created by migration from https://trac.sagemath.org/ticket/27681

@embray embray added this to the sage-8.8 milestone Apr 16, 2019
@embray

This comment has been minimized.

@embray
Copy link
Contributor Author

embray commented Apr 16, 2019

comment:2

Some additional observations:

  • I can also reproduce the problem by doing just rm -rf ~/.sage/gap/gap-workspace-*; i.e. deleting just the pexpect gap workspace and not the libgap workspace.

  • If I instead do rm -rf ~/.sage/gap/libgap-workspace-* the test passes every time, but no new libgap workspace is created (as if the test does not use libgap at all; I haven't confirmed this but it seems to be the case).

  • However, if I delete both workspaces the test fails and both a gap-workspace- and a libgap-workspace- are created.

@embray
Copy link
Contributor Author

embray commented Apr 17, 2019

Branch: u/embray/ticket-27681

@embray
Copy link
Contributor Author

embray commented Apr 17, 2019

comment:3

This seems to fit it, with the aid of #27680.

Ironically, it was while testing #27680 that I discovered this issue, and at first I thought it was caused somehow by my changes in that ticket. But I can confirm that it's not related at all.

The problem stems (for reasons I still do not understand and are disturbing) from the fact that libgap was actually being used in the process of initializing a new workspace by pexpect GAP. Somehow this was causing problems; perhaps related to package loading.

It's still disturbing that this could cause such a problem, and I can't seem to reproduce the issue in other examples of using both GAP interfaces simultaneously. It must come down to some subtle resource contention issue within GAP itself, and we can make some progress toward avoiding it by taking more care not to use both GAPs simultaneously during workspace caching.


New commits:

de4332bChange libgap initialization to work lazily. Don't initialize() in Gap.__init__ but only just before libgap actually needs to be initialized.
e687ee6This test no longer makes sense in light of lazy initialization of libgap
0b6d94eTrac #27681: Update all_installed_packages to allow passing a specific GAP

@embray
Copy link
Contributor Author

embray commented Apr 17, 2019

Commit: 0b6d94e

@embray
Copy link
Contributor Author

embray commented Apr 17, 2019

Dependencies: #27680

@embray
Copy link
Contributor Author

embray commented Apr 17, 2019

Author: Erik Bray

@embray
Copy link
Contributor Author

embray commented Apr 17, 2019

comment:4

The attached fix worked around the issue superficially for one use case, but it still seems to be a problem after all as I am still able to reproduce it in other cases even with this fix.

@dimpase
Copy link
Member

dimpase commented May 6, 2019

comment:5

I am not able to reproduce the problem in the ticket description. Is it platform-specific?

@dimpase
Copy link
Member

dimpase commented May 6, 2019

comment:6

OK, I see it on Fedora 26 now (but not on Gentoo or Debian!).

I don't even need to remove the workspaces.
During the run GAP (but not libgap) workspace is changed (well, it's creation time is changed). So this seems to be unrelated to libgap, it's a GAP interface problem.

@dimpase
Copy link
Member

dimpase commented May 6, 2019

comment:7

sage/combinat/tableau.py # Bad exit: 1 is mentioned here
https://groups.google.com/d/msg/sage-devel/UJW7bDFdrhQ/EoC_WdOOFgAJ

already in 8.7beta - I think I saw this earlier. I'm sure it's a doctest-related race condition - I have no idea why GAP workspace is updated mid-test, this alone is fishy.

@embray
Copy link
Contributor Author

embray commented May 6, 2019

comment:10

I'm not sure it has nothing to do with libgap. In 0b6d94ee I fixed a problem where libgap was being used during pexpect-gap initialization in the all_installed_packages function.

This seemed to alleviate the problem somewhat, but it seems to be more of a bandaid.

What's really weird is that the way the problem manifests itself is in the form of some pretty severe memory corruption affecting just random things in Sage. And I would think that only the libgap interface would be capable of breaking things that badly.

I haven't looked at the issue in a few weeks though so I don't have any good ideas at the moment. I need to start looking at it again...

@dimpase
Copy link
Member

dimpase commented May 6, 2019

comment:11

As usual with race conditions, it's extremely fluid, changes somewhere unrelated might "fix" it. First of all, there is no reason for GAP workspace to get updated during this test, yet is updated (not only on systems where these errors are seen, it seems it happens everywhere).

@embray
Copy link
Contributor Author

embray commented May 6, 2019

comment:12

After rebasing this branch on 8.8.beta4 I can't reproduce the issue anymore, at least not in the same way. I'll keep trying things though.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented May 6, 2019

Changed commit from 0b6d94e to ecacac8

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented May 6, 2019

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

19d5733Change libgap initialization to work lazily. Don't initialize() in Gap.__init__ but only just before libgap actually needs to be initialized.
1312024This test no longer makes sense in light of lazy initialization of libgap
ecacac8Trac #27681: Update all_installed_packages to allow passing a specific GAP

@embray
Copy link
Contributor Author

embray commented May 6, 2019

comment:14

Rebased the branch on 8.8.beta4. Did a full make ptestlong and it passed, no problem. This is on my Ubuntu machine; which IIRC is where I originally noted the problem.

Can you try this updated version of the branch and see if it's fixed on Fedora as well? Anyways, try as I might and I can't get this to come back, at least not in the way I originally reproduced it.

I'll try again on Cygwin too...

@dimpase
Copy link
Member

dimpase commented May 6, 2019

comment:15

The error is still there, slightly different:

$ ./sage -t --long src/sage/combinat/tableau.py
Running doctests with ID 2019-05-06-16-43-27-d70e975c.
Git branch: develop
Using --optional=build,dochtml,memlimit,python2,sage
Doctesting 1 file.
sage -t --long --warn-long 54.8 src/sage/combinat/tableau.py
Process DocTestWorker-1:
Traceback (most recent call last):
  File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 2139, in run
    task(self.options, self.outtmpfile, msgpipe, self.result_queue)
  File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 2498, in __call__
    result_queue.put(result, False)
  File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/multiprocessing/queues.py", line 107, in put
    self._start_thread()
  File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/multiprocessing/queues.py", line 195, in _start_thread
    self._thread.start()
  File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/threading.py", line 736, in start
    _start_new_thread(self.__bootstrap, ())
error: can't start new thread
    Bad exit: 1
**********************************************************************
Tests run before process (pid=23005) failed:
sage: t = Tableau([[1,2,3],[4,5]]); t ## line 144 ##

...
sage: IT = IncreasingTableaux(4, (1,0,1,1)) ## line 9389 ##
sage: IT[0].parent() is IT ## line 9390 ##
True
sage: sig_on_count() # check sig_on/off pairings (virtual doctest) ## line 9392 ##
0
sage: IT = IncreasingTableaux(4, (1,0,1,1)) ## line 9402 ##
sage: all(it in IT for it in IT) ## line 9403 ##
True
sage: all(it in IT for it in IncreasingTableaux([2,2], (1,0,1,1))) ## line 9405 ##
True
sage: sig_on_count() # check sig_on/off pairings (virtual doctest) ## line 9407 ##
0

**********************************************************************
----------------------------------------------------------------------
sage -t --long --warn-long 54.8 src/sage/combinat/tableau.py  # Bad exit: 1

GAP workspace still changes, as before.

@dimpase
Copy link
Member

dimpase commented May 6, 2019

comment:16

Note that if I wipe out ~/.sage/ then after that the 1st run the test passes, but on the 2nd, etc. runs it does not.

@embray
Copy link
Contributor Author

embray commented May 6, 2019

comment:17

Right, that's the behavior I was getting before too, but now I'm not getting it anymore, even on the exact same system as before :(

@dimpase
Copy link
Member

dimpase commented May 7, 2019

comment:18

How does one check the caller stack while running tests? I would like to see where gap_reset_workspace() is triggered during this test run, but the usual inspect thing does not print anything:

--- a/src/sage/interfaces/gap.py
+++ b/src/sage/interfaces/gap.py
@@ -1575,6 +1575,11 @@ def gap_reset_workspace(max_workspace_size=None, verbose=False):
         sage: sage.interfaces.gap.WORKSPACE = ORIGINAL_WORKSPACE
     """
     # Create new workspace with filename WORKSPACE
+    import inspect
+    curframe = inspect.currentframe()
+    calframe = inspect.getouterframes(curframe, 2)
+    print('caller name:', calframe[1][3])
+
     g = Gap(use_workspace_cache=False, max_workspace_size=None)
     g.eval('SetUserPreference("HistoryMaxLines", 30)')
     from sage.tests.gap_packages import all_installed_packages

yes the GAP workspace is changed during the test run!

@embray
Copy link
Contributor Author

embray commented May 9, 2019

comment:28

Replying to @dimpase:

If there is a problem here, it's with doctest framework. Otherwise, looks good.

I'm not sure how you concluded that, but it is a possibility I suppose. I'd be more concerned that it's a problem with forking, which does happen when running the doctests, but doesn't normally happen when running interactively. So it can make things difficult to reproduce. Nevertheless, I can't reliably reproduce the issue anymore either, and the one thing I did fix here seems to help a bit...

@vbraun
Copy link
Member

vbraun commented May 14, 2019

Changed branch from u/embray/ticket-27681 to ecacac8

@kiwifb
Copy link
Member

kiwifb commented May 14, 2019

comment:30

I have just noticed this ticket. And it gave me a little bit of fun chasing it (I don't know how Volker merged it, it doesn't show where I expect).

So I got that interesting failure in sage-on-gentoo

File "/usr/lib64/python2.7/site-packages/sage/tests/gap_packages.py", line 137, in sage.tests.gap_packages.all_installed_packages
Failed example:
    all_installed_packages() == all_installed_packages(gap=gap)
Expected:
    True
Got:
    False

After some inspection

sage: from sage.tests.gap_packages import all_installed_packages
sage: all_installed_packages()
('AtlasRep',
 'AutoDoc',
 'FactInt',
 'GAPDoc',
 'Hap',
 'MapClass',
 'SmallGrp',
 'Toric',
 'aclib',
 'alnuth',
 'atlasrep',
 'autpgrp',
 'cohomolo',
 'corelg',
 'crime',
 'crisp',
 'crypting',
 'cryst',
 'crystcat',
 'ctbllib',
 'design',
 'digraphs',
 'fga',
 'gbnp',
 'genss',
 'grape',
 'guava',
 'hecke',
 'io',
 'irredsol',
 'laguna',
 'liealgdb',
 'liepring',
 'liering',
 'loops',
 'nq',
 'orb',
 'polenta',
 'polycyclic',
 'primgrp',
 'qpa',
 'quagroup',
 'radiroot',
 'resclasses',
 'semigroups',
 'sla',
 'sonata',
 'sophus',
 'tomlib',
 'transgrp',
 'utils')
sage: all_installed_packages(gap=gap)
('AutoDoc',
 'FactInt',
 'GAPDoc',
 'Hap',
 'MapClass',
 'SmallGrp',
 'Toric',
 'aclib',
 'alnuth',
 'atlasrep',
 'autpgrp',
 'cohomolo',
 'corelg',
 'crime',
 'crisp',
 'crypting',
 'cryst',
 'crystcat',
 'ctbllib',
 'design',
 'digraphs',
 'fga',
 'gbnp',
 'genss',
 'grape',
 'guava',
 'hecke',
 'io',
 'irredsol',
 'laguna',
 'liealgdb',
 'liepring',
 'liering',
 'loops',
 'nq',
 'orb',
 'polenta',
 'polycyclic',
 'primgrp',
 'qpa',
 'quagroup',
 'radiroot',
 'resclasses',
 'semigroups',
 'sla',
 'sonata',
 'sophus',
 'tomlib',
 'transgrp',
 'utils')

The leading AtlasRep is in fact something I had forgotten in my ~/.gap and that will soon be cleaned. But it points to the fact that the way we start gap's pexpect interface makes it ignore .gap. You could say we start gap in a mode that also happens to be a "safe mode" (all your personal settings ignored) and that really should be a different option upstream.

In the meantime, should we amend the test so that ignore_dot_gap=True in both case in a follow up ticket.

@kiwifb
Copy link
Member

kiwifb commented May 14, 2019

Changed commit from ecacac8 to none

@kiwifb
Copy link
Member

kiwifb commented May 15, 2019

comment:31

Just to document so I can point to it next time. In gap-4.10.1 you have the file src/system.c which controls many OS dependent things including how gap starts. Currently line 1297 onwards

#ifdef HAVE_DOTGAPRC
    /* the users home directory                                            */
    if ( getenv("HOME") != 0 ) {
        strxcpy(DotGapPath, getenv("HOME"), sizeof(DotGapPath));
# if defined(SYS_IS_DARWIN) && SYS_IS_DARWIN
        /* On Darwin, add .gap to the sys roots, but leave */
        /* DotGapPath at $HOME/Library/Preferences/GAP     */
        strxcat(DotGapPath, "/.gap;", sizeof(DotGapPath));
        if (!IgnoreGapRC) {
          SySetGapRootPath(DotGapPath);
        }
		
        strxcpy(DotGapPath, getenv("HOME"), sizeof(DotGapPath));
        strxcat(DotGapPath, "/Library/Preferences/GAP;", sizeof(DotGapPath));
# elif defined(__CYGWIN__)
        strxcat(DotGapPath, "/_gap;", sizeof(DotGapPath));
# else
        strxcat(DotGapPath, "/.gap;", sizeof(DotGapPath));
# endif

        if (!IgnoreGapRC) {
          SySetGapRootPath(DotGapPath);
        }

So if IgnoreGapRC is "false" you get .gap. Where is that variable coming from? It is documented in the same file at line 150

/****************************************************************************
**
*V  IgnoreGapRC . . . . . . . . . . . . . . . . . . . -r option for kernel
**
*/
Int IgnoreGapRC;

and finally more interestingly we have this at line 1077

  { 'r', "", toggle, &IgnoreGapRC, 0 }, /* kernel */

and in sage we have in sage/interfaces/gap.py at line 205

gap_cmd = "gap -r"

At this stage I don't know if that option is a remnant of gap-4.8 and lower or if it is really needed.

@kiwifb
Copy link
Member

kiwifb commented May 28, 2019

comment:32

Follow up at #27878

@mwageringel
Copy link
Contributor

comment:33

Using 8.8.rc2 with Python 2, I still consistently get the same error as in comment:15. This is on CentOS Linux release 7.6.1810 (Core). Using --serial makes the test pass.

@embray
Copy link
Contributor Author

embray commented Jun 25, 2019

comment:34

I was able to reproduce this again in #18267 specifically.

@embray
Copy link
Contributor Author

embray commented Jul 3, 2019

comment:35

Replying to @mwageringel:

Using 8.8.rc2 with Python 2, I still consistently get the same error as in comment:15. This is on CentOS Linux release 7.6.1810 (Core). Using --serial makes the test pass.

Thanks for the info; that's interesting (and strange). I'm still struggling to find a way to reliably reproduce this, though it's good to know that --serial made it pass for you. It makes me wonder if there is some subtle bug with libgap post-fork.

@embray
Copy link
Contributor Author

embray commented Jul 3, 2019

comment:36

I've opened #28106 to track this issue further.

@vbraun
Copy link
Member

vbraun commented Jul 13, 2019

Changed branch from ecacac8 to u/embray/ticket-27681

@vbraun
Copy link
Member

vbraun commented Jul 13, 2019

Commit: ecacac8

@vbraun
Copy link
Member

vbraun commented Jul 13, 2019

comment:37

Reopening as it breaks the buildbot

@vbraun
Copy link
Member

vbraun commented Jul 13, 2019

Changed dependencies from #27680 to #27680, #28106

@vbraun
Copy link
Member

vbraun commented Jul 13, 2019

comment:39

Setting this back to positive review assuming that it'll be fixed in #28106

@vbraun
Copy link
Member

vbraun commented Jul 13, 2019

Changed dependencies from #27680, #28106 to 27680, 28106

@vbraun
Copy link
Member

vbraun commented Jul 13, 2019

Changed branch from u/embray/ticket-27681 to ecacac8

@vbraun
Copy link
Member

vbraun commented Jul 13, 2019

Changed commit from ecacac8 to none

@vbraun
Copy link
Member

vbraun commented Jul 13, 2019

Changed dependencies from 27680, 28106 to #27680

@vbraun
Copy link
Member

vbraun commented Jul 13, 2019

comment:41

Wrong ticket, sorry for the noise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants