Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using Series.subset() with NumPy 1.9 #41

Closed
freeman-lab opened this issue Oct 19, 2014 · 0 comments
Closed

Error using Series.subset() with NumPy 1.9 #41

freeman-lab opened this issue Oct 19, 2014 · 0 comments
Assignees
Labels

Comments

@freeman-lab
Copy link
Member

Thunder's Series.subset() method relies on PySpark's rdd.takeSample(). Due to a recent patch to NumPy (numpy/numpy@6b1a120), takeSample is broken on NumPy 1.9 installations because it generates random seeds that frequently exceed the maximum bound of 2 ** 32.

As a result, the following example code:

data = tsc.makeExample('pca')
data.subset(10)

Will almost always produce:

ValueError: Seed must be between 0 and 4294967295

The underlying issue needs to be fixed in PySpark, but for now we can avoid the problem by explicitly specifying a seed in the correct range.

freeman-lab added a commit that referenced this issue Oct 19, 2014
Explicitly specify a random seed in the range 0 to 2 ** 32 - 1, which
will always yield a valid random number using numpy’s sampler.
@freeman-lab freeman-lab self-assigned this Oct 19, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant