Error using Series.subset() with NumPy 1.9 #41

freeman-lab · 2014-10-19T03:17:56Z

Thunder's Series.subset() method relies on PySpark's rdd.takeSample(). Due to a recent patch to NumPy (numpy/numpy@6b1a120), takeSample is broken on NumPy 1.9 installations because it generates random seeds that frequently exceed the maximum bound of 2 ** 32.

As a result, the following example code:

data = tsc.makeExample('pca')
data.subset(10)

Will almost always produce:

ValueError: Seed must be between 0 and 4294967295

The underlying issue needs to be fixed in PySpark, but for now we can avoid the problem by explicitly specifying a seed in the correct range.

The text was updated successfully, but these errors were encountered:

Explicitly specify a random seed in the range 0 to 2 ** 32 - 1, which will always yield a valid random number using numpy’s sampler.

freeman-lab added the bug label Oct 19, 2014

freeman-lab added a commit that referenced this issue Oct 19, 2014

#41 Fix for random seed error during sampling

9ba0fb6

Explicitly specify a random seed in the range 0 to 2 ** 32 - 1, which will always yield a valid random number using numpy’s sampler.

freeman-lab self-assigned this Oct 19, 2014

industrial-sloth mentioned this issue Nov 6, 2014

error when run subset: Py4JJavaError: An error occurred while calling o102.collect. #45

Closed

freeman-lab closed this as completed Nov 15, 2014

JoshRosen mentioned this issue Jan 23, 2015

Python mllib tests failing databricks/spark-perf#46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error using Series.subset() with NumPy 1.9 #41

Error using Series.subset() with NumPy 1.9 #41

freeman-lab commented Oct 19, 2014

Error using Series.subset() with NumPy 1.9 #41

Error using Series.subset() with NumPy 1.9 #41

Comments

freeman-lab commented Oct 19, 2014