python - exceptions for numpy arrays -
i'm looking remove values within constant range around values held within second array. i.e. have 1 large np array , want remove values +-3 in array using array of specific values, [20,50,90,210]. if large array [14,21,48,54,92,215] want [14,54,215] returned. values double precision i'm trying avoid creating large mask array remove specific values , use range instead.
you mentioned wanted avoid large mask array. unless both "large array" , "specific values" array large, wouldn't try avoid this. often, numpy it's best allow relatively large temporary arrays created.
however, if need control memory usage more tightly, have several options. typical trick vectorize 1 part of operation , iterate on shorter input (this shown in second example below). saves having nested loops in python, , can decrease memory usage involved.
i'll show 3 different approaches. there several others (including dropping down c or cython if need tight control , performance), gives ideas.
on side note, these small inputs, overhead of array creation overwhelm differences. speed , memory usage i'm referring large (>~1e6 elements) arrays.
fully vectorized, memory usage
the easiest way calculate distances @ once , reduce mask same shape initial array. example:
import numpy np vals = np.array([14,21,48,54,92,215]) other = np.array([20,50,90,210]) dist = np.abs(vals[:,none] - other[none,:]) mask = np.all(dist > 3, axis=1) result = vals[mask] partially vectorized, intermediate memory usage
another option build mask iteratively each element in "specific values" array. iterates on elements of shorter "specific values" array (a.k.a. other in case):
import numpy np vals = np.array([14,21,48,54,92,215]) other = np.array([20,50,90,210]) mask = np.ones(len(vals), dtype=bool) num in other: dist = np.abs(vals - num) mask &= dist > 3 result = vals[mask] slowest, lowest memory usage
finally, if want reduce memory usage, iterate on every item in large array:
import numpy np vals = np.array([14,21,48,54,92,215]) other = np.array([20,50,90,210]) result = [] num in vals: if np.all(np.abs(num - other) > 3): result.append(num) the temporary list in case take more memory mask in previous version. however, avoid temporary list using np.fromiter if wanted. timing comparison below shows example of this.
timing comparisons
let's compare speed of these functions. we'll use 10,000,000 elements in "large array" , 4 values in "specific values" array. relative speed , memory usage of these functions depend on sizes of 2 arrays, should consider vague guideline.
import numpy np vals = np.random.random(1e7) other = np.array([0.1, 0.5, 0.8, 0.95]) tolerance = 0.05 def basic(vals, other, tolerance): dist = np.abs(vals[:,none] - other[none,:]) mask = np.all(dist > tolerance, axis=1) return vals[mask] def intermediate(vals, other, tolerance): mask = np.ones(len(vals), dtype=bool) num in other: dist = np.abs(vals - num) mask &= dist > tolerance return vals[mask] def slow(vals, other, tolerance): def func(vals, other, tolerance): num in vals: if np.all(np.abs(num - other) > tolerance): yield num return np.fromiter(func(vals, other, tolerance), dtype=vals.dtype) and in case, partially vectorized version wins out. that's expected in cases vals longer other. however, first example (basic) fast, , arguably simpler.
in [7]: %timeit basic(vals, other, tolerance) 1 loops, best of 3: 1.45 s per loop in [8]: %timeit intermediate(vals, other, tolerance) 1 loops, best of 3: 917 ms per loop in [9]: %timeit slow(vals, other, tolerance) 1 loops, best of 3: 2min 30s per loop either way choose implement things, these common vectorization "tricks" show in many problems. in high-level languages python, matlab, r, etc it's useful try vectorizing, mix vectorization , explicit loops if memory usage issue. 1 best depends on relative sizes of inputs, common pattern try when optimizing speed vs memory usage in high-level scientific programming.
Comments
Post a Comment