python - exceptions for numpy arrays -

April 15, 2015

i'm looking remove values within constant range around values held within second array. i.e. have 1 large np array , want remove values +-3 in array using array of specific values, [20,50,90,210]. if large array [14,21,48,54,92,215] want [14,54,215] returned. values double precision i'm trying avoid creating large mask array remove specific values , use range instead.

you mentioned wanted avoid large mask array. unless both "large array" , "specific values" array large, wouldn't try avoid this. often, numpy it's best allow relatively large temporary arrays created.

however, if need control memory usage more tightly, have several options. typical trick vectorize 1 part of operation , iterate on shorter input (this shown in second example below). saves having nested loops in python, , can decrease memory usage involved.

i'll show 3 different approaches. there several others (including dropping down c or cython if need tight control , performance), gives ideas.

on side note, these small inputs, overhead of array creation overwhelm differences. speed , memory usage i'm referring large (>~1e6 elements) arrays.

fully vectorized, memory usage

the easiest way calculate distances @ once , reduce mask same shape initial array. example:

import numpy np  vals = np.array([14,21,48,54,92,215]) other = np.array([20,50,90,210])  dist = np.abs(vals[:,none] - other[none,:]) mask = np.all(dist > 3, axis=1) result = vals[mask]

partially vectorized, intermediate memory usage

another option build mask iteratively each element in "specific values" array. iterates on elements of shorter "specific values" array (a.k.a. other in case):

import numpy np  vals = np.array([14,21,48,54,92,215]) other = np.array([20,50,90,210])  mask = np.ones(len(vals), dtype=bool) num in other:     dist = np.abs(vals - num)     mask &= dist > 3 result = vals[mask]

slowest, lowest memory usage

finally, if want reduce memory usage, iterate on every item in large array:

import numpy np  vals = np.array([14,21,48,54,92,215]) other = np.array([20,50,90,210])  result = [] num in vals:     if np.all(np.abs(num - other) > 3):         result.append(num)

the temporary list in case take more memory mask in previous version. however, avoid temporary list using np.fromiter if wanted. timing comparison below shows example of this.

timing comparisons

let's compare speed of these functions. we'll use 10,000,000 elements in "large array" , 4 values in "specific values" array. relative speed , memory usage of these functions depend on sizes of 2 arrays, should consider vague guideline.

import numpy np  vals = np.random.random(1e7) other = np.array([0.1, 0.5, 0.8, 0.95]) tolerance = 0.05  def basic(vals, other, tolerance):     dist = np.abs(vals[:,none] - other[none,:])     mask = np.all(dist > tolerance, axis=1)     return vals[mask]  def intermediate(vals, other, tolerance):     mask = np.ones(len(vals), dtype=bool)     num in other:         dist = np.abs(vals - num)         mask &= dist > tolerance     return vals[mask]  def slow(vals, other, tolerance):     def func(vals, other, tolerance):         num in vals:             if np.all(np.abs(num - other) > tolerance):                 yield num     return np.fromiter(func(vals, other, tolerance), dtype=vals.dtype)

and in case, partially vectorized version wins out. that's expected in cases vals longer other. however, first example (basic) fast, , arguably simpler.

in [7]: %timeit basic(vals, other, tolerance) 1 loops, best of 3: 1.45 s per loop  in [8]: %timeit intermediate(vals, other, tolerance) 1 loops, best of 3: 917 ms per loop  in [9]: %timeit slow(vals, other, tolerance) 1 loops, best of 3: 2min 30s per loop

either way choose implement things, these common vectorization "tricks" show in many problems. in high-level languages python, matlab, r, etc it's useful try vectorizing, mix vectorization , explicit loops if memory usage issue. 1 best depends on relative sizes of inputs, common pattern try when optimizing speed vs memory usage in high-level scientific programming.

Search This Blog

Two