Incredibly Fast Random Sampling in Python - 리뷰

2019. 6. 14. 01:00분석 Python/Numpy Tip

샘플링을 어떻게 빠르게 할 수 있을까?

 

numpy or random 패키지로도 할 수 있지만,

최근에 쉽게 해결할 수 없는 무작위 샘플링 문제를 발견했다고 합니다.

일단 구체적으로 필요한 것은 다음과 같다.

 

  • A specified sample size (지정된 표본 크기)
  • A specified number of samples (지정된 샘플수)
  • Sampling without replacement 
  • A specified inclusion probability of each element’s inclusion in a given sample (주어진 표본에 각 요소가 포함될 확률을 명시)
import random
import numpy as np
# constants
num_elements = 20
num_samples = 1000
sample_size = 5
elements = np.arange(num_elements)
# probabilities should sum to 1
probabilities = np.random.random(num_elements)
probabilities /= np.sum(probabilities)
probs = probabilities

 

Method 1 — Native Python loops

`random` package 사용해서 하는 법

def native_loop(num_samples, sample_size, elements, probabilities):
    elements = list(elements) # because we later use .remove() method
    samples = []
    for _ in range(num_samples):
        sample = []
        candidate_elements = elements.copy()
        while len(sample) < sample_size:
            # choose an index as a candidate to include
            candidate_element = random.choice(candidate_elements)
            # decide whether to include in the sample or not
            if probs[candidate_element] >= random.random():
                sample.append(candidate_element)
                candidate_elements.remove(candidate_element)
        samples.append(sample)
    return samples
    
    
    
 %timeit native_loop(num_samples, sample_size, elements, probabilities)
 ## 146 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Method 2 — NumPy’s random choice method

numpy random choice 사용

 

시간은 줄어들었지만, np.random.choice는 오직 함수 한 번에 한 번만 생성한다. 

def numpy_choice(num_samples, sample_size, elements, probabilities):
    return np.asarray([
        np.random.choice(
            elements, sample_size, p=probabilities, replace=False
        ) for _ in range(num_samples)
    ])
    
    
%timeit numpy_choice(num_samples, sample_size, elements, probabilities)
## 92.3 ms ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Method 3 — Multidimensional Shifting using NumPy

 

  • Replicates the probabilities as many times as the specified num_samples
    • 지정된 num_sample만큼 확률 복제
  • Generates random numbers with the same shape and magnitude as probabilities
    • 확률과 동일한 모양과 크기를 가진 난수 생성
  • Shiftsprobabilities of each row according to the randomly generated numbers
    • 임의로 생성된 숫자에 따라 각 행의 이동확률
  • Identifies the largest sample_size number of elements in each row (each row is a sample)
    • 각 행에서 가장 큰 sample_size 요소 수 식별(각 행은 샘플임)
def multidimensional_shifting(num_samples, sample_size, elements, probabilities):
    # replicate probabilities as many times as `num_samples`
    ## 복제 (num_samples , 1 )
    replicated_probabilities = np.tile(probabilities, (num_samples, 1))
    # get random shifting numbers & scale them correctly
    ## shape 만큼 radom 만들고 확률
    random_shifts = np.random.random(replicated_probabilities.shape)
    random_shifts /= random_shifts.sum(axis=1)[:, np.newaxis]
    # shift by numbers & find largest (by finding the smallest of the negative)
    ## 특정 확률값에서 여러가지 확률값들을 뺀다음에 가장 낮은 것 5개 뽑기 
    shifted_probabilities = random_shifts - replicated_probabilities
    return np.argpartition(shifted_probabilities, sample_size, axis=1)[:, :sample_size]
    
    
 %timeit multidimensional_shifting(num_samples, sample_size, elements, probabilities)
 
 ## 767 µs ± 60 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

 

으음 이런 랜덤 샘플링이 언젠가 쓸모가 있겠지?

 

https://medium.com/ibm-watson/incredibly-fast-random-sampling-in-python-baf154bd836a

 

Incredibly Fast Random Sampling in Python

We need speed in random sampling. How fast can we go?

medium.com

 

728x90

'분석 Python > Numpy Tip' 카테고리의 다른 글

numpy.unique, numpy.searchsorted  (0) 2019.05.26
Broadcasting, numpy.newaxis  (0) 2019.05.26
numpy argsort  (0) 2019.05.26
넘파이 좋은 팁이 많음!  (0) 2019.05.25
Numpy  (0) 2017.12.25