Incredibly Fast Random Sampling in Python - 리뷰
2019. 6. 14. 01:00ㆍ분석 Python/Numpy Tip
샘플링을 어떻게 빠르게 할 수 있을까?
numpy or random 패키지로도 할 수 있지만,
최근에 쉽게 해결할 수 없는 무작위 샘플링 문제를 발견했다고 합니다.
일단 구체적으로 필요한 것은 다음과 같다.
- A specified sample size (지정된 표본 크기)
- A specified number of samples (지정된 샘플수)
- Sampling without replacement
- A specified inclusion probability of each element’s inclusion in a given sample (주어진 표본에 각 요소가 포함될 확률을 명시)
import random
import numpy as np
# constants
num_elements = 20
num_samples = 1000
sample_size = 5
elements = np.arange(num_elements)
# probabilities should sum to 1
probabilities = np.random.random(num_elements)
probabilities /= np.sum(probabilities)
probs = probabilities
Method 1 — Native Python loops
`random` package 사용해서 하는 법
def native_loop(num_samples, sample_size, elements, probabilities):
elements = list(elements) # because we later use .remove() method
samples = []
for _ in range(num_samples):
sample = []
candidate_elements = elements.copy()
while len(sample) < sample_size:
# choose an index as a candidate to include
candidate_element = random.choice(candidate_elements)
# decide whether to include in the sample or not
if probs[candidate_element] >= random.random():
sample.append(candidate_element)
candidate_elements.remove(candidate_element)
samples.append(sample)
return samples
%timeit native_loop(num_samples, sample_size, elements, probabilities)
## 146 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Method 2 — NumPy’s random choice method
numpy random choice 사용
시간은 줄어들었지만, np.random.choice는 오직 함수 한 번에 한 번만 생성한다.
def numpy_choice(num_samples, sample_size, elements, probabilities):
return np.asarray([
np.random.choice(
elements, sample_size, p=probabilities, replace=False
) for _ in range(num_samples)
])
%timeit numpy_choice(num_samples, sample_size, elements, probabilities)
## 92.3 ms ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method 3 — Multidimensional Shifting using NumPy
- Replicates the probabilities as many times as the specified num_samples
- 지정된 num_sample만큼 확률 복제
- Generates random numbers with the same shape and magnitude as probabilities
- 확률과 동일한 모양과 크기를 가진 난수 생성
- Shiftsprobabilities of each row according to the randomly generated numbers
- 임의로 생성된 숫자에 따라 각 행의 이동확률
- Identifies the largest sample_size number of elements in each row (each row is a sample)
- 각 행에서 가장 큰 sample_size 요소 수 식별(각 행은 샘플임)
def multidimensional_shifting(num_samples, sample_size, elements, probabilities):
# replicate probabilities as many times as `num_samples`
## 복제 (num_samples , 1 )
replicated_probabilities = np.tile(probabilities, (num_samples, 1))
# get random shifting numbers & scale them correctly
## shape 만큼 radom 만들고 확률
random_shifts = np.random.random(replicated_probabilities.shape)
random_shifts /= random_shifts.sum(axis=1)[:, np.newaxis]
# shift by numbers & find largest (by finding the smallest of the negative)
## 특정 확률값에서 여러가지 확률값들을 뺀다음에 가장 낮은 것 5개 뽑기
shifted_probabilities = random_shifts - replicated_probabilities
return np.argpartition(shifted_probabilities, sample_size, axis=1)[:, :sample_size]
%timeit multidimensional_shifting(num_samples, sample_size, elements, probabilities)
## 767 µs ± 60 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
으음 이런 랜덤 샘플링이 언젠가 쓸모가 있겠지?
https://medium.com/ibm-watson/incredibly-fast-random-sampling-in-python-baf154bd836a
728x90
'분석 Python > Numpy Tip' 카테고리의 다른 글
numpy.unique, numpy.searchsorted (0) | 2019.05.26 |
---|---|
Broadcasting, numpy.newaxis (0) | 2019.05.26 |
numpy argsort (0) | 2019.05.26 |
넘파이 좋은 팁이 많음! (0) | 2019.05.25 |
Numpy (0) | 2017.12.25 |