🌍

ba2.4.3.3. title: 학습 시 사용했던 데이터와 새롭게 들어오는 데이터(‣)의 차이를 모니터링(‣)하면 모델 드리프트(‣)를 모니터링할 수 있다.

생성

prev summary

🚀 prev note

♻️ prev note

next summary

🚀 next note

♻️ next note

ba2.4.3.3.1. title: 베이스라인은 모델이 최소한으로 확보해야 하는 성능에 대한 기준선을 의미하기도 하지만, 데이터 드리프트와 관련해서는 모델이 학습되었던 상황과 비교하여 받아들일 수 있는 데이터의 통계적 특성 차이의 기준선을 의미하기도 한다. 기준선은 학습 시 사용되었던 데이터셋(‣)으로부터 만들어진다.

관련 임시노트

9 more properties

모델 드리프트(from2)는 결국 학습 시 사용했던 데이터와 새롭게 입력되는 데이터가 달라진다는 것을 의미한다(참고3,4). 따라서 학습 시 사용했던 데이터 - 이는 베이스라인 데이터셋(‣ Baseline dataset)이라고 불린다(참고6) - 와 새롭게 들어오는 데이터 - 이는 타겟 데이터셋(‣ Target dataset)이라고 불린다(참고5) - 의 차이를 모니터링(from1)하면 모델 드리프트를 모니터링하는 것이라고 말할 수 있다(참고2).

코드(참고1): 아래 코드는 sklearn-multiflow 를 이용한 모델 드리프트 모니터링의 간단한 예제이다.

import numpy as np
from skmultiflow.drift_detection.adwin import ADWIN
 
adwin = ADWIN()
 
# Simulating a data stream as a normal distribution of 1's and 0's
data_stream = np.random.randint(2, size=2000)
 
# Artificially shift the data from index 999 to 2000
# by replacing the i value with a greater one
for i in range(999, 2000):
   data_stream[i] = np.random.randint(5, high=10)
 
previous_variance = 0
# Add the stream elements to ADWIN and check if drift has been detected
for i in range(2000):
   adwin.add_element(data_stream[i])
   if adwin.detected_change():
       print("Change detected in value {}, at index {}".format(data_stream[i], i))
       print("Current variance: {}. Previous variance {}".format(adwin.variance, previous_variance))
   previous_variance = adwin.variance
Python
복사

parse me : 언젠가 이 글에 쓰이면 좋을 것 같은 재료들.

None

from : 과거의 어떤 생각이 이 생각을 만들었는가?

supplementary : 어떤 새로운 생각이 이 문서에 작성된 생각을 뒷받침하는가?

ba2.4.3.3.1. title:
베이스라인은 모델이 최소한으로 확보해야 하는 성능에 대한 기준선을 의미하기도 하지만, 데이터 드리프트와 관련해서는 모델이 학습되었던 상황과 비교하여 받아들일 수 있는 데이터의 통계적 특성 차이의 기준선을 의미하기도 한다. 기준선은 학습 시 사용되었던 데이터셋(‣)으로부터 만들어진다.

opposite : 어떤 새로운 생각이 이 문서에 작성된 생각과 대조되는가?

None

to : 이 문서에 작성된 생각이 어떤 생각으로 발전되고 이어지는가?

None

참고 : 레퍼런스

The scikit-multiflow package can detect data drift using an algorithm known as adaptive windowing (ADWIN) that detects data drift over a stream of data.

Since both drifts involve a statistical change in the data, the best approach to detect them is by monitoring its statistical properties, the model’s predictions, and their correlation with other factors.

Data quality monitoring establishes a profile of the input data during model training, and then continuously compares incoming data with the profile.

chapter6, The first is the target dataset. This can be the same dataset you used to train a model, although special care needs to be put into ensuring that the number (and order) of features doesn’t change. The second one is the base‐ line. The baseline determines what differences may (or may not) be acceptable when a model gets trained.

Target dataset - usually model input data - is compared over time to your baseline dataset. This comparison means that your target dataset must have a timestamp column specified.

Baseline dataset - usually the training dataset for a model.