[Python] 데이터 분석 기초 과제

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

나의 기록

[Python] 데이터 분석 기초 과제 본문

개발일지/Python

[Python] 데이터 분석 기초 과제

리베린 2024. 1. 26. 18:49

### 과제 목표

- 주어진 데이터는 1년 동안 Github public repository(개발자들의 코드 저장소)의 푸시 횟수입니다

- Github의 공개 저장소에 대한 기록은 모두에게 공개되어 있으며, Bigquery에 데이터베이스 형태로도 저장되어 있습니다

- 아래 데이터는 2019년 2월 1일 ~ 2020년 1월 14일, 약 1년간 각 일자별 Push(코드 업데이트) 횟수입니다

- 각 개인 별 데이터는 취합되었기 때문에 확인이 어려우며, 전 세계의 데이터이기 때문에 하루에도 수십만회 Push가 이루어집니다

- 본 과제의 목표는 **요일 별 Push 횟수에 유의미한 차이가 있는지 확인**하는 것입니다

- 이를 위한 **데이터의 기본적인 전처리**가 과제이며, 실제 통계 분석은 예시로 제공됩니다

import time

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

import statsmodels.api as sm

import scipy.stats as stats

from statsmodels.stats.stattools import durbin_watson

그 다음 데이터 입력

Q1. 날짜 전처리

Q1.1. 날짜 형식 바꾸기

df.dtypes 로 데이터 형식 확인

df['log_date'] = pd.to_datetime(df['log_date'], format = '%y-%m-%d')

날짜 전처리 해주기

Q1.2. 변환된 날짜 컬럼으로부터 'day_of_week' 을 숫자로 추가

# 변환된 날짜 칼럼으로부터 요일(day_of_week)을 숫자로 추출 (월요일=0, 일요일=6)

df = df.assign(day_of_week = df['log_date'].dt.dayofweek)

Q2. GROUP BY 로 통계량 집계

## 요일별 푸시 횟수의 평균과 중간값 추출

push_count_by_dow = df.groupby('day_of_week')['push_count'].agg(['mean','median'])

push_count_by_dow = push_count_by_dow.sort_index()

display(push_count_by_dow)

Q3. Bar Chart 시각화

https://giveme-happyending.tistory.com/168 한글 폰트 깨질 때 !

[Matplotlib] Matplotlib에서 한글 깨짐 문제 해결하기

Matplotlib에서 한글 깨짐 문제 데이터 분석을 하던 중.. 이런 오류가 생겼다. 그 이유는 matplotlib에서는 한글 폰트를 지원하지 않기 때문이다. 그래서 한글이 깨지는 문제가 발생하는데, 이를 해결

giveme-happyending.tistory.com

plt.figure(figsize=(10,6))

plt.bar(push_count_by_dow.index, push_count_by_dow['mean'], color = 'green')

plt.xlabel('요일')

plt.ylabel('평균 푸시 횟수')

plt.title('요일별 평균 푸시 횟수')

plt.xticks(np.arange(7), ['월','화','수','목','금','토','일'])

plt.legend(['평균 푸시 횟수'])

plt.show()

Q4. 아웃라이어 제거

## 함수 기반으로 z score 기반의 이상치 탐지

def z_score_outlier_remover(df:pd.DataFrame, threshold:float) -> pd.DataFrame:

total_outlier_count = 0

if threshold <= 0:

raise ValueError("Threshold must larger than zero")

while True:

#평균과 표준 편차 구하기

m = df['push_count'].mean()

s = df['push_count'].std()

# z-score (값-평균) / 표준편차

z_scores = (df['push_count'] - m) / s

## num_list의 각 요소가 이상치인지를 나타내는 Boolean series 생성

## abs는 절대값

ser_outlier_bool = abs(z_scores) > threshold

# 이상치 숫자 집계

outlier_count = ser_outlier_bool.sum()

# 이상치가 존재한다면 그 숫자를 세고, 제거

# total_outlier_count = total_outlier_count + outlier_count

if outlier_count > 0:

total_outlier_count += outlier_count

df = df[~ser_outlier_bool].reset_index(drop=True)

## 이상치가 존재하지 않으면

else:

## 제거한 이상치가 1개 이상이라면 출력하고 종료

if total_outlier_count > 0:

print(f"The number of outliers(z-score > {threshold}): {total_outlier_count}")

break

return df

df = df[~ser_outlier_bool].reset_index(drop=True)

이 부분이 이해가 안 된다

아무래도 통계 공부를 열심히 해야 할 것 같다 ㅠㅠㅠㅠ

'개발일지 > Python' 카테고리의 다른 글

[Python] 데이터 시각화 (2)	2024.01.25
[Python] 데이터 집계 , Group by, Pivot table (0)	2024.01.25
[Python] 데이터 병합, concat()함수, merge()함수 (0)	2024.01.25
[Python] 데이터전처리(2) Boolean Indexing, (1)	2024.01.25
[Python] 1.데이터 전처리 (데이터저장,불러오기,컬럼,인덱스,데이터확인,.loc,.iloc) (0)	2024.01.24

'개발일지/Python' Related Articles

나의 기록

[Python] 데이터 분석 기초 과제 본문

[Python] 데이터 분석 기초 과제

Q1. 날짜 전처리

Q2. GROUP BY 로 통계량 집계

Q3. Bar Chart 시각화

Q4. 아웃라이어 제거

'개발일지 > Python' 카테고리의 다른 글

티스토리툴바