GDGoC CAU

ch2 - 머신러닝 프로젝트 처음부터 끝까지

이제부터 나는 부동산 회사에 막 고용된 데이터 사이언티스트이다.

내가 할 일은 ’캘리포니아 인구조사 데이터’를 사용해 캘리포니아의 구역 별 중간 주택 가격 예측 모델을 만들어야 한다.

데이터가 담고있는 정보:

블록 그룹(block group, 혹은 구역)마다 인구(population)

중간 소득(median income)

중간 주택 가격(median housing price)

etc…

크게 아래와 같이 진행 예정

큰 그림 보기

데이터 구하기

데이터로부터 패턴을 파악하기 위해 탐색 및 시각화

머신러닝 알고리즘을 위한 데이터 준비

모델 선택 및 훈련

모델 상세히 조정

솔루션(문제 해결 방법) 제시

시스템 런칭 및 모니터링, 유지 보수

문제 정의

문제를 정의하기 전, 설계를 위한 정보가 필요함 ### ‘비즈니스의 목적이 정확히 무엇인가?’ - 해당 모델을 어떻게 사용해 이익을 얻으려고 하는가 - 문제 구성, 어떤 알고리즘 선택, 모델 평가에 어떤 성능 지표 사용할 지, 모델 튜닝 위해 얼마나 노력 투여할 지 결정 위해 중요한 질문

‘현재 문제 해결 방법은 어떻게 구성되어 있는가?’

머신러닝 모델이 사용되고 있지 않은 지금은 어떻게 문제를 해결하고 있는가 (참고용)

ex) 현재 구역 별 중간 주택 가격 예측은 전문가가 수동으로 하고 있으며, 전문가의 추정 가격은 실제 주택 가격의 10% 정도가 벗어나는 문제가 있음.

이는 구역의 데이터를 기반으로 중간 주택 가격을 예측하는 모델을 훈련시키는 쪽이 유용할 듯 함.

정보를 얻었으니 문제를 정의하여 설계를 해보자.

우리가 만들 시스템은 어떤 학습 방법을 사용하게 될까? - 지도 학습? 비지도 학습? 강화 학습? - 분류 및 회귀? 기타 다른 방법? - 배치 학습? 온라인 학습?

우리 예제에서는 ’레이블’된 훈련 샘플이 존재(각 value 별 정답으로 ’기대 출력값(구역의 중간 주택 가격)’을 가지고 있음)

-> 따라서 ’지도 학습’이 적합할 듯

또한 정해진 정답(레이블) 내에서 새로운 값의 소속을 분류 하는 것이 아님. 새로운 값이 새로운 정답(타깃 수치)으로 예측 되어야 해.

-> 따라서 ’회귀’가 적합할 듯, 그런데 feature가 많으므로 ’다변량 회귀’가 가장 적합할 듯

시스템으로 들어오는 데이터에 연속적인 흐름이 없고 빠르게 변하는 데이터에 적응할 필요가 일단은 없음. 메모리도 작아

-> ’배치 학습’이 적절

성능 측정 지표 선택

흔히 Loss function이라 불리는 것으로 어떤 것을 선택하는지에 대한 문제이며, 여러 함수가 있으나 자세한 수학적인 설명은 일단 제외.

Loss function(손실 함수)이란 훈련하면서 도출된 예측 값을 실제 값(레이블)과 비교할 때 사용하는 함수이며, 해당 함수 적용의 결과 값이 작을 수록 오차가 적은 것임.

회귀 문제에서는 보통 ’평균 제곱근 오차(Root Mean Square Error, RMSE)’를 사용하므로 해당 함수를 사용할 것임.

$RMSE(X,h)=\sqrt{\frac{1}{m}\sum_{i=1}^{m}{(h(x^{(i)})-y^{(i)}})^2}$

이제 데이터를 구하여 가져와 보자


# 파이썬 2와 파이썬 3 지원from __future__ import division, print_function, unicode_literals# 공통import numpy as npimport os# 일관된 출력을 위해 유사난수 초기화np.random.seed(42)# 맷플롯립 설정%matplotlib inlineimport matplotlibimport matplotlib.pyplot as pltplt.rcParams['axes.labelsize'] = 14plt.rcParams['xtick.labelsize'] = 12plt.rcParams['ytick.labelsize'] = 12# 한글출력matplotlib.rc('font', family='NanumBarunGothic')plt.rcParams['axes.unicode_minus'] = False# 그림을 저장할 폴드PROJECT_ROOT_DIR = "."CHAPTER_ID = "end_to_end_project"IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)    if tight_layout:        plt.tight_layout()    plt.savefig(path, format=fig_extension, dpi=resolution)


import osimport tarfilefrom six.moves import urllibDOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"HOUSING_PATH = os.path.join("datasets", "housing")HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):    if not os.path.isdir(housing_path):        os.makedirs(housing_path)    tgz_path = os.path.join(housing_path, "housing.tgz")    urllib.request.urlretrieve(housing_url, tgz_path)    housing_tgz = tarfile.open(tgz_path)    housing_tgz.extractall(path=housing_path)    housing_tgz.close()


fetch_housing_data()


import pandas as pddef load_housing_data(housing_path=HOUSING_PATH):    csv_path = os.path.join(housing_path, "housing.csv")    return pd.read_csv(csv_path)

데이터를 불러왔으니 데이터를 훑어보자


housing = load_housing_data()housing.head()

ㅤ	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY


housing.info()

<class ‘pandas.core.frame.DataFrame’> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype

— —— ————– —–

0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB


housing["ocean_proximity"].value_counts()

<1H OCEAN 9136 INLAND 6551 NEAR OCEAN 2658 NEAR BAY 2290 ISLAND 5 Name: ocean_proximity, dtype: int64


housing.describe()

ㅤ	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000


%matplotlib inlineimport matplotlib.pyplot as plthousing.hist(bins=50, figsize=(20,15))save_fig("attribute_histogram_plots")plt.show()

findfont: Font family [‘NanumBarunGothic’] not found. Falling back to DejaVu Sans.

attribute_histogram_plots


# 일관된 출력을 위해 유사난수 초기화np.random.seed(42)

테스트 데이터 셋 만들기

이전에 알아봤듯이 우리는 데이터셋이 크게 ’훈련 데이터’와 ’시험(테스트) 데이터’로 나뉜다는 것을 안다.

전체 데이터셋에서 훈련 데이터로 사용하는 친구들은 정답(중간 소득)과 함께 하고, 시험 데이터로 사용하는 친구들은 정답을 제거하고 사용하게 된다.

제거한 정답은 버리는게 아니라, 시험 데이터를 통해 예측된 값과 비교하여 오차를 계산하는데에 사용된다.


import numpy as np# 예시를 위해서 만든 것입니다. 사이킷런에는 train_test_split() 함수가 있습니다.def split_train_test(data, test_ratio):    shuffled_indices = np.random.permutation(len(data))    test_set_size = int(len(data) * test_ratio)    test_indices = shuffled_indices[:test_set_size]    train_indices = shuffled_indices[test_set_size:]    return data.iloc[train_indices], data.iloc[test_indices]


train_set, test_set = split_train_test(housing, 0.2)print(len(train_set), "train +", len(test_set), "test")

16512 train + 4128 test


from zlib import crc32def test_set_check(identifier, test_ratio):    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32def split_train_test_by_id(data, test_ratio, id_column):    ids = data[id_column]    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))    return data.loc[~in_test_set], data.loc[in_test_set]


import hashlibdef test_set_check(identifier, test_ratio, hash=hashlib.md5):    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio


# 이 버전의 test_set_check() 함수가 파이썬 2도 지원합니다.def test_set_check(identifier, test_ratio, hash=hashlib.md5):    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio


housing_with_id = housing.reset_index()   # `index` 열이 추가된 데이터프레임이 반환됩니다.train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")


housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")


test_set.head()

ㅤ	index	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	id
8	8	-122.26	37.84	42.0	2555.0	665.0	1206.0	595.0	2.0804	226700.0	NEAR BAY	-122222.16
10	10	-122.26	37.85	52.0	2202.0	434.0	910.0	402.0	3.2031	281500.0	NEAR BAY	-122222.15
11	11	-122.26	37.85	52.0	3503.0	752.0	1504.0	734.0	3.2705	241800.0	NEAR BAY	-122222.15
12	12	-122.26	37.85	52.0	2491.0	474.0	1098.0	468.0	3.0750	213500.0	NEAR BAY	-122222.15
13	13	-122.26	37.84	52.0	696.0	191.0	345.0	174.0	2.6736	191300.0	NEAR BAY	-122222.16


housing["median_income"].hist()

<AxesSubplot:>

median_income


# 소득 카테고리 개수를 제한하기 위해 1.5로 나눕니다.housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)# 5 이상은 5로 레이블합니다.housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)


housing["income_cat"].value_counts()

3.0 7236 2.0 6581 4.0 3639 5.0 2362 1.0 822 Name: income_cat, dtype: int64


housing["income_cat"].hist()save_fig('income_category_hist')

income_category_hist

사이킷런 모듈은 데이터셋을 훈련용, 시험용으로 나누는 다양한 방법을 제공한다고 함


from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)for train_index, test_index in split.split(housing, housing["income_cat"]):    strat_train_set = housing.loc[train_index]    strat_test_set = housing.loc[test_index]


strat_test_set["income_cat"].value_counts() / len(strat_test_set)

3.0 0.350533 2.0 0.318798 4.0 0.176357 5.0 0.114341 1.0 0.039971 Name: income_cat, dtype: float64


housing["income_cat"].value_counts() / len(housing)

3.0 0.350581 2.0 0.318847 4.0 0.176308 5.0 0.114438 1.0 0.039826 Name: income_cat, dtype: float64


from sklearn.model_selection import train_test_splitdef income_cat_proportions(data):    return data["income_cat"].value_counts() / len(data)train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)compare_props = pd.DataFrame({    "Overall": income_cat_proportions(housing),    "Stratified": income_cat_proportions(strat_test_set),    "Random": income_cat_proportions(test_set),}).sort_index()compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100


compare_props

ㅤ	Overall	Stratified	Random	Rand. %error	Strat. %error
1.0	0.039826	0.039971	0.040213	0.973236	0.364964
2.0	0.318847	0.318798	0.324370	1.732260	-0.015195
3.0	0.350581	0.350533	0.358527	2.266446	-0.013820
4.0	0.176308	0.176357	0.167393	-5.056334	0.027480
5.0	0.114438	0.114341	0.109496	-4.318374	-0.084674


for set_ in (strat_train_set, strat_test_set):    set_.drop("income_cat", axis=1, inplace=True)

데이터 이해를 위한 탐색과 시각화

데이터를 시각적으로 보고 패턴을 대략적으로 분석해보자


housing = strat_train_set.copy()


ax = housing.plot(kind="scatter", x="longitude", y="latitude")ax.set(xlabel='경도', ylabel='위도')save_fig("bad_visualization_plot")

bad_visualization_plot


# 밀집된 지역을 보다 잘 보여주는 산점도ax = housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)ax.set(xlabel='경도', ylabel='위도')save_fig("better_visualization_plot")

better_visualization_plot


# 낮은 가격: 파란색 -> 높은 가격: 빨간색ax = housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,    s=housing["population"]/100, label="인구", figsize=(10,7),    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,    sharex=False)ax.set(xlabel='경도', ylabel='위도')plt.legend()save_fig("housing_prices_scatterplot")

housing_prices_scatterplot

상관관계 조사


corr_matrix = housing.corr()


corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value 1.000000 median_income 0.687151 total_rooms 0.135140 housing_median_age 0.114146 households 0.064590 total_bedrooms 0.047781 population -0.026882 longitude -0.047466 latitude -0.142673 Name: median_house_value, dtype: float64


from pandas.plotting import scatter_matrixattributes = ["median_house_value", "median_income", "total_rooms",              "housing_median_age"]scatter_matrix(housing[attributes], figsize=(12, 8))save_fig("scatter_matrix_plot")

scatter_matrix_plot


housing.plot(kind="scatter", x="median_income", y="median_house_value",             alpha=0.1)plt.axis([0, 16, 0, 550000])save_fig("income_vs_house_value_scatterplot")

income_vs_house_value_scatterplot


housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]housing["population_per_household"]=housing["population"]/housing["households"]


corr_matrix = housing.corr()corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value 1.000000 median_income 0.687151 rooms_per_household 0.146255 total_rooms 0.135140 housing_median_age 0.114146 households 0.064590 total_bedrooms 0.047781 population_per_household -0.021991 population -0.026882 longitude -0.047466 latitude -0.142673 bedrooms_per_room -0.259952 Name: median_house_value, dtype: float64


housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",             alpha=0.2)plt.axis([0, 5, 0, 520000])plt.show()

scatter_x_rooms_per_household_y_median_house_value


housing.describe()

ㅤ	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	rooms_per_household	bedrooms_per_room	population_per_household
count	16512.000000	16512.000000	16512.000000	16512.000000	16354.000000	16512.000000	16512.000000	16512.000000	16512.000000	16512.000000	16354.000000	16512.000000
mean	-119.575635	35.639314	28.653404	2622.539789	534.914639	1419.687379	497.011810	3.875884	207005.322372	5.440406	0.212873	3.096469
std	2.001828	2.137963	12.574819	2138.417080	412.665649	1115.663036	375.696156	1.904931	115701.297250	2.611696	0.057378	11.584825
min	-124.350000	32.540000	1.000000	6.000000	2.000000	3.000000	2.000000	0.499900	14999.000000	1.130435	0.100000	0.692308
25%	-121.800000	33.940000	18.000000	1443.000000	295.000000	784.000000	279.000000	2.566950	119800.000000	4.442168	0.175304	2.431352
50%	-118.510000	34.260000	29.000000	2119.000000	433.000000	1164.000000	408.000000	3.541550	179500.000000	5.232342	0.203027	2.817661
75%	-118.010000	37.720000	37.000000	3141.000000	644.000000	1719.000000	602.000000	4.745325	263900.000000	6.056361	0.239816	3.281420
max	-114.310000	41.950000	52.000000	39320.000000	6210.000000	35682.000000	5358.000000	15.000100	500001.000000	141.909091	1.000000	1243.333333

머신러닝 알고리즘을 위한 데이터 준비

흔히 데이터 전처리(pre-processing)이라고 하는 과정.

이전에 쓰레기 값을 넣으면 쓰레기 결과가 나온다는 비유를 한 적이 있으며, 쓰레기 값을 쓸모 있는 값으로 바꾸기 위한 과정임.

데이터 정제

전체 데이터에서 일부 데이터의 경우 어떤 특성(feature)에 대한 값이 누락되는 경우가 있다. 이 때 아래와 같은 행동을 취할 수 있는데, 이는 상황 by 상황으로 취하면 된다.

- 해당 데이터(row)를 아예 삭제하기 - 누락된 값이 있는 특성(feature, column)을 아예 삭제하기 - 어떤 값으로 채우기(0, 평균, 중간값 등)


housing = strat_train_set.drop("median_house_value", axis=1) # 훈련 세트를 위해 레이블 삭제housing_labels = strat_train_set["median_house_value"].copy()


sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()sample_incomplete_rows

ㅤ	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	ocean_proximity
1606	-122.08	37.88	26.0	2947.0	NaN	825.0	626.0	2.9330	NEAR BAY
10915	-117.87	33.73	45.0	2264.0	NaN	1970.0	499.0	3.4193	<1H OCEAN
19150	-122.70	38.35	14.0	2313.0	NaN	954.0	397.0	3.7813	<1H OCEAN
4186	-118.23	34.13	48.0	1308.0	NaN	835.0	294.0	4.2891	<1H OCEAN
16885	-122.40	37.58	26.0	3281.0	NaN	1145.0	480.0	6.3580	NEAR OCEAN


sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # 옵션 1

ㅤ

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

ocean_proximity


sample_incomplete_rows.drop("total_bedrooms", axis=1)       # 옵션 2

ㅤ	longitude	latitude	housing_median_age	total_rooms	population	households	median_income	ocean_proximity
1606	-122.08	37.88	26.0	2947.0	825.0	626.0	2.9330	NEAR BAY
10915	-117.87	33.73	45.0	2264.0	1970.0	499.0	3.4193	<1H OCEAN
19150	-122.70	38.35	14.0	2313.0	954.0	397.0	3.7813	<1H OCEAN
4186	-118.23	34.13	48.0	1308.0	835.0	294.0	4.2891	<1H OCEAN
16885	-122.40	37.58	26.0	3281.0	1145.0	480.0	6.3580	NEAR OCEAN


median = housing["total_bedrooms"].median()sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # 옵션 3sample_incomplete_rows

ㅤ	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	ocean_proximity
1606	-122.08	37.88	26.0	2947.0	433.0	825.0	626.0	2.9330	NEAR BAY
10915	-117.87	33.73	45.0	2264.0	433.0	1970.0	499.0	3.4193	<1H OCEAN
19150	-122.70	38.35	14.0	2313.0	433.0	954.0	397.0	3.7813	<1H OCEAN
4186	-118.23	34.13	48.0	1308.0	433.0	835.0	294.0	4.2891	<1H OCEAN
16885	-122.40	37.58	26.0	3281.0	433.0	1145.0	480.0	6.3580	NEAR OCEAN


#from sklearn.preprocessing import Imputerfrom sklearn.impute import SimpleImputerimputer = SimpleImputer(strategy="median")


housing_num = housing.drop('ocean_proximity', axis=1)# 다른 방법: housing_num = housing.select_dtypes(include=[np.number])


imputer.fit(housing_num)

SimpleImputer(strategy=‘median’)


imputer.statistics_

array([-118.51 , 34.26 , 29. , 2119. , 433. , 1164. , 408. , 3.54155])


housing_num.median().values

array([-118.51 , 34.26 , 29. , 2119. , 433. , 1164. , 408. , 3.54155])


X = imputer.transform(housing_num)


housing_tr = pd.DataFrame(X, columns=housing_num.columns,                          index = list(housing.index.values))


housing_tr.loc[sample_incomplete_rows.index.values]

ㅤ	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income
1606	-122.08	37.88	26.0	2947.0	433.0	825.0	626.0	2.9330
10915	-117.87	33.73	45.0	2264.0	433.0	1970.0	499.0	3.4193
19150	-122.70	38.35	14.0	2313.0	433.0	954.0	397.0	3.7813
4186	-118.23	34.13	48.0	1308.0	433.0	835.0	294.0	4.2891
16885	-122.40	37.58	26.0	3281.0	433.0	1145.0	480.0	6.3580


imputer.strategy

‘median’


housing_tr = pd.DataFrame(X, columns=housing_num.columns)housing_tr.head()

ㅤ	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income
0	-121.46	38.52	29.0	3873.0	797.0	2237.0	706.0	2.1736
1	-117.23	33.09	7.0	5320.0	855.0	2015.0	768.0	6.3373
2	-119.04	35.37	44.0	1618.0	310.0	667.0	300.0	2.8750
3	-117.13	32.75	24.0	1877.0	519.0	898.0	483.0	2.2264
4	-118.70	34.28	27.0	3536.0	646.0	1837.0	580.0	4.4964

텍스트와 범주형 특성 다루기

앞선 범주형 특성 ocean_proximity는 텍스트라 중간값을 계산할 수가 없었다.

텍스트가 값인 경우, 해당 값을 텍스트에서 숫자로 바꿔주는 것이 일반적이다. (mapping)


housing_cat = housing['ocean_proximity']housing_cat.head(10)

12655 INLAND 15502 NEAR OCEAN 2908 INLAND 14053 NEAR OCEAN 20496 <1H OCEAN 1481 NEAR BAY 18125 <1H OCEAN 5830 <1H OCEAN 17989 <1H OCEAN 4861 <1H OCEAN Name: ocean_proximity, dtype: object


housing_cat_encoded, housing_categories = housing_cat.factorize()housing_cat_encoded[:10]

array([0, 1, 0, 1, 2, 3, 2, 2, 2, 2])


housing_categories

Index([‘INLAND’, ‘NEAR OCEAN’, ‘<1H OCEAN’, ‘NEAR BAY’, ‘ISLAND’], dtype=‘object’)


from sklearn.preprocessing import OneHotEncoderencoder = OneHotEncoder(categories='auto')housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))housing_cat_1hot

<16512x5 sparse matrix of type ‘<class ’numpy.float64’>’ with 16512 stored elements in Compressed Sparse Row format>


housing_cat_1hot.toarray()

array([[1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.], [1., 0., 0., 0., 0.], …, [0., 0., 1., 0., 0.], [0., 0., 1., 0., 0.], [1., 0., 0., 0., 0.]])


# [PR #9151](https://github.com/scikit-learn/scikit-learn/pull/9151)에서 가져온 CategoricalEncoder 클래스의 정의.# 이 클래스는 사이킷런 0.20에 포함될 예정입니다.from sklearn.base import BaseEstimator, TransformerMixinfrom sklearn.utils import check_arrayfrom sklearn.preprocessing import LabelEncoderfrom scipy import sparseclass CategoricalEncoder(BaseEstimator, TransformerMixin):    """Encode categorical features as a numeric array.    The input to this transformer should be a matrix of integers or strings,    denoting the values taken on by categorical (discrete) features.    The features can be encoded using a one-hot aka one-of-K scheme    (``encoding='onehot'``, the default) or converted to ordinal integers    (``encoding='ordinal'``).    This encoding is needed for feeding categorical data to many scikit-learn    estimators, notably linear models and SVMs with the standard kernels.    Read more in the :ref:`User Guide `.    Parameters    ----------    encoding : str, 'onehot', 'onehot-dense' or 'ordinal'        The type of encoding to use (default is 'onehot'):        - 'onehot': encode the features using a one-hot aka one-of-K scheme          (or also called 'dummy' encoding). This creates a binary column for          each category and returns a sparse matrix.        - 'onehot-dense': the same as 'onehot' but returns a dense array          instead of a sparse matrix.        - 'ordinal': encode the features as ordinal integers. This results in          a single column of integers (0 to n_categories - 1) per feature.    categories : 'auto' or a list of lists/arrays of values.        Categories (unique values) per feature:        - 'auto' : Determine categories automatically from the training data.        - list : ``categories[i]`` holds the categories expected in the ith          column. The passed categories are sorted before encoding the data          (used categories can be found in the ``categories_`` attribute).    dtype : number type, default np.float64        Desired dtype of output.    handle_unknown : 'error' (default) or 'ignore'        Whether to raise an error or ignore if a unknown categorical feature is        present during transform (default is to raise). When this is parameter        is set to 'ignore' and an unknown category is encountered during        transform, the resulting one-hot encoded columns for this feature        will be all zeros.        Ignoring unknown categories is not supported for        ``encoding='ordinal'``.    Attributes    ----------    categories_ : list of arrays        The categories of each feature determined during fitting. When        categories were specified manually, this holds the sorted categories        (in order corresponding with output of `transform`).    Examples    --------    Given a dataset with three features and two samples, we let the encoder    find the maximum value per feature and transform the data to a binary    one-hot encoding.    >>> from sklearn.preprocessing import CategoricalEncoder    >>> enc = CategoricalEncoder(handle_unknown='ignore')    >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])    ... # doctest: +ELLIPSIS    CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,              encoding='onehot', handle_unknown='ignore')    >>> enc.transform([[0, 1, 1], [1, 0, 4]]).toarray()    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.],           [ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])    See also    --------    sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of      integer ordinal features. The ``OneHotEncoder assumes`` that input      features take on values in the range ``[0, max(feature)]`` instead of      using the unique values.    sklearn.feature_extraction.DictVectorizer : performs a one-hot encoding of      dictionary items (also handles string-valued features).    sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot      encoding of dictionary items or strings.    """    def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,                 handle_unknown='error'):        self.encoding = encoding        self.categories = categories        self.dtype = dtype        self.handle_unknown = handle_unknown    def fit(self, X, y=None):        """Fit the CategoricalEncoder to X.        Parameters        ----------        X : array-like, shape [n_samples, n_feature]            The data to determine the categories of each feature.        Returns        -------        self        """        if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:            template = ("encoding should be either 'onehot', 'onehot-dense' "                        "or 'ordinal', got %s")            raise ValueError(template % self.handle_unknown)        if self.handle_unknown not in ['error', 'ignore']:            template = ("handle_unknown should be either 'error' or "                        "'ignore', got %s")            raise ValueError(template % self.handle_unknown)        if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':            raise ValueError("handle_unknown='ignore' is not supported for"                             " encoding='ordinal'")        X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)        n_samples, n_features = X.shape        self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]        for i in range(n_features):            le = self._label_encoders_[i]            Xi = X[:, i]            if self.categories == 'auto':                le.fit(Xi)            else:                valid_mask = np.in1d(Xi, self.categories[i])                if not np.all(valid_mask):                    if self.handle_unknown == 'error':                        diff = np.unique(Xi[~valid_mask])                        msg = ("Found unknown categories {0} in column {1}"                               " during fit".format(diff, i))                        raise ValueError(msg)                le.classes_ = np.array(np.sort(self.categories[i]))        self.categories_ = [le.classes_ for le in self._label_encoders_]        return self    def transform(self, X):        """Transform X using one-hot encoding.        Parameters        ----------        X : array-like, shape [n_samples, n_features]            The data to encode.        Returns        -------        X_out : sparse matrix or a 2-d array            Transformed input.        """        X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)        n_samples, n_features = X.shape        X_int = np.zeros_like(X, dtype=np.int)        X_mask = np.ones_like(X, dtype=np.bool)        for i in range(n_features):            valid_mask = np.in1d(X[:, i], self.categories_[i])            if not np.all(valid_mask):                if self.handle_unknown == 'error':                    diff = np.unique(X[~valid_mask, i])                    msg = ("Found unknown categories {0} in column {1}"                           " during transform".format(diff, i))                    raise ValueError(msg)                else:                    # Set the problematic rows to an acceptable value and                    # continue `The rows are marked `X_mask` and will be                    # removed later.                    X_mask[:, i] = valid_mask                    X[:, i][~valid_mask] = self.categories_[i][0]            X_int[:, i] = self._label_encoders_[i].transform(X[:, i])        if self.encoding == 'ordinal':            return X_int.astype(self.dtype, copy=False)        mask = X_mask.ravel()        n_values = [cats.shape[0] for cats in self.categories_]        n_values = np.array([0] + n_values)        indices = np.cumsum(n_values)        column_indices = (X_int + indices[:-1]).ravel()[mask]        row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),                                n_features)[mask]        data = np.ones(n_samples * n_features)[mask]        out = sparse.csc_matrix((data, (row_indices, column_indices)),                                shape=(n_samples, indices[-1]),                                dtype=self.dtype).tocsr()        if self.encoding == 'onehot-dense':            return out.toarray()        else:            return out


#from sklearn.preprocessing import CategoricalEncoder # Scikit-Learn 0.20에서 추가 예정cat_encoder = CategoricalEncoder()housing_cat_reshaped = housing_cat.values.reshape(-1, 1)housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)housing_cat_1hot

/var/folders/gp/1ws0mt1s6vzgjlh2w0hm8wvm0000gn/T/ipykernel_37589/4276530552.py:110: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations X = check_array(X, dtype=np.object, accept_sparse=‘csc’, copy=True) /var/folders/gp/1ws0mt1s6vzgjlh2w0hm8wvm0000gn/T/ipykernel_37589/4276530552.py:145: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations X = check_array(X, accept_sparse=‘csc’, dtype=np.object, copy=True) /var/folders/gp/1ws0mt1s6vzgjlh2w0hm8wvm0000gn/T/ipykernel_37589/4276530552.py:147: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations X_int = np.zeros_like(X, dtype=np.int) /var/folders/gp/1ws0mt1s6vzgjlh2w0hm8wvm0000gn/T/ipykernel_37589/4276530552.py:148: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations X_mask = np.ones_like(X, dtype=np.bool)

<16512x5 sparse matrix of type ‘<class ’numpy.float64’>’ with 16512 stored elements in Compressed Sparse Row format>


from sklearn.preprocessing import OneHotEncodercat_encoder = OneHotEncoder(categories='auto')housing_cat_reshaped = housing_cat.values.reshape(-1, 1)housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)housing_cat_1hot

<16512x5 sparse matrix of type ‘<class ’numpy.float64’>’ with 16512 stored elements in Compressed Sparse Row format>


housing_cat_1hot.toarray()

array([[0., 1., 0., 0., 0.], [0., 0., 0., 0., 1.], [0., 1., 0., 0., 0.], …, [1., 0., 0., 0., 0.], [1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.]])


# cat_encoder = CategoricalEncoder(encoding="onehot-dense")cat_encoder = OneHotEncoder(categories='auto', sparse=False)housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)housing_cat_1hot

array([[0., 1., 0., 0., 0.], [0., 0., 0., 0., 1.], [0., 1., 0., 0., 0.], …, [1., 0., 0., 0., 0.], [1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.]])


cat_encoder.categories_

[array([‘<1H OCEAN’, ‘INLAND’, ‘ISLAND’, ‘NEAR BAY’, ‘NEAR OCEAN’], dtype=object)]

나만의 변환기

사이킷런이 제공하는 전처리 작업도 좋은게 많긴 한데, 내 프로젝트 목적에 맞는 자신만의 변환기(전처리기)를 만들 필요가 있을 수도 있음.


from sklearn.base import BaseEstimator, TransformerMixin# 컬럼 인덱스rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6class CombinedAttributesAdder(BaseEstimator, TransformerMixin):    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs        self.add_bedrooms_per_room = add_bedrooms_per_room    def fit(self, X, y=None):        return self  # nothing else to do    def transform(self, X, y=None):        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]        population_per_household = X[:, population_ix] / X[:, household_ix]        if self.add_bedrooms_per_room:            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]            return np.c_[X, rooms_per_household, population_per_household,                         bedrooms_per_room]        else:            return np.c_[X, rooms_per_household, population_per_household]attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)housing_extra_attribs = attr_adder.transform(housing.values)


housing_extra_attribs = pd.DataFrame(    housing_extra_attribs,    columns=list(housing.columns)+["rooms_per_household", "population_per_household"])housing_extra_attribs.head()

ㅤ	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	ocean_proximity	rooms_per_household	population_per_household
0	-121.46	38.52	29.0	3873.0	797.0	2237.0	706.0	2.1736	INLAND	5.485836	3.168555
1	-117.23	33.09	7.0	5320.0	855.0	2015.0	768.0	6.3373	NEAR OCEAN	6.927083	2.623698
2	-119.04	35.37	44.0	1618.0	310.0	667.0	300.0	2.875	INLAND	5.393333	2.223333
3	-117.13	32.75	24.0	1877.0	519.0	898.0	483.0	2.2264	NEAR OCEAN	3.886128	1.859213
4	-118.7	34.28	27.0	3536.0	646.0	1837.0	580.0	4.4964	<1H OCEAN	6.096552	3.167241

변환 파이프라인

변환의 단계가 많으면 정확한 순서대로 실행되어야 하는데, 사이킷런에서는 연속된 변환을 순서대로 처리할 수 있도록 도와주는 Pipeline 클래스가 있음.


from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalernum_pipeline = Pipeline([        ('imputer', SimpleImputer(strategy="median")),        ('attribs_adder', CombinedAttributesAdder()),        ('std_scaler', StandardScaler()),    ])housing_num_tr = num_pipeline.fit_transform(housing_num)


housing_num_tr

array([[-0.94135046, 1.34743822, 0.02756357, …, 0.01739526, 0.00622264, -0.12112176], [ 1.17178212, -1.19243966, -1.72201763, …, 0.56925554, -0.04081077, -0.81086696], [ 0.26758118, -0.1259716 , 1.22045984, …, -0.01802432, -0.07537122, -0.33827252], …, [-1.5707942 , 1.31001828, 1.53856552, …, -0.5092404 , -0.03743619, 0.32286937], [-1.56080303, 1.2492109 , -1.1653327 , …, 0.32814891, -0.05915604, -0.45702273], [-1.28105026, 2.02567448, -0.13148926, …, 0.01407228, 0.00657083, -0.12169672]])


# from future_encoders import ColumnTransformerfrom sklearn.compose import ColumnTransformernum_attribs = list(housing_num)cat_attribs = ["ocean_proximity"]full_pipeline = ColumnTransformer([        ("num", num_pipeline, num_attribs),        ("cat", OneHotEncoder(categories='auto'), cat_attribs),    ])housing_prepared = full_pipeline.fit_transform(housing)housing_prepared

array([[-0.94135046, 1.34743822, 0.02756357, …, 0. , 0. , 0. ], [ 1.17178212, -1.19243966, -1.72201763, …, 0. , 0. , 1. ], [ 0.26758118, -0.1259716 , 1.22045984, …, 0. , 0. , 0. ], …, [-1.5707942 , 1.31001828, 1.53856552, …, 0. , 0. , 0. ], [-1.56080303, 1.2492109 , -1.1653327 , …, 0. , 0. , 0. ], [-1.28105026, 2.02567448, -0.13148926, …, 0. , 0. , 0. ]])


from sklearn.base import BaseEstimator, TransformerMixin# 사이킷런이 DataFrame을 바로 사용하지 못하므로# 수치형이나 범주형 컬럼을 선택하는 클래스를 만듭니다.class DataFrameSelector(BaseEstimator, TransformerMixin):    def __init__(self, attribute_names):        self.attribute_names = attribute_names    def fit(self, X, y=None):        return self    def transform(self, X):        return X[self.attribute_names].values


num_attribs = list(housing_num)cat_attribs = ["ocean_proximity"]num_pipeline = Pipeline([        ('selector', DataFrameSelector(num_attribs)),        ('imputer', SimpleImputer(strategy="median")),        ('attribs_adder', CombinedAttributesAdder()),        ('std_scaler', StandardScaler()),    ])cat_pipeline = Pipeline([        ('selector', DataFrameSelector(cat_attribs)),        ('cat_encoder', CategoricalEncoder(encoding="onehot-dense")),    ])


cat_pipeline = Pipeline([        ('selector', DataFrameSelector(cat_attribs)),        ('cat_encoder', OneHotEncoder(categories='auto', sparse=False)),    ])


# full_pipeline = FeatureUnion(transformer_list=[#         ("num_pipeline", num_pipeline),#         ("cat_pipeline", cat_pipeline),#     ])full_pipeline = ColumnTransformer([        ("num_pipeline", num_pipeline, num_attribs),        ("cat_encoder", OneHotEncoder(categories='auto'), cat_attribs),    ])


housing_prepared = full_pipeline.fit_transform(housing)housing_prepared

모델 선택과 훈련

지금까지 머신러닝 알고리즘에 주입할 데이터를 자동으로 정제하고 준비하기 위한 변환 파이프라인을 작성했다.

이제 머신러닝 모델을 선택하고 훈련시켜보자.

훈련 세트에서 훈련하고 평가하기


from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()lin_reg.fit(housing_prepared, housing_labels)

LinearRegression()


# 훈련 샘플 몇 개를 사용해 전체 파이프라인을 적용해 보겠습니다.some_data = housing.iloc[:5]some_labels = housing_labels.iloc[:5]some_data_prepared = full_pipeline.transform(some_data)print("예측:", lin_reg.predict(some_data_prepared))

예측: [ 85657.90192014 305492.60737488 152056.46122456 186095.70946094 244550.67966089]


print("레이블:", list(some_labels))

레이블: [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]


some_data_prepared

array([[-0.94135046, 1.34743822, 0.02756357, 0.58477745, 0.64037127, 0.73260236, 0.55628602, -0.8936472 , 0.01739526, 0.00622264, -0.12112176, 0. , 1. , 0. , 0. , 0. ], [ 1.17178212, -1.19243966, -1.72201763, 1.26146668, 0.78156132, 0.53361152, 0.72131799, 1.292168 , 0.56925554, -0.04081077, -0.81086696, 0. , 0. , 0. , 0. , 1. ], [ 0.26758118, -0.1259716 , 1.22045984, -0.46977281, -0.54513828, -0.67467519, -0.52440722, -0.52543365, -0.01802432, -0.07537122, -0.33827252, 0. , 1. , 0. , 0. , 0. ], [ 1.22173797, -1.35147437, -0.37006852, -0.34865152, -0.03636724, -0.46761716, -0.03729672, -0.86592882, -0.59513997, -0.10680295, 0.96120521, 0. , 0. , 0. , 0. , 1. ], [ 0.43743108, -0.63581817, -0.13148926, 0.42717947, 0.27279028, 0.37406031, 0.22089846, 0.32575178, 0.2512412 , 0.00610923, -0.47451338, 1. , 0. , 0. , 0. , 0. ]])


from sklearn.metrics import mean_squared_errorhousing_predictions = lin_reg.predict(housing_prepared)lin_mse = mean_squared_error(housing_labels, housing_predictions)lin_rmse = np.sqrt(lin_mse)lin_rmse

68627.87390018745


from sklearn.metrics import mean_absolute_errorlin_mae = mean_absolute_error(housing_labels, housing_predictions)lin_mae

49438.66860915802


from sklearn.tree import DecisionTreeRegressortree_reg = DecisionTreeRegressor(random_state=42)tree_reg.fit(housing_prepared, housing_labels)

DecisionTreeRegressor(random_state=42)


housing_predictions = tree_reg.predict(housing_prepared)tree_mse = mean_squared_error(housing_labels, housing_predictions)tree_rmse = np.sqrt(tree_mse)tree_rmse

0.0

교차 검증을 사용한 평가

바로 위를 보면 오차가 0인데, 과연 너무 완벽한 모델이 탄생한 것일까? 그럴리가 없…다. 분명히 과적합 되었을 가능성이 있다.

우리가 사용한 모델(DecisionTreeRegressor, 결정 트리 모델)을 평가해보도록 하자.


from sklearn.model_selection import cross_val_scorescores = cross_val_score(tree_reg, housing_prepared, housing_labels,                         scoring="neg_mean_squared_error", cv=10)tree_rmse_scores = np.sqrt(-scores)


def display_scores(scores):    print("점수:", scores)    print("평균:", scores.mean())    print("표준편차:", scores.std())display_scores(tree_rmse_scores)

점수: [72831.45749112 69973.18438322 69528.56551415 72517.78229792 69145.50006909 79094.74123727 68960.045444 73344.50225684 69826.02473916 71077.09753998] 평균: 71629.89009727491 표준편차: 2914.035468468928


lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,                             scoring="neg_mean_squared_error", cv=10)lin_rmse_scores = np.sqrt(-lin_scores)display_scores(lin_rmse_scores)

점수: [71762.76364394 64114.99166359 67771.17124356 68635.19072082 66846.14089488 72528.03725385 73997.08050233 68802.33629334 66443.28836884 70139.79923956] 평균: 69104.07998247063 표준편차: 2880.328209818068


from sklearn.ensemble import RandomForestRegressorforest_reg = RandomForestRegressor(n_estimators=10, random_state=42)forest_reg.fit(housing_prepared, housing_labels)

RandomForestRegressor(n_estimators=10, random_state=42)


housing_predictions = forest_reg.predict(housing_prepared)forest_mse = mean_squared_error(housing_labels, housing_predictions)forest_rmse = np.sqrt(forest_mse)forest_rmse

22413.454658589766


from sklearn.model_selection import cross_val_scoreforest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,                                scoring="neg_mean_squared_error", cv=10)forest_rmse_scores = np.sqrt(-forest_scores)display_scores(forest_rmse_scores)

점수: [53519.05518628 50467.33817051 48924.16513902 53771.72056856 50810.90996358 54876.09682033 56012.79985518 52256.88927227 51527.73185039 55762.56008531] 평균: 52792.92669114079 표준편차: 2262.8151900582


scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)pd.Series(np.sqrt(-scores)).describe()

count 10.000000 mean 69104.079982 std 3036.132517 min 64114.991664 25% 67077.398482 50% 68718.763507 75% 71357.022543 max 73997.080502 dtype: float64

모델 세부 튜닝

가능성 있는 모델들을 추렸다고 가정했을 때, 해당 모델들을 세부 튜닝해야함.

그리드 탐색

만족할만한 하이퍼파라미터 조합 찾을 때까지 수동으로 하이퍼파라미터 조정

사이킷런의 GridSearchCV 사용


from sklearn.model_selection import GridSearchCVparam_grid = [    # 하이퍼파라미터 12(=3×4)개의 조합을 시도합니다.    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},    # bootstrap은 False로 하고 6(=2×3)개의 조합을 시도합니다.    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},  ]forest_reg = RandomForestRegressor(random_state=42)# 다섯 폴드에서 훈련하면 총 (12+6)*5=90번의 훈련이 일어납니다.grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error',                           return_train_score=True, n_jobs=-1)grid_search.fit(housing_prepared, housing_labels)

GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), n_jobs=-1, param_grid=[{‘max_features’: [2, 4, 6, 8], ‘n_estimators’: [3, 10, 30]}, {‘bootstrap’: [False], ‘max_features’: [2, 3, 4], ‘n_estimators’: [3, 10]}], return_train_score=True, scoring=‘neg_mean_squared_error’)


grid_search.best_params_

{‘max_features’: 8, ‘n_estimators’: 30}


grid_search.best_estimator_

RandomForestRegressor(max_features=8, n_estimators=30, random_state=42)


cvres = grid_search.cv_results_for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):    print(np.sqrt(-mean_score), params)

63895.161577951665 {‘max_features’: 2, ‘n_estimators’: 3} 54916.32386349543 {‘max_features’: 2, ‘n_estimators’: 10} 52885.86715332332 {‘max_features’: 2, ‘n_estimators’: 30} 60075.3680329983 {‘max_features’: 4, ‘n_estimators’: 3} 52495.01284985185 {‘max_features’: 4, ‘n_estimators’: 10} 50187.24324926565 {‘max_features’: 4, ‘n_estimators’: 30} 58064.73529982314 {‘max_features’: 6, ‘n_estimators’: 3} 51519.32062366315 {‘max_features’: 6, ‘n_estimators’: 10} 49969.80441627874 {‘max_features’: 6, ‘n_estimators’: 30} 58895.824998155826 {‘max_features’: 8, ‘n_estimators’: 3} 52459.79624724529 {‘max_features’: 8, ‘n_estimators’: 10} 49898.98913455217 {‘max_features’: 8, ‘n_estimators’: 30} 62381.765106921855 {‘bootstrap’: False, ‘max_features’: 2, ‘n_estimators’: 3} 54476.57050944266 {‘bootstrap’: False, ‘max_features’: 2, ‘n_estimators’: 10} 59974.60028085155 {‘bootstrap’: False, ‘max_features’: 3, ‘n_estimators’: 3} 52754.5632813202 {‘bootstrap’: False, ‘max_features’: 3, ‘n_estimators’: 10} 57831.136061214274 {‘bootstrap’: False, ‘max_features’: 4, ‘n_estimators’: 3} 51278.37877140253 {‘bootstrap’: False, ‘max_features’: 4, ‘n_estimators’: 10}


pd.DataFrame(grid_search.cv_results_)

ㅤ

mean_fit_time

std_fit_time

mean_score_time

std_score_time

param_max_features

param_n_estimators

param_bootstrap

params

split0_test_score

split1_test_score

…

mean_test_score

std_test_score

rank_test_score

split0_train_score

split1_train_score

split2_train_score

split3_train_score

split4_train_score

mean_train_score

std_train_score

0.087806

0.013061

0.005892

0.002079

NaN

{‘max_features’: 2, ‘n_estimators’: 3}

-4.119912e+09

-3.723465e+09

…

-4.082592e+09

1.867375e+08

-1.155630e+09

-1.089726e+09

-1.153843e+09

-1.118149e+09

-1.093446e+09

-1.122159e+09

2.834288e+07

0.259524

0.018060

0.015139

0.005057

NaN

{‘max_features’: 2, ‘n_estimators’: 10}

-2.973521e+09

-2.810319e+09

…

-3.015803e+09

1.139808e+08

-5.982947e+08

-5.904781e+08

-6.123850e+08

-5.727681e+08

-5.905210e+08

-5.928894e+08

1.284978e+07

0.775507

0.054523

0.034792

0.005973

NaN

{‘max_features’: 2, ‘n_estimators’: 30}

-2.801229e+09

-2.671474e+09

…

-2.796915e+09

7.980892e+07

-4.412567e+08

-4.326398e+08

-4.553722e+08

-4.320746e+08

-4.311606e+08

-4.385008e+08

9.184397e+06

0.135841

0.014487

0.003882

0.001241

NaN

{‘max_features’: 4, ‘n_estimators’: 3}

-3.528743e+09

-3.490303e+09

…

-3.609050e+09

1.375683e+08

-9.782368e+08

-9.806455e+08

-1.003780e+09

-1.016515e+09

-1.011270e+09

-9.980896e+08

1.577372e+07

0.388972

0.014442

0.008515

0.001776

NaN

{‘max_features’: 4, ‘n_estimators’: 10}

-2.742620e+09

-2.609311e+09

…

-2.755726e+09

1.182604e+08

-5.063215e+08

-5.257983e+08

-5.081984e+08

-5.174405e+08

-5.282066e+08

-5.171931e+08

8.882622e+06

1.080341

0.030499

0.028142

0.002480

NaN

{‘max_features’: 4, ‘n_estimators’: 30}

-2.522176e+09

-2.440241e+09

…

-2.518759e+09

8.488084e+07

-3.776568e+08

-3.902106e+08

-3.885042e+08

-3.830866e+08

-3.894779e+08

-3.857872e+08

4.774229e+06

0.145464

0.005240

0.003709

0.001239

NaN

{‘max_features’: 6, ‘n_estimators’: 3}

-3.362127e+09

-3.311863e+09

…

-3.371513e+09

1.378086e+08

-8.909397e+08

-9.583733e+08

-9.000201e+08

-8.964731e+08

-9.151927e+08

-9.121998e+08

2.444837e+07

0.490803

0.006879

0.009305

0.002367

NaN

{‘max_features’: 6, ‘n_estimators’: 10}

-2.622099e+09

-2.669655e+09

…

-2.654240e+09

6.967978e+07

-4.939906e+08

-5.145996e+08

-5.023512e+08

-4.959467e+08

-5.147087e+08

-5.043194e+08

8.880106e+06

1.539921

0.030108

0.030670

0.003994

NaN

{‘max_features’: 6, ‘n_estimators’: 30}

-2.446142e+09

-2.446594e+09

…

-2.496981e+09

7.357046e+07

-3.760968e+08

-3.876636e+08

-3.875307e+08

-3.760938e+08

-3.861056e+08

-3.826981e+08

5.418747e+06

0.197752

0.008595

0.004226

0.000931

NaN

{‘max_features’: 8, ‘n_estimators’: 3}

-3.590333e+09

-3.232664e+09

…

-3.468718e+09

1.293758e+08

-9.505012e+08

-9.166119e+08

-9.033910e+08

-9.070642e+08

-9.459386e+08

-9.247014e+08

1.973471e+07

0.703009

0.035052

0.009964

0.002181

NaN

{‘max_features’: 8, ‘n_estimators’: 10}

-2.721311e+09

-2.675886e+09

…

-2.752030e+09

6.258030e+07

-4.998373e+08

-4.997970e+08

-5.099880e+08

-5.047868e+08

-5.348043e+08

-5.098427e+08

1.303601e+07

2.152918

0.034081

0.028884

0.005224

NaN

{‘max_features’: 8, ‘n_estimators’: 30}

-2.492636e+09

-2.444818e+09

…

-2.489909e+09

7.086483e+07

-3.801679e+08

-3.832972e+08

-3.823818e+08

-3.778452e+08

-3.817589e+08

-3.810902e+08

1.916605e+06

0.102501

0.009359

0.004284

0.001413

False

{‘bootstrap’: False, ‘max_features’: 2, ’n_est…

-4.020842e+09

-3.951861e+09

…

-3.891485e+09

8.648595e+07

-0.000000e+00

-4.306828e+01

-1.051392e+04

-0.000000e+00

-2.111398e+03

4.201294e+03

0.372035

0.007677

0.013123

0.001289

False

{‘bootstrap’: False, ‘max_features’: 2, ’n_est…

-2.901352e+09

-3.036875e+09

…

-2.967697e+09

4.582448e+07

-0.000000e+00

-3.876145e+00

-9.462528e+02

-0.000000e+00

-1.900258e+02

3.781165e+02

0.143098

0.007210

0.004097

0.001342

False

{‘bootstrap’: False, ‘max_features’: 3, ’n_est…

-3.687132e+09

-3.446245e+09

…

-3.596953e+09

8.011960e+07

-0.000000e+00

0.000000e+00

0.477478

0.040663

0.010420

0.002360

False

{‘bootstrap’: False, ‘max_features’: 3, ’n_est…

-2.837028e+09

-2.619558e+09

…

-2.783044e+09

8.862580e+07

-0.000000e+00

0.000000e+00

0.186153

0.009293

0.003124

0.000506

False

{‘bootstrap’: False, ‘max_features’: 4, ’n_est…

-3.549428e+09

-3.318176e+09

…

-3.344440e+09

1.099355e+08

-0.000000e+00

0.000000e+00

0.510566

0.019858

0.009109

0.001882

False

{‘bootstrap’: False, ‘max_features’: 4, ’n_est…

-2.692499e+09

-2.542704e+09

…

-2.629472e+09

8.510266e+07

-0.000000e+00

0.000000e+00

18 rows × 23 columns

랜덤 탐색

탐색 공간이 커지면 그리드 탐색은 부적합.

랜덤 탐색은 그리드와 거의 방식으로 사용하지만 가능한 모든 조합을 시도하는 대신 각 반복마다 하이퍼파라미터에 임의의 수를 대입하여 지정한 횟수만큼 평가.


from sklearn.model_selection import RandomizedSearchCVfrom scipy.stats import randintparam_distribs = {        'n_estimators': randint(low=1, high=200),        'max_features': randint(low=1, high=8),    }forest_reg = RandomForestRegressor(random_state=42)rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,                                n_iter=10, cv=5, scoring='neg_mean_squared_error',                                random_state=42, n_jobs=-1)rnd_search.fit(housing_prepared, housing_labels)

RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), n_jobs=-1, param_distributions={‘max_features’: <scipy.stats._distn_infrastructure.rv_frozen object at 0x7feb1268d7c0>, ‘n_estimators’: <scipy.stats._distn_infrastructure.rv_frozen object at 0x7feb01313ca0>}, random_state=42, scoring=‘neg_mean_squared_error’)


cvres = rnd_search.cv_results_for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):    print(np.sqrt(-mean_score), params)

49117.55344336652 {‘max_features’: 7, ‘n_estimators’: 180} 51450.63202856348 {‘max_features’: 5, ‘n_estimators’: 15} 50692.53588182537 {‘max_features’: 3, ‘n_estimators’: 72} 50783.614493515 {‘max_features’: 5, ‘n_estimators’: 21} 49162.89877456354 {‘max_features’: 7, ‘n_estimators’: 122} 50655.798471042704 {‘max_features’: 3, ‘n_estimators’: 75} 50513.856319990606 {‘max_features’: 3, ‘n_estimators’: 88} 49521.17201976928 {‘max_features’: 5, ‘n_estimators’: 100} 50302.90440763418 {‘max_features’: 3, ‘n_estimators’: 150} 65167.02018649492 {‘max_features’: 5, ‘n_estimators’: 2}

앙상블 방법

모델의 그룹(앙상블)이 최상의 단일 모델보다 더 나은 성능을 발휘할 때가 많음(7장에서 다룰 내용)

최상의 모델과 오차 분석

최상의 모델을 분석하여 현재 모델의 문제에 대한 통찰 얻기


feature_importances = grid_search.best_estimator_.feature_importances_feature_importances

array([6.96542523e-02, 6.04213840e-02, 4.21882202e-02, 1.52450557e-02, 1.55545295e-02, 1.58491147e-02, 1.49346552e-02, 3.79009225e-01, 5.47789150e-02, 1.07031322e-01, 4.82031213e-02, 6.79266007e-03, 1.65706303e-01, 7.83480660e-05, 1.52473276e-03, 3.02816106e-03])


extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]# cat_encoder = cat_pipeline.named_steps["cat_encoder"]cat_encoder = full_pipeline.named_transformers_["cat_encoder"]cat_one_hot_attribs = list(cat_encoder.categories_[0])attributes = num_attribs + extra_attribs + cat_one_hot_attribssorted(zip(feature_importances, attributes), reverse=True)

[(0.3790092248170967, ‘median_income’), (0.16570630316895876, ‘INLAND’), (0.10703132208204354, ‘pop_per_hhold’), (0.06965425227942929, ‘longitude’), (0.0604213840080722, ‘latitude’), (0.054778915018283726, ‘rooms_per_hhold’), (0.048203121338269206, ‘bedrooms_per_room’), (0.04218822024391753, ‘housing_median_age’), (0.015849114744428634, ‘population’), (0.015554529490469328, ‘total_bedrooms’), (0.01524505568840977, ‘total_rooms’), (0.014934655161887776, ‘households’), (0.006792660074259966, ‘<1H OCEAN’), (0.0030281610628962747, ‘NEAR OCEAN’), (0.0015247327555504937, ‘NEAR BAY’), (7.834806602687504e-05, ‘ISLAND’)]

테스트 세트로 시스템 평가하기

어느 정도 모델 튜닝 시 마침내 만족할 만한 모델을 얻게됨.

이제 테스트 세트에서 최종 모델을 평가할 차례


final_model = grid_search.best_estimator_X_test = strat_test_set.drop("median_house_value", axis=1)y_test = strat_test_set["median_house_value"].copy()X_test_prepared = full_pipeline.transform(X_test)final_predictions = final_model.predict(X_test_prepared)final_mse = mean_squared_error(y_test, final_predictions)final_rmse = np.sqrt(final_mse)


final_rmse
47873.26095812988

Discover us

Development Blog

Development

ML ch2 - 머신러닝 프로젝트 처음부터 끝까지

ML ch2 - 머신러닝 프로젝트 처음부터 끝까지

ch2 - 머신러닝 프로젝트 처음부터 끝까지

데이터가 담고있는 정보:

크게 아래와 같이 진행 예정

문제 정의

‘현재 문제 해결 방법은 어떻게 구성되어 있는가?’

성능 측정 지표 선택

이제 데이터를 구하여 가져와 보자

테스트 데이터 셋 만들기

데이터 이해를 위한 탐색과 시각화

상관관계 조사

머신러닝 알고리즘을 위한 데이터 준비

데이터 정제

텍스트와 범주형 특성 다루기

나만의 변환기

변환 파이프라인

모델 선택과 훈련

훈련 세트에서 훈련하고 평가하기

교차 검증을 사용한 평가

모델 세부 튜닝

그리드 탐색

랜덤 탐색

앙상블 방법

최상의 모델과 오차 분석

테스트 세트로 시스템 평가하기