[Python] 자동차 가격예측 (Ridge Regression)

자동차 가격예측

데이터 로드

In [1]:
import pandas as pd
data = pd.read_csv('Automobile_data_.csv')
In [2]:
data.head()
Out[2]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 109 mpfi 3.19 3.4 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 136 mpfi 3.19 3.4 8.0 115 5500 18 22 17450

5 rows × 26 columns

  • 결측값 변경 “?” -> NAN
In [65]:
import numpy as np
data = data.replace('?',np.NaN)
data.head()
Out[65]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 NaN alfa-romero gas std two convertible rwd front 88.6 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495.0
1 3 NaN alfa-romero gas std two convertible rwd front 88.6 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500.0
2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500.0
3 2 164 audi gas std four sedan fwd front 99.8 109 mpfi 3.19 3.4 10.0 102 5500 24 30 13950.0
4 2 164 audi gas std four sedan 4wd front 99.4 136 mpfi 3.19 3.4 8.0 115 5500 18 22 17450.0

5 rows × 26 columns

In [53]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized-losses    205 non-null object
make                 205 non-null object
fuel-type            205 non-null object
aspiration           205 non-null object
num-of-doors         205 non-null object
body-style           205 non-null object
drive-wheels         205 non-null object
engine-location      205 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-type          205 non-null object
num-of-cylinders     205 non-null object
engine-size          205 non-null int64
fuel-system          205 non-null object
bore                 205 non-null object
stroke               205 non-null object
compression-ratio    205 non-null float64
horsepower           205 non-null object
peak-rpm             205 non-null object
city-mpg             205 non-null int64
highway-mpg          205 non-null int64
price                205 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.7+ KB
In [54]:
data.describe()
Out[54]:
symboling wheel-base length width height curb-weight engine-size compression-ratio city-mpg highway-mpg
count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000
mean 0.834146 98.756585 174.049268 65.907805 53.724878 2555.565854 126.907317 10.142537 25.219512 30.751220
std 1.245307 6.021776 12.337289 2.145204 2.443522 520.680204 41.642693 3.972040 6.542142 6.886443
min -2.000000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.000000 7.000000 13.000000 16.000000
25% 0.000000 94.500000 166.300000 64.100000 52.000000 2145.000000 97.000000 8.600000 19.000000 25.000000
50% 1.000000 97.000000 173.200000 65.500000 54.100000 2414.000000 120.000000 9.000000 24.000000 30.000000
75% 2.000000 102.400000 183.100000 66.900000 55.500000 2935.000000 141.000000 9.400000 30.000000 34.000000
max 3.000000 120.900000 208.100000 72.300000 59.800000 4066.000000 326.000000 23.000000 49.000000 54.000000
  • 프라이스 변수 연속형으로 변경
In [58]:
data['price'] = pd.to_numeric(data['price'])
  • 연속형 변수 파악
In [59]:
cols = data.columns  #전체칼럼명
num_cols = data._get_numeric_data().columns 
num_cols = list(num_cols)  #연속형변수 
num_cols
Out[59]:
['symboling',
 'wheel-base',
 'length',
 'width',
 'height',
 'curb-weight',
 'engine-size',
 'compression-ratio',
 'city-mpg',
 'highway-mpg',
 'price']
  • 이산형 변수 파악
In [60]:
cate_cols = list(set(cols) - set(num_cols)) #이산형변수
cate_cols
Out[60]:
['engine-location',
 'fuel-system',
 'make',
 'stroke',
 'num-of-cylinders',
 'fuel-type',
 'aspiration',
 'peak-rpm',
 'bore',
 'normalized-losses',
 'engine-type',
 'horsepower',
 'body-style',
 'drive-wheels',
 'num-of-doors']
  • 연속형 변수 EDA
In [61]:
data.hist(bins=50, figsize=(20,15))
Out[61]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDAAB6C18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBAE9470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBB32908>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBB952B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDA8E0E10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDA8E0E48>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBBE44A8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBC194A8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBCAB7B8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBCBCF28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBD757F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBDC5F98>]], dtype=object)
  • 이산형 변수 EDA
In [62]:
cate_data = data[cate_cols]
cate_data.columns
Out[62]:
Index(['engine-location', 'fuel-system', 'make', 'stroke', 'num-of-cylinders',
       'fuel-type', 'aspiration', 'peak-rpm', 'bore', 'normalized-losses',
       'engine-type', 'horsepower', 'body-style', 'drive-wheels',
       'num-of-doors'],
      dtype='object')
In [181]:
cate_data[cate_cols[2]].value_counts().plot(kind = "bar")
Out[181]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dcddbd23c8>

모델 만들기

  • 1) 일부 컬럼만 선택함(회귀모델이니 연속형 변수만)
  • 2) 데이터 결측값 처리함
  • 3) 모델에 쓸 변수만 다시 선택함
  • 4) 트레이닝/테스트셋 분리
  • 5) 모델 학습
  • 6) 모델 검증

1)일부컬럼 선택(연속형만)

In [106]:
num_data = data._get_numeric_data()
num_data.head()
Out[106]:
symboling wheel-base length width height curb-weight engine-size compression-ratio city-mpg highway-mpg price
0 3 88.6 168.8 64.1 48.8 2548 130 9.0 21 27 13495.0
1 3 88.6 168.8 64.1 48.8 2548 130 9.0 21 27 16500.0
2 1 94.5 171.2 65.5 52.4 2823 152 9.0 19 26 16500.0
3 2 99.8 176.6 66.2 54.3 2337 109 10.0 24 30 13950.0
4 2 99.4 176.6 66.4 54.3 2824 136 8.0 18 22 17450.0

2) 결측값 처리

In [107]:
def cnt_NA(df):
    colname = df.columns.tolist()
    for i in colname:
        if sum(pd.isnull(df[i])) != 0:
            na = sum(pd.isnull(df[i]))
            print(i + ":" + str(na)+ ", NA_ratio:" + str(na/len(df)))
    print("NA test end")
In [79]:
cnt_NA(num_data)
price:4, NA_ratio:0.019512195122
NA test end
  • 결측값 제거
In [110]:
num_data = num_data.dropna(axis=0, how='any')
cnt_NA(num_data)
NA test end

3) 모델에 쓸 데이터만 선택

  • 그냥 연속형 변수만 다쓰겠음 : num_data

4) 트레이닝 테스트셋 분리

  • 싸이킷런에서 지원해주는 함수 사용
  • from sklearn.model_selection import train_test_split
In [133]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(num_data, test_size=0.2)
print(len(train), len(test))
160 41
In [134]:
test.head()
Out[134]:
symboling wheel-base length width height curb-weight engine-size compression-ratio city-mpg highway-mpg price
196 -2 104.3 188.8 67.2 56.2 2935 141 9.5 24 28 15985.0
92 1 94.5 165.3 63.8 54.5 1938 97 9.4 31 37 6849.0
87 1 96.3 172.4 65.4 51.6 2403 110 7.5 23 30 9279.0
135 2 99.1 186.6 66.5 56.1 2758 121 9.3 21 28 15510.0
128 3 89.5 168.9 65.0 51.6 2800 194 9.5 17 25 37028.0
In [135]:
train_x = train.iloc[:,:-1]
train_y = train.iloc[:, -1]
In [136]:
train_x.head()
Out[136]:
symboling wheel-base length width height curb-weight engine-size compression-ratio city-mpg highway-mpg
83 3 95.9 173.2 66.3 50.2 2921 156 7.0 19 24
70 -1 115.6 202.6 71.7 56.3 3770 183 21.5 22 25
185 2 97.3 171.7 65.5 55.7 2212 109 9.0 27 34
62 0 98.8 177.8 66.5 55.5 2410 122 8.6 26 32
198 -2 104.3 188.8 67.2 56.2 3045 130 7.5 17 22
In [138]:
train_y.head()
Out[138]:
83     14869.0
70     31600.0
185     8195.0
62     10245.0
198    18420.0
Name: price, dtype: float64

리그레션 트레이닝

In [158]:
train_x = np.asarray(train_x)
train_y = np.asarray(train_y)

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train_x, train_y)
Out[158]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [159]:
lr.coef_
Out[159]:
array([   88.64668674,   -51.52298832,  -128.1369366 ,  1099.27904606,
         147.23503498,     2.64572556,   114.26626391,   168.08428786,
        -450.23937409,   245.84972674])
In [160]:
lr.intercept_
Out[160]:
-58987.916226805588

테스트

In [161]:
test_x = test.iloc[:,:-1]
test_y = test.iloc[:, -1]
In [162]:
y_pred = lr.predict(test_x)
In [163]:
from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, y_pred)
Out[163]:
13039778.151573779

릿지 리그레션

In [164]:
from sklearn.linear_model import Ridge
clf = Ridge(alpha=1.0)
clf.fit(train_x, train_y) 
Out[164]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
In [167]:
y_pred = clf.predict(test_x)
In [174]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test_y, y_pred)
mse
Out[174]:
13010424.150200721
In [175]:
 np.sqrt(mse)
Out[175]:
3606.9965553352999

댓글 남기기

이메일은 공개되지 않습니다. 필수 입력창은 * 로 표시되어 있습니다