machine learning - How to perform undersampling (the right way) with python scikit-learn? -
i attempting perform undersampling of majority class using python scikit learn. codes n of minority class , try undersample exact same n majority class. , both test , training data have 1:1 distribution result. want 1:1 distribution on training data test on original distribution in testing data.
i not quite sure how latter there dict vectorization in between, makes confusing me.
# perform undersampling majority group minorityn = len(df[df.ethnicity_scan == 1]) # total count of low-frequency group minority_indices = df[df.ethnicity_scan == 1].index minority_sample = df.loc[minority_indices] majority_indices = df[df.ethnicity_scan == 0].index random_indices = np.random.choice(majority_indices, minorityn, replace=false) # use low-frequency group count randomly sample high-frequency group majority_sample = data.loc[random_indices] merged_sample = pd.concat([minority_sample, majority_sample], ignore_index=true) # merging low-frequency group sample , new (randomly selected) high-frequency sample df = merged_sample print 'total n after undersampling:', len(df) # declaring variables x = df.raw_f1.values x2 = df.f2.values x3 = df.f3.values x4 = df.f4.values y = df.outcome.values # codes skipped .... def feature_noneighborloc(locstring): pass my_dict16 = [{'location': feature_noneighborloc(feature_full_name(i))} in x4] # codes skipped .... # dict vectorization all_dict = [] in range(0, len(my_dict)): temp_dict = dict( my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items() + my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items() + my_dict9[i].items() + my_dict10[i].items() + my_dict11[i].items() + my_dict12[i].items() + my_dict13[i].items() + my_dict14[i].items() + my_dict19[i].items() + my_dict16[i].items() # location feature ) all_dict.append(temp_dict) newx = dv.fit_transform(all_dict) x_train, x_test, y_train, y_test = cross_validation.train_test_split(newx, y, test_size=testtrainsplit) # fitting x , y model, using training data classifierused2.fit(x_train, y_train) # making predictions using trained data y_train_predictions = classifierused2.predict(x_train) y_test_predictions = classifierused2.predict(x_test)
you want subsample training samples of 1 of categories because want classifier treats labels same.
if want instead of subsampling can change value of 'class_weight' parameter of classifier 'balanced' (or 'auto' classifiers) job want do.
you can read documentation of logisticregression classifier example. notice description of 'class_weight' parameter here.
by changing parameter 'balanced' won't need subsampling anymore.
Comments
Post a Comment