machine learning - How to perform undersampling (the right way) with python scikit-learn? -


i attempting perform undersampling of majority class using python scikit learn. codes n of minority class , try undersample exact same n majority class. , both test , training data have 1:1 distribution result. want 1:1 distribution on training data test on original distribution in testing data.

i not quite sure how latter there dict vectorization in between, makes confusing me.

# perform undersampling majority group minorityn = len(df[df.ethnicity_scan == 1]) # total count of low-frequency group minority_indices = df[df.ethnicity_scan == 1].index minority_sample = df.loc[minority_indices]  majority_indices = df[df.ethnicity_scan == 0].index random_indices = np.random.choice(majority_indices, minorityn, replace=false) # use low-frequency group count randomly sample high-frequency group majority_sample = data.loc[random_indices]  merged_sample = pd.concat([minority_sample, majority_sample], ignore_index=true) # merging low-frequency group sample , new (randomly selected) high-frequency sample df = merged_sample print 'total n after undersampling:', len(df)  # declaring variables x = df.raw_f1.values x2 = df.f2.values x3 = df.f3.values x4 = df.f4.values y = df.outcome.values  # codes skipped .... def feature_noneighborloc(locstring):     pass my_dict16 = [{'location': feature_noneighborloc(feature_full_name(i))} in x4] # codes skipped ....  # dict vectorization all_dict = [] in range(0, len(my_dict)):     temp_dict = dict(         my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()         + my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()         + my_dict9[i].items() + my_dict10[i].items()         + my_dict11[i].items() + my_dict12[i].items() + my_dict13[i].items() + my_dict14[i].items()         + my_dict19[i].items()         + my_dict16[i].items() # location feature         ) all_dict.append(temp_dict)  newx = dv.fit_transform(all_dict)  x_train, x_test, y_train, y_test = cross_validation.train_test_split(newx, y, test_size=testtrainsplit)  # fitting x , y model, using training data classifierused2.fit(x_train, y_train)  # making predictions using trained data y_train_predictions = classifierused2.predict(x_train) y_test_predictions = classifierused2.predict(x_test) 

you want subsample training samples of 1 of categories because want classifier treats labels same.

if want instead of subsampling can change value of 'class_weight' parameter of classifier 'balanced' (or 'auto' classifiers) job want do.

you can read documentation of logisticregression classifier example. notice description of 'class_weight' parameter here.

by changing parameter 'balanced' won't need subsampling anymore.


Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

ruby on rails - Seeing duplicate requests handled with Unicorn -