美文网首页
05-multi-category logistic regre

05-multi-category logistic regre

作者: 西瓜三茶 | 来源:发表于2017-07-17 00:05 被阅读0次

    1.Read data and get unique value of a column

    • pd.get_dummies()
    • 可以把一列,按这一列当中的值,转化为好多列的二进制格式。
      比如,cars["year"].unique() = [1980, 1981, 1982, 1983] 这四个值
      而pd.get_dummies(cars["year"], prefix="year")会得到4列,每列的列名是year_1980, year_1981, year_1982, year_1983(增加了year作为prefix),这几列中的值是0或者1。
    import pandas as pd
    cars = pd.read_csv("auto.csv")
    unique_regions = cars["origin"].unique()
    print (unique_regions)
    
    dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")
    cars = pd.concat([cars, dummy_cylinders], axis=1)
    dummy_years = pd.get_dummies(cars["year"], prefix="year")
    cars = pd.concat([cars, dummy_years], axis=1)
    cars = cars.drop("year", axis=1)
    cars = cars.drop("cylinders", axis=1)
    print(cars.head())
    

    2.随机把index打乱,取train和test

    shuffled_rows = np.random.permutation(cars.index)
    shuffled_cars = cars.iloc[shuffled_rows]
    #取70%作为training data
    highest_train_row = int(cars.shape[0] * .70)
    train = shuffled_cars.iloc[0:highest_train_row]
    test = shuffled_cars.iloc[highest_train_row:]
    

    3.根据origin的1,2,3分类,依次取origin=1的时候,训练出的model,origin=2的model以及origin=3的model

    • 取python column name的方法
      df.columns.tolist() 或者 df.columns.values.tolist() list(df) for c in df.columns if c.startswith("prefix") or c.startswith("prefix")
    from sklearn.linear_model import LogisticRegression
    
    unique_origins = cars["origin"].unique()
    unique_origins.sort()
    
    models = {}
    features = [c for c in train.columns if c.startswith("cyl") or c.startswith("year")]
    
    for origin in unique_origins:
        model = LogisticRegression()
        
        X_train = train[features]
        y_train = train["origin"] == origin
    
        model.fit(X_train, y_train)
        models[origin] = model
    

    继续计算其test_proba

    testing_probs = pd.DataFrame(columns=unique_origins)  
    
    for origin in unique_origins:
        # Select testing features.
        X_test = test[features]   
        # Compute probability of observation being in the origin.
        testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]
    

    在三列当中,选概率最大的值,作为predicted origins

    • 方法:df.idxmax(axis = 1) --- 在dataframe的所有列中,选择第一个出现的最大值的那一列,返回那一列的列名
    predicted_origins = testing_probs.idxmax(axis=1)
    print(predicted_origins)
    

    相关文章

      网友评论

          本文标题:05-multi-category logistic regre

          本文链接:https://www.haomeiwen.com/subject/zpizhxtx.html