Kaggle: Detect toxicity - Basic

作者: 不会停的蜗牛 | 来源:发表于2019-05-30 23:29 被阅读6次

    This kaggle is:

    https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview


    The Goal is:

    to classify toxic comments
    especially to recognize unintended bias towards identities

    toxic comment is a comment that is rude, disrespectful or otherwise likely to make someone leave a discussion


    challenge is:

    some neutral comments regarding some identity like "gay" would be classified as toxic,eg:"I am a gay woman" .

    reason is:
    identities associated with toxicity outnumbered neutral comments regarding the same identity


    Dataset

    dataset labeled with the associated identity

    https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data

    These are subtypes of toxicity, do not need to predict:

    severe_toxicity
    obscene
    threat
    insult
    identity_attack
    sexual_explicit

    These columns are corresponding to identity attributes:
    representing the identities that are mentioned in the comment

    male
    female
    transgender
    other_gender
    heterosexual
    homosexual_gay_or_lesbian
    bisexual
    other_sexual_orientation
    christian
    jewish
    muslim
    hindu
    buddhist
    atheist
    other_religion
    black
    white
    asian
    latino
    other_race_or_ethnicity
    physical_disability
    intellectual_or_learning_disability
    psychiatric_or_mental_illness
    other_disability

    Additional columns:

    toxicity_annotator_count and identity_annotator_count, and metadata from Civil Comments: created_date, publication_id, parent_id, article_id, rating, funny, wow, sad, likes, disagree. Civil Comments' label rating is the civility rating Civil Comments users gave the comment.

    Example:

    Comment: Continue to stand strong LGBT community. Yes, indeed, you'll overcome and you have.
    Toxicity Labels: All 0.0
    Identity Mention Labels: homosexual_gay_or_lesbian: 0.8, bisexual: 0.6, transgender: 0.3 (all others 0.0)


    1. Libs and Data:

    import pandas as pd
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    import os
    print(os.listdir("../input"))
    
    train_df = pd.read_csv('../input/train.csv')
    test_df = pd.read_csv('../input/test.csv')
    

    2. Shape of data

    train_len, test_len = len(train_df.index), len(test_df.index)
    print(f'train size: {train_len}, test size: {test_len}')
    

    train size: 1804874, test size: 97320

    train_df.head()
    

    3. Count the amount of missing values

    miss_val_train_df = train_df.isnull().sum(axis=0) / train_len
    miss_val_train_df = miss_val_train_df[miss_val_train_df > 0] * 100
    miss_val_train_df
    
    • a large portion of the data doesn't have the identity tag
    • but the numbers are same

    4. Visualization

    Q1: which identity appears the most in the dataset?

    According to the data details, just care about the identities tagged in this dataset, and make a list of them:

    identities = ['male','female','transgender','other_gender','heterosexual','homosexual_gay_or_lesbian',
                  'bisexual','other_sexual_orientation','christian','jewish','muslim','hindu','buddhist',
                  'atheist','other_religion','black','white','asian','latino','other_race_or_ethnicity',
                  'physical_disability','intellectual_or_learning_disability','psychiatric_or_mental_illness',
                  'other_disability']
    

    From below diagram we can also see distributions of toxic and non-toxic in each identity:

    # getting the dataframe with identities tagged
    train_labeled_df = train_df.loc[:, ['target'] + identities ].dropna()
    # lets define toxicity as a comment with a score being equal or .5
    # in that case we divide it into two dataframe so we can count toxic vs non toxic comment per identity
    toxic_df = train_labeled_df[train_labeled_df['target'] >= .5][identities]
    non_toxic_df = train_labeled_df[train_labeled_df['target'] < .5][identities]
    
    # at first, we just want to consider the identity tags in binary format. So if the tag is any value other than 0 we consider it as 1.
    toxic_count = toxic_df.where(train_labeled_df == 0, other = 1).sum()
    non_toxic_count = non_toxic_df.where(train_labeled_df == 0, other = 1).sum()
    
    # now we can concat the two series together to get a toxic count vs non toxic count for each identity
    toxic_vs_non_toxic = pd.concat([toxic_count, non_toxic_count], axis=1)
    toxic_vs_non_toxic = toxic_vs_non_toxic.rename(index=str, columns={1: "non-toxic", 0: "toxic"})
    # here we plot the stacked graph but we sort it by toxic comments to (perhaps) see something interesting
    toxic_vs_non_toxic.sort_values(by='toxic').plot(kind='bar', stacked=True, figsize=(30,10), fontsize=20).legend(prop={'size': 20})
    

    Q2: which identities are more frequently related to toxic comments?

    • consider the score (target) of how toxic the comment is
    • also count in the value of how each identity been targeted
    # First we multiply each identity with the target
    weighted_toxic = train_labeled_df.iloc[:, 1:].multiply(train_labeled_df.iloc[:, 0], axis="index").sum() 
    # changing the value of identity to 1 or 0 only and get comment count per identity group
    identity_label_count = train_labeled_df[identities].where(train_labeled_df == 0, other = 1).sum()
    # then we divide the target weighted value by the number of time each identity appears
    weighted_toxic = weighted_toxic / identity_label_count
    weighted_toxic = weighted_toxic.sort_values(ascending=False)
    # plot the data using seaborn like before
    plt.figure(figsize=(30,20))
    sns.set(font_scale=3)
    ax = sns.barplot(x = weighted_toxic.values , y = weighted_toxic.index, alpha=0.8)
    plt.ylabel('Demographics')
    plt.xlabel('Weighted Toxicity')
    plt.title('Weighted Analysis of Most Frequent Identities')
    plt.show()
    

    Conclusion: race based identities (White and Black) and religion based identities (Muslim and Jews) are heavily associated with toxic comments.

    相关文章

      网友评论

        本文标题:Kaggle: Detect toxicity - Basic

        本文链接:https://www.haomeiwen.com/subject/uhpmtctx.html