About Data

data = data[['label_sexist' , 'text' , 'split']]

in here we only need 3 columns

split: It indicates how to split each row, either into training-validation or testing sets.
text: This column contains the data that we need for training.
label_sexist: This column represents the labels for the data, specifically indicating whether it is classified as sexist or not.

About Data pre-pairing

convert label using oneHotEncoding

label_encoder = LabelEncoder()
data['label_sexist'] = label_encoder.fit_transform(data['label_sexist'])

num_classes = len(set(data['label_sexist'])) 
data['label_sexist'] = to_categorical(data['label_sexist'], num_classes=num_classes)

Clean data

remove emojie and other symbol

def clean_text(text):
    text = text.lower()
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = re.sub('\s+', ' ', text).strip()
    return text

Tokenize and create vocab

texts_train = X_train['text'].tolist()
text_val = X_val['text'].tolist()
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts_train)
X_train_sequences = tokenizer.texts_to_sequences(texts_train)

tokenizer.fit_on_texts(text_val)
X_val_sequences = tokenizer.texts_to_sequences(text_val)


max_sequence_length = 70  
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_sequence_length)
X_val_padded = pad_sequences(X_val_sequences, maxlen=max_sequence_length)

The text data is split into training and validation sets, and then the Tokenizer class is employed to convert the raw text into sequences of numerical values.

About Model

model = Sequential()

in here A sequential model in deep learning refers to a type of neural network architecture where the layers are arranged in a sequential order.

model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=50, input_length=max_sequence_length))

in here we have encoded vocab
Embedding layer used to represent categorical or discrete data as continuous vectors

model.add(Bidirectional(LSTM(128, return_sequences=True)))

Let's Break this code like This :

LSTM(128, return_sequences=True)

This part creates an LSTM layer with 128 units (memory or neurons)

what is LSTM and what used for?

it's layer used in neural network
LSTM layer will produce an output for each input sequence element

model.add(Dropout(0.5))

now for stop overfitting we add dropout layer it will drop 0.5 of in forward move

model.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01)))

Dens is represent as fully connected

relu is activation function that work like f(x) = max(0,x)
L2 regularization is applied to the weights of the layer and helps prevent overfitting by adding a penalty term to the loss function

model.add(Dense(1, activation='sigmoid'))

'1': This specifies the number of neurons or units in the Dense layer and here '1' bec problem is classification and output single value.
sigmoid function : The sigmoid activation function is used for the single neuron in the output layer

Using Early-stopping

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

Using early stopping in the fit function is generally employed to prevent overfitting. As an argument, we must specify something to monitor. In this case, we should monitor the validation loss. If it increases over 5 epochs, then training is stopped, and the best model with the lowest validation loss is saved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sexism.md

sexism.md

About Data

About Data pre-pairing

convert label using oneHotEncoding

Clean data

Tokenize and create vocab

About Model

what is LSTM and what used for?

Using Early-stopping

Files

sexism.md

Latest commit

History

sexism.md

File metadata and controls

About Data

About Data pre-pairing

convert label using oneHotEncoding

Clean data

Tokenize and create vocab

About Model

what is LSTM and what used for?

Using Early-stopping