DataFountain-虚假职位招聘预测

赛题链接：https://www.datafountain.cn/competitions/448

任务分析：这是一个分类任务，难点在于，第一，数据集难以处理，第二，类别不平衡，负类样本占比达到95%

运行环境、使用框架、语言

操作系统：Linux ubuntu 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

硬件环境： Intel® Xeon® Gold 6226R CPU @ 2.90GHz + NVIDIA GeForce RTX 3090 * 1 (CUDA Version: 12.4)

语言和运行环境：python 3.9.21 (main, Dec 11 2024, 16:24:11) + jupyter notebook 7.2.2

使用框架：pytorch 2.5.1+cu124

数据分析、特征设计、抽取、处理

数据处理部分的代码在data_processing.ipynb中，处理好的数据存储在 train_data.csv和test_data.csv。

特征一览，总共16个特征（不包括 job_id 和 fraudulent ）：

   			 Feature  Unique Values Count Value Type  NaN Values Count
              title                 6851     object                 0
           location                 2284     object               198
         department                  944     object              6388
       salary_range                  608     object              8393
    company_profile                 1397     object              1885
        description                 8663     object                 0
       requirements                 6965     object              1566
           benefits                 3861     object              4084
      telecommuting                    2      int64                 0
   has_company_logo                    2      int64                 0
      has_questions                    2      int64                 0
    employment_type                    5     object              1992
required_experience                    7     object              3960
 required_education                   13     object              4548
           industry                  125     object              2781
           function                   37     object              3654
         fraudulent                    2      int64                 0  (label)

将特征划分为4种：

categorical feature (种类特征) ：telecommuting 、has_company_logo 、has_questions、employment_type、required_experience 、required_education、required_education 、industry 、function
Continuous feature （连续特征）：无
text feature（文本特征）：title、company_profile 、description、requirements、 benefits
complex feature（复杂特征）： location、department、salary_range

其中前两种特征好处理，后两种特征难以处理，我的思路是：将文本特征和复杂特征抽取为种类特征和连续特征，然后统一处理数据。

将 department 和 location 转化为种类特征

我们看看department的最高的50项频次，如下图所示，可以看出sales出现的次数最多，其次是engineering。

将department feature中频次小于等于4的值都设置为others，即出现概率小于千分之一的值都设为others，这样处理以后department 中不同值的种类就变少了，就可以转化为种类特征。

department_counts = data['department'].value_counts()
low_frequency_departments = department_counts[department_counts <= 4].index
data['department'] = data['department'].replace(low_frequency_departments, 'others')
cat_features+=['department']

同理，对于location feature，我将频次小于等于10的值都设置为others，即出现概率小于千分之一的值都设置为others，然后就将location转化为种类特征。

location_counts = data['location'].value_counts()
low_frequency_locations = location_counts[location_counts <= 10].index
data['location'] = data['location'].replace(low_frequency_locations, 'others')
cat_features+=['location']

这样处理后，location中不同值的个数为144，department中不同值的个数为107

将salary_range转化为连续特征

连续特征一般是数值特征，我们知道工资是数字，它是连续特征，于是我将salary_range分解为’min_salary’和’max_salary’，分别对应最低工资和最高工资，两个连续特征。对于不合法数据则设置为NaN。

代码如下：

salary_range = data.salary_range.copy()
salary_range_seg = list(salary_range.str.split('-').values)
for ind, seg in enumerate(salary_range_seg):
    if isinstance(seg, list) and len(seg) == 2:  # 确保是一个两元素的列表
        sal_min, sal_max = seg
        # 检查最小值和最大值是否为数字
        if not sal_min.isdigit() or not sal_max.isdigit():
            print(f"Invalid range at index {ind}: {seg}")
            salary_range_seg[ind] = [np.nan, np.nan]  # 设置为 NaN
        elif sal_min=='0' and sal_max=='0':
            salary_range_seg[ind] = [np.nan, np.nan]
    else:
        salary_range_seg[ind] = [np.nan, np.nan]  # 设置为 NaN

salary_range_df = pd.DataFrame(salary_range_seg, columns=['min_salary', 'max_salary'])

# 确保无法转换的值变为 NaN
salary_range_df['min_salary'] = pd.to_numeric(salary_range_df['min_salary'], errors='coerce')
salary_range_df['max_salary'] = pd.to_numeric(salary_range_df['max_salary'], errors='coerce')

# 使用 Int64 类型（允许 NaN）
salary_range_df = salary_range_df.astype({'min_salary': 'Int64', 'max_salary': 'Int64'})

# 合并拆分后的数据到原 DataFrame，并删除原 salary_range 列
data = pd.concat([data, salary_range_df], axis=1)
data.drop('salary_range', axis=1, inplace=True)
Contin_feature+=['min_salary', 'max_salary']

将文本特征转化为连续特征

在处理文本特征前，先计算文本中单词数量，将其作为一个额外连续特征，加入数据集中。

data[text_features] = data[text_features].fillna('')
for feature in text_features: #将文本特征 计算字数 作为一个新特征
    feature_count_words=f'{feature}_count_words'
    data[feature_count_words]=data[feature].astype(str).str.split(' ').apply(len) 
    Contin_features +=[feature_count_words]

如何将文本特征转化为连续特征?

我的思路是，首先将title、company_profile 、description、requirements、 benefits中的文本合并，作为combined_text，然后使用TfidfVectorizer提取文本中的高频词汇，包括高频unigram,bigram,trigram，将这些高频项（gram）作为新的特征。每个特征都是一个连续特征，表示该项出现的次数。例如，一个样本的combine_text中出现了10次 machine learning，则该样本的特征 bigram :machine learning的值就是10.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

data['combined_text'] = data[text_features].apply(lambda row: ' '.join(row.astype(str)), axis=1) #合并文本,不同文本特征之间使用 ' '连接
data = data.drop(columns=text_features) #删除原来的文本特征
vectorizer_unigrams = TfidfVectorizer(stop_words='english', ngram_range=(1, 1), max_features=200) #选择最高频的 200个 unigrams
vectorizer_bigrams = TfidfVectorizer(stop_words='english', ngram_range=(2, 2), max_features=100) # 选择最高频的 100个 bigrams
vectorizer_trigrams = TfidfVectorizer(stop_words='english', ngram_range=(3, 3), max_features=50) # 选择最高频的 50个 trigrams
X_unigrams = vectorizer_unigrams.fit_transform(data['combined_text'])
X_bigrams = vectorizer_bigrams.fit_transform(data['combined_text'])
X_trigrams = vectorizer_trigrams.fit_transform(data['combined_text'])
# 合并 unigrams ，bigrams和 trigrams 的特征
X_combined = hstack([X_unigrams, X_bigrams,X_trigrams])
# 生成新的列名
unigram_features = [f"text_freq_unigrams_{feat}" for feat in vectorizer_unigrams.get_feature_names_out()]
bigram_features = [f"text_freq_bigrams_{feat}" for feat in vectorizer_bigrams.get_feature_names_out()]
trigram_features = [f"text_freq_trigrams_{feat}" for feat in vectorizer_trigrams.get_feature_names_out()]
combined_features = unigram_features + bigram_features+trigram_features

Contin_features+=combined_features #将新的特征加入连续特征

tfidf_df = pd.DataFrame(X_combined.toarray(), columns=combined_features)
data = data.drop(columns=['combined_text'])
data = pd.concat([data.reset_index(drop=True), tfidf_df.reset_index(drop=True)], axis=1) #更新后的数据集

处理连续特征

连续特征都是数值特征，就做一下变换，变成均值为0方差为1的数值，就行了，nan值填入均值0

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# 处理连续特征

imputer = SimpleImputer(strategy='mean') #使用均值填充nan值
data[Contin_features] = imputer.fit_transform(data[Contin_features])
scaler = StandardScaler()
data[Contin_features] = scaler.fit_transform(data[Contin_features])

处理种类特征

使用one-hot编码即可

data= pd.get_dummies(data, columns=cat_features,dummy_na=True)

生成训练集和测试集

根据实验要求，划分10%的数据作为测试集合，这里采用分层采样，确保训练集和测试集中fraudulent的值比例一致。

from sklearn.model_selection import train_test_split

data = data[[col for col in data.columns if col != 'fraudulent'] + ['fraudulent']]
X = data.iloc[:, :-1]  # 特征
y = data.iloc[:, -1]   # 标签

# 划分训练集和测试集，比例为 9:1，分层采样
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, stratify=y, random_state=42
)

train_data = pd.concat([X_train.astype('float'), y_train], axis=1)
test_data = pd.concat([X_test.astype('float'),y_test],axis=1)

train_data.to_csv('train_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)

最后生成的数据集信息如下

<class 'pandas.core.frame.DataFrame'>
Index: 8999 entries, 3990 to 5488
Columns: 812 entries, min_salary to fraudulent
dtypes: float64(811), int64(1)
memory usage: 55.8 MB
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 4359 to 3820
Columns: 812 entries, min_salary to fraudulent
dtypes: float64(811), int64(1)
memory usage: 6.2 MB

可以看到训练集和测试集的维度（列数）是812

应对：类别不平衡

这个任务中fraudulent标签中正例和负例是相当不平衡的，正例远小于负例，如下图所示

面对训练集的类型不平衡，我采用两种策略应对

过采样训练集中的正类样本
使用focal loss，增大正类样本被预测为负类的惩罚

过采样

train_data = pd.read_csv('train_data.csv')
test_data = pd.read_csv('test_data.csv')

X_test = torch.tensor(test_data.iloc[:,:-1].values, dtype=torch.float32)
y_test = torch.tensor(test_data.iloc[:,-1].values, dtype=torch.long)

X_train,y_train = train_data.iloc[:, :-1].values,train_data.iloc[:, -1].values
ros = RandomOverSampler(sampling_strategy={1: int(0.33*len(y_train[y_train == 0]))}) #过采样 训练集中 标签为1的样本，直到样本数量达到 标签为0的样本的1/3
X_train, y_train = ros.fit_resample(X_train, y_train) 
X_train = torch.tensor(test_data.iloc[:,:-1].values, dtype=torch.float32)
y_train = torch.tensor(test_data.iloc[:,-1].values, dtype=torch.long)

如上面代码中所示，将训练集中的标签为1的样本，多次采样，直到达到训练集中标签为0的样本数量的1/3。

focal loss

Focal Loss 是一种专门为解决类别不平衡问题设计的损失函数，最早由 Facebook 的研究人员在论文《Focal Loss for Dense Object Detection》中提出。它常用于目标检测任务（例如 RetinaNet），以缓解正负样本比例失衡对模型训练的影响。

Focal Loss 的数学公式如下：

$FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)$

$p_t$ : 样本的预测概率，表示模型预测的正确类别的概率。

$p_t = \begin{cases} p, & \text{如果标签为正样本} \\ 1-p, & \text{如果标签为负样本} \end{cases}$

其中 p 是正类别的预测概率。
$\alpha_t$ : 平衡因子，用于平衡正负样本的影响（可选）。
$\gamma$ : 调节因子（Focusing Parameter），控制易分类样本的损失衰减速度。当 $\gamma=0$ 时，Focal Loss 等价于交叉熵损失。
$(1-p_t)^\gamma$ : 难度加权项。对于易分类的样本（ $p_t$ 大），这一项接近 0，从而减少其对损失的贡献；对于难分类的样本（ $p_t$ 小），这一项较大，从而增加其对损失的贡献。

在我的设计中选择 $\alpha_0=1,\alpha_1=5,\gamma=2$ ，增加正例被判为负例的惩罚。

class FocalLoss(nn.Module):
    def __init__(self, alpha, gamma=2, reduction='mean'):
        """
        alpha: tensor数组 (num_label,1)
        gamma: 难易样本调节因子。
        reduction: 损失的聚合方式，支持 'mean', 'sum', or 'none'。
        """
        super(FocalLoss, self).__init__()
        self.alpha = alpha.to(device)
        self.gamma = gamma
        self.reduction = reduction

    def forward(self, pred, target):
        '''
        pred: tensor数组 (batch_size,num_labels) 每个种类的预测概率 
        target: tensor数组 (batch_size,) 真实标签
        '''
        pred=pred.gather(1, target.unsqueeze(1)) #根据真实标签选择 概率 pred ,shape (batch_size,1) 
        pred=-(1 - pred) ** self.gamma * torch.log(pred) # -(1-p)^gamma log(p)
        selected_alpha=torch.matmul(F.one_hot(target,num_classes=2).float(), self.alpha) # (batch_size,1) 计算每个pred 对应的alpha
        loss=pred*selected_alpha
        if self.reduction == 'mean':
            return torch.mean(loss)
        elif self.reduction == 'sum':
            return torch.sum(loss)
        else:
            return loss
      
loss_fn =FocalLoss(alpha)

模型设计

这其实就是一个回归任务，使用几个全连接层就行了，为了防止过拟合，加入了dropout层。

class myNN(nn.Module):
    def __init__(self, input_dim, output_dim,dropout=0.2):
        super(myNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 2*input_dim)  # 第一层
        self.fc2 = nn.Linear(2*input_dim, int(0.5*input_dim))         # 第二层
        self.fc3 = nn.Linear(int(0.5*input_dim), output_dim)  # 输出层
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return self.softmax(x)
    
net=myNN(X_train.shape[1],2)

评价指标AUC

AUC(Area under Curve)：Roc曲线下的面积，介于0.1和1之间。Auc作为数值可以直观的评价分类器的好坏，值越大越好。详情见赛题链接：https://www.datafountain.cn/competitions/448。

def calculate_auc(preds,labels):
    '''
    - preds: 预测的概率 numpy数组 (num_samples,2) 
    - labels: 真实标签 numpy数组 (num_samples,) 
    '''
    from sklearn.metrics import roc_curve, auc
    y_pred_proba=preds[:,-1] # 正类标签的预测概率
    fpr, tpr, thresholds = roc_curve(labels, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    return roc_auc

实验结果分析

训练时采用单张NVIDIA GeForce RTX 3090 ，batch_size=200, num_epochs=15, 学习率为0.001，优化器使用adam，为了防止过拟合，设置权重衰减率为0.01.

训练代码如下：

def train(device, net, X_train, y_train, X_test, y_test, batch_size=200, num_epochs=15, lr=0.001, weight_decay=0.01):
    """
    - device: 设备
    - net: 神经网络模型
    - X_train: 训练集的特征 (tensor 数组) (num_train,dimension)
    - y_train: 训练集的标签 (tensor 数组) (num_train,)
    - X_test: 测试集的特征 (tensor 数组) (num_test,dimension)
    - y_test: 测试集的标签 (tensor 数组) (num_test,)
    - batch_size: 
    - num_epochs: 
    - lr: learning rate
    """
    # 创建训练集的 DataLoader
    train_dataset = TensorDataset(X_train, y_train)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    net = net.to(device)
    optimizer = optim.Adam(net.parameters(), lr=lr, weight_decay = weight_decay) #使用adam优化器

    for epoch in range(num_epochs):
        net.train() #训练模式
        ls=0.00
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device) 

            optimizer.zero_grad()
            outputs = net(inputs)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            ls+=loss
        
        net.eval() #评估模式
        with torch.no_grad():
            # 计算平均训练误差
            train_loss= ls/len(train_loader)
            # 计算测试误差，测试集精确度和测试集 auc
            X_test = X_test.to(device)
            y_test = y_test.to(device)
            test_outputs=net(X_test)
            test_loss=loss_fn(test_outputs,y_test).item() #测试误差
            _, test_predicted = torch.max(test_outputs,1)
            test_accuracy = (test_predicted == y_test).float().mean().item() #测试集 精确度
            test_auc=calculate_auc(test_outputs.cpu().numpy(),y_test.cpu().numpy()) #测试集 auc

            print(f"Epoch [{epoch+1}], train_loss: {train_loss:.4f}, \
            test_loss: {test_loss:.4f}, test_accuracy: {test_accuracy:.4f}, test_auc: {test_auc:.4f}")
    return net

训练结果如下图所示：

可以看到在训练3个epoch后就能达到很好的效果，在epoch 8 以后收敛。

还是有点震惊的：测试集的准确率为100%，auc=1。

声明：

训练集和测试集互斥
数据处理时，计算均值方差和频次时，使用了测试集的相关特征
数据处理过程中对于label是不可见的，不存在测试集标签泄露

参考文献

无