获取数据

Yang Li
Institute of Computing Technology, Chinese Academy of Sciences
本仓库只作学习用途，请勿用于任何商业行为和违法行为，否则后果自负。

获取数据

知乎数据

可以直接使用requests库对网页进行get操作，将返回的html使用lxml库进行解析，得到问题及问题链接：

对问题的所有答案进行爬取时，使用知乎的api获得网页内容的json格式数据：

可以看到最后使用is_end和next指明了该页面是否需要翻页已经翻页后的页面地址，因此我们只需要针对一个问题访问多个url就可以得到全部的回答：

global question_id
question_id = url.split('/')[-1]  # url为问题链接

# 使用知乎的api进行爬取
answer_dict = {}
url_api = f'https://www.zhihu.com/api/v4/questions/{question_id}/feeds?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Creaction_instruction%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&offset=0&limit=5&order=default&platform=desktop'
is_end = False  # 不一定够20个回答，需要用该参数进行判断，防止死循环（也是知乎的api）
# print(url_api)

num_answer = len(answer_dict)
while not is_end and num_answer < top_k:
    is_end, next_url_api = spider_for_zhihu(url_api, answer_dict, num_answer, usr_info)
    # print(type(is_end))
    num_answer = len(answer_dict)
    url_api = next_url_api

    answer_list.append(answer_dict)

对获得的数据进行简单的组织，并保存在json里即可。

同时，由于知乎热榜是变化的，每天爬取热榜内容大概在2M左右，因此扩展数据容量比较方便，只需要每天都爬就可以了。

Quora数据

尚存一些问题，比如quora的提问有很多是提问图片的，导致当天的数据可能并没有很高的质量，但是统计意义上偏差不大

Quora是动态加载页面，因此我们只能通过selenium库对浏览器行为进行模拟。

单击more之后会得到问题的链接：

和知乎一样，Quora的首页也是实时变化的，因此扩展数据容量也比较方便。

滚动页面时注意如果已经到底端，就跳出循环：

# 翻页，保证得到足够多的回答
js = "window.scrollTo(0,document.body.scrollHeight)"
temp_height = 0
for _ in range(int(top_k / 2)):
    driver.execute_script(js)
    time.sleep(3)
    # 获取当前滚动条距离顶部的距离
    check_height = driver.execute_script(
        "return document.documentElement.scrollTop || window.pageYOffset || document.body.scrollTop;")
    # 如果两者相等说明到底了
    if check_height == temp_height:
        break
    temp_height = check_height

Quora的回答框也很有规律：

这时就需要xpath模糊匹配，获得所有包含dom_annotate_question_answer_item_的div元素：

ans_block = driver.find_elements_by_xpath('//div[contains(@class, "dom_annotate_question_answer_item_")]')

然后依次定位到元素位置，再进行单击click，确保回答被展开，然后再获取回答内容：

for block in ans_block:
    driver.execute_script("arguments[0].scrollIntoView();", block)  # 滚动到该位置
    time.sleep(1)
    block.click()  # 展开
    time.sleep(1)

需要注意几点：

对空回答进行过滤，因为有些提问是照片或链接，回答里不一定有文字。
Quora存在匿名回答，可能无法正确爬取usr_url，注意做好判断。

不要爬到"Related"

try:
    # 跳过Related
    related = block.find_element_by_xpath('.//div[@class="q-text qu-dynamicFontSize--small qu-fontWeight--regular"]').text
    if related == 'Related':
        continue
except:
        pass

数据处理

中文数据

数据清洗

无用符号
特殊符号

def clean_text(text):
    """清理数字、符号、特殊字符"""
    text = re.sub(r'\d+', '', text)  # 删除数字
    for c in string.punctuation:  # 删除英文符号
        text = text.replace(c, '')
    for c in zhon.hanzi.punctuation:  # 删除中文符号
        text = text.replace(c, '')
    text = re.sub(' +', ' ', text)  # 连续空格变为一个
    return text


def process_zh(text):
    text = clean_text(text.strip())
    text = re.sub('[a-zA-Z]', '', text)  # 删除英文
    # 删除除中文和空格以外的所有非法字符，其实只保留以下这行就够了
    text = re.sub('([^\u4e00-\u9fa5 ])', '', text)
    return text

分词

使用jieba库进行分词

import jieba

text = ''
word_list = jieba.cut(text)

英文数据

数据清洗 Cleaning

无用符号
特殊符号

def clean_text(text):
    """清理数字、符号、特殊字符"""
    text = re.sub(r'\d+', '', text)  # 删除数字
    for c in string.punctuation:  # 删除英文符号
        text = text.replace(c, '')
    for c in zhon.hanzi.punctuation:  # 删除中文符号
        text = text.replace(c, '')
    text = re.sub(' +', ' ', text)  # 连续空格变为一个
    return text


def process_en(text):
    text = clean_text(text.strip().lower())
    text = re.sub('[\u4e00-\u9fa5]', '', text)  # 删除中文
    # 删除除英文字符和空格以外的所有非法字符，其实只保留以下这行就够了
    text = re.sub(r'[^A-Za-z ]+', '', text)
    return text

停用词，这里要十分注意，停用词不能用字符串的replace函数进行处理（否则会删除所有该字符串，而不是只删单词），只能判等再删除。

stop_words = set(stopwords.words('english'))
for word in word_list:
    if mode == 'en' and word in stop_words:  # 删除停用词
        continue

注意：考虑到与中文保持一致，因此最终并未去除停用词

分词 Segmentation

英文相较中文而言分词会简单很多，使用空格进行分割即可，但是考虑到为了熟悉使用这些库和准确性，选择使用nltk.tokenize进行处理。

from nltk.tokenize import word_tokenize

text = ''
word_list = word_tokenize(text)

标准化

使用nltk库进行处理，包含以下两个方面：

词干提取 Stemming，从单词中去除词缀并返回词根，返回的并不是单词，因此我们并不做词干提取，只做词形还原。

词形还原 Lemmazation

from nltk.tokenize import word_tokenize
word_list = [wnl.lemmatize(word) for word in word_list]

计算文本熵

从10月8日起，以连续三天的问答为一个单位，每次实验增加一个单位的数据量。

按字/字符计算熵，计算方式如下：

$$ H(x)=-\sum_{i=1}^np(x_i)\log p(x_i) $$

实现很简单（但是注意freq_dict是按freq排序好的）：

def calc_entropy(freq_dict: dict) -> float:
    entropy = 0.0
    char_sum = np.sum(list(freq_dict.values()))
    for k, v in freq_dict.items():
        prob = v / char_sum
        entropy -= prob * log2(prob)
    return entropy

冯志伟在1989年的统计结果为：

从后续结果（统计字/字母的熵，而非单词）可以看出，与上图结果保持一致。

同时可以发现：在相同规模字符数的情况下，中文词的个数约为英文词个数的3倍，原因就是中文信息熵较大

中文

10月8日至10月10日
- 数据规模：2285621 字符
- 文本熵：9.5061
10月8日至10月13日
- 数据规模：4335765 字符
- 文本熵：9.5258

英文

10月8日至10月10日
- 数据规模：1902777 字符
- 文本熵：4.1932
10月8日至10月13日
- 数据规模：2663740 字符
- 文本熵：4.1932

验证Zipf's law（齐夫定律）

在一个自然语言的语料库中，一个词的出现频数和这个词在这个语料中的排名（这个排名是基于出现次数的）成反比。

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.

因此，我们可以以rank为自变量，freq为应变量，画出图像，如果图像是一条直线（反比例函数），则说明定律是正确的：

$$ \text{freq}=k\times\frac{1}{\text{rank}} $$

数据越多，验证越准确，因此我们选择共做2次验证，每次增加一组爬取的数据。

词频统计

创建一个Process类用于接收数据并对其进行隐式的统计处理，防止在函数调用过程中更改原有数据，保证安全性和准确性：

class Process:
    def __init__(self, data_dict, mode):
        self.__data_dict = data_dict
        if mode not in ['zh', 'en']:
            raise KeyError(f'mode should be either zh or en, not {mode}')
        self.__mode = mode

        self.__stop_words = set(stopwords.words('english'))

        self.__pure_ans_list = []
        self.__word_freq_dict = defaultdict(int)
        self.__char_freq_dict = defaultdict(int)
        self._process()

    @classmethod
    def construct_and_process(cls, data_dict: list[dict], mode: str) -> Process:
        return cls(data_dict, mode)

    def process(self, data_dict: list[dict]) -> None:
        self.__data_dict = data_dict
        self._process()

    def _process(self):
        self.__pure_ans_list.clear()
        self.__word_freq_dict.clear()
        self.__char_freq_dict.clear()
        for item in self.__data_dict:
            if not item['answers']:  # 未获取到任何答案
                continue
            for ans_text in iter(item['answers'].values()):
                if self.__mode == 'zh':
                    pure_text = process_zh(ans_text)
                    word_list = jieba.cut(pure_text)
                else:
                    pure_text = process_en(ans_text)
                    wnl = WordNetLemmatizer()
                    word_list = [wnl.lemmatize(word) for word in word_tokenize(pure_text)]

                self._update_freq_dict(word_list)
                self.__pure_ans_list.append(pure_text)

    def _update_freq_dict(self, word_list):
        for word in word_list:
            if word == ' ':
                continue
            # if self.__mode == 'en' and word in self.__stop_words:  # 删除停用词
            #     continue
            self.__word_freq_dict[word] += 1
            for c in word:
                self.__char_freq_dict[c] += 1

    @staticmethod
    def calc_entropy(freq_dict: dict) -> float:
        entropy = 0.0
        char_sum = np.sum(list(freq_dict.values()))
        for k, v in freq_dict.items():
            prob = v / char_sum
            entropy -= prob * log2(prob)
        return entropy

    @staticmethod
    def plot(freq_dict: dict, top_k: int = 10000, save_path=None) -> None:
        save_path = '.' if save_path is None else save_path

        plt.title('Zipf-Law', fontsize=18)  # 标题
        plt.xlabel('rank', fontsize=18)  # 排名
        plt.ylabel('freq', fontsize=18)  # 频度

        plt.yticks([pow(10, i) for i in range(0, 4)])  # 设置y刻度
        plt.xticks([pow(10, i) for i in range(0, 4)])  # 设置x刻度

        plt.yscale('log')  # 设置纵坐标的缩放
        plt.xscale('log')  # 设置横坐标的缩放

        y = list(freq_dict.values())[:top_k]
        x = list(range(1, top_k + 1))
        plt.plot(x, y, 'b')  # 绘图
        plt.savefig(save_path)  # 保存图片

        plt.show()

    @property
    def question_answer_dict(self) -> defaultdict:
        """获取传入的 问题-答案 字典"""
        return self.__data_dict

    @property
    def pure_answer_list(self) -> list[str]:
        """获取处理过后的所有回答"""
        return self.__pure_ans_list

    @property
    def answer_num(self) -> int:
        """获取数据集中回答的个数"""
        return len(self.__pure_ans_list)

    @property
    def answer_average_len(self) -> float:
        """获取数据集中回答的平均字符长度"""
        return self.character_sum / len(self.__pure_ans_list)

    @property
    def word_sum(self) -> int:
        """获取数据集总词数"""
        return np.sum(list(self.__word_freq_dict.values()))

    @property
    def word_freq_dict(self) -> dict:
        """返回按频率降序排序的 词-频率 字典"""
        return dict(sorted(self.__word_freq_dict.items(), key=lambda x: x[-1], reverse=True))

    @property
    def character_sum(self) -> int:
        """获取数据集中总字（符）数"""
        return np.sum(list(self.__char_freq_dict.values()))

    @property
    def character_freq_dict(self) -> dict:
        """返回按频率降序排序的 字符-频率 字典"""
        return dict(sorted(self.__char_freq_dict.items(), key=lambda x: x[-1], reverse=True))

以上代码中使用的公共函数process_zh和process_en定义已在上文中给出，这三个函数接收字符串文本text 为参数，可以独立运行，并不依赖于我们的传入的json数据，因此并不将其作为类方法，而是单独作为一个module。

画图验证

def plot(freq_dict: dict, top_k: int = 10000, save_path=None) -> None:
    save_path = '.' if save_path is None else save_path

    plt.title('Zipf-Law', fontsize=18)  # 标题
    plt.xlabel('rank', fontsize=18)  # 排名
    plt.ylabel('freq', fontsize=18)  # 频度

    plt.yticks([pow(10, i) for i in range(0, 4)])  # 设置y刻度
    plt.xticks([pow(10, i) for i in range(0, 4)])  # 设置x刻度

    plt.yscale('log')  # 设置纵坐标的缩放
    plt.xscale('log')  # 设置横坐标的缩放

    y = list(freq_dict.values())[:top_k]
    x = list(range(1, top_k + 1))
    plt.plot(x, y, 'b')  # 绘图
    plt.savefig(save_path)  # 保存图片

    plt.show()

与计算文本熵相同，从10月8日起，以连续三天的问答为一个单位，每次实验增加一个单位的数据量。

为画图方便，本实验只取rank top 10000的词汇

中文

10月8日至10月10日（数据规模：1314693 词）
10月8日至10月13日（数据规模：2497128 词）

英文

10月8日至10月10日（数据规模：422141 词）
10月8日至10月13日（数据规模：585633 词）

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
README.assets		README.assets
saves		saves
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

获取数据

知乎数据

Quora数据

数据处理

中文数据

数据清洗

分词

英文数据

数据清洗 Cleaning

分词 Segmentation

标准化

计算文本熵

中文

英文

验证Zipf's law（齐夫定律）

词频统计

画图验证

中文

英文

参考文档

About

Releases

Packages

Languages

Liesy/LearningSpider

Folders and files

Latest commit

History

Repository files navigation

获取数据

知乎数据

Quora数据

数据处理

中文数据

数据清洗

分词

英文数据

数据清洗 Cleaning

分词 Segmentation

标准化

计算文本熵

中文

英文

验证Zipf's law（齐夫定律）

词频统计

画图验证

中文

英文

参考文档

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages