dataframe 里面要有 created_at 一列,格式 %Y-%m-%d %H:%M:%S,首先提取出小时,然后分层(组)抽样,保存到 csv 中,话不多说,上代码~

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# -*- coding: utf-8 -*-
# author: inspurer(月小水长)
# create_time: 2022/4/2 22:58
# 运行环境 Python3.6+
# github https://github.com/inspurer
# 微信公众号 月小水长

import pandas as pd

input_file = 'RussiaUkraine1.csv'
output_file = 'RussiaUkraine2.csv'

df = pd.read_csv(input_file)
# 新增一列 hour
df['hour'] = pd.to_datetime(df['created_at']).dt.hour # 时间
# 抽样比例 1%
res_df = df.groupby(df['hour']).apply(lambda x: x.sample(frac=0.01))
res_df.to_csv(output_file, index=False, encoding='utf-8-sig')