Skip to main content
 首页 » 编程设计

python之pandas groupBy date 然后将日期和字符串过滤到新的数据框中

2024年10月01日4JeffreyZhao

我在这里苦苦挣扎,我希望获取以下数据,按日期分组,然后检查组内的行以确定该组是否具有与之关联的任何位置数据,如果是,则将其提取.

我的数据样本:

id,dates,text,place 
1,2017-01-26 01:06:47,text,"Place(country_code='US', full_name='Manhattan, NY', place_type='city', name='Manhattan', contained_within=[], _api=<tweepy.api.API object at 0x10336f320>, attributes={}, country='United States', bounding_box=BoundingBox(type='Polygon', coordinates=[[[-74, 40], [-73, 40], [-73, 40], [-74, 40]]], _api=<tweepy.api.API object at 0x10336f320>))" 
2,2017-01-26 01:05:51,text,"Place(country_code='US', full_name='Manhattan, NY', place_type='city', name='Manhattan', contained_within=[], _api=<tweepy.api.API object at 0x10336f320>, attributes={}, country='United States', bounding_box=BoundingBox(type='Polygon', coordinates=[[[-74, 40], [-73, 40], [-73, 40], [-74, 40]]], _api=<tweepy.api.API object at 0x10336f320>))" 
4,2017-01-23 01:38:29,text, 
5,2017-01-23 01:36:53,text, 

我首先加载 csv 并对日期进行分组

import pandas as pd 
import matplotlib.pyplot as plt 
import datetime 
 
fig = plt.figure(figsize=(5,5)) 
df1 = pd.read_csv('data.csv') 
df = df1[['dates','place']] 
df['dates']=pd.to_datetime(df['dates'],format='%Y-%m-%d') 
df.index=df['dates'] 
 
grp = pd.groupby(df,by=[df.index.year,df.index.month,df.index.day]) 
for date,group in grp: 
    print(date) 
    print(group) 

这将产生一个如下所示的数据框:

(2017, 1, 26) 
                                  dates  \ 
dates 
2017-01-26 01:06:47 2017-01-26 01:06:47 
2017-01-26 01:05:51 2017-01-26 01:05:51 
 
                                                                 place 
dates 
2017-01-26 01:06:47  Place(country_code='US', full_name='Manhattan,... 
2017-01-26 01:05:51                                                NaN 

这是我遇到过滤/条件问题的地方,我的目标是拥有一个数据框,我可以将其保存到如下所示的 csv 中:

date, item_count, has_location, location 
2017-01-26, 2, yes, Manhattan 
2017-01-23, 2, no, na 

最好的方法是什么?谢谢

请您参考如下方法:

我认为你可以使用:

extract nameplace 首先,然后 groupby by dt.date (如果 dates 列的 dtypedatetimeto_datetime 可以删除)并按 size 聚合一些列例如id 并按 firstplace。最后 insert numpy.where 创建的新列:

print (df) 
   id                dates  text  \ 
0   1  2017-01-26 01:06:47  text    
1   2  2017-01-26 01:05:51  text    
2   4  2017-01-23 01:38:29  text    
3   5  2017-01-23 01:36:53  text    
 
                                               place   
0  Place(country_code='US', full_name='Manhattan,...   
1  Place(country_code='US', full_name='Manhattan,...   
2                                                NaN   
3                                                NaN   
 
df.place = df.place.str.extract(", name='(.*)', contained_within", expand=True) 
print (df) 
   id                dates  text      place 
0   1  2017-01-26 01:06:47  text  Manhattan 
1   2  2017-01-26 01:05:51  text  Manhattan 
2   4  2017-01-23 01:38:29  text        NaN 
3   5  2017-01-23 01:36:53  text        NaN 
 
df1 = df.groupby(pd.to_datetime(df.dates).dt.date).agg({'id':'size', 'place':'first'}) 
df1.columns = ['item_count','location'] 
df1.insert(1, 'has_location', np.where(df1.location.isnull(), 'no', 'yes')) 
print (df1) 
            item_count has_location   location 
dates                                          
2017-01-23           2           no        NaN 
2017-01-26           2          yes  Manhattan