Pandas 的整数索引重采样等价物

Pandas' equivalent of resample for integer index

我正在为一个不是 DatetimeIndex 而是整数数组,甚至可能是浮点数的数据帧寻找与 resample 方法等效的 pandas。

我知道在某些情况下(例如这个),重采样方法可以很容易地被重新索引和插值替换,但在某些情况下(我认为)它不能。

例如,如果我有

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

这给了我

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

但我无法使用 df 和重新采样产生相同的结果。所以我正在寻找可以作为

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

那会给我

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

这样的方法存在吗?我能够创建此方法的唯一方法是手动将 df 拆分为较小的数据帧,应用 np.std 然后将所有内容连接回来,我发现这很慢而且一点也不聪明。

干杯


设置
df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

您需要自己创建标签进行分组。我会使用:

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

为你获取一系列值,如 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...] 然后在 groupby

中使用它

您还需要为新数据框指定索引。我会使用:

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

获取从第 5 个位置开始的当前索引(因此是 4)以及之后的每个第 5 个位置。它看起来像 [4, 9, 14, 19]。我本可以将其作为 df.index[::5] 来获得起始位置,但我选择了结束位置。

解决方案

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

看起来像:

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

其他注意事项

这相当于下采样。我们尚未解决抽样问题。

要以更频繁的方式从我们生成的内容返回到数据帧索引,我们可以像这样使用 reindex

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

看起来像:

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

我们也可以使用其他的东西来reindex,比如range(0, 20, 2),将样本上采样到偶数索引。


另类,这是可以做到的一件事

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

@piSquared 解决方案非常好,但我不喜欢在重新索引时手动选择索引。

这也适用于每种下采样(浮点索引),并自动选择每个范围内索引的平均值:

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

现在您可以随意选择要在每个子组中计算的函数:

df = pd.DataFrame(np.random.randn(10,2))

withdates = df.set_index(pd.date_range('2012-01-01', periods=10))

withdates.resample('5D', np.std)
         0     1

2012-01-01 1.184582 0.492113

2012-01-06 0.533134 0.982562
df.resample(5, np.std)
     0     1

0 1.184582 0.492113

5 0.533134 0.982562

import pandas as pd

import numpy as np



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

(df.index.to_series() / 5).astype(int)
df.index[4::5]

# assign as variable because I'm going to use it more than once.

s = (df.index.to_series() / 5).astype(int)



df.groupby(s).std().set_index(s.index[4::5])
     A     B

4  0.198019 0.320451

9  0.329750 0.408232

14 0.293297 0.223991

19 0.095633 0.376390

# assign what we've done above to df_down

df_down = df.groupby(s).std().set_index(s.index[4::5])



df_up = df_down.reindex(range(20)).bfill()
     A     B

0  0.198019 0.320451

1  0.198019 0.320451

2  0.198019 0.320451

3  0.198019 0.320451

4  0.198019 0.320451

5  0.329750 0.408232

6  0.329750 0.408232

7  0.329750 0.408232

8  0.329750 0.408232

9  0.329750 0.408232

10 0.293297 0.223991

11 0.293297 0.223991

12 0.293297 0.223991

13 0.293297 0.223991

14 0.293297 0.223991

15 0.095633 0.376390

16 0.095633 0.376390

17 0.095633 0.376390

18 0.095633 0.376390

19 0.095633 0.376390

def resample(df, rule, how=None, **kwargs):

  import pandas as pd

  if how==None:

    import numpy as np

    how = np.mean



  if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):

    return df.resample(rule, how, **kwargs)

  else:

    idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)

    aux = df.groupby(idx).apply(how)

    aux = aux.set_index(bins[:-1])

    return auxdf = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])

df.index.name = 'crazy_index'



s = (df.index.to_series() / 10).astype(int)

# calculate std() in each group

df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )



          A     B

crazy_index

3.667539   0.276986 0.317642

14.275074  0.248700 0.372551

25.054042  0.254860 0.297586



# calculate median() in each group

df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

Out[38]:

          A     B

crazy_index

3.667539   0.454654 0.521649

14.275074  0.451265 0.490125

25.054042  0.489326 0.622781

编辑:s 索引中有一些错误,现在它是正确的


相关推荐

  • Spring部署设置openshift

    Springdeploymentsettingsopenshift我有一个问题让我抓狂了三天。我根据OpenShift帐户上的教程部署了spring-eap6-quickstart代码。我已配置调试选项,并且已将Eclipse工作区与OpehShift服务器同步-服务器上的一切工作正常,但在Eclipse中出现无法消除的错误。我有这个错误:cvc-complex-type.2.4.a:Invali…
    2025-04-161
  • 检查Java中正则表达式中模式的第n次出现

    CheckfornthoccurrenceofpatterninregularexpressioninJava本问题已经有最佳答案,请猛点这里访问。我想使用Java正则表达式检查输入字符串中特定模式的第n次出现。你能建议怎么做吗?这应该可以工作:MatchResultfindNthOccurance(intn,Patternp,CharSequencesrc){Matcherm=p.matcher…
    2025-04-161
  • 如何让 JTable 停留在已编辑的单元格上

    HowtohaveJTablestayingontheeditedcell如果有人编辑JTable的单元格内容并按Enter,则内容会被修改并且表格选择会移动到下一行。是否可以禁止JTable在单元格编辑后转到下一行?原因是我的程序使用ListSelectionListener在单元格选择上同步了其他一些小部件,并且我不想在编辑当前单元格后选择下一行。Enter的默认绑定是名为selectNext…
    2025-04-161
  • Weblogic 12c 部署

    Weblogic12cdeploy我正在尝试将我的应用程序从Tomcat迁移到Weblogic12.2.1.3.0。我能够毫无错误地部署应用程序,但我遇到了与持久性提供程序相关的运行时错误。这是堆栈跟踪:javax.validation.ValidationException:CalltoTraversableResolver.isReachable()threwanexceptionatorg.…
    2025-04-161
  • Resteasy Content-Type 默认值

    ResteasyContent-Typedefaults我正在使用Resteasy编写一个可以返回JSON和XML的应用程序,但可以选择默认为XML。这是我的方法:@GET@Path("/content")@Produces({MediaType.APPLICATION_XML,MediaType.APPLICATION_JSON})publicStringcontentListRequestXm…
    2025-04-161
  • 代码不会停止运行,在 Java 中

    thecodedoesn'tstoprunning,inJava我正在用Java解决项目Euler中的问题10,即"Thesumoftheprimesbelow10is2+3+5+7=17.Findthesumofalltheprimesbelowtwomillion."我的代码是packageprojecteuler_1;importjava.math.BigInteger;importjava…
    2025-04-161
  • Out of memory java heap space

    Outofmemoryjavaheapspace我正在尝试将大量文件从服务器发送到多个客户端。当我尝试发送大小为700mb的文件时,它显示了"OutOfMemoryjavaheapspace"错误。我正在使用Netbeans7.1.2版本。我还在属性中尝试了VMoption。但仍然发生同样的错误。我认为阅读整个文件存在一些问题。下面的代码最多可用于300mb。请给我一些建议。提前致谢publicc…
    2025-04-161
  • Log4j 记录到共享日志文件

    Log4jLoggingtoaSharedLogFile有没有办法将log4j日志记录事件写入也被其他应用程序写入的日志文件。其他应用程序可以是非Java应用程序。有什么缺点?锁定问题?格式化?Log4j有一个SocketAppender,它将向服务发送事件,您可以自己实现或使用与Log4j捆绑的简单实现。它还支持syslogd和Windows事件日志,这对于尝试将日志输出与来自非Java应用程序…
    2025-04-161