当前位置: 代码迷 >> python >> 如何使用组合或字符串作为DataFrame(熊猫)的索引?
  详细解决方案

如何使用组合或字符串作为DataFrame(熊猫)的索引?

热度:66   发布时间:2023-06-14 08:46:05.0

我正在尝试使用列的组合在Pandas DataFrame中创建一个新列。 由于我不知道如何使用作为索引生成的组合,因此我尝试将组合转换为字符串,但这也不起作用。

import itertools as iter
def pset(lst):
    comb = (iter.combinations(lst, l) for l in range(2,3))
    return list(iter.chain.from_iterable(comb))

temp = pset(transactions)
t = str(temp[0]).strip(" ")
transactions[[t]]

这给我一个错误

KeyError: '["\'A\', \'B\'"] not in index'

这里的A和B是我在数据框中的列。

transaction dataset:
A,B,C,D,E,F,G
1,0,1,1,0,1,1
1,1,1,1,0,1,0
1,0,0,1,0,1,0
0,0,1,1,1,0,0
1,0,0,1,1,1,0
0,1,1,1,1,1,1

Expected output Expected output:
A,B  A,C  A,D
 1    2    4

您将获得预期的输出, df是您过帐的交易数据集。 (尽管此解决方案是使用python 2.7制作的,但我希望它在python 3中能起到同样的作用)

import itertools as iter
import pandas as pd

colComb = [a for a in iter.combinations(df.columns,2)]
newCols = [','.join(colComb[i]) for i in range(len(colComb))]

Out = pd.DataFrame(columns = newCols)
for i in range(len(colComb)):
    Out.loc[0,newCols[i]] = df[(df[colComb[i][0]] == 1) & (df[colComb[i][1]] == 1)][colComb[i][0]].count()

输出:

  A,B A,C A,D A,E A,F A,G B,C B,D B,E B,F B,G C,D C,E C,F C,G D,E D,F D,G E,F  \
0   1   2   4   1   4   1   2   2   1   2   1   4   2   3   2   3   5   2   2   

  E,G F,G  
0   1   2  

如果需要转置:

OT = Out.T
OT.columns = ["Count"]

输出:

    Count
A,B     1
A,C     2
A,D     4
A,E     1
A,F     4
A,G     1
B,C     2
B,D     2
B,E     1
B,F     2
B,G     1
C,D     4
C,E     2
C,F     3
C,G     2
D,E     3
D,F     5
D,G     2
E,F     2
E,G     1
F,G     2

编辑:

改进的代码也可以使用更高的尺寸:

import itertools as iter
import pandas as pd
import numpy as np

dim = 2
colComb = [a for a in iter.combinations(df.columns,dim)]
newCols = [','.join(colComb[i]) for i in range(len(colComb))]

Out = pd.DataFrame(columns = newCols)
for i in range(len(colComb)):
    Out.loc[0,newCols[i]] = df[np.sum(df[list(colComb[i])],axis=1) == dim][colComb[i][0]].count()

编辑:

二维的代码要快得多:

cols = []
vals = []
for i in range(len(df.columns)):
    for j in range(i+1,len(df.columns)):
        cols.append(df.columns[i]+','+df.columns[j])
        vals.append(np.multiply(df[df.columns[i]],df[df.columns[j]]).sum())

Out = pd.DataFrame(columns=cols)
Out.loc[0] = vals
Out = Out.astype(int)

另一个Edit,一种适用于更高尺寸的更快解决方案:

vals = []
colComb = [a for a in iter.combinations(df.columns,dim)]
cols = [','.join(colComb[i]) for i in range(len(colComb))]
vals = []
for C in colComb:
    v = df[C[0]]
    for i in range(1,len(C)):
        v = np.multiply(v,df[C[i]])
    vals.append(v.sum())
dd = pd.DataFrame(columns=cols)
dd.loc[0] = vals
dd = dd.astype(int)

它应该至少快3-4次工作。

  相关解决方案