Skip to main content
 首页 » 编程设计

python之如何添加一个 id 列来识别 read_html() 表

2025年02月15日7jpfss

考虑以下网站( site1site2site3 ),它们有许多不同的表格。

我正在使用 read_html将表格拼凑成一个表格,如下所示:

import multiprocessing 
links = ['site1.com','site2.com','site3.com'] 
 
def process_url(url): 
    return pd.concat(pd.read_html(url), ignore_index=False)    
 
pool = multiprocessing.Pool(processes=2) 
df = pd.concat(pool.map(process_url, links), ignore_index=True) 

通过上述过程,我得到了一张表。虽然是我所期望的,但添加一个标志或“表计数器”会很有帮助,只是为了不丢失表的引用(例如,哪一行属于或对应于哪个表)。那么,如何将表格的编号添加到一行中呢?。

类似这样,同一张表,但带有 table_num专栏:

    Bank Name   City    ST  CERT    Acquiring Institution   Closing Date    Updated Date        table_num 
1   Allied Bank     Mulberry    AR  91.0    Today's Bank    September 23, 2016  October 17, 2016        1 
2   The Woodbury Banking Company    Woodbury    GA  11297.0     United Bank     August 19, 2016     October 17, 2016    1 
3   First CornerStone Bank  King of Prussia     PA  35312.0     First-Citizens Bank & Trust Company     May 6, 2016     September 6, 2016   1 
4   Trust Company Bank  Memphis     TN  9956.0  The Bank of Fayette County  April 29, 2016  September 6, 2016   2 
5   North Milwaukee State Bank  Milwaukee   WI  20364.0     First-Citizens Bank & Trust Company     March 11, 2016  June 16, 2016   2 
6   Hometown National Bank  Longview    WA  35156.0     Twin City Bank  October 2, 2015     April 13, 2016  3 
7   The Bank of Georgia     Peachtree City  GA  35259.0     Fidelity Bank   October 2, 2015     October 24, 2016        3 
8   Premier Bank    Denver  CO  34112.0     United Fidelity Bank, fsb   July 10, 2015   August 17, 2016     3 
9   Edgebrook Bank  Chicago     IL  57772.0     Republic Bank of Chicago    May 8, 2015     July 12, 2016   3 
10  Doral Bank  NaN     NaN     NaN     NaN     NaN     NaN     4 
11  En Espanol  San Juan    PR  32102.0     Banco Popular de Puerto Rico    February 27, 2015   May 13, 2015        4 
12  Capitol City Bank & Trust Company   Atlanta     GA  33938.0     First-Citizens Bank & Trust Company     February 13, 2015   April 21, 2015  4 
13  Valley Bank     Fort Lauderdale     FL  21793.0     Landmark Bank, National Association     June 20, 2014   June 29, 2015   5 
14  Valley Bank     Moline  IL  10450.0     Great Southern Bank     June 20, 2014   June 26, 2015   5 
15  Slavie Federal Savings Bank     Bel Air     MD  32368.0     Bay Bank, FSB   May 3, 2014     June 15, 2015   5 
16  Columbia Savings Bank   Cincinnati  OH  32284.0     United Fidelity Bank, fsb   May 23, 2014    November 10, 2016   6 
17  AztecAmerica Bank   NaN     NaN     NaN     NaN     NaN     NaN 6 
18  En Espanol  Berwyn  IL  57866.0     Republic Bank of Chicago    May 16, 2014    October 20, 2016    6 

例如,如果 site1 中有两个表,函数必须分配 0table1 的所有行,以及关于 table2site1该函数必须分配 1table2 的所有行.

另一方面,如果site2有两个表,函数必须赋值 3table1 的所有行和 4table2对于 site2 中的所有表.

此外,是否可以使用 assign()或其他方法来获取每一行的引用(例如出处表)?

请您参考如下方法:

尝试如下更改您的 process_url() 函数:

def process_url(url): 
    return pd.concat([x.assign(table_num=i) 
                      for i,x in enumerate(pd.read_html(url))], 
                     ignore_index=False)