考虑以下网站( site1 、 site2 、 site3 ),它们有许多不同的表格。
我正在使用 read_html将表格拼凑成一个表格,如下所示:
import multiprocessing
links = ['site1.com','site2.com','site3.com']
def process_url(url):
return pd.concat(pd.read_html(url), ignore_index=False)
pool = multiprocessing.Pool(processes=2)
df = pd.concat(pool.map(process_url, links), ignore_index=True)
通过上述过程,我得到了一张表。虽然是我所期望的,但添加一个标志或“表计数器”会很有帮助,只是为了不丢失表的引用(例如,哪一行属于或对应于哪个表)。那么,如何将表格的编号添加到一行中呢?。
类似这样,同一张表,但带有 table_num
专栏:
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date table_num
1 Allied Bank Mulberry AR 91.0 Today's Bank September 23, 2016 October 17, 2016 1
2 The Woodbury Banking Company Woodbury GA 11297.0 United Bank August 19, 2016 October 17, 2016 1
3 First CornerStone Bank King of Prussia PA 35312.0 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016 1
4 Trust Company Bank Memphis TN 9956.0 The Bank of Fayette County April 29, 2016 September 6, 2016 2
5 North Milwaukee State Bank Milwaukee WI 20364.0 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016 2
6 Hometown National Bank Longview WA 35156.0 Twin City Bank October 2, 2015 April 13, 2016 3
7 The Bank of Georgia Peachtree City GA 35259.0 Fidelity Bank October 2, 2015 October 24, 2016 3
8 Premier Bank Denver CO 34112.0 United Fidelity Bank, fsb July 10, 2015 August 17, 2016 3
9 Edgebrook Bank Chicago IL 57772.0 Republic Bank of Chicago May 8, 2015 July 12, 2016 3
10 Doral Bank NaN NaN NaN NaN NaN NaN 4
11 En Espanol San Juan PR 32102.0 Banco Popular de Puerto Rico February 27, 2015 May 13, 2015 4
12 Capitol City Bank & Trust Company Atlanta GA 33938.0 First-Citizens Bank & Trust Company February 13, 2015 April 21, 2015 4
13 Valley Bank Fort Lauderdale FL 21793.0 Landmark Bank, National Association June 20, 2014 June 29, 2015 5
14 Valley Bank Moline IL 10450.0 Great Southern Bank June 20, 2014 June 26, 2015 5
15 Slavie Federal Savings Bank Bel Air MD 32368.0 Bay Bank, FSB May 3, 2014 June 15, 2015 5
16 Columbia Savings Bank Cincinnati OH 32284.0 United Fidelity Bank, fsb May 23, 2014 November 10, 2016 6
17 AztecAmerica Bank NaN NaN NaN NaN NaN NaN 6
18 En Espanol Berwyn IL 57866.0 Republic Bank of Chicago May 16, 2014 October 20, 2016 6
例如,如果 site1 中有两个表,函数必须分配 0
到 table1
的所有行,以及关于 table2
在 site1
该函数必须分配 1
到 table2
的所有行.
另一方面,如果site2
有两个表,函数必须赋值 3
到 table1
的所有行和 4
至 table2
对于 site2
中的所有表.
此外,是否可以使用 assign()或其他方法来获取每一行的引用(例如出处表)?
请您参考如下方法:
尝试如下更改您的 process_url()
函数:
def process_url(url):
return pd.concat([x.assign(table_num=i)
for i,x in enumerate(pd.read_html(url))],
ignore_index=False)