当前位置: 首页 > 后端技术 > Python

不要再问如何用python提取PDF内容了!

时间:2023-03-26 18:38:50 Python

浣滆€咃細闄堟洣鏉ユ簮锛欵arlyPython澶у濂斤紝鍦ㄤ箣鍓嶇殑鍔炲叕鑷姩鍖栫郴鍒楁枃绔犱腑锛屾垜浠凡缁忚缁嗕粙缁嶄簡濡備綍浣跨敤python鎵归噺澶勭悊PDF鏂囦欢锛屽寘鎷悎骞躲€佹媶鍒嗐€佸姞姘村嵃銆佸拰鍔犲瘑銆備粖澶╂垜浠啀娆″洖鍒癙DF锛岃缁嗚瑙e浣曚娇鐢╬ython浠嶱DF涓彁鍙栨寚瀹氫俊鎭€傛垜浠皢浠ヤ竴浠藉勾鎶DF涓轰緥杩涜浠嬬粛锛屽叾涓寘鍚ぇ閲忕殑鏂囧瓧銆佽〃鏍煎拰鍥剧墖銆傝鎯呭涓嬨€傛ā鍧楀畨瑁呴鍏堥渶瑕佸畨瑁呬袱涓ā鍧楋紝绗竴涓槸pdfplumber锛屽彲浠ュ湪鍛戒护琛屼娇鐢╬ip瀹夎馃憞pipinstallpdfplumber鐨勭浜屼釜鏄痜itz锛屽畠鏄痯ymupdf涓殑涓€涓ā鍧椼€傚畠涔熷彲浠ュ緢瀹规槗鍦扮敤pip瀹夎銆俻ipinstallpymupdf鏂囨湰淇℃伅鎻愬彇浣跨敤python鎻愬彇PDF涓殑鏂囨湰浠g爜銆備竴椤典娇鐢?extract_text()鏂规硶鎻愬彇褰撳墠椤甸潰鐨勬枃瀛楃幇鍦ㄨ鎴戜滑鐢ㄤ笂闈㈢殑浠g爜灏濊瘯鎻愬彇鏍锋湰鏁版嵁涓12椤电殑鏂囧瓧馃憞importpdfplumberfile_path=r'C:xxxxpractice.PDF'withpdfplumber.open(file_path)aspdf:page=pdf.pages[11]print(page.extract_text())缁撴灉濡備笅鍥撅紝鐒跺悗閫氳繃瀵煎叆python-docx鍜寀singwordfile.add_paragraph()锛岃繖涓ā鍧楁垜浠凡缁忚В閲婅繃寰堝娆′簡锛岃繖閲屼笉鍐嶈禈杩般€傝〃鏍间俊鎭彁鍙栦娇鐢≒ython鎻愬彇鍗曚釜琛ㄦ牸涓庢彁鍙栧崟椤垫枃鏈潪甯哥浉浼笺€傚畠浣跨敤.extract_table()浣嗛渶瑕佹敞鎰忕殑鏄?extract_table()榛樿鎻愬彇鎸囧畾椤甸潰鐨勭涓€涓〃銆傚鏋滃綋鍓嶉〉闈㈡湁澶氫釜琛ㄩ渶瑕佸叏閮ㄦ彁鍙栵紝鐩存帴浣跨敤.extract_tables()銆備緥濡傦紝绀轰緥鏂囦欢鐨勭13椤典笂鏈?涓〃銆傛垜浠娇鐢?extract_table()鍜?extract_tables()瑙傚療杈撳嚭缁撴灉importpdfplumberfile_path=r'C:xxxxpractice.PDF'withpdfplumber.open(file_path)aspdf:page=pdf.pages[12]print(page.extract_table())缁撴灉濡備笅锛屽彲浠ョ湅鍑烘槸涓€涓祵濂楀垪琛ㄣ€傜啛鎮夎繖绉嶆牸寮忕殑灏变細鏄庣櫧锛宲andasor閬嶅巻宓屽鍒楄〃鍚庯紝浣跨敤openpyxl鐨剆heet.append(list)鍐欏叆Excel鏂囦欢锛宨mportpdfplumberfile_path=r'C:xxxxpractice.PDF'withpdfplumber.open(file_path)aspdf:page=pdf.pages[12]print(page.extract_tables())鍜?extract_tables()鎻愬彇褰撳墠椤甸潰涓婄殑鎵€鏈夎〃鏍煎皢鐢熸垚涓€涓笁绾у祵濂楀垪琛ㄣ€備竴绾у垪琛ㄤ唬琛ㄦ瘡涓〃锛岀劧鍚庡彲浠ヤ娇鐢ㄥ叾浠栧簱鍐欏叆Excel銆傚浘鍍忔彁鍙栧浜庡浘鍍忔彁鍙栵紝娌℃湁妯″潡鍙互鍋氬埌100%鎻愬彇銆傛湰鏂囧彧浠嬬粛鍩轰簬fitz妯″潡鐨勪唬鐮併€傚熀鏈€濇兂鏄€氳繃姝e垯鍖栨悳绱㈠浘鍍忓苟杈撳嚭锛屼緥濡傛彁鍙栨牱鏈枃浠朵腑鐨勫浘鍍忋€備唬鐮佸彲浠ヨ繖鏍峰啓馃憞importfitzimportreimportosfile_path=r'C:xxxpractice.PDF'dir_path=r'C:xxx'#瀛樻斁鍥剧墖鐨勬枃浠跺すdefpdf2pic(path,pic_path):checkXO=r"/Type(?=*/XObject)"checkIM=r"/Subtype(?=*/Image)"pdf=fitz.open(path)lenXREF=pdf._getXrefLength()imgcount=0foriinrange(1,lenXREF):text=pdf._getXrefString(i)isXObject=re.search(checkXO,text)isImage=re.search(checkIM,text)ifnotisXObjectornotisImage:continueimgcount+=1pix=fitz.Pixmap(pdf,i)new_name=f"img_{imgcount}.png"ifpix.n<5:pix.writePNG(os.path.join(pic_path,new_name))else:pix0=fitz.Pixmap(fitz.csRGB,pix)pix0.writePNG(os.path.join(pic_path,new_name))pix0=Nonepix=Nonepdf2pic(file_path,dir_path)缁撴灉濡備笅锛屼綘鍙互鐪嬪埌鎻愬彇鎴愬姛鏄湁鍥剧墖鐨勶紝浣嗘槸PDF涓殑鍥剧墖杩滀笉姝㈣繖浜涖€傚鏋滃ぇ瀹舵湁鍏朵粬鐨勬兂娉曟垨鑰呮柟娉曪紝鍙互鍦ㄧ暀瑷€鍖哄拰鎴戜氦娴併€傚啓鍦ㄦ渶鍚庯紝鏈€鍚庤璇存槑鐨勬槸锛屽湪涓婁竴绡囧拰杩欑瘒鏂囩珷涓紝鎴戜滑鍒嗘瀽浜嗘瘡涓€琛屼唬鐮併€備絾鏄疨DF鐨勬ā鍧楁瘮杈冨锛屾湁浜涙ā鍧楀湪鍔熻兘涓婅繕涓嶅瀹屽杽锛屼唬鐮佷篃涓嶅儚OFFICE鐨勪笁浠跺鎿嶄綔閭d箞绠€鍗曘€傛墍浠ワ紝鏇村鐨勬槸鐞嗚В锛屼笉闇€瑕佸畬鍏ㄦ帉鎻″啓浣溿€傛偍鍙互浣跨敤瀹冨苟鏇存敼瀹冦€傦紒褰撶劧锛岃繕鏄笇鏈涘ぇ瀹惰兘澶熸槑鐧斤紝Python鍔炲叕鑷姩鍖栫殑鏍稿績涔嬩竴灏辨槸鎵归噺鎿嶄綔瑙f斁鍙屾墜锛屽彲浠ョ粨鍚堟棩甯稿姙鍏紝灏嗗鏉傜殑宸ヤ綔鑷姩鍖栵紒