전체 파일을 로드하지 않고 XLS 파일에서 시트 이름을 가져오는 방법은 무엇입니까?

programing

전체 파일을 로드하지 않고 XLS 파일에서 시트 이름을 가져오는 방법은 무엇입니까?

javamemo 2023. 5. 10. 20:10

전체 파일을 로드하지 않고 XLS 파일에서 시트 이름을 가져오는 방법은 무엇입니까?

저는 현재 사용자가 어떤 시트를 사용할지 선택할 수 있도록 팬더를 사용하여 엑셀 파일을 읽고 그 시트 이름을 사용자에게 제시하고 있습니다.문제는 파일이 매우 커서(70열 x 65k 행) 노트북에 로드하는 데 최대 14초가 소요된다는 점입니다(CSV 파일의 동일한 데이터는 3초가 소요됨).

내 판다 코드는 다음과 같습니다.

xls = pandas.ExcelFile(path)
sheets = xls.sheet_names

저는 전에 xlrd를 시도했지만 비슷한 결과를 얻었습니다.xlrd의 코드입니다.

xls = xlrd.open_workbook(path)
sheets = xls.sheet_names

그렇다면, 엑셀 파일에서 전체 파일을 읽는 것보다 시트 이름을 검색하는 더 빠른 방법을 제안할 수 있는 사람이 있습니까?

xlrd 라이브러리를 사용하고 "on_demand=True" 플래그를 사용하여 워크북을 열면 시트가 자동으로 로드되지 않습니다.

판다와 유사한 방법으로 시트 이름을 검색할 수 있습니다.

import xlrd
xls = xlrd.open_workbook(r'<path_to_your_excel_file>', on_demand=True)
print xls.sheet_names() # <- remeber: xlrd sheet_names is a function, not a property

표준 / 대중적인 립을 사용한 제 연구에서 이것은 2020년 현재 구현되지 않았습니다.xlsx/xls하지만 당신은 이것을 할 수 있습니다.xlsb어느 쪽이든 이러한 솔루션은 성능을 크게 향상시킬 수 있습니다.위해서xls,xlsx,xlsb.

아래는 ~10Mb의 벤치마크 데이터입니다.xlsx,xlsb파일.

`xlsx, xls`

from openpyxl import load_workbook

def get_sheetnames_xlsx(filepath):
    wb = load_workbook(filepath, read_only=True, keep_links=False)
    return wb.sheetnames

벤치마크: 14배 빠른 속도 향상

# get_sheetnames_xlsx vs pd.read_excel
225 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.25 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

`xlsb`

from pyxlsb import open_workbook

def get_sheetnames_xlsb(filepath):
  with open_workbook(filepath) as wb:
     return wb.sheets

벤치마크: 최대 56배 속도 향상

# get_sheetnames_xlsb vs pd.read_excel
96.4 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
5.36 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

주의:

이것은 좋은 자료입니다 - http://www.python-excel.org/
xlrd2020년 현재 더 이상 유지 관리되지 않습니다.

xlrd, panda, openpyxl 및 기타 라이브러리를 시도해 보았는데 전체 파일을 읽을 때 파일 크기가 증가하기 때문에 모두 기하급수적인 시간이 소요되는 것 같습니다.위에서 언급한 'on_demand'를 사용한 다른 솔루션은 저에게 적합하지 않았습니다.다음 함수는 xlsx 파일에 대해 작동합니다.

def get_sheet_details(file_path):
    sheets = []
    file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
    # Make a temporary directory with the file name
    directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
    os.mkdir(directory_to_extract_to)

    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(file_path, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()

    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['sheetId'], # can be @sheetId for some versions
                'name': sheet['name'] # can be @name
            }
            sheets.append(sheet_details)

    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets

모든 xlsx는 기본적으로 압축 파일이기 때문에 기본 xml 데이터를 추출하고 라이브러리 기능에 비해 몇 분의 1초가 소요되는 워크북에서 직접 시트 이름을 읽습니다.

벤치마킹: (4장의 6mb xlsx 파일의 경우)
판다, xlrd: 12초
openpyxl: 24초
제안된 방법: 0.4초

@Dhwanilshah의 답변과 여기에 있는 답변을 결합하여 시트가 하나뿐인 xlsx 파일과도 호환되는 코드를 작성했습니다.

def get_sheet_ids(file_path):
sheet_names = []
with zipfile.ZipFile(file_path, 'r') as zip_ref:
    xml = zip_ref.open(r'xl/workbook.xml').read()
    dictionary = xmltodict.parse(xml)

    if not isinstance(dictionary['workbook']['sheets']['sheet'], list):
        sheet_names.append(dictionary['workbook']['sheets']['sheet']['@name'])
    else:
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_names.append(sheet['@name'])
return sheet_names

dwanil-shah의 대답을 바탕으로, 저는 이것이 가장 효율적이라고 생각합니다.


import os
import re
import zipfile

def get_excel_sheet_names(file_path):
    sheets = []
    with zipfile.ZipFile(file_path, 'r') as zip_ref: xml = zip_ref.read("xl/workbook.xml").decode("utf-8")
    for s_tag in  re.findall("<sheet [^>]*", xml) : sheets.append(  re.search('name="[^"]*', s_tag).group(0)[6:])
    return sheets

sheets  = get_excel_sheet_names("Book1.xlsx")
print(sheets)
# prints: "['Sheet1', 'my_sheet 2']"

xlsb 작업 대안


import os
import re
import zipfile

def get_xlsb_sheet_names(file_path):
    sheets = []
    with zipfile.ZipFile(file_path, 'r') as zip_ref: xml = zip_ref.read("docProps/app.xml").decode("utf-8")
        xml=grep("<TitlesOfParts>.*</TitlesOfParts>", xml)
        for s_tag in  re.findall("<vt:lpstr>.*</vt:lpstr>", xml) : sheets.append(  re.search('>.*<', s_tag).group(0))[1:-1])

    return sheets

장점은 다음과 같습니다.

속도
간단한 코드, 적응하기 쉽습니다.
임시 파일 또는 디렉토리 작성 없음(모두 메모리에 있음)
코어 립만 사용하기

개선해야 할 사항:

regex 구문 분석(시트 이름에 이중 따옴표 ["]가 포함된 경우 동작 방식을 알 수 없음)

전체 pathlib 경로 파일 이름(예: 'c:\xml\file)을 사용하여 Python 코드를 적용했습니다.xlsx')).Dhwanil shah 답변에서 임시 작성에 사용된 Django 방법이 없습니다.

import xmltodict
import shutil
import zipfile


def get_sheet_details(filename):
    sheets = []
    # Make a temporary directory with the file name
    directory_to_extract_to = (filename.with_suffix(''))
    directory_to_extract_to.mkdir(parents=True, exist_ok=True)
    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(filename, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()
    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = directory_to_extract_to / 'xl' / 'workbook.xml'
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['@sheetId'],  # can be sheetId for some versions
                'name': sheet['@name']  # can be name
            }
            sheets.append(sheet_details)
    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets

표준 라이브러리만 사용:

import re
from pathlib import Path
import xml.etree.ElementTree as ET
from zipfile import Path as ZipPath


def sheet_names(path: Path) -> tuple[str, ...]:
    xml: bytes = ZipPath(path, at="xl/workbook.xml").read_bytes()
    root: ET.Element = ET.fromstring(xml)
    namespace = m.group(0) if (m := re.match(r"\{.*\}", root.tag)) else ""
    return tuple(x.attrib["name"] for x in root.findall(f"./{namespace}sheets/") if x.tag == f"{namespace}sheet")

Excel 시트 이름을 읽는 간단한 방법:

import openpyxl
wb = openpyxl.load_workbook(r'<path-to-filename>') 
print(wb.sheetnames)

팬더를 사용하여 Excel의 특정 시트에서 데이터를 읽는 방법:

import pandas as pd
df = pd.read_excel(io = '<path-to-file>', engine='openpyxl', sheet_name = 'Report', header=7, skipfooter=1).drop_duplicates()

XLSB & XLSM 솔루션.Cedric Bonjour에서 영감을 받았습니다.

import re
import zipfile

def get_sheet_names(file_path):
    with zipfile.ZipFile(file_path, 'r') as zip_ref: 
        xml = zip_ref.read("docProps/app.xml").decode("utf-8")
    xml = re.findall("<TitlesOfParts>.*</TitlesOfParts>", xml)[0]
    sheets = re.findall(">([^>]*)<", xml)
    sheets = list(filter(None,sheets))
    return sheets

사용할 수도 있습니다.

data=pd.read_excel('demanddata.xlsx',sheet_name='oil&gas')
print(data)

여기 수요 데이터는 당신의 파일 이름이고 기름은 당신의 시트 이름 중 하나입니다.워크시트에 시트 수가 있을 수 있습니다.Sheet_name="필요한 시트의 이름"에서 가져오려는 시트의 이름을 지정하십시오.

언급URL : https://stackoverflow.com/questions/12250024/how-to-obtain-sheet-names-from-xls-files-without-loading-the-whole-file

'programing' 카테고리의 다른 글

AutoReconnect 예외 "마스터가 변경되었습니다" (0)	2023.05.10
Python Panda는 데이터 프레임이 비어 있지 않은지 확인합니다. (0)	2023.05.10
Git 빨리 감기 VS 빨리 감기 병합 안 함 (0)	2023.05.10
쿼리 문자열에서 여러 매개 변수를 전달하는 방법 (0)	2023.05.10
요소가 비활성화된 경우 마우스 오버 이벤트 발생 (0)	2023.05.10

현재글전체 파일을 로드하지 않고 XLS 파일에서 시트 이름을 가져오는 방법은 무엇입니까?

각종 프로그래밍 정보를 다루는 블로그입니다.

python, typescript, angularJS, C, CSS, reactjs, sql-server, Git, MongoDB, json, Ajax, jQuery, Spring-boot, Excel, Oracle, PowerShell, asp.net, mysql, MariaDB, wordpress,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

javamemo

전체 파일을 로드하지 않고 XLS 파일에서 시트 이름을 가져오는 방법은 무엇입니까?

전체 파일을 로드하지 않고 XLS 파일에서 시트 이름을 가져오는 방법은 무엇입니까?

`xlsx, xls`

`xlsb`

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

전체 파일을 로드하지 않고 XLS 파일에서 시트 이름을 가져오는 방법은 무엇입니까?

전체 파일을 로드하지 않고 XLS 파일에서 시트 이름을 가져오는 방법은 무엇입니까?

xlsx, xls

xlsb

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바

`xlsx, xls`

`xlsb`