Python) csv 파일을 parquet 파일로 만드는 방법

특정 파일을 paruqet으로 만드는 방법에 대해서 알아야 해서 정리해봅니다.

Library Load

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

TXT (CSV) FILE TO Parquet file 변환

csv_file = "./ml-25m/movies.csv"
parquet_file = "./my.parquet"
chunksize = 500
csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False)
chunk =next(iter(csv_stream))

chunk로 schema를 추정할 수 있지만, 미리 정해져 있다면, 그 형식에 맞게 하는 것이 좋습니다.

parquet_schema_old = pa.Table.from_pandas(df=chunk).schema
parquet_schema_new = pa.schema([
    ('movieId', pa.int64()),
    ('title', pa.string()),
    ('genres', pa.string()),
    
])
parquet_schema_old == parquet_schema_new

아래 코드를 이용해서 csv를 부분적으로 읽어서 parquet 파일에 쌓을 수 있습니다.

chunksize = 500
csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False)
for i, chunk in enumerate(csv_stream):
    print("Chunk", i)
    if i == 0:
        # Guess the schema of the CSV file from the first chunk
        parquet_schema = parquet_schema_new
        # Open a Parquet file for writing
        parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
    # Write CSV chunk to the parquet file
    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)
else :
    parquet_writer.close()

이렇게 특정 파일이 크거나 작은 데이터 파일을 parquet 파일로 변환하는 방법을 알아봤습니다.

한 가지 아쉬운 점은 기존 parquet 파일에 추가로 쌓을 수 있는 지는 확인을 못했습니다.

끝

Reference

https://stackoverflow.com/questions/26124417/how-to-convert-a-csv-file-to-parquet

How to convert a csv file to parquet

I'm new to BigData.I need to convert a csv/txt file to Parquet format. I searched a lot but couldn't find any direct way to do so. Is there any way to achieve that?

stackoverflow.com

https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file

Using pyarrow how do you append to parquet file?

How do you append/update to a parquet file with pyarrow? import pandas as pd import pyarrow as pa import pyarrow.parquet as pq table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'b...

stackoverflow.com

저작자표시 (새창열림)

'분석 Python > 구현 및 자료' 카테고리의 다른 글

[Pandas][꿀팁] string 데이터를 pandas data frame으로 바꾸기 (1)	2022.09.09
Python) Data Drift 탐지 KS로 구현해보기 (0)	2022.09.03
Python) 특정 코드의 패턴 조합 찾기 (0)	2022.08.28
[Python] 이산화된 공간 안에 속하는 좌표 찾기 (0)	2022.05.19
Python) list와 nested list안에 값을 기준으로 병합하는 코드 (2)	2022.05.06

Python) csv 파일을 parquet 파일로 만드는 방법

Library Load

TXT (CSV) FILE TO Parquet file 변환

Reference

'분석 Python > 구현 및 자료' 카테고리의 다른 글

AI 도구

AI 도구 사이드 패널

티스토리툴바