Reading Apache Parquet Files

Reading Apache Parquet Files#

This tutorial provides an example of how to read data from an Apache Parquet file, and how to process and transform the data with Grain.

grain.experimental.ParquetIterDataset is a thin wrapper around pyarrow.parquet.ParquetFile. Any extra keyword arguments passed to ParquetIterDataset(...) are forwarded to ParquetFile, so you can use the PyArrow reference to discover supported read options.

Internally, Grain streams one row group at a time from each parquet file. If you pass multiple paths, Grain lazily interleaves per-file iterators, which is useful for sharded datasets that do not support efficient random access.

Generate a test Parquet file on local#

import pyarrow as pa
import pyarrow.parquet as pq

# Generate a sample PyArrow table containing email subjects and bodies.
table = pa.table({
    'email_subject': [
        "Meeting Reminder: Project X Update",
        "Important Announcement Regarding Company Policy",
        "FWD: Quick Question",
        "Your Order Confirmation #12345",
        "Invitation to Team Building Activity"
    ],
    'email_body': [
        "Hi team,\n\nJust a reminder about our Project X update meeting tomorrow at 10 AM PST. Please come prepared to discuss your progress and any roadblocks.\n\nSee you there,\n[Your Name]",
        "Dear employees,\n\nPlease be advised of a new company policy regarding remote work, effective May 1st, 2025. You can find the full details on the company intranet.\n\nRegards,\nManagement",
        "Hi [Name],\n\nForwarding you this email as you might have the answer to this quick question:\n\n[Original Email Content]",
        "Dear [Customer Name],\n\nThank you for your recent order! This email confirms your order #12345. You can view the details and track its shipment here: [Link]\n\nSincerely,\nThe [Company Name] Team",
        "Hello everyone,\n\nYou're invited to participate in our upcoming team building activity on Friday, April 28th. It will be a fun afternoon of [Activity]. Please RSVP by Wednesday.\n\nBest,\n[Organizer Name]"
    ]
})

# Write this table to a parquet file.
writer = pq.ParquetWriter('emails.parquet', table.schema)
writer.write_table(table)
writer.close()

Load Dataset#

# Install Grain
!pip install grain

import grain
import pprint

ds = grain.experimental.ParquetIterDataset('./emails.parquet')

list(ds)[0]

Example of passing some kwargs through to ParquetFile.

ds = grain.experimental.ParquetIterDataset(
    './emails.parquet',
    memory_map=True,
)
list(ds)[0]

To read multiple parquet files in parallel, provide a list of paths. Grain will automatically use InterleaveIterDataset internally to interleave the files.

ds = grain.experimental.ParquetIterDataset(
    ['./emails.parquet', './emails.parquet'],
)

You can also use grain.experimental.ThreadPrefetchIterDataset to overlap CPU-side reading with downstream work.

ds = grain.experimental.ThreadPrefetchIterDataset(
    ds,
    prefetch_buffer_size=10,
)

Transform Dataset#

# Load a pre trained tokenizer.
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-cased")

class TokenizeText(grain.transforms.Map):
  """Tokenizes the text values within each element using a provided tokenizer."""
  def __init__(self, tokenizer):
    self._tokenizer = tokenizer

  def map(self, element):
    return [self._tokenizer.encode(item).tokens for item in element.values()]

# Tokenize the data using the provided tokenizer.
ds = ds.map(TokenizeText(tokenizer))

# Create an iterator object of the dataset.
ds_iter = iter(ds)
# Print the first element in the dataset.
pprint.pprint(next(ds_iter))

Reading Apache Parquet Files

Contents

Reading Apache Parquet Files#

Generate a test Parquet file on local#

Load Dataset#

Transform Dataset#