Parquet reader refactor: move to a Streamly-based streaming pipeline (bounded memory, clearer structure, optional concurrency) by sharmrj · Pull Request #188 · DataHaskell/dataframe

sharmrj · 2026-03-20T07:45:25Z

Solves #133 and #171.

Introduces a RandomAccess class for abstracting Random Access on files (Which can be extended to remote files as well)
Uses Pinch to decode thrift metadata
- There are functions to convert between relevant pinch based types and the types used by the legacy Parquet parser. We do this because we re-use decodePageData from the legacy parser.
The core nested loop is now implemented using Stream.unfoldEach; we transpose our RowGroups to get a Stream of ColumnChunks, and we define an Unfold that yields the parsed Column given a ColumnChunk (just the part of the column thats relevant).
We allocate the columns ahead of time using newMutableColumn and copy the Columns yielded by the stream into it using copyIntoMutableColumn and then freeze the mutable column. So no growing is necessary

Next steps

Implement getting the typeLength when the column has the type FIXED_LEN_BYTE_ARRAY
Implement reading with options
- Take advantage of the fact that we're streaming columns to apply predicates during the parse rather than after like in the current stable parser
Reconcile the Thrift types with the types in DataFrame.IO.Parser.Types.
We could possible use a ParserCombinator abstraction to deal with binary functions that return (a, ByteString). We might even try using a MonadTransformer stack along with that to clean up the code.
Parallelism

… (Each column in a stream is a chunk in the larger column)

Eiko-Tokura

Thanks for the great work!

Eiko-Tokura · 2026-03-20T16:52:25Z

src/DataFrame/IO/Utils/RandomAccess.hs

+    readRanges = mapM readBytes
+    readSuffix :: Int -> m ByteString
+
+newtype ReaderIO r a = ReaderIO {runReaderIO :: r -> IO a}


Thanks for the great work! Would it be better if we can use the mature ReaderT r IO from mtl or transformers instead of rolling our own instances, like using type ReaderIO r = ReaderT r IO? If the intent was to avoid extra dependencies then I think it's fine. (My guess is that we will eventually need StateT for a state accumulating writer anyway)

You are correct that ReaderT is the correct thing to use here (that's really the pattern I was going for to account for multiple ways of reading data). I think we should hold off on adding the dependency until we actually need the monad transformer stack (or a different effect system should we decide that's something we want).

Eiko-Tokura · 2026-03-20T17:00:24Z

src/DataFrame/IO/Utils/RandomAccess.hs

+
+data Range = Range {offset :: !Integer, length :: !Int} deriving (Eq, Show)
+
+class (Monad m) => RandomAccess m where


(non critical, just a remark) Maybe we can try to merge the abstraction RandomAccess with the DataFrame.IO.Parquet.Seeking into one interface, at later stages.

Yes. I'll be doing that as I remove the Unstable module and move the code into the current DataFrame.IO Module.

Eiko-Tokura · 2026-03-20T17:17:32Z

src/DataFrame/IO/Unstable/Parquet.hs

+import qualified Data.ByteString as BS
+import Data.Functor ((<&>))
+import Data.List (foldl', transpose)
+import qualified Data.Map as Map


I think we can use Data.Map.Strict by default, there is no need to be lazy here

adithyaov

Good job with the PR!

I've made a surface level review.
I've not reviewed the functionality throughly.

adithyaov · 2026-03-29T04:02:46Z

src/DataFrame/IO/Parquet/Page.hs

+    GZIP -> pure (LB.toStrict (GZip.decompress (BS.fromStrict compressed)))
+    other -> error ("Unsupported compression type: " ++ show other)
+
 readPage :: CompressionCodec -> BS.ByteString -> IO (Maybe Page, BS.ByteString)


Not related and out of context: This looks like an Unfold to me.

adithyaov · 2026-03-29T04:18:10Z

src/DataFrame/IO/Parquet/Page.hs

    _ -> False

+decompressData :: CompressionCodec -> BS.ByteString -> IO BS.ByteString
+decompressData codec compressed = case codec of


The result of decompressData is used to produce a stream of Page (readPage). This decompression is strict in nature. I'm not sure if we can do a lazy, on-demand, decompression.

adithyaov · 2026-03-29T04:24:51Z

src/DataFrame/IO/Parquet/Page.hs

+            result <- next
+            drainZstd result BS.empty (chunk : acc)
+        drainZstd (Zstd.Done final) _ acc =
+            pure $ BS.concat (reverse (final : acc))


bytestring might have something similar to fromListRevN or fromChunksRev. If not, it should be easy to write our own.
We can avoid a list traversal and pre-allocate the resulting array avoiding any unnecessary copies.

adithyaov · 2026-03-29T04:29:52Z

src/DataFrame/IO/Utils/RandomAccess.hs

+    mmapFileForeignPtr,
+ )
+
+uncurry_ :: (a -> b -> c -> d) -> (a, b, c) -> d


You can maybe call this uncurry3 or something?
A suffix of _ generally signifies discarded result. There are no rules though :-)

adithyaov · 2026-03-29T04:40:15Z

src/DataFrame/IO/Utils/RandomAccess.hs

+unsafeToByteString :: VS.Vector Word8 -> ByteString
+unsafeToByteString v = PS (castForeignPtr ptr) offset len
+  where
+    (ptr, offset, len) = VS.unsafeToForeignPtr v


This will cause maintenance burden.
The core datatype has changed across different versions of bytestring.
We either have to constrain bytstrring to specific versions or support multiple implementations here using CPP macros.

Example burden: https://github.com/psibi/streamly-bytestring/blob/22fa4738302dcdeffc15d522281a9113bfaf62fe/src/Streamly/External/ByteString.hs#L76

adithyaov · 2026-03-29T05:36:24Z

src/DataFrame/IO/Unstable/Parquet.hs

+            sizes = map (fromIntegral . BS.index footer) [0 .. 3]
+         in foldl' (.|.) 0 $ zipWith shiftL sizes [0, 8 .. 24]
+
+parseColumns :: (RandomAccess r, MonadIO r) => FileMetadata -> [Stream r Column]


I don't like this: [Stream r ColumnChunk]. That said, I'm not in a position to suggest a better alternative.
Could you help me understand how this fits in the bigger picture?
Each element in this list corresponds to a column?

Update: I think I see where this is used.
You can return a vector directly here.
Vector (Stream Column) is easier to reason with over [Stream Column]
FYI, Data.Vector == Streamly.Data.Array (Boxed & Unboxed)

adithyaov · 2026-03-29T05:43:15Z

src/DataFrame/IO/Unstable/Parquet.hs

+    case Pinch.decode Pinch.compactProtocol rawMetadata of
+        Left e -> error $ show e
+        Right metadata -> return metadata


You can use maybe

Use of error will make the control flow harder to reason with and manage later.

adithyaov · 2026-03-29T05:47:33Z

src/DataFrame/IO/Unstable/Parquet.hs

+readParquetUnstable filepath = IO.withFile filepath IO.ReadMode $ \handle -> do
+    runReaderIO parseParquet handle
+
+parseParquet :: (RandomAccess r, MonadIO r) => r DataFrame


adithyaov · 2026-03-29T05:49:45Z

src/DataFrame/IO/Unstable/Parquet/Thrift.hs

@@ -0,0 +1,672 @@
+{-# LANGUAGE DataKinds #-}


I've not reviewed this module. It mostly looks like necessary boilerplate.

adithyaov · 2026-03-29T05:51:11Z

src/DataFrame/IO/Unstable/Parquet/Utils.hs

+{- | Build a forest from a flat, depth-first schema list,
+  consuming elements and returning (tree, remaining).
+-}
+data SchemaTree = SchemaTree SchemaElement [SchemaTree]


This looks like a RoseTree.
Is there already an existing library with performance and representation ironed out?

Raghav Sharma and others added 11 commits March 20, 2026 12:46

Use Pinch to decode parquet metadata

c49fd56

WIP Implement Parquet reading using streamly

faef937

WIP: PArquet Refactor

2f95aa8

Refactored the streaming parquet parser to return a stream of Columns…

8dfea3c

… (Each column in a stream is a chunk in the larger column)

Implemented a streaming parquet parser

14f0399

copied over the tests for the parquet parser to test the unstable parser

b29814a

Updated the pinch dependency constraints

e0e5a70

Ran fourmolu on the changed files

e0f25c9

ran fourmolu on `DataFrame.IO.Unstable.Parquet.Utils

da0ecc1

Ran fourmolu on the new test file

622a261

Fixed some hlint issues

4c2e2ce

Eiko-Tokura reviewed Mar 20, 2026

View reviewed changes

adithyaov reviewed Mar 29, 2026

View reviewed changes


		data Range = Range {offset :: !Integer, length :: !Int} deriving (Eq, Show)

		class (Monad m) => RandomAccess m where

Conversation

sharmrj commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Eiko-Tokura left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adithyaov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adithyaov Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sharmrj commented Mar 20, 2026 •

edited

Loading

adithyaov left a comment •

edited

Loading

adithyaov Mar 29, 2026 •

edited

Loading