New library: PG.bin

PG.bin is a library to parse Postgres COPY dumps made in binary format.

Postgres has a great API to transfer data into and out from a database called COPY. What is special about it is that it supports three different formats: CSV, text and binary. Both CSV and text are trivial: values are passed using their text representation. Only quoting rules and separating characters differ.

Binary format is special in that direction that values are not text. They’re passed exactly how they’re stored in Postgres. Thus, binary format is more compact: it’s 30% less in size than CSV or text. The same applies to performance: COPY-ing a binary data back and forth takes about 15-25% less time.

To parse a binary dump, one must know its structure. This is what the library does: it knows how to parse such dumps. It supports most of the built-in Postgres types including JSON(b). The API is simple an extensible.

Installation

Add this to your project:

;; lein
[com.github.igrishaev/pg-bin "0.1.0"]

;; deps
com.github.igrishaev/pg-bin {:mvn/version "0.1.0"}

Usage

Let’s prepare a binary dump as follows:

create temp table test(
    f_01 int2,
    f_02 int4,
    f_03 int8,
    f_04 boolean,
    f_05 float4,
    f_06 float8,
    f_07 text,
    f_08 varchar(12),
    f_09 time,
    f_10 timetz,
    f_11 date,
    f_12 timestamp,
    f_13 timestamptz,
    f_14 bytea,
    f_15 json,
    f_16 jsonb,
    f_17 uuid,
    f_18 numeric(12,3),
    f_19 text null,
    f_20 decimal
);

insert into test values (
    1,
    2,
    3,
    true,
    123.456,
    654.321,
    'hello',
    'world',
    '10:42:35',
    '10:42:35+0030',
    '2025-11-30',
    '2025-11-30 10:42:35',
    '2025-11-30 10:42:35.123567+0030',
    '\xDEADBEEF',
    '{"foo": [1, 2, 3, {"kek": [true, false, null]}]}',
    '{"foo": [1, 2, 3, {"kek": [true, false, null]}]}',
    '4bda6037-1c37-4051-9898-13b82f1bd712',
    '123456.123456',
    null,
    '123999.999100500'
);

\copy test to '/Users/ivan/dump.bin' with (format binary);

Let’s peek what’s inside:

xxd -d /Users/ivan/dump.bin

00000000: 5047 434f 5059 0aff 0d0a 0000 0000 0000  PGCOPY..........
00000016: 0000 0000 1400 0000 0200 0100 0000 0400  ................
00000032: 0000 0200 0000 0800 0000 0000 0000 0300  ................
00000048: 0000 0101 0000 0004 42f6 e979 0000 0008  ........B..y....
00000064: 4084 7291 6872 b021 0000 0005 6865 6c6c  @.r.hr.!....hell
00000080: 6f00 0000 0577 6f72 6c64 0000 0008 0000  o....world......
00000096: 0008 fa0e 9cc0 0000 000c 0000 0008 fa0e  ................
00000112: 9cc0 ffff f8f8 0000 0004 0000 24f9 0000  ............$...
00000128: 0008 0002 e7cc 4a0a fcc0 0000 0008 0002  ......J.........
00000144: e7cb dec3 0d6f 0000 0004 dead beef 0000  .....o..........
00000160: 0030 7b22 666f 6f22 3a20 5b31 2c20 322c  .0{"foo": [1, 2,
00000176: 2033 2c20 7b22 6b65 6b22 3a20 5b74 7275   3, {"kek": [tru
00000192: 652c 2066 616c 7365 2c20 6e75 6c6c 5d7d  e, false, null]}
00000208: 5d7d 0000 0031 017b 2266 6f6f 223a 205b  ]}...1.{"foo": [
00000224: 312c 2032 2c20 332c 207b 226b 656b 223a  1, 2, 3, {"kek":
00000240: 205b 7472 7565 2c20 6661 6c73 652c 206e   [true, false, n
00000256: 756c 6c5d 7d5d 7d00 0000 104b da60 371c  ull]}]}....K.`7.
00000272: 3740 5198 9813 b82f 1bd7 1200 0000 0e00  7@Q..../........
00000288: 0300 0100 0000 0300 0c0d 8004 ceff ffff  ................
00000304: ff00 0000 1000 0400 0100 0000 0900 0c0f  ................
00000320: 9f27 0700 32ff ff                        .'..2..

Now the library comes into play:

(ns some.ns
  (:require
   [clojure.java.io :as io]
   [pg-bin.core :as copy]
   taggie.core))

(def FIELDS
  [:int2
   :int4
   :int8
   :boolean
   :float4
   :float8
   :text
   :varchar
   :time
   :timetz
   :date
   :timestamp
   :timestamptz
   :bytea
   :json
   :jsonb
   :uuid
   :numeric
   :text
   :decimal])

(copy/parse "/Users/ivan/dump.bin" FIELDS)

[[1
  2
  3
  true
  (float 123.456)
  654.321
  "hello"
  "world"
  #LocalTime "10:42:35"
  #OffsetTime "10:42:35+00:30"
  #LocalDate "2025-11-30"
  #LocalDateTime "2025-11-30T10:42:35"
  #OffsetDateTime "2025-11-30T10:12:35.123567Z"
  (=bytes [-34, -83, -66, -17])
  "{\"foo\": [1, 2, 3, {\"kek\": [true, false, null]}]}"
  "{\"foo\": [1, 2, 3, {\"kek\": [true, false, null]}]}"
  #uuid "4bda6037-1c37-4051-9898-13b82f1bd712"
  123456.123M
  nil
  123999.999100500M]]

Here and below: I use Taggie to render complex values like date & time, byte arrays and so on. Really useful!

This is what is going on here: we parse a source pointing to a dump using the parse function. A source might be a file, a byte array, an input stream and so on – anything that can be coerced to an input stream using the clojure.java.io/input-stream function.

Binary files produced by Postgres don’t know their structure. Unfortunately, there is no information about types, only data. One should help the library traverse a binary dump by specifying a vector of types. The FIELDS variable declares the structure of the file. See below what types are supported.

API

There are two functions to parse, namely:

pg-bin.core/parse accepts any source and returns a vector of parsed lines. This function is eager meaning it consumes the whole source and accumulates lines in a vector.
pg-bin.core/parse-seq accepts an InputStream and returns a lazy sequence of parsed lines. It must be called under the with-open macro as follows:

(with-open [in (io/input-stream "/Users/ivan/dump.bin")]
  (let [lines (copy/parse-seq in FIELDS)]
    (doseq [line lines]
      ...)))

Both functions accept a list of fields as the second argument.

Skipping fields

When parsing, it’s likely that you don’t need all fields to be parsed. You may keep only the leading ones:

(copy/parse DUMP_PATH [:int2 :int4 :int8])
[[1 2 3]]

To skip fields located in the middle, use either :skip or an underscore:

(copy/parse DUMP_PATH [:int2 :skip :_ :boolean])
[[1 true]]

Raw fields

If, for any reason, you have a type in your dump that the library is not aware about, or you’d like to examine its binary representation, specify :raw or :bytes. Each value will be a byte array then. It’s up to you how to deal with those bytes:

(copy/parse DUMP_PATH [:raw :raw :bytes])
[[#bytes [0, 1]
  #bytes [0, 0, 0, 2]
  #bytes [0, 0, 0, 0, 0, 0, 0, 3]]]

Handling JSON

Postgres is well-known for its vast JSON capabilities, and sometimes tables that we dump have json(b) columns. Above, you saw that by default, they’re parsed as plain strings. This is because there is no a built-in JSON parser in Java and I don’t want to tie this library to a certain JSON implementation.

But the library provides a number of macros to extend undelrying multi-methods. With a line of code, you can enable parsing json(b) types with Chesire, Jsonista, Clojure.data.json, Charred, and JSam. This is how to do it:

(ns some.ns
  (:require
   [pg-bin.core :as copy]
   [pg-bin.json :as json]))

(json/set-cheshire keyword) ;; overrides multimethods

(copy/parse DUMP_PATH FIELDS)

[[...
  {:foo [1 2 3 {:kek [true false nil]}]}
  {:foo [1 2 3 {:kek [true false nil]}]}
  ...]]

The set-cheshire macro extends multimethods assuming you have Cheshire installed. Now the parse function, when facing json(b) types, will decode them properly.

The pg-bin.json namespace provides the following macros:

set-string: parse json(b) types as strings again;
set-cheshire: parse using Cheshire;
set-data-json: parse using clojure.data.json;
set-jsonista: parse using Jsonista;
set-charred: parse using Charred;
set-jsam: parse using JSam.

All of them accept optional parameters that are passed into the underlying parsing function.

PG.Bin doesn’t introduce any JSON-related dependencies. Each macro assumes you have added a required library into the classpath.

Metadata

Each parsed line tracks its length in bytes, offset from the beginning of a file (or a stream) and a unique index:

(-> (copy/parse DUMP_PATH FIELDS)
    first
    meta)

#:pg{:length 306, :index 0, :offset 19}

Knowing these values might help reading a dump by chunks.

Supported types

:raw :bytea :bytes for raw access and bytea
:skip :_ nil to skip a certain field
:uuid to parse UUIDs
:int2 :short :smallint :smallserial 2-byte integer (short)
:int4 :int :integer :oid :serial 4-byte integer (integer)
:int8 :bigint :long :bigserial 8-byte integer (long)
:numeric :decimal numeric type (becomes BigDecimal)
:float4 :float :real 4-byte float (float)
:float8 :double :double-precision 8-byte float (double)
:boolean :bool boolean
:text :varchar :enum :name :string text values
:date becomes java.time.LocalDate
:time :time-without-time-zone becomes java.time.LocalTime
:timetz :time-with-time-zone becomes java.time.OffsetTime
:timestamp :timestamp-without-time-zone becomes java.time.LocalDateTime
:timestamptz :timestamp-with-time-zone becomes java.time.OffsetDateTime

Ping me for more types, if needed.

On Writing

At the moment, the library only parses binary dumps. Writing them is possible yet requires extra work. Ping me if you really need writing binary files.

Scenarios

Why using this library ever? Imagine you have to fetch a mas-s-s-ive chunk of rows from a database, say 2-3 million to build a report. That might be an issue: you don’t want to saturate memory, neither you want to paginate using LIMIT/OFFSET as it’s slow. A simple solution would be to dump the data you need into a file and process it. You won’t keep the database constantly busy as you’re working with a dump! Here is a small demo:

(ns some.ns
  (:require
   [pg-bin.core :as copy]
   [pg-bin.json :as json]))

(defn make-copy-manager
  "
  Build an instance of CopyManager from a connection.
  "
  ^CopyManager [^Connection conn]
  (new CopyManager (.unwrap conn BaseConnection)))

(let [conn (jdbc/get-connection data-source)
      mgr (make-copy-manager conn)
      sql "copy table_name(col1, col2...) to stdout with (format binary)"
      ;; you can use a query without parameters as well
      sql "copy (select... from... where...) to stdout with (format binary)"
      ]
  (with-open [out (io/output-stream "/path/to/dump.bin")]
    (.copyOut mgr sql out)))

(with-open [in (io/input-stream "/path/to/dump.bin")]
  (let [lines (copy/parse-seq in [:int2 :text ...])]
    (doseq [line lines]
      ...)))

Above, we dump the data into a file and then process it. There is a way to process lines on the fly using another thread. The second demo:

(let [conn
      (jdbc/get-connection data-source)

      mgr
      (make-copy-manager conn)

      sql
      "copy table_name(col1, col2...) to stdout with (format binary)"

      in
      (new PipedInputStream)

      started? (promise)

      fut ;; a future to process the output
      (future
        (with-open [_ in] ;; must close it afterward
          (deliver started? true) ;; must report we have started
          (let [lines (copy/parse-seq in [:int2 :text ...])]
            (doseq [line lines] ;; process on the fly
              ;; without touching the disk
              ...))))]

  ;; ensure the future has started
  @started?

  ;; drain down to the piped output stream
  (with-open [out (new PipedOutputStream in)]
    (.copyOut mgr sql out))

  @fut ;; wait for the future to complete
  )

← Место работы

Зависимости S3 →