New library: PG.bin
PG.bin is a library to parse Postgres COPY dumps made in binary format.
Postgres has a great API to transfer data into and out from a database called COPY. What is special about it is that it supports three different formats: CSV, text and binary. Both CSV and text are trivial: values are passed using their text representation. Only quoting rules and separating characters differ.
Binary format is special in that direction that values are not text. They’re passed exactly how they’re stored in Postgres. Thus, binary format is more compact: it’s 30% less in size than CSV or text. The same applies to performance: COPY-ing a binary data back and forth takes about 15-25% less time.
To parse a binary dump, one must know its structure. This is what the library does: it knows how to parse such dumps. It supports most of the built-in Postgres types including JSON(b). The API is simple an extensible.
Installation
Add this to your project:
;; lein
[com.github.igrishaev/pg-bin "0.1.0"]
;; deps
com.github.igrishaev/pg-bin {:mvn/version "0.1.0"}
Usage
Let’s prepare a binary dump as follows:
create temp table test(
f_01 int2,
f_02 int4,
f_03 int8,
f_04 boolean,
f_05 float4,
f_06 float8,
f_07 text,
f_08 varchar(12),
f_09 time,
f_10 timetz,
f_11 date,
f_12 timestamp,
f_13 timestamptz,
f_14 bytea,
f_15 json,
f_16 jsonb,
f_17 uuid,
f_18 numeric(12,3),
f_19 text null,
f_20 decimal
);
insert into test values (
1,
2,
3,
true,
123.456,
654.321,
'hello',
'world',
'10:42:35',
'10:42:35+0030',
'2025-11-30',
'2025-11-30 10:42:35',
'2025-11-30 10:42:35.123567+0030',
'\xDEADBEEF',
'{"foo": [1, 2, 3, {"kek": [true, false, null]}]}',
'{"foo": [1, 2, 3, {"kek": [true, false, null]}]}',
'4bda6037-1c37-4051-9898-13b82f1bd712',
'123456.123456',
null,
'123999.999100500'
);
\copy test to '/Users/ivan/dump.bin' with (format binary);
Let’s peek what’s inside:
xxd -d /Users/ivan/dump.bin
00000000: 5047 434f 5059 0aff 0d0a 0000 0000 0000 PGCOPY..........
00000016: 0000 0000 1400 0000 0200 0100 0000 0400 ................
00000032: 0000 0200 0000 0800 0000 0000 0000 0300 ................
00000048: 0000 0101 0000 0004 42f6 e979 0000 0008 ........B..y....
00000064: 4084 7291 6872 b021 0000 0005 6865 6c6c @.r.hr.!....hell
00000080: 6f00 0000 0577 6f72 6c64 0000 0008 0000 o....world......
00000096: 0008 fa0e 9cc0 0000 000c 0000 0008 fa0e ................
00000112: 9cc0 ffff f8f8 0000 0004 0000 24f9 0000 ............$...
00000128: 0008 0002 e7cc 4a0a fcc0 0000 0008 0002 ......J.........
00000144: e7cb dec3 0d6f 0000 0004 dead beef 0000 .....o..........
00000160: 0030 7b22 666f 6f22 3a20 5b31 2c20 322c .0{"foo": [1, 2,
00000176: 2033 2c20 7b22 6b65 6b22 3a20 5b74 7275 3, {"kek": [tru
00000192: 652c 2066 616c 7365 2c20 6e75 6c6c 5d7d e, false, null]}
00000208: 5d7d 0000 0031 017b 2266 6f6f 223a 205b ]}...1.{"foo": [
00000224: 312c 2032 2c20 332c 207b 226b 656b 223a 1, 2, 3, {"kek":
00000240: 205b 7472 7565 2c20 6661 6c73 652c 206e [true, false, n
00000256: 756c 6c5d 7d5d 7d00 0000 104b da60 371c ull]}]}....K.`7.
00000272: 3740 5198 9813 b82f 1bd7 1200 0000 0e00 7@Q..../........
00000288: 0300 0100 0000 0300 0c0d 8004 ceff ffff ................
00000304: ff00 0000 1000 0400 0100 0000 0900 0c0f ................
00000320: 9f27 0700 32ff ff .'..2..
Now the library comes into play:
(ns some.ns
(:require
[clojure.java.io :as io]
[pg-bin.core :as copy]
taggie.core))
(def FIELDS
[:int2
:int4
:int8
:boolean
:float4
:float8
:text
:varchar
:time
:timetz
:date
:timestamp
:timestamptz
:bytea
:json
:jsonb
:uuid
:numeric
:text
:decimal])
(copy/parse "/Users/ivan/dump.bin" FIELDS)
[[1
2
3
true
(float 123.456)
654.321
"hello"
"world"
#LocalTime "10:42:35"
#OffsetTime "10:42:35+00:30"
#LocalDate "2025-11-30"
#LocalDateTime "2025-11-30T10:42:35"
#OffsetDateTime "2025-11-30T10:12:35.123567Z"
(=bytes [-34, -83, -66, -17])
"{\"foo\": [1, 2, 3, {\"kek\": [true, false, null]}]}"
"{\"foo\": [1, 2, 3, {\"kek\": [true, false, null]}]}"
#uuid "4bda6037-1c37-4051-9898-13b82f1bd712"
123456.123M
nil
123999.999100500M]]
Here and below: I use Taggie to render complex values like date & time, byte arrays and so on. Really useful!
This is what is going on here: we parse a source pointing to a dump using the
parse
function. A source might be a file, a byte array, an input stream and so
on – anything that can be coerced to an input stream using the
clojure.java.io/input-stream
function.
Binary files produced by Postgres don’t know their structure. Unfortunately,
there is no information about types, only data. One should help the library
traverse a binary dump by specifying a vector of types. The FIELDS
variable
declares the structure of the file. See below what types are supported.
API
There are two functions to parse, namely:
-
pg-bin.core/parse
accepts any source and returns a vector of parsed lines. This function is eager meaning it consumes the whole source and accumulates lines in a vector. -
pg-bin.core/parse-seq
accepts anInputStream
and returns a lazy sequence of parsed lines. It must be called under thewith-open
macro as follows:
(with-open [in (io/input-stream "/Users/ivan/dump.bin")]
(let [lines (copy/parse-seq in FIELDS)]
(doseq [line lines]
...)))
Both functions accept a list of fields as the second argument.
Skipping fields
When parsing, it’s likely that you don’t need all fields to be parsed. You may keep only the leading ones:
(copy/parse DUMP_PATH [:int2 :int4 :int8])
[[1 2 3]]
To skip fields located in the middle, use either :skip
or an underscore:
(copy/parse DUMP_PATH [:int2 :skip :_ :boolean])
[[1 true]]
Raw fields
If, for any reason, you have a type in your dump that the library is not aware
about, or you’d like to examine its binary representation, specify :raw
or
:bytes
. Each value will be a byte array then. It’s up to you how to deal with
those bytes:
(copy/parse DUMP_PATH [:raw :raw :bytes])
[[#bytes [0, 1]
#bytes [0, 0, 0, 2]
#bytes [0, 0, 0, 0, 0, 0, 0, 3]]]
Handling JSON
Postgres is well-known for its vast JSON capabilities, and sometimes tables that we dump have json(b) columns. Above, you saw that by default, they’re parsed as plain strings. This is because there is no a built-in JSON parser in Java and I don’t want to tie this library to a certain JSON implementation.
But the library provides a number of macros to extend undelrying multi-methods. With a line of code, you can enable parsing json(b) types with Chesire, Jsonista, Clojure.data.json, Charred, and JSam. This is how to do it:
(ns some.ns
(:require
[pg-bin.core :as copy]
[pg-bin.json :as json]))
(json/set-cheshire keyword) ;; overrides multimethods
(copy/parse DUMP_PATH FIELDS)
[[...
{:foo [1 2 3 {:kek [true false nil]}]}
{:foo [1 2 3 {:kek [true false nil]}]}
...]]
The set-cheshire
macro extends multimethods assuming you have Cheshire
installed. Now the parse
function, when facing json(b) types, will decode them
properly.
The pg-bin.json
namespace provides the following macros:
set-string
: parse json(b) types as strings again;set-cheshire
: parse using Cheshire;set-data-json
: parse using clojure.data.json;set-jsonista
: parse using Jsonista;set-charred
: parse using Charred;set-jsam
: parse using JSam.
All of them accept optional parameters that are passed into the underlying parsing function.
PG.Bin doesn’t introduce any JSON-related dependencies. Each macro assumes you have added a required library into the classpath.
Metadata
Each parsed line tracks its length in bytes, offset from the beginning of a file (or a stream) and a unique index:
(-> (copy/parse DUMP_PATH FIELDS)
first
meta)
#:pg{:length 306, :index 0, :offset 19}
Knowing these values might help reading a dump by chunks.
Supported types
:raw :bytea :bytes
for raw access andbytea
:skip :_ nil
to skip a certain field:uuid
to parse UUIDs:int2 :short :smallint :smallserial
2-byte integer (short):int4 :int :integer :oid :serial
4-byte integer (integer):int8 :bigint :long :bigserial
8-byte integer (long):numeric :decimal
numeric type (becomesBigDecimal
):float4 :float :real
4-byte float (float):float8 :double :double-precision
8-byte float (double):boolean :bool
boolean:text :varchar :enum :name :string
text values:date
becomesjava.time.LocalDate
:time :time-without-time-zone
becomesjava.time.LocalTime
:timetz :time-with-time-zone
becomesjava.time.OffsetTime
:timestamp :timestamp-without-time-zone
becomesjava.time.LocalDateTime
:timestamptz :timestamp-with-time-zone
becomesjava.time.OffsetDateTime
Ping me for more types, if needed.
On Writing
At the moment, the library only parses binary dumps. Writing them is possible yet requires extra work. Ping me if you really need writing binary files.
Scenarios
Why using this library ever? Imagine you have to fetch a mas-s-s-ive chunk of rows from a database, say 2-3 million to build a report. That might be an issue: you don’t want to saturate memory, neither you want to paginate using LIMIT/OFFSET as it’s slow. A simple solution would be to dump the data you need into a file and process it. You won’t keep the database constantly busy as you’re working with a dump! Here is a small demo:
(ns some.ns
(:require
[pg-bin.core :as copy]
[pg-bin.json :as json]))
(defn make-copy-manager
"
Build an instance of CopyManager from a connection.
"
^CopyManager [^Connection conn]
(new CopyManager (.unwrap conn BaseConnection)))
(let [conn (jdbc/get-connection data-source)
mgr (make-copy-manager conn)
sql "copy table_name(col1, col2...) to stdout with (format binary)"
;; you can use a query without parameters as well
sql "copy (select... from... where...) to stdout with (format binary)"
]
(with-open [out (io/output-stream "/path/to/dump.bin")]
(.copyOut mgr sql out)))
(with-open [in (io/input-stream "/path/to/dump.bin")]
(let [lines (copy/parse-seq in [:int2 :text ...])]
(doseq [line lines]
...)))
Above, we dump the data into a file and then process it. There is a way to process lines on the fly using another thread. The second demo:
(let [conn
(jdbc/get-connection data-source)
mgr
(make-copy-manager conn)
sql
"copy table_name(col1, col2...) to stdout with (format binary)"
in
(new PipedInputStream)
started? (promise)
fut ;; a future to process the output
(future
(with-open [_ in] ;; must close it afterward
(deliver started? true) ;; must report we have started
(let [lines (copy/parse-seq in [:int2 :text ...])]
(doseq [line lines] ;; process on the fly
;; without touching the disk
...))))]
;; ensure the future has started
@started?
;; drain down to the piped output stream
(with-open [out (new PipedOutputStream in)]
(.copyOut mgr sql out))
@fut ;; wait for the future to complete
)
Нашли ошибку? Выделите мышкой и нажмите Ctrl/⌘+Enter