Py-x-Protobuf - Or How I Learned to Stop Worrying and Love Protocol Buffers

TLDR

Protobufs are a mainstay of microservice development.
You can use them in lieu of JSONs when interacting with a webservice.
They’re much faster than JSON when you’re deserializing or serializing them.
They’re designed mostly for applications that talk to each other.
They’d make excellent choices for MCP-centric applications as well.

Introduction

I first heard about protobufs from a friend working at Gojek in 2017. I didn’t know what they were used for and even when I looked them up, I didn’t understand what I needed them for. JSONs were good enough, weren’t they?

Honestly, I’ve noticed that it’s a pattern (sample size > 10) with developers who’d mostly coded in Python. Protobufs were something that came out of Java (preconception: mine), and they were continued to be used by people who were from that world, going on to become Go developers, perhaps.

I was wrong, and I’m glad that I discovered them when I did.

For those of you who are reading about protobufs for the first time, here’s the short story.

Protocol Buffers (protobuf) are a language-neutral, platform-neutral extensible mechanism for serializing structured data i.

Programmers define the data in a .proto file, which is then used to interface with data, language-specific runtime libraries. For example:

1
2
3
4
5
6
7
8
edition="2023";

message Book {
  string isbn = 1;
  string title = 2;
  string author = 3;
  int32 pagecount = 4;
}

Using a Protobuf in Python

Let’s take the above proto example and save it to Book.proto.

ℹ️ Note
The numbers you see above are not default values. They’re the field tags and they have meaning. Tags in the range 1-15 take 1 byte. You can use these for frequently-used fields. Avoid reusing/overriding old tags by using the reserved keyword.

Ensure that you’ve installed grpcio_tools using uv add or pip install.

Now, generate the python code for the protobuf by running python -m grpc_tools.protoc -I. --python_out=. book.proto.

This should have created the file book_pb2 in the current directory. This file should look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# -*- coding: utf-8 -*-
# Generated by the protocol buffer compiler.  DO NOT EDIT!
# NO CHECKED-IN PROTOBUF GENCODE
# source: book.proto
# Protobuf Python Version: 5.29.0
"""Generated protocol buffer code."""
from google.protobuf import descriptor as _descriptor
from google.protobuf import descriptor_pool as _descriptor_pool
from google.protobuf import runtime_version as _runtime_version
from google.protobuf import symbol_database as _symbol_database
from google.protobuf.internal import builder as _builder
_runtime_version.ValidateProtobufRuntimeVersion(
    _runtime_version.Domain.PUBLIC,
    5,
    29,
    0,
    '',
    'book.proto'
)
# @@protoc_insertion_point(imports)

_sym_db = _symbol_database.Default()




DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\nbook.proto\"F\n\x04\x42ook\x12\x0c\n\x04isbn\x18\x01 \x01(\t\x12\r\n\x05title\x18\x02 \x01(\t\x12\x0e\n\x06\x61uthor\x18\x03 \x01(\t\x12\x11\n\tpagecount\x18\x04 \x01(\x05\x62\x08\x65\x64itionsp\xe8\x07')

_globals = globals()
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'book_pb2', _globals)
if not _descriptor._USE_C_DESCRIPTORS:
  DESCRIPTOR._loaded_options = None
  _globals['_BOOK']._serialized_start=14
  _globals['_BOOK']._serialized_end=84
# @@protoc_insertion_point(module_scope)

protoc, the protobuf compiler, will generate this for any language (the python variant using grpcio-tools will generate the python syntax, naturally.)

As the docstring at the top tells you, you should not edit this file.

How do you use it?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import book_pb2

book = book_pb2.Book(isbn="9781857232097", title="The Fires of Heaven", author="Robert Jordan", pagecount=912)

serialized_book = book.SerializeToString()
# You can write this to a file if you want.

#You can deserialize this into a Book object if required.

book = book_pb2.Book()
book.ParseFromString(serialized_book)

print(f"{book.title} by {book.author}")

I used to wonder why this was really that useful, until I looked at the performance metrics. For this, we are going to use the timeit module.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import book_pb2
import json
import timeit


def serialize_to_json(book):
    return json.dumps(book)

def serialize_to_protobuf(book):
    book = book_pb2.Book(isbn=book["isbn"], title=book["title"], author=book["author"], pagecount=book["pagecount"])
    return book.SerializeToString()

def deserialize_from_json(book: str):
    return json.loads(book)

def deserialize_from_protobuf(book: str):
    book_obj = book_pb2.Book()
    book_obj.ParseFromString(book)
    return book_obj


if __name__ == "__main__":

    book = dict(
        isbn="9781857232097",
        title="The Fires of Heaven",
        author="Robert Jordan",
        pagecount=912
    )
    trial_json = timeit.timeit("serialize_to_json(book)", number=10**6, globals=dict(serialize_to_json=serialize_to_json, book=book))

    trial_protobuf = timeit.timeit("serialize_to_protobuf(book)", number=10**6, globals=dict(serialize_to_protobuf=serialize_to_protobuf, book=book))

    print("Runtime results for serialization across 10^6 runs:")
    print(f"JSON={round(trial_json,4)}")
    print(f"protobuf={round(trial_protobuf,4)}")
    percentage_difference = round((trial_json-trial_protobuf)/trial_json*100,4)
    print(f"% difference={percentage_difference}")

    book_json = serialize_to_json(book)
    trial_json = timeit.timeit("deserialize_from_json(book_json)", number=10**6, globals=dict(deserialize_from_json=deserialize_from_json, book_json=book_json))
    book_protobuf = serialize_to_protobuf(book)
    trial_protobuf = timeit.timeit("deserialize_from_protobuf(book_protobuf)", number=10**6, globals=dict(deserialize_from_protobuf=deserialize_from_protobuf,book_protobuf=book_protobuf))
    percentage_difference = round((trial_json-trial_protobuf)/trial_json*100,4)
    print("Runtime results for deserialization across 10^6 runs:")
    print(f"JSON={round(trial_json,4)}")
    print(f"protobuf={round(trial_protobuf,4)}")
    print(f"% difference={percentage_difference}")
    

On my laptop, I get the following results:

Runtime results for serialization across 10^6 runs:
JSON=1.4081
protobuf=0.6609
% difference=53.0643
Runtime results for deserialization across 10^6 runs:
JSON=1.1308
protobuf=0.3009
% difference=73.389

For 10^6 (one million) runs of simple serialization/deserialization functions, these are the results.

Activity	JSON	Protobuf	% Difference
Serialization	1.4081	0.6609	53.0643
Deserialization	1.1308	0.3009	73.389

For serialization, protobufs are 53% faster, while for deserialization, they’re 73.4% faster. The output above is in seconds, so for a million runs, we saved almost 1.5 seconds by using protobufs vs using jsons if we were just serializing and deserializing them.

In a large application where the payload also becomes complex, this will aid in speeding things up greatly. Additionally, note the field tags. These are used in lieu of the field names within the definition, so you have the added memory reduction.

Additionally, the json library is unaware of whether a field is an int or a string. Protobufs make this explicit and leave no ambiguity to chance when creating the payload or when serializing or deserializing it.

I’ll follow up with a part-2 where I discuss using protobufs with the usual suspect: gRPC.

TLDR#

Introduction#

Using a Protobuf in Python#

TLDR

Introduction

Using a Protobuf in Python