TLDR#
- Protobufs are a mainstay of microservice development.
- You can use them in lieu of JSONs when interacting with a webservice.
- They’re much faster than JSON when you’re deserializing or serializing them.
- They’re designed mostly for applications that talk to each other.
- They’d make excellent choices for MCP-centric applications as well.
Introduction#
I first heard about protobufs from a friend working at Gojek in 2017. I didn’t
know what they were used for and even when I looked them up, I didn’t understand
what I needed them for. JSONs were good enough, weren’t they?
Honestly, I’ve noticed that it’s a pattern (sample size > 10) with developers
who’d mostly coded in Python. Protobufs were something that came out of Java
(preconception: mine), and they were continued to be used by people who were
from that world, going on to become Go developers, perhaps.
I was wrong, and I’m glad that I discovered them when I did.
For those of you who are reading about protobufs for the first time, here’s the
short story.
Protocol Buffers (protobuf) are a language-neutral, platform-neutral
extensible mechanism for serializing structured data
i.
Programmers define the data in a .proto
file, which is then used to interface
with data, language-specific runtime libraries. For example:
1
2
3
4
5
6
7
8
| edition="2023";
message Book {
string isbn = 1;
string title = 2;
string author = 3;
int32 pagecount = 4;
}
|
Using a Protobuf in Python#
Let’s take the above proto
example and save it to Book.proto
.
ℹ️ Note
The numbers you see above are not default values. They’re the field tags and
they have meaning. Tags in the range 1-15 take 1 byte. You can use these for
frequently-used fields. Avoid reusing/overriding old tags by using the
reserved
keyword.
Ensure that you’ve installed grpcio_tools
using uv add
or pip install
.
Now, generate the python code for the protobuf by running
python -m grpc_tools.protoc -I. --python_out=. book.proto
.
This should have created the file book_pb2
in the current directory. This file
should look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| # -*- coding: utf-8 -*-
# Generated by the protocol buffer compiler. DO NOT EDIT!
# NO CHECKED-IN PROTOBUF GENCODE
# source: book.proto
# Protobuf Python Version: 5.29.0
"""Generated protocol buffer code."""
from google.protobuf import descriptor as _descriptor
from google.protobuf import descriptor_pool as _descriptor_pool
from google.protobuf import runtime_version as _runtime_version
from google.protobuf import symbol_database as _symbol_database
from google.protobuf.internal import builder as _builder
_runtime_version.ValidateProtobufRuntimeVersion(
_runtime_version.Domain.PUBLIC,
5,
29,
0,
'',
'book.proto'
)
# @@protoc_insertion_point(imports)
_sym_db = _symbol_database.Default()
DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\nbook.proto\"F\n\x04\x42ook\x12\x0c\n\x04isbn\x18\x01 \x01(\t\x12\r\n\x05title\x18\x02 \x01(\t\x12\x0e\n\x06\x61uthor\x18\x03 \x01(\t\x12\x11\n\tpagecount\x18\x04 \x01(\x05\x62\x08\x65\x64itionsp\xe8\x07')
_globals = globals()
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'book_pb2', _globals)
if not _descriptor._USE_C_DESCRIPTORS:
DESCRIPTOR._loaded_options = None
_globals['_BOOK']._serialized_start=14
_globals['_BOOK']._serialized_end=84
# @@protoc_insertion_point(module_scope)
|
protoc
, the protobuf compiler, will generate this for any language (the python
variant using grpcio-tools
will generate the python syntax, naturally.)
As the docstring at the top tells you, you should not edit this file.
How do you use it?
1
2
3
4
5
6
7
8
9
10
11
12
13
| import book_pb2
book = book_pb2.Book(isbn="9781857232097", title="The Fires of Heaven", author="Robert Jordan", pagecount=912)
serialized_book = book.SerializeToString()
# You can write this to a file if you want.
#You can deserialize this into a Book object if required.
book = book_pb2.Book()
book.ParseFromString(serialized_book)
print(f"{book.title} by {book.author}")
|
I used to wonder why this was really that useful, until I looked at the
performance metrics. For this, we are going to use the timeit module.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
| import book_pb2
import json
import timeit
def serialize_to_json(book):
return json.dumps(book)
def serialize_to_protobuf(book):
book = book_pb2.Book(isbn=book["isbn"], title=book["title"], author=book["author"], pagecount=book["pagecount"])
return book.SerializeToString()
def deserialize_from_json(book: str):
return json.loads(book)
def deserialize_from_protobuf(book: str):
book_obj = book_pb2.Book()
book_obj.ParseFromString(book)
return book_obj
if __name__ == "__main__":
book = dict(
isbn="9781857232097",
title="The Fires of Heaven",
author="Robert Jordan",
pagecount=912
)
trial_json = timeit.timeit("serialize_to_json(book)", number=10**6, globals=dict(serialize_to_json=serialize_to_json, book=book))
trial_protobuf = timeit.timeit("serialize_to_protobuf(book)", number=10**6, globals=dict(serialize_to_protobuf=serialize_to_protobuf, book=book))
print("Runtime results for serialization across 10^6 runs:")
print(f"JSON={round(trial_json,4)}")
print(f"protobuf={round(trial_protobuf,4)}")
percentage_difference = round((trial_json-trial_protobuf)/trial_json*100,4)
print(f"% difference={percentage_difference}")
book_json = serialize_to_json(book)
trial_json = timeit.timeit("deserialize_from_json(book_json)", number=10**6, globals=dict(deserialize_from_json=deserialize_from_json, book_json=book_json))
book_protobuf = serialize_to_protobuf(book)
trial_protobuf = timeit.timeit("deserialize_from_protobuf(book_protobuf)", number=10**6, globals=dict(deserialize_from_protobuf=deserialize_from_protobuf,book_protobuf=book_protobuf))
percentage_difference = round((trial_json-trial_protobuf)/trial_json*100,4)
print("Runtime results for deserialization across 10^6 runs:")
print(f"JSON={round(trial_json,4)}")
print(f"protobuf={round(trial_protobuf,4)}")
print(f"% difference={percentage_difference}")
|
On my laptop, I get the following results:
Runtime results for serialization across 10^6 runs:
JSON=1.4081
protobuf=0.6609
% difference=53.0643
Runtime results for deserialization across 10^6 runs:
JSON=1.1308
protobuf=0.3009
% difference=73.389
For 10^6 (one million) runs of simple serialization/deserialization functions,
these are the results.
Activity | JSON | Protobuf | % Difference |
---|
Serialization | 1.4081 | 0.6609 | 53.0643 |
Deserialization | 1.1308 | 0.3009 | 73.389 |
For serialization, protobufs are 53% faster, while for deserialization, they’re 73.4% faster. The output above is in seconds, so for a million runs, we saved almost 1.5 seconds by using protobufs vs using jsons if we were just serializing and deserializing them.
In a large application where the payload also becomes complex, this will aid in speeding things up greatly. Additionally, note the field tags. These are used in lieu of the field names within the definition, so you have the added memory reduction.
Additionally, the json
library is unaware of whether a field is an int or a string. Protobufs make this explicit and leave no ambiguity to chance when creating the payload or when serializing or deserializing it.
I’ll follow up with a part-2 where I discuss using protobufs with the usual suspect: gRPC.