Incremental Text Splitting
Have you ever wondered if there’s a way to chunk text data incrementally?
Data chunking is a process with which every software developer is familiar. Sometimes the data at hand is too extensive to be processed all at once. In such cases, we reduce it by slicing it into smaller pieces called chunks. Consider the following Python code:
def heavy_processing(lines: List[str]) -> List[int]:
# it crushes on len(data) > 1024
...
def process_all(lines: List[str]) -> List[int]:
result = []
chunk_size = 1024
for i in range(0, len(lines), chunk_size):
chunk = lines[i:i + chunk_size]
result.extend(heavy_processing(chunk))
return result
The intention of this code is to process an entire dataset by applying a computation to the entire list. The algorithm is heavy and cannot handle more than 1024 strings at once. The natural solution is to limit the input data and reconcile the outputs into a single result. The approach works as expected, crunching the numbers.
Now, imagine a scenario where we decide to use this approach to process a source file with ChatGPT, summarizing the code into a few sentences. Since the bot cannot process a very long code snippet, the code must be divided into chunks.
def derive_description(lines: List[str]) -> str:
# it can summarize code up to 500 lines
...
def reconcile_results(descriptions: List[str]) -> str:
# it summarizes already preaggregated code
...
def process_all(lines: List[str]) -> str:
descriptions = []
chunk_size = 400
for i in range(0, len(lines), chunk_size):
chunk = lines[i:i + chunk_size]
descriptions.extend(derive_description(chunk))
return reconcile_results(descriptions)
You may noticed that both the derive_description
and reconcile_results
methods are very resource intensive, making it optimal to call them as little as possible. This pushes us to start applying incremental processing.
If a user commits a new version of a file, the code is likely only changed in a few places. With fixed-size chunks, a single change at the beginning of the file could affect all chunks due to line shifting. It seems that fixed-size chunking is not suitable here, as one change could potentially trigger the summarization of all chunks.
What if we made each line a potential boundary instead? Consider line number 6; if it’s designated as the start or end of a boundary, it can remain so unless it’s altered or duplicated elsewhere. This strategy creates a more flexible, content-aware chunking process.
Now, let’s include the content hash. If we generate and preserve the hashes of all chunks from the initial version, we end up with two distinct entries (#1 and #2). When the second version of the file comes in, it only contributes with one additional entry for chunk #1. We end up with three distinct entries.
# version 1
line 1 to line 6 with hash A
line 6 to EOF with hash B
# version 2
line 1 to line 6 with hash C
line 6 to EOF with hash B
Let’s try something practical. I picked up an arbitrary info.py from the poetry git repository. It contains 600+ lines. I did a simulation how such splitting would affect chunks if applied historically. I used a constraint to limit the chunk-size to 10kb. Here what I got.
# commit-sha timestamp unix
# chunk-start chunk-end content-hash timestamp unmatched?
# .......................................................
da9680b0faee435a9d4fb52ac4b11b7a18c48069 2022-02-28T04:12:05Z 1646021525
0000000000000000000 e991bfc73b7a385d943 758e117aa79f81dc5ae 1646021525 U
e991bfc73b7a385d943 e3a425bcee335da6ca5 1c28422e2e0aeeb0306 1646021525 U
e3a425bcee335da6ca5 fabe27e0789f753401e 16405ecce69e353731b 1646021525 U
fabe27e0789f753401e fffffffffffffffffff 47f7fbd5b63f5e13275 1646021525 U
e05ad9b147fadfb1e51991e8abd6493bb243f91f 2022-02-28T12:35:28Z 1646051728
0000000000000000000 e991bfc73b7a385d943 f1c7f7e8f4c45a42a62 1646051728 U
e991bfc73b7a385d943 e3a425bcee335da6ca5 1c28422e2e0aeeb0306 1646021525
e3a425bcee335da6ca5 fabe27e0789f753401e 16405ecce69e353731b 1646021525
fabe27e0789f753401e fffffffffffffffffff 47f7fbd5b63f5e13275 1646021525
d2474b634989ee19d45a88f70de6301fa975e0f0 2022-05-07T13:44:44Z 1651931084
0000000000000000000 e991bfc73b7a385d943 f1c7f7e8f4c45a42a62 1646051728
e991bfc73b7a385d943 e3a425bcee335da6ca5 1c28422e2e0aeeb0306 1646021525
e3a425bcee335da6ca5 fabe27e0789f753401e 16405ecce69e353731b 1646021525
fabe27e0789f753401e fffffffffffffffffff eb84c1ccce08fc7bb31 1651931084 U
48171301ef032fec1e3ec1ae5d40f71e45c680ea 2022-05-08T21:03:59Z 1652043839
0000000000000000000 e991bfc73b7a385d943 3e427ca380701a8d65f 1652043839 U
e991bfc73b7a385d943 e3a425bcee335da6ca5 1c28422e2e0aeeb0306 1646021525
e3a425bcee335da6ca5 fabe27e0789f753401e 16405ecce69e353731b 1646021525
fabe27e0789f753401e fffffffffffffffffff eb84c1ccce08fc7bb31 1651931084
We have some defined boundaries, which are mostly derived from change to change. The inclusion of the timestamp makes it simpler to track chunks and enables reconstruction their order at any moment of time without having an entire content file. And we see unmatched chunks are not happening often. But it may happen that the entire file is unmatched.
I also tried to process in the same way a CHANGELOG.md file. Such files are often changed only in one place. Let’s see how the algorithm performs.
# commit-sha timestamp unix
# chunk-start chunk-end content-hash timestamp unmatched?
# .......................................................
5daeb89b0800dbc282d2b9ef1105fb20e7e712c2 2020-10-23T21:22:09Z 1603488129
0000000000000000000 6aa8d41992f4da978ac b13ddc6f827f2c3668a 1603488129 U
6aa8d41992f4da978ac f1f294eb8bdc9b41a60 99a332aaf1a30602fdc 1603488129 U
f1f294eb8bdc9b41a60 099eaefb34d6d0cc402 dd77d6daebf2ad1c6ba 1603488129 U
099eaefb34d6d0cc402 1385ae79ed21148b663 5f7d869f8564c962b38 1603488129 U
1385ae79ed21148b663 59c807d228dcb7eb699 34ddf1f1dc1c46f5a1e 1603488129 U
59c807d228dcb7eb699 a2ba03ce92e4908231e f02ac1fa8374b7d2dc4 1603488129 U
a2ba03ce92e4908231e 9f6f10e5d359cb22204 d2e74401062fc835e0d 1603488129 U
9f6f10e5d359cb22204 fffffffffffffffffff 79dc3aca6c0d4f5066a 1603488129 U
25771b4d27e6ba4926f52aea876a7d8dd35f7574 2021-02-10T00:01:41Z 1612915301
0000000000000000000 6aa8d41992f4da978ac b13ddc6f827f2c3668a 1603488129
6aa8d41992f4da978ac f1f294eb8bdc9b41a60 99a332aaf1a30602fdc 1603488129
f1f294eb8bdc9b41a60 099eaefb34d6d0cc402 dd77d6daebf2ad1c6ba 1603488129
099eaefb34d6d0cc402 1385ae79ed21148b663 5f7d869f8564c962b38 1603488129
1385ae79ed21148b663 59c807d228dcb7eb699 34ddf1f1dc1c46f5a1e 1603488129
59c807d228dcb7eb699 a2ba03ce92e4908231e f02ac1fa8374b7d2dc4 1603488129
a2ba03ce92e4908231e 9f6f10e5d359cb22204 d2e74401062fc835e0d 1603488129
9f6f10e5d359cb22204 fffffffffffffffffff da15e27abfe82d0b802 1612915301 U
f1b1cf9345837b916e15837d0258f1cc6a0b3f91 2021-04-06T17:12:29Z 1617729149
0000000000000000000 6aa8d41992f4da978ac ae7bd3fd2ae54203f92 1617729149 U
6aa8d41992f4da978ac f1f294eb8bdc9b41a60 99a332aaf1a30602fdc 1603488129
f1f294eb8bdc9b41a60 099eaefb34d6d0cc402 dd77d6daebf2ad1c6ba 1603488129
099eaefb34d6d0cc402 1385ae79ed21148b663 5f7d869f8564c962b38 1603488129
1385ae79ed21148b663 59c807d228dcb7eb699 34ddf1f1dc1c46f5a1e 1603488129
59c807d228dcb7eb699 a2ba03ce92e4908231e f02ac1fa8374b7d2dc4 1603488129
a2ba03ce92e4908231e 9f6f10e5d359cb22204 d2e74401062fc835e0d 1603488129
9f6f10e5d359cb22204 fffffffffffffffffff da15e27abfe82d0b802 1612915301
ea92ec14d4f999b5f3cbf07202840e606e2226f1 2021-05-21T13:13:31Z 1621602811
0000000000000000000 6aa8d41992f4da978ac aff3979e1379ae1e846 1621602811 U
6aa8d41992f4da978ac f1f294eb8bdc9b41a60 99a332aaf1a30602fdc 1603488129
f1f294eb8bdc9b41a60 099eaefb34d6d0cc402 dd77d6daebf2ad1c6ba 1603488129
099eaefb34d6d0cc402 1385ae79ed21148b663 5f7d869f8564c962b38 1603488129
1385ae79ed21148b663 59c807d228dcb7eb699 34ddf1f1dc1c46f5a1e 1603488129
59c807d228dcb7eb699 a2ba03ce92e4908231e f02ac1fa8374b7d2dc4 1603488129
a2ba03ce92e4908231e 9f6f10e5d359cb22204 d2e74401062fc835e0d 1603488129
9f6f10e5d359cb22204 fffffffffffffffffff 13abacfdbd5a886abb0 1621602811 U
6659a22e2a03ad13bae142b66c7c31aea68dd114 2021-07-31T22:17:47Z 1627769867
0000000000000000000 a5e6271250114dea762 6f7be4a6efea2d137c7 1627769867 U
a5e6271250114dea762 6aa8d41992f4da978ac 043cf1f5a2e817ceb98 1627769867 U
6aa8d41992f4da978ac f1f294eb8bdc9b41a60 99a332aaf1a30602fdc 1603488129
f1f294eb8bdc9b41a60 099eaefb34d6d0cc402 dd77d6daebf2ad1c6ba 1603488129
099eaefb34d6d0cc402 1385ae79ed21148b663 5f7d869f8564c962b38 1603488129
1385ae79ed21148b663 59c807d228dcb7eb699 34ddf1f1dc1c46f5a1e 1603488129
59c807d228dcb7eb699 a2ba03ce92e4908231e f02ac1fa8374b7d2dc4 1603488129
a2ba03ce92e4908231e 9f6f10e5d359cb22204 d2e74401062fc835e0d 1603488129
9f6f10e5d359cb22204 fffffffffffffffffff 4d6eb50f83fd85f3557 1627769867 U
The algorithm performs perfectly — it identifies only few affected chunks. It’s worth to nice, that if the file grows some existing chunks are split deeper into smaller pieces.
In this article I shared a way of splitting text files into smaller chunks. The presented idea (and not detailed algorithm) is capable of identifying previously generated chunks in the current version of the file, thus reusing them as much as possible. The designed chunking system is also able to reconstruct its absolute order without looking at data.
Are you interested in the code creating my results? You can find it here, where you can explore detailed implementation of the algorithm: https://github.com/amacal/learning-python/tree/text-chunking