Merkle tree is a classic data structure for checking data integrity, this project aim to build tree and simulate the challenge-respond between client and server.
Let imagine a scenery, you upload a 2GB file A
to you web server and the file is stroed block by block (a1, a2, ...). Before you upload A
, you stroe a root node
of the merkle tree which is caculated by those blocks. After you upload A
, you delete file A
in your computer. After a period, you want to make sure if those blocks are integral. Then, you download one of the blocks from sever (to be simplified, suppose you download a1), at the sometime the server returns you a authentication path
. You get a value caculated by a1 and authentication path
, after that, you check in if root node
is equal to the caculated value, if ture, you get a answer 'Server is good, the data is integrity.', else you wonder 'On, no... the data is not integrity!'.
Suppose we split file A
into 8 equal pieces, this is block1, block2, ..., block8. We use sha256
to get the blcok hash tags corresponding to X1, X2, ..., X8 in pic1. We class nodes as follows:
- leaf node: X1, X2, ..., X8;
- middle node (a.k.a none-leaf node): A1, A2, ..., A14;
- roof node: A15.
In merkle_tree.py, the Node
class is implemented. Each node has five properties:
- node_id: to identify node;
- value: each node maintain a value which is bonded with its child nodes;
- parent_node: let's take A10 for example, its parent node is A13;
- left_node: Merkle Tree is a sequence structure, A9 is the left child of A13;
- right_node: similar to above, A10 is the right child of A13.
In Merkle Tree, the parent node's value is bonded with its child's values. To be simplified, I choose sha512:
parent_node.value = sha512(right_child.value + left_child.vlaue)
However, before we generate a Merkle Tree, how to get tags X1, X2, ..., X8? As we mentioned above, Xi is calculated by blocki using sha256 where i ∈ (1, 2, ..., 8).
Given a specified block size, we seperate a block into two segments. One records the length of valid bytes, the other stores the valid bytes. For the last block, its size maybe less than specified size, thus we append adequate $
as follows:
def pad_block(block, block_size):
size = len(block)
regular_block = size.to_bytes(4, byteorder='big')
regular_block += block
while len(regular_block) % block_size != 0:
regular_block += b'$' * (block_size - len(regular_block))
return regular_block
Typically, we declare the authentication as challenge-responde protocal. There are 3 phrases (let's take block1 for example):
- The client picks up an arbitrary block number 1 and downloads block1 from server. Then calculates tag X1 corresponding to block1 and send block number 1 to server.
- The server receives the block number 1, it generates a merkle tree over related file. Send authentication path (A2, A10, A14) to client.
- After The client receices the anthentication path, it calculates merkle tree root by tag X1 and authentication path. Next, it retrives local merkle tree root, then compares those root. If calculated merkle root is equal to local merkle tree, we could make sure the file in the server is integrity. Otherwise, our file in the server is compromised.