diff --git a/docs/concepts/faq.md b/docs/concepts/faq.md index 3e9429b1f..646f017d1 100644 --- a/docs/concepts/faq.md +++ b/docs/concepts/faq.md @@ -26,6 +26,9 @@ The quickest way to get IPFS up and running on your machine is by installing [IP For installing and initializing IPFS from the command line, check out the [command-line quick start](../how-to/command-line-quick-start.md) guide. +### Why doesn't my SHA hash match my CID? +When you add a file to IPFS, IPFS splits it into smaller blocks. Each of these pieces is hashed individually, which then results in an overall different hash. Instead, IPFS uses Merkle DAGs, which are self-verifiable. See [Merkle Directed Acyclic Graphs (DAGs)](../concepts/merkle-dag.md). + ## Contributing to IPFS ### How do I start contributing to IPFS? diff --git a/docs/concepts/hashing.md b/docs/concepts/hashing.md index 9c53038b6..fc6f10c6f 100644 --- a/docs/concepts/hashing.md +++ b/docs/concepts/hashing.md @@ -6,10 +6,6 @@ description: Learn about cryptographic hashes and why they're critical to how IP # Hashing -::: tip -If you're interested in how cryptographic hashes fit into how IPFS works with files in general, check out this video from IPFS Camp 2019! [Core Course: How IPFS Deals With Files](https://www.youtube.com/watch?v=Z5zNPwMDYGg) -::: - Cryptographic hashes are functions that take some arbitrary input and return a fixed-length value. The particular value depends on the given hash algorithm in use, such as [SHA-1](https://en.wikipedia.org/wiki/SHA-1) (used by git), [SHA-256](https://en.wikipedia.org/wiki/SHA-2), or [BLAKE2](), but a given hash algorithm always returns the same value for a given input. Have a look at Wikipedia's [full list of hash functions](https://en.wikipedia.org/wiki/List_of_hash_functions) for more. As an example, the input: @@ -40,7 +36,11 @@ For example, the SHA-256 hash of "Hello world" from above can be represented as mtwirsqawjuoloq2gvtyug2tc3jbf5htm2zeo4rsknfiv3fdp46a ``` -## Hashes are important +::: tip +If you're interested in how cryptographic hashes fit into how IPFS works with files in general, check out this video from IPFS Camp 2019! [Core Course: How IPFS Deals With Files](https://www.youtube.com/watch?v=Z5zNPwMDYGg) +::: + +## Important hash characteristics Cryptographic hashes come with a couple of very important characteristics: @@ -49,15 +49,17 @@ Cryptographic hashes come with a couple of very important characteristics: - **unique** - it's infeasible to generate the same hash from two different messages - **one-way** - it's infeasible to guess or calculate the input message from its hash -These features also mean we can use a cryptographic hash to identify any piece of data: the hash is unique to the data we calculated it from and it's not too long so sending it around the network doesn't take up a lot of resource. A hash is a fixed length, so the SHA-256 hash of a one-gigabyte video file is still only 32 bytes. +These features also mean we can use a cryptographic hash to identify any piece of data: the hash is unique to the data we calculated it from and it's not too long so sending it around the network doesn't take up a lot of resource. A hash is a fixed length, so the SHA-256 hash of a one-gigabyte video file is still only 32 bytes. -That's critical for a distributed system like IPFS, where we want to be able to store and retrieve data from many places. A computer running IPFS can ask all the peers it's connected to whether they have a file with a particular hash and, if one of them does, they send back the whole file. Without a short, unique identifier like a cryptographic hash, that wouldn't be possible. This technique is called [content addressing](content-addressing.md) — because the content itself is used to form an address, rather than information about the computer and disk location it's stored at. +That's critical for a distributed system like IPFS, where we want to be able to store and retrieve data from many places. A computer running IPFS can ask all the peers it's connected to whether they have a file with a particular hash and, if one of them does, they send back the whole file. Without a short, unique identifier like a cryptographic hash, this kind of [content addressing](content-addressing.md) wouldn't be possible. -## Content identifiers are not file hashes +## Example: Content Identifiers are not file hashes -Hash functions are widely used as to check for file integrity. A download provider may publish the output of a hash function for a file, often called a _checksum_. The checksum enables users to verify that a file has not been altered since it was published. This check is done by performing the same hash function against the downloaded file that was used to generate the checksum. If that checksum that the user receives from the downloaded file exactly matches the checksum on the website, then the user knows that the file was not altered and can be trusted. +Hash functions are widely used to check for file integrity. Because IPFS splits content into blocks and verifies them through [directed acyclic graphs (DAGs)](../concepts/merkle-dag.md), SHA file hashes won't match CIDs. Here's an example of what will happen if you try to do that. -Let us look at a concrete example. When you download an image file for [Ubuntu Linux](https://ubuntu.com/) you might see the following `SHA-256` checksum on the Ubuntu website listed for verification purposes: +A download provider may publish the output of a hash function for a file, often called a _checksum_. The checksum enables users to verify that a file has not been altered since it was published. This check is done by performing the same hash function against the downloaded file that was used to generate the checksum. If that checksum that the user receives from the downloaded file exactly matches the checksum on the website, then the user knows that the file was not altered and can be trusted. + +Let's look at a concrete example. When you download an image file for [Ubuntu Linux](https://ubuntu.com/) you might see the following `SHA-256` checksum on the Ubuntu website listed for verification purposes: ``` 0xB45165ED3CD437B9FFAD02A2AAD22A4DDC69162470E2622982889CE5826F6E3D ubuntu-20.04.1-desktop-amd64.iso @@ -80,7 +82,7 @@ added QmPK1s3pNYLi9ERiq3BDxKa4XosgWwFRQUydHUtz4YgpqB ubuntu-20.04.1-desktop-amd6 2.59 GiB / 2.59 GiB [==========================================================================================] 100.00% ``` -The string `QmPK1s3pNYLi9ERiq3BDxKa4XosgWwFRQUydHUtz4YgpqB` returned by the `ipfs add` command is the content identifier (CID) of the file `ubuntu-20.04.1-desktop-amd64.iso`. We can utilize the [CID Inspector](https://cid.ipfs.io/) to see what the CID includes. The actual hash is listed under `DIGEST (HEX)`: +The string `QmPK1s3pNYLi9ERiq3BDxKa4XosgWwFRQUydHUtz4YgpqB` returned by the `ipfs add` command is the content identifier (CID) of the file `ubuntu-20.04.1-desktop-amd64.iso`. We can use the [CID Inspector](https://cid.ipfs.io/) to see what the CID includes. The actual hash is listed under `DIGEST (HEX)`: ``` NAME: sha2-256 @@ -101,4 +103,6 @@ ubuntu-20.04.1-desktop-amd64.iso: FAILED shasum: WARNING: 1 computed checksum did NOT match ``` -As we can see, the hash included in the CID does NOT match the hash of the input file `ubuntu-20.04.1-desktop-amd64.iso`. To understand what the hash contained in the CID is, we must understand how IPFS stores files. IPFS uses a [directed acyclic graph (DAG)](merkle-dag.md) to keep track of all the data stored in IPFS. A CID identifies one specific node in this graph. This identifier is the result of hashing the node's contents using a cryptographic hash function like `SHA256`. +As we can see, the hash included in the CID does NOT match the hash of the input file `ubuntu-20.04.1-desktop-amd64.iso`. + +As we can see, the hash included in the CID does not match the hash of the input file ubuntu-20.04.1-desktop-amd64.iso. To understand what the hash contained in the CID is, we must understand how IPFS stores files. IPFS uses a directed acyclic graph (DAG) to keep track of all the data stored in IPFS. A CID identifies one specific node in this graph. This identifier is the result of hashing the node's contents using a cryptographic hash function like SHA256. diff --git a/docs/concepts/how-ipfs-works.md b/docs/concepts/how-ipfs-works.md index 02d370cb1..3326e2523 100644 --- a/docs/concepts/how-ipfs-works.md +++ b/docs/concepts/how-ipfs-works.md @@ -36,11 +36,13 @@ Many distributed systems make use of content addressing through hashes as a mean This is where the [Interplanetary Linked Data (IPLD) project](https://ipld.io/) comes in. IPLD translates between hash-linked data structures allowing for the unification of the data across distributed systems. IPLD provides libraries for combining pluggable modules (parsers for each possible type of IPLD node) to resolve a path, selector, or query across many linked nodes, allowing you to explore data regardless of the underlying protocol. IPLD provides a way to translate between content-addressable data structures: _"Oh, you use Git-style, no worries, I can follow those links. Oh, you use Ethereum, I got you, I can follow those links too!"_ -IPFS follows particular data-structure preferences and conventions. The IPFS protocol uses those conventions and IPLD to get from raw content to an IPFS address that uniquely identifies content on the IPFS network. The next section explores how links between content are embedded within that content address through a DAG data structure. +IPFS follows particular data-structure preferences and conventions. The IPFS protocol uses those conventions and IPLD to get from raw content to an IPFS address that uniquely identifies content on the IPFS network. + +The next section explores how links between content are embedded within that content address through a DAG data structure. ## Directed acyclic graphs (DAGs) -IPFS and many other distributed systems take advantage of a data structure called [directed acyclic graphs](https://en.wikipedia.org/wiki/Directed_acyclic_graph), or DAGs. Specifically, they use _Merkle DAGs_, which are DAGs where each node has a unique identifier that is a hash of the node's contents. Sound familiar? This refers back to the _CID_ concept that we covered in the previous section. Put another way: identifying a data object (like a Merkle DAG node) by the value of its hash _is content addressing_. Check out our [guide to Merkle DAGs](merkle-dag.md) for a more in-depth treatment of this topic. +IPFS and many other distributed systems take advantage of a data structure called [directed acyclic graphs](https://en.wikipedia.org/wiki/Directed_acyclic_graph), or DAGs. Specifically, they use _Merkle DAGs_, where each node has a unique identifier that is a hash of the node's contents. Sound familiar? This refers back to the _CID_ concept that we covered in the previous section. Put another way: identifying a data object (like a Merkle DAG node) by the value of its hash _is content addressing_. Check out our [guide to Merkle DAGs](merkle-dag.md) for a more in-depth treatment of this topic. IPFS uses a Merkle DAG that is optimized for representing directories and files, but you can structure a Merkle DAG in many different ways. For example, Git uses a Merkle DAG that has many versions of your repo inside of it. @@ -68,9 +70,17 @@ You've discovered your content, and you've found the current location(s) of that There are [other content replication protocols under discussion](https://github.com/ipfs/camp/blob/master/DEEP_DIVES/24-replication-protocol.md) as well, the most developed of which is [_Graphsync_](https://github.com/ipld/specs/blob/master/block-layer/graphsync/graphsync.md). There's also a proposal under discussion to [extend the Bitswap protocol](https://github.com/ipfs/go-bitswap/issues/186) to add functionality around requests and responses. +## SHA file hashes won't match Content IDs + +You may be used to verifying the integrity of a file by matching SHA hashes, but SHA hashes won't match CIDs. Because IPFS splits a file into blocks, each block has its own CID, including separate CIDs for any parent nodes. + +The DAG keeps track of all the content stored in IPFS as blocks, not files, and Merkle DAGs are self-verified structures. To learn more about DAGs, see [directed acyclic graph (DAG)](../concepts/merkle-dag.md). + +For a detailed example of what happens when you try to compare SHA hashes with CIDs, see [Content Identifiers are not hashes](../concepts/hashing/#content-identifiers-are-not-file-hashes). + ### Libp2p -What makes libp2p especially useful for peer to peer connections is _connection multiplexing_. Traditionally, every service in a system opens a different connection to communicate with other services of the same kind remotely. Using IPFS, you open just one connection, and you multiplex everything on that. For everything your peers need to talk to each other about, you send a little bit of each thing, and the other end knows how to sort those chunks where they belong. +What makes libp2p especially useful for peer-to-peer connections is _connection multiplexing_. Traditionally, every service in a system opens a different connection to communicate with other services of the same kind remotely. Using IPFS, you open just one connection, and you multiplex everything on that. For everything your peers need to talk to each other about, you send a little bit of each thing, and the other end knows how to sort those chunks where they belong. This is useful because establishing connections is usually hard to set up and expensive to maintain. With multiplexing, once you have that connection, you can do whatever you need on it.