Iterate on the new `verification_jobs` table #1826

marcocastignoli · 2025-01-06T12:12:12Z

Iterate on the new table definition, making sure that it contains all the need information. In particular focus on the foreign key verification_id in the case of a successful verification and understand what information to include in the case of a failed verification:

jobId
endpoint: /v2/verify/etherscan/
chainId
address
status
server hardware
compilation time
errors (compiler failures)
createdAt
finishedAt (success or fail)
when successful: verification-id
Do we want to store information when the compilation succeeds but match fails?
- onchain bytecodes
- recompiled bytecodes

The text was updated successfully, but these errors were encountered:

manuelwedler · 2025-01-30T11:44:31Z

I mostly agree with above fields. Errors need multiple fields due to the API design in Stoplight. I propose to use exactly these names for the columns:

Table: `verification_jobs`

Needed for the API response:

id: bigserial, primary key, incremental
status: varchar, not null
started_at: timestamptz, not null
completed_at: timestamptz
chain_id: bigint, not null
contract_address: bytea, not null
verified_contract_id: bigserial, foreign key to verified_contracts.id
error_code: varchar
error_message: varchar
error_id: uuid

These are only needed for stats:

verification_endpoint: varchar, not null
server_hardware: jsonb, not null
compilation_time: interval

Regarding server_hardware, I am not sure if we can get helpful information for it, since we run it in Cloud Run and CPU and memory allocations are just limits. We can try to use the Node.js functions for those (os.cpus(), os.totalmem(), ...). Let's see what they return under this environment. I think it makes sense to store this information for each verification job in the following json format:

{
  "cpu": string,
  "memory": string
}

Anything else that should be added to the server_hardware?

Additional information?

Do we want to store information when the compilation succeeds but match fails?

onchain bytecodes

recompiled bytecodes

In my opinion, we should not store more information than necessary, because I don't want to blow the database. The error_code with the error_message should be enough to tell the user what went wrong. If we want to debug something, we should be able to find the corresponding logs by using the error_id.

kuzdogan · 2025-01-31T12:04:32Z

Small comment on chain_id, I think it should be bigInt instead of numeric, see verifier-alliance/database-specs#23

kuzdogan · 2025-01-31T12:06:29Z

Additionally this was what's recommended by Claude. I don't know to what extent this is useful but I see we can go beyond the minimal cpu+memory info:

const os = require('os');
const v8 = require('v8');

const getHardwareInfo = () => {
  const memoryUsage = process.memoryUsage();
  const heapStats = v8.getHeapStatistics();
  
  return {
    cpu: {
      model: os.cpus()[0].model,
      cores: os.cpus().length,
      speed: os.cpus()[0].speed,
      allocated_cores: process.env.CPU_LIMIT || os.cpus().length,
      cpu_quota: process.resourceUsage().maxCPU || null
    },
    memory: {
      total_allocated: process.env.MEMORY_LIMIT || os.totalmem(),
      available: os.freemem(),
      process_usage: {
        heapTotal: memoryUsage.heapTotal,
        heapUsed: memoryUsage.heapUsed,
        external: memoryUsage.external,
        rss: memoryUsage.rss
      }
    },
    environment: {
      type: process.env.K_SERVICE ? 'cloud-run' : 'local',
      node_version: process.version,
      container_id: process.env.HOSTNAME || null
    }
  };
};

Edit: When I asked for only the essential info Claude gave me these:

const os = require('os');

const getEssentialHardwareInfo = () => {
  return {
    cpu: {
      allocated_cores: process.env.CPU_LIMIT || os.cpus().length,
      model: os.cpus()[0].model
    },
    memory: {
      allocated_mb: Math.floor((process.env.MEMORY_LIMIT || os.totalmem()) / (1024 * 1024))
    },
    environment: {
      type: process.env.K_SERVICE ? 'cloud-run' : 'local'
    }
  };
};

kuzdogan · 2025-01-31T12:12:05Z

Regarding storing the bytecodes, I'd really like to show a nice UI for the users in the future how these bytecodes diff for failed jobs. Like this:

We find ourselves debugging into people's failed verifications a lot of times and it would be useful to have this information without doing everything locally, even better if we can convey this information to the user and they don't have to contact us at all. IMO this is an essential part of building an overall better contract verification experience.

However this is not a requisite from my side and this can be added later by extending the table. We can even add another table like failed_verification_bytecodes that only retains the failed bytecodes of the last e.g. 2 weeks. So that the database won't blow. The bytecodes are only useful immediately to see what went wrong but we can keep the verification_jobs table minimal and store it longer term.

manuelwedler · 2025-02-03T08:06:23Z

Small comment on chain_id, I think it should be bigInt instead of numeric, see verifier-alliance/database-specs#23

Updated my proposal regarding this.

Additionally this was what's recommended by Claude. I don't know to what extent this is useful but I see we can go beyond the minimal cpu+memory info:

I would do a mixture of the extended and minimal information. Not everything of the extended seems useful to me. How about this?

const getHardwareInfo = () => {
  return {
    cpu: {
      allocated_cores: process.env.CPU_LIMIT || os.cpus().length,
      model: os.cpus()[0].model,
      speed: os.cpus()[0].speed
    },
    memory: {
      total_allocated: process.env.MEMORY_LIMIT || os.totalmem()
    },
    environment: {
      type: process.env.K_SERVICE ? 'cloud-run' : 'local',
      node_version: process.version,
      cloud_run_revision: process.env.K_REVISION
    }
  };
};

Where do these environment variables actually come from (CPU_LIMIT and MEMORY_LIMIT)? Not sure if they really exist.

The other question here is, do we really want to store this information for every job? If you exclude the current resource usage, this will be a lot of redundant information. We could store this once per revision in a separate table and just reference it by some identifier.

Regarding storing the bytecodes, I'd really like to show a nice UI for the users in the future how these bytecodes diff for failed jobs. Like this:

Alright, makes sense to me. From my point of view, we can add it now. But it should reference entries in the code table. How about adding these to the verification_jobs table?

onchain_creation_code_hash: bytea, foreign key to code
onchain_runtime_code_hash: bytea, foreign key to code
recompiled_creation_code_hash: bytea, foreign key to code
recompiled_runtime_code_hash: bytea, foreign key to code

Now, I also wonder if we need to store the error_message in the database. The server could determine the message on-the-fly based on the error_code and the stored bytecodes.

kuzdogan · 2025-02-03T08:25:58Z

Yes we can remove the error_message

I also agree with the redundancy of the server_hardware. Also in the server_hardware I see the hardware info but no information regarding how much resources are used, right? Memory and cpu would be the same across all verification jobs?

And regarding storing the bytecodes in code table, I'm not sure about that. If we do that would it be difficult when we are deleting the verification_job entry? When we are deleting the verification_jobs that are e.g. 30days old, we need to delete the code entry too and we need to make sure we are not deleting a code that is being referenced by a verified contract. Also catagorically the bytecodes in code are different than those coming from failed jobs. code are verified contract bytecodes so they are "useful", the bytecodes from failed jobs will be ephemeral and "not useful" so maybe it makes sense to separate them in different tables. Also keep in mind we'll only save bytecodes for failed jobs so most of the verification_jobs will not have bytecodes.

manuelwedler · 2025-02-03T08:38:49Z

Right, they should only be stored for failed jobs.

I expected the verification_jobs table to never get deleted. It's hard to determine what would be a good time frame for deleting. Also consider this scenario: Someone verifies a contract through the Remix plugin and closes the browser immediately after. He opens Remix only after 2 months again. The verification would show as failed in the IDE, because we deleted the information, but actually the verification went successfully.

I also agree with the redundancy of the server_hardware. Also in the server_hardware I see the hardware info but no information regarding how much resources are used, right? Memory and cpu would be the same across all verification jobs?

If we don't change the Cloud Run service, I would expect this, yes. In my proposal, I excluded the resource usage for this reason.

manuelwedler · 2025-02-03T08:44:42Z

So maybe our idea for table was a bit different here. I tried to keep it minimal, because I would like to keep the data forever.

What if we separate the table into two: One for the minimal information of the job that should be kept forever, and one with additional information that is used for debugging and statistics (maybe bytecodes and server_hardware) and can be deleted from time to time.

kuzdogan · 2025-02-03T08:50:31Z

The verification would show as failed in the IDE, because we deleted the information, but actually the verification went successfully.

This is a rather edge case and I don't think we need to cover them. It's reasonable users can't find out why their job failed after 30-60 days. But if their verification went through, they can always check with chainId+address.

What if we separate the table into two: One for the minimal information of the job that should be kept forever, and one with additional information that is used for debugging and statistics (maybe bytecodes and server_hardware) and can be deleted from time to time.

Bytecodes yes but server hardware was also meant for longer term. The goal of that is to provide context for the compilation and hardware+compilation_time gives a relative resource use information that can be useful when e.g. someone's looking for resource intensive contracts.

We can dedupe the server hardware easily with a separate table, so it can stay longer term. For the bytecodes, we can look in the future how much space they take and only then start deleting. We can also explore some compression options as I don't expect the older bytecodes to be accessed frequently.

marcocastignoli · 2025-02-03T15:58:03Z

Regarding hardware I think it's best to just use an identifier that we provide in the .env, we just need to store the identifier in the verification_jobs table. E.g. current identifier can be: GCP_CLOUD_RUN_0. Then we know that we refer with GCP_CLOUD_RUN_0.

Regarding deleting the rows, the remix case @manuelwedler can be fixed on the client side, just by calling v2/contract/chain/address instead of v2/verificationJob. That said, I like the idea of having two tables. One for successful contracts (minimal, potentially with permanent data) and one for failed contracts (extended, we remove data every N days).

Regarding verification_job's bytecode, I would not store them in codes. Let's keep everything in the verification_jobs table. I think it's much safer for when we'll have to delete and also it's a potential attack strategy for someone to fill up our database with bad codes.

manuelwedler · 2025-02-04T09:03:17Z

I agree mostly with what Marco said. I think we can indeed just store an identifier for the server hardware. The identifier for the hardware could be the process.env.K_REVISION which gives the Cloud Run revision. That would reduce the data a lot compared to a json object. I would also name the column cloud_run_revision in order to be specific.

Here is my new proposal composing the above ideas:

Table `verification_jobs`

id: bigserial, primary key, incremental
status: varchar, not null
started_at: timestamptz, not null
completed_at: timestamptz
chain_id: bigint, not null
contract_address: bytea, not null
verified_contract_id: bigserial, foreign key to verified_contracts.id
error_code: varchar
error_id: uuid
verification_endpoint: varchar, not null
cloud_run_revision: varchar (use process.env.K_REVISION)
compilation_time: interval

Table `verification_jobs_ephemeral`

id: bigserial, primary key, foreign key to verification_jobs.id
onchain_creation_code: bytea
onchain_runtime_code: bytea
recompiled_creation_code: bytea
recompiled_runtime_code: bytea

What do you think about this?

manuelwedler · 2025-02-04T09:08:13Z

Regarding deleting the rows, the remix case @manuelwedler can be fixed on the client side, just by calling v2/contract/chain/address instead of v2/verificationJob.

Regarding that, I'm not sure if that would be the right thing to do. It could be that verification failed because a contract is already verified. In that case, it would be weird to show a successful receipt in Remix.

I would say we should just move forward with not deleting the verification_jobs table and see if it really grows too big over time.

marcocastignoli · 2025-02-04T09:10:47Z

Nice!

Let's just keep a generic column name for hardware: instead of cloud_run_revision -> hw. Then we store something with an identifier + details "cloud_run:${process.env.K_REVISION}". So we can reuse it if we change environment.
"In that case, it would be weird to show a successful receipt in Remix" Right.

kuzdogan · 2025-02-04T09:14:05Z

cloud_run_revision: varchar (use process.env.K_REVISION)

This wouldn't give an outsider eye what the server hardware was like CPU and memory, which was the whole point. Should we keep that info somewhere, like in another table? We can still keep this for debugging purposes though.

manuelwedler · 2025-02-04T09:20:48Z

This wouldn't give an outsider eye what the server hardware was like CPU and memory, which was the whole point. Should we keep that info somewhere, like in another table? We can still keep this for debugging purposes though.

Also agree with this point. You would need access to GCP then to gather this information. The other question is, who will actually have access to our database directly? This table should just be for internal use and not included in the parquet export.

If we decide to store the hardware info, I think it should be in a separate table. The json properties from above could be its columns.

marcocastignoli · 2025-02-04T11:20:40Z

like in another table?

I think we can just list all the details of the server somewhere else. I would not create a postgresql table for this. E.g. in our docs

manuelwedler · 2025-02-05T09:44:44Z

Implemented the last proposal in #1902

github-project-automation bot added this to Sourcify Public Jan 6, 2025

marcocastignoli mentioned this issue Jan 6, 2025

[Milestone] APIv2 #1367

Open

github-project-automation bot moved this to Triage in Sourcify Public Jan 6, 2025

manuelwedler mentioned this issue Jan 6, 2025

Iterate on the new verification_jobs table #1825

Closed

marcocastignoli moved this from Triage to Sprint - Candidates in Sourcify Public Jan 6, 2025

marcocastignoli mentioned this issue Jan 6, 2025

Verification endpoints for APIv2 #1828

Open

marcocastignoli moved this from Sprint - Candidates to Sprint - Up Next in Sourcify Public Jan 6, 2025

marcocastignoli assigned manuelwedler Jan 20, 2025

manuelwedler assigned manuelwedler and unassigned manuelwedler Jan 27, 2025

manuelwedler moved this from Sprint - Up Next to Sprint - In Progress in Sourcify Public Jan 28, 2025

marcocastignoli moved this from Sprint - In Progress to Sprint - Needs Review in Sourcify Public Jan 30, 2025

kuzdogan mentioned this issue Feb 3, 2025

Add verification_jobs table to migrations #1884

Closed

manuelwedler mentioned this issue Feb 5, 2025

Add migrations for verification_jobs and verification_jobs_ephemeral tables #1902

Merged

manuelwedler closed this as completed Feb 5, 2025

github-project-automation bot moved this from Sprint - Needs Review to Sprint - Done in Sourcify Public Feb 5, 2025

marcocastignoli moved this from Sprint - Done to COMPLETED in Sourcify Public Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterate on the new `verification_jobs` table #1826

Iterate on the new `verification_jobs` table #1826

marcocastignoli commented Jan 6, 2025 •

edited

Loading

manuelwedler commented Jan 30, 2025 •

edited

Loading

kuzdogan commented Jan 31, 2025

kuzdogan commented Jan 31, 2025 •

edited

Loading

kuzdogan commented Jan 31, 2025

manuelwedler commented Feb 3, 2025

kuzdogan commented Feb 3, 2025

manuelwedler commented Feb 3, 2025

manuelwedler commented Feb 3, 2025

kuzdogan commented Feb 3, 2025

marcocastignoli commented Feb 3, 2025

manuelwedler commented Feb 4, 2025

manuelwedler commented Feb 4, 2025 •

edited

Loading

marcocastignoli commented Feb 4, 2025

kuzdogan commented Feb 4, 2025

manuelwedler commented Feb 4, 2025

marcocastignoli commented Feb 4, 2025 •

edited

Loading

manuelwedler commented Feb 5, 2025

Iterate on the new verification_jobs table #1826

Iterate on the new verification_jobs table #1826

Comments

marcocastignoli commented Jan 6, 2025 • edited Loading

manuelwedler commented Jan 30, 2025 • edited Loading

Table: verification_jobs

Additional information?

kuzdogan commented Jan 31, 2025

kuzdogan commented Jan 31, 2025 • edited Loading

kuzdogan commented Jan 31, 2025

manuelwedler commented Feb 3, 2025

kuzdogan commented Feb 3, 2025

manuelwedler commented Feb 3, 2025

manuelwedler commented Feb 3, 2025

kuzdogan commented Feb 3, 2025

marcocastignoli commented Feb 3, 2025

manuelwedler commented Feb 4, 2025

Table verification_jobs

Table verification_jobs_ephemeral

manuelwedler commented Feb 4, 2025 • edited Loading

marcocastignoli commented Feb 4, 2025

kuzdogan commented Feb 4, 2025

manuelwedler commented Feb 4, 2025

marcocastignoli commented Feb 4, 2025 • edited Loading

manuelwedler commented Feb 5, 2025

Iterate on the new `verification_jobs` table #1826

Iterate on the new `verification_jobs` table #1826

marcocastignoli commented Jan 6, 2025 •

edited

Loading

manuelwedler commented Jan 30, 2025 •

edited

Loading

Table: `verification_jobs`

kuzdogan commented Jan 31, 2025 •

edited

Loading

Table `verification_jobs`

Table `verification_jobs_ephemeral`

manuelwedler commented Feb 4, 2025 •

edited

Loading

marcocastignoli commented Feb 4, 2025 •

edited

Loading