Data collection
Currently, the data for the analysis of the different ledgers is collected through Google BigQuery .
Queries
One can retrieve the data directly from BigQuery using the queries below:
Bitcoin
WITH double_entry_book AS (
SELECT array_to_string(inputs.addresses, ",") as address, inputs.type, -inputs.value as value
FROM `bigquery-public-data.crypto_bitcoin.inputs` as inputs
WHERE block_timestamp < "{{timestamp}}"
UNION ALL
SELECT array_to_string(outputs.addresses, ",") as address, outputs.type, outputs.value as value
FROM `bigquery-public-data.crypto_bitcoin.outputs` as outputs
WHERE block_timestamp < "{{timestamp}}"
)
SELECT address, type, sum(value) as balance
FROM double_entry_book
GROUP BY 1,2
HAVING balance > 0
ORDER BY balance DESC
Bitcoin Cash
WITH double_entry_book AS (
SELECT array_to_string(inputs.addresses, ",") as address, inputs.type, -inputs.value as value
FROM `bigquery-public-data.crypto_bitcoin_cash.inputs` as inputs
WHERE block_timestamp < "{{timestamp}}"
UNION ALL
SELECT array_to_string(outputs.addresses, ",") as address, outputs.type, outputs.value as value
FROM `bigquery-public-data.crypto_bitcoin_cash.outputs` as outputs
WHERE block_timestamp < "{{timestamp}}"
)
SELECT address, type, sum(value) as balance
FROM double_entry_book
GROUP BY 1,2
HAVING balance > 0
ORDER BY balance DESC
Dogecoin
WITH double_entry_book AS (
SELECT array_to_string(inputs.addresses, ",") as address, inputs.type, -inputs.value as value
FROM `bigquery-public-data.crypto_dogecoin.inputs` as inputs
WHERE block_timestamp < "{{timestamp}}"
UNION ALL
SELECT array_to_string(outputs.addresses, ",") as address, outputs.type, outputs.value as value
FROM `bigquery-public-data.crypto_dogecoin.outputs` as outputs
WHERE block_timestamp < "{{timestamp}}"
)
SELECT address, type, sum(value) as balance
FROM double_entry_book
GROUP BY 1,2
HAVING balance > 0
ORDER BY balance DESC
Ethereum
WITH double_entry_book AS (
SELECT to_address as address, value AS value
FROM `bigquery-public-data.crypto_ethereum.traces`
WHERE to_address IS NOT null
AND status = 1
AND (call_type NOT IN ('delegatecall', 'callcode', 'staticcall') OR call_type IS null)
AND block_timestamp < "{{timestamp}}"
UNION ALL
SELECT from_address as address, -value AS value
FROM `bigquery-public-data.crypto_ethereum.traces`
WHERE from_address IS NOT null
AND status = 1
AND (call_type NOT IN ('delegatecall', 'callcode', 'staticcall') OR call_type IS null)
AND block_timestamp < "{{timestamp}}"
UNION ALL
SELECT miner AS address, sum(cast(receipt_gas_used as numeric) * cast(gas_price as numeric)) AS value
FROM `bigquery-public-data.crypto_ethereum.transactions` AS transactions
JOIN `bigquery-public-data.crypto_ethereum.blocks` AS blocks on blocks.number = transactions.block_number
WHERE transactions.block_timestamp < "{{timestamp}}"
GROUP BY blocks.miner
UNION ALL
SELECT from_address AS address, -(cast(receipt_gas_used as numeric) * cast(gas_price as numeric)) AS value
FROM `bigquery-public-data.crypto_ethereum.transactions`
WHERE block_timestamp < "{{timestamp}}"
)
SELECT address, sum(value) AS balance
FROM double_entry_book
GROUP BY address
HAVING balance > 0
ORDER BY balance DESC
Litecoin
WITH double_entry_book AS (
SELECT array_to_string(inputs.addresses, ",") as address, inputs.type, -inputs.value as value
FROM `bigquery-public-data.crypto_litecoin.inputs` as inputs
WHERE block_timestamp < "{{timestamp}}"
UNION ALL
SELECT array_to_string(outputs.addresses, ",") as address, outputs.type, outputs.value as value
FROM `bigquery-public-data.crypto_litecoin.outputs` as outputs
WHERE block_timestamp < "{{timestamp}}"
)
SELECT address, type, sum(value) as balance
FROM double_entry_book
GROUP BY 1,2
HAVING balance > 0
ORDER BY balance DESC
Tezos
WITH double_entry_book as (
SELECT IF(kind = 'contract', contract, delegate) AS address, change AS value
FROM `public-data-finance.crypto_tezos.balance_updates`
WHERE (status IS NULL OR status = 'applied') AND (timestamp < "{{timestamp}}")
UNION ALL
SELECT address, balance_change
FROM `public-data-finance.crypto_tezos.migrations`
WHERE timestamp < "{{timestamp}}"
)
SELECT address, SUM(value) AS balance
FROM double_entry_book
GROUP BY address
HAVING balance > 0
ORDER BY balance DESC
Zcash
WITH double_entry_book AS (
SELECT array_to_string(inputs.addresses, ",") as address, inputs.type, -inputs.value as value
FROM `bigquery-public-data.crypto_zcash.inputs` as inputs
WHERE block_timestamp < "{{timestamp}}"
UNION ALL
SELECT array_to_string(outputs.addresses, ",") as address, outputs.type, outputs.value as value
FROM `bigquery-public-data.crypto_zcash.outputs` as outputs
WHERE block_timestamp < "{{timestamp}}"
)
SELECT address, type, sum(value) as balance
FROM double_entry_book
GROUP BY 1,2
HAVING balance > 0
ORDER BY balance DESC
Automating the data collection process
Instead of executing each of these queries separately on the BigQuery console and saving the results manually, it is also possible to automate the process using a script and collect all relevant data in one go. Executing this script will run queries from this file.
IMPORTANT: the script uses service account credentials for authentication, therefore before running it, you need to
generate the relevant credentials from Google, as described
here and save your key in the
data_collections_scripts/
directory of the project under the name 'google-service-account-key-0.json'. Any additional
keys should be named 'google-service-account-key-1.json', 'google-service-account-key-2.json', and so on.
There is a
sample file
that you can consult, which shows what your credentials are supposed to look like (but note that this is for
informational purposes only, this file is not used in the code).
Once you have set up the credentials, you can just run the following command from the root directory to retrieve data for all supported blockchains:
python -m data_collection_scripts.big_query_balance_data
There are also three command line arguments that can be used to customize the data collection process:
ledgers
accepts any number of the supported ledgers (case-insensitive). For example, adding--ledgers bitcoin
results in collecting data only for Bitcoin, while--ledgers Bitcoin Ethereum
would collect data for Bitcoin and Ethereum. If theledgers
argument is omitted, then the default value is used, which is taken from the configuration file and typically corresponds to all supported blockchains.snapshot_dates
accepts any number of dates formatted as YYYY-MM-DD, YYYY-MM, or YYYY. Then, data is collected for the specified date(s). Again, if this argument is omitted, the default value is taken from the configuration file.--force-query
forces the collection of all raw data files, even if some or all of the files already exist. By default, this flag is set to False and the script only fetches data for some blockchain if the corresponding file does not already exist.
Historical prices
The historical market prices for all coins (available under price_data
) were collected from
CoinGecko.