# **Prepare SVD for multiome**

In this Jupyter notebook, data from train and test datasets is put together and then the TruncatedSVD is calculated. This is done twice: once for data normalized by organizers, and then for raw data. Only SVD features made from normalized data were used in a final submission.

In kaggle environment it is more convenient to do this in a separate notebook, as it would be a waste of both time and GPU quota to calculate the TruncatedSVD each time before fitting the model.

## Imports and definitions

In [1]:
# Importing the libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
import gc, scipy.sparse
from humanize import naturalsize
from sklearn.decomposition import TruncatedSVD

In [2]:
# Need this library to read *.h5 files
!pip install --quiet tables

[0m

In [3]:
DATA_DIR = "/kaggle/input/open-problems-multimodal/"
FP_CELL_METADATA = os.path.join(DATA_DIR,"metadata.csv")

FP_CITE_TRAIN_INPUTS = os.path.join(DATA_DIR,"train_cite_inputs.h5")
FP_CITE_TRAIN_TARGETS = os.path.join(DATA_DIR,"train_cite_targets.h5")
FP_CITE_TEST_INPUTS = os.path.join(DATA_DIR,"test_cite_inputs.h5")

FP_MULTIOME_TRAIN_INPUTS = os.path.join(DATA_DIR,"train_multi_inputs.h5")
FP_MULTIOME_TRAIN_TARGETS = os.path.join(DATA_DIR,"train_multi_targets.h5")
FP_MULTIOME_TEST_INPUTS = os.path.join(DATA_DIR,"test_multi_inputs.h5")

FP_SUBMISSION = os.path.join(DATA_DIR,"sample_submission.csv")
FP_EVALUATION_IDS = os.path.join(DATA_DIR,"evaluation_ids.csv")

In [4]:
# The multiome train dataset raw data is too large to be loaded into RAM. But it is also sparse.
# So, I load the dataset in chunks, and then convert it to sparse matrix.
# Will use this function to do right that.


def read_convert_hdf_in_chunks(link, chunk_size, sparse_matrice=None):
 i = 0
 while i < 1000000:
 df_chunk = pd.read_hdf(link, start=i, stop=i+chunk_size)
 sparse_chunk = scipy.sparse.csr_matrix(df_chunk.values)
 if sparse_matrice == None:
 sparse_matrice = sparse_chunk
 else:
 sparse_matrice = scipy.sparse.vstack([sparse_matrice, sparse_chunk])
 print(i)
 i += chunk_size
 if sparse_chunk.shape[0] < chunk_size:
 return sparse_matrice
 

## Process the raw data

In [5]:
%%time
# Loading raw data inputs

sparse_X = read_convert_hdf_in_chunks('../input/open-problems-raw-counts/train_multi_inputs_raw.h5', 5000)
print(sparse_X.shape[0])
gc.collect()

0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95000
100000
105000
105933
CPU times: user 16min 52s, sys: 1min 13s, total: 18min 5s
Wall time: 18min 18s


34

In [6]:
%%time
# Same procedure for the test raw data.
sparse_X = read_convert_hdf_in_chunks('/kaggle/input/open-problems-raw-counts/test_multi_inputs_raw.h5', 5000, sparse_X)
print(sparse_X.shape[0])
gc.collect()

0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
161868
CPU times: user 8min 49s, sys: 46.6 s, total: 9min 36s
Wall time: 9min 43s


75

In [7]:
# Export total_counts. Maybe they will be useful as a feature.
total_counts = sparse_X.sum(axis=1)
counts_index = [*range(len(total_counts))]
total_counts = total_counts.flat
df_total_counts = pd.DataFrame({'total_counts': total_counts}, index=counts_index)
df_total_counts.to_feather('total_counts_multiome.ftr')

In [8]:
%%time
# Apply the singular value decomposition.

print(f"Shape of both before SVD: {sparse_X.shape}")
svd = TruncatedSVD(n_components=64, random_state=1)
sparse_X = svd.fit_transform(sparse_X)
print(f"Shape of both after SVD: {sparse_X.shape}")

Shape of both before SVD: (161868, 228942)
Shape of both after SVD: (161868, 64)
CPU times: user 21min 46s, sys: 9.47 s, total: 21min 56s
Wall time: 21min 38s


In [9]:
# Save results in a file.
df_svd = pd.DataFrame(sparse_X)
df_svd.to_csv('svd_raw.csv')
print('Raw data SVD ready')

Raw data SVD ready


In [10]:
# Free the RAM.
del sparse_X, df_svd
gc.collect()

21

## Process the normalized data

In [11]:
%%time
# Generally the same operations for the normalized data using the same function.
# Load the train data in chunks and convert it to sparse matrix.

sparse_X = read_convert_hdf_in_chunks(FP_MULTIOME_TRAIN_INPUTS, 5000)
print(sparse_X.shape[0])
gc.collect()

0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95000
100000
105000
105942
CPU times: user 18min 25s, sys: 1min 25s, total: 19min 51s
Wall time: 21min 42s


150

In [12]:
%%time
# Same for normalized test dataset.

sparse_X = read_convert_hdf_in_chunks(FP_MULTIOME_TEST_INPUTS, 5000, sparse_X)
print(sparse_X.shape[0])
gc.collect()

0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
161877
CPU times: user 9min 8s, sys: 50.7 s, total: 9min 59s
Wall time: 11min 3s


0

In [13]:
%%time
# Apply the singular value decomposition.
# Normalized data is more important, so I will prepare more components.

print(f"Shape of both before SVD: {sparse_X.shape}")
svd = TruncatedSVD(n_components=256, random_state=1)
sparse_X = svd.fit_transform(sparse_X)
print(f"Shape of both after SVD: {sparse_X.shape}")

Shape of both before SVD: (161877, 228942)
Shape of both after SVD: (161877, 256)
CPU times: user 1h 9min 43s, sys: 29.3 s, total: 1h 10min 13s
Wall time: 1h 8min 59s


In [14]:
# Save results in a file.
df_svd = pd.DataFrame(sparse_X)
df_svd.to_csv('svd.csv')
print('All the SVD ready')

All the SVD ready
