Rootstrap Blog

Initial Coronavirus outbreak in France and South Korea

We are currently experiencing the negative impacts of the novel Coronavirus. This virus has quickly changed our lives and has left many of us feeling confused and fearful of what’s to come. 

As engineers and data scientists – we want to help make sense of the overwhelming amount of data available. We believe that it is our duty to share our insights in order to collectively find solutions and prevent further outbreaks.

The focus of this article is to analyze the initial spread of the disease in France and South Korea. Our analysis covers all data collected until 12th March 2020.

Note: This problem requires a multidisciplinary approach. We are data-driven professionals that have an engineering foundation. The analysis also demands the work of epidemiologists and the results and conclusions we share below have to be taken as they are. We are providing a pure data analysis approach, with limited domain knowledge. We’ve taken a similar approach analyzing the impact of Covid-19 on the EdTech and online education industries.

Where’s the data coming from?

Kaggle provides different datasets for data scientists. They can be used to practice and solve machine learning problems.

The links below contain recent datasets from various countries around that world providing stats of patients infected with coronavirus:

Even more informative resources can be found here.


At this moment, the scientific community has way more questions than answers. It’s very easy to formulate questions but most answers will have to wait.

We can easily make up some of them, like: 

  • Is there a pattern in data and how the first infections occurred? Are there easily identifiable clusters? 
  • What is the ratio between positive cases and deaths?
  • Can we predict future positive cases?
  • What is the virus trajectory? How has the virus spread over time? 

As said, in this article, we are going to explore how the virus was initially spread in France and South Korea, and compare both of them through clustering analysis. 

Clustering Analysis is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

Step 1 – Our Datasets

There are two datasets that have similar information for France and South Korea, respectively. So, the idea is to merge those datasets and see if we can find meaningful clusters.
We can find the links to the data here:

Let’s start by doing some initial setup of the libraries we’ll be using here. 

import numpy as np
import pandas as pd
import random
from sklearn import preprocessing
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from kneed import KneeLocator
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D

I recommend that you check out this article about Pandas library and data cleaning.

Step 2 – Check Data Quality

Before digging deeper, we have to explore our dataset to see what it looks like and how we’d like to start our analysis.

First of all, we’ll check empty values. We’ll load both datasets and see how many empty (NA) values we find.

NA values correspond to missing or empty information. 

Based on what we find, we might need to make some decisions for each particular column to prepare the data for analysis.

def load_data(data_path):
    df = pd.read_csv(data_path + '/patient.csv')
    df['released_date'] = pd.to_datetime(df['released_date'])
    df['confirmed_date'] = pd.to_datetime(df['confirmed_date'])
    df['month'] = df['confirmed_date'].dt.month
    df['day'] = df['confirmed_date']
    return df

df_france = load_data('coronavirusdataset_france')

id                  2067
sex                 1851
birth_year          1936
country                1
region                 1
department          195
city                1804
group               1905
infection_reason    1906
infection_order     2068
infected_by         2056
contact_number      2073
confirmed_date         4
released_date       2064
deceased_date       2048
status              1481
health              1849
source               199
comments            1637
month                  4
day                    4
dtype: int64

df_south_korea = load_data('coronavirusdataset_south_korea')

patient_id             0
sex                 7190
birth_year          7203
country                0
region              7432
disease             7841
group               7783
infection_reason    7715
infection_order     7833
infected_by         7799
contact_number      7816
confirmed_date         0
released_date       7813
deceased_date       7833
state                  0
month                  0
day                    0
dtype: int64

Step 3 – Resolve Missing Values

With the objective of keeping only the necessary data, some of the columns should be removed such as department, comments, and health as they aren’t important for this specific analysis.

We’ll fill the missing values for birth_year. This process is called data imputation.

By using the birth date, we can create the age variable, subtracting it to the actual date. The missing information will be filled with random numbers drawn from a distribution. The information related to the age distribution of the population can be filled considering the demographics of each country:

The ‘simulate_age’ function was created to simulate the population’s age based on available data. In this case, having the range per age and the percentage of the total population, we can use a uniform distribution to simulate the age distribution for each range.

df_france.drop(['departement','region','comments', 'id', 'infected_by','health','city','source'],axis=1,inplace=True)

df_south_korea.drop(['region','disease','patient_id','infected_by'], axis=1, inplace=True)

def simulate_age(ranges, percents, total_pop):
    simulated_pop = np.array(0)
    for (low, high), percent in zip(ranges, percents):
        simulated_pop = np.append(simulated_pop, 
                  np.random.randint(low=low, high=high, size=int(total_pop*percent/100)))
return simulated_pop

france_population = 67364357

0-14 years: 18.48% 
15-24 years: 11.8% 
25-54 years: 37.48% 
55-64 years: 12.42%
65 years and over: 19.82%
ranges = [(0,14),(15,24),(25,54),(55,64),(65,90)]
percents = [18.48,11.8,37.48,12.42,19.82]
france_simulated_pop = simulate_age(ranges, percents, france_population) f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))
ax1.hist(france_simulated_pop,bins=20, color='mediumaquamarine', edgecolor='k', alpha=0.5)ax1.set_title('France - Simulated age distribution')

#South Korea
south_korea_population = 51418097
0-14 years: 13.03% 
15-24 years: 12.19%
25-54 years: 45.13%
55-64 years: 15.09% 
65 years and over: 14.55% 
percents = [13.03,12.19,45.13,15.09,14.55]
south_korea_simulated_pop = simulate_age(ranges, percents, south_korea_population)
ax2.hist(south_korea_simulated_pop,bins=20, color='mediumaquamarine', edgecolor='k', alpha=0.5)
ax2.set_title('South Korea - Simulated age distribution')

Now, we can create an age column in our data frame and fill the missing values with a random value chosen from the distributions that we just simulated.

import math
actual_year = pd.to_datetime('today').year

def calculate_age(x):
    if math.isnan(x):
        return x
    else:return int(actual_year - x)

df_france['age'] = df_france['birth_year'].apply(calculate_age)
df_france.fillna({'age':int(random.choice(france_simulated_pop))}, inplace=True)
df_france.drop(['birth_year'], axis=1, inplace=True)

#South Korea
df_south_korea['age'] = df_south_korea['birth_year'].apply(calculate_age)
df_south_korea.fillna({'age':int(random.choice(south_korea_simulated_pop))}, inplace=True)
df_south_korea.drop(['birth_year'], axis=1, inplace=True)

For missing sex values, we can draw a random number with a value of probability based on the sex ratio for each population.

Considering m as men and w as women. 
m/w=ratio -> m=ration*w
def calculate_values(ratio, total_pop):
    w = (france_population/(1+ratio))/total_pop
    m = 1 - w
return (w,m)

# 0 (woman) and 1 (man) with the calculated probabilities
# France
# total population: 0.96 male(s)/female (2018 est.)

w,m = calculate_values(0.96, france_population)
df_france['sex'] = df_france['sex'].str.lower()
df_france["sex"].replace({"male\xa0?": "male"}, inplace=True)
df_france.fillna({'sex': np.random.choice(['female','male'],p=[w,m])}, inplace=True)

# South Korea
# total population: 1 male(s)/female (2018 est.)

w,m = calculate_values(1, south_korea_population)
df_south_korea['sex'] = df_south_korea['sex'].str.lower()
df_south_korea["sex"].replace({"male\xa0?": "male"}, inplace=True)
df_south_korea.fillna({'sex': np.random.choice(['female','male'],p=[w,m])}, inplace=True)

# France
# total population: 0.96 male(s)/female (2018 est.)

w,m = calculate_values(0.96, france_population)
df_france['sex'] = df_france['sex'].str.lower()
df_france["sex"].replace({"male\xa0?": "male"}, inplace=True)
df_france.fillna({'sex': np.random.choice(['female','male'],p=[w,m])}, inplace=True)

# South Korea

# total population: 1 male(s)/female (2018 est.)

w,m = calculate_values(1, south_korea_population)
df_south_korea['sex'] = df_south_korea['sex'].str.lower()
df_south_korea['sex'].replace({"male\xa0?": "male"}, inplace=True)
df_south_korea.fillna({'sex': np.random.choice(['female','male'],p=[w,m])}, inplace=True)

Since the status column for France’s dataset and the state column for South Korea’s dataset have the same meaning, we can rename the column for one of the datasets, and update the values to be the same categories.

df_france.rename({'status':'state'}, axis=1, inplace=True)
df_france['state'] = df_france['state'].apply(lambda x: 'isolated' if (x=='hospital' or x=='home isolation') else x)


  • The empty values for the country variable will be filled with France or South Korea, respectively.
  • A new category ‘Unknown’ will be created for infection_reason, group, status variables
  • A new category for infection_order is added with code 0
  • The empty values for contact number will be filled with 0

df_france.fillna({'country':'France','infection_reason':'Unknown','group':'Unknown', 'state':'Unknown','infection_order':0, 'contact_number':0} , inplace=True)
df_south_korea.fillna({'infection_reason':'Unknown','group':'Unknown', 'infection_order':0, 'contact_number':0, 'state':'Unknown'} , inplace=True)

Now, let’s check if we still have missing values that need to be resolved.


sex                    0
country                0
group                  0
infection_reason       0
infection_order        0
contact_number         0
confirmed_date         4
released_date       2064
deceased_date       2048
state                  0
month                  4
day                    4
age                    0
dtype: int64


sex                    0
country                0
group                  0
infection_reason       0
infection_order        0
contact_number         0
confirmed_date         0
released_date       7813
deceased_date       7833
state                  0
month                  0
day                    0
age                    0
dtype: int64

Great job keeping up! We don’t have too much left. Now we need to resolve released_date and deceased_date empty values.

  • If released_date is empty, it means that the person still has the virus.
  • If deceased_date is empty, it means that the person didn’t die.

We can calculate the infection duration in days and remove the other 3 variables. Also, we’d like to transform deceased_date to a binary column, indicated if the person has died or not.

df_france['released_date'] = df_france[['released_date','deceased_date']].fillna(df_france['deceased_date'])
df_france['released_date'] = df_france[['released_date']].fillna(pd.to_datetime('today'))
df_france['infection_duration'] = pd.to_datetime(df_france['released_date']).sub(df_france['confirmed_date'], axis=0)

df_france = df_france[df_france['infection_duration'].dt.days>=0]
df_france['infection_duration'] = df_france['infection_duration'].dt.days
df_france.drop(['released_date','confirmed_date','deceased_date'], axis=1, inplace=True)
df_south_korea['released_date'] = df_south_korea[['released_date','deceased_date']].fillna(df_south_korea['deceased_date'])
df_south_korea['released_date'] = df_south_korea[['released_date']].fillna(pd.to_datetime('today'))

df_south_korea['infection_duration'] = pd.to_datetime(df_south_korea['released_date']).sub(df_south_korea['confirmed_date'], axis=0)
df_south_korea = df_south_korea[df_south_korea['infection_duration'].dt.days>=0]

df_south_korea['infection_duration'] = df_south_korea['infection_duration'].dt.days


Index(['sex', 'country', 'group', 'infection_reason',
'infection_order','contact_number', 'state', 'month', 
'day', 'age', 'infection_duration'], dtype='object')


Index(['sex', 'country', 'group', 'infection_reason',
'infection_order', 'contact_number', 'state', 'month', 
'day', 'age', 'infection_duration'], dtype='object')

Step 4 – Data Fusion

Finally, we are ready to put the two datasets together and start our analysis.

df = df_france.append(df_south_korea, sort=False)

sex                   0
country               0
group                 0
infection_reason      0
infection_order       0
contact_number        0
state                 0
month                 0
day                   0
age                   0
infection_duration    0
dtype: int64

Dummy Encoding

The input for the model needs to be numerical values. Thus, we would have to transform the categorical variables into numbers. Since there is no order into the categories, we will transform each category value to a binary column (0 or 1 values).  This technique is called dummy encoding.

In [19]:
df = pd.concat([df, pd.get_dummies(df['sex'])], axis=1)
df = pd.concat([df, pd.get_dummies(df['country'])], axis=1)
df = pd.concat([df, pd.get_dummies(df['state'], drop_first=True)], axis=1)
df = pd.concat([df, pd.get_dummies(df['infection_reason'], drop_first=True)], axis=1)
df = pd.concat([df, pd.get_dummies(df['group'], drop_first=True)], axis=1)

Dimension Reduction

When we apply dummy encoding we end up with more variables, because each of the categories was transformed into a column.

Since we have too many variables, it is difficult to find the pattern among the clusters. First, we can reduce the number of categorical variables by grouping similar categories. Second, we can apply a dimension reduction technique to reduce the number of input variables and make the model easier to interpret.

df = df_france.append(df_south_korea, sort=False)

Transform infection_reason: we’ll list all possible values for this variable, and group them. After that, we’ll group similar reasons and transform them into dummy variables. Finally, we’ll drop the original column.


array(['visit to Italy', 'contact with patient', 'visit to Mulhouse religious gathering', 'Unknown', 'contact with person who visited Italy', 'visit to Egypt', 'unknown', 'Visit to Venice, Italy', 'contact with patient in Auray', 'visit to Mulhouse', 'visit to Milan', 'Italian', 'visit to Lombardy', 'parishioner', 'Creil military base\xa0?', 'visit to Senegal', 'visit to Alsace', 'visit in Lombardy', 'visit to Bretagne', 'Visit in Italy', 'In contact with someone contamitaminated in Oise', 'Religious Meeting in Mulhouse', 'work in a medical environment ', 'Visit family in Oise', 'health professional', 'visit to Wuhan', 'contact with patient in Japan', 'residence in Wuhan', 'visit to Thailand', 'contact with patient in Singapore', 'visit to China', 'visit to Daegu', 'pilgrimage to Israel', 'contact with patient in Daegu', 'visit to Vietnam', 'visit to Japan', 'visit to ooo'], dtype=object)

def transform_reason(value):
    if ('religious' in value or 'parishioner' in value):
        return 'religious'
    elif ('visit' in value or 'residence' in value):
        return 'visit'
    elif ('contact' in value):
        return 'contact'
    elif ('medical' in value or 'health professional' in value):
        return 'medical'
    elif ('militar' in value):
        return 'militar'
    elif ('italian' in value):
        return 'italian'
    elif ('pilgrimage' in value):
        return 'pilgrimage'
        return 'unknown'

df['infection_reason'] = df['infection_reason'].str.lower()
df['infection_reason'] = df['infection_reason'].apply(transform_reason)  
df = pd.concat([df, pd.get_dummies(df['infection_reason'], prefix='infection_reason', prefix_sep='_')], axis=1)
df.drop(['infection_reason_unknown'], axis=1, inplace=True)

Also, the ‘group’ variable provides similar information to infection_reson. We can easily remove it.

df.drop(['group'], axis=1, inplace=True)

Now, we can transform other categorical variables into dummies: country, state, and sex.

df = pd.concat([df, pd.get_dummies(df['country'])], axis=1)
df = pd.concat([df, pd.get_dummies(df['state'], prefix='state', prefix_sep='_')], axis=1)
df = pd.concat([df, pd.get_dummies(df['sex'])], axis=1)

Principal Components Analysis (PCA)

Now, we are going to prepare our data for a very powerful technique: Principal Components Analysis (PCA).

This technique finds a linear combination of the original variables that explains the data. The main objective is to reduce the number of variables by finding new variables ‘components’. It is based on orthogonal vectors, which makes those components to be uncorrelated.

We need to define which variables are inputs and remove the variables from which we have created the dummy variables (they are redundant).

Also, it is necessary to normalize our data, in order to do that we use StandardScaler. Normalizing data means to put all the variables on the same scale to avoid accumulative numerical errors. Only then, we’ll be able to compare the distance between data points.

features = df.drop(['country','state','sex','infection_reason'], axis=1)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
x = StandardScaler().fit_transform(features.values)
pca = PCA(random_state=20)

PCA(copy=True, iterated_power='auto', n_components=None, random_state=20, svd_solver='auto', tol=0.0, whiten=False)

To determine the number of components, we need to look at the explained variance for each of the components.
The components are calculated in a way that explains the largest variance. As an example, we would be adding components until we reach a defined threshold for the explained variance. Typically the threshold is between 0.7 and 0.9. It means that it explains between 70% and 90% of the variance.

In this case, we are going to choose 0.8 as the threshold.

#determine number of components with threshold=0.8


#explained variance
v = round(np.cumsum(pca.explained_variance_ratio_)[n_components-1]*100,1)
print(f'It is needed {n_components} components to explain {v}% variance of the data')

We need 12 components to explain 83.1% variance of the data

Now that we have a number of components, we can calculate the values for those new variables.

pca = PCA(n_components=n_components, random_state=20)
pcs =
components_name = list(range(0, n_components))
components_name = list(map(lambda x: 'PC' + str(x), components_name))
pd.DataFrame(data=pcs.components_, columns = features.columns, index=components_name)


We can visualize in a matrix the importance of each variable for each component.

components_range = np.arange(1, n_components+1, 1)
components_names = list(map(lambda x: 'PC' + str(x), components_range))
plt.yticks(range(0,n_components), components_names,fontsize=10)

Higher values for the variables means more influence in the principal component. Lower values mean a negative influence on the principal components.
Thus, from the heatmap, one possible interpretation for the principal components analysis is:

  • PC1: men not isolated and not from Korea
  • PC2: first months
  • PC3: state released
  • PC4: state deceased
  • PC5: infection reason religious
  • PC6: infection reason visit
  • PC7: infection reason Italian
  • PC8: infection reason Militar
  • PC9: infection reason medical
  • PC10: infection reason pilgrimage
  • PC11: high infection order
  • PC12: from Mongolia

Step 5 – K-Means Clustering

K-means tries to split the data into k groups, where the elements of a group are close to each other. This method is based on the distance between data points.
Therefore, the goal is to minimize the distance between the points to the centroids. The centroids are the “middle” points for each cluster/group.

The algorithm starts with randomly selected centroids and in each iteration, it recalculates the position of the centroids.

To determine k, the number of groups, we use a plot that shows the distortion of data against the number of clusters. This method is called elbow-test. Distortion is defined as the average distance to the cluster centers. The point at which the distortion starts to decrease in a linear fashion is the elbow, which indicates the optimal number of clusters. It means that adding another cluster won’t change too much the distortion.

Let’s create a data frame from the principal components scores and use that for the clustering analysis.

pca_df = pd.DataFrame(data = pca.fit_transform(x), columns = components_names)


Use the elbow test in order to define the optimal number of clusters.

def elbow_test(df, n_init, max_clusters, max_iter):
    distortions = []
    for i in range(1, max_clusters):
        km = KMeans(
            n_clusters=i, init='random',
            n_init=n_init, max_iter=max_iter,
            tol=1e-04, random_state=20
plt.plot(range(1, max_clusters), distortions, marker='o')
    plt.xlabel('Number of clusters')

kn = KneeLocator(
        range(1, max_clusters),
    return kn.knee

n_clusters = elbow_test(pca_df, 10, 20, 300)
print(f'the optimal number of clusters is

According to our analysis, the optimal number of clusters is 4

We’ll know run K-means algorithm with four clusters and see what we find!

km = KMeans(n_clusters=n_clusters, random_state=20)
y = km.fit_predict(pca_df)
idx = np.argsort(km.cluster_centers_.sum(axis=1))
lut = np.zeros_like(idx)
lut[idx] = np.arange(n_clusters)
pca_df['cluster'] = lut[km.labels_]
df['cluster'] = lut[km.labels_]

We can save/load our model with the following code:

import pickle   
pickle.dump(km, open('kmeans_model.sav', 'wb'))

# Load
km = pickle.load(open('kmeans_model.sav', 'rb'))



We can see that the PC7 value is high. It corresponds to the infection reason ‘italian’. We can corroborate this by looking at the actual data:



1 rows × 28 columns

The following functions will plot our data.

The first one is a scatter plot to compare two principal components and see how clusters are distributed among them. The second one creates a 3d plot to compare three principal components colored by cluster.

These graphs will help us to determine the meaning of the clusters.

def draw_scatter(df, col_1, col_2, cluster_column, num_clusters, title):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111)
    labels = list(range(0,num_clusters))
    colors =, 1, num_clusters))
    axs = []
    for i in labels:
        axs.append(ax.scatter(df[df[cluster_column]==i][col_1], df[df[cluster_column]==i][col_2], cmap=colors[I]))

ax.legend(axs, labels, loc='center', bbox_to_anchor=(0.92, 0.84), ncol=1)

def create_3d_scatter(df, col_1, col_2, col_3, cluster_column, num_clusters, title):
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.set_zlabel(col_3, rotation=90)
    labels = list(range(0,num_clusters))
    colors =, 1, num_clusters))
    axs = []
    for i in labels:
        d = df[df[cluster_column]==i]
        axs.append(ax.scatter(d[col_1], d[col_2], d[col_3], cmap=colors[i]))
    ax.legend(axs, labels, bbox_to_anchor=(0.2, 0.5), ncol=1)

create_3d_scatter(pca_df, 'PC1', 'PC2', 'PC3', 'cluster', n_clusters, '')

We can see that some of the clusters are distributed according to the first 2 principal components. Also, it seems that PC3 does not have much influence on the separation of clusters.

In this table, we can see how the value for each component influences the meaning of the clusters.

Let’s plot the cluster over PC1 and PC2, to validate more clearly this assumption.

draw_scatter(pca_df, 'PC1', 'PC2', 'cluster', n_clusters, 'Clusters - PC1/PC2')

draw_scatter(pca_df, 'PC1', 'PC3', 'cluster', n_clusters, 'Clusters - PC1/PC3')

Remember that the meaning for the principal components was defined as follows:

  • PC1: men who are not isolated and are not from Korea
  • PC2: first months
  • PC3: state released

So, looking at the graph we can conclude:


Clustering Analysis consists of putting together objects into independent groups (clusters), objects that are similar are placed within the same group. The similarities between the objects are based on the distance between them, objects that are close to each other have something in common. Therefore, close objects belong to the same cluster.  

In this case, the clusters found by K-means show that the cases are grouped by certain characteristics among the patients. These characteristics are focused on the variables of sex, infection reason, country, and month. Depending on the value of those variables the data points are distributed among the clusters.

Clusters are groups of patients with similar characteristics.

The clusters that we found are:

Cluster 0 grouped the patients that are women from Korea who contracted the virus in March. Cluster 1 contains the other group of women that are from Korea but contracted the virus during other months. Cluster 2 refers to men from France and cluster 3 grouped the patients that contracted the virus because they had contact with someone from Italy.

Doing this work provides us with a clearer understanding of how this virus is spreading. Even though  Coronavirus has quickly dispersed out of our control, we believe that the more data becomes available, the better we will be able to respond in the future.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.