Open In Colab

from google.colab import drive
drive.mount('/content/drive')
!pip install netCDF4
!apt-get install libgeos-3.5.0
!apt-get install libgeos-dev
!pip install https://github.com/matplotlib/basemap/archive/master.zip
!pip install pyproj==1.9.6

Introduction

Using Data from OCO-2 Satellite, issued by the NASA.

TODO: Explanation

Getting and Exploring Data

Sample data can be accessed freely on the NASA Database, among other open data from several NASA sattelites.

We will be using CSV aggragated by Benoit Courty. The sample contains Data for the entire month of December 2019 (237MB).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("/content/drive/My Drive/Data For Good - S7: Ministè€re & O-CO2/oco2_2019_12.csv", sep=";")
data.head()
sounding_id latitude longitude xco2 xco2_uncert orbit
0 2019120100401475 -74.449608 -147.438370 406.644867 0.629539 28801
1 2019120100401773 -74.289062 -147.508148 406.293152 0.505218 28801
2 2019120100401806 -74.278152 -147.729370 408.028839 0.439435 28801
3 2019120100433774 -63.759300 -156.219269 407.812469 0.387007 28801
4 2019120100480106 -49.472168 -162.097275 408.023682 0.605203 28801
data.describe()
sounding_id latitude longitude xco2 xco2_uncert orbit
count 2.528809e+06 2.528809e+06 2.528809e+06 2.528809e+06 2.528809e+06 2.528809e+06
mean 2.019122e+15 -7.162205e+00 -5.379357e+00 4.091962e+02 4.867411e-01 2.902197e+04
std 8.984320e+08 2.559929e+01 1.046395e+02 1.249663e+00 1.282551e-01 1.308953e+02
min 2.019120e+15 -8.530780e+01 -1.800000e+02 3.879844e+02 6.670112e-03 2.880100e+04
25% 2.019121e+15 -2.785016e+01 -1.028904e+02 4.084726e+02 3.976190e-01 2.890800e+04
50% 2.019122e+15 -7.797689e+00 -7.425950e+00 4.091107e+02 4.571700e-01 2.902100e+04
75% 2.019122e+15 1.576880e+01 7.997420e+01 4.098619e+02 5.399239e-01 2.913300e+04
max 2.019123e+15 5.233300e+01 1.800000e+02 4.229699e+02 1.843246e+00 2.925100e+04
Cutting the month data into 4 weeks:
end_week_1 = 2019120800000000
end_week_2 = 2019121500000000
end_week_3 = 2019122200000000
end_week_4 = 2019122900000000

data_1 = data[data['sounding_id'] < end_week_1]
data_2 = data[data['sounding_id'] < end_week_2]
data_2 = data_2[data['sounding_id'] > end_week_1]
data_3 = data[data['sounding_id'] < end_week_3]
data_3 = data_3[data['sounding_id'] > end_week_2]
data_4 = data[data['sounding_id'] < end_week_4]
data_4 = data_4[data['sounding_id'] > end_week_3]

print("Number of observations per week:")
print("Week 1: ", data_1.shape[0])
print("Week 2: ", data_2.shape[0])
print("Week 3: ", data_3.shape[0])
print("Week 4: ", data_4.shape[0])
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:8: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:10: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

Number of observations per week:
Week 1:  590620
Week 2:  579613
Week 3:  564073
Week 4:  560096
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:12: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

To convert the sounding_id into a datetime variable data:
from datetime import datetime

def to_date(a):
    return datetime.strptime(str(a), '%Y%m%d%H%M%S%f')

data['date'] = data['sounding_id'].apply(to_date)
data.head()
sounding_id latitude longitude xco2 xco2_uncert orbit date
0 2019120100401475 -74.449608 -147.438370 406.644867 0.629539 28801 2019-12-01 00:40:14.750
1 2019120100401773 -74.289062 -147.508148 406.293152 0.505218 28801 2019-12-01 00:40:17.730
2 2019120100401806 -74.278152 -147.729370 408.028839 0.439435 28801 2019-12-01 00:40:18.060
3 2019120100433774 -63.759300 -156.219269 407.812469 0.387007 28801 2019-12-01 00:43:37.740
4 2019120100480106 -49.472168 -162.097275 408.023682 0.605203 28801 2019-12-01 00:48:01.060

Data on the map

draw_map: Function to draw the map and the observations (relief style)

Parameters:

  • (DataFrame) data: the dataset to map.
  • (int) lon_min : the minimum longitude. default: -180
  • (int) lon_max: the maximum longitude. default: 180
  • (int) lat_min: the minimum latitude. default: -90
  • (int) lat_max: the maximum latitude. default: 90
  • (int) size_point: size of the point to plot (useful if we zoom in). default: 1
  • (Bool) frontier: whether or not to draw the countries borders. default: False
from mpl_toolkits.basemap import Basemap

def draw_map(data, lon_min=-180, lon_max=180, lat_min=-90, lat_max=90, size_point=1, frontier=False):

    plt.figure(figsize=(15, 10), edgecolor='w')
    m = Basemap(llcrnrlat=lat_min, urcrnrlat=lat_max, llcrnrlon=lon_min, urcrnrlon=lon_max)
    
    m.shadedrelief()
    if (frontier):
      m.drawcountries(linewidth=1)
    
    parallels = np.arange(-80.,81,10.)
    m.drawparallels(parallels,labels=[False,True,True,False])

    meridians = np.arange(10.,351.,20.)
    m.drawmeridians(meridians,labels=[True,False,False,True])

    m.scatter(data['longitude'], data['latitude'], c=data['xco2'], cmap=plt.cm.jet, s=size_point)

    plt.show()

Visualisations of:

  1. One month of Data for the whole world
  2. 1st week of Data for the whole world
  3. 2nd week of Data for the whole world
draw_map(data)
draw_map(data_1)
draw_map(data_2)

Data for a Specific Zone

get_data_zone: retrieve the data for one specific zone.

Parameters:

  • (DataFrame) data: the origin dataset.
  • (int) lon_min : the minimum longitude. default: -180
  • (int) lon_max: the maximum longitude. default: 180
  • (int) lat_min: the minimum latitude. default: -90
  • (int) lat_max: the maximum latitude. default: 90
def get_data_zone(data, lon_min=-180, lon_max=180, lat_min=-90, lat_max=90):
    data = data[data['longitude'] < lon_max]
    data = data[data['longitude'] > lon_min]
    data = data[data['latitude'] < lat_max]
    data = data[data['latitude'] > lat_min]
    return data
data_10_20 = get_data_zone(data, lon_min=10, lon_max=20, lat_min=-90, lat_max=90)
draw_map(data_10_20)

This can be interesting to study some points below the latitude -80.

These points appear to be anomalies since the measuement technique of the satellite is measuring the CO2 concentrations through spectroscopy (and therefore is dependant on light). Since there is few enlightment in this region, it is surprising to observe measurements there.

data_artctica = get_data_zone(data, lon_min=-180, lon_max=180, lat_min=-90, lat_max=-75)

print("Number of observations in the -75° zone: ", data_artctica.shape[0])

draw_map(data_artctica, lat_max=-75)
Number of observations in the -75° zone:  811

Let's zoom in a particular region:

data_artctica = get_data_zone(data, lon_min=-70, lon_max=-50, lat_min=-90, lat_max=-75)

print("Number of observations in the -75°:-70°-50° zone: ", data_artctica.shape[0])

draw_map(data_artctica, lat_max=-75, lon_min=-70, lon_max=-50, size_point=5)
Number of observations in the -75°:-70°-50° zone:  121

The same thing can be done over a specific country. Here we have the month data above India:

data_india = get_data_zone(data, lon_min=65, lon_max=100, lat_min=5, lat_max=40)
draw_map(data_india, lon_min=65, lon_max=100, lat_min=5, lat_max=40, frontier=True)

Data Profiling and Anomaly Spotting

We can now explore the CO2 concentration values to get some insights on the distribution and spot some outsiders:

import matplotlib.pyplot as plt

data_test = data[10:2000]

plt.figure(figsize=(15, 10), edgecolor='w')
plt.scatter(data_test['date'], data_test['xco2'])
<matplotlib.collections.PathCollection at 0x7fa7a6c8af98>

We have 2.53M entries, ranging from 288 to 423, with a mean of 409.

data['date'].describe()
count                        2528809
unique                       2528809
top       2019-12-21 20:30:44.310000
freq                               1
first     2019-12-01 00:40:14.750000
last      2019-12-31 22:46:49.080000
Name: date, dtype: object
import seaborn as sns
plt.figure(figsize=(15, 10), edgecolor='w')
sns.distplot(data['xco2'], bins=100)
<matplotlib.axes._subplots.AxesSubplot at 0x7fa7b179d198>

We can try to spot (and remove?) the outsider extremes that may be considered as anomalies.

We can see below an outsider above 420 and suprisingly low observations below 400:

plt.figure(figsize=(15, 10), edgecolor='w')
sns.boxplot(data['xco2'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fa7a6fca828>

We have:

  • 01 obs above 420: This seems to be an anomaly.
  • 13 obs below 400: These may not all be anomalies since some of them make sense considering a gaussian distribution.
print("Values above 420: \n", data[data['xco2'] > 420]['xco2'])
print("Values below 400: \n", data[data['xco2'] < 400]['xco2'])
Values above 420: 
 599013    422.96991
Name: xco2, dtype: float64
Values below 400: 
 367850     397.382385
507507     398.951752
650870     395.861664
955766     399.951538
967410     397.712036
1302053    397.126099
1312833    393.484985
1564802    397.753967
2239166    387.984375
2257075    397.789276
2257076    394.924622
2286446    399.652252
2498825    399.505096
Name: xco2, dtype: float64

Visualize track pattern

The idea here is to consider only one orbit of the satellite in order to plot a track and find an anomaly pattern (cf. F. Chevallier paper for China).

First we need to visualize the splitted data:

#Here we consider the 28904th orbit (6555 observations):
data_28904 = data[data['orbit'] == 28904]
#Here we consider the 28905th orbit:
data_28905 = data[data['orbit'] == 28905]
#Here we consider the 28906th orbit:
data_28906 = data[data['orbit'] == 28906]

plt.figure(figsize=(30, 7), edgecolor='w')
ax1 = plt.subplot(131, sharey=ax1)
plt.scatter(data_28904['date'], data_28904['xco2'])
ax2 = plt.subplot(132, sharey=ax1)
plt.scatter(data_28905['date'], data_28905['xco2'])
ax3 = plt.subplot(133, sharey=ax1)
plt.scatter(data_28906['date'], data_28906['xco2'])
<matplotlib.collections.PathCollection at 0x7fa798dc94e0>
get_orbit_data: retrieve all data points for a specific orbit

Parameters:

  • (DataFrame) data: the origin dataset.
  • (int) orbit: the id of the wanted orbit. default: 28905

Return:

  • (DataFrame) the dataset containing all the data points of the specified orbit.
def get_orbit_data(data, orbit=28905):
    return data[data['orbit'] == orbit]

get_orbit_data(data, 28960)
sounding_id latitude longitude xco2 xco2_uncert orbit date
918479 2019121122384133 -84.843575 -63.591770 414.342773 0.601548 28960 2019-12-11 22:38:41.330
918480 2019121122391308 -84.106491 -80.107651 411.894287 0.625155 28960 2019-12-11 22:39:13.080
918481 2019121122391974 -83.839905 -82.203819 409.719269 0.653130 28960 2019-12-11 22:39:19.740
918482 2019121122392004 -83.828697 -82.337822 409.789215 0.618078 28960 2019-12-11 22:39:20.040
918483 2019121122392005 -83.840591 -82.497063 409.916901 0.596719 28960 2019-12-11 22:39:20.050
... ... ... ... ... ... ... ...
926056 2019121123135904 28.715212 -152.121628 410.655975 0.327967 28960 2019-12-11 23:13:59.040
926057 2019121123135906 28.689960 -152.103485 411.835114 0.348910 28960 2019-12-11 23:13:59.060
926058 2019121123135936 28.707779 -152.109482 410.516602 0.324554 28960 2019-12-11 23:13:59.360
926059 2019121123135937 28.694929 -152.100571 409.954681 0.333670 28960 2019-12-11 23:13:59.370
926060 2019121123135977 28.712656 -152.106644 410.323700 0.341458 28960 2019-12-11 23:13:59.770

7582 rows Ă— 7 columns

There seems to be:

  • Some anomalies (observation high alone)
  • Some noise (for the same line of observation)

It would be interesting to visualise these as a line. We could regroup temporal values to ther mean. To do so, we can regroup datapoints in time.

CAUTION: This "noise reduction" is too basic and needs to be improved !

group_by_minute: reduce the time resolution of a dataframe to the minute

Parameters:

  • (DataFrame) data: the origin dataset.

Return:

  • (DataFrame) the transformed dataset.
def group_by_minute(data):
    data['date'] = data['date'].astype('datetime64[m]')
    return data.groupby(['date']).mean()

data_28905_grouped = group_by_minute(data_28905)
data_28905_grouped.head()
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

sounding_id latitude longitude xco2 xco2_uncert orbit
date
2019-12-08 04:05:00 2.019121e+15 -69.237373 156.832840 412.351044 0.597179 28905.0
2019-12-08 04:07:00 2.019121e+15 -63.462181 152.878530 408.137201 0.422398 28905.0
2019-12-08 04:08:00 2.019121e+15 -62.065236 152.133859 408.361529 0.346995 28905.0
2019-12-08 04:14:00 2.019121e+15 -40.197575 144.382348 409.680232 0.364700 28905.0
2019-12-08 04:16:00 2.019121e+15 -34.071424 142.813871 408.516944 0.375947 28905.0
Pair visualisation

We can now visualize the simplified data for the 28905th orbit. We notice that the overall shape is roughly kept.

plt.figure(figsize=(20, 7), edgecolor='w')
ax1 = plt.subplot(121)
plt.scatter(grouped.index, y = grouped['xco2'])
ax2 = plt.subplot(122, sharex=ax1, sharey=ax1)
plt.scatter(data_28905['date'], data_28905['xco2'])
<matplotlib.collections.PathCollection at 0x7fa798ea3ef0>
data_28950 = get_orbit_data(data, 28950)
data_28950_grouped = group_by_minute(data_28950)
data_28951 = get_orbit_data(data, 28951)
data_28951_grouped = group_by_minute(data_28951)
data_28952 = get_orbit_data(data, 28952)
data_28952_grouped = group_by_minute(data_28952)


plt.figure(figsize=(30, 10), edgecolor='w')

ax1 = plt.subplot(231)
plt.scatter(data_28950_grouped.index, data_28950_grouped['xco2'])
ax2 = plt.subplot(234, sharex=ax1, sharey=ax1)
plt.scatter(data_28950['date'], data_28950['xco2'])

ax3 = plt.subplot(232, sharey=ax1)
plt.scatter(data_28951_grouped.index, data_28951_grouped['xco2'])
ax4 = plt.subplot(235, sharex=ax3, sharey=ax3)
plt.scatter(data_28951['date'], data_28951['xco2'])

ax5 = plt.subplot(233, sharey=ax1)
plt.scatter(data_28952_grouped.index, data_28952_grouped['xco2'])
ax6 = plt.subplot(236, sharex=ax5, sharey=ax5)
plt.scatter(data_28952['date'], data_28952['xco2'])
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

<matplotlib.collections.PathCollection at 0x7fa7967ec0b8>

Apply the wind (reverse)

The observation of the CO2 concentration isn't done immediatly after the emission. Since the satellite has a 16-days "refreshing" period for a geographical point, the CO2 could have been emitted quite far from the place it has been measured.

One reason for that is that the wind is moving the air, so the CO2 particles. Therefore, it can be interesting to understand the where the CO2 particles could have been before being moved by the wind.

The wind data are available from the NASA and already compiled in the original OCO2 Dataset.


Finding Peaks (Measurement Time)

We can try to find spikes in the distibution. By looking directly at the difference between a point and its predecessor, we can't get any useful information, due to the noise.

The idea would be to look at the neighbouring distribution to spot a local peak...

// TODO

from scipy.signal import find_peaks 

peaks, _ = find_peaks(data['xco2'][:500], threshold=1)

plt.figure(figsize=(15, 10), edgecolor='w')
plt.scatter(data['date'], data['xco2'])
plt.scatter(peaks, data['xco2'][:500][peaks], linewidth=0.3, s=50, c='r')

#print(data['xco2'][peaks-1], " -> ", data['xco2'][peaks])

If we take into consideration the fact that two close peaks are in fact one only phenomenon (due to the noise), we would rather try to find global peaks.

The distance here is choosen empirically to visually match the obsrevation.

peaks, _ = find_peaks(data['xco2'], threshold=1, distance=3000)

plt.figure(figsize=(25, 10), edgecolor='w')
plt.plot(data['xco2'][:500000])
plt.scatter(peaks, data['xco2'][:500000][peaks], linewidth=0.3, s=50, c='r')
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:1146: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]
<matplotlib.collections.PathCollection at 0x7f7ed6185208>
peaks, _ = find_peaks(data['xco2'], threshold=3, distance=5000)

plt.figure(figsize=(25, 10), edgecolor='w')
plt.plot(data['xco2'][:500000])
plt.scatter(peaks, data['xco2'][:500000][peaks], linewidth=0.3, s=50, c='r')
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:1146: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]
<matplotlib.collections.PathCollection at 0x7f7ed5d3e4e0>

Finding peaks (Space)

The idea is to find peaks on a space-related basis (Longitude & Latitude).

If a value is significantly high compared to neigbouhring values, we can assume it is either an emission peak or a measurement anomaly. In both cases it isinteresting to dig deeper.

Finding peaks (Local Time)

The idea is to spot peaks in a region over a large period (eg. a month or a year).

This can be done by comparing the previous yealy values (problem: few obs) or monthly values (problem: carbon cycle).


Data Cleaning and Curation



Junk Code

#m.scatter(data_1['longitude'], data_1['latitude'], c=data_1['xco2'], cmap=plt.cm.jet, s=1)
lon = data['longitude'][:]
lat = data['latitude'][:]
CO2 = data['xco2'][:]
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

m = Basemap(llcrnrlat=-90, urcrnrlat=90,
            llcrnrlon=-180, urcrnrlon=180)

plt.figure(figsize=(15, 10), edgecolor='w')
m.shadedrelief()
m.scatter(lon, lat, s=1, c=CO2, cmap=plt.cm.jet, alpha=0.5)
for index,row in data[:25].iterrows():
    row['date'] = datetime.strptime(str(row['sounding_id']), '%Y%m%d%H%M%S%f')

data.head()
plt.plot(data['xco2'])
plt.scatter(c_max_index[0],data['xco2'][c_max_index[0]],linewidth=0.3, s=50, c='r')
# Reducing the datetime precision by the minute:
data_28905['date'] = data_28905['date'].astype('datetime64[m]')
data_28906['date'] = data_28906['date'].apply(lambda x: x.replace(microsecond=0))
data_28906['date'] = data_28906['date'].apply(lambda x: x.replace(second=x.second[:1]))
grouped = data_28905.groupby(['date']).mean()

grouped.head()
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-147-6ff0849e7973> in <module>()
      1 data_28905['date'] = data_28905['date'].astype('datetime64[m]')
      2 data_28906['date'] = data_28906['date'].apply(lambda x: x.replace(microsecond=0))
----> 3 data_28906['date'] = data_28906['date'].apply(lambda x: x.replace(second=x.second[:1]))
      4 grouped = data_28905.groupby(['date']).mean()
      5 

/usr/local/lib/python3.6/dist-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   4043             else:
   4044                 values = self.astype(object).values
-> 4045                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4046 
   4047         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-147-6ff0849e7973> in <lambda>(x)
      1 data_28905['date'] = data_28905['date'].astype('datetime64[m]')
      2 data_28906['date'] = data_28906['date'].apply(lambda x: x.replace(microsecond=0))
----> 3 data_28906['date'] = data_28906['date'].apply(lambda x: x.replace(second=x.second[:1]))
      4 grouped = data_28905.groupby(['date']).mean()
      5 

TypeError: 'int' object is not subscriptable