Externall Data

All files presented in this documentation can be downloaded from the CREDO project internal page (redmine).

Data source

All of the data extraction scripts shown below operate on data from:

Pierre Auger Observatory: opendata.auger.org/data.php#scalers
Neutron Monitor database: nmdb.eu/nest/
USGS earthquakes catalog: earthquake.usgs.gov/earthquakes/search/
SILSO sunspot number data: sidc.be/silso/datafiles

Downloading script

Two scripts were created to retrieve the data:
import_oulu.py (and import_mosc.py) -> retrieve data from nmdb database, using OULU as an example
import_eq.py -> download earthquake data

Downloading data from Pierre Auger is done via the official website, there is no download script but if you want to see an example of how to analyse the CR data see the documentation: data analysis

Import NMDB (oulu)

import_oulu.py script

Declare libraries and a list of fixed request elements to the database.

import os
from urllib import request

oulu_body = 
[
    "http://nest.nmdb.eu/draw_graph.php?formchk=1&stations[]=OULU&tabchoice=revori&dtype=corr_for_pressure&tresolution=",
    "&force=1&yunits=",
    "&date_choice=bydate&start_day=1&start_month=1&start_year=",
    "&start_hour=0&start_min=0&end_day=31&end_month=12&end_year=", "&end_hour=23&end_min=59&output=ascii"
]

units = 0

Function to create nmdb database query

def oulu_pack(year_begin, year_end, tresolution, only_year):
    if only_year:
        for one_year in range(year_begin, year_end + 1):
            req = request.Request(oulu_body[0] + str(tresolution) + oulu_body[1] + str(units) +
                                         oulu_body[2] + str(one_year) + oulu_body[3] + str(one_year) + oulu_body[4])
            with request.urlopen(req) as response:
                the_page = response.read()
            download_save(the_page, one_year, one_year)
    else:
        req = request.Request(oulu_body[0] + str(tresolution) + oulu_body[1] + str(units) +
                                     oulu_body[2] + str(year_begin) + oulu_body[3] + str(year_end) + oulu_body[4])
        with request.urlopen(req) as response:
            the_page = response.read()
        download_save(the_page, year_begin, year_end)

convert database response to csv file format - download_save function

To start with, lets look at what a sample request sent to the nmdb database generating an html page looks like.

"http://nest.nmdb.eu/draw_graph.php?formchk=1&stations[]=OULU&tabchoice=revori&dtype=corr_for_pressure&tresolution=5&force=1&yunits=1&date_choice= bydate&start_day=1&start_month=1&start_year=2006&start_hour=0&start_min=0&end_day=31& end_month=12&end_year=2007&end_hour=23&end_min=59&output=ascii"

The bold sections are elements that we can change and set when calling the oulu_pack function.

Note: dtype can be "corr_for_pressure" (we use it in publication CR-EQ) or "pressure_mbar".

function oulu_pack

If we set the one_year option to True then we run data loops for one year.
If we set one_year to False it will immediately download everything as one file in the range year_begin, year_end.
If we specify too large a range then an error will occur due to too many events in one requeset.
In this case, enable the option one_year = True and it will download the data for each year separately (see examples 2 and 3).

In the current version we only change the year - but it is also possible to change the month and day of the start date (at the moment M=01; D =01) and end date (at the moment M=12; D =31).

Once the appropriate request has been created, the data is retrieved using the request function from the urlib library - so that we finally have the body of the results page in the form of html code (the_page = response.read()).
The result (the_page) should be processed in order to be able to save the data as a CSV file - the download_save function will be used for this

function download_save

def download_save(the_page, year_begin, year_end):
    page = str(the_page)
    # cutting off the unneccessary
    c1 = "%d-01-01 00:00:00;" % year_begin
    new = page[page.find(c1):]
    c2 = '</code'
    End = new.find(c2)
    text = new[0:End]
    text1 = text.replace(";", "; ")
    # writing to the output
    new_folder = "csv_data/"
    os.makedirs(new_folder, exist_ok=True)
    file_name = "%s_%s" % (year_begin, year_end)
    f = open(new_folder + file_name + ".csv", "w+")
    length = len(text1)  # len(text[:200])
    part0 = text1
    count = 0
    for count in range(0, length):
        if not part0 == "":
            part1 = part0[0:length].partition('\\n')[0]
            part2 = part0[0:length].partition('\\n')[2]
            part0 = part2
            count = len(part1)
            print(part1, file=f)
    f.close()

The body of the page is the whole page - cut everything out of it except the part of interest.
That is, we look for the position of the First element (c1) - the start date (e.g. 2000-01-01) and create a new variable from position C1 to the end (new = page[page.find(c1):]).
But at the end there is also an unnecessary code and it starts with "" so we look for its position (C2) and what we are interested in is between C1 and C2 - we write it into the variable "text".
We also add a space next to the semicolon ( ";" -> "; ") and save it to the file.

examples of downloading data from selected date ranges and diffrent tresolution

Examples 1 (tresolution = 60 min)

def main():
    tresolution = 60
    year_begin, year_end = 1965, 1979
    oulu_pack(year_begin, year_end, tresolution, False)

Examples 2 (tresolution = 15 min)

def main():
    tresolution = 15
    year_begin, year_end = 1980, 1999
    oulu_pack(year_begin, year_end, tresolution, True)

Examples 3 (tresolution = 5 min)

def main():
    tresolution = 5
    year_begin, year_end = 2000, 2023
    oulu_pack(year_begin, year_end, tresolution, True)

example csv (to see the full image click on it)

Note: Import Mosc

mosc_body = [
    "http://nest.nmdb.eu/draw_graph.php?formchk=1&stations[]=MOSC&tabchoice=1h&dtype=corr_for_efficiency&tresolution=",
    "&force=1&yunits=",
    "&date_choice=bydate&start_day=1&start_month=1&start_year=",
    "&start_hour=0&start_min=0&end_day=31&end_month=12&end_year=",
    "&end_hour=23&end_min=59&output=ascii"]

for chart 1 you must set the value of "units" to 0

units = 0

for other calculations you must set the value of "units" to 1

units = 1

In a similar way, you retrieve data from the other stations of interest in the NMDB collection.

For an example, see also the file import_mosc.py which retrieves data for MOSC, it is 99% the same code as here - so we will not describe this example. (We only changed the dtype from corr_for_pressure to corr_for_efficiency).

Important!

Different data is used for Figure 1 and different data is used for the 6 sigma count.

for chart 1 you must set the value of "units" to 0: units = 0

for other calculations you must set the value of "units" to 1: units = 1

Import Earthquakes

import_eq.py script

Declare libraries and a list of fixed request elements to the database.

import os
import urllib.request

eq_body = [
    "https://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=",
    "%2000:00:00&endtime=", "%2023:59:59&minmagnitude=", "&orderby=time-asc"]

minmagnitude = 2

Declaration of dates in a year

year_date_parts = ["-01-01", "-06-30", "-07-01", "-12-31"]
year_date_parts3 = ["-01-01", "-04-30", "-05-01", "-08-31", "-09-01", "-12-31"]
year_date_parts4 = ["-01-01", "-03-31", "-04-01", "-06-30", "-07-01", "-09-30", "-10-01", "-12-31"]
year_date_parts6 = ["-01-01", "-04-30", "-05-01", "-05-31", "-06-01", "-06-30",
                    "-07-01", "-07-31", "-08-01", "-08-31", "-09-01", "-12-31"]

Function to create eq database query

def eq_pack(year_begin, year_end, only_year, parts=1):
    if only_year:
        for one_year in range(year_begin, year_end+1):
            if parts == 1:
                req = urllib.request.Request(eq_body[0] + str(one_year) + "-01-01" + eq_body[1] + str(one_year) +
                                             "-12-31" + eq_body[2] + str(minmagnitude) + eq_body[3])
                with urllib.request.urlopen(req) as response:
                    the_page = response.read()
                download_save(the_page, one_year, one_year)
            if parts > 1:
                year_elements = year_date_parts
                if parts == 3:
                    year_elements = year_date_parts3
                if parts == 6:
                    year_elements = year_date_parts6
                for i in range(parts):
                    req = urllib.request.Request(eq_body[0] + str(one_year) + year_elements[i*2] + eq_body[1] + str(one_year) +
                                                 year_elements[i*2 + 1] + eq_body[2] + str(minmagnitude) + eq_body[3])
                    with urllib.request.urlopen(req) as response:
                        the_page = response.read()
                    download_save(the_page, one_year, one_year, i)
    else:
        req = urllib.request.Request(eq_body[0] + str(year_begin) + "-01-01" + eq_body[1] + str(year_end) +
                                     "-12-31" + eq_body[2] + str(minmagnitude) + eq_body[3])
        with urllib.request.urlopen(req) as response:
            the_page = response.read()
        download_save(the_page, year_begin, year_end)

convert database response to csv file format - download_save function

To start with, lets look at what a sample request sent to the EQ database generating an html page looks like.

"https://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=1989-07-01%2000:00:00&endtime=1989-12-31%2023:59:59&minmagnitude=2&orderby=time-asc"

The bold sections are elements that we can change and set when calling the eq_pack function.

function eq_pack

By editing the minmagnitude you will get earthquakes greater than or equal to the given value.

If we set the one_year option to True then we run data loops for one year.
If we set one_year to False it will immediately download everything as one file in the range year_begin, year_end. But remember that we can receive a maximum of 20,000 elements in a single query.
If we specify too large a range then an error will occur due to too many events in one requeset.
In this case, enable the option one_year = True and it will download the data for each year separately.

Due to such a small number of elements (max. 20,000) in a single query, we create several queries/files for most years. We specify the date range in the files in the variable year_date_parts, year_date_partsN (where N is 2,3,6).

Once the appropriate request has been created, the data is retrieved using the request function from the urlib library - so that we finally have the body of the results page in the form of html csv code (the_page = response.read()).
The result (the_page) should be processed in order to be able to save the data as a CSV file - the download_save function will be used for this

download_save function

def download_save(the_page, year_begin, year_end, i=-1):
    page = str(the_page)
    # cutting off the unneccessary
    new = page[2:-1]
    # writing to the output
    text1 = new
    new_folder = "csv_data_eq/"
    os.makedirs(new_folder, exist_ok=True)
    file_name = "%s_%s" % (year_begin, year_end)
    if i >= 0:
        file_name += "_%d" % i
    f = open(new_folder + file_name + ".csv", "w+")
    length = len(text1)  # len(text[:200])
    part0 = text1
    count = 0
    for count in range(0, length):
        if not part0 == "":
            part1 = part0[0:length].partition("\\n")[0]
            part2 = part0[0:length].partition("\\n")[2]
            part0 = part2
            count = len(part1)
            print(part1, file=f)
    f.close()

The body of the page is the whole page - cut everything out of it except the part of interest (page[2:-1]).
For EQ data, it looks easier and we only remove the first and last character and save it to the file.

But After getting the data in a loop for count in range(0, length):. we add to the end of each element. The rule of thumb for this is that using the .partition("\n") option, we look for the closest "\n" character.

In part 1, we write down what came before that character, and thats what we write to the file using print(part1, file=f). And what was after the character we save to part2 and then overwrite part 0 with what we have in part 0.

examples of downloading data from selected date ranges and diffrent parts

Examples 1 (1960, 1969, 9 Years = 1 File)

def main():
    year_begin, year_end = 1960, 1969
    eq_pack(year_begin, year_end, False)

Examples 2 (1970, 1985, 1 Year = 1 File)

def main():
    year_begin, year_end = 1970, 1985
    eq_pack(year_begin, year_end, True)

Examples 3 (1986, 1991, 1 Year = 2 Files)

def main():
    year_begin, year_end = 1986, 1991
    eq_pack(year_begin, year_end, True, 2)

Examples 3b (1992, 2017, 1 Year = 3 Files)

def main():
    year_begin, year_end = 1992, 2017
    # year_begin, year_end = 2019, 2023
    eq_pack(year_begin, year_end, True, 3)

Examples 3c (2018 - hard year, 1 Year = 6 Files)

def main():
    year_begin, year_end = 2018, 2018
    eq_pack(year_begin, year_end, True, 6)

directory preview after running the script and downloading the data (to see the full image click on it)

example csv (to see the full image click on it)

How join data

Below you will find examples of how data from one source from all years can be combined into 1 file. Combining the data in this way will create a graph 1 from the publication on the correlation of Cosmic Rays with Earthquakes.

You can find out more about this in the Data Analysis documentation under 'Examples' and subsection "Ex-2 ... "

Note

Import library

from datetime import date, datetime, timedelta
import pandas as pd
import os
import glob

You can find the finished join_data.py script on the CREDO internal page redmine.credo.science/projects/external-data/files

For a better understanding, see the screenshots below

$HOME directory

raw_data directory

directory with EQ files

directory with OULU files

Auger data

def auger_data(date_path):
    # Defining the columns to read
    usecols = ["time", "rateCorr"]
    readed_data = pd.read_csv(date_path, usecols=usecols)
    # add column "datatime - if not needed you can comment below line
    readed_data['datetime'] = readed_data.apply(lambda x: datetime.utcfromtimestamp(x["time"]), axis=1)
    readed_data.to_csv('raw_data/auger_data.csv', index=False)

An example of how to run the function

auger_data("raw_data/scalers.csv")

In the case of data from the Pierre Auger Observatory, we only have 1 file - so in this example we will demonstrate how to read only selected columns from the file and then convert the data from unix timestamp format to datetime UTC format in a quick way.

The final result is a CSV file with the following name: "auger_data.csv" and this file should be read when creating the chart and further analyses.

Data from scalers.csv

Data from auger_data.csv

Earthquakes data

def eq_data(date_path):
    usecols = ["time", "latitude", "longitude", "mag"]
    files = os.path.join(date_path, "*.csv")
    files = glob.glob(files)
    df = pd.concat(map(pd.read_csv, files), ignore_index=True)
    df = df[usecols]
    df = df[(df['mag'] >= 4)]  # save only row where mag >= 4
    df.to_csv('raw_data/eq_data.csv', index=False)
    # create second file with sum (from evry day) and mean from N days (5 days)
    df["time"] = df.apply(lambda x: pd.to_datetime(x['time'].split(".")[0].replace("T"," "), format='%Y-%m-%d %H:%M:%S'), axis=1)
    df['date'] = df['time']
    df2 = df[["time","date", 'mag']].copy()
    df2 = df2.groupby(pd.Grouper(key='date', freq='1D'), as_index=False)['mag'].agg(['sum',"count"])
    df2 = df2.groupby(pd.Grouper(key='date', freq='5D'), as_index=False)['sum'].agg(['mean'])
    df2["mean"] = df2.apply(lambda x: round(x['mean'], 1), axis=1)
    df2.to_csv('raw_data/eq_data_msum.csv', index=False)

An example of how to run the function

eq_data("raw_data/csv_data_eq")

The source file has a lot of columns - for the moment we are not interested in most of them - so we decide to load only 4 columns ["time", "latitude", "longitude", "mag"].

As there are a lot of earthquake data files in the directory, we need to list the files and get the gloabl path for each CSV file. We do this using the 'glob' library, then use the ac function 'pd.concat' from the pandas library which, in the aforementioned function, maps the files and concatenates them into a single DataFrame Series - our csv files do not have Index so we have to impose an ignore row index search.

When we have 1 data series concatenated we leave in the 'df' variable only those arcs which have magnitude >=4.

Here we create our first file "eq_data.csv".

We then convert the value for the df['time'] column to a datetime format (required for summing over time).
First we need to sum the data for 1 day, and then we calculate the average value over 5 days. In both cases, we use the capabilities of the "groupby" function. For greater readability, we round the value in the df['mean'] column to 1 decimal place.

When this is ready we save the results to a second file "eq_data_msum.csv" - it is from this that the data for chart 1 which is in the publication with CR-EQ correlations are read.

Note: Pandas allows you to operate very quickly and smoothly on data coming from spreadsheets (Excel, CSV).
The calculation of totals or averages is very fast, but Pandas has its own rules which require some knowledge for better "surfing" through the data. Do not hesitate to consult(read) the Pandas library documentation.
A very good page with Pandas documentation is at sparkbyexamples.com/python-pandas-tutorial-for-beginners/

E.g.
1) If the data is not regular then you need to systematise it first - e.g. all values to the minute, hour, day, week etc.
2) When grouping, if we dont have Index then the library sacrifices the first column to Index - so even though we copied the ['time', 'date', 'mag'] columns to df2 in the final file we will only have [['date', 'mag'].

Data from eq_data.csv

Data from eq_data_msum.csv

OULU data

def oulu_data(date_path):
    usecols = ["time", "value"]
    files = os.path.join(date_path, "*.csv")
    files = glob.glob(files)
    # we read element like "1997-01-01 00:00:00; 1004.829,"
    # delimeter not is "," but is ";"
    df = pd.concat((pd.read_csv(f, header=None, index_col=False, delimiter=";") for f in files), ignore_index=True)
    df.columns = usecols
    df.to_csv('raw_data/oulu_data.csv', index=False, header=["datetime","value"])
    df["date"] = df.apply(lambda x: datetime.strptime(x["time"], "%Y-%m-%d %H:%M:%S"), axis=1)
    df = df.groupby(pd.Grouper(key='date', freq='5D'), as_index=False)['value'].agg(['mean'])
    df["mean"] = df.apply(lambda x: round(x['mean'], 1), axis=1)
    df.to_csv('raw_data/oulu_data_5Dm.csv', index=False)

An example of how to run the function

oulu_data("raw_data/csv_data_oulu")

This example is not much different from the example of the function with earthquakes.
The only difference is that here the character separating the columns (delimiter) is a semicolon ";" and not a comma ";".
The second difference here we do not add up the values, we only calculate the average of 5 days.

Raw data with a frequency of 1 hour

Raw data with a frequency of 15 minutes

Raw data with a frequency of 5 minutes

Combined data with an average value of 5 days (saved as oulu_data_5Dm.csv).

Mosc data

def mosc_data(date_path):
    usecols = ["datetime", "value"]
    df = pd.read_csv(date_path, delimiter=";")
    df.columns = usecols
    df.to_csv(date_path, index=False,header=usecols)
    df["date"] = df.apply(lambda x: datetime.strptime(x["datetime"], "%Y-%m-%d %H:%M:%S"), axis=1)
    df = df.groupby(pd.Grouper(key='date', freq='5D'), as_index=False)['value'].agg(['mean'])
    df = df[df['mean']>0]
    df["mean"] = df.apply(lambda x: round(x['mean'], 1), axis=1)
    df.to_csv('raw_data/mosc_data_5Dm.csv', index=False)

An example of how to run the function

mosc_data("raw_data/mosc_data.csv")

For Mosc data we do the same as for Oulu.

But remember to only use the file you downloaded by specifying "units" = 0 in the script import_mosc.py

SILSO (Sun) data

def silso_data():
    # Filename: SN_d_tot_V2.0.csv - info https://www.sidc.be/SILSO/infosndtot
    usecols = ["Year", "Month", "Day", "sunspot"]
    df = pd.read_csv("raw_data/SN_d_tot_V2.0.csv", usecols=[0, 1, 2, 4], names=usecols, delimiter=";")
    df = df[(df['Year'] >= 1959) & (df['sunspot'] > -1.0)]
    df["date"] = df.apply(lambda x: datetime(x['Year'], x['Month'], x['Day']), axis=1)
    df2 = df[["date",'sunspot']].copy()
    df2["date2"] = df["date"]
    # calculate mean value for alls months
    df2 = df2.resample('M', on='date2').mean()
    # create date as "YYYY-MM" i.e 2023-01
    df2["date"] = df2.apply(lambda x: x['date'].strftime("%Y-%m"), axis=1)
    df2["sunspot"] = df2.apply(lambda x: round(x['sunspot'],1), axis=1)
    df2.to_csv('raw_data/SN_d_tot.csv', index=False)
    #
    # READ SECOND FILE
    #  Filename: SN_ms_tot_V2.0.csv - info https://www.sidc.be/SILSO/infosnmstot
    usecols = ["Year", "Month", "sunspot"]
    df = pd.read_csv("raw_data/SN_ms_tot_V2.0.csv", usecols=[0, 1, 3], names=usecols, delimiter=";")
    df = df[(df['Year'] >= 1959) & (df['sunspot'] > -1.0)]
    df["date"] = df.apply(lambda x: datetime(int(x['Year']), int(x['Month']), 1), axis=1)
    df["date"] = df.apply(lambda x: x['date'].strftime("%Y-%m"), axis=1)
    df2 = df[["date",'sunspot']].copy()
    df2.to_csv('raw_data/SN_ms_tot.csv', index=False)

An example of how to run the function

silso_data()

First we need to use the first 3 columns (year, month, day) and replace it with a datetime object and we have to do it for each element - the lambda function will help us by the way we will create a "date" column.
For easier and faster analysis, we create df2 and copy only 2 columns "date" and "sunspot".

Next, we create a temporary column "Date2", which will serve as an index in grouping and calculating the monthly average.
Finally, we remove the day from the date ("2023-09-07 => 2023-09") and save. The day does not interest us because it is the average for the month and not for a specific day of the month.
These operations will create a "SN_d_tot.csv" file.

Raw data from SN_d_tot_V2.0.csv

Data with an average value of 1 Month (saved as SN_d_tot.csv).

First we need to use the first 2 columns (year, month) and replace it with a datetime object and we have to do it for each element - the lambda function will help us by the way we will create a "date" column.
For easier and faster analysis, we create df2 and copy only 2 columns "date" and "sunspot" and save as "SN_ms_tot.csv" file.
The raw data was already at a frequency every month, so we do not have to do much with this data.

Raw data from SN_ms_tot_V2.0.csv

Data with an average value of 1 Month (saved as SN_ms_tot.csv).

Create plot

In this documentation we focus on obtaining raw data and preparing them for analysis, we will not explain here how to make such a graph. But it will be described in other documentation.

You can find out more about this in the "Data Analysis documentation under 'Examples' " and subsection "Ex-2 ... "