This guide provides an example of how to load CSV and JSON data sets into the Siren platform. You should adapt it for use with your own data sets.

Note

The data sets used in the example contains millions of records. If you use these data sets, loading may take a long time to complete.

Prerequisites

The example uses publicly available data from Companies House. If you want to try it for yourself, you can download:

Extract the CSV and TXT files and edit the example scripts to match the path and file names. 

Create a configuration file for the company data

Create a plain text file with the following content:

input {
 file {
   path => "C:\data\BasicCompanyDataAsOneFile-2018-01-01.csv"
   start_position => beginning
 }
}
filter {
   csv {
separator => ","
autodetect_column_names => true
autogenerate_column_names => true
   }
}
output {
   elasticsearch {
       hosts => ["127.0.0.1:9220"]
       index => "company"
   }
}

Edit the path to match the location of the CSV file and save it as logstash_csv.conf in the same path as the data set.

Create a configuration file for the person with significant control data

Create a plain text file with the following content:

input {
 file {
   type => "json"
   path => "C:\data\persons-with-significant-control-snapshot-2018-01-01.txt"
   start_position => beginning
 }
}
filter {
 json {
   source => "message"
 }
 mutate {
   uppercase => [ "data[name]" ]
 }
}
output {
   elasticsearch {
       hosts => ["127.0.0.1:9220"]
       index => "persons-control"
   }
}

Edit the path to match the location of the TXT file and save it as logstash_json.conf in the same path as the data set.

Load the data

From the command prompt, navigate to the logstash/bin folder and run Logstash with the configuration files you created earlier. For example:

logstash -f C:\data\logstash_csv.conf
logstash -f C:\data\logstash_json.conf

Tip

You can speed up the import process by installing a second instance of Logstash and running the imports concurrently.