After publishing the “Loading data from AWS S3 bucket to Vertica made easy” post I had people asking me about the possibility of using the S3 function to load data in different formats. The most frequently asked question was about loading GZipped files from the S3 bucket.
Vertica COPY command expects an uncompressed CSV input by default. GZIP option is available when source data files are located on the node, but not available when using the S3 function since it is based on the User-Defined Load (UDL) feature.
Recently I came across UDx-Examples repo on GitHub with User Defined Extension (UDx) examples. This is a public repo, which is published by the Vertica team. In that repo we can find code for Filter Functions, which will enable us to expand the power of Vertica Library for AWS.
In this tutorial we will run through use cases when we will need to:
- load BZip and GZip compressed files;
- load data in different characterset;
- make data modifications while loading it.
My Environment
- Ubuntu Server 14.04 LTS (used ubuntu-14.04.4-server-amd64.iso)
- Vertica 8.0.0 (used vertica_8.0.0-0_amd64.deb)
Installing Filter Functions to Vertica Database
After doing some research I realized that the UDx-Examples repo contains sample code from “HP Vertica User Defined Extensions Software Development Kit”. Irony of the situation is that the mentioned SDK is part of the standard Vertica installation so it is sitting on every node of every Vertica cluster. All that I needed to do is just start using it. 🙂
For the purposes of this tutorial all I need is the Filter Function portion of the examples.
root@ubuntu:~# ll /opt/vertica/sdk/examples/FilterFunctions total 36 drwxr-xr-x 2 root root 4096 Oct 24 2016 ./ drwxr-xr-x 15 root root 4096 May 29 16:20 ../ -rw-r--r-- 1 root root 3386 Aug 19 2016 BZip.cpp -rw-r--r-- 1 root root 4583 Aug 19 2016 GZip.cpp -rw-r--r-- 1 root root 5728 Aug 19 2016 Iconverter.cpp -rw-r--r-- 1 root root 4361 Aug 19 2016 SearchAndReplaceFilter.cpp
I will get the CPP portion of the examples compiled and will ignore errors from the Java portion of the examples.
We will start by installing required dependencies:
root@ubuntu:~# apt-get install g++ make zlib1g-dev libbz2-dev libcurl4-openssl-dev libboost-dev Reading package lists... Done Building dependency tree Reading state information... Done g++ is already the newest version. libboost-dev is already the newest version. libbz2-dev is already the newest version. make is already the newest version. zlib1g-dev is already the newest version. libcurl4-openssl-dev is already the newest version. 0 upgraded, 0 newly installed, 0 to remove and 118 not upgraded.
Note: When functions are installed they are copied to every node of the cluster. When SQL references those functions that are executed it will run on every node of your cluster. Make sure that all dependencies are installed on all nodes of your cluster. I strongly recommend to have all your nodes identical to each other (including OS packages installed).
Next we will compile examples:
root@ubuntu:/opt/vertica/sdk/examples# make test -d /opt/vertica/sdk/examples/build || mkdir -p /opt/vertica/sdk/examples/build touch /opt/vertica/sdk/examples/build/.exists ... g++ -I /opt/vertica/sdk/include -I HelperLibraries -g -Wall -Wno-unused-value -shared -fPIC -o /opt/vertica/sdk/examples/build/IconverterLib.so FilterFunctions/Iconverter.cpp /opt/vertica/sdk/include/Vertica.cpp g++ -I /opt/vertica/sdk/include -I HelperLibraries -g -Wall -Wno-unused-value -shared -fPIC -I /usr/include -o /opt/vertica/sdk/examples/build/GZipLib.so FilterFunctions/GZip.cpp /opt/vertica/sdk/include/Vertica.cpp -lz g++ -I /opt/vertica/sdk/include -I HelperLibraries -g -Wall -Wno-unused-value -shared -fPIC -I /usr/include -o /opt/vertica/sdk/examples/build/BZipLib.so FilterFunctions/BZip.cpp /opt/vertica/sdk/include/Vertica.cpp -lbz2 g++ -I /opt/vertica/sdk/include -I HelperLibraries -g -Wall -Wno-unused-value -shared -fPIC -o /opt/vertica/sdk/examples/build/SearchAndReplaceFilter.so FilterFunctions/SearchAndReplaceFilter.cpp /opt/vertica/sdk/include/Vertica.cpp ... /../third-party/jdk/jdk1.6.0_45/bin/javac -g -cp /opt/vertica//bin/VerticaSDK.jar /opt/vertica/sdk/BuildInfo.java -d /opt/vertica/sdk/examples/build/JavaScalarLib make: /../third-party/jdk/jdk1.6.0_45/bin/javac: Command not found make: *** [/opt/vertica/sdk/examples/build/JavaScalarLib.jar] Error 127 root@ubuntu:/opt/vertica/sdk/examples#
My output is showing success for the Filter Functions so I am going to ignore “javac: Command not found” error. I know that Java is not installed on this system and I do not need anything from the Java portion of the examples in this tutorial.
File FilterFunctions.sql in /opt/vertica/sdk/examples contains:
- SQL statements to load compiled libraries to Vertica database
- SQL statements to create filter functions
- Commands to generate sample data
- COPY statements to use/test functions
- SQL statements to drop all created functions and libraries.
You can run FilterFunctions.sql and if it runs without errors then you can be sure you are on your way to success, but when the script completes your Vertica database is not going to have filter functions. FilterFunctions.sql removes them as the final step.
root@ubuntu:/opt/vertica/sdk/examples# /opt/vertica/bin/vsql -f FilterFunctions.sql -U dbadmin Password: CREATE LIBRARY CREATE LIBRARY CREATE LIBRARY CREATE LIBRARY CREATE FILTER FUNCTION CREATE FILTER FUNCTION CREATE FILTER FUNCTION CREATE FILTER FUNCTION CREATE TABLE … a lot of successful statements!!! DROP TABLE DROP LIBRARY DROP LIBRARY DROP LIBRARY DROP LIBRARY root@ubuntu:/opt/vertica/sdk/examples#
I am going to create a copy of FilterFunctions.sql, I will keep statements from Step1 and Step 2 and will remove everything for Step 3 and Step 4. This will be easy to do since the file contains comments showing the beginning of each step.
root@ubuntu:/opt/vertica/sdk/examples# sed -n '/-- Step 1:/{p; :loop n; p; /-- Step 3:/q; b loop}' FilterFunctions.sql > create_FilterFunctions.sql root@ubuntu:/opt/vertica/sdk/examples# /opt/vertica/bin/vsql -f create_FilterFunctions.sql -U dbadmin Password: CREATE LIBRARY CREATE LIBRARY CREATE LIBRARY CREATE LIBRARY CREATE FILTER FUNCTION CREATE FILTER FUNCTION CREATE FILTER FUNCTION CREATE FILTER FUNCTION root@ubuntu:/opt/vertica/sdk/examples# /opt/vertica/bin/vsql -U dbadmin Password: Welcome to vsql, the Vertica Analytic Database interactive terminal. Type: \h or \? for help with vsql commands \g or terminate with semicolon to execute query \q to quit dbadmin-> \df *Zip List of functions procedure_name | procedure_return_type | procedure_argument_types ----------------+-----------------------+-------------------------- BZip | | GZip | | (2 rows) dbadmin-> \df Iconverter List of functions procedure_name | procedure_return_type | procedure_argument_types ----------------+-----------------------+-------------------------- Iconverter | | (1 row) dbadmin-> \df SearchAndReplace List of functions procedure_name | procedure_return_type | procedure_argument_types ------------------+-----------------------+-------------------------- SearchAndReplace | | (1 row) dbadmin->
Now we can test our setup. I will use the table and the data file from “Loading data from AWS S3 bucket to Vertica made easy” post to make it easier to follow.
Loading GZip data from AWS S3 bucket to Vertica
Loading GZipped data from the S3 bucket without the filter function will result in corrupted data in the table.
dbadmin=> truncate table test_schema.test_load; TRUNCATE TABLE dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load.csv.gz'); Rows Loaded ------------- 3 (1 row) dbadmin=> select * from test_schema.test_load; sample_data ------------------------------------- /�jXsample_load.csvs�,*.Q(JM�/J� vu��sRA.!�E) A)��� ���T��ĒD.�Sߠ7 (3 rows)
If we apply the GZip() filter function our data will become correct
dbadmin=> truncate table test_schema.test_load; TRUNCATE TABLE dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load.csv.gz') FILTER GZip(); Rows Loaded ------------- 4 (1 row) dbadmin=> select * from test_schema.test_load; sample_data ---------------- First record SECOND RECORD Third Record Much more data (4 rows)
Loading BZip data from AWS S3 bucket to Vertica
Loading BZ2 data from the S3 bucket without the filter function will result in corrupted data in the table.
dbadmin=> truncate table test_schema.test_load; TRUNCATE TABLE dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load.csv.bz2'); Rows Loaded ------------- 1 (1 row) dbadmin=> select * from test_schema.test_load; sample_data --------------------------------------------------------------------- BZh91AY&SYk~��׀@�.b� TP4�ɐJ��M4h�j=E�h'�*]�[��c3n>�QUc�&{)ʪ��w$S� �� (1 row)
If we apply the BZip() filter function our data will be correct.
dbadmin=> truncate table test_schema.test_load; TRUNCATE TABLE dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load.csv.bz2') FILTER BZip(); Rows Loaded ------------- 4 (1 row) dbadmin=> select * from test_schema.test_load; sample_data ---------------- First record SECOND RECORD Third Record Much more data (4 rows)
Loading files saved in non-standard character set from AWS S3 bucket to Vertica
HP Vertica expects to receive all input data in UTF-8. Vertica will store the characters data in UTF-8 and will output results in UTF-8. To illustrate the Iconverter() function I added a Cyrillic translation next to English strings and saved them as UTF-16. Loading this data from the S3 bucket without the filter function will result in corrupted data in the table.
dbadmin=> truncate table test_schema.test_load; TRUNCATE TABLE dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load_UTF16-BE.csv'); Rows Loaded ------------- 4 (1 row) dbadmin=> select * from test_schema.test_load; sample_data ------------------------------------------------- ��First record - 5@20O 70?8AL SECOND RECORD - B>@0O 70?8AL Third Record - "@5BLO 70?8AL Much more data - I5 <=>3> 40==KE (4 rows)
If we apply the Iconverter() filter function our data will be correct and readable.
dbadmin=> truncate table test_schema.test_load; TRUNCATE TABLE dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load_UTF16-BE.csv') FILTER Iconverter(from_encoding='UTF-16'); Rows Loaded ------------- 4 (1 row) dbadmin=> select * from test_schema.test_load; sample_data ----------------------------------- First record - Первая запись SECOND RECORD - Вторая запись Third Record - Третья запись Much more data - Еще много данных (4 rows)
Searching and replacing data while loading from AWS S3 bucket to Vertica
In this example we will search for a specific substring and will replace it with something else. First we will load the original unchanged data set.
dbadmin=> truncate table test_schema.test_load; TRUNCATE TABLE dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load_2_lang.csv'); Rows Loaded ------------- 4 (1 row) dbadmin=> select * from test_schema.test_load; sample_data ----------------------------------- First record - Первая запись SECOND RECORD - Вторая запись Third Record - Третья запись Much more data - Еще много данных (4 rows)
Now let’s use the SearchAndReplace() function to replace ‘rd’ with ‘222’.
dbadmin=> truncate table test_schema.test_load; TRUNCATE TABLE dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load_2_lang.csv') FILTER SearchAndReplace(pattern='rd', replace_with='222'); Rows Loaded ------------- 4 (1 row) dbadmin=> select * from test_schema.test_load; sample_data ----------------------------------- First reco222 - Первая запись SECOND RECORD - Вторая запись Thi222 Reco222 - Третья запись Much more data - Еще много данных (4 rows)
Conclusion
Filter functions expand the set of tasks we can do with the help of the Vertica Library for AWS. The example functions provided by the Vertica team in SDK fills missing parts of functionality and gives a very good example to encourage Vertica customers to expand it even more by writing functions for their own use cases.