254 reads

Unlocking the Power of Advanced Data Types in Big Data

by deepakjayabalanMay 2nd, 2024

Too Long; Didn't Read

The gigantic mountains of big data, which are getting bigger by the day, always pose a problem in terms of storage, analysis, and meaningful insight extraction. As one might suppose, this is where the advanced SQL data types save the day. Features of the specialized data types near integers and strings, which we use in every-day life, will allow us to store and operate complex data structures with great ease. Moreover, mastering them takes us straight on to solving the very core problems of the big data realm, in this way enabling more in-depth analysis and thus more valuable outcomes.

featured image - Unlocking the Power of Advanced Data Types in Big Data

Please note that the examples provided in the article are from Presto DB and the syntaxes might change according to your database.

Lets first discuss the different advanced SQL data types:

ARRAY

An array is a collection of values of the same type that are sorted by their index. It is very similar to a list from Python. For example, it can be an array of numbers: 1, 2, 3. It is possible to retrieve information from an array in SQL by using the “UNNEST” function and the CROSS JOIN operator . Such a combination will allow us to turn an array into individual rows and then join these rows with a table or subquery . Here is how this query might look:

SELECT
   numbers_array,
   number
FROM (
   VALUES
       (ARRAY[2, 5]),
       (ARRAY[7, 8, 9])
) AS x (numbers_array)
CROSS JOIN UNNEST(numbers_array) AS t (number);

MAP

A map is a collection of key-value pairs where all the keys and values should be of the same type. A map is known as a dictionary in Python. Map is a very useful data type that allows you to organize a huge amount of structured information in one container. If you have a lot of columns and you don’t know much about its specifics, do not worry you can organize everything in the forms of pairs and deal with a single container instead of dealing with a variety of subqueries. In order to query a map in SQL, you have to utilize the UNNEST function and CROSS JOIN operator, which joins the two selected rows of another table. Therefore, UNNEST use the opened pairs in the form of separate rows, whereas CROSS JOIN open rows and joins them with the selected ones. For example, a query is:

SELECT
   animals, a, n
FROM (
   VALUES
       (MAP(ARRAY['dog', 'cat', 'bird'], ARRAY[1, 2, 0])),
       (MAP(ARRAY['dog', 'cat'], ARRAY[4, 5]))
) AS x (animals)
CROSS JOIN UNNEST(animals) AS t (a, n);

ROW(STRUCT)

A row or struct is a group of fields in which each field has a unique name and data type. A struct is a tuple in python. For instance, a struct is a struct “x” with a value of 1 and a struct “y” with a value of 2.0. A struct is a group of fields in which each field has a name and data type. It lets us design columns that are significant but are not often used. Instead of having multiple columns, we can have a single struct column with all the required information. It is like a map but with predetermined columns and is more constant. You can get to fields utilizing a field reference operator. To query a struct in SQL, we can utilize the UNNEST function and the CROSS JOIN operator. The CROSS JOIN operator combines the generated rows with the other table or subquery, while the UNNEST function unrolls the struct into new columns. The Query is:

WITH struct_table AS (
   SELECT
       ARRAY[
           CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)),
           CAST(ROW(2, 3.0) AS ROW(x BIGINT, y DOUBLE))
       ] AS sample
)
SELECT
   t.*
FROM struct_table s
CROSS JOIN UNNEST(sample) AS t

Array vs Map vs Row

	Array	Maps	Row (Struct)
Analogy in Python	List	Dictionary	Tuple
Definition	Values of the same type	Flexible Keys and Values of the same type	Defined Keys and Values which can have varying types
Uses	Reduced Facts, Cumulative table design	Flexibility - Use when you have tons of columns and you don’t know what all of them are	More stricter version of map as it has predefined columns but if the fields have different data types, we can use this

Comparing Popular DBs

Given below the DBs and their support for these advanced data types with some workarounds if available especially through Json or Xml,

SQL Databases:

PostgreSQL 16:

Array: fully supported in all versions.
Struct: uses composite types pretending to be structs.
Map: no direct, uses hstore or JSONB for the key-value storage.

MySQL 8.0:

Array: does not natively support JSON arrays.
Struct: uses JSON for storage.
Map: uses JSON as a formatter.

Microsoft SQL Server 2022:

Array: no native, use JSON or XML.
Struct: JSON or XML for structured data.
Map: JSON or XML to store maps.

Oracle Database 21c

Array: supported, VARRAYs or nested tables.
Struct: supported, object types for the data structuring.
Map: no direct, associative arrays, and JSON.

NoSQL databases:

MongoDB 6.0:

Array: uses the native system.
Struct: condiment-oriented form, can support structs natively.
Map: the document works as a map.

Apache Cassandra 4.0:

Array: the list is the functional analog of an array.
Struct: supports UDTs.
Map: the data type is natively supported.

Redis 7.0:

Array: the list of Redis, the analog of an array.
Struct: uses hashes.
Map: has the key-value pattern.

Apache HBase 2.4:

Array: does not work directly, use serialization.
Struct: could be serialized.
Map: uses serialization or column qualifiers.

Distributed SQL Query Engine:

Presto Trino 380:

Array: supported as the data type.
Struct: uses the ROW data type, nesting available.
Map: supported, the data type.

Benefits and Use Cases:

Storage

Among the advantages of using advanced data types such as arrays, maps, and structs for rather extensive information is that you can save more data within the single column instead of dividing it into multiple rows or columns. In addition to a much more systematic and comfortable approach for querying, it also cuts storage substantially. The notion is that five separated rows, for example, will be merged into a single row, delivering all those multiple fragmented informational sectors via map or struct. This is especially relevant for large-scale datasets, where the number of users or records can range up to billions. Therefore, using advanced data types will decrease the overall storage and make the data much more efficient.

Table without advanced data type:

user_id	name	dob	device	login_date
101	John Smith	1/1/2000	android	1/23/2023
101	John Smith	1/1/2000	iOS	1/10/2023
101	John Smith	1/1/2000	android	1/07/2023
101	John Smith	1/1/2000	iOS	1/04/2023
101	John Smith	1/1/2000	android	1/02/2023

Table with advanced data type:

user_id	name	dob	last_30_days_login_by_device
101	John Smith	1/1/2000	{“iOS”: [“1/10/2023”,”1/04/2023”], “android”: [“1/23/2023”,”1/07/2023”,”1/02/2023”]}

Compute

As can be seen, this is essentially more than storage saving; for instance, instead of reading X number of rows to gather all the required information, using a map with arrays, one would need to read X/n number of rows. This varies from n operations to merely one; hence, the overall computing cost is diminished to process. As a result, the query would be performed faster, and the system would perform faster.

Modeling

Advanced data types , such as maps and structs, increase the flexibility available in the schema design. In particular, when you deal with large data volumes, creating a column for each such data point is often impossible or inadequate. However, with advanced data types, you can save less important but necessary information for eventual retrieval in one column. In the same way, unifying facts and dimensions into one large table decreases the amount of movement required between tables or partitions, which generally increases system performance and speed of query. In addition, this results in a less complicated system architecture, which is easier to manage.

Conclusion:

However, employing advanced SQL data types starting from the modeling stage suggests significant benefits in the capabilities of computational resources and the efficient use of storage. By enabling support to json formats most of the SQL databases in market today allows us to use these advance data types. By introducing this type of data into the databases design, the firms could optimize the functionality of their databases, query tools, and query types, minimizing the demands they place on resources. Additionally, advanced SQL data types enable the firms to work with various data structures more effectively, which fuels increased scalability and adaptability in the rapidly changing data environment. Altogether, the decision to include these data types do not only optimize the usage of computational resources but also builds the foundation of other data-intensive innovations and also helps the firms stay competitive in the rapidly digitizing world.