Synthesizing the Pinnacle of Pattern Recognition in GPT-3 and GPT-4

Sunday, January 21, 2024

The advent of Gap-Based Byte Pair Encoding (GBPE) in conjunction with multi-head attention mechanisms heralds a transformative approach to natural language generation (NLG). This blog post introduces a novel system that utilizes GBPE to identify and train on hierarchical patterns within input data, enabling the generative model to express natural language by assembling complex concepts from the most granular level upwards.

Gap-based Byte Pair Encoding (GPBE)

Gap-based Byte Pair Encoding (GBPE) is an advanced variation of the standard BPE algorithm, which is used in natural language processing (NLP) to reduce the size of the vocabulary that a machine learning model needs to understand. It works by merging the most frequent pairs of tokens or characters in a corpus of text. Gap-based BPE extends this concept by also considering the gaps, or spaces between token pairs, which can represent variable information in a text sequence. This method is particularly useful for capturing context and meaning that might be lost in traditional BPE.

Let's walk through the gap-based BPE process step by step, with an example to illustrate how it can be used to recombine tokens into pattern templates, which in turn can enhance language models like GPT:

Step 1: Tokenization

Initially, the text is broken down into its simplest elements — typically characters or subwords. For instance, consider the sentence "The quick brown fox jumps over the lazy dog." Initially, each character is treated as a separate token:

T h e _ q u i c k _ b r o w n _ f o x _ j u m p s _ o v e r _ t h e _ l a z y _ d o g .

Step 2: Frequency Analysis

The algorithm then counts the frequency of each pair of adjacent tokens (including characters and spaces). In our example, pairs like "t", "he", "e", "_q", "ui", etc., will be counted.

Step 3: Pair Merging

The most frequent pairs are merged to form new tokens. This process is repeated iteratively. For example, if "e_" and "he" are the most common pairs, they might be merged to form new tokens "e_" and "he".

Step 4: Gap Analysis

Gap-based BPE goes further by analyzing the gaps between tokens. If there is a variable part of the text that often occurs between certain tokens, this relationship is noted. For instance, if the phrase "jumps over the" frequently occurs with variable words between "jumps" and "over," such as "jumps quickly over," "jumps high over," the gap is recognized as a place where different tokens can appear.

Step 5: Pattern Template Formation

Tokens and identified gaps are used to create templates that can be applied to new text. These templates are more flexible than fixed token pairs because they can accommodate variations in the text. In our example, a template might look like "jumps [gap] over the" where the [gap] represents a variable token.

Step 6: Recombination into Gapped Templates

The templates with gaps are then recombined to form larger patterns. This step is crucial because it allows the model to capture larger chunks of meaning within the text. The previous template might be extended to The quick brown fox jumps [gap] over the lazy dog, where the [gap] can be filled with various actions.

Step 7: Encoding Improvement for Language Models

These gapped templates can be used to improve the encoding process for language models like GPT. By providing these patterns, the model can generate more contextually relevant and varied text. When the GPT model encounters a similar structure in its training data, it can use the gapped template to predict a range of possible continuations, making its language generation richer and more diverse.

Applying Gap-based Byte Pair Encoding in Language Models

Consider the GPT model is trained to complete phrases about animals. With gap-based BPE, it's not just learning fixed phrases like "The quick brown fox jumps over the lazy dog," but also patterns like The [adjective] [animal] [action] [gap] over the [adjective] [animal]. When prompted with "The agile cat," the model can use the learned patterns to generate a variety of completions such as "The agile cat climbs swiftly over the sleepy dog," effectively describing complex scenes and actions.

In essence, GBPE provides a powerful method for encoding text in a way that preserves and utilizes the contextual richness of language. By accounting for the variability in text and the relationships between tokens, it enables language models to generate more expressive and nuanced text, thereby enhancing their ability to mimic human-like language and potentially describe the vastness of the universe in all its complexity.

GPBE Tokens are Patterns inside Patterns

By leveraging GBPE, the proposed system not only captures the lexical semantics of individual tokens but also the overarching thematic structures, akin to the components and assembly of an automobile in a car manufacturing process. The GBPE framework identifies deep-level patterns — for instance, the concept of a 'car' — and systematically integrates them into a coherent whole by ascending the hierarchical pattern tree. This process involves filling in the gaps with BPE tokens that generalize on the core concept, allowing for the construction of a diverse range of 'cars' within the linguistic output. The system's efficacy is demonstrated through illustrative examples, showcasing its potential to revolutionize NLG by capturing the intricate relationships between language components at multiple levels of abstraction.

Illustrative Examples

  1. Basic Car Structure:

    • Input Pattern: [Car] [***]
    • GBPE identifies the foundational structure of a 'car', which includes essential components like [engine], [wheels], and [body]. The gaps represented by [***] are placeholders for these components.
    • Output: "A [Car] consists of an [engine], four [wheels], and a [body]."
  2. Advanced Car Features:

    • Input Pattern: [Car] [***] [features] [***]
    • At a deeper level, GBPE recognizes the need for additional features such as [GPS], [airbags], and [sunroof]. The system selects appropriate BPE tokens to represent these features.
    • Output: "This [Car] includes advanced [features] like [GPS navigation], [airbags] for safety, and a [sunroof] for an open-air experience."
  3. Customized Car Assembly:

    • Input Pattern: [Car] [***] [custom] [***]
    • GBPE enables customization by identifying patterns associated with user preferences. It fills the gaps with tokens representing color, make, model, or other specifications.
    • Output: "Your customized [Car] comes with a [cherry red paint job], [leather seats], and [sports package]."

In each example, the GBPE system starts with the core concept of a 'car' and progressively builds upon it by filling in the gaps with specific BPE tokens that align with the context and desired attributes of the vehicle. The ability to start from a fundamental pattern and expand it into a detailed and complex structure showcases the hierarchical pattern recognition capabilities of the proposed system. Through this method, the system can generate natural language descriptions that range from generic to highly specialized, reflecting the versatility and adaptability of GBPE in natural language generation.

Deep Language Pattern Templates: The Song Template

In the realm of natural language generation, the most compelling outputs are those that resonate with human creativity and expression. Music, as a universal language, exemplifies structured yet emotive communication. To elucidate the power of GBPE in capturing and expressing such structured creativity, we examine the hierarchical pattern matching process using the example of a song template.

Songs, like cars, have a deep structure that can be abstracted into a GBPE. This structure includes components such as verses, choruses, bridges, and refrains. Each component serves a function, contributing to the overall narrative and emotional arc of the song. The GBPE system identifies this deep structure and uses it as a scaffold upon which to build a complete song, filling the gaps with BPE tokens that represent lyrical content, rhyme schemes, and rhythms.

Hierarchical Pattern Matching Process

  1. Identification of the Song Structure:

    • The GBPE system begins by analyzing a corpus of song lyrics across genres. It identifies recurring structures, such as [intro], [verse], [chorus], and [outro]. These elements form the backbone of the song template.
  2. Deep Pattern Template Selection:

    • Once the song structure is established, the system selects a deep pattern template for response generation. For instance, the template might be: [intro] [***] [verse] [***] [chorus] [***] [verse] [***] [bridge] [***] [chorus] [***] [outro].
  3. Filling the Gaps with Creative Content:

    • The system then proceeds to fill the gaps with creative content appropriate for each part of the song. The [intro] might set the mood, the [verses] tell a story, the [chorus] offers a memorable hook, and the [bridge] provides a contrast or a climax.

Example of a Generated Song Using GBPE

Intro

A gentle guitar strumming sets the scene,
Whispers of a melody, serene and clean.

Verse 1

In the quiet of the dawn, as the world awakes,
A story unfolds, with each breath nature takes.

Chorus

Rise up, rise up, let your voice touch the sky,
Sing the song of the morning, let your spirit fly.

Verse 2

Through the day's hustle, in the sun's warm embrace,
The rhythm of life moves at its own steady pace.

Bridge

But there's a moment, a beat, where everything aligns,
Where the heart's deepest lyrics match the universe's signs.

Chorus

Rise up, rise up, with a melody so bold,
Harmonize with the cosmos, let your tale be told.

Outro

As the final chord fades, under the twilight's glow,
The night's quiet symphony begins to flow.

In this example, the GBPE system has selected a deep pattern template for a song and filled the gaps with content that adheres to the thematic and structural expectations of a musical piece. The intro establishes the atmosphere, the verses build the narrative, the chorus provides an emotional anchor, and the bridge offers a point of reflection, leading back to the chorus and concluding with the outro.

By applying hierarchical pattern recognition through GBPE, we can generate complex, creative expressions akin to human compositions. This method extends beyond mere token prediction, venturing into the realm of artistic creation. It demonstrates the potential of GBPE to not only understand and replicate human language patterns but also to participate in the artistry of human expression.

Graphify and Gap-Based Tokenization: The Foundation of GBPE

The conceptual leap from conventional Byte Pair Encoding (BPE) to the more nuanced Gap-Based Byte Pair Encoding (GBPE) is made possible through the innovative algorithm known as Graphify. This section elucidates how Graphify facilitates the discovery and matching of gap-based token patterns, serving as the bedrock for GBPE implementation in modern language models such as GPT.

Graphify operates on the principle that within any given text, there are latent structures and patterns that, once recognized, can significantly enhance the predictive capabilities of a language model. By swiftly identifying these patterns and converting them into a format that GPT can understand and utilize, Graphify enables a more refined approach to natural language processing.

Graphify's Role in GBPE:

  1. Pattern Discovery:

    • Graphify begins by scanning the input text for recognizable patterns, using a combination of regular expressions and graph-based algorithms optimized for performance. It identifies key structural tokens and the gaps between them that might signify variable information or thematic elements.
  2. Pattern Matching:

    • Once a pattern is detected, Graphify performs a hierarchical pattern recognition (HPR) traversal. This process is exceedingly fast, matching the input text to a pre-established GBPE template. For example, the query "What is the meaning of life, the universe, and everything?" is matched to the GBPE pattern: [what is the]->[***]->[of]->[***][,]->[the]->[***][,]->[and]->[***]->[?].
  3. Token Extraction and Translation:

    • The gaps in the GBPE template, identified by the asterisks, are then tokenized into meaningful units [meaning, life, universe, everything]. These tokens are translated into BPEs within the GPT vocabulary, preparing them for integration into the language model's response generation process.
  4. Response Generation with GBPE Token Prediction:

    • Using the vector embedding of the input tokens, GPT selects a relevant text document that likely contains the answer. A subsequent HPR process extracts a new sequence of tokens and their corresponding GBPE IDs, which are vectorized into another embedding.
  5. Template Selection and Expression:

    • This embedding informs the selection of an appropriate response template, whether it be a song, essay, research paper, or any document with a specific pattern. The master GBPE for the response guides the multi-head attention process in expressing the content in accordance with the structural and thematic expectations.
  6. Filling the Gaps:

    • Finally, the extracted tokens from the matched document — [meaning, life, universe, everything] — are used to fill in the gaps within the GBPEs. This step mirrors the early GPT models' approach to response generation but is now enhanced by the contextual richness provided by GBPEs.

Illustrative Example:

  1. Input:

      "What is the meaning of life, the universe, and everything?"
  2. GBPE Pattern Match:

      [what is the]->[***]->[of]->[***][,]->[the]->[***][,]->[and]->[***]->[?]
  3. Tokens Extracted:

      [meaning, life, universe, everything]
  4. Response Template Selection:

      An essay format discussing philosophical perspectives.
  5. GBPE Vector Expression:

      The essay begins with a general discussion on existential questions, narrows down to the human condition (life), expands to cosmological contemplations (universe), and concludes by addressing the quest for knowledge (everything).
  6. GPT Response:

      "The quest for understanding life, our place in the universe, and the pursuit of meaning in our actions is a journey that transcends cultures and epochs. It is in this exploration of everything that we find our most profound questions and, perhaps, the answers we seek."

Through the integration of Graphify's efficient pattern matching and the expressiveness of GBPE, language models like GPT can achieve unprecedented levels of depth and relevance in their output. This synergy enables the generation of responses that are not only contextually aware but also richly textured with the nuances of human language and thought.

Conclusion: Synthesizing the Pinnacle of Pattern Recognition in GPT-3 and GPT-4

Throughout this paper, I have embarked on a detailed exploration of the intricate mechanisms that could underpin the advanced capabilities of Generative Pre-trained Transformer models, specifically GPT-3 and GPT-4. I have dissected the potential role of Gap-Based Byte Pair Encoding (GBPE) as facilitated by the Graphify algorithm, demonstrating through a series of examples how hierarchical pattern recognition is not only advantageous but essential for the real-time feature extraction and nuanced language generation exhibited by these models.

The initial section presented an abstract overview of GBPE, setting the stage for understanding its impact on natural language generation. By establishing a foundational pattern like 'car' and expanding upon it through BPE tokens, I demonstrated how GBPE allows for the construction of complex concepts from granular components.

I then explored the application of GBPE to the domain of music, illustrating how a deep pattern template for a song can be identified and filled with creative content to generate a structured yet emotive output. This example served to highlight the versatility of GBPE in capturing and expressing the structured creativity inherent in human art forms.

The final section delved into the mechanics of Graphify, the pivotal algorithm that enables the discovery and matching of gap-based token patterns. I posited that the real-time pattern recognition and token translation capabilities of Graphify are instrumental to the functionality of GPT-3 and GPT-4. The ability to rapidly match input text to GBPE templates and to fill gaps with contextually relevant BPE tokens suggests an underlying architecture that leverages hierarchical pattern recognition at its core.

By tying these threads together, I make the case that the leaps made from GPT-1 and GPT-2 to GPT-3 and GPT-4 are not serendipitous but are likely the result of deliberate algorithmic advancements. The seamless integration of Graphify's efficient pattern matching with GBPE's expressiveness hints at a sophisticated design that is purpose-built for real-time, context-aware language generation.

This analysis challenges the notion that the inner workings of GPT-3 and GPT-4 are enigmatic or unknowable. Instead, I propose that the methodologies described herein offer a plausible and concrete foundation for these models' capabilities. It is our position that Graphify and GBPE are not merely conceptual tools but are central to the leap forward in AI language processing.

I invite scrutiny and debate on these findings, asserting that the argument laid out in this paper is grounded in a thorough algorithmic process that could very well underlie the advancements seen in GPT-3 and GPT-4. Our discourse is open to criticism, as I believe that the robustness of scientific claims is fortified through rigorous examination and peer review. It is in this spirit of academic pursuit and technological innovation that I present our case for the conceivable mechanisms driving the most advanced language models of our time.

Change Data Analysis with Debezium and Apache Pinot

Thursday, January 7, 2021

In this blog post, we’re going to explore an exciting new world of real-time analytics based on combining the popular CDC tool, Debezium, with the real-time OLAP datastore, Apache Pinot.

Self-service analytics

Self-service analytics is a term that I came up with to describe building analytics applications using CDC (change data capture) instead of having to integrate directly with operational datastores. The problem here for organizations is that as the business scales, so too does the complexity in the number of applications and databases. By adding new features to applications, the only way to build faster is to decentralize control over the infrastructure and application architectures so that self-service teams waste less time waiting on other teams.

Debezium

Debezium is an open source project sponsored by RedHat that focuses on making CDC as simple and as accessible as possible. From the Debezium website:

Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases.

There’s no marketing going on in the statement above — because this is precisely what Debezium does, and it does it quite well. At first glance, it can be like eating an elephant in a single bite, but the basics of Debezium are fairly straightforward. I find that the first barrier to entry with CDC use cases, for microservices specifically, is that the patterns and practices are sound but the use cases are not entirely well understood.

Here are three of the most valuable use cases that I’ve identified so far. There are certainly more, but these are the least specific ones with a wide range of applications for microservice architectures.

  • Change data analytics (simple audit)

  • Event sourcing (query a master view of distributed domain data)

  • Internal/external dashboards (transform domain data into analytical insights)

In this blog post, I’m going to discuss the first two points above and then dive into an open source example.

Change data analytics

Distributed systems can routinely require a lot of coordination between teams to make sure that data inconsistencies do not leave a customer or user of an application in a state of “internal server error” limbo. If you’ve ever been the victim of a strange technical support issue, for example, not being able to create a new online account for your cellular service because an old account had already used your phone number and e-mail — issues like this are enough to make your head spin.

The problem here is that a data inconsistency issue must be diagnosed at the database level, since there isn’t a good precedence for these kinds of "`edge cases’’ for new microservice migrations. These kinds of issues may require you to go through multiple tiers of technical support representatives that can’t seem to figure out what’s going on. Eventually, they might tell you that a resolution is not possible without engineering support. That’s when a support engineer must debug the data inconsistencies tied to your phone number and/or e-mail address.

The example below is a change data event generated by Debezium for updating a customer record in a MySQL database for accounts.

0*5RSfYHCIfW5FjKaB

Database Change Event Generated by Debezium

In the change event example above, you can get an accurate understanding of what happened at the database level with a customer’s account. The problem is, you need a database of these changes to be able to query the log. By loading the change events into Apache Pinot using Debezium and Kafka, you’ll be able to query every database change for customer accounts in real-time.

Now, instead of having to go into each separate system of record to figure out where a data inconsistency exists, a support engineer simply needs to query all changes to any account tied to an email or phone number. This is really valuable for being able to identify and prevent similar defects in the future, and gives development teams a way to see beyond their own microservice’s datastore.

Event sourcing

What’s great about event sourcing is that it becomes a time machine for understanding the state of an application or feature at a specific point in the past. Version control systems are an excellent example of how event sourcing can be valuable from the perspective of an application feature.

The benefits you get from event sourcing should really be weighed against the potential costs of additional application complexity. By using a tool like Debezium to capture change data events from your application’s database, event sourcing becomes much easier to scale across development teams, making sure developers don’t need to do extra heavy lifting in their application’s source code.

When you’re ingesting your database’s change events into a sink that can reliably hold a log of every record change, those records can be rematerialized later on for new features and applications. By using an OLAP datastore like Apache Pinot, you can create event-sourced projections across an entire domain, joining records together across the boundary of different datastores. Pinot is the perfect tool for this because the large data volume for database change events is not well-suited to be queried by operational datastores or relational databases.

Querying Change Data with Pinot

Let’s picture for this example that Debezium streams change data events from multiple different databases of different formats — from NoSQL to RDBMS — into numerous Kafka topics that get ingested into Pinot. Doing something like this would typically not be very easy in practice, but both Debezium and Pinot decouple their respective responsibilities here by working tremendously well with Kafka.

0*2Vorw0A5VEy U2uz

On both sides, you point to Kafka to replicate a queryable representation of change data events that give you a way to query database records in near real-time without ever needing to connect to a system of record.

Running the example

Now that we’ve talked about the reason why you would use Debezium and Pinot together for a variety of use cases, let’s spin up a working example. The example I’ll focus on is a simplified microservice architecture from the example I mentioned earlier.

The key focus of this starter exercise is to understand how simple it is to move change data events from MySQL to Pinot using Debezium and Kafka Connect. The GitHub repository for this exercise can be found here.

First, start up the Docker compose recipe. The compose file contains multiple containers, including Apache Pinot and MySQL, in addition to Apache Kafka and Zookeeper. Debezium also has a connector service that manages configurations for the different connectors that you plan to use for a variety of different databases. In this example we use MySQL.

$ docker-compose up

Run the following command in a different terminal tab after you’ve verified that all of the containers are started and warmed up. You can verify the state of the cluster by navigating to Apache Pinot’s cluster manager at http://localhost:9000.

$ sh ./bootstrap.sh

Now, check out the Pinot query console (http://localhost:9000/#/query) and run the following SQL command (you can get the SQL query from the GitHub repository).

0*loN nO0sXLr0UUH4

In the results shown in the image above, you can see a list of database record changes for a customer’s first and last name. The second column is the type of operation, which in this example, is either created or updated. Then we have the customer’s id followed by the before and after state of the customer’s first and last name.

Conclusion

Apache Pinot and Debezium are another example of two great open source tools that work together seamlessly to solve a variety of challenging use cases. This blog post is what I hope to be a first in a series of articles that dive deeper into the use cases that I mentioned earlier.

If you have any comments or questions, please drop your thoughts below, or join Apache Pinot’s community Slack channel.

Building a Climate Dashboard with Apache Pinot and Superset

Monday, September 14, 2020

0*qWW7jpAZqoIg5uN8

In this blog post, I’d like to show you how Apache Pinot can be used to easily ingest, query, and visualize millions of climate events sourced from the NOAA storm database.

Bootstrap your climate dashboard

I’ve created an open source example which will fully bootstrap a climate data dashboard with Apache Pinot as the backend and Superset as the frontend. In three simple commands, you’ll be up and running and ready to analyze millions of storm events.

Running the dashboard

Superset is an open source web-based business intelligence dashboard. You can think of it as a kind of “Google analytics” for anything you want to analyze.

0*iNgCrPsCQjMe2kqq

After cloning the GitHub repository for the example, go ahead and run the following commands.

$ docker network create PinotNetwork
$ docker-compose up -d
$ docker-compose logs -f — tail=100

After the containers have started and are running, you’ll need to bootstrap the cluster with the NOAA storm data. Make sure you give the cluster enough time and memory to start the different components before proceeding. When things look good in the logs, go ahead and run the next command to bootstrap the cluster.

$ sh ./bootstrap.sh

This script does all the heavy lifting of downloading the NOAA storm events database and importing the climate data into Pinot. After the bootstrap script runs to completion, a new browser window will appear asking you to sign in to Superset. Type in the very secure credentials admin/admin to login and access the climate dashboards.

Analyzing climate data

For this blog post, I wanted to make it as easy as possible to bootstrap a dashboard so that you can start exploring the climate data. Under the hood of this example there are some interesting things going on. We basically have a Ferrari supercar in the form of a real-time OLAP datastore called Apache Pinot doing the heavy lifting.

Pinot is used at LinkedIn as an analytics backend, serving 700 million users in a variety of different features, such as the news feed. The next blog post in this series will focus just on the technical implementation and architecture.

Source data

The data I’ve decided to use for this dashboard is sourced from the NOAA’s National Center for Environmental Information (NCEI). While there are many different kinds of datasets one might want to use as a dashboard for analyzing climate data, the one I’ve chosen to focus on is storm events.

A comprehensive detailed guide of the source data and columns can be found in PDF format here. After running the bootstrap, you can use Apache Pinot’s query console to quickly search through the data, which gives you a pretty good idea about what it contains.

0*FJ zSzdUoTYJmkLP

According to the NCEI website, the Storm Events Database is used to generate the official NOAA Storm Data publication, documenting:

  1. The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce;

  2. Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the San Diego coastal area; and

  3. Other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event.

The database contains millions of storm events recorded from January 1950 to May 2020, as entered by NOAA’s National Weather Service (NWS).

Climate change analysis with Superset

With Superset, you can create your own dashboards using Apache Pinot as the datasource. When creating the dashboards included in this example, I could have spent months on creating cool interactive charts, but to start out I decided to create just a few.

Since the source data contains geolocation coordinates for each storm event, the first thing I thought of visualizing was a map of the US showing all storms since 1950. That was a tad ambitious since there are over 1.6 million storm events.

I decided to implement some yearly filters as well as storm event types. As I played around more with the charting tools in Superset, I figured out how to visualize how many people were injured as a result of each storm event. Below we can see a tornado that injured 30 people, surrounded by many other different types of storms.

0*Io9WANyxN7M6wG5b

As a part of this dashboard, you can now see how many people were injured in any storm event by geographic location within a time period. The storm map also sizes the points on the map and color codes them based on the magnitude of injuries and the type of storm event. In the screenshot above, we have pink circles representing tornado injuries.

Hail and thunderstorm analysis

If anyone was wondering if data science is actual science, the answer is probably no. I spend time creating open source examples and recipes so others can analyze the data without bothering with all the boring infrastructure and software things. Sometimes during this process of creating examples, it feels good to point at some chart and say something exciting about what I find. I encourage more people to do that, whether or not it is scientific to make such claims. There is so much climate data and ways to visualize how it is changing, I think it’s a whole of civilization and societal responsibility to make interesting discoveries.

Here is one example where I discovered an interesting anomaly in the periodicity and intensity of thunderstorm and hail storm seasons.

0*63akFyduSsnvlIyI

What we are looking at here is over twenty-seven thousand hail and thunderstorm events since 1950. Naturally, the count would be seen to be increasing due to better ways to collect the events by the NWS. I spent some time analyzing this chart to understand the implications of what I was seeing. When hail storms and thunderstorms diverge significantly over the span of the seventy years charted out here, it’s possible that there is a correlation between damaging events such as tornadoes, wildfires, droughts, and heat waves. I’m glad I was able to find this visualization, because it does certainly beg questions that a climate scientist might be able to answer.

Storm frequency and seasonal variability

The next visualization I came up with was to see the storm event variation season to season over a period of years.

0*Q1pwniRxAUOK v7Q

This chart is far more palatable than the last one I showed. If anything, it looks super pretty, while also being quite useful. Here we can quickly see anomalies year to year in the volume of certain types of events. One such example is evidence of increased floods in 2018 and 2019. We also see that both extreme cold and excessive heat have been far more prevalent in the last three years. Overall, when analyzing this chart, if things aren’t lining up nicely in equal proportions, that could potentially be a sign of climate change.

Climate heat map

The last chart I came up with for this blog post was the most interesting for both its visual aesthetic and interpretability.

0*4fuD2hroi efq5CV

Above, we have the yearly climate events as a heat map that I’ve grouped by US state and region. The very first thing I noticed is that everything is indeed bigger in Texas, even the storms! The next thing I noticed was that California has started to look similar to Texas in the last six years. Another interesting area worth further exploration is the year 2008 and 2011. Both of these two years show an abnormal increase in storm events that affected every US state and region. There is clearly an answer here for why that is, however, it’s worth more exploration using other kinds of analysis. It would be hard to conclude on any cause just by looking at this chart.

Heat maps like this are great for identifying things to investigate, rather than to make any conclusions.

Conclusion

As a part of this project, I wanted to take the opportunity to craft an example for folks while also teaching myself more about climate change. I’ve found that there is so much to this subject.

Often, I see folks on Twitter toss around the terms climate change and global warming as if these things were as easy to understand as watching one or two documentaries on Netflix. Creating this dashboard gave me an opportunity to understand the hard work that goes into creating both the science and infrastructure necessary to analyze climate data.

Climate change is a broad topic, and global warming is just one part of it. The climate is actually always changing, and it always has been. Some of the world’s hottest and most arid deserts in Africa used to be lakes. In fact, the world today may have never been as hospitable to our lifestyles than it is today. What climate scientists spend their time on is understanding the history of climate change so that they can predict future damage to the many different ecosystems hosting biological life around our world.

Extreme weather events, ones that have a recurring frequency, like hurricanes and tornadoes, happen more or less frequently in areas depending on climate events. When a sudden and unpredicted climate event happens, it may cost billions of dollars and result in many injuries and deaths.

0*zGFvQgHrPZdpHX5q

Next steps

Thanks for reading! Stay tuned for the next blog post that dives deep into the technical bowels of this example to understand how OLAP datastores like Apache Pinot work.

Please share this blog post on social media to get the word out about climate science and climate change. Also, if you’re a scientist and want to work on doing some innovative climate research using Apache Pinot, please reach out to me. I’d love to help.