Labs

Data Analytics and Anomaly Detection with Jupyter for the GRAX Data Lakehouse on Heroku

From Data Signals to Action: Salesforce Data Anomaly Detection

You must consent to Statistics cookies to watch this video.

With the rise in AI usage, businesses are constantly seeking innovative ways to extract valuable insights from their data. Organizations leveraging Salesforce, in particular, can unlock powerful intelligence through advanced analytics leveraging all their historical data through data platforms like GRAX Data Lakehouse

One of the best ways to harness Salesforce data is by connecting it to Jupyter Notebook for advanced data anomaly detection. But what really elevates this approach is deploying your GRAX Data Lake / Lakehouse on Heroku—allowing for a powerful, scalable environment that’s quick to set up and easy to manage. This video will walk you through connecting your GRAX Data Lake / Lakehouse to Jupyter Notebook with a fully deployed JupyterHub installation on Heroku, empowering you to run sophisticated anomaly detection with your Salesforce history.

What’s Covered:

  • How to easily deploy a GRAX Data Lake / Lakehouse on Heroku
  • Step-by-step process for querying your Salesforce data using Python in Jupyter Notebook
  • Techniques for detecting anomalies, including Z-scores, interquartile (IQR) ranges, and Isolation Forest methods
  • Tips for efficiently analyzing historical data 
  • Anomaly detection walkthrough using Python, Panda DataFrame, and Scikit-learn

Key Highlights:

  • Quick Deployment with Heroku: Heroku makes it incredibly easy to deploy GRAX Data Lake / Lakehouse with JupyterHub. With just a few steps, you can have a fully functional and secure environment with all your historical Salesforce snapshots. 
  • Python-Powered Anomaly Detection: Use Python libraries like Panda DataFrames and Scikit-learn to perform advanced analytics and anomaly detection on your Salesforce data. 
  • Flexible Data Querying with Athena: Easily query your Salesforce history stored in your GRAX Data Lake / Lakehouse. You can perform advanced data operations all within Jupyter.
  • Visual Insights and Anomaly Detection: Visualize anomalies and outliers in your data using scatter plots. You can also create advanced visualizations that make it easier to identify trends or patterns.

GRAX Data Lake / Lakehouse on Heroku with Jupyter offers an incredibly flexible and powerful way to analyze Salesforce data at scale. Whether you’re looking to detect outliers or anomalies in sales opportunity stages or understand trends in historical performance, this setup is a game-changer for any organization looking to unearth actionable insights.

Watch the video to see how GRAX, Heroku, and Jupyter all work together to help you unleash your data’s potential. 

Transcript

GRAX here.

And today, we’re gonna take a look at connecting to your GRAX Data Lake with a Jupyter notebook, actually, a fully blown JupyterHub installation.

And we’re going to do that in order to compute some anomaly detection against our historical Salesforce data.

So these stages are taking a little too long to go through, and I’ll say on average, even though some of the statistics are interquartile ranges or some more advanced stuff we can really do in Python.

And it really sort of showcases the power of the data analytics one can do, when given a connection to the data lake and fully fleshed out Python environment.

So we already have one of these deployed, but this is something you’re gonna be able to deploy yourself directly to Heroku.

Data lake credentials, S3 persistence, and we’re using Heroku OAuth for our installation.

So we can sign in here and start a new kernel.

Now what’s nice is, for our repo, we sort of include a very thin wrapper around Athena that reads all the credentials out and can let you get started querying.

So we’re gonna build up a query here that looks at the changes of the stage name based on time.

And so this is a common table expression to just pull out all the fields we need and do a little database magic to get the previous value and the previous timestamp all in one row.

And so that’s gonna allow us to just select the object ID, the name, and the field.

This is here so we can kind of generalize this so we can look at not just opportunity and stage, but for right here we’re just gonna be looking at the object name, object field, the previous value.

And so our analysis is actually be able to apply to any signal out the gate.

So we’re gonna just create this query as a string and then, there’s two methods, on the GRAX Athena client.

They basically let this one, allows us to, return the data as a Panda DataFrame.

So we’re gonna run that query against our live data lake, and we already have a result.

Take a little peek at it just to do some quick validation.

And here we see this is going from solutioning to validating, validating to proposal, proposal to validating, validating proposal, proposal to close loss.

So that sounds like, you know, there were some perhaps multiple proposals in play and this opportunity was finally lost.

So let’s take our dataset and use a little trick here and write it to a .CSV.

That way we don’t have to do the network query even though it was quick, each time we want to analyze the data—we can just analyze it locally.

So let’s just look at the average Z-scores and compute a Z-score to get some averages.

And this actually looks quite wrong—and full disclosure it is.

So what I want to do here is actually show an example of exploring your data.

And the problem here is that we’re grouping by the objectID and so we’re not really computing an average across all the data.

And so this change is actually quite simple.

You just replace this line with the duration seconds.

And boom, we’re sort of looking at what we would expect: a couple big outliers, everything’s centered around zero

This looks a lot more like the sort of statistical distribution we would expect—great!

So now we’re kind of convinced we’re computing these Z-scores here well.

And we know what we’re doing with the data, let’s do a little bit more sophisticated analysis.

And before we do that, we can actually look at a scatter plot of the duration versus date to sort of see how things are looking.

And this is actually, you know, one of the cool parts about using Python.

We can turn off the log scale and really see that the majority of our data isn’t taking forever, and these sort of look like our outliers visually.

So how can we actually find those with code?

Bring back the log scale to kind of spread out the values a little bit more, but it’s the same pattern.

And so this notebook didn’t come with scikit-learn, but one of the methods we’re gonna use actually uses scikit-learn.

So this is cool.

You can pip install.

On Heroku, this will be gone, when the Dyno restarts.

But if you want it to persist, just add it to your requirements text and it’ll be there in your GRAX  Jupyter installation.

So now let’s get to the actual anomaly detection and we’re gonna do it three different ways.

We’re gonna use a Z-score, which is basically pretty pretty much the simplest.

It does require the data to be normally distributed.

The interquartile (IQR) method, which basically picks, you know, anything that’s on the bottom quartile or above the top quartile.

And then this Isolation Forest method, which is actually, we’re just gonna use scikit-learn to do this for us.

So all of these methods this one’s the most robust and doesn’t rely on any underlying distribution for your data.

But what’s cool is we can now, you know, sort of compare and contrast the different methods.

We can build up a plot here and we can graph it.

So the Z-score anomalies found twenty five, we found two hundred and using the quartile method and we can change those bounds.

But here we go, so this is our chart and as we can notice it’s really there’s sort of this imaginary cutoff line for the different methods.

And, we can do that same trick where we remove logarithms and really see that, you know, this is working to find the outliers based on these.

Not arbitrary cutoffs, they’re based from the data set.

But, you know, we have to set this line somewhere and decide who’s an outlier.

This is a really powerful way to sort of do, you know, sort of industrial scale data analysis with your Salesforce data.

And so now that we actually have these, you can come in here, you know, and start looking at actual objects, you know, what’s at the top of this.

So now we can go into Salesforce and look at these different opportunity stages and understand, okay, what made them an outlier and go back into force to do that.

But right now, we basically have a way to detect anomalies in your data that is quite straightforward using standard data analysis tools, extremely flexible, extremely powerful.

And this is a place we can be adding value to as time goes on.

Unleash Your Data’s Potential

Speak with our product experts to see how you can transform your data strategy.

Get started
See all
GRAX

Join the best
with GRAX Enterprise.

Be among the smartest companies in the world.