GCP Machine: Cricket Video Analysis to Predicting Coal Mine Equipment Failures (Cloud Next ’19)


[ELECTRONIC MUSIC] ALEXANDER GORBACHEV:
So today I’m going to tell you a
couple of stories. I like telling
stories and there’s going to be details
of a couple machine learning projects that we’ve
done with my team at Pythian for two customers, and viewers
talking publicly about this. One of the customers
is Teck Resources, one of the largest mining
companies in the world and the largest in Canada. These are the
machines that we’ve been working on, predicting when
they’re actually going to fail. That’s me and Paul down
there, they are so big. Paul is sitting right there. He is so small. So you can see the
proportions, right? And another project
I’m going to talk about is how we are analyzing
the cricket match videos to extract certain
information out of them. So it’s a completely different
machine learning applications but there’s many common
themes that they’re going to go through. And I think you’re going to
take quite a few insights out of those. And of course we
build those on GCP. And if something doesn’t
work today, I apologize. This presentation
style is a little new. We’ve done it before
once, so it kind of works. So I think it’ll be cool. So a little bit about Pythian. Pythian is a
20-year-old company. We help customers to innovate
with data and with the cloud. Specifically, our
machine learning practice is helping build
real-world AI solutions. And a couple of
those we’re going to look at today,
including some details. One of those projects
actually, the one for the ECB, being submitted for the
Partner of the Year award and it was selected. So yesterday we actually got
a Partner of the Year award in data analytics. So that’s one of being on
the stage receiving it. A little bit about me. Hi, I’m Alex. I’m a Cloud CTO today. I’ve been at Pythian
for 13 years, been doing a lot of different
things from running a team of database
engineers to creating a new business in a new region. And today I’m a
Cloud CTO as well as run our machine
learning practice. Pythian is a very known
premier partner of Google. We have our full
specializations and do a lot of cloud
migrations, help customers build data systems and AI
solutions in a Google Cloud. All right, so a
little bit about Teck. Teck is, as I mentioned
already, Canada’s largest diversified resource company. That’s a fashionable way
to say mining company. And they’re very modern
and all of those devices are equipment in the mines,
actually are big IOT devices today. So specifically what we’re going
to focus on is on those babies. This is a Komatsu
930E hauling trucks and it’s about $1 million
truck, so it’s quite big. It’s quite big as you’ve
seen on the picture. So basically what we’re
going to look today is that– we’re going to look
at how can we predict the failure of those machines. Because cost of failure
is actually quite large. A single electrical
subsystem failure, and we focused on
that today, actually costs anywhere between $100,000
to several million to fix. And the cost comes
from two things. One of them is the cost of
repair, like a direct repair, and the earlier we can
spot the potential problem and initiate the repair the
less, usually, this repair costs. But the biggest
cost actually comes from the truck being idle. Because in a
resource-constrained environment, when the
truck doesn’t hold coal, it doesn’t hold stone, rock,
we don’t get any new revenue. That’s where the biggest
cost coming from. So predicting failure allows
us to do this maintenance and repair at a time when
we’re not resource-constrained. So we reduce or eliminate
its idling cost. So again, what do we need? We need an early warning, right? We need an early
warning and ideally we want to know how much time until
the truck is going to fail. That would be our nirvana. Or depending on some
other types of failure we want to know a level
of under-performance. How deteriorated the
resource or a component is so we can decide
which ones to take for the proactive
maintenance first. They estimated that
if you can predict high voltage electrical
subsystem failures then they could save up to
$4 million a year just with this failure alone. And it’s a scoping
on a three mining site with 108 trucks like this. So now we have a kind of
problem and we have a need. And we always try to be very
disciplined in establishing those first before we’re
moving towards machine learning solutions because it allows
us to have a full circle and estimate ROI of
our work as well. Now, what to date
we are working with. There’s several dataset inputs
that we’re going to work with– datasets we’re
going to work with. But one of them is
that, hey, where do we get predictive data
in order to spot the things, in order to spot
the patterns that indicate a pending failure? Those trucks are
equipped with sensors that send the readings
as time series data as well as alerts
and events which are generated on those
trucks from actuators from the smart systems
embedded in those as well. There’s some dispatch data
and some other data sources available as well. But ultimately, we’re
going to work only with the time series data here. And each truck generates about
two gigabytes of data per day and hundreds of thousands
of data points per truck per second approximately. So obviously this is where
our big data is coming from. We’ve been working with four
years of historical data available to date, three mining
sites, more than 100 trucks, millions of alerts,
billions of sensor readings. And it definitely must be
a machine learning problem, right? We have a large enough dataset. But, you know, what
do we do with this? Before I go further, I want
to take a step back and share with you our learning that
over the years of building AI solutions we’ve always
tried to narrow down our problem as a
supervised learning problem because it allows us to
measure how well we’re doing historically. It allows us to give convincing
insight to end users. It’s a very actionable
result as opposed to things like an anomaly
detection when you say, something seems wrong
with this truck. So what? There’s thousands of components
of this truck, right? Or tried to create
some clustering and figure out certain
patterns on it. There’s so many patterns
but most of them have nothing to do with
electrical failures that you’re going to look at. So as a result we always
try to narrow down the problem in some
way and state it as a supervised
learning problem when we have a known actionable
outcome that we can predict with the model whether it’s a
classification or regression type of problem. Now, where do we get the labels? Well there are historical
maintenance data and a historical database of
the spare parts and so on. The trouble is that
some of this data has not actually been digitized. So this is where their
maintenance analysts had to go back for a few
years, go through all the data and actually extract
some of their failure and predictive or preventive
repair information and supply us those. Now we had so much data. It’s great, but what about
those failures themselves? Well this is where
small data comes in. And the small data
in this case, we have only really a few
dozen failure events. There was 31 catastrophic
high voltage failures over those 108 trucks but let’s
study one that they could find. And a few dozen events that’s
resulting in suspension exchanges or
recharges, so-called, when it’s deteriorating
and then pump it up so it works a little longer. And most of those actually
went undocumented at all. So as a result this is
the most important data that we normally take
into our projects and this is the data that seems
to have the lowest quality. So we had not as much certainty
that some events are not missing from there. So we have to be very
careful how we use that. Now does it smell like a
machine learning problem still? When you only work
with dozens of events it’s very hard to build a
model on 30 events or so. So what do we do? Well in general, like
as we move forward, as we have a business
need, as we understand our data at least a little bit,
we need to frame the problem. As I said, usually we
always try to narrow down on supervised learning
which is as a regression and classification. Then we’ll need
to understand how to produce reliable labels
if they’re not available, modeling techniques to use,
and work on the feature engineering. So those are the general
kind of directions that we’re following. Let’s say what we
want to predict is– let’s start with an obvious. We just want to predict
time to failure. This would be great. We know exactly when,
according to the model, the resource is going to fail–
component is going to fail. But it’s really hard to
build that because there’s very few samples. And other than just
training the model to find out whether
failure is going to happen or not we now need to
distinguish whether it’s going to happen one day from
now, seven days from now, or 20 days from now. Which is a significantly
harder problem to do and some failure types
actually are not developing in progressive fashion at all. So we won’t be able
even to say whether it’s one day or seven days. As a result we had to kind of
go back and look at the problem again. And what we normally
do, we always try to look at the problem
from the eyes of their human who is already doing
the work today. Ideally if that’s possible. In this case, the Assets Health
Supervisors, what they do is that many times
a day they look at their dashboard, some kind
of heuristic rules and so on, and make a guess. I think this thing
going to fail. I’ll need to take
for the maintenance. But what they do ultimately is
they take observations and make some kind of conclusions
using whatever human insight and their
knowledge that they have. So that teaches us two things. It’s great that we can now have
those observations many times a day. It’s much more than
just 30 failures we need to work with but we
don’t have those observations historically. The good news is we can simulate
them from the failure events that we have. And I’ll show how we do that. Another thing that we learned
is that it’s actually not really important to know
exactly when the failure is going to happen. It’s enough to say that it’s
going to happen soon enough. And within soon enough
usually for them needs to be a few days. A day in advance, maybe a little
late but four days, seven days, is awesome. So that’s what we learned. Now how do you
reframe this problem? This is one of the maybe
most innovative approaches, the most creative thing we’ve
done during this project. We basically
transform our events that we have from
the maintenance analysts that they have provided
us into certain observations. So how do we do that? Let me make sure– all right. This is a timeline
for each truck. And there’s some trucks
that had no failures at all so this timeline would
be clean without events. This is a failure event. And this is a
maintenance event when staff decided to
took the truck out because they thought there would
be a problem and they fixed it. But all of those is
indicating for us that there is something wrong. So what we can do,
we work with them in order to say all time in the
lifetime of a truck we’re going to split into certain ranges. The range before the failure
of a certain length we’re going to call as a failure
pending range, or the failure range, or a bad range. This is where we’re going to
make observations every day a few times a day and
these observations will be labeled as one. Which in our case
meaning a failure pending within this
period of time, however long it
is seven days, 10. I think we ended up with
seven days in the end. There’s also some
period of time when there may be still a signal
because the failure may develop earlier but maybe not. That’s kind of an
area when there’s uncertainty in the data. We’re not sure whether there’s
a strong signal or not. So we decided to
mark those ranges and exclude them from
taking observations. So we have a much
cleaner delineation between there’s a
failure pending soon, and there’s a clean zone
when we don’t expect failure. For example, over
the next month or so. So as a result we basically
have those clean ranges marked as the green
and then the red ranges marked as failure pending. And sometime after
the failure was fixed, because the way that
we’re generating labels, all of your labels are data
features as you’ll learn later, we also excluded
from consideration. What’s important
here to remember is that basically we simulated
human observations the same way they tried to spot the problem. And at each of those
points in time, we kind of make an
observation and take the data we know by that time
from all time series data and generate some
features that we’re going to fit in the model. And the model will
basically be trained on whether its observation
is from a clean range or from the failure
pending range. That’s it. So what we have done
is that we’ve basically interpolated dozens
of rare events into hundreds of
thousands of observations. And you know, thousands
of observations is already something we can work
with and build the model on. Although still
not a huge amount. So you have to be quite careful. So what do we do next? An obvious way of working
with time series data as Paul will tell me, is like,
hey, we build an RNN model. It’s obviously designed to get
a sequence of either alerts and sensor readings or
some aggregate of those. And then make a
prediction based on those. But we end up with a
huge feature space, like tens of thousands of points
only for a few thousand labels. So what we’re going
to end up with is what’s kind of known as
the dimensionality curse when we use a
too-complex feature space or too-complex model. We can obviously train
the model on those and it may show a great result
on training but on the testing it will not generalize and
the error will be huge. And it will basically definitely
learn certain patterns but there are so
many patterns in data that we will most
likely learn something else than electrical failures. And that’s definitely
what we’ve seen. So how do we deal with it? How do we deal with
relatively few samples and a huge future space? Well, to fix that
there’s a few things. One of them– we need to
control our model complexity and not go crazy with creating
complicated neural networks, for example. And we also should try to
reduce our feature space. We already tried to
simplify the problem by going away from regression
to binary classification. That’s as simple as we can get. So what do we do next? In order to reduce
the feature space, we should be working
more creatively on the feature engineering
than just fill the sequence. If we want to employ some of
the automated dimensionality reduction techniques–
in this case it didn’t quite work
because the data is not distributed normally. But what we did is that we
worked with domain experts and understood how we can
embed a temporal effect into our features made
from sensors and alerts without actually having to work
with a full sequence of tens of thousands of data points. So what we’ve done is
basically at each time that we make observation we have
this historical data available up to that point. What we’ve done is that we
defined several sliding ranges of hours or days. And some of them
even shifted back. And we created an aggregate
from those alerts. And the sensors
and alerts counts of alerts within those ranges
worked actually the best in this case. And the sizes of those
ranges as well as the offset is what we’ve actually
been tuning as part of the hyperparameter tuning. So we tune our feature
engineering parameters into hyperparameters as well. What we’ve also been
doing is that, when I was mentioning on how
we created those ranges, the sizes of those ranges we
actually also made tunable. So our hyperparameters
tuning iterations was not only on the model,
it was on a complete feature engineering pipeline as well as
on a label production pipeline so that we can tune
that and understand what is going to work best when
we have a four days window, a seven days window,
and how far back we’ll need to consider pushing
this clean range as well. If you got more
detailed questions, you’ll find me at the booth
for the rest of the day. He’ll come up here and
we can talk about it. Because we’re taking our feature
engineering parameters as well as labeling parameters into– as part of our
hyperparameters tuning cycle we need to be careful how we
do a split on the training and testing. What’s important is that we
need to split in such a way that the failure event
and the windows resulting from this failure event will
either be only in a training set or in a testing set. Because otherwise we may cheat
by basically artificially tuning those labeling windows. And we may also
introduce a problem because nearby observations
from today and from yesterday will include data from the
last maybe 10 days or 15 days depending on those ranges. They may be very
similar so we need to make sure that we don’t
use them in the training and testing because it is
kind of cheating in this case. So we’ve been very careful
that when we split the data we actually split it
by events before we start doing any transformations
and then in tuning. Very important not to
pollute it in this way. Also, we have to take care of
not learning individual trucks. Because there were some trucks– actually most trucks did
not have any failures. We had 108 trucks and
31 electrical failures and some of the trucks
failed more than once. So as a result, we
had to be careful not to learn from the huge
amount of alerts and sensors. The patterns that
are truck-specific. So as a result, we had to
include trucks, obviously, that had no failures and pick up
some of the clean observations from them. We need to include trucks
that never failed in training but failed in testing set
so we cover those as well. So those are the things
we need to think about. Of course, when building
models that generalize as well also means that we’re not
just spotting a pattern we’re not interested in, which
is actually learning the truck or learning the mining site. For example, if it
ends up that one site had more failures than another. Now, how do we control
model complexity? Well, if we stick
to a neural network then we’ll just use highly
regularized neural networks. But it’s quite a bit of
effort and the good news is that there are
algorithms that are very robust on these
relatively small data sets with relatively high
number of features. You know, boosted
trees, random forests. We actually ended up with a
random forest model in the end. That worked really well for us. They’ve worked out
really well for us. And so that’s basically how
we approach the modeling techniques. Now, something to share is
that what we’ve also done is that we applied
additional smoothing on the output of the model. So in this case, this is
basically observations. This is time. This is the model output,
which is basically– we can simplify the
probability of the failure. Although, it’s
slightly different how you treat this after
random forest as an output. But for the end user that’s
the easiest way to present it. And it’s a little spiky, right? As you see, this is where the
failure happened in the past. So as you can see, the
model actually, some days before, started to produce– the probability
of failure started to grow based on the model. But it was a little
spiky so as a result we decided to smooth it. It’s a better user experience. As well, allows us to remove
some of the potential noise in this from the model as well. And as you can see, when
the failure was dealt with the model prediction
started showing that the lower probability as well. Now, how do we
choose the threshold at which we make a decision
that it’s likely pending failure or not? If we choose it very high
then we can miss the event or miss the failure or get
notified very late for it. If we choose it
very low then we’ll have too many false positives
and engineers will constantly need to make a decision to take
the truck out for maintenance. So knowing what’s the
average cost of failure and knowing what
the maintenance– what the check-up
costs, we basically could quickly build
a formula that will allow us to calculate
how much money we’re going to save based on the
different result in here. So it turned out that the
optimal was around this 0.5. When we put it in production,
we’ve took into account another aspect, a human
aspect of interpretation. And the human aspect is this. Those Assets Health
Supervisors spend their life or most of
their life doing this. And model ultimately
will predict– will make some false positives. Will indicate a failure when
there won’t be one happening. So we want to make sure
at the very beginning is that we give
them experience that will make them more confident. So as a result we’d rather
miss some failures sometimes but make sure
that, when they see our load, when they see
the critical condition, they do react. So as a result we choose
it a little bit higher for them to be more
confident in it. And that worked actually
really, really well although we did miss some failures. I’ll go through the
results a little bit later. Now of course, we’ve built
it all in the Google Cloud Platform. This is kind of a simplified
way how we in the end architected the
production system. But basically whenever
we built a system like this, what we’re trying
to do is we’re trying to find what’s our integration
between our AI solution with the
rest of the world? How do we get the data and how
do we get predictive insights into the customer’s hand? Teck Resources has
been basically managing a lot of this stuff on prem and
we’ve built it in the cloud. So as a result, of course,
we’ve set up a data ingestion pipeline with them. And actually later
on the dataflow was introduced versus how we
worked here with Cloud Composer and Airflow. But in the end we’ve
basically been landing data in the cloud storage, which is
a very common way of doing that. And then we had a regular job
ingesting the data in BigQuery as well as updating some of
the aggregates in the Bigtable. And the reason
we’ve chose Bigtable is because it’s very
good for sparse data, like tens of thousands
of columns of sparse data that we needed. And it was very useful
for us to pre-create a lot of different aggregates. There’s a lot of different
averaging windows and a lot of
different parameters. So we can then quickly sample
observations out of them based on whatever
labeling techniques we’re using and sampling
different aggregation windows. So we could iterate
through those very quickly without
recalculating those aggregates every time. So that’s why we used
Bigtable for this, which is kind of unusual in a way. And in the end we’ve developed
an API that basically sits in the App Engine and behind
the Google Cloud Endpoint, have a REST Endpoint
for them to call. So all they need to do after
all of this implemented is basically tell me at
this point in time what’s your prediction that
this truck will fail. Or give me predictions
for all 108 trucks today or as of five days ago. That’s basically the
API that we gave them. The API behind the scene
calculated several– made several
predictions, applied the smoothing that we need, and
threw back out the probability as well as whether
it’s a load condition or not based on a
selected threshold. And API basically
picks the data from– picks the features
from Bigtable, which are updated as part
of the data ingestion. It picks the model from
the Cloud Storage bucket so that we can update
model without impacting and restarting our applications,
and presents the result as a JSON in the response. So it’s not that complicated. So all they need
to do is regularly call it and get the
results in front of users or store them in
their own systems. And it’s very
important that we could embed it in their own
systems rather than give them something new. Because Assets
Health Supervisors already have a lot of things
and dashboards to look at. They don’t need
yet another thing. Although it may be
somewhat ugly-ish, you can argue, it’s really
in front of the users and what they’re using now
and that’s very important. And when we build the
systems with an API– that’s exposing an
API, that allows us this ease of embedding. Ease of integration,
like as opposed to have very tight, very,
very, very tight integration. And as we build it
on Google Cloud, we have all the operational
metrics in the Stackdriver and have a dashboard
that’s showing us exactly what’s going on,
how much resources used, how many predictions are being
made, and et cetera, et cetera. So very kind of sleek
and very easy to use. And it’s been running basically
on its own with one failure, in about a year and
a half, I think. So something got full somewhere
and that’s why it failed. It wasn’t monitored before. Let’s look at a
specific example. One of the failures,
how it was predicted. So this isn’t a cold AIDE. It triggered a warning
on Thursday morning. The Assets Health
Supervisors made a decision in whatever way they
made that since it’s legitimate let’s take for maintenance. They took for maintenance. They found this RP1 Contactor
that been failing, replaced it. But the predictive
model still was reporting a pending failure. So they decided to take
it back for maintenance and then found another crack
in the critical component that normally results in a high
voltage system failure. Basically electrical
cabinet blow up is– and they blow up in
a spectacular way and it’s expensive. So they were able to fix that. And then the model kind of then
dropped down its prediction to a normal level over time. And it’s interesting even
though we’re trained it on a certain range, it turns
out that the model was often predicting it a little
bit before that range. So having this
uncertainty window helped us a lot when we
are kind of in the middle. It gave a much stronger, much
more differentiated signal for the model to train on
which is very important for us because we have
very few samples. Thousands of samples
is very, very little. So what do we focus on? We focus on the
quality of our labels. We also would like to have a
good quality of the features, but to some degree model
can deal with that noise. That’s a job of
machine learning model, in order to deal
with that noise. Of course, we want to reduce
that complexity but our focus is always quality of labels. This is something that’s kind
of cool that they told us four months after what happened. And I read it because I think
it’s amazing, amazing results. So the work we did with the
Komatsu 930E haul truck fleet has saved us half
a million so far. So we put it in place
in April of last year and that was June of this year. And it has so far predicted
six electrical cabinet blowups because the model predicted six. Now, four of these
were actioned. Assets Health Supervisor
took a decision. Yep, good idea. Two of them were not actioned. Assets Health
Supervisors thought that the model doesn’t know what
it’s doing and they missed it. It caused a failure. One cabinet blowup was missed. So the model didn’t predict it. Whether because we chose
threshold a little too high, intentionally or
because there was just a failure that was so
rapidly developing that there was no pattern for it. That’s possible too. They estimated that it was
$3 to $4 million per annum that they would save from
this use case alone per year. So it’s incredible ROI that
we were able to achieve. And Kal, the CIO,
yesterday evening told me, you know how much we saved
this year alone with this? On the 10 truck failures
that they successfully predicted and avoided they saved
the $4 million this year alone. So it’s actually
turned out even better than they thought before. The result was even better. So that was predictive
maintenance use case. Again, working with the
structured data very few interested events and
kind of a creating way how we frame this problem as
a machine learning problem. So now I want to move
to a cricket game analysis, which is pretty cool. We worked with the English
and Wales Cricket Board. And it’s basically a
single national governing body for all cricket games
in England and Wales. That includes the top league
games as well as college games and everything in between. One of their tedious jobs
that their staff is doing is that the analysts
need to watch hours and hours of cricket
video in order to capture– extract information
from the games, what happens when the bowlers
deliver it, what type of shot it is. And so on and all
the characteristics. So it takes a very long time. It’s very manual
work so why don’t we try to create a system
that will either simplify that or basically just analyze
and extract all those events? Manual analysis of this
doesn’t quite scale. They usually analyze
only the top tier events and the rest they don’t
because they have just no time and it’s not affordable to
invest those efforts in this. And another problem is that
they have a group of people and they’re all kind
of inconsistent. They all kind of a
few seconds apart. Sometimes they make
a mistake and so on. So they’re actually not
very consistent in analyzing the games. Where do we start? Just like I skipped
that from the Teck story but normally when we
come in the beginning we’ll try to make an initial
discover and assessment. What are the realms of possible
within the domain we’re looking at? What are the
potential use cases? And we basically build a mini
kind of risk reward matrix. We look at the
feasibility and the risks of solving a specific use
case with machine learning. And then we look at
the potential value and we calculate
certain weighted number. And that gives us an
input to an executive and the senior leadership
which of those we need to focus on first. So not very difficult
to interpret. It’s very powerful. And it turned out that both
operation use cases basically was the most fundamental. And the reason it’s
the most fundamental, is the most important, is that
before basically everything else needs to be
done analysts need to look through those
hours and hours of footage. And a game over the five days
is about eight hours of game normally, right? Like this is our first class
test game– international game. And there’s hours of
video but most of the time is idle as people do nothing. There’s 10 seconds basically–
bowl deliver in the 10 seconds, few seconds play, that people
are really interested in. So extracting those
out of all those hours will remove a huge amount
of overhead right away. So that’s what we
wanted to focus on first and there was a number of
ways how we could do that. So let’s look at the data
that we’re working with. We kind of know the
need in some way. So here’s our big data
again as we look at this. It’s a petabyte of
unstructured video data. There’s tens of
thousands of hours of HD video or standard
video, depending on the game being captured. And five days of
footage up to 40 hours, usually kind of in 20s, is
about 100 to 200 gigabytes worth of data. So it’s quite a
bit to work with. And other than
the normal footage there’s also a hawk
eye footage available which is a special camera at
some of the stadiums equipped that’s tracking the ball. So it’s pretty
impressive what they do but we’ve been mostly
working with just the footage. The footage can be either
televised or non-televised. And non-televised
footage is done by cameras installed by analysts
somewhere in a static location. And that’s it. And the televised
footage is what you see on the TV, which is cool. It has commentary. It has overlays. So as a result, we have
some additional input other than what’s happening
in the game itself. We have data from the overlay
that we could possibly use and we have an audio data which
we could also possibly use. And you will see how later
it actually helped us a lot. But basically what happens
is that as the cricket game is played, he’s a
bowler that’s running. It’s called a run-up. So he runs with the ball then
he finishes and goes “poh!” I’m demonstrating really
bad for those of you who know cricket, right? I see your smile. But that was– basically, they
release the ball and then the [INAUDIBLE] here with a bat
like bang, hits or misses. But that’s really a very
unique moment in the game that we want to spot. It’s a very unique
position of players to make very unique movements. And pretty much all
televised footages just kind of very specific scene how they
film it from behind as bowler running like this. So, obviously,
there must be a way to just spot those patterns. Now, what we have looked
at in the very beginning, or we could look in
the very beginning, is use some of the
pre-trained model. As we’re working
with Google Cloud, there’s quite a few
pre-trained models available. We look at the Cloud,
we do Intelligence API. It has the ability to
detect scene change so we can use that as input in our
model to see when the shot is changing so it moves from one
to another, which potentially could help us before this
ball delivery happens as well some annotations. Turned out they’re
not very useful for us because it basically
been always detecting players like a cricket game or
some other ball game and so on. Like a stadium in
a lot of things, but not quite useful
for what we need. The Cloud Vision API
also can annotate what it sees on the frame
and it has an OCR component. And OCR component could allow
us to look at those overlays and see when this
overlay changes. Because overlay changes after
the ball delivery happens, right? So that’s an interesting
thing to know. And they could use
speech-to-text API in order to parse commentary
and apply maybe some NLP to it. We ended up not doing that. But that was an avenue
that was available to us. Now again, remember that we
want to present and state the problem as a supervised
learning problem. So it’s very important if you
always narrow down into this. So here’s the data that analysts
extract, like a subset of it. For each delivery
there is a line and it has information of who
is a batsman and a bowler, what type of shot it is, wicket
type, and the timestamp as well. Where the ball
landed, and so on. So that was manual,
tedious work. Unfortunately we could
not use this data reliably simply
because the timestamps that they had could
not be matched to offsets in the video
in a continuous video because this continuous video– first of all, it doesn’t have
a concept of a timestamp. And it’s also sometimes
fast forwarded and so on. So there was no
easy way in order to match those to the
locations in the video. Plus they were not very precise. It was always by
a few seconds off and we really need
to be focusing on those specific
seconds of the video. So there was a problem. Before going further I
want to kind of review what we need to solve
and how we can map it to a machine learning problem. So what we need is, we need
to take this long footage, cut it into small bits. Basically that means
we’ll need offsets in the video that
will indicate where the ball delivery happened
within a few seconds’ precision. And then from there
on we can just cut about even like
10 seconds frames which is 10 seconds fragment. This is usually long enough
for the game to be over. So we could also try
to train the model to find out when the ball– whether the play finished. But we didn’t really
need to do that. We needed to focus on
the very beginning. So the way that we
can kind of state it as a machine learning
problem is that we basically need to look at either
individual frames or a sequence of frames. That’s what the video is. And then predict
whether it looks like this time in the beginning
of a ball delivery or not. It’s quite specific,
again, compared to the rest of the recording,
the rest of the footage. So it must be doable, right? But that’s basically how
we stated this problem. So it’s a
classification problem. We can potentially do it
multi-level classification by predicting whether it looks
like the bowler is actually just doing a run-up
before the release, whether it’s actually
the ball was released and traveling in the
air towards the batsman. But we didn’t need it
to be that complicated. We just did it binary, whether
it’s within this range or not. So either frames or
a sequence as I said. The bonus things
available is that, hey, we can also look in some
heuristics from data structures from overlays as
well as use some text processing over the commentary
if we really have to. One of those things we
actually ended up using. So when we worked
with it last summer we first built a very quick
AutoML Vision prototype when AutoML Vision
was still in alpha. And it worked
really well for us. It told us that it’s very
feasible to actually implement a proper system around this. What we’ve done is we’ve
just created a few– extracted a few
thousands of frames and we labeled them as either a
ball event or a non-ball event, which is everything else
that looks different. And from that alone
the AutoML model was able to achieve
quite a good performance and precision recall was
basically around 95% for that. So very quickly we could build
a model even just on the frames themselves. On the frames themselves. Now, there’s replays,
there’s some highlights, or sometimes they will
make a false positive. There are other
ways to handle it. But it gives us a
good confidence. OK, it’s possible to do. We did some just very
quick manual labeling. So what we’ve done in order
to make it much more robust is that we needed more labels. The good thing is that
these days there’s a lot of ways to actually
label images and label videos. There’s many services you
can use, like crowdsource, those specialists. Instead of doing
that we’ve basically built a very simple interface
in order to label the videos and use some of the
research students that we crowdsourced to
do that for us to help us. But what we’ve also done is
a very interesting thing, is that we reduced
labeling efforts by using a few techniques that
allows us to approximately say where in the video there’s
going to be a ball delivery. And what we use is two things. There’s two things. One of those was
using Cloud Vision API OCR that detected where
there are overlay changes. And we’ve basically
built some heuristic around this that will say this
seems like a ball delivery. So that was pretty
good, but it was always late by some seconds. 10, 15, 20 because
it happens later. And depending on the
broadcaster it’s different. Overlay’s different and the
time shift is different. Other things we
used is that we used a combination of unsupervised
learning, clustering, and basically transfer learning. So what we have done is
that– and it’s great, and we use those systems
available in our case ecosystem. But what we took is that
we took an Inception I3D model that was trained
on the kinetics data set. And it’s about 500,000
videos if I remember, with various human
actions labeled in them. And that model was
trained on that. It takes an input
as a frame sequence and it produces an
output which is basically a multilabel classifier. But what we’ve done
with it is that we’ve stripped the last
layers out of this model and then we ended
up with a fragment that’s basically producing a
600 numeric representation. Like a 600 values
vector for each sequence that it’s fed with. So if you can think of
it in the way of that we took a video sequence which
is hundreds by hundreds of page frame and we looked
at two seconds sequences. So it’s 50 frames. So it’s tens of
thousands of numbers times three for RGB channels. So we took this and
narrowed it down to 600 in such a way that it
came from a model that was trained to recognize
certain human motions as opposed to in any other
random way or using some dimensionality reduction
technique or whatnot. So what we’ve done
after that is we’ll basically run a simple
K-means clustering. And it became quite obvious
that within this video fragment that’s close to
ball delivery there were certain clusters
showing up very quickly. So we use this
clustering technique and basically say, hey,
if the video sequence is in this cluster it means
it’s close to ball delivery. It’s actually worked reasonably
well to give us a general idea where we are. So again, quite an
interesting trick. And using this model
was super easy actually. It’s available on
the Tensorflow hub. So we basically just load it
and then modify it to our needs. Or it’s actually available
in a GitHub as well. And the model was done
by Google’s AI research arm, DeepMind. Or Alphabet, I guess,
owners of DeepMind. So a talented team. And they produce a lot
of things in open source. So what we’ve done
is that we’ve– how are we doing on time? Well, what we’ve done
with it is that we took those approximate locations
right here on the site, as you see. And we created a small HTML page
with basically a single page with some JavaScript
embedded in it in order to allow us to move second
by second, frame by frame. And very quickly mark– tag places in the video
as a certain event with the certain tags, right? We’re actually just
going to open source it as available page so
everybody could use it. But basically we end
up with those tags. So we went through
the dozens and dozens of those long-form
footage, produced few thousands of those labels. And that basically split
our time of the footage into places where certain
elements of a game happen. And specifically, the time
when there’s bowler running up. And that’s throwing the ball
and the ball hit by a batsman. That’s the thing that we
focused on and basically made a binary classifier
out of it again. And what we’ve done with this
is that we trained two models. One of them is that
we basically took that kinetics dataset-based
I3D model which producing 600 vectors. And we use those 600
as the input features into a simple classifier. I mean, we could have just
stripped additional layer on top of it and make it not– freeze training of that bit. But in the end, in
any way you do it, we again used transfer learning
right because the model already trained on this. And we turn it into the
classification problem as opposed to
semi-usable clustering. And we achieved
very high precision and recall pretty quickly on
this in the very high 90s. And we also built
the frame classifier. We could have arguably
taken the same approach by possibly taking
the preexisting image classification algorithm,
pick important layers from it, and then retrain it
for what we need. Fortunately this
product already exists. And it’s called Google
Cloud AutoML Vision. Because that’s
exactly what it is. So it’s already pre-trained
on a lot of the photos and we just need to additionally
train it on the frames that we need. And we basically merged
those models together and made a decision and built
a small microservice API. Because it takes long time to
go through several hours of HD video and undo it and extract
frames and apply those models, it needs to be asynchronous. But again, we were
always thinking of it as an API that can be
embedded in other systems. So their API here is
that sending a message and it pub/sub
topic that gives us location of the video
in our Cloud Storage. There’s a Kubernetes
cluster that runs workers that auto-scales scales
based on a number of messages in a queue that
picks up that video, goes through that with
OpenCV, extracts the frames, and feed those frames into
the kinetics-based model. And as well as send
some of the frames into the Google AutoML
Vision and then stitches, quote unquote,
“those two together.” Basically build a simple
ensemble out of those and creates a JSON file that it
puts in another Cloud pub/sub topic. So how this is embedded is
that we send the message when the video is available. Or we can automatically send
the message in the pub/sub queue when the video is uploaded. For example there are
many ways we can do it in a Google Cloud Platform. And then we just wait in a queue
for the producing to finish and we do whatever
we need to do. In the beginning, we
just created a small UI that will basically, in a
similar fashion that I show you with the labeling, it would
kind of show us the label so that analysts
could go through it and evaluate how
well it’s working. But another not shown
here is that component that basically picks the
JSON from a pub/sub when the labels become available
and basically cuts the images into small bits. Very, very easy to do. So we have a full pipeline
by taking a long-form video and transforming
it in a short-form and it works pretty slick. When I was preparing
for it I was saying, it would be really cool that
Google introduced AutoML Video because it just made sense
because it would just do it in exactly the same way. Guess what? This morning it was announced
that AutoML Video Intelligence is basically the same
thing as Vision AutoML but works on a sequence of
video in exactly the same way. So what we could have done
if it was available earlier is that we could have used that. Now, maybe it will
work good enough on the videos that focused
on people’s movement. Maybe not. It still remains to be
seen because kinetics is one that’s been focused on
predicting the people movement. What people do
and their actions. So this domain transfer
worked really well for us. The transfer learning
worked really well for us but maybe it’s too generic. But we will definitely
try it and I’m pretty sure we’ll get
some good initial results. So it’s great that we have
availability of those, even if we don’t end up
using them 100% at the end, it will already give us
a very quick idea if it’s possible or not and we can
go through more complicated solutions as needed. [ELECTRONIC MUSIC]

Leave a Reply

Your email address will not be published. Required fields are marked *