# Using data from baseball-reference.com to predict baseball game attendance

One of the important abilities

to do in sports analytics is prediction. In particular, prediction

of future events that affect your

team or your league. What we’re going to try

now is coming up with ways to predict the attendance

of future games. So the objective is to

look at three games that are coming up in the next

week, in particular, Thursday, June 8, Friday, June

9, and Sunday the 11 against these various teams. Now what I’ve done is selected

these games because they all are slightly different,

but they all take place at Chase Field in Phoenix. So let’s think about how we

would come up with predictions for these games,

and in particular, how we can learn about

the Diamondbacks’ schedule, and the unique

characteristics of Chase Field to try to come up with

better informed predictions. Instead of just guessing

what the attendance would be. So the way we’re going

to do this is we’re going to pull some data

from a website called baseballreference.com. And I’m going to show you

some very basic steps that can be done to sort this data

and be able to analyze it in a way to be able to answer

this question of attendance at future games. So a little bit

about Chase Field. It’s been around

for almost 20 years. Originally was called

Bank One Ballpark. And at some point,

Chase Bank came in as the corporate sponsor. The interesting thing is this

is one of the largest capacity stadiums in Major

League Baseball. Almost 50,000 seats

is the capacity. And you’ll see that in the

very high nosebleed seats on the top left of this

photo and on the top right of this photo, it’s

quite tough to get the people to sit up there. I think it’s an

oversized stadium. But that gives you an idea of

what the maximum capacity would theoretically be if this

was a sold out game. So let’s talk about

how we would make the forecast for attendance. There’s a couple of ways. One way to do it is to

estimate by analogy. In other words, to look at

very similar games that took place on a Thursday, a

Friday, and a Sunday, and try to see what the

attendance was for those games. And then try to

compare roughly what these future games, how

similar or different they are from those past games. So that’s one way. I don’t think that’s

very accurate, and it relies a lot on guessing. So we’re going to use method

two, which is actually looking at historical data, that

is, past attendance at Chase Field, and then use a little

bit of our expert judgment to try to figure out what

additional factors may influence the attendance

of these particular games that we’re interested in. These three games on

June 8, 8, and 11. So let’s talk about briefly

what factors could affect the attendance of a game. So I think weekday

games are less popular than weekend games. If you have a game that’s during

the day during work hours, that happens once

in a while, that’s not going to be as well attended

as a game that’s an evening. Of course, the weather– not in the particular example

of Chase Field, because– some of you may

have been there– but you should be aware that the

roof is actually retractable. So if there is rain

outside or extreme heat, they go ahead and

close this roof and make the inside a very

comfortable 70 or 75 degrees. But the weather could affect the

commute to getting somewhere. That’s harder to predict. Free giveaways are

something that I think would increase the attendance. Who the opponent is

and how the opponent is doing, how the

home team is doing, and then maybe even some

possible personal or team records that are

about to be broken. The harder thing to account

for is other competing sporting events. So in particular,

let me point out that of these three games

that I want you to consider, one of those– I won’t tell you which one,

you’re have to look this up– one of these games conflicts

with the NBA finals. So if people have

to make a choice between going to a regular

season game at Chase Field or watching the

NBA finals on TV, depending on what

sport you like better, you might choose

one over the other. So these are other

things to consider. Not just other sporting

events that may conflict, but also, other events

in and around Phoenix that may conflict. For instance, if there’s a

big conference at the Phoenix Convention Center, that

actually may help attendance at Chase Field, because when

folks go to this conference, the conference ends, and they

just walk to their hotel, they get together for

dinner, and talk about, hey, what are we

going to do tonight? Oh, there’s a home

game at Chase Field. It’s a couple blocks away. Let’s walk over there and do it. So think about all

these considerations that may affect the actual

attendance of the game. So first item for you is to be

able to look up the information on these three games. So we’re going to go to this

website, mlb.com/dbacks, and look at these particular

games that I’m referring to. So here we are on the

website, mlb.com/dbacks, and we’re going

to go to schedule. And when I hover– I haven’t clicked on anything– I’ve just hovered my

mouse over this menu, and allows me to select

the 2017 schedule. So because we’re already

in the month of June, it automatically pulls up

the calendar for the month. So the first game that

we are interested in is the Thursday, June 8

game against the Padres. So when I hover

my mouse over it, it tells me some

information about that game, not just that it’s at 12:40,

but also, that there’s a promotion that day

called Throwback Thursday. Now this is a difficult

one to compare because it’s a day game during the week. And you’re going to

have to look this up, but I think the Padres are

in last place in the National League West. So that’s probably not going to

be a very well attended game. You’re going to have to

make a judgment about that. The very next game

however, on Friday, June 9, is an evening game against

the Milwaukee Brewers. I don’t know how the

Brewers are doing this year, but Friday nights are usually

a little bit more popular. There are no promotions

that day, though. No promotions that game. And the third and final

game I want you to consider is also against the Brewers, but

this is a Sunday afternoon game or early afternoon game, 1

o’clock, 1:10 PM first pitch. But look at all these

promotions going on. Let me expand this link. And it looks like there

are multiple promotions. We’re looking at this one here. Not only do the first

10,000 kids get a free– this must be a plastic

Wiffle ball bat, but there’s also something

going on with Baxter’s birthday. And Baxter is the mascot

for the Diamondbacks. And then there’s a military

and first responder discount. OK, so a lot. Three promotions in one game. So probably, this is going

to influence attendance on the Sunday early

afternoon game. OK, so the second

thing we’re going to do is we’re going to look

at historical information on attendance at

Diamondbacks games. In order to do this,

we’ll go to this website called baseballreference.com. And first thing I

want to point out is that there are some

sibling websites up here– football, basketball,

hockey, soccer– that all contain

the same format. So all the processes that you’re

going to see me go through, you can replicate these things

in these other websites. So for now, we’re interested in

just the Arizona Diamondbacks. There are several ways

to get to this team. We can either click on

this link called teams, or here from the

home page, you’ll see that it’s

available, because they list the current standings. So here you go,

Padres are actually in fourth place in the

National League West. And the Brewers, who

are the other opponents, are in first place in the

National League Central. So two first place teams

playing each other. So that might be interesting. Let’s look at Arizona only. So Arizona currently

is tied for first place with the Colorado Rockies. So there is a lot of information

about this team on this page. And you can see that

the roster is available. The team’s statistics, the

pitching, all the fielding totals. But the only thing

we’re interested in are the schedule and results

available at the very, very top of this page. So by clicking on this

page, we get the information that we are interested

in is here, attendance. So when you hover over

any of these columns, you get a definition

of that particular data under that column. Now you’ll see in this

case, they actually define what attendance means. It’s the number of

tickets sold at that game. That’s very different than

the number of people attended. So I would say this is a little

bit of an inflated number because people who

actually attend the game is slightly lower than

the number of tickets sold. So this is the number

we’re going to use. Not a big deal. It’s still very helpful. They list all the attendance

for all the games so far. So you’ll notice a few things. Up until present time,

the most recent game was played Thursday, June 1. I’m recording this

video on Saturday. So you’ll see that the

Friday information has not been uploaded. So that already

tells you the data is already a day old or

a day behind in terms of updating this website. That’s pretty good. Pretty good. OK, so how do we take

all this information and actually use

it to understand the typical attendance

for Diamondbacks games? So what we’re going

to do is we’re going to first modify this table. So I’m going to click

on Share and More, and select the first option,

which is modify and share. So what it does is it

converts this table into an editable format. So because I’m only interested

in home games and attendance at those home games,

there are a lot of room The columns on here that I’m

going to actually eliminate. So the nice thing about

baseballreference.com is it has these little circle

buttons so that if I click on, it eliminates that column. So I’m going to

eliminate streak. I’m going to eliminate

the opponent, whether it was a win or a loss. All the other

information that is irrelevant to this particular

question that I want to answer. So I am leaving

this column, which is showing me whether the

home the game is home or away. I’m going to eliminate

the team column, because by default, all

the data in this table is for the Arizona Diamondbacks. I’m also going to

eliminate the box score. I don’t I don’t care about

the details of the game. I only care about whether

the game was at home and what the attendance

was at that game. Now the season is still

going on in the data here is displayed in

chronological order, meaning that we only have

information available up until the present time

or delayed by a day. So what I’m going to do is,

because all this information is blank, I’m actually going to

click on the little button that looks like a home plate. And it’s going to eliminate

all the rows underneath. So now I’m only left

with attendance for games up to the present time. Now I’m ready to extract the

data and put it into a program that I can do some

manipulation on. So I’m going to go back to this

Share and More pull down menu, and I’m going to get

the table as a CSV. Now CSV stands for

commerce separated value. I’m going to click on that. You’re going to see

how the data now changed from a very nice

graphical summary and a table to sort of very raw

looking piece of data which, I’m going to copy. I’m going to highlight it,

right click, and select Copy. So now I’ve got all this

data on home attendance. And actually, on

total attendance. I’m going to do some additional

sorting in just a moment. But I’ve got all this

data on attendance to date for this season. So in Microsoft Excel, I’m

going to paste this data. There it is. Now, you’ll see that

it’s sort of all crunched only into

column A. So we have to do some

additional steps in order to separate the data

into multiple columns. So I’m going to go

to the data column. And there’s an option there

called text to columns. And what it does is

it starts to separate the information that you have. As long as it’s

separated by commas– in this case, it’s

delimited by commas– I’m going to hit Next. And then I’m going

to say no, comma is the actual separator

for this information. Hit Next. Yeah, that’s great. Finish. Whoops, sorry. Let me do that again. I should have highlighted the

entire sheet before I did that. So I highlighted

the entire sheet. Or I can actually just

get away with highlighting column A, because all the

data is in column A. Again, go to Data menu. Text to columns. Delimited. Correct. Comma delimited. And you’ll see down

here it gives you a preview of what the data is

going to look like before we actually select Finish. Look at that. That’s great. That’s what I want. You notice that the data

in column B sort of creeps into column C. So

what I’m going to do is I’m going to get right

in between those two columns and I’m going to double click. And what it does is it

automatically readjusts the width of that column. Now this column C is

not labeled anything, because when we pulled the data

from baseballreference.com, it didn’t have a label

back then, anyway. So I do want to label it now. And I’m just going to call it

H or A, as in home or away. Now because the question

that I want you to answer has to do with whether

the attendance at home gives us any hints,

we want to get rid of all the games that are away. And those are marked

by this @ symbol. So you’ll see that

the Diamondbacks played a bunch of away games. Then came home, played a

bunch of games at home, and then played a bunch of

games away, and so forth. So what I’m going to do

is I can either manually come in and remove each

row, which would take me a few seconds to do. Or I can be a little

smarter and say OK let’s go to Data and Sort. So what I’m going to

do is now sort the data by whether the games

were home or away. Before we even do

this though, you have to select this button

that says my data has headers. And what this tells

Excel is that row number 1 are just labels. And everything

underneath is data. So we click that, and you’ll

see how it sort of left out row number 1. And then I’m going to

say Sort by H or A. So what it’s done is it’s put

all the away games on the top, and all the home

games on the bottom. So now I can just select

the first few rows now I’m going to right

click on this section, and I’m going to say, delete. Goodbye. Now I’m only left with

data for home attendance. So there’s only

been 29 home games. Now let’s do some

analysis on this data. This column is no

longer useful to me because I’ve already

sorted the data. So I could right click on column

C and say goodbye as well. Now I’m only left with the

information that I care about. I’m going to double click on

this spot between columns C and D to readjust the size. I’m going to do one more

thing, because in order to look at these numbers,

it’s much easier for me to appreciate how large

these values are when there’s a comma in between. So I’m going to go to Home

and click the comma button. And then it’s going

to convert the number to a different format. And I don’t really need

the decimal points, so I’m going to

click on this button and say, I’m not interested

in fractions or decimals. I just want the whole number. So here’s the attendance

for the home games thus far at Diamondbacks

stadium, at Chase Field. Now what I’m going to do is try

to get some summary statistics for this data only. So I’m going to add a

couple of rows underneath. One is going to

be called average. The other one is going

to be called minimum. The next one maximum. And then the last one

is standard deviation. And I’ll explain

each of these, how it’s going to be helpful to us. So let’s use formulas in Excel

to calculate these four things. So with our mouse here

on this particular cell, I’m going to go to Formulas. And there’s already some

frequently used formulas in this menu called AutoSum. So I want Excel to

calculate the average. And look how it automatically

highlighted the information above. Now if it didn’t

automatically do that, you can actually put the mouse

here and highlight it yourself. Or you could go into the formula

itself, say this was blank, and manually type in, I want you

to analyze the data from cell C2 all the way to C30. And then you get

the same result. So the outcome

you get is 22,649. So that is the average number

of tickets sold for the first 29 games this season. Now that gives me a good sense

of what the typical number is, but I also want to understand

the range of numbers. So what’s the minimum

number of tickets and what’s the maximum

number of tickets. Let’s use other formulas

for this as well. So I’m going to go

to Formulas again. Sorry. Formulas again. My functions are

already in here, but if they don’t show

up on here for you, I’m going to go

to more functions. Formulas. More functions. So here in this

function menu, it’s asking me to tell it something

about the type of formula that I want. So I’m going to type

in min for minimum. And it down here,

it’s going to give me some recommendations

of formulas that relate to this word, this term. So fortunately,

there’s one called min. And that gives me the smallest

number in a set of values. That’s exactly what I want. OK, so what’s the range. Well, I can either

type it in manually, C2 all the way to C30. Or I could have

highlighted it in order to achieve the same outcome. So this the least

attended game is 12,215. Now let’s go to max. There is a formula that I

can access through this menu. More functions called max. Enter. And here it is. Returns the largest

value in a set of values. Bingo, that’s what I want. Hit OK. So I want you to analyze from

cell C2 all the way to C30. Or– I’m going to remove that– I can also just

click on this button. And it’ll send me to

the data, and give me an opportunity to select. Then I’m going to hit Enter. C2 to C30, yes. Great, go. So the maximum number is 49,016. And I bet you that

was opening day. Game one, 49,000. Yes, opening day is

always very popular. So we know the average. We know the minimum. We know the. Maximum Another

descriptive statistic is standard deviation. So there is a formula

for that as well. I go to formula,

AutoSum, more functions. And then I’m going to type

in standard deviation. So there are a lot of

versions of this formula. Let’s just use a

very basic one called STDEV for standard deviation. And it estimates a standard

deviation based on a sample. Yeah, that’s what I want. Give me the standard

deviation of C2 to C30. Either I’m typing it in manually

or I go to the actual data, highlight it, and

then hit Enter. All right, beautiful. So what this

information tells me is that one standard deviation

from the mean, that is, 8,000 fans, either

above or below the mean, above the $22,000 number or

below the $22,000 number, the majority of the data

sits within that range. So to put it in different

terms, about 60% of the time, the attendance at Chase

Field is between 30,000– that is 8,000 plus 22– and 22,000 minus 8,

so that’s 14,000. So 14,000 to 30,000,

66% of the time, that is the attendance

at the Chase Field. OK, so now your

job is to replicate what I’ve done in

Microsoft Excel by using the data on

baseballreference.com. But notice that the only

thing that I’ve done is I’ve compared data from to 2017. What you could do is look

at data from the 2016 season as well. Or look at data for

an entire decade, and try to utilize

that information to figure out how the

various promotions, how the various opponents influence

this column right here, which is attendance sorted by this

column here, home games. OK, I hope that’s enough

information for you to replicate this. And then eventually,

your objective is to submit your

predictions on this survey. So let me go to this

survey and show you where I want you to

enter your predictions for this particular exercise. It’s a Google survey

that I’ve created. And what I want you to do

is enter your prediction– this is a numerical input

for the Thursday game. And then I want you

to justify how did you come up with this number. Just give me a few data

points on how you actually obtained that. Then you estimate the

attendance on the Friday game and justify that. And finally, the attendance

at the Sunday game and justify that. So that is the extent of

this particular exercise. There are more components

of homework nuber one in the first inning. But this just will get you

started with utilizing the data from baseballreference.com and

making some inferences and some descriptive statistics– in particular, average,

minimum, maximum, and standard deviation. All right, play ball.