Sunday, August 28, 2011

Colleges and Universities Account for a Disproportionate Number of Reported Data Breaches


As the depressingly steady march of breach notifications comes across my RSS feeds, I notice that US colleges and universities seem to be the victims of an awful lot of breaches.  At least, when I skim the list of breaches cataloged by resources like the DataLossDB and the Privacy Rights Clearinghouse, the names of colleges and universities stick out to me.  It sure looks like higher education makes up a disproportionate number of breach victims.  Several other infosec writers, both inside and outside academia, have made the same point (see below for more articles on the topic).

But, maybe that’s just confirmation bias.  Maybe colleges and universities are not breached any more often than other organizations, and it just seems that way to my subjective memory.  So, I decided to dig into the data a little and see if institutions of higher education – or for that matter, any particular types of organizations or businesses – account for a disproportionately large percentage of breach reports. 


I decided to look at publicly reported incidents of external breaches which were:  1) known or presumed to be malicious; and 2) not carried out or assisted by malicious insiders.  I was only interested in reports of “hacks”, not in reports of lost laptops, accidental exposures on FTP servers, accidental emailings, insider abuse, and so on.
I reviewed breach reports archived at the DataLossDB and the Privacy Rights Clearinghouse for the time period January 1, 2009 through August 16, 2011.  I selected only breaches of US organizations, and only those which were reported as malicious external attacks.  That gave me a list of 315 reports. 

To categorize each victim organization, I used the North American Industry Classification System (NAICS), a hierarchical, six-digit classification system used by the US Census, the Bureau of Labor Statistics, and other agencies.  It’s a little like a Dewey Decimal System for businesses, non-profits, and other organizations, collectively called “Firms”.  You can find a good overview on how the NAICS codes work here.

I used various online databases and my local library to find the NAICS code for each of the 315 breached organizations.  When an organization had more than one NAICS code, I used the primary code.  Ultimately, though, the number of organizations with multiple NAICS codes was relatively small and did not significantly affect the major findings of my research.

With the 315 breach reports and their associated NAICS codes on a spreadsheet, I could break the data up in various ways.  Grouping the organizations by the first three or four digits of their NAICS codes provided interesting insight into what types of organizations made up the bulk of the reports in the dataset.

However, merely counting the number of breaches in each NAICS category only answers half the question.  What I really wanted to know was whether any NAICS category was disproportionately represented in the breach reports.  For that, I needed to know how many organizations in total fall into each NAICS category.  The US Census helpfully provides that information in publicly-available datasets.  Armed with that information, I could see if particular NAICS categories accounted for more or fewer breaches than would be expected if breaches were evenly distributed across the pool of US organizations.

You can download my data from SkyDrive.


Let’s begin with an example.  NAICS codes beginning with 722 cover “Food Services and Drinking Places”.  We know that computer criminals like to attack bars and restaurants to try to get credit card numbers.  And indeed, in the set of 315 reported external breaches between January 2009 and August 2011, 32 breaches were reported by organizations with NAICS codes beginning with 722.  So, bars and restaurants accounted for a little over 10% of reported external breaches.  That sounds bad, but it sounds a little bit better when you realize that there are nearly 425,000 firms in that NAICS category, or about 7.2% of all US firms.  That is, bars and restaurants make up 7.2% of US businesses, and 10% of reported breach victims.  So, they are breached somewhat out of proportion to their numbers, but it’s nothing terribly striking.

Hotels and resorts fare worse.  NAICS codes beginning with 721 (“Accommodation”) account for less than 1% of US firms (n=52,274) but about 4.4% of external breaches (n=14).  That lines up with the conventional wisdom that carders like to go after hotels.  So, hotels do account for a disproportionate number of breaches, but we’re not in Crazy Town just yet.

Hospitals (NAICS codes beginning with 622) only make up about 0.06% of US firms (n=3948) but a whopping 3.17% of reported breaches (n=10).  Clearly, hospitals are dramatically over-represented among breach victims.

But, what of colleges and universities, which got me to thinking about this in the first place?  Well, NAICS codes starting with 611 apply to “Educational Services”.  But this is a fairly broad category which includes primary and secondary schools, colleges and universities, martial arts training, sports camps, testing services, and various support services.  Still, looking just at the three-digit category, we see that these types of organizations make up 1.3% of US firms (n=78,620) and a full 20% of reported external breaches (n=63), more than any other three-digit NAICS category and almost double the nearest competitor, bars and restaurants.

If you drill down to the next NAICS level, you find that colleges and universities (excluding junior colleges and trade or technical schools) get NAICS codes beginning 6113.  This category includes 2,424 organizations, a mere 0.04% of US firms.  However, these organizations report 14.6% of breaches (n=46). 

To put this in perspective, look again at the 622 category, hospitals.  Hospitals (0.06% of firms, 3.17% of breaches) are over-represented in the breach data by a factor of about 48 (i.e., 3.17/0.06, with rounding).  In other words, hospitals are about 48 times more likely to show up as breach victims than would be predicted by the number of hospitals if breaches were distributed evenly among all US firms.  That is not a good number, but it pales in comparison to colleges and universities, which turn up in breach reports about 357 times more often than they would if breaches were distributed evenly (~14.6/0.04).  That is a staggering number.

Keep in mind, I excluded from the dataset attacks known to have been committed or assisted by insiders.  For the purposes of this research, I considered students to be insiders.  So, these numbers do not include reports of students breaking into their own colleges’ networks.  These are just breaches committed by unknown external intruders.


What accounts for this striking imbalance?  Honestly, I don’t know.  It could be that there is some bias in the data recorded by the sources I reviewed.   And, it is probably the case that legal reporting requirements lead to some disparity among the breaches that get publicly disclosed.  Hospitals, for one, are subject to the reporting requirements of HIPAA and the HITECH Act, which may help account for the large number of hospital breaches in the dataset.

Or, perhaps colleges and universities tend to have a greater awareness among IT staff or more mature security functions, resulting in a relatively higher rate of discovery of  breaches.

Or, the reason could be the opposite, that colleges and universities do a relatively poor job of protecting their data assets.  It may be the case that, perceiving a conflict between security and academic freedom, these institutions leave themselves poorly defended against external attacks, or that they for other reasons fail to address information security effectively.

Or, it could be the case that colleges and universities are targeted more often than other organizations.  Attackers might just like to attack colleges.  However, according to a study published last month by Imperva, modern attacks are heavily automated and targets are often discovered by scanning botnets that “operate with the same comprehensiveness and efficiency used by Google spiders to index websites.  [A]utomation … means that attacks are equal opportunity offenders; they do not discriminate between well-known and unknown sites or enterprise-level and non-profit organizations.”  If this is accurate, it suggests that attackers, looking indiscriminately across the internet for easy victims, all too often find that the easiest victim is a university.

Further Reading

“Are colleges and universities at greater risk of data breaches?” (

“An Examination of Database Breaches at Higher Education Institutions”  (PDF)(