WHY BIG DATA NOW


Summer 2016 | Stories by Ellis Booker

It’s more than a buzzword—Big Data is a big deal. And though we’ve been creating, collecting and analyzing data forever, the current explosion in digital technology gives us access to an ever-expanding treasure trove of information that’s changing the way we’re conducting research, making business decisions and much more. Of course, Georgia Tech stands right in the middle of the action. 

Humanity generates data at a dizzying pace. By 2020, the amount of data created worldwide is expected to hit 44 zettabytes—the equivalent of 40 trillion gigabytes, according to IDC Research.

Yet some computer science researchers wince at the now-popular term “Big Data.” They point out, correctly, that volumes have been getting bigger for decades, as the cost of storage has tumbled, and as the things we produce and consume—documents, media, business applications and even social interactions—have become digital.

And if you thought 300 to 500 million tweets per day or 300 hours of video uploaded to YouTube per minute are impressive numbers, hang onto your hat. The tsunami of human-created data will soon be outpaced by a constant stream of data flowing from devices: sensors in smartphones, cars, homes, medical devices and machinery, to name but a few pieces of the rapidly growing Internet of Things (IoT).

“One thing we’re seeing in several domains, is data spiraling up faster than our ability to analyze it,” says Srinivas Aluru, professor of computational science and engineering, and co-director of the Institute for Data Engineering and Science at Georgia Tech.

In some domains, data volumes will demand entirely new approaches to storage, let alone analysis. Take the Square Kilometre Array (SKA), a large, multi-radio telescope project planned for construction in 2018 in Australia and South Africa. By one estimate, SKA will produce 62,000 petabytes (petabyte = 1 million gigabytes) worth of data annually. For comparison, worldwide annual Google searches generate only about 100 petabytes.

But Aluru and his colleagues emphasize it’s not just the profusion of data, it’s that much of it is inherently “noisy” and difficult to analyze. What’s more, the data is horizontal, flowing from an ever-expanding set of sources, increasingly in real time. This variety—along with the two other “Vs” of Big Data, “Volume” and “Velocity”—is key, and a big reason Big Data is impacting all disciplines, all businesses, all of us, right now.

While data accumulates within domains as diverse as healthcare, urban planning and materials science, researchers, governments and industry are increasingly interested in combining data sets, and using algorithms to search for interesting patterns and correlations. This has spurred unprecedented interdisciplinary collaborations.

This is a new phenomenon, says Renata Afi Rawlings-Goss, senior research scientist at Tech.

“The real good of Big Data is that it is crossing so many fields, that people are seeing it is untenable to solve these problems one discipline at a time, one nation at a time,” she says. “In that sense, Big Data has been a unifying agent, an impetus.”

Rawlings-Goss would know. She serves as co-executive director, South Big Data Regional Innovation Hub. The South BD Hub—serving 16 states and the District of Columbia—is jointly housed at Georgia Tech and the University of North Carolina and receives funding from the National Science Foundation (NSF). Other consortia members are Columbia University (Northeast Hub), the University of Illinois at Urbana-Champaign (Midwest Hub) and the University of California, San Diego, the University of California, Berkeley, and the University of Washington (West Hub).

Each innovation hub is set up in a hub-and-spoke arrangement, where the spokes are mission-driven around things like healthcare, coastal hazards, smart cities or manufacturing. “Coordination around Big Data is important because so many key players are in their own silos,” Rawlings-Goss explains.

The NSF regional hubs build upon the National Big Data Research and Development Initiative announced in 2012 by President Barack Obama’s administration. That 2012 initiative, which disperses funds through six federal agencies, including the NSF and National Institutes of Health (NIH), helped fuel collaborations and attention on Big Data on a national scale, Aluru says. Indeed, his team won one of the eight inaugural Big Data awards for their work on genomics. He also led the effort for Tech to be selected as an NSF Big Data hub host.

But the collaboration is international, too. In May, for instance, Aluru flew to Tokyo, where he took part in a two-day meeting between NSF-funded Big Data principal investigators from the U.S. with their Japan Science and Technology Agency
counterparts.

Aluru says the goal of the meeting is “to take stock of what’s happening in researchers’ respective countries, and figure out how they might collaborate.”
Closer to home, Georgia Tech is involved in bringing the promise of Big Data analytics to sectors of the economy that—unlike financial services, online retail and marketing—have been slow to adopt data analytics.

Data Science for Social Good Atlanta, begun by Tech’s Ellen Zegura, places students onto multi-disciplinary teams working under the supervision of a professor on a problem that comes from a partner in the city of Atlanta or a local nonprofit organization. Co-Directors Bistra Dilkina and Chris LeDantec jointly run this intensive, 10-week paid internship experience for students with Zegura.

Among DSSG’s recent projects:

  • Working with the Atlanta Fire Rescue Department (AFRD) to predict buildings with the greatest risk of fire. The system uses fire permit data, as well as five years of actual building fire records, to create a predictive model for fire risk and a prioritized list of properties that should be inspected by AFRD.
  • Working with the city of Atlanta and Trees Atlanta, the project seeks to help maintain and improve Atlanta’s urban forest. The project will use multiple types of data, such as percentage of tree canopy cover, impervious surfaces and floodplain data, developing a model that prioritizes planting sites by land parcels. This will help quantify the benefits of planting trees in a given location, assist arborists in finding potential planting sites and enable policy makers to make well-informed decisions about the future of Atlanta’s urban forest.

Supplying the next generation of data scientists is the final push at Georgia Tech. There’s a well-documented need for data scientists, and this career is in high demand. According to a new report from company review site Glassdoor, data scientists lead the pack for best jobs in America, with a median salary of $116,000.

Take Georgia Tech’s effort to improve curriculum and training for computer scientists—specifically around Big Data and cybersecurity. Assistant professors Hadi Esmaeilzadeh and Taesoo Kim introduced new courses and labs in the School of Computer Science with a data-driven approach to malware analysis. Results support a separate NSF project to train students in security and Big Data analysis as technology converges. Today’s computing professionals need to have a deep understanding of both, but only a small number of students have taken courses in either area, they say.

The professors’ new course modules will be tried first at Georgia Tech, and then released to a broader community in academia and industry.

Security and privacy, always major topics surrounding Big Data, have taken on new urgency thanks to the Internet of Things. Indeed, IoT was in the spotlight last October during Georgia Tech’s 13th Annual Cyber Security Summit. Not only did the ensuing “2016 Emerging Cyber Threats Report” dedicate an entire chapter to IoT, the summit drew Department of Homeland Security cybersecurity undersecretary Phyllis Schneck, PhD CS 99 (related story, page 52), as its headline speaker.

For Bo Rotoloni, co-director of the Institute for Information Security & Privacy at Georgia Tech, the Internet of Things has obvious security implications. He leads two large information security labs which encompass about 400 researchers at the Georgia Tech Research Institute (GTRI). Rotoloni’s teams are working on “trust,” both for machine-to-machine and machine-to-human communications. “When everything is connected how do you assess the trust of the data you’re receiving?” he asks.

Another topic, one that will affect everything from shopping to smart cities to healthcare, is privacy. “Who decides what’s private and what isn’t?” Rotoloni asks. “Even anonymized data can yield personally identifiable information if combined with a few publicly available data sets.”

Privacy regulations unfortunately can’t keep up with such analytics. Rotoloni and others say technology always outpaces the regulations. “Policies are always going to lag the technology,” he says. “Technology gets pushed out, something happens, and then you decide you need a policy.”

The application of advanced analytics onto Big Data sources has helped create a number of commercial startups from Georgia Tech, too.

Take Damballa, a threat-detection system for enterprise networks that harvests and trains systems on the industry’s largest unfiltered data set, some 15 percent of the world’s Internet activity, and monitors three-quarters of a billion devices every day. From its analysis of this massive data set, which includes 1.2 trillion DNS (domain name system) queries every day, Damballa can predict and find malicious behavior. “While the amount of data we harvest is impressive, it’s what we can do with it,” says Damballa Chief Technology Officer Stephen Newman, MS EE 97. “Our researchers build different machine-learning systems that look at the raw data, and what compromised devices will do, so they don’t remain hidden in the network,” Newman says.

Damballa has two core product lines: one for communication service providers, ISPs (Internet service providers) and telecoms, including some of the largest companies in North America, and another for enterprise customers, ranging in size from 500 to 150,000 employees.

Damballa was spun out of research by Georgia Tech professors, including Merrick Furst, distinguished professor in the College of Computing and founder-director of Tech’s Flashpoint startup accelerator; computer science professor Wenke Lee; PhD student David Dagon; and Richard Lipton, Frederick G. Storey Chair professor in the School of Computer Science.

Damballa continues to enjoy a close relationship with Georgia Tech, employing Lee’s students as interns to do primary research, which is published and presented at conferences worldwide. Newman says the company hasn’t had trouble attracting top-notch data scientists, a pervasive complaint in business these days. “Because we have this very large, unbiased data set, it’s easy to attract the best,” Newman says.

At the end of the day, it isn’t surprising that Big Data intrigues Srinivas Aluru and other Tech researchers. Exploring massive quantities of information, or combining data sets in novel ways and then using algorithms to search for patterns, is an act of exploration, going where others haven’t set foot.

“The data is lying there, and there may be interesting things that we have yet to discover,” Aluru says.

BIG DATA'S IMPACT: THE PROMISE OF PRECISION MEDICINE

Sequencing the first human genome—mapping the DNA in a complete set of human genes—cost billions of dollars at the turn of the century and required an international consortium to complete. Today, sequencing runs under $1,000 a pop, and is on track to drop below $100 in the next three years or so.

“At that point, it would become fairly routine for every human to be sequenced and then that genomic information can be used as part of medical care,” says Srinivas Aluru, computational science professor and co-director of the Institute for Data Engineering and Science at Georgia Tech.

In 2012, Aluru and his team of researchers were among the eight inaugural winners of Big Data grants from the National Science Foundation and National Institutes of Health. They were awarded to bolster the development and use of high-performance computing techniques in studying large DNA sequencing datasets and their applications to plant and human genomics. While our DNA is 99.9 percent identical, within the 3 billion nucleotides there remain approximately 3 million differences. And as more and more people have their DNA sequenced, the data volume will increase, obviously. But that’s just the start.

“One of the grand challenges is taking tens of thousands of sequenced genomes and looking for variations from this large number of patients and their medical histories,” Aluru says. If genetic variations can be correlated to particular diseases, he adds “you can start taking preventive measures, before the onset of disease.”

This is precision medicine, also known as personalized medicine, which seeks to identify and treat the exact form of disease in patients based on their genome. It also looks at other factors, such as the interaction of genes and environment, and sometimes even the microbial organisms living in our bodies. This personalized approach also allows doctors to tailor drugs to each individual and avoid ineffective or harmful drugs.

Outside of genomics, there are many other applications of Big Data in healthcare. Thanks to the electronic health record, or EHR, it is now possible to analyze millions of patient medical histories, treatments and outcomes to create computer models that predict the onset of disease and suggest the most effective drugs.

That’s the focus of Jimeng Sun, an associate professor in computational science and engineering. “All the people in medicine, researchers or practitioners, have known that variation exists,” Sun explains. “But in the past, for hundreds of years, all the treatments were designed to treat an average patient because they were based on a standard protocol.”

“Today we can create personalized models, how a disease progresses, and predict which drugs are likely to work for an individual based on their data,” he says. “Besides advances in computing, having the data in electronic form from many, many patients over a long period of time is really the key difference now.”

For Sun and his team—a group composed of 10 students and postdocs, including two MDs pursuing their PhDs in computer science—the hope is that by applying large-scale predictive modeling and so-called “similarity analytics,” medicine can be individualized for each patient.

For example, Sun has used machine learning to predict heart failure or the onset of hypertension—and these predictions are remarkably accurate. His most advanced model can predict the onset of heart failure accurately, more than 80 percent of the time, six to 12 months before a conventional diagnosis. This gives both patient and doctor “a lot more time to adjust patient behavior or start early intervention,” Sun says.

This work, which began with a NIH-funded project when Sun was at IBM before joining Georgia Tech, is still in the research phase. Aside from gaining regulatory approval for clinical trials, one key challenge is technical in nature. Integrating sophisticated predictive models into existing EHR software is difficult, and some EHR vendors prevent this kind of add-on, Sun says. Happily, there is some movement on interoperability standards in EHR systems, he notes.

A second challenge is physician buy-in—that is, getting doctors to change decades-old treatment practices. “In the past, it could be quite tricky,” Sun admits. But as data-driven models become more common, in medicine and elsewhere, this will be less of an obstacle, he says.
To move the medical profession along, Sun is working with clinical partners to post their results in trusted medical journals and presenting his computer models at conferences whenever possible. Such an approach can work well, Sun says, remembering an earlier project.
He helped a pharmaceutical company develop an algorithm that would help a physician determine the best epilepsy drug to use for a particular patient. “Initially, the experts were very skeptical about the idea,” he recalls. After almost a year of development, and a face-to-face meeting with pharma experts, Sun presented his team’s findings, the details of their model, and the data they used. “Many of them were very impressed with the rigor of our model, and we actually recruited several of them to work with us on clinical publications,” he says.
Finally, there’s even an intersection between Sun’s work and genomic research. By mining the EHR data, Sun can produce a fine-grained phenotype description of a patient, data that can then be used in the ongoing Big Data genomic research being conducted at Georgia Tech and elsewhere.

BIG DATA'S IMPACT: A FUTURE WHERE NEW MATERIALS CAN BE FORGED FASTER

Can better use of data reduce the time needed to discover, develop and deploy new materials? That’s the hope of researchers like Surya Kalidindi, Georgia Tech professor of mechanical engineering and computer science and engineering.

“Historically, it takes 15 to 25 years to bring a new material from the lab to the marketplace,” Kalidindi says. In highly regulated industries, such as aerospace, the lifecycle from lab to market can be even longer.

The protocols currently used to discover new materials present considerable challenges to researchers. They involve many sequential steps, Kalidindi says. As a result, researchers may be far down the path before encountering an unexpected problem that forces them to “have to go all the way back to the beginning and start again,” Kalidindi says.

One way to avoid that is to replace physical experiments with Big Data-driven computer models, which promises a way to speed up the discovery phase—theoretically allowing materials to be developed as fast as a few years. But there’s a problem: Current materials models aren’t sufficiently accurate when it comes to performance predictions. So manufacturers—forever wary of flawed or defective products—rely on old-school, time-consuming, real-world experiments. However, better use of data, Kalidindi believes, will produce better models, which will compress the exploration phase, saving time and money.

“Currently, we’re heavily focused on a suitable data infrastructure to accelerate materials innovation,” he says. To address these infrastructure issues, Kalidindi and his team are busy on a pilot project, due to be operational by the end of the year, that will be made available to the materials innovation community at Georgia Tech.
Kalidindi and his team are also creating computational methods that will unleash machine-learning algorithms against new and legacy data to discover promising new materials. Using data mining techniques to account for variances and uncertainties, researchers can acquire much more rigorous, reliable and complete information.

Such work falls under the Materials Genome Initiative, which since its launch in 2011 has sought innovative analysis of public data for the creation of materials, as well as new models that describe the processing-structure-property relationships in either structural (load-bearing), functional (electrical, optical or magnetic) or multifunctional materials. Ultimately, Kalidindi and his colleagues envision machine learning at all steps of materials science—discovery, development and deployment—that stand to transform the field.

BIG DATA'S IMPACT: IMPROVING SECURITY FOR THE INTERNET OF THINGS AGE

Some 6.4 billion connected devices will be in use worldwide at the end of 2016, up 30 percent from last year and reaching 20.8 billion by 2020, according to research firm Gartner Inc. In fact, 5.5 million new things get connected every day.

At Georgia Tech, researchers have been deeply invested in working on the details of this connection—the Internet of Things (IoT)—which along with great promise brings a complex set of security and privacy concerns.

“Suddenly, you’re drastically increasing the number of devices connected to the Internet,” says Manos Antonakakis, PhD CS 12, assistant professor of electrical and computer engineering at the Institute. As objects as diverse as phones, refrigerators, cars and medical devices increasingly emit data on the public Internet, Antonakakis says, threats to networks increase, as do threats to private information security. “We’re primarily focusing on privacy-preserving data sets, identifying IoT devices, and finding the privacy and security risks of communicating with the external network,” he says.

With billions of IoT chips appearing in all manner of products, how are researchers like Antonakakis keeping pace with the security issues? “The reality is that security on the Internet has been, historically, an afterthought,” Antonakakis explains, adding that IoT is no different. “The first thing that everybody wants to achieve is connectivity and some level of service. When this service is widely adopted and used, everybody goes back to security.”

Last December, in fact, the International Telecommunication Union and Georgia Tech jointly agreed to monitor global IoT activities and collaborate on developing standards. The memorandum of understanding recognizes the importance of standards and the effective management of the associated applications through which value is clearly identified and captured for this fast-growing industry.

Among the multiple projects at Georgia Tech aimed at protecting critical cyber-physical system processes is one called Trustworthy Autonomic Interface Guardian Architecture (TAIGA). The architecture establishes trust at the embedded-control level, creating a small root of trust that sits between physical processes and an embedded controller and maintains known good states. The code for the device is small—so it can be formally verified—and is implemented in hardware, which has additional performance and security benefits.

For his part, Antonakakis has set up the Astrolavos Lab, where students from both Georgia Tech’s College of Computing and School of Electrical and Computer Engineering conduct research in the areas of network security, intrusion detection, and data mining. One output of the lab’s work: an objective way of quantifying the risk on a network. Until now, there hasn’t been an objective yardstick, a generalizable solution, to security problems with high operational impact, Antonakakis says.

The metric, already widely used at Georgia Tech, will soon be used in one of the largest telecommunication companies in the United States.

This work would not be possible without what Antonakakis calls the “revolutionary developments” in computer storage and computational analytics, among other advances. “The thing with Big Data is, effectively, your ability as a researcher or a company to identify patterns or identify structures in your data that you didn’t know a priori,” he says. In the domain of computer and network security, for instance, “you can analyze and effectively conduct attack attribution on, say, half a decade worth of data around threats,” he explains.

Beyond understanding threats, Antonakakis says, other kinds of Big Data analytics will help us understand the impact of these threats on government, industry or society as a whole.

Big business is also very interested in what happens in the IoT security arena. The Center for the Development and Application of Internet-of-Things Technologies (CDAIT) inside the Georgia Tech Research Institute (GTRI) bridges industry with Georgia Tech faculty and GTRI researchers. Founding members include AirWatch by VMware, AT&T, Cisco, Flex, IBM, Samsung, Stanley Black & Decker and Wipro.