Defining Data Science Trends For 2016


The data science industry generated many headlines this year, from the U.S. Department of Commerce hiring its first chief data scientist to the National Science Foundation launching four regional data science brain trusts. But now that 2015 is winding down, it’s time to figure out where this buzzworthy industry is headed. Here are some projections of the primary trends of data science for the coming year:

Data science will spread to more industries and applications

Eighty-nine percent of corporations believe that not leveraging big data will result in lost market share, a study from Accenture and General Electric found. So as more industries hire data scientists, they’ll be moving away from IT-focused positions and into specialized roles making everyday objects smarter and fine tuning cutting-edge technology. In fact, the infamous Google Self-Driving Car Project is powered by machine learning that allows autonomous cars to differentiate an exit from a ditch or a child from an adult. Similarly, data science applications will spread to industries like energy forecasting and geopolitics.

The number of data science education programs will increase

According to, the number of data scientist jobs rose 57% in the first quarter of 2015 compared to last year. And with McKinsey & Company predicting a massive shortage of analytics talent by 2018, a boom in data science education opportunities is inevitable.

Data science graduate programs are popping up at institutions like Columbia University, and the number of bootcamp-style programs will increase too. However, bootcamps won’t necessarily turn out qualified candidates, as the skills needed — engineering, statistics, industry knowledge, and creativity — can’t be taught in a few months.

Creative agencies turn to data science to optimize campaigns

With the rise of ad blocking, the expected demise of the desktop banner ad and the growth in mobile video, creative — and specifically mobile creative — will be a big theme for 2016. Some say programmatic broke advertising, but programmatic data will allow for the use of current insights to inform creative execution. The use of data science to measure and optimize creative performance will grow, with current creative agencies looking to add data scientists to their roster, either in-house or through partnerships.

Deep learning techniques will become integral to data science

Deep learning makes it possible to teach systems to recognize images or understand spoken language. It also provides multiple representations of underlying data, generating new ways of predicting and informing behaviors. That’s why this subset of machine learning is a natural addition to data scientists’ toolkits.

Data scientists will use deep learning to automate the process of feature extraction and uncover patterns in data that might have gone unnoticed. Consequently, deep learning tools will become widely available as turnkey solutions. Case in point: In November, Google open sourced its artificial intelligence engine, TensorFlow, which features built-in deep learning support.

Datasets will be bigger, better and more widely available

The amount of data in the world is expected to reach the size of 44 trillion gigabytes by the year 2020, IDC reports. That means a lot more data will be available across a wide array of disciplines.

This will create an “open data” mindset, in which researchers and agencies will share code and data publicly to accelerate learning. At DataScience, we analyze public data from social media to understand the markets we serve. But we also work with expanding open urban data sets to solve bigger problems, like reducing traffic-related fatalities in Los Angeles.

Data science will be adapted to the language of the web

Data scientists rely on programming languages Python and R to create data visualizations, but that could change. That’s because more open-source projects utilize JavaScript, a programming language that is synonymous with the web.

Companies are now open sourcing their JavaScript-reliant components — for instance, Uber has open sourced its mapping component built for React-based applications — and JavaScript’s D3 library makes the creation of interactive data visualizations simpler. More importantly, JavaScript-based data visualizations are easily integrated with web applications, a place where Python and R fall short.

Rise of Data Virtualization

The areas of natural language queries, semantics, and data hubs are converging in the realm of data virtualization. At its simplest, data virtualization is the process of opening up company data silos and making them accessible to one another through the use of hybrid data systems capable of both storing and retrieving content in a wide variety of formats.

With data virtualization, data can come in from multiple channels and formats (traditional ETL, data feeds in XML and JSON, word processing, spreadsheet and slideshow documents, and so forth, be mined for semantic attachments, and then stored internally within a data system. Queries to this database can be done using natural language questions – “Who were our top five clients by net revenue?”, “Show me a graph of earnings by quarter starting in 2012”, and so forth. Beyond such questions, data virtualization is also able to present output in a variety of different forms for more sophisticated uses of the data, including providing it as a data stream for reporting purposes and visualization tools.

Hybrid Data Stores Become More Common

One thing that makes such systems feasibles is the rise of hybrid data stores. Such stores are capable of storing information in different ways and transforming them internally, along with providing more sophisticated mid-tier type loigc. Such systems might include the ability to work with XML, JSON, RDF and relational data in a single system, provide deep query capability in multiple modes (JSON-query, XQuery, SPARQL, SQL), and to take advantage of the fact that information doesn’t have to be serialized out to text and back to do processing, which can make such operations an order of magnitude faster.

This area is most readily seen in the XML Database space especially, since these have generally been around long enough to thoroughly understanding the indexing structures necessary to support such capabilities. MarkLogic and eXist-db are perhaps my two favorites in this regard, with MarkLogic in particular bridging the gap. One thing that differentiates these from other systems is the robustness of the mid-tier API, with many things that have traditionally been the province of application logic now moving into the data tier. However, if you look at other NoSQL systems such as CouchBase or MongoDB, you see this same philosophy making its way into these systems, where JavaScript within the servers becomes the glue for handling data orchestration, transformations, and rules logic.

Databases Become Working Memory

One of the more subtle shifts that has been happening for a while, but will accelerate through 2015, is the erosion of tiered development in favor of intelligent nodes with local “working storage”. This is in fact a natural consequence of the rise of lightweight (mainly RESTful) services – applications are no longer concentrated on any one tier. Instead, what seems to be emerging is a model whereby every node – whether a laptop, a mobile device, a server, even simple IoT sensors, now has enough processing power to make decisions, and has the ability to store relevant state data either locally or within an intermediate data node.

So there you have it: the data science trends for 2016. Only one thing’s for sure — it’s going to be a big year for data science.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s