What is Big Data? According to Wikipedia, “The term ‘Big Data’ (abbreviated as B.D) is a phrase referring to data processing and statistical analysis driven by technological growth or market evolution.” It was defined in the November 1997 paper by Raja D. Gopalakrishnan and Youssef El Ghouli, “A revolution in Information Technology and its impact on business and trade.”
Gopalakrishnan and El Ghouli first used the term ‘big data’ in a scientific paper published in Science Reviews in July 1997. The paper, “Spatio-Temporal Analysis and Its Application to Biostatistics,” described a new methodology for managing and analyzing big data. The methodology combined two main ideas: sampling techniques that were new at the time and velocity methods that had only recently been developed by applied behavioural scientists.
They showed how using both these new techniques, along with a specialised software package called the Temporal Clustering Algorithm (TCA), they could analyse high-velocity, high-volume (HVMM) enterprise database records and produce reliable, accurate results quickly. The paper was entitled, “computational analysis of large quantitative databases: spatio-temporal dynamics and velocity estimation.”
Gopalakrishnan and El Ghouli did not discuss the application of their techniques to real-time applications, but they did discuss how the Temporal Clustering Algorithm and their specially designed software tool can be used to exploit existing big data sources. For example, they said that the US Department of Defense should be able to obtain operational, action-planning data quickly and easily. How is this done? The software tool would be able to analyse data sources, such as videos of military actions being taken by troops in Afghanistan, from a variety of remote locations around the globe, at different times of day. It would allow analysts to determine patterns, link events and locations, evaluate them and create dashboards to provide crucial insights into how soldiers are operating in the field.
They also tackled another important problem faced by companies and organisations of all sizes everywhere, which was how to deal with huge amounts of unstructured, semi-structured or even blended big data. They said that the answer lies in adopting a multi-tiered approach to data management. For example, some sources of big data may need more manual tracking than others. This is why it is better to have a server-side processing model, where the data types are structured before being stored on the users’ machines. However, this is not always possible, so companies must then use what is called an extractor and indexing tool to extract the needed data types from the mediums chosen.
Another issue that Gopalakrishnan and his team had to face was how to deal with big data that comes from unstructured sources. If the sources are not structured, the users will have a very hard time understanding the information, even if they can extract it in a manner that is acceptable. For instance, say a data set comes from a smartphone that has been captured with the user’s consent. Since the phone has the GPS chip, the device will present coordinates about its location, along with other data types that the user wants it to give up. In such a scenario, users will have a hard time using it to find relevant information.
Fortunately, the new age of big data is about machine learning. Machine learning is the process of feeding large amounts of unstructured or semi-structured data into a computer system, such as a neural network, and letting the machine learn to understand and interpret it. Such a system can then create new products, services or applications on the basis of what it has learned from its interaction with humans. It is thanks to machine learning that Facebook, Google and many other companies are now able to compile huge troves of data about their customers, and use it to make new products, personalized services and even weapons for combatting enemies.
However, the concept behind big data has not yet been fully understood by the scientific community. The main reason for this is that not all machine learning systems work well, especially when it comes to identifying the right data types. In fact, it could be said that the identification of these data types has still not been completely solved even today. There are many possible types of data sources, and not all can be effectively used by machine learning systems. In other words, there are still a lot of ‘black holes’ that lie unexploited in the field of big-data science.
Fortunately, progress is being made on this problem. Researchers are increasingly trying to identify structured versus unstructured data. They are trying to create systems that can efficiently sort through unstructured data sources, to come up with highly effective and efficient products, and even tools that can unearth hidden patterns from the massive amounts of data. However, one problem that researchers are facing is the lack of a common vocabulary that would describe both structured and unstructured data. In this respect, the field of big-data science is still very much in its infancy.