Data Mining

From New Media Business Blog

Jump to: navigation, search

From nearly the very moment humans are born, they are exposed to data. Hospitals take down a newborn child's weight, birth date, and sex. Humans are, of course, too young to realize at this point, but their data profiles have already begun growing. As humans age they are constantly exposed to a variety of online activities such as surfing the internet and shopping. What they may not realize at first is that more and more data is being collected about them. The industries that collect your data in massive databases can then use it in a variety of ways. However, before they use the data, they must transform it into useful, meaningful, and understandable information. This is where the process of data mining comes in to help companies do so.

What is Data Mining? [1]

Contents

Introduction

The concept of data mining has been around since 1960, and it has only progressed since then. Data mining is a process used by companies to turn raw data into useful information. While big data is a term often used when studying data mining, it simply means a large data set. However, data mining is a set of techniques that are used to analyze data to discover patterns, correlations and anomalies, which once identified, can be applied to everything from helping companies increase their revenue to improving their customer relationships. [2] Organizations are storing most, if not all, of their data electronically. However, to benefit from these large batches of data, they need to analyze it to learn more about their customers to develop effective marketing strategies, increase sales, and decrease costs. [2] Overall, data mining provides users with information that would not be available otherwise.

What is Data Mining?

An Intertwined Discipline

Data mining disciplines [2]

Data mining contains three intertwined scientific disciplines which are statistics, artificial intelligence, and machine learning. Statistics is the numeric study of data relationships, whereas artificial intelligence is human intelligence demonstrated in machines and software. Machine learning is algorithms that are learned from large batches of data that are implemented to make predictions. [2] Data mining also depends on effective data collection, warehousing, and computer processing. Warehousing a key component of data mining, and it enables organizations to store their data in a single database or program, allowing the organization to retrieve fragments of data, analyze it, and apply it. [3]

Data Mining in Practice

Based on the information that users and organizations provide, data mining programs then breakdown patterns in that data. For example, grocery stores notoriously use data mining techniques in the services that they offer. Most grocery stores offer loyalty cards to customers at no cost and customers can use these cards to gain exclusive access to reduced prices on goods. [3] Loyalty cards allow grocery stores as well as retail stores to track what consumers are purchasing, how frequently they are making purchases, what they are purchasing, and how much they are spending on each item. [3] Stores then analyze the retrieved data, discover patterns in customers’ purchase history, and configure their buying habits. [3] Stores then send coupons to customers accordingly, and decide when to discount items and offer promotions.

In addition to grocery stores, retailers, banks, manufacturers, and other industries, are using data mining software to discover relationships and become more efficient. For example, the banking industry uses data mining to help them detect fraud and examine market risks. Data mining is used by organizations in their pricing and promotions strategies. It also showcases how organizations’ business models, costs, revenues, operations and customer relationships are affected by various factors, such as social media and the economy. [4]

So why is data mining important? Data mining technology is constantly evolving in order to be able to keep track of the limitless potential of big data and affordable computing power. [2] Data mining is used both in organizations and in every aspect of our daily lives. A task as simple as searching Google for a restaurant applies applications such as search engine algorithms and recommendation systems, which is under the data mining umbrella. [2] Overall, data mining allows users to sort through data and use the information that is relevant to predict likely outcomes, and make informed decisions.

The History of Data Mining

1960s: The Decade of Data Collection

Prior to the invention of computers and digital storage, data was kept in physical filing systems in businesses. However, once IBM invented the hard disk in 1956 and the floppy disk became more widespread and common, these traditional physical filing methods were no longer prominent in businesses [5]. Businesses quickly realized that there were more efficient and better ways to store their data, so they began using laser-discs and larger hard disks. However, these storage methods and devices were expensive, harder to manage, and more time consuming than anticipated when extracting data. [5] Furthermore, it was not possible to gather and bring together data at this stage, therefore, there was no centralized method or technology that would accomplish this. [5] This is when Database Management Systems (DMBS) began to emerge.

1970s: The Decade of Data Management

During this decade, Database Management Systems (DBMS) emerged, which allowed businesses to gather their data in a centralized manner. A database management system is a software package that receives instructions from a database administrator and is designed to define, manipulate, retrieve and manage data in a database. [6] It manipulates the stored data as well as the data format, field names, record structure and file structures. The benefits of Database Management Systems is “less redundant data, data independence, security and integrity, which all lead to efficient searches”. [5] With sophisticated database management systems, it was possible to store large amounts of data and query terabytes and petabytes of data. [5] The term relational database management systems (RDBMS) was coined by E. F. Codd at IBM in 1970, and it revolutionized how businesses and users sort through their data. [7] Prior to relational database management systems, users had to manually sift through large amounts of data in DMBS to find the data they needed. This was complex and time consuming for users, and with the introduction of RDBMS, databases were changed from a simple method of organization to a tool for querying data to find hidden relationships in that data. [8] Deleting and modifying details was made possible with relational databases, and its benefits included decreased data duplication and inconsistent records, and RDMBS made it easier to maintain security. [5]

1980s: The Decade of Data Access

Data collection [9]

Query languages such as MySQL, Oracle, SQL Server began were implemented with RDBMS. The development of Structured Query Language (SQL), which is a programming language used to communicate with RDBMS, allowed users to ‘Insert’, ‘Update’, ‘Delete’, ‘Create’, ‘Drop’ table records. [10] The data in RDBMS is stored in database objects called tables, which are related data entries that are made up of columns and rows [10] SQL allowed for complicated queries to be written to extract data from many tables at once, which significantly helped companies to access and store their data. Moreover, Oracle is multi-model database management system that is commonly used for online transaction processing (OLTP), data warehousing and mixed database workloads. [11]

1990s: The Decade of Data Warehouses and Decision Support Systems

In the 1990s, the term “data mining” first appeared, and retail companies as well as the finance industry began using data mining to analyze their data and identify trends. [12] Recognition of trends enabled them to increase their customer base and predict sales and interest fluctuations. The use of data warehouses allowed for retrospective, dynamic data delivery at multiple levels. Enabling technologies during this decade included on-line analytic processing (OLAP), multidimensional databases, and data warehouses. [13] Data warehousing is a model for the transition of data from operational systems to decision support environments. [14] Prior to the introduction of data warehousing, completing requests would be quite time consuming as reporting tools were primary designed to “execute” tasks in a business rather than “run” the business. [14] With data warehousing, data could be gathered in one place and with one querying tool, allowing users to search the data efficiently and gain an understanding of its functions. OLAP processing tools allows users to analyze multidimensional data, which could provide trend analysis views. As such, users switched from a transaction oriented mindset to a more analytical approach of viewing data.

2000s: The Decade of Data Mining

Enabling technologies during this decade included advanced algorithms, multiprocessor computers, and large databases. [12] An algorithm in data mining is a sequence of instructions, typically to solve problems or perform computations that create a model from data. [15] In order to create a model, the algorithm first analyzes the data users provide, and then looks for specific types of patterns or trends. This can be used to forecast sales, outcomes, and probabilities. Multiprocessing is carried out by two or more central processing units (CPUs) within a single computer system. [16] It is capable of running many programs simultaneously and multitasking. The popularity of data mining techniques continues to grow rapidly among businesses with the aim of predicting future outcomes. Data mining is also widespread in medicine and finance, where clinical trials, credit card transactions and stock market movements can be analyzed through data mining applications. [12]

How Does Data Mining Work?

Supervised and Unsupervised Machine Learning

Much of data mining is done through machine learning. “Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves”. [17]

Furthermore, machine learning uses algorithms to breakdown data sets to analyze data and give you the inputs needed to create models. Machine learning generally is broken down into two types: supervised machine learning and unsupervised machine learning.

In supervised machine learning, algorithms are applied to what has been learned from past data to new data and this is done to predict future events. [17] It is referred to as “supervised learning” because the algorithms learn from a training data set that is labeled and carries out tasks from its learning. [18]

Unsupervised machine learning differs from supervised because algorithms are trained using data that is not classified or labeled. [17] Here algorithms have to discover and analyze what is in the data. [18]

An easier way to think about it is that in supervised learning you have input data (X) and output data (Y), and an algorithm can use new inputs (X) to predict new outputs (Y). [19] Think of the function Y=f(X). In contrast, unsupervised learning only has input data (X) and no output data (Y), so it works to decipher interesting structures in the data and learn more about it using just the input data. [18]

Data Mining Methodology

CRISP-DM [20]

When it comes to mining the data and using it for your business, it may be hard for a new company or beginner to understand how it is done. Luckily, there is a cross-industry standard process for data mining, which is also referred to as “CRISP-DM”. This method is the most widely used analytical model and can help organizations understand data better. This method has six steps and focuses on filtering data so you can end up with the data you need for your business or objective. [21]

There are six successive steps/phases in CRISP-DM, and they are as follows:

  1. Business understanding
  2. Data understanding
  3. Data preparation
  4. Modelling
  5. Evaluation
  6. Deployment

Now what these steps mean is the important part, as this is where companies and analysts have to make decisions regarding what they see and what they need.

Business understanding phase: First you have to understand business goals/objectives clearly and find out what the business needs are. From there find the important factors that need to be considered to assess the current situation, some factors can include assumptions, constraints, and resources. [21] Then you can create data mining goals to accomplish the business goals within the current business situation. [21] Lastly, have a detailed plan to achieve the business goals as well as your data mining goals. [21]

Data understanding phase: This phase starts with the initial data collection. Data is collected from available data sources, and it is helpful to get familiar with the data. [21]. Here, activities such as data load and data integration are performed to ensure data is collected successfully. [21] Once collected, the properties of the data are examined and reported. The properties of the data that need to be examined are the surface or gross properties. [21] Next, by using querying, reporting, and visualization, you can explore the data. [21] When exploring the data, refer back to you data mining goals and questions so you can understand where to look. Finally, examine the data quality and when doing so see if it answers some important questions. [21] Important questions can be such as “Are there things missing from the data?”, “Is the data we have complete?”. [21]

Data preparation phase: This phase can consume the most time out of the entire task, and it’s best to be patient when in this phase. Since you have now identified data sources, they are now ready to be selected, cleaned, constructed and formatted into the desired business format. [21] The process of selected, cleaning, constructing, and formatting the data into the desired business format is why this is the most time consuming part of this task. Once completed, this data will be the final data set. [21]

Modelling phase: The first thing to do in this phase is to select modeling techniques that will be used on the final data set. Using the selected technique, run a test model to check the quality and validity of the model. [21] If the model seems to be valid, create your model(s) on the final data set. From here, the models constructed should be examined carefully to see if they meet requirements and/or business initiatives. [21]

Evaluation phase: In this phase you evaluate the model results. When evaluating, do so while considering the business goals/objectives that were developed in the business understanding phase. [21] In addition, in this phase, you may see new patterns that have been discovered from the results and thus new business requirements may arise. Lastly, here the decision is made whether or not the data will move onto the next step and is ready for deployment. [21]

Deployment phase: Here the results, information, and knowledge gained from the data mining process is shared and presented. It is vital for it to be presented in a way that makes it usable for individuals, such as stakeholders.[21] How it will be deployed should be decided by the business so it can successfully deploy, maintain, and monitor its completed task. Having this decided will also create a pathway for future data mining tasks. Finally, it is important to do a review to see where things can be improved, how processes can be changed if needed, and taking down any lessons learned for the future. [21]

Using CRISP-DM, a business can turn a raw data set into a valuable data set they can then use for their desired purpose. However, CRISP-DM faces some criticism due to the fact is is quite old (created in 1996) and therefore it is not the newest technology. Businesses may prefer something that is newer to keep up with the massive data that is being collected since the framework of CRISP-DM has not been updated on problems faced with new technologies. [22] An example of new technologies includes Big Data which is something very relevant to us today. [22]

Data Mining Techniques and Models

Data mining techniques generally fall under two categories. The first being predictive analysis, and the other being descriptive analysis. Both these techniques can be broken down into 4 types, which then can be used for modelling.

Models & Techniques [5]


Predictive analysis mainly focuses on predicting what will happen in the future. [23] Under predictive we have classification analysis, regression analysis, time serious analysis and prediction analysis and each of these differ from each other.

Classification analysis: This analysis fetches important and relevant information about data and then it classifies data into predefined categories/groups. [23] An example of this data categorization would be using email provider to group data. [23] Once that is done algorithms can be used to classify mail as legitimate or spam. This is done using supervised learning.

Regression analysis: This type of analysis focuses on the dependency between variables and from there it tries to create a forecast for the future. [23] It does make the assumption that future data will fall into this forecast it predicts. [23]

Time serious analysis: This analysis measures a sequence of defined data points which are then indexed in time order to predict future values. [24]

Prediction analysis: It is very similar to time serious analysis but it is not bound by time. It is also used to predict future values using previous data. [24]


Descriptive analysis focuses on summarizing or turning data into useful information. [23]Under descriptive analysis there is clustering analysis, summarization analysis, association rule analysis and sequence discovery analysis.

Clustering analysis: In this analysis data sets are identified by how similar they are to one another and clustered. [24] It sounds very similar to classification but this is mainly done using unsupervised learning and it does not have predefined groups. [23]

Summarization analysis: This analysis uses techniques to find a description for a dataset. [23]

Association rule analysis: Works to identify relationships between different variables in datasets and is mainly used in the retail industry. [24]

Sequence discovery analysis: This is exactly what it sounds like, it focuses on discovering a sequence of an activity. [23]

These are all the tools that can be used to create data models for business needs. These can help assess future outcomes, profitability, buying patterns and much more. When creating these models it’s important to be educated in how to create a model and which model to create so you can get results faster.

Analysis

There are a variety of reasons as to why an organization would decide to implement data mining and incorporate it into its strategy. An organization, however, must also consider the challenges it would face if it were to do so. As the amount of information worldwide continues to grow and grow, opportunities to utilize data mining increase as well. It is growing increasingly important to use data mining techniques and processes because of the increasingly vast repositories of data organizations cultivate and have access to; they need to be able to harness this data to further increase their business’ capabilities. The following are the most notable benefits & challenges organizations encounter with data mining.

Benefits

  • Improved Decision Making
  • Insights into Consumers
  • Prediction & Forecasting Abilities
  • Reducing Costs

Improved Decision Making

The very purpose of data mining is to process raw data into useful information. With that in mind, the usefulness of such information will depend on where the data is coming from and the relevance of the data itself to the organization. Utilizing this information enables organizations to improve their decision-making abilities because the information is based on quantifiable data sets. Decision makers can clearly see how the analysis was done and why an analysis may have come to a specific conclusion. It is important to note, though, that significant care must be placed into the preparation of data itself. This will be covered under the challenges of data mining later on.

An additional aspect to data mining’s inherent nature is automated decision-making. The process of data mining allows for the ability for certain decisions to be made without human input. [25] It is up to an organization to determine what kinds of decisions should be automated. While more critical decisions can be independently acted upon by data mining algorithms, it may not necessarily be in an organization’s best interests to do so as there are still many factors that would require human judgement.
Insights into Data
Photo by Stephen Dawson on Unsplash[26]

Insights into Consumers

This is one of the main advantages to organizations. Essentially, the useful information that data mining gleans from raw data are the relationships and patterns that are detected from the data that is inputted. [27] By analyzing the data an organization has on its consumers, it can answer key questions regarding consumer preferences such as what products or services are being used the most? Which products or services are used the most amongst each different type of demographic? When are these products or services being used the most? For example, banks can make use of their consumer data in assessing the credit risk each individual pose. More specifically, credit-risk teams within banks could better assess whether “a customer whose bank balance falls into the red more than once a quarter could be at higher risk for defaulting on a mortgage.” [28]

While data mining can be of tremendous assistance in gaining insight into the consumer, caution must be taken as well. Particularly if a data miner is seeking to prove whether or not a conjecture they have is correct. [29] A company should not selectively manipulate their data inputs for their data mining tools in an attempt to manufacture an artificial, erroneous analysis to support a hunch.

Improved Prediction & Forecasting Abilities

This is closely tied to the insights an organization would obtain into consumers. Once an organization better understands its consumers, it can leverage data mining tools to forecast future trends. These forecasts can be counted upon to be generally reliable and are based on trends that are already present in past data as well as the most up-to-date data. [30] Another aspect to this is the knowledge gained from being able to look at all these trends in one place. An organization can see where they have gone wrong in the past and can take the appropriate measures to prevent such mistakes from happening again. Data mining tools also facilitate easier comparisons between an organization’s past and current state.[31] When used properly, data mining becomes an invaluable tool to an organization’s ability to plan.

Reducing Costs

Cost reduction is a natural benefit of data mining primarily due to the efficiencies gained by implementing and integrating it into organizational processes. Data mining allows for the identification of possible areas to improve or streamline within these organizational processes. A better understanding of one’s consumers also allows an organization to focus on efforts that have a higher chance of yielding success based on forecasted trends, thus decreasing the chances of costly errors in planning. However, the type of cost reductions that can be obtained will ultimately vary from industry to industry and organization to organization, depending on the type of work conducted and the processes involved. In general though, any cost reductions will be related to the benefits described above.

Industry Examples

Construction

Caterpillar, the world’s largest construction equipment manufacturer, recognizes that data mining and big data overall have become vital to its business. Caterpillar utilizes Uptake, a data analytics company, to identify areas and processes in its equipment that can be optimized[32]. Caterpillar has had a history of making use of the data generated by its equipment and making it available to its consumers, but over the years it has begun to further invest into leveraging this data even more. The company now provides machine learning tools incorporated and proprietary algorithms into some of its offered services and places extensive effort into outfitting its machinery, old and new, with sensors and analytics technology. This is all part of Caterpillar’s endeavours to offer greater, more beneficial insights to its customers[33].
Oil Pumpjack Sunset
Photo by Zbynek Burival on Unsplash[34]

Healthcare

The healthcare industry, in particular, stands to benefit a great deal from a deeper understanding of their consumers’ (i.e. patients) behaviours. Part of the reason why this is the case is because the amount of data already being collected by healthcare organizations is quite extensive. The following are only some examples of the kinds of data being collected: patient medical records, medication data, emergency service records, and insurance claims [35]. Government entities dealing with human health have realized the benefits of data mining and have started to set rules accordingly to govern proper use. In 2013, the US Department of Health and Human Services as well as The Centers for Medicare & Medicaid Services moved to issue rules facilitating data mining use[36].

Oil & Gas

For example, oil and gas companies have substantial amounts of assets that require some form of maintenance and/or upkeep. Data mining would facilitate predictive maintenance in order to avoid sizeable repair costs before something actually breaks and to minimize upkeep costs. Some of these assets are in the form of processing plants, refineries, and general equipment. With these assets, data mining enables more accurate predictions and forecasts of oil and gas demand as well as usage patterns[37].

Challenges

  • Preparation of Data
  • Cost of Scale
  • Organizational Adoption
  • Privacy & Security

Preparation of Data

The benefits that data mining can provide are contingent upon the four V’s of big data. The four V’s are the different dimensions of big data and big data itself “is a term applied to datasets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency.”
[38] The four V’s are as follows:

  • Variety
  • Velocity
  • Veracity
  • Volume

Each of these V’s presents a distinctive challenge to data mining and advanced analytics in general.

Source: IBM Big Data & Analytics Hub[39]

Variety

Variety is concerned with all the different forms of data out there: images, videos, text, health data, etc.[39] There is an enormous diversity in the kinds of data that we generate and have access to, and the most common types we encounter in our day-to-day lives only scratch the surface of this diverse range. Data mining tools have to be equipped to handle a broad range of data formats as being unable to do so would limit the abilities of the tool and devalue it.

Velocity

Velocity addresses the speed at which data is being generated as well as how it is being gathered.[39] More and more devices are coming online everyday, creating further network connections per person. Many of these devices have additional sensors that monitor an assortment of elements depending on the device itself, which in turn generate even more data than they would have in the past. For this V, the computing power of data mining tools have to be up to par to keep up with the ever-increasing velocity of data.

Veracity

Veracity has to do with the accuracy of the data being collected and whether or not the information derived from that data can be trusted. This aspect of big data is particularly important as being unable to trust the information you are consuming, perhaps even relying upon, leads to inaccuracies. The ability of data mining to provide useful information to organizations fundamentally depends on the veracity of the data being inputted. There are serious economic impacts to this that need to be considered too. As displayed in the above infographic from IBM, “poor data quality costs the US economy around $3.1 trillion a year”[39].

Volume

Volume relates to the scale of data.[39] Specifically, how much data is being created and how much is being stored by an organization. Massive amounts of data are being generated and stored by companies and other organizations. Much of this data is harvested from devices, applications, and digital services and how consumers interact with them in their daily lives.

Scaling Costs

The scale at which data mining is being harnessed throughout an organization incurs certain costs intertwined with some of the V’s of big data. Namely, volume and velocity. With forecasted growth in the amount of data worldwide as well as increasing data velocity, investments have to be continually be made into an organization’s IT infrastructure to keep up with demand. [40]

While scaling costs are an important consideration, an organization does not necessarily have to manage its own IT infrastructure. There exists infrastructure as a service (IaaS) as an alternative. It is similar to software as a service (SaaS) but, as the name indicates, for infrastructure instead. Microsoft defines IaaS as “an instant computing infrastructure, provisioned and managed over the internet”. [41] The infrastructure itself would be handled by a third-party service provider, allowing an organization to circumvent the difficulties associated with having to manage it by itself. Although IaaS offers many benefits, particularly cost reductions in scale, there are certainly disadvantages depending on an organization’s affairs. An internet connection that can meet the demands of IaaS is necessary, and this may not be an option for organizations located away from large population centres where access to such types of internet connection is easier. In far-flung areas, internet connections are more likely to be dismal and sparse. One other important disadvantage that must be noted is privacy-related. [42] Since the infrastructure itself is being serviced through the internet and through a third-party service provider, the necessary precautions must be taken to ensure privacy concerns are addressed and alleviated.

Organizational Adoption

There are several barriers to the adoption of data mining amongst organizations. The biggest barriers to organizations, according to a 2019 survey of industry-leading firms, adopting big data analytics such as data mining are the following[43]:

  • Lack of organizational alignment/agility
  • Cultural resistance to change
  • Technology solutions
  • Understanding data as an asset
  • Executive leadership

Most notably, 40.3% of the respondents to this survey indicated that big data adoption was held up by a lack of organizational alignment or agility. The second and third-highest percentages that respondents indicated were the biggest challenges were the understanding of data as an asset (30%) and cultural resistance to change (23.6%). What we can gather from this survey is that it is often not the technology itself that poses problems for organizations looking to adopt big data solutions such as data mining. Rather, it is internal factors within an organization itself that hinders implementation efforts.

Privacy & Security

Privacy is key [44]

Algorithms and data mining as a whole can help consumers decide which hotel to book based on their preferences and recommends various restaurants based on the data they have mined from user preferences. However, there is a trade-off as some degree of a person’s privacy is invaded. When the data collected involves people, there are questions regarding privacy and ethics that arise. For instance, if an employer has access to employee medical records, they can screen out individuals who have diabetes. Screening out employees with this condition will cut the employer’s insurance costs but it creates ethical concerns.

Security and privacy is a significant issue in data mining, especially with social media applications, mobile applications, and smart IoT devices adopted in daily tasks. [45] As such, large amounts of user data is generated and without proper privacy protection in place, users can be at risk of privacy breaches. Data mining can also be an intrusive practice, as business can profile consumers by anything from age and ethnicity to political views and income.

Companies are constantly tracking user information and collecting bits of user information in order to put all these disparate pieces together to build user digital profiles. [46] Companies are easily able to collect this data, as users blindly click “accept” on privacy policies and turn on locations in mobile applications to search for nearby restaurants for the sake of convenience. The amount of users that use Facebook as a log-in for other mobile applications and other tools is increasing, giving businesses a deeper look into people's personal lives. [46] Users are putting all of their data on the table for companies to see and use, and companies can analyze user likes, preferences, and habits.

The multi-billion dollar data mining industry is mostly unregulated, and “data companies can keep their virtual warehouses private, unlike other companies like credit reporting agencies, who are required to let people see what that they have created with the data they mine and organize”. [46] Most foreign countries and those under the EU GDPR, a data protection regulation, allow citizens to access personal information held by third parties and, in fact, deem it a legal right. [46] However, Canadian and U.S. citizens do not have this legal right, so it is concerning which personal information and how much data businesses have collected about each user.

The Cambridge Analytica Scandal

Breakdown of where data was stolen from [47]

In 2018, news broke that the firm Cambridge Analytica had gained unauthorised access to millions of user information through a third party quiz application on Facebook. [48] Even though only 270,000 users downloaded application, over 80 million users had their information accessed. [49] The users had not explicitly given their consent for the firm to access this information, however, the firm said that nothing illegal occurred because all users technically gave consent by agreeing to the user conditions in the application. [49]

The user data accessed included millions of Facebook users’ identities, their friend networks, their groups, and their likes and interests. The quiz application was able to pass this information to the firm, which then used this information to create detailed user profiles of Facebook users. These profiles were then used to make micro-targeted political advertisements for the purpose of swaying users during the 2016 US Presidential elections. [48] This breach of data and confidence shocked people, and this level of data compromise meant that any business could gain access to personal information for the purpose of unethical targeted advertising.

With the increase in popularity for social media logins as way of accessing other applications due to its convenience, this puts users at risk of unauthorized Facebook data mining. [50] Facebook along with other Web giants tracks user transactions and preferences, in order to accumulate personal user data over time. The more data there is in one place, the more value it has for data mining.

Data Mining for Fraud Detection

Fraud affects many users and industries such as finance and government agencies, and with a 46% increase in fraud from 2008 to 2019 in Canada, it is increasingly important to implement prevention strategies. [51] Data mining is used to find associations in data which can show patterns and trends, which are then analyzed to detect fraudulent activity. This is accomplished using machine learning, through which computer systems can use algorithms and rely on patterns and mathematical models rather than users having to manually sift through data.

Supervised and Unsupervised Learning

Both supervised and unsupervised learning are methods under machine learning that play important roles in fraud detection and are used to look for accounts, customers and other parties that behave unusually to detect anomalies and suspicious activities. [52] Supervised learning is the most common model in which a random sample of all records is taken and classified as fraudulent or non-fraudulent, and these records are used to train a supervised machine learning algorithm. [52] Thus, the algorithm can classify records as fraudulent when appropriate, and fraudulent transactions can be detected. [52]

However, unsupervised learning models are designed to detect outliers, and users may not have a training dataset.[52] As such, users can use clustering techniques to categorize and label data, and identify transactions that do not conform to the majority. [52] These models allow users and businesses to promptly recognize patterns of fraud and protect their accounts.

The Future of Data Mining

Data continues to grow at astronomical rates and with this comes to need to make all this data useful. Data mining has a bright future as more industries and organizations are beginning to use it and see its value. There are also many future applications of data mining and in-demand jobs as data continues to grow.

Forecasted Growth

Amount of Data Worldwide

It is important to establish by how much the amount of data worldwide is expected to grow in the near future.

Source: Statisa[53]

There are large increases in the amount of data worldwide expected each year and by 2025, there will be an approximate 175 zetabytes of data across the globe. To put that into perspective, one zetabyte equals one million petabytes which is then equivalent to one trillion gigabytes. This forecast has staggering implications for data-driven organizations as resources and capabilities will have to be invested to ensure they are commensurate in the face of ever-growing amounts of data and its increasing velocity. With this forecasted growth in mind, it is only logical that there are significant increases in the global big data market, of which data mining would be categorized under, as well.

The Big Data & Business Analytics Market

The big data & business analytics market is quite relevant in relation to data mining because of how big data is defined and the interdependent relationship between the tools required to analyze it. Only advanced analytics processes and tools such as data mining are able to obtain anything of value from big data. It would be more accurate to say that only such processes and tools even have the requisite ability to do so.

Source: Statisa[54]

By 2022, the size of the global big data & business analytics market is forecasted to grow to 274.3 billion USD. While there are no forecasted exponential increases year to year, in contrast to the amount of data worldwide, for the big data market, one trend is still quite clear: it is increasing and it will increase markedly. From this, it can be inferred that organizations are indeed seeing the value of data analytics in general and are investing the time and money into tools such as data mining.

Future Applications

Big Data Adoption

Source: Statisa[55]

The graph above displays information about data adoption in these various industries. From the graph it is easy to see that most of these industries are using big data in their business right now and most plan to in the future. This graphic depicts usage of big data as of 2018 and specific industry growth in the future. The growth presented looks very attractive as it may signify more jobs opportunities in data mining and in various industries.

Future Uses of Data Mining

There are many future uses involving data mining and its applications. Multimedia data mining, improved computing, and blockchain technology are all interesting applications and concepts involving data mining which will be discussed.

Multimedia data mining: “Multimedia data mining is a research domain which helps to extract interesting knowledge from multimedia data sets such as audio, video, images, graphics, speech, text and combination of several types of data sets.” [56] The data here is usually unstructured and semi-structured data, so data mining here would be done using machine learning. [56] Machine learning would clean the data and collect it making easier to find and use. This can be useful in a situation where say you were looking for audio of a specific topic and you had access to all applicable audio without having to search to long for it. Another example of where it could be useful is video mining specifically when it comes to security footage. When you have the tools to mine security footage, it can be easier to look for what’s needed instead of going through hours of footage.

Improved computing: Moore’s law states that the number of transistors that fit on a computer chip doubles about every 18-24 months. This in turn will provide us with more computing power to analyze and mine data faster. However, recently this has not been the case and Moore’s law is becoming obsolete and expected to end around 2025. [57] As of 2019 only two companies, Samsung and TSMC (Taiwan Semiconductor Manufacturing Company), are keeping pace with Moore’s law. [58] At the same time, we may be exposed more to quantum computing which will also allow for faster machine learning and calculations. Traditional computers have “bits” which encode either a zero or a one, but on a quantum computer you have “qubits” which can encode a zero and a one. [59] The power of a quantum computer can factor large numbers and recently in 2019 Google’s quantum computer was on full display showing its capabilities. It solved a complex problem in three minutes and twenty-two seconds that would take the most powerful supercomputer today 10,000 years to solve. [60]

Elements of a Blockchain [61]

Blockchain technology: The word “block” in blockchain represents digital info which is stored on a public database or the “chain”. Each of these blocks store transaction data, your digital signature and a unique code referred to as a hash. [62] The hash works to tell blocks apart, every block has a different hash which represents different transactions. Blockchain technology is public, decentralized, secure and basically unchangeable. [62] By having the feature of not being changeable, the info in it is authentic later on and this technology can be used in many industries such as the medical industry. It can allow hospitals keeping authentic records on the blockchain, not have to worry about them getting lost, and allows them to add on more records. Having the ability to mine the data in a blockchain can help gather information faster, cluster certain information you are looking for and analyze strings of data stored in the past.

Job Prospects

The future of data mining is loaded with potential. As seen from the graphs in the market growth section, it is easy to identify that data gathering will not be slowing down anytime soon and with more data comes the need for data mining. In addition, LinkedIn’s 2017 annual report on future jobs stated that three of the most in-demand jobs in the USA were related to big data. [63] IBM also forecasts the demand for data professionals will grow by 28% between now and 2020. [63] Furthermore, many industries look to use data mining in the future. With this more jobs will be in demand to perform data mining tasks and make loads of data useful for stakeholders. Some entry level positions for those interested in a career in data mining include data analyst, business analyst, and test data analyst roles:

Data analysts handle the collecting, analyzing, and storing of data. They then use it to help clients or companies make better decisions. [64] At the entry level they may be working with a team or directly under a supervisor, where they can learn more from their coworkers in the process of making data useful. They can expect to earn a salary around $55,000 CAD at the entry level.[65]

Business analysts conduct more research on the business side of things for companies. This can include analyzing product lines, profitability, buying patterns, and market analysis. [66] Being at an entry level position can also entail working with a team or under someone. A salary of around $50,000 CAD is common for an entry level business analyst. [67]

Test data analyst roles are responsible for analyzing and clarifying test data requests, creating the right queries to mine data, making test data results, and testing data results sets. [68] They may also work under supervision and help accomplish the goals set out by higher management to improve the business. Salaries for entry level positions in this role are around $52,000 CAD. [69]

What all these roles and many other data mining type of roles have in common is that they require an individual to have strong technological, analytical, and communication skills. [66]All these skills are must-haves in this field and having an adequate level of ability in each skill may help one find jobs easier and learn new processes on the job quicker. Furthermore, the education required for most of these jobs generally requires business and information technology studies. [66] However, going to school may not teach you everything you need to know. Conducting some learning with programs on your own time would be helpful in improving your skills. The internet is filled with programs and guides that can help you learn how to work with databases and having this experience may prove to be the determining factor for you when looking for a job in data mining.

Summary & Thoughts

Data mining is only going to continue to become further integrated into the workings of society’s organizations and will impact our daily lives as time goes on. With the rate of data generation increasing rapidly year by year, data mining is becoming more and more valuable to organizations seeking to utilize all the data they are generating and have access to. Arguably, data mining has already become integral to many organizations as they harness the insights they gain from the data mining process to better address the challenges they face. Our group was of the mind that the benefits presented by data mining definitely outweigh its challenges.

The data mining process & its applicability to organizations throughout the globe will only continue to evolve as the type of data and information available changes and grows. While we were able to find commonly agreed upon insights from various sources as to the future of data mining, there are many more possibilities out there that have not yet been fully explored.

Authors

Anna Bobrovskaya Gavin Dhaliwal Jose Honorio
Beedie School of Business
Simon Fraser University
Burnaby, BC, Canada
Beedie School of Business
Simon Fraser University
Burnaby, BC, Canada
Beedie School of Business
Simon Fraser University
Burnaby, BC, Canada

References

  1. https://blog.eduonix.com/internet-of-things/top-5-popular-data-mining-techniques/
  2. 2.0 2.1 2.2 2.3 2.4 2.5 https://www.sas.com/en_ca/insights/analytics/data-mining.html
  3. 3.0 3.1 3.2 3.3 https://www.investopedia.com/terms/d/datamining.asp
  4. https://histechup.com/data-mining-what-it-is-and-why-it-matters/
  5. 5.0 5.1 5.2 5.3 5.4 5.5 5.6 https://dataconomy.com/2014/06/history-bi-1960s-70s/
  6. https://www.techopedia.com/definition/24361/database-management-systems-dbms/
  7. https://www.ibm.com/analytics/relational-database
  8. http://avant.org/project/history-of-databases/
  9. https://www.cleanpng.com/png-relational-database-management-system-data-managem-3529547/
  10. 10.0 10.1 https://www.w3schools.com/sql/sql_intro.asp
  11. https://en.wikipedia.org/wiki/Oracle_Database
  12. 12.0 12.1 12.2 https://dataconomy.com/2016/06/history-data-mining/
  13. https://www.exastax.com/big-data/the-history-of-data-mining/
  14. 14.0 14.1 https://dataconomy.com/2014/07/the-history-of-bi-the-1980s-and-90s/
  15. https://docs.microsoft.com/en-us/analysis-services/data-mining/data-mining-algorithms-analysis-services-data-mining
  16. https://www.britannica.com/technology/multiprocessing
  17. 17.0 17.1 17.2 https://expertsystem.com/machine-learning-definition/
  18. 18.0 18.1 18.2 https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/
  19. https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/
  20. https://commons.wikimedia.org/wiki/File:CRISP-DM_Process_Diagram.png
  21. 21.00 21.01 21.02 21.03 21.04 21.05 21.06 21.07 21.08 21.09 21.10 21.11 21.12 21.13 21.14 21.15 21.16 21.17 https://barnraisersllc.com/2018/10/data-mining-process-essential-steps/
  22. 22.0 22.1 https://jenstirrup.com/2017/07/01/whats-wrong-with-crisp-dm-and-is-there-an-alternative/
  23. 23.0 23.1 23.2 23.3 23.4 23.5 23.6 23.7 23.8 https://electricalfundablog.com/data-mining-working-characteristics-types-applications-advantages/#Predictive_Data_Mining_Analysis
  24. 24.0 24.1 24.2 24.3 https://bigdatanerd.wordpress.com/2011/06/25/introduction-to-data-mining-types-of-data-mining-techniques/
  25. https://www.microstrategy.com/us/resources/introductory-guides/data-mining-explained#database
  26. https://unsplash.com/photos/qwtCeJ5cLYs
  27. https://www.investopedia.com/terms/d/datamining.asp
  28. https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/capturing-value-from-your-customer-data
  29. https://www.investopedia.com/terms/d/datamining.asp
  30. https://www.microstrategy.com/us/resources/introductory-guides/data-mining-explained#database
  31. https://www.sfgnetwork.com/blog/data-services/the-benefits-of-mining-customer-data/
  32. https://www.chicagotribune.com/business/ct-biz-caterpillar-uptake-20171115-story.html
  33. https://www.zuora.com/2017/06/13/wsj-caterpillar-ramping-data-services-business/
  34. https://unsplash.com/photos/GrmwVnVSSdU
  35. https://archer-soft.com/en/blog/electronic-data-interchange-healthcare
  36. https://archer-soft.com/en/blog/data-mining-healthcare
  37. https://www.accenture.com/_acnmedia/pdf-109/accenture-data-mining-reveals-savings-oil-gas-companies.pdf
  38. https://www.ibm.com/analytics/hadoop/big-data-analytics
  39. 39.0 39.1 39.2 39.3 39.4 https://www.ibmbigdatahub.com/infographic/four-vs-big-data
  40. https://www.microstrategy.com/us/resources/introductory-guides/data-mining-explained#database
  41. https://azure.microsoft.com/en-ca/overview/what-is-iaas/
  42. https://www.w3schools.in/cloud-services/infrastructure-as-a-service/
  43. https://www-statista-com.proxy.lib.sfu.ca/statistics/742983/worldwide-survey-corporate-big-data-adoption-barriers/
  44. https://www.npr.org/sections/alltechconsidered/2018/05/28/614419275/do-not-sell-my-personal-information-california-eyes-data-privacy-measure
  45. https://www.hindawi.com/journals/scn/si/575723/cfp/
  46. 46.0 46.1 46.2 46.3 http://business.time.com/2012/07/31/big-data-knows-what-youre-doing-right-now/
  47. https://www.dw.com/en/cambridge-analytica-causing-trouble-for-facebook-in-southeast-asia/a-43286109
  48. 48.0 48.1 https://www.nytimes.com/2018/09/28/technology/facebook-hack-data-breach.html
  49. 49.0 49.1 https://www.wired.com/story/facebook-exposed-87-million-users-to-cambridge-analytica/
  50. https://medium.com/@IAMEIdentity/the-facebook-data-mining-scandal-what-happened-82154855aeca
  51. https://www150.statcan.gc.ca/n1/pub/85-002-x/2019001/article/00013-eng.htm
  52. 52.0 52.1 52.2 52.3 52.4 https://www.fico.com/blogs/5-keys-using-ai-and-machine-learning-fraud-detection
  53. https://www-statista-com.proxy.lib.sfu.ca/statistics/254266/global-big-data-market-forecast/
  54. https://www-statista-com.proxy.lib.sfu.ca/statistics/551501/worldwide-big-data-business-analytics-revenue/
  55. https://www-statista-com.proxy.lib.sfu.ca/statistics/919683/worldwide-big-data-adoption-expectations-by-vertical/
  56. 56.0 56.1 http://airccse.org/journal/ijcga/papers/5115ijcga05.pdf
  57. https://arxiv.org/pdf/1511.05956.pdf
  58. https://www.hpcwire.com/2019/06/12/tsmc-and-samsung-moving-to-5nm-whither-moores-law/
  59. https://uwaterloo.ca/institute-for-quantum-computing/quantum-computing-101#What-is-quantum-computing
  60. https://www.cnbc.com/2019/10/23/google-claims-successful-test-of-its-quantum-computer.html
  61. https://www.researchgate.net/figure/Key-Elements-of-Blockchain-Systems_fig1_327711685
  62. 62.0 62.1 https://www.investopedia.com/terms/b/blockchain.asp
  63. 63.0 63.1 https://www.iberdrola.com/innovation/data-mining-definition-examples-and-applications
  64. https://www.betterteam.com/data-analyst-job-description
  65. https://www.payscale.com/research/CA/Job=Data_Analyst/Salary?_ga=2.140834290.1516550751.1575146513-552329811.1575146513
  66. 66.0 66.1 66.2 https://www.roberthalf.co.nz/our-services/finance-accounting/business-analyst-jobs
  67. https://www.payscale.com/research/CA/Job=Junior_Business_Analyst_(Unspecified_Type)/Salary?_ga=2.251395779.473598037.1575146619-1718049016.1575146619
  68. https://sceweb.sce.uhcl.edu/helm/ROLE-Tester/myfiles/IBMRUP/process/workers/wk_tstanl.htm
  69. https://www.payscale.com/research/CA/Job=Test_Analyst/Salary?_ga=2.220324376.184936868.1575146756-1002922405.1575146756
Personal tools