Big Data

Our approach to mobilizing big data without sacrificing privacy

Big data is generally defined as a data set that is so large that innovative forms of processing are required to capture, analyze, and manage it. Every time you go online, your digital footprints leave a trail of data: tweets, social media posts, emails, search engine queries, websites visited. When you shop in a mall, more data about your consumer choices is collected as well. That trail of information, along with everyone else’s, is gathered into a massive big data environment, where trends and patterns and profiles can be collated and correlated by sophisticated data analysis techniques. 

The three primary characteristics of big data are volume, velocity, and variety.  

Volume is the amount of data, measured in today’s benchmarks of petabytes or even exabytes. For context: the digital archives of the US Library of Congress was reported to contain around 74 terabytes of information in 2009; in 2011, the US National Security Agency boasted of collecting this much information every six hours. 

Velocity is the speed at which data moves. Big data is often generated, processed, and analysed in real time. 

Variety is the range of data sources for data. Big data can capture structured and unstructured data in all formats from a vast array of sources, both online and offline.  

Two additional descriptors have since been added: veracity – the quality of the data – and variability – its consistency, and thus reliability.  

Big data is growing exponentially on all of these fronts and is now crucial to commercial success. The consumer data being leveraged for corporate benefit is far more detailed, and more individualized, than ever before. The savvy use of big data analytics can add another “V” factor – value – by giving companies a competitive edge. Businesses large and small are realising that if they don’t develop big data strategies, they’ll get left behind. 

Big data analytics has rapidly become so ubiquitous that all companies will need to engage with it in some shape or form. But if you’re new to big data, how best can you go about it? Data analysis is a complex and specialized field; how you initially choose to engage with it will depend on the current level of expertise in your IT department. As your capacity and experience grows, you may want to explore more advanced options. 

Getting Started with Big Data 

1. Get industry insights through reports

A good place to start with data analytics is to subscribe to reports on data trends, or to a trends-based web service. These services can provide valuable research data relevant to your business operations. Big data allows researchers to collect and analyze various industry trends and patterns, draw conclusions, and publish or sell the results.  

2. Take advantage of web analytics

A next step is to access customizable web analytics services. This is a more hands-on approach to investigating your specific customer base and how it interacts with your business. The big advantage is that these services are highly customizable; however, this means that you need to know exactly what you’re looking for to get the most from what they offer. 

Analytics services allow you to monitor and analyze how long users spend on your website, which pages they read, where they came from, and other useful information. These services may appear more rudimentary than the analysis performed by professionals, but they are customized to your business, rather than to general industry trends.  

3. Build your capacity for data analysis

Later, you may choose to build your own data analysis infrastructure, by integrating a software framework for the distributed storage and processing of massive data sets into your current large-scale data management system. Organizations that regularly deal with large amounts of data will benefit from this dynamic and interactive approach, which allows incoming information to be integrated into an existing database. This can help build customer profiles, improving marketing and sales.  

Building data analysis infrastructure is a complex process, but it can produce invaluable information. Depending on the data source and the consent provided, companies may be able to share or sell data with third parties, but even if they choose not to, the internal use of data analytics can be well worth the investment.  

Big data analytics has tremendous potential. With it, you can weave together massive amounts of disparate information and discover unforeseen patterns that enable you to optimize marketing, distribution, staffing, resource allocation, and virtually any other aspect of your operations. This is both the source of its value, and of considerable privacy risk. Big data initiatives often collect, use and share highly detailed personal information without individuals’ awareness or understanding. Even when consent is sought, data is frequently used and linked in unanticipated ways to which individuals have not agreed.  

Big Data and Privacy Law 

While privacy laws restrict personal data collection practices, big data initiatives are collecting, aggregating, and sharing larger and larger volumes of data. To avoid a breach, and remain compliant with privacy legislation, if your organization is using big data, then it also needs big privacy. A strong privacy framework, with effective training, tools, and implementation processes, is mandatory to govern big data initiatives. 

The ground rules of personal data protection are essentially no different in a big data environment than in more conventional contexts: Canadian organizations should only collect, store, use, and share personal information for specific purposes to which an individual has consented.  

Big privacy for a big data environment means a mature privacy program with effective data governance practices.

The Office of the Privacy Commissioner of Canada has ruled that information collected through online tracking generally constitutes personal information, and that under PIPEDA, website owners have a legal responsibility to protect the personal information of the people using their website. Further, they are responsible to obtain meaningful consent from individuals to the collection and use of their information.  

Meaningful consent can use either an opt-in or opt-out model. The key is that individuals are notified of the purposes of data collection and all of the parties who will have access to the data, as well as other information about privacy and security practices. Website owners should avoid collecting sensitive information, such as medical information, and destroy or de-identify data as soon as possible. 

In addition to technological solutions, big privacy for a big data environment means a mature privacy program with effective data governance practices. Four principles of big privacy are: 

1. Data lifecycle management 

The protective measures set out in your privacy policies must be maintained throughout the information life cycle. Your organization must make sure that all security and privacy requirements that protect datasets are tracked and maintained throughout the information life cycle, from data collection through use, retention, disclosure, and destruction. Individuals should be notified of these practices at the time of collection. 

2. Anonymization of Secondary Use Data  

Where PI is to be used for secondary purposes, including external (and sometimes internal) sharing, it must be anonymized or de-identified to protect individual privacy. Unfortunately, data analytics can reverse many older de-identification methods by re-linking correlated data. If data is not completely anonymized, the possibility and risk of linking different data sets must be evaluated.  

3. Evaluation of Data Recipients 

If your organization shares individual-level data, you’re obligated under PIPEDA to ensure that data recipients’ security and data privacy policies provide “a comparable level of protection” to your own. Be sure to carefully evaluate the privacy and security practices of third parties requesting access to data. Even if the data is de-identified, you will need to evaluate the risk that linking data sets from several sources could re-identify individuals. Ensure that any companies with whom you sell or share data can demonstrate that their own privacy practices are in full compliance with PIPEDA, and that data in their care is protected in all stages of its life cycle.  

4. Legislative Compliance  

Your organization must identify and understand all the privacy regulations that apply to the data you store, process, and transmit. Make sure that you are aware of the exact definitions of terms such as “personal information,” “personal data,” and “identifying information,” “permissible purposes,” and “permissible disclosures,” in any jurisdictions relevant to your organization.  

For any new (or redesigned) project, we strongly recommend conducting a Privacy Impact Assessment (PIA), to identify and address potential privacy risks, and ensure that your personal information management practices are aligned with PIPEDA. A PIA acts as an early warning system, showing you where you need to build safeguards into your data management practices. 

 Next Page: Ethical AI