Carry out data manipulation
Overview
This standard is about carrying out data manipulation.
Data engineers prepare data for analytical or operational uses. They are typically responsible for designing and building data pipelines to bring together information from different source systems. They integrate, consolidate, cleanse and structure data to make it easily accessible and in usable format. Processed data can then be used by business executives, data analysts and other end users to inform organisational processes and decision making.
Data Extraction, Transformation and Loading (ETL) involves identifying data sources, importing dating, loading data, converting data, merging and consolidating data and processing data. This also includes storing prepared data ready for analysis or other organisational processes.
This standard is for those who need to carry out data manipulation as part of their duties.
Performance criteria
You must be able to:
- Identify the main data types used within an organisation to support data understanding and handling
- Review and agree dataset requirements with stakeholders to plan data preparation tasks
- Identify target data sources to determine accessibility constraints
Implement security measures for data manipulation to maintain data resilience in line with organisational standards
Apply available data pipelines to assist in providing data flows
Extract, transform and load data for manipulation in line with organisational requirements
- Combine and manipulate data from various structured and unstructured sources to produce consolidated datasets in line with requirements
Anonymise data in line with organisational and legal requirements for data access, handling and sharing
Convert data to defined structures and file formats in line with organisational requirements
Export and store datasets into staged data environments to make data available to end users
- Develop code to automate data extraction and manipulation
- Document source-to-target mappings to show data lineage
- Document data manipulation activities and dataset features in line with organisational procedures
Knowledge and Understanding
You need to know and understand:
- How to access and extract data securely from organisational data sources
- The need to document data lineage when using and sharing data
- The role of data ownership and associated responsibilities in sourcing and accessing data
- The main file formats for storing and sharing data
- Industry standard tools used for handling, sharing and managing data
- Why data manipulation is important
- That data manipulation helps to make it easier to understand the dataset and to break it into manageable chunks
- Industry standard data processing languages and how to use them
- How to access and load the dataset to perform manipulation
- The different terms used that refer to data manipulation including preparing, transforming and wrangling data
- How to join and merge multiple datasets from various sources using common keys to combine them into a single dataset
- The industry standard processes that are used to manipulate data
- Organisational policies and national regulations associated with data management and data protection, storing and sharing data
- The requirement for effective safe usage and security of data within organisations
- The difference between wide and long data formats and how to apply them for structuring datasets
- How to format datasets to produce the final structure required
- How to provide documentation associated with data manipulation activities
- How to design, write and iterate code from prototype to production-ready for data manipulation and staging solutions
- How to work with large or complex datasets
- The importance of ethics in relation to data engineering, including organisational codes of practice