Data Flow Task in SSIS
We will begin with the tutorial for Data flow task in SSIS. SSIS is one my most favorite topic and simply like to work on this technology.
There is always a misconception between control flow and data flow. Hence we need to know the difference between both.
Control flow
A control flow consists of one or more tasks and containers that execute when the package runs. To control order or define the conditions for running the next task or container in the package control flow, we use precedence constraints to connect the tasks and containers in a package. A subset of tasks and containers can also be grouped and run repeatedly as a unit within the package control flow. SQL Server 2005 Integration Services (SSIS) provides three different types of control flow elements: Containers that provide structures in packages, Tasks that provide functionality, and Precedence Constraints that connect the executables, containers, and tasks into an ordered control flow.
Data flow
A data flow consists of the sources and destinations that extract and load data, the transformations that modify and extend data, and the paths that link sources, transformations, and destinations The Data Flow task is the executable within the SSIS package that creates, orders, and runs the data flow. A separate instance of the data flow engine is opened for each Data Flow task in a package. Data Sources, Transformations, and Data Destinations are the three important categories in the Data Flow.
Data flow as name suggests is use to transfer data from source to destination. Data flow should always contain a source and a destination. If business demands us to modify our data while transferring data from source to destination we might need to use transformations to achieve our task.
Following are some commonly used transformation that are available:
AGGREGATE - It applies aggregate functions to Record Sets to produce new output records from aggregated values.
AUDIT - Adds Package and Task level Metadata - such as Machine Name, Execution Instance, Package Name, Package ID, etc..
CHARACTER MAP - Performs SQL Server level makes string data changes such as changing data from lower case to upper case.
CONDITIONAL SPLIT – Separates available input into separate output pipelines based on Boolean Expressions configured for each output.
COPY COLUMN - Add a copy of column to the output we can later transform the copy keeping the original for auditing.
DATA CONVERSION - Converts columns data types from one to another type. It stands for Explicit Column Conversion.
DATA MINING QUERY – Used to perform data mining query against analysis services and manage Predictions Graphs and Controls.
DERIVED COLUMN - Create a new (computed) column from given expressions.
EXPORT COLUMN – Used to export a Image specific column from the database to a flat file.
FUZZY GROUPING – Used for data cleansing by finding rows that are likely duplicates.
FUZZY LOOKUP - Used for Pattern Matching and Ranking based on fuzzy logic.
IMPORT COLUMN - Reads image specific column from database onto a flat file.
LOOKUP - Performs the lookup (searching) of a given reference object set against a data source. It is used for exact matches only.
MERGE - Merges two sorted data sets into a single data set into a single data flow.
MERGE JOIN - Merges two data sets into a single dataset using a join junction.
MULTI CAST - Sends a copy of supplied Data Source onto multiple Destinations.
ROW COUNT - Stores the resulting row count from the data flow / transformation into a variable.
ROW SAMPLING - Captures sample data by using a row count of the total rows in dataflow specified by rows or percentage.
UNION ALL - Merge multiple data sets into a single dataset.
PIVOT – Used for Normalization of data sources to reduce analomolies by converting rows into columns
UNPIVOT – Used for demoralizing the data structure by converts columns into rows incase of building Data Warehouses.