Saturday, 7 December 2013

Creating simple Bar chart in Tableau

                                             Creating simple Bar chart in Tableau

This is my first post on Tableau. Tableau is a fantastic data visualization tool. It help us to create some fantastic dashboards using this tool.

In this post we will learn how to create a simple bar chart in the dashboard.



As shown in the above screen shot, my data contains list of countries the regions and sub-region in which they falls.


We will now create a bar chart which will display the no. of countries according to the regions.

Step1: Drag and drop Region field in the dimension section to columns and Number of countries field in the row section.



A graph will be plotted automatically as shown in the above figure.

Step2: To add label to the graph, drag and drop the 'number of countries' field in the label section.




Now the labels is visible on the graph

Step3: Now we will color the graph according to the 'Number of countries'. The darker the color the larger the number of countries available in the region.



Step4: Sort the graph according to the number of countries. Go to column field and click on the 'Region' field and select the option sort -> ascending-> sort by field-> click OK.



You can also sort the chart in descending order by selecting the descending option.




Friday, 22 November 2013

Difference between Char,Varchar and Nvarchar in SQL

                Difference between Char,Varchar and Nvarchar in SQL


Char DataType:

  • Char Datatype is used to store fixed length characters. 
  • If declared as Char(100) then it will allocate space of memory for 100 characters, no matter how much characters is actually stored in the column
Varchar DataType: 
  • Abbreviation for Varchar is Variable length characters
  • It is used to store Non-Unicode characters
  • Since the name contains 'Variable' it specifies that the memory allocation varies as per the characters stored in the column. For e.g 'Nadir' is stored in the column then the memory allocation would be 5. It allocates 1 byte per character
  • Maximum length for this data type is 8000
  • Varchar supports collation which requires 1 byte memory storage per characters
  • No other language except the once requires 1 byte memory storage per characters can be stored for example: English language characters can be stored in varchar
Nvarchar DataType:
  • Abbreviation for Nvarchar is Uni-code variable length characters
  • It is used to store Unicode characters
  • The memory allocation varies similar to varchar datatype except it allocates 2 byte per characters
  • Maximum length for this data type is 4000
  • Nvarchar supports all types of collation
  • Multiple languages can be stored in nvarchar datatype by using prefix 'N'. Below example explains this point.
                 e.g: Declare @str nvarchar(4000) = N'最低点'

                                                               Highlights:
  1. As both varchar and nvarchar datatypes are variable in nature it acquires space as per the characters stored irrespective of the max limit specified
  2. It is always recommendable to use varchar compared to nvarchar when the data is in no other language then English
  3. Query execution for varchar is generally faster compared to nvarchar

Sunday, 17 November 2013

Data Flow Task in SSIS

                                    Data Flow Task in SSIS


We will begin with the tutorial for Data flow task in SSIS. SSIS is one my most favorite topic and simply like to work on this technology.

There is always a misconception between control flow and data flow. Hence we need to know the difference between both.

Control flow

                  A control flow consists of one or more tasks and containers that execute when the package runs. To control order or define the conditions for running the next task or container in the package control flow, we use precedence constraints to connect the tasks and containers in a package. A subset of tasks and containers can also be grouped and run repeatedly as a unit within the package control flow. SQL Server 2005 Integration Services (SSIS) provides three different types of control flow elements: Containers that provide structures in packages, Tasks that provide functionality, and Precedence Constraints that connect the executables, containers, and tasks into an ordered control flow.

Data flow
                A data flow consists of the sources and destinations that extract and load data, the transformations that modify and extend data, and the paths that link sources, transformations, and destinations The Data Flow task is the executable within the SSIS package that creates, orders, and runs the data flow. A separate instance of the data flow engine is opened for each Data Flow task in a package. Data Sources, Transformations, and Data Destinations are the three important categories in the Data Flow.


Data flow as name suggests is use to transfer data from source to destination. Data flow should always contain a source and a destination. If  business demands us to modify our data while transferring data from source to destination we might need to use transformations to achieve our task.


Following are some commonly used transformation that are available:

AGGREGATE  - It applies aggregate functions to Record Sets to produce new output records from aggregated values.
AUDIT  - Adds Package and Task level Metadata - such as Machine Name, Execution Instance, Package Name, Package ID, etc.. 
CHARACTER MAP - Performs SQL Server level makes string data changes such as changing data from lower case to upper case.
CONDITIONAL SPLIT – Separates available input into separate output pipelines based on Boolean Expressions configured for each output.
COPY COLUMN - Add a copy of column to the output we can later transform the copy keeping the original for auditing.
DATA CONVERSION - Converts columns data types from one to another type. It stands for Explicit Column Conversion.
DATA MINING QUERY – Used to perform data mining query against analysis services and manage Predictions Graphs and Controls.
DERIVED COLUMN - Create a new (computed) column from given expressions.
EXPORT COLUMN – Used to export a Image specific column from the database to a flat file.
FUZZY GROUPING – Used for data cleansing by finding rows that are likely duplicates.
FUZZY LOOKUP -  Used for Pattern Matching and Ranking based on fuzzy logic.
IMPORT COLUMN - Reads image specific column from database onto a flat file.
LOOKUP - Performs the lookup (searching) of a given reference object set against a data source. It is used for exact matches only.
MERGE - Merges two sorted data sets into a single data set into a single data flow.
MERGE JOIN - Merges two data sets into a single dataset using a join junction.
MULTI CAST - Sends a copy of supplied Data Source onto multiple Destinations.
ROW COUNT - Stores the resulting row count from the data flow / transformation into a variable.
ROW SAMPLING - Captures sample data by using a row count of the total rows in dataflow specified by rows or percentage.
UNION ALL - Merge multiple data sets into a single dataset.
PIVOT – Used for Normalization of data sources to reduce analomolies by converting rows into columns

UNPIVOT – Used for demoralizing the data structure by converts columns into rows incase of building Data Warehouses. 

Saturday, 16 November 2013

Difference between UNION and UNION ALL

                                    Difference between UNION and UNION ALL


Both Union and Union all statements are use to merge data from multiple queries which has same number of columns. Also, the data types of the column should be same.

We will look into the difference between both operators using an example. We have the following two tables

Empdetails:











Custdetails:














Union: Union merge data from two or  more queries and returns a sorted unique result set. Since this operator provides distinct result set it takes more time than Union All.

Query:

               select * from EmpDetails union  select * from custdetails   
Output:















Union All: Union all merge data from two or more queries and returns non-unique result set. Since this operator does not perform distinct on the result set, it's performance is much faster than Union operator

Query:

            select * from EmpDetails union all select * from custdetails 

Output:




















Following are the constraints that are associated with UNION and UNION ALL operator:

  • The tables should contains the same number of columns
  • Columns in the tables must be of the same data type
  • Order by clause should be applied to the final result set
  • Column name for the result set would be of the first query mentioned



See Also:   Different Types of Joins in SQL   , Difference between CTE and View



Different Types of Joins in SQL

                          Different Types of Joins in SQL



Joins are classified into three types as below:

  1. Inner Join
  2. Outer Join
  3. Cross Join


We will look into this by considering an example. Suppose we have two table as follows:

Empdetails:













EmpSalary:













Inner and outer joins perform join on the basis of at least one field. in our example this common file is empid.
  • Inner Join: Inner join is the most commonly used join. This join returns rows when there is a match in both the tables. Now we will apply this join between the two sample tables
                     Query:
                                 select empdetails.empid,empdetails.empname,empsalary.empsalary
                                 from empdetails , empsalary where empdetails.empid = empsalary.empid

                 Now in this query we are trying to pull salary of employee from empsalary tables for the                                employees which are present in empdetails table The result for this query will be as follows:


  • Outer Join: Outer join is of three type: Left outer Join and Right Outer Join
Left outer Join: This join returns all the rows from the left table in conjunction with the matching rows from the right table. If there are no columns matching in the right table, it returns NULL values.

Example:

Query:

                     select empdetails.empid,empdetails.empname,empsalary.empsalary
                     from empdetails left outer join empsalary on empdetails.empid = empsalary.empid



Output:















Right outer Join:This join returns all the rows from the right table in conjunction with the matching rows from the left table. If there are no columns matching in the left table, it returns NULL values.

Example:

Query:


    select empdetails.empid,empdetails.empname,empsalary.empid,empsalary.empname,empsalary.empsalary

    from empdetails right outer join empsalary on empdetails.empid = empsalary.empid





Output:


























Full outer Join:This join combines left outer join and right outer join. It returns row from either table when the conditions are met and returns null value when there is no match.

Example:

Query:

select empdetails.empid,empdetails.empname,empsalary.empid,empsalary.empname,empsalary.empsalary

from empdetails full outer join empsalary on empdetails.empid = empsalary.empid





Output:


















  • Cross Join: It is the Cartesian product of the two tables involved. 

Example:

Query:


   select empdetails.empid,empdetails.empname,empsalary.empid,empsalary.empname,empsalary.empsalary
   from empdetails cross join empsalary 





Output:




























Join conditions can be specified in either the FROM or WHERE clauses; specifying them in the FROM clause is recommended. WHERE and HAVING clauses can also contain search conditions to further filter the rows selected by the join conditions.

Friday, 15 November 2013

Difference between truncate and drop command in SQL

                 Difference between truncate and drop command in SQL


Both truncate and delete are DDL statements and changes done by both can't be reverted. Following are the main differences between both:

Truncate Command:
  • Truncate removes all the records from the table/view
  • Truncate command is used to remove data from the table
  • The schema for the table remains as it is
  • It does not effect relationship between tables
Drop Command:

  • Drop command is used to remove and object from the database
  • The object(table/view/indexes,etc) is completely removed from the database
  • The schema is also removed
  • All the relationship for that object is removed and if we need that object to use again then we will have to develop it from scratch

Saturday, 9 November 2013

Difference between having and where clause in SQL

           Difference between having and where clause in SQL


It is very important to know the difference between Having and Where in order to get the desired results from the query

WHERE clause:
  • can be used with Select,Insert,Delete statements
  • can't use aggregate functions in where clause
  • Where clause filter results before the results are Grouped as Where clause can only be written before Group clause 
  • Where clause applies to rows
Example:

Select fname,country_name from person
where country_name in ('US','UK')

HAVING clause:
  • can be used with Select statement only
  • can use aggregate functions in Having clause
  • Having clause filters results after the results are grouped as having clause can be used after Group clause only
  • Having clause applies to groups
Example:

Select fname,max(sal) from person
group by fname having max(sal) > 10000


Similar Posts:  CTE vs VIEW, TEMP TABLE vs CTE

Synchronous and Asynchronous transformation in ssis

Synchronous and Asynchronous transformation in ssis


                     Synchronous transformation process the incoming rows and pass it to the output one at a time.Output is synchronized with the input. it does not create a separate buffer for output. same buffer is used both for input and output rows. All the non-blocking transformations are synchronous transformation. 

                   An Asynchronous transformation store the input data into the buffer and perform some operations before releasing it to the output buffer. There is two different buffer create for both input and output.Semi-Blocking and Full-Blocking transformations are asynchronous transformation.

Nonblocking, Partial Blocking and Blocking Transformations in SSIS

   Nonblocking, Partial Blocking and Blocking Transformations In SSIS



Transformation in SSIS lets you manipulate data during ETL(extract, transform and Load).  Operations such as aggregate, merge, split, modify,etc can be performed on data using transformations.

Transformations are generally divided into three categories as follows:

  1. Non-blocking transformation: The output of this transformation uses the same buffer as that of input. It works on row by row basis. The transformation works on a synchronous basis. This transformations are efficient and lightweight results in better and faster transformation compared to other types of transformation.
  2. Partial Blocking: This transformations works on a row-set basis and stores the data in buffer before releasing it to the output buffer.This transformation creates a new buffer for output rows. As this transformations generally stores some data in the buffer to perform  some operations this are generally heavy-weighted compare to Non-blocking transformations and hence performance is slower compared to Non-blocking transformation.
  3.  Blocking Transformations: This transformations work on the whole data set at a time and hence stores the entire data in buffer. This transformation works on asynchronous basis.This transformation creates new buffer for output rows. consider an aggregate transformation which calculate the sum of a field in the data set, for this the transformation will have to store the entire data into buffer and then perform sum on that data set and then the results will be sent to output buffer. Because transformation stores the entire data set into the buffer, this transformations are generally the most heavy-weighted compare to other transformations and hence results in low performance. 

Following are the list of Non blocking, Partial Blocking and Blocking Transformations in SSIS





Saturday, 12 October 2013

Decision Making statements in Visual Basic

                      Decision Making statements in Visual Basic


              Decision making statements are often used to validate certain condition and based on that execute some piece of code. As the name suggest, this statements are used  to check certain conditions. We can have multiple conditions in the statement and according to output, whether the check returns True or False, make some decision in your program.

             There are two types of decision making statements in visual basic, as follows:

  1. If Statements
  2. Select Case Statements



If statements can further be classified into three types:
  • If - Then Statements:  An If - Then statements has only one condition and if that condition gets satisfied that a particular piece of code gets executed
Example:

                         If  name = "XYZ" Then
                                       MsgBox "Welcome" & name
                         End if
  • If -then-else statements: This statements handles both the scenario, what should happen when a condition gets satisfied and what should happens if it doesn't.
Example:

                         If  name = "XYZ" Then
                                       MsgBox "Welcome" & name
                             Else
                                    MsgBox "You are not an authorized person"
                         End if
  • Nested IF-Then-Else Statements: This statements has two or more statements based on which our program acts.
Example:

                         If  name = "XYZ" Then
                                       MsgBox "Welcome" & name
                             Elseif name = "ABC"
                                      MsgBox "Welcome" & name
                             Else
                                      MsgBox "You are not an authorized person"
                         End if

Select Case statements are generally used when there are more conditions to check , which gets quite complicated to handle

Example:

                         Select Case name
Case "XYZ"
                                                MsgBox "Welcome" & name
Case "ABC"
                                                MsgBox "Welcome" & name
Case "IJK"
                                                MsgBox "Welcome" & name
Case "ELSE"
                                                MsgBox "You are not an authorized person"
END SELECT

Difference between temp table and CTE

                                            Difference between temp table and CTE

 CTE:                 

  • CTE is  un-materialized/ non-indexable (cannot create indexes on CTE)
  • CTE is logical/disposable View
  • CTE persists only till the very next query
  • CTE cannot have constraints
  • CTE is mostly used for recursion, as CTE can call itself
  • CTE resists in memory



 Temp Table:

  • Temp table gets stored in temp table
  • Temp table persists till the current connection ends
  • Temp table can be referred in sub procedure
  • Temp table can have constraints,indexes and primary defined
  • Indexes can be implemented in Temp Table
  • Data can be updated in Temp Table
  • Temp Tables are stored in disk

Similar Posts:   Difference between CTE and View  ,  CTE(Common Table Expressions) in SQL ,                                      T- SQL Interview Questions and Answer

Difference between CTE and View - SQL

             Difference between CTE and View


               A CTE is a temporary/logical View, it is not store physically. It is a named query, the result for which is only available to the very next query after the CTE is defined. CTE is defined using WITH clause.

                  A View is a physical object that is present in the database. View is as good as a Table but it doesn't store data physically as compared to a table, only the data schema is stored in View. View,when referred, pulls data by executing the query that is associated with it.


                 The biggest difference between a CTE and View, is that, View or derived table cannot call itself, whereas CTE can call itself and hence support recursion.


                In many databases, views have options, for instance to index them.

Monday, 9 September 2013

Sub query/Nested Query/Inner Query in SQL

            Sub Query/Nested Query/Inner Query in SQL

Sub query or Inner query or Nested query is a query in a query. A sub query is usually added in the WHERE Clause of the SQL statement.

A Subquery is used to return data that will be used in the main query as a condition to further restrict the data to be retrieved.
Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements along with the operators like =, <, >, >=, <=, IN, BETWEEN etc.
There are a few rules that subqueries must follow:

  • Subqueries must be enclosed within parentheses.

  • A subquery can have only one column in the SELECT clause, unless multiple columns are in the main query for the subquery to compare its selected columns.

  • An ORDER BY cannot be used in a subquery, although the main query can use an ORDER BY. The GROUP BY can be used to perform the same function as the ORDER BY in a subquery.

  • Subqueries that return more than one row can only be used with multiple value operators, such as the IN operator.

  • The SELECT list cannot include any references to values that evaluate to a BLOB, ARRAY, CLOB, or NCLOB.

  • A subquery cannot be immediately enclosed in a set function.

  • The BETWEEN operator cannot be used with a subquery; however, the BETWEEN operator can be used within the subquery.


CTE in SQL

                    CTE(Common Table Expressions) in SQL


CTE( Common Table Expressions) was first introduced with SQL 2005. It is the best replacement for Views.

There are times when we might require to work on small data sets that is a part of large database. For example, we might have a huge sized table that contains data for all transactions that a company has made, but we require current month's data only and apply some grouping or filters on that particular data of current month. Typically views are used to do the task. But, views sometimes over occupy our resources as views are stored at system level. CTE is the best substitute for Views. It is also helpful when we don't have access to create object in a database.


Common Table Expressions or CTE are more readable form of derived table. CTE can be declared once and can be used in multiple queries. CTE can be recursive and non-recursive (we will cover this topic in next post)

Common Table Expression Syntax
A Common Table Expression contains three core parts:
  • The CTE name (this is what follows the WITH keyword)
  • The column list (optional)
  • The query (appears within parentheses after the AS keyword)

EXAMPLE:

WITH ProductAndCategoryNamesOverTenDollars (ProductName, CategoryName, UnitPrice) AS
(
   SELECT
      p.ProductName,
      c.CategoryName,
      p.UnitPrice
   FROM Products p
      INNER JOIN Categories c ON
         c.CategoryID = p.CategoryID
   WHERE p.UnitPrice > 10.0
)


SELECT *
FROM ProductAndCategoryNamesOverTenDollars
ORDER BY CategoryName ASC, UnitPrice ASC, ProductName ASC


                            The CTE is part of the subsequent statement only.The subsequent statement can be a single SELECT/INSERT/UPDATE/DELETE, or a compound (with UNION, INTERSECT etc). So if we modify our above example and include a query between CTE creation query(WITH) and the query that refer the CTE.(like below)



WITH ProductAndCategoryNamesOverTenDollars (ProductName, CategoryName, UnitPrice) AS
(
   SELECT
      p.ProductName,
      c.CategoryName,
      p.UnitPrice
   FROM Products p
      INNER JOIN Categories c ON
         c.CategoryID = p.CategoryID
   WHERE p.UnitPrice > 10.0
)

SELECT * FROM Products 

SELECT *
FROM ProductAndCategoryNamesOverTenDollars
ORDER BY CategoryName ASC, UnitPrice ASC, ProductName ASC


Here, we will see that when the query that refers to CTE(ProductAndCategoryNamesOverTenDollars) is executed we will get an error "Invalid object name 'ProductAndCategoryNamesOverTenDollars'" 

Sunday, 8 September 2013

VBA- Enable Developer Tab - EXCEL

                     Enable Developer Tab

Following are the steps to enable developer tab in Excel 2007


  • Click on 'Office Button' to top left of the workbook and then click on 'Excel Options'

  • Check the option 'Show developer tab in the ribbon' and click on OK

You might now be able to see developer tab in the workbook

  • To start the Visual Basic Editor click on 'Developer' Tab and click on the 'Visual Studio' option (very first option from left hand side)




You will now see visual basic editor to begin coding



SQL Tuning/Query Optimization/SQL Optimization

           SQL Tuning/Query Optimization/SQL Optimization


Query Optimization is often misunderstood as a technique or some setting that we need to change in SQL. In-real query optimization is something related to best practice that we have to follow in your day-to-day life.

Query optimization is all about how we use clause and operators in our query.

I have listed below some examples that we might be using daily and best practice that needs to follow to optimize for better performance.

  • SQL query becomes faster if you specify column names in query instead of '*'
                  EXAMPLE:
                             Normal way:
                                 select * from emp_details

                             Optimized way:
                                 select fname,lname from emp_details   

                  

  • Always use HAVING clause instead of WHERE clause whenever you are using GROUP BY in your query. Using HAVING clause allows SQL to first group the data and then filter the grouped data using HAVING clause. 

                  EXAMPLE:
                             Normal way:
                                 select fname.lname from emp_details
                                 where fname like 'XYZ'
                                 group by fname

                             Optimized way:
                                select fname,lname from emp_details
                                group by fname
                                having fname like 'XYZ'


  • Create Index on the column which we will be using to filter data.Following priority order should be followed when any index is created a) WHERE clause, b) JOIN clause, c) ORDER BY clause, d) SELECT clause
  • Sometimes you may have more than one subqueries in your main query. Try to minimize the number of subquery block in your query. 


                 EXAMPLE:
                             Normal way:
                                SELECT name FROM employee
                  WHERE salary = (SELECT MAX(salary) FROM employee_details) 
                  AND age = (SELECT MAX(age) FROM employee_details) 
                  AND emp_dept = 'Electronics';


                             Optimized way:
                                SELECT name FROM employee 
                  WHERE (salary, age ) = (SELECT MAX (salary), MAX (age) 
                  FROM employee_details) 
                  AND dept = 'Electronics'; 
  • Always use EXISTS instead of IN operator, Generally on large data IN operator works really slow compare to EXISTS. IN is efficient when most of the filter criteria is in the sub-query. EXISTS is efficient when most of the filter criteria is in the main query.
  • Always use UNION ALL instead of UNION if you does not expect unique result. UNION checks for duplicate rows and eliminate it . Hence it takes more time for execution compare to UNION ALL
  • If you want to re-use the query, create a STORE PROCEDURE for the same.
  • Check if there is at-least 30% HHD is empty – it improves the performance a bit
  • Remove any unnecessary joins from table

IDENTITY in SQL

                          IDENTITY in SQL

I have often came across many questions on IDENTITY like 'What is IDENTITY?', 'What is the use of IDENTITY?'  , 'How does IDENTITY help us identify duplicate rows in a table?', etc...

This post will cover the most basic functionality of IDENTITY

What is IDENTITY?
  • The IDENTITY clause specifies that the column is an IDENTITY column
  • If the column is set as IDENTITY column than it will be filled with an auto-incremented number and no other data can be inserted  into that column
  • IDENTITY is also known as 'Auto-Increment'
  • IDENTITY is used with CREATE and ALTER statements
  • It supports only int, bigint, smallint, tinyint, or decimal / numeric with a scale of 0
Syntax:
IDENTITY [ ( seed , increment ) ]
Arguments
seed
Is the value that is used for the very first row loaded into the table.
increment
Is the incremental value that is added to the identity value of the previous row that was loaded.
You must specify both the seed and increment or neither. If neither is specified, the default is (1,1).
Example:

CREATE TABLE new_employees( 
id_num int IDENTITY(1,1),
fname varchar (20),
minit char(1),
lname varchar(30))


Alter Table new_employees
add id_num bigint identity(1,1)




What is the use of IDENTITY?
                       
                  The most common use of IDENTITY is to set PRIMARY Key column or UNIQUE key column as IDENTITY column which lets you provide a simple unique row identifier without the need to maintain it yourself. SQL Server does the work for you.

                   
How does IDENTITY help us identify duplicate rows in a table?

                               We can use IDENTITY to identify duplicate rows in a table. Now we will go through an example to know how it works. Suppose we have a table named 'new_employees' and the data in the table is as in the below screen shot.


Now if you observe there is multiple entry made for 'XYZ'  named employee in the table. Now you come to know that there is duplicate entry in the table and here you can identify the id for this row and type a simple DELETE Command to delete row based on 'id_num'. But suppose there is duplicate entry for more than 1,000 employees, in this case it will not be possible to to identify 'id_num' manually.

In this case, we will simply write a sub query to get the desired results.

Sub query:

select * from new_employees where id_num in (select MAX(id_num) from new_employees group by fname)

The output of the query is as below



If you see the latest entry for the employee 'XYZ'  is selected and the old entry is not considered.

To convert above SELECT statement into DELETE command to delete the old entry we simply need to do the below highlighted changes.

delete from new_employees where id_num not in (select MAX(id_num) from new_employees group by fname)