MSBI LIVE: SSIS

Showing posts with label SSIS. Show all posts

Thursday, 1 May 2014

SSIS Interview Question - Part 4

What is a Task?

A task is very much like a method of any programming language which represents or carries out an individual unit of work. There are broadly two categories of tasks in SSIS, Control Flow tasks and Database Maintenance tasks. All Control Flow tasks are operational in nature except Data Flow tasks. Although there are around 30 control flow tasks which you can use in your package you can also develop your own custom tasks with your choice of .NET programming language.

What is a Precedence Constraint and what types of Precedence Constraint are there?

SSIS allows you to place as many as tasks you want to be placed in control flow. You can connect all these tasks using connectors called Precedence Constraints. Precedence Constraints allow you to define the logical sequence of tasks in the order they should be executed. You can also specify a condition to be evaluated before the next task in the flow is executed.
These are the types of precedence constraints and the condition could be either a constraint, an expression or both

Success (next task will be executed only when the last task completed successfully) or
Failure (next task will be executed only when the last task failed) or
Complete (next task will be executed no matter the last task was completed or failed).

What is a container and how many types of containers are there?

A container is a logical grouping of tasks which allows you to manage the scope of the tasks together.
These are the types of containers in SSIS:

Sequence Container - Used for grouping logically related tasks together
For Loop Container - Used when you want to have repeating flow in package
For Each Loop Container - Used for enumerating each object in a collection; for example a record set or a list of files.

Apart from the above mentioned containers, there is one more container called the Task Host Container which is not visible from the IDE, but every task is contained in it (the default container for all the tasks).

What are variables and what is variable scope?

A variable is used to store values. There are basically two types of variables, System Variable (like ErrorCode, ErrorDescription, PackageName etc) whose values you can use but cannot change and User Variable which you create, assign values and read as needed. A variable can hold a value of the data type you have chosen when you defined the variable.
Variables can have a different scope depending on where it was defined. For example you can have package level variables which are accessible to all the tasks in the package and there could also be container level variables which are accessible only to those tasks that are within the container.

How many difference source and destinations have you used?

It is very common to get all kinds of sources so the more the person worked with the better for you. Common ones are SQL Server, CSV/TXT, Flat Files, Excel, Access, Oracle, MySQL but also Salesforce, web data scrapping.

What configuration options have you used?

This is an important one. Configuration should always be dynamic and usually is done using XML and/or Environment Variable and SQL Table with all configurations.

How do you apply business rules in SSIS (Transformations….Specific calculations but also cleansing)?

Some people use SSIS only to extract data and then go with stored procedures only….they are usually missing the point of the power of SSIS. Which allows to create "a flow" and on each step applies certain rules this greatly simplifies the ETL process and simplicity is very good.

How to quickly load data into sql server table?

Fast Load option. This option is not set by default so most developers know this answer as otherwise the load is very slow.

Give example of handling data quality issues?

Data Quality is almost always a problem and SSIS handles it very well. Examples include importing customers from different sources where customer name can be duplicates. For instance you can have as company name: SQL Server Business Intelligence but also SQL Server BI or SQL Server BI LTD or SQL Server BI Limited or intelligence (with one l). There are different ways to handle it. Robust and time consuming is to create a table with or possible scenarios and update it after each update. You can also use fuzzy grouping which is usually easy to implement and will make usually very good decisions but it is not 100% accurate so this approach has to be justified. Other typical quality issues are nulls (missing values), outliers (dates like 2999 or types like 50000 instead of 5000 especially important if someone is adjusting the value to get bigger bonus), incorrect addresses and these are either corrected during ETL, ignored, re-directed for further manual updates or it fails the packages which for big processes is usually not practised.

When to use Stored Procedures?

This was one of the requested question in comment (at the bottom of the page). This one is very important but also tricky. ALL SSIS developers have SQL Server background and that is sometime not very good if they use SQL not SSIS approach.

Let's start with when you typically use SPs. This is for preparing tables (truncate), audit tasks (usually part of SSIS framework), getting configuration values for loops and a few other general tasks.
During ETL extract you usually type simple SQL because it comes from other sources and usually over complication is not a good choice (make it dynamic) because any changes usually affect the package which has to be updated as well.
During Transformation phase (business rules, cleaning, core work) you should use Transformation tasks not Stored procedures! There are loads of tasks that make the package much easier to develop but also a very important reason is readability which is very important for other people who need to change the package and obviously it reduces risks of making errors. Performance is usually very good with SSIS as it is memory/flow based approach. So when to use Stored Procedures for transformations? If you don't have strong SSIS developers or you have performance reasons to do it. In some cases SPs can be much faster (usually it only applies to very large datasets). Most important is have reasons which approach is better for the situation.

What is your approach for ETL with data warehouses (how many packages you developer during typical load etc.)?

This is rather generic question. A typical approach (for me) when building ETL is to. Have a package to extract data per source with extract specific transformations (lookups, business rules, cleaning) and loads data into staging table. Then a package do a simple merge from staging to data warehouse (Stored Procedure) or a package that takes data from staging and performs extra work before loading to data warehouse. I prefer the first one and due to this approach I occasionally consider having extract stage (as well as stage phase) which gives me more flexibility with transformation (per source) and makes it simpler to follow (not everything in one go). So to summarize you usually have package per source and one package per data warehouse table destination. There are might be other approach valid as well so ask for reasons.

SSIS Interview Question - Part 3

What is SQL Server Integration Services (SSIS)?

SQL Server Integration Services (SSIS) is component of SQL Server 2005 and later versions. SSIS is an enterprise scale ETL (Extraction, Transformation and Load) tool which allows you to develop data integration and workflow solutions. Apart from data integration, SSIS can be used to define workflows to automate updating multi-dimensional cubes and automating maintenance tasks for SQL Server databases.
This is a platform for Integration and Workflow applications. It is known for a fast and flexible OLTP and OLAP extensions used for data extraction, transformation, and loading (ETL). The tool may also be used to automate maintenance of SQL Server databases and multidimensional data sets.

How does SSIS differ from DTS?

SSIS is a successor to DTS (Data Transformation Services) and has been completely re-written from scratch to overcome the limitations of DTS which was available in SQL Server 2000 and earlier versions. A significant improvement is the segregation of the control/work flow from the data flow and the ability to use a buffer/memory oriented architecture for data flows and transformations which improve performance.

Data Transformation Services	SQL Server Integration Services
Limited Error Handling	Complex and powerful Error Handling
Message Boxes in ActiveX Scripts	Message Boxes in .NET Scripting
No Deployment Wizard	Interactive Deployment Wizard
Limited Set of Transformation	Good number of Transformations
NO BI functionality	Complete BI Integration

What is a workflow in SSIS ?

Workflow is a set of instructions on to specify the Program Executor on how to execute tasks and containers within SSIS Packages.

What is the Control Flow?

When you start working with SSIS, you first create a package which is nothing but a collection of tasks or package components. The control flow allows you to order the workflow, so you can ensure tasks/components get executed in the appropriate order.

What is the Data Flow Engine?

The Data Flow Engine, also called the SSIS pipeline engine, is responsible for managing the flow of data from the source to the destination and performing transformations (lookups, data cleansing etc.). Data flow uses memory oriented architecture, called buffers, during the data flow and transformations which allows it to execute extremely fast. This means the SSIS pipeline engine pulls data from the source, stores it in buffers (in-memory), does the requested transformations in the buffers and writes to the destination. The benefit is that it provides the fastest transformation as it happens in memory and we don't need to stage the data for transformations in most cases.

What is a Transformation?

A transformation simply means bringing in the data in a desired format. For example you are pulling data from the source and want to ensure only distinct records are written to the destination, so duplicates are removed. Anther example is if you have master/reference data and want to pull only related data from the source and hence you need some sort of lookup. There are around 30 transformation tasks available and this can be extended further with custom built tasks if needed.

SSAS (SQL Server Analysis Services) Interview Questions - Part 3

Q: How do you extract first tuple from the set?

Use could usefunction Set.Item(0)
Example:

SELECT {{[Date].[Calendar].[Calendar Year].Members
}.Item(0)}
ON 0
FROM [Adventure Works]

Q: How can I setup default dimension member in Calculation script?

You can use ALTER CUBE statement. Syntax:
ALTER CUBE CurrentCube | YourCubeName UPDATE DIMENSION , DEFAULT_MEMBER='';

SSAS (SQL Server Analysis Services) Interview Questions - Part 2

SSAS (SQL Server Analysis Services) Interview Questions - Part 2

Q: What is the difference between a derived measure and a calculated measure?
The difference between a derived measure and a calculated measure is when the calculation is performed. A derived measure is calculated before aggregations are created, and the values of the derived measure are stored in the cube. A calculated measure is calculated after aggregations are created, and the values of a calculated measure aren’t stored in the cube. The primary criterion for choosing between a derived measure and a calculated measure is not efficiency, but accuracy.

Q: What is a partition?
A partition in Analysis Services is the physical location of stored cube data. Every cube has at least one partition by default. Each time we create a measure group, another partition is created. Queries run faster against a partitioned cube because Analysis Services only needs to read data from the partitions that contain the answers to the queries. Queries run even faster when partition also stores aggregations, the pre calculated totals for additive measures. Partitions are a powerful and flexible means of managing cubes, especially large cubes.

Q: While creating a new calculated member in a cube what is the use of property
called non-empty behavior?
Nonempty behavior is important property for ratio calculations. If the denominator Is empty, an MDX expression will return an error just as it would if the denominator Were equal to zero. By selecting one or more measures for the Non-Empty Behavior property, we are establishing a requirement that each selected measure first be evaluated before the calculation expression is evaluated. If each selected measure is empty, then The expression is also treated as empty and no error is returned.

Q: What is a RAGGED hierarchy?
Under normal circumstances, each level in a hierarchy in Microsoft SQL Server 2005 Analysis Services (SSAS) has the same number of members above it as any other member at the same level. In a ragged hierarchy, the logical parent member of at least one member is not in the level immediately above the member. When this occurs, the hierarchy descends to different levels for different drilldown paths. Expanding through every level for every drilldown path is then unnecessarily complicated.

Q: What are the roles of an Analysis Services Information Worker?

The role of an Analysis Services information worker is the traditional "domain expert" role in business intelligence (BI) someone who understands the data employed by a solution and is able to translate the data into business information. The role of an Analysis Services information worker often has one of the following job titles: Business Analyst (Report Consumer), Manager (Report Consumer), Technical Trainer, Help Desk/Operation, or Network Administrator.

Q: What are the different ways of creating Aggregations?
We can create aggregations for faster MDX statements using Aggregation Wizard or thru UBO – Usage Based Optimizations. Always, prefer UBO method in realtime performance troubleshooting.

Q: What is WriteBack? What are the pre-conditions?
The Enable/Disable Writeback dialog box enables or disables writeback for a measure group in a cube. Enabling writeback on a measure group defines a writeback partition and creates a writeback table for that measure group. Disabling writeback on a measure group removes the writeback partition but does not delete the writeback table, to avoid unanticipated data loss.

Q: What is processing?
Processing is a critical and resource intensive operation in the data warehouse lifecycle and needs to be carefully optimized and executed. Analysis Services 2005 offers a high performance and scalable processing architecture with a comprehensive set of controls for database administrators.
We can process an OLAP database, individual cube, Dimension or a specific Partition in a cube.

Q: Name few Business Analysis Enhancements for SSAS?
The following table lists the business intelligence enhancements that are available in Microsoft SQL Server 2005 Analysis Services (SSAS). The table also shows the cube or dimension to which each business intelligence enhancement applies, and indicates whether an enhancement can be applied to an object that was created without using a data source and for which no schema has been generated.

*Enhancement*	*Type*	*Applied to*	No data source
Time Intelligence	Cube	Cube	No
Account Intelligence	Dimension	Dimension or cube	No
Dimension Intelligence	Dimension	Dimension or cube	Yes
Custom Aggregation	Dimension	Dimension (unary operator) or cube	No
Semiadditive Behavior	Cube	Cube	Yes>
Custom Member Formula	Dimension	Dimension or cube	No
Custom Sorting and Uniqueness Settings	Dimension	Dimension or cube	Yes
Dimension Writeback	Dimension	Dimension or cube	Yes

Q: What MDX functions do you most commonly use?
This is a great question because you only know this answer by experience. If you ask me this question, the answer practically rushes out of me. “CrossJoin, Descendants, and NonEmpty, in addition to Sum, Count, and Aggregate. My personal favorite is CrossJoin because it allows me identify non-contiguous slices of the cube and aggregate even though those cube cells don’t roll up to a natural ancestor.” Indeed, CrossJoin has easily been my bread and butter.

Q: Where do you put calculated members?
The reflexive answer is “in the Measures dimension” but this is the obvious answer. So I always follow up with another question. “If you want to create a calculated member that intersects all measures, where do you put it?” A high percentage of candidates can’t answer this question, and the answer is “In a dimension other than Measures.” If they can answer it, I immediately ask them why. The answer is “Because a member in a dimension cannot intersect its own relatives in that dimension.”

Q: How do I find the bottom 10 customers with the lowest sales in 2003 that were not null?

A: Simply using bottomcount will return customers with null sales. You will have to combine it with NONEMPTY or FILTER.

SELECT { [Measures].[Internet Sales Amount] } ON COLUMNS ,
BOTTOMCOUNT(
NONEMPTY(DESCENDANTS( [Customer].[Customer Geography].[All Customers]
, [Customer].[Customer Geography].[Customer] )
, ( [Measures].[Internet Sales Amount] ) )
, 10
, ( [Measures].[Internet Sales Amount] )
) ON ROWS
FROM [Adventure Works]
WHERE ( [Date].[Calendar].[Calendar Year].&[2003] ) ;

Q: How in MDX query can I get top 3 sales years based on order quantity?

By default Analysis Services returns members in an order specified during attribute design. Attribute properties that define ordering are "OrderBy" and "OrderByAttribute". Lets say we want to see order counts for each year. In Adventure Works MDX query would be:

SELECT {[Measures].[Reseller Order Quantity]} ON 0
, [Date].[Calendar].[Calendar Year].Members ON 1
FROM [Adventure Works];

Same query using TopCount:
SELECT
{[Measures].[Reseller Order Quantity]} ON 0,
TopCount([Date].[Calendar].[Calendar Year].Members,3, [Measures].[Reseller Order Quantity]) ON 1
FROM [Adventure Works];

SSAS (SQL Server Analysis Services) Interview Questions - Part 1

SSAS - SQL Server Analysis Services

Q: What is Analysis Services? List out the features?
Microsoft SQL Server 2005 Analysis Services (SSAS) delivers online analytical processing (OLAP) and data mining functionality for business intelligence applications. Analysis Services supports OLAP by letting we design, create, and manage multidimensional structures that contain data aggregated from other data sources, such as relational databases. For data mining applications, Analysis Services lets we design, create, and visualize data mining models that are constructed from other data sources by using a wide variety of industry-standard
data mining algorithms.
Analysis Services is a middle tier server for analytical processing, OLAP, and Data mining. It manages multidimensional cubes of data and provides access to heaps of information including aggregation of data. One can create data mining models from data sources and use it for Business Intelligence also including reporting features.
Analysis service provides a combined view of the data used in OLAP or Data mining. Services here refer to OLAP, Data mining. Analysis services assists in creating, designing and managing multidimensional structures containing data from varied sources. It provides a wide array of data mining algorithms for specific trends and needs.

Some of the key features are:

Ease of use with a lot of wizards and designers.
Flexible data model creation and management
Scalable architecture to handle OLAP
Provides integration of administration tools, data sources, security, caching, and reporting etc.
Provides extensive support for custom applications

Q: What is UDM? Its significance in SSAS?
The role of a Unified Dimensional Model (UDM) is to provide a bridge between the user and the data sources. A UDM is constructed over one or more physical data sources, and then the end user issues queries against the UDM using one of a variety of client tools, such as Microsoft Excel. At a minimum, when the UDM is constructed merely as a thin layer over the data source, the advantages to the end user are a simpler, more readily understood model of the data, isolation from heterogeneous backend data sources, and improved performance for summary type queries. In some scenarios a simple UDM like this is constructed totally automatically. With greater investment in the construction of the UDM, additional benefits accrue from the richness of metadata that the model can provide.

The UDM provides the following benefits:
• Allows the user model to be greatly enriched.
• Provides high performance queries supporting interactive analysis, even over huge data volumes.
• Allows business rules to be captured in the model to support richer analysis.

Q: What is the need for SSAS component?

Analysis Services is the only component in SQL Server using which we can perform Analysis and Forecast operations.
SSAS is very easy to use and interactive.
Faster Analysis and Troubleshooting.
Ability to create and manage Data warehouses.
Apply efficient Security Principles.

Q: Explain the TWO-Tier Architecture of SSAS?

SSAS uses both server and client components to supply OLAP and data mining functionality BI Applications.
The server component is implemented as a Microsoft Windows service. Each instance of Analysis Services implemented as a separate instance of the Windows service.
Clients communicate with Analysis Services using the standard the XMLA (XML For Analysis) , protocol for issuing commands and receiving responses, exposed as a web service.

Q: What are the components of SSAS?

An OLAP Engine is used for enabling fast ad hoc queries by end users. A user can interactively explore data by drilling, slicing or pivoting.
Drilling refers to the process of exploring details of the data.
Slicing refers to the process of placing data in rows and columns.
Pivoting refers to switching categories of data between rows and columns.
In OLAP, we will be using what are called as Dimensional Databases.

Q: What is FASMI ?
A database is called a OLAP Database if the database satisfies the FASMI rules :

Fast Analysis– is defined in the OLAP scenario in five seconds or less.
Shared – Must support access to data by many users in the factors of Sensitivity and Write Backs.
Multidimensional – The data inside the OLAP Database must be multidimensional in structure.
Information – The OLAP database Must support large volumes of data..

Q: What languages are used in SSAS ?

Structured Query Language (SQL)
Multidimensional Expressions (MDX) - an industry standard query language orientated towards analysis
Data Mining Extensions (DMX) - an industry standard query language oriented toward data mining.
Analysis Services Scripting Language (ASSL) - used to manage Analysis Services database objects.

Q: How Cubes are implemented in SSAS ?

Cubes are multidimensional models that store data from one or more sources.
Cubes can also store aggregations
SSAS Cubes are created using the Cube Wizard.
We also build Dimensions when creating Cubes.
Cubes can see only the DSV( logical View).