I completed my PhD in 2012. My research area is on Data Warehouse (DW) domain. In specific, I’m delving the problems of ETL processes in DW systems that one of the major tasks in DW systems development. ETL processes modeling stress the need to document the mapping between sources attributes to the target attributes in order to facilitate the DW transformation process, but usually fail to offer methods to define and identify correspond requirements that refers to the particular data sources. Moreover, it takes for granted that requirements have already been analyzed and documented according to the organization's business goals. If these requirements have not been analyzed and documented properly, how do we first identify them unambiguously and know when they all completely specify? How those requirements are able to specify and establish the ETL process specification and finally ensure the transformation processes in ETL will meets the real needs of organizations. Therefore, the problems underlie on these questions can be identified in two: i) defining and maintaining the ETL process specification, and ii) semantic heterogeneity problems in heterogeneous data sources.
The problems related to the definition and maintenance of the data transformation specifications is one of the big challenges either in enterprise application integration (EAI), enterprise information integration (EII) or DW system integration. This is because data transformation nature is application-driven not data-driven. The problem with application-driven is when the user requirements or the data source changes, the transformation mechanism such as filtering, conversion, aggregation and merging needs to be updated accordingly. These tasks are error-prone and time consuming in maintaining the data transformation processes. Hence the data transformation activities are important in ETL processes, thus the appropriate requirement model for ETL processes need to be provided in order to adequately represent the information requirements of DW with the actual information supplied (data sources) for successful DW system development. Therefore, a suitable requirement analysis method needs to be developed for DW contents, which are closely related to the information required by decision-making process.
Since the requirement analysis activities will define and document what users want as clearly and completely as possible, it is crucial to ensure the consistency reconciliation of data sources and information needs to be tackled together. However, this task will be forced by the nature issues of data sources semantic heterogeneity. The semantic heterogeneity is the problems occurs during the data transformation (i.e. from data sources to data targets) and can be categorized as: i) conflict at data schema levels; and ii) conflict in data at instances level (data value). Generally, three major problems in data integration have been identified as syntactic, structural and semantic. The syntactic heterogeneity refers to the differences in presenting the information item in disparate data sources. The structural heterogeneity refers to the situation where the identical information store as different structure in disparate data sources, whereas the semantic heterogeneity always refers to the conflict of meaning between information items whether in attributes name or value (instances).
Normally, the type of semantic heterogeneity can be identified as synonyms or homonyms. Synonym is about two information with different meanings referred to the same name, meanwhile, homonym is otherwise. The well-defined and understanding semantic of data sources is very essential for semantic reconciliation and avoiding terminology inconsistencies in information sharing environments. This will avoid the confusion on understanding of filename and definitions and preserved the autonomy of data owners. The semantics of information items will be well accepted in data interchanged across the enterprise and ultimately easier the execution of ETL processes. Therefore, the ETL process model will be guided by the semantics of business requirements, which consider the semantic heterogeneity problems at the requirement modeling level. The conceptual design of ETL processes will be capturing the agreed-upon semantics of business requirements that was agreeable with the available data sources and the goal of DW system. Consequently, the requirement engineering model that focuses on the analysis method will be explored and presented in order to overcome these problems.
My works were motivated by the researches of Alkis Simitsis on modeling and designing the ETL processes in DW environments and Paolo Giorgini on Requirements Analysis for DW by using Goal-oriented approach.
The problems related to the definition and maintenance of the data transformation specifications is one of the big challenges either in enterprise application integration (EAI), enterprise information integration (EII) or DW system integration. This is because data transformation nature is application-driven not data-driven. The problem with application-driven is when the user requirements or the data source changes, the transformation mechanism such as filtering, conversion, aggregation and merging needs to be updated accordingly. These tasks are error-prone and time consuming in maintaining the data transformation processes. Hence the data transformation activities are important in ETL processes, thus the appropriate requirement model for ETL processes need to be provided in order to adequately represent the information requirements of DW with the actual information supplied (data sources) for successful DW system development. Therefore, a suitable requirement analysis method needs to be developed for DW contents, which are closely related to the information required by decision-making process.
Since the requirement analysis activities will define and document what users want as clearly and completely as possible, it is crucial to ensure the consistency reconciliation of data sources and information needs to be tackled together. However, this task will be forced by the nature issues of data sources semantic heterogeneity. The semantic heterogeneity is the problems occurs during the data transformation (i.e. from data sources to data targets) and can be categorized as: i) conflict at data schema levels; and ii) conflict in data at instances level (data value). Generally, three major problems in data integration have been identified as syntactic, structural and semantic. The syntactic heterogeneity refers to the differences in presenting the information item in disparate data sources. The structural heterogeneity refers to the situation where the identical information store as different structure in disparate data sources, whereas the semantic heterogeneity always refers to the conflict of meaning between information items whether in attributes name or value (instances).
Normally, the type of semantic heterogeneity can be identified as synonyms or homonyms. Synonym is about two information with different meanings referred to the same name, meanwhile, homonym is otherwise. The well-defined and understanding semantic of data sources is very essential for semantic reconciliation and avoiding terminology inconsistencies in information sharing environments. This will avoid the confusion on understanding of filename and definitions and preserved the autonomy of data owners. The semantics of information items will be well accepted in data interchanged across the enterprise and ultimately easier the execution of ETL processes. Therefore, the ETL process model will be guided by the semantics of business requirements, which consider the semantic heterogeneity problems at the requirement modeling level. The conceptual design of ETL processes will be capturing the agreed-upon semantics of business requirements that was agreeable with the available data sources and the goal of DW system. Consequently, the requirement engineering model that focuses on the analysis method will be explored and presented in order to overcome these problems.
My works were motivated by the researches of Alkis Simitsis on modeling and designing the ETL processes in DW environments and Paolo Giorgini on Requirements Analysis for DW by using Goal-oriented approach.