ETL is the abbreviation of Extract-Transform-Load. It is used to describe the process of extracting, transforming, and loading data from the source to the destination. The term ETL is more commonly used in data warehouses, but its object is not limited to data warehouses.
ETL is an important part of building a data warehouse. Users extract the required data from the data source , after data cleaning , and finally load the data into the data warehouse according to the pre-defined data warehouse model.
In the process of ETL transformation, it is mainly reflected in the following aspects:
Null value processing: the field null value can be captured, loaded or replaced with other meaning data, and can be shunted and loaded into different target libraries according to the field null value.
Standardized data format: field format constraint definition can be realized, and the loading format can be customized for the time, value, character and other data in the data source.
Split data: The fields can be split according to business requirements. For example, the calling number 861082585313-8148 can be resolved by area code and telephone number.
Verify data correctness: Lookup and split functions can be used for data verification. For example, the calling number 861082585313-8148, after the area code and phone number are resolved, can use Lookup to return to the calling area recorded by the calling gateway or switch for data verification.
Data replacement: For business factors, invalid data and missing data can be replaced.
Lookup: Find missing data. Lookup implements sub-query and returns missing fields obtained by other means to ensure field integrity.
Establish the primary and foreign key constraints of the ETL process: illegal data without dependence can be replaced or exported to the wrong data file to ensure the loading of the unique record of the primary key.
Advantages of ETL architecture:
ETL can share the load of the database system (using a separate hardware server)
Compared with EL-T architecture, ETL can realize more complex data conversion logic
ETL uses a separate hardware server. .
ETL has nothing to do with the underlying database data storage.
In the ELT architecture, ELT is only responsible for providing a graphical interface to design business rules. The entire data processing process flows between the target and source databases. ELT coordinates related database systems to execute related applications. The data processing process is both It can be executed on the source database side or on the target data warehouse side (mainly depends on the architecture design and data attributes of the system). When the efficiency of the ETL process needs to be improved, it can be achieved by tuning the relevant database or changing the server that executes the processing. General database vendors will vigorously promote this kind of architecture, such as Oracle and Teradata are vigorously promoting the ELT architecture.
Advantages of ELT architecture:
ELT mainly uses the database engine to achieve the scalability of the system (especially when the data processing process is at night, you can make full use of the resources of the database engine)
ELT can keep all data in the database at all times, avoiding data loading and exporting, thereby ensuring efficiency and improving the monitor-ability of the system.
ELT can perform parallel processing optimization according to the distribution of data, and can use the inherent functions of the database to optimize disk I/O.
The scalability of ELT depends on the scalability of the database engine and its hardware server.
By tuning the performance of related databases, it is generally not particularly difficult to obtain 3 to 4 times the efficiency of the ETL process.