Home > Site data warehouse, data warehouse source data types

The source data type of the data warehouse

July 20, 2010 by joegh Message »

dw-source-data Integrated enterprises can get almost all of the data in the data warehouse for data analysis and decision support, including of course all the data that I mentioned in the web analytics data sources . These data into the data warehouse is nothing less than three types: structured data, semi-structured data and unstructured data, after conversion unified in some form, stored in a data warehouse, which is usually said ETL (Extract, Transform, Load, extract, transform, load) of the process. The following will mainly talk about the difference between these three types of data, respectively, including which source data and these data in the analysis of site data.

Structured data

Such data formats are standardized, a typical representative of the data in a relational database, these data can be used two-dimensional tables to store a fixed number of fields, each field has a fixed data type (numeric, character, date ,), and the length in bytes of each field is also relatively fixed. Such data is the most easy to manage and maintain, at the same time is most convenient for the query, display and analysis data format.

Structured data on the site, generally refers to a site within the database data as well as some of the data obtained in the external open database interface. These data can be imported by ETL to a data warehouse for integrated management, site analysis and data analysis as required by the SQL statement query to export.

Structured data occupy a pivotal position in the site data analysis, data stored in the database are generally the operational data of the site and user operation result data (Outcome), such as the number of registered users of the site, the number of blog articles Comments ... for e-commerce sites, orders and sales data directly to the storage and database, based on these data, calculated the total profit per order average profit for each user to create profits and other KPI data can be the direct analysis of the site's objectives are being achieved.

Semi-structured data

Semi-structured data format specifications, are generally plain text data, you can resolve each of the data in some way. The most common is the log data, XML, JSON formatted data, they each record may be pre-defined specifications, but the information contained in each record may vary, and may have different number of fields, including different field name or field type, or contains a nested format. Such data are generally plain text output, management and maintenance is also more convenient, but need to use these data, such as access, query or analysis of data may need to these data format corresponding analytical.

Semi-structured data is usually the site of the log data, or because of some demand for the output in XML or JSON format data. The most common site of the Apache log, according to predefined fields in order to play the corresponding value:

72.14.192.1 - [09/May/2010: 03:35:02 +0800] "GET / HTTP/1.1" 200 13726 "-" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US) , gzip (gfe) (via translate.google.com) "

While JSON format to the key (Key / Value) form of output data:

{Time: 1234567890, action: "comment", respond: true, the user: {the userid: 1, username: "abc"}}

Apache log data, we can cut as needed to separate the useful data and import them into the data warehouse, XML and JSON formatted data, we can call all kinds of string parsing through their label or name to obtain the corresponding value for the nested structure layer by layer traversal in order to obtain, also select the data warehouse for the analysis of useful data. In this process, the conversion of part of the ETL will become more complex because of the need for format parsing, this step will directly affect the ETL stability and robustness. There is a troublesome problem is the format of the data and storage issues, and it may be necessary to create a custom field type; or select NOSQL database, discussion on NOSQL database was in full swing, from Google's Big table, Amazon Dynamo to the Facebook Cassandra, the NOSQL database, scalability, mass data storage provides a new solution for the WEB data management.

Semi-structured data is also very important to the analysis of site data, site click-stream logs and user behavior data are generally in the form of semi-structured data output when the various types of indicators we need statistics website analysis or user behavior analysis, such data is essential.

Unstructured data

Unstructured data refers to the class of non-plain-text data, there is no standard format, can not directly resolve the corresponding value. Common unstructured data rich text documents, web pages, multimedia (images, sound, video, etc.). Such data is not easy to collect management can not directly query and analysis, so this kind of data need to use a different approach.

Rich text, images, sound, video and other information, unless the need for advanced text mining, multimedia data mining, whether for daily involved in the data statistics and analysis of unstructured data itself is no analysis of the value. It is generally not the unstructured data directly in binary form into the data warehouse, the father of the data warehouse - Inmon's proposal is only need to store unstructured data in the data warehouse metadata (Meta Data), or said To explain the data. So we generally unstructured data stored in the file system (File System), which records in the data warehouse data to quickly index and find the required data. Such as Word document title, abstract, author, creation time, last modified time, etc. The pictures may also include pixel resolution. Like those data items that you see under the Details tab, right-click the file attributes, these unstructured data to a standard form of record, and can help to quickly search query to the corresponding unstructured data, the same can be used for statistics and analysis, in fact, is to give each of unstructured data, labeled, and label information record to the data warehouse.

May, for most sites, this type of unstructured data unless it is used for advanced data mining, statistical analysis of data in most of the time effect is not large, but for certain sites, such as pictures, video class site, the data is crucial. For pictures, video sites, pictures and video is the product of the site, the recorded picture video meta-data is the detailed information and data for these products, product analysis, product segments and so dependent on these data; Similarly, for some archive of the company's internal documents, data, unified data warehouse to record the information of these files, you can quickly search to find the necessary files when necessary, is very effective for a unified and integrated management of information.

With the continuous development of the Internet, all kinds of information continues to expand, there are a variety of data types will continue to emerge, and the data warehouse plays the role of data integration, processing and management for all types of data will also continue to improve optimization.


»In this paper, the BY-NC-SA agreement, reproduced please specify source: website data analysis » "the source of the data warehouse data types"

Related Articles:

  1. Data warehouse metadata management
  2. The basic structure of the data warehouse
  3. The value of the data warehouse
  4. Data cube and OLAP
  5. The basic characteristics of OLAP

10 comments

  1. Aibei Fu said:

    Top wow data warehouse. . . .

    Reply Reply
  2. bookcold said:

    Mentioned in NoSQL, in fact I am also curious, to build the warehouse to the operability of a non-relational database, I

    Reply Reply
  3. joegh said:

    _AT_ bookcold : the NoSQL The advantage of the breakthrough in the traditional database dimensional tables the limitations of this model, you can store a variety of structured data; multi-node parallel processing to improve the ability of data computing. Because it did not come into contact with NoSQL, so in the end can not be applied to the data warehouse is not to say, but it can be to some extent different underlying data structure such a troublesome problem.

    Reply Reply
  4. Do not know the bloggers of data mining, there is no relationship between the research and site analysis and data mining. Recommend "click-stream data warehouse" great, I found a web data mining - customer data into customer value "is also very good.

    Reply Reply

Leave a Comment