Welcome to our book review site go-pdf.online!

You may have to Search all our reviewed books and magazines, click the sign up button below to create a free account.

Sign up

Data Profiling
  • Language: en
  • Pages: 156

Data Profiling

Data profiling refers to the activity of collecting data about data, i.e., metadata. Most IT professionals and researchers who work with data have engaged in data profiling, at least informally, to understand and explore an unfamiliar dataset or to determine whether a new dataset is appropriate for a particular task at hand. Data profiling results are also important in a variety of other situations, including query optimization, data integration, and data cleaning. Simple metadata are statistics, such as the number of rows and columns, schema and datatype information, the number of distinct values, statistical value distributions, and the number of null or empty values in each column. More c...

An Introduction to Duplicate Detection
  • Language: en
  • Pages: 77

An Introduction to Duplicate Detection

With the ever increasing volume of data, data quality problems abound. Multiple, yet different representations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems. The effects of such duplicates are detrimental; for instance, bank customers can obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detecting duplicates is difficult: First, duplicate representations are usually not identical but slightly differ in their values. Second, in principle all pairs of records should be compared, which is infeasible for large volumes of data. This lecture...

Efficient and Exact Computation of Inclusion Dependencies for Data Integration
  • Language: en
  • Pages: 46

Efficient and Exact Computation of Inclusion Dependencies for Data Integration

Data obtained from foreign data sources often come with only superficial structural information, such as relation names and attribute names. Other types of metadata that are important for effective integration and meaningful querying of such data sets are missing. In particular, relationships among attributes, such as foreign keys, are crucial metadata for understanding the structure of an unknown database. The discovery of such relationships is difficult, because in principle for each pair of attributes in the database each pair of data values must be compared. A precondition for a foreign key is an inclusion dependency (IND) between the key and the foreign key attributes. We present with S...

Advancing the Discovery of Unique Column Combinations
  • Language: en
  • Pages: 30

Advancing the Discovery of Unique Column Combinations

Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the wellknown GORDIAN algorithm and "Apriori-based" algorithms are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCAGORDIAN combines the advantages of GORDIAN and our new algorithm HCA, and it significantly outperforms all previous work in many situations.

Extracting Structured Information from Wikipedia Articles to Populate Infoboxes
  • Language: en
  • Pages: 32

Extracting Structured Information from Wikipedia Articles to Populate Infoboxes

Roughly every third Wikipedia article contains an infobox - a table that displays important facts about the subject in attribute-value form. The schema of an infobox, i.e., the attributes that can be expressed for a concept, is defined by an infobox template. Often, authors do not specify all template attributes, resulting in incomplete infoboxes. With iPopulator, we introduce a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. In contrast to prior work, iPopulator detects and exploits the structure of attribute values for independently extracting value parts. We have tested iPopulator on the entire set of infobox templates and provide a detailed analysis of its effectiveness. For instance, we achieve an average extraction precision of 91% for 1,727 distinct infobox template attributes.

Understanding Cryptic Schemata in Large Extract-transform-load Systems
  • Language: en
  • Pages: 28

Understanding Cryptic Schemata in Large Extract-transform-load Systems

Extract-Transform-Load (ETL) tools are used for the creation, maintenance, and evolution of data warehouses, data marts, and operational data stores. ETL workflows populate those systems with data from various data sources by specifying and executing a DAG of transformations. Over time, hundreds of individual workflows evolve as new sources and new requirements are integrated into the system. The maintenance and evolution of large-scale ETL systems requires much time and manual effort. A key problem is to understand the meaning of unfamiliar attribute labels in source and target databases and ETL transformations. Hard-to-understand attribute labels lead to frustration and time spent to devel...

Data Profiling
  • Language: en
  • Pages: 136

Data Profiling

Data profiling refers to the activity of collecting data about data, {i.e.}, metadata. Most IT professionals and researchers who work with data have engaged in data profiling, at least informally, to understand and explore an unfamiliar dataset or to determine whether a new dataset is appropriate for a particular task at hand. Data profiling results are also important in a variety of other situations, including query optimization, data integration, and data cleaning. Simple metadata are statistics, such as the number of rows and columns, schema and datatype information, the number of distinct values, statistical value distributions, and the number of null or empty values in each column. More...

Covering Or Complete?
  • Language: en
  • Pages: 40

Covering Or Complete?

Data dependencies, or integrity constraints, are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. In the last years conditional dependencies have been introduced to analyze and improve data quality. In short, a conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (CINDs). We generalize the definition of CINDs, distinguishing covering and completeness conditions. We present a new use case for such CINDs showing their value for solving complex data quality tasks. Further, we define quality measures for conditions inspired by precision and recall. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case.

An Introduction to Duplicate Detection
  • Language: en
  • Pages: 77

An Introduction to Duplicate Detection

With the ever increasing volume of data, data quality problems abound. Multiple, yet different representations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems. The effects of such duplicates are detrimental; for instance, bank customers can obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detecting duplicates is difficult: First, duplicate representations are usually not identical but slightly differ in their values. Second, in principle all pairs of records should be compared, which is infeasible for large volumes of data. This lecture...

Responsible Librarianship
  • Language: en
  • Pages: 193

Responsible Librarianship

These papers examine library policies and organizational structures in light of the literature of ergonomics, high reliability organizations, joint cognitive systems and integrational linguistics. Bade argues that many policies and structures have been designed and implemented on the basis of assumptions about technical possibilities, ignoring entirely the political dimensions of local determination of goals and purposes as well as the lessons from ergonomics, such as the recognition that people are the primary agents of reliability in all technical systems. Because libraries are understood to be loci of human interaction and communication rather than purely technical systems at the disposal of an abstract user, Bade insists on looking at problems of meaning and communication in the construction and use of the library catalog. Looking at various policies for metadata creation and the results of those policies forces the question: is there a responsible human being behind the library web site and catalog, or have we abandoned the responsibilities of thinking and judgment in favor of procedures, algorithms and machines?