|Markov, Zdravko / Larose, Daniel T.|
Data Mining the Web
Uncovering Patterns in Web Content, Structure, and Usage
1. Auflage Mai 2007
2007. 218 Seiten, Hardcover
- Praktikerbuch -
ISBN 978-0-471-66655-4 - John Wiley & Sons
Preis inkl. Mehrwertsteuer zzgl. Versandkosten.
Learn How To Convert Web Data Into Web Knowledge
This text demonstrates how to extract knowledge by finding meaningful connections among data spread throughout the Web. Readers learn methods and algorithms from the fields of information retrieval, machine learning, and data mining which, when combined, provide a solid framework for mining the Web. The authors walk readers through the algorithms with the aid of examples and exercises.
This text is divided into three parts:
Part One, Web Structure, presents basic concepts and techniques for extracting information from the Web. Readers learn how to collect and index Web documents as well as search and rank Web pages according to their textual content and hyperlink structure.
Part Two, Web Content Management, offers two approaches, clustering and classification, for organizing Web content. For both approaches, the authors set forth specific algorithms that enable readers to convert Web data into knowledge.
Part Three, Web Usage Mining, demonstrates the application of data mining methods to uncover meaningful patterns of Internet usage.
Methods and algorithms are illustrated by simple examples. More than 100 exercises help readers assess their grasp of the material. Further, thirty-four hands-on analysis problems ask readers to use their new data mining expertise to solve real problems, working with large data sets. All the data sets needed for the examples, exercises, and analysis problems are available on the companion Web site.
The extensive use of examples, along with the opportunity to test and apply data mining skills, makes this text ideal for graduate and upper-level undergraduates in computer science and engineering. Web designers and researchers will find that this text gives them a new set of tools to further mine the Web for knowledge and move well beyond the capabilities of standard search engines.
Aus dem Inhalt
PART I: WEB STRUCTURE MINING.
1 INFORMATION RETRIEVAL AND WEB SEARCH.
Web Search Engines.
Crawling the Web.
Indexing and Keyword Search.
Advanced Text Search.
Using the HTML Structure in Keyword Search.
Evaluating Search Quality.
2 HYPERLINK-BASED RANKING.
Social Networks Analysis.
Authorities and Hubs.
Link-Based Similarity Search.
Enhanced Techniques for Page Ranking.
PART II: WEB CONTENT MINING.
Hierarchical Agglomerative Clustering.
Finite Mixture Problem.
Collaborative Filtering (Recommender Systems).
4 EVALUATING CLUSTERING.
Approaches to Evaluating Clustering.
Similarity-Based Criterion Functions.
Probabilistic Criterion Functions.
MDL-Based Model and Feature Evaluation.
Minimum Description Length Principle.
MDL-Based Model Evaluation.
Precision, Recall, and F-Measure.
General Setting and Evaluation Techniques.
Naive Bayes Algorithm.
PART III: WEB USAGE MINING.
6 INTRODUCTION TO WEB USAGE MINING.
Definition of Web Usage Mining.
Cross-Industry Standard Process for Data Mining.
Web Server Log Files.
Remote Host Field.
HTTP Request Field.
Status Code Field.
Transfer Volume (Bytes) Field.
Common Log Format.
Extended Common Log Format.
User Agent Field.
Example of a Web Log Record.
Microsoft IIS Log Format.
7 PREPROCESSING FOR WEB USAGE MINING.
Need for Preprocessing the Data.
Data Cleaning and Filtering.
Page Extension Exploration and Filtering.
De-Spidering the Web Log File.
Directories and the Basket Transformation.
Further Data Preprocessing Steps.
8 EXPLORATORY DATA ANALYSIS FOR WEB USAGE MINING.
Number of Visit Actions.
Relationship between Visit Actions and Session Duration.
Average Time per Page.
Duration for Individual Pages.
9 MODELING FOR WEB USAGE MINING: CLUSTERING, ASSOCIATION, AND CLASSIFICATION.
Definition of Clustering.
The BIRCH Clustering Algorithm.
Affinity Analysis and the A Priori Algorithm.
Discretizing the Numerical Variables: Binning.
Applying the A Priori Algorithm to the CCSU Web Log Data.
Classification and Regression Trees.
The C4.5 Algorithm.