4.5 webminig

20
DATA MINING MINING THE WORLD WIDE WEB

Transcript of 4.5 webminig

Page 1: 4.5 webminig

DATA MINING

MINING THE WORLD WIDE WEB

Page 2: 4.5 webminig

Mining the Web’s Link Structures to IdentifyAuthoritative Web Pages

• The Number the pages {1,2,....,n} and their adjacency matrix A to be an n×n matrix, then A(i, j) is 1 if page i links to page j, or 0 otherwise.

• The authority weight vector a = (a1,a2,....,an), and the hub weight vector h = (h1,h2,....,hn). we have

• Two equations for k times, we have

2mining www

Page 3: 4.5 webminig

• HITS sometimes drifts when hubs contain multiple topics. It may also cause “topic hijacking” when many pages from a single website point to the same single popular site, giving the site too large a share of the authority weight.

• Such problems can be overcome by replacing the sums of Equations with weighted sums

• scaling down the weights of multiple links from within the same site, using anchor text to adjust the weight of the links along which authority is propagated and breaking large hub pages into smaller units.

3mining www

Page 4: 4.5 webminig

• The link analysis algorithms are based on 2 assumptions– links convey human endorsement.(if there exists a link from page

A to page B and these two pages are authored by different people, then the link implies that the author of page A found page B valuable.)

– pages that are co-cited by a certain page are likely related to the same topic.

• Problems are– importance of page may be miscalculated by Page Rank– topic drift may occur in HITS

• Causes are a single Web page often contains multiple semantics, and the different parts of the Web page have different importance in that page

4mining www

Page 5: 4.5 webminig

5mining www

Page 6: 4.5 webminig

• Using VIPS,construct a page graph and a block graph.• Using Graph model the new link analysis algorithms discovers

the intrinsic semantic structure of the Web.• The graph model in block-level link analysis is induced from two

kinds of relationships, block-to-page (link structure) and page-to-block (page layout).

6mining www

Page 7: 4.5 webminig

• The block-to-page relationship (link analysis) -more reasonable to consider the hyperlinks from block to page , rather from page to page.

• Let Z denote the block-to-page matrix with dimension Z can be defined as :

7mining www

Page 8: 4.5 webminig

• The page-to-block relationship(page layout)-Let X denote the page-to-block matrix with dimension k×n

• Each Web page can be segmented into blocks. X is defined as

• where f is a function that assigns to every block b in page p an importance value. The bigger is, the more important the block b is. Function f is empirically defined as

8mining www

Page 9: 4.5 webminig

• Based on the block-to-page and page-to-block relations, a new Web page graph incorporates the block importance information is defined as

9mining www

Page 10: 4.5 webminig

Mining Multimedia Data on the Web

• Web-based multimedia data are embedded on the Web page and are associated with text and link information.

• Using some Web page layout mining techniques (like VIPS), a Web page can be partitioned into a set of semantic blocks.

• VIPS help to identify the surrounding text for Web images. This text provides a textual description of Web images and can be used to build an image index.

• TheWeb image search problem can then be partially completed using traditional text search techniques.

10mining www

Page 11: 4.5 webminig

11mining www

Page 12: 4.5 webminig

12mining www

Page 13: 4.5 webminig

• The block-level link analysis technique is used to organize Web images. Consider a new relation: block-to-

image relation. • Let Y denote the block-to-image matrix with dimension

n×m. For each image, at least one block contains this image.

• Y is defined as

13mining www

Page 14: 4.5 webminig

• we first construct the block graph from which the image graph can be induced. the block graph is defined as:

• where t is a suitable constant. D is a diagonal matrix, is 0 if block i and block j are contained in

two different Web pages; otherwise, it is set to DOC,the value of the smallest block containing both block i and block j. It is easy to check that the sum of is 1.

• can be viewed as a probability transition matrix such that is the probability of jumping from block a to block b.

14mining www

Page 15: 4.5 webminig

• The image graph can be constructed by noticing that every image is contained in at least one block.

• The weight matrix of the image graph is defined as:

• Where is an matrix. If two images i and j are in the same block say b, then

• The images in the same block are semantically related. Thus, we get

15mining www

Page 16: 4.5 webminig

16mining www

Page 17: 4.5 webminig

Automatic Classification of Web Documents• Each document is assigned a class label from a set of predefined

topic categories, based on a set of examples of preclassified documents

• For example, Yahoo!’s taxonomy and its associated documents can be used as training and test sets in order to derive a Web document classification scheme

• A Web page may contain multiple themes, ads, and navigation information, block-based page content analysis play an important role in construction of high-quality classification models.

• The block-based Web linkage will reduce such noise and enhance the quality of Web document classification.

17mining www

Page 18: 4.5 webminig

Web Usage Mining• A Web server usually registers a (Web) log entry, or Weblog entry,

for every access of a Web page. It includes the URL requested, the IP address from which the request originated and a timestamp.

• Web usage mining, mines Weblog records to discover user access patterns of Web pages.

• Analyzing and exploring Weblog records can identify the customers for electronic commerce, enhance the quality and delivery of Internet information services to the end user, and improve Web server system performance.

• E.g. Web-based e-commerce servers

18mining www

Page 19: 4.5 webminig

• The techniques for developing Web usage mining– what and how much valid and reliable knowledge can be

discovered from the large raw log data. data need to be cleaned, condensed, and transformed in order to retrieve and analyze significant and useful information.

– construct a multidimensional view on the Weblog database , and multidimensional OLAP analysis is performed to find top N users, Web pages and so on, which helps to discover customers, users, markets, and others.

– data mining can be performed on Weblog records to find association patterns, sequential patterns, and trends of Web accessing

19mining www

Page 20: 4.5 webminig

• For example, some studies have proposed adaptive sites: websites that improve themselves by learning from user access patterns.

• Weblog analysis may also help build customized Web services for individual users.

• Weblog information can be integrated with Web content and Web linkage structure mining to help Web page ranking , Web document classification, and the construction of a multilayered Web information

20mining www