WDM - Chapter 12. Web Usage Mining (2)

지난번 포스팅에 이어, 전처리에 대한 과정을 소개합니다.

Sources and types of data
일반적으로 Web server access logs, Application server logs 가 있으며, 다양한 데이터 소스를 포함하고 있는데...

Usage data

방문자의 행태

Content data
Structure data

페이지 간의 링크구조

User data

이용자 정보

Key elements of web usage data pre-processing
그다지 특이할 만한 사항은 없으며, Sessionization 이라는 것은 특별한 기법이 아니라, 웹 서버의 Application의 LoginID 또는 SessionID 정보가 없다면, Heuristic한 기법으로 Session을 끊어 내겠다는 말입니다. 시간 또는 Referer정보를 기준으로 접근하는 방법을 소개하고 있습니다.

Data fusion and cleaning
1. Data fusion
  1. 여러대의 서버로 서비스 하는 웹 서버의 로그를 하나로 통합
2. Cleaning
  1. 일부 파일포맷(css, graphics ...) 제외
  2. 사용하지 않는 필드 (file size ...) 제외
  3. 웹 크롤러, 로봇 또는 검색엔진 로그 제외
Pageview identifiation
1. Dependant on the intra-page structure of the site
  1. single frame site
  2. multi-framed sites
  3. dynamic sites
2. Must be recorded attributes
  1. pageview id (URL)
  2. pageview type (search, maintitle, relatedlist ...)
  3. other metadata (keyword, product attributes ...)
User identification
1. Authetification
  1. IP + UserAgent + LoginID
2. Cookie and Session (if no authentification)
  1. IP + UserAgent + SessionID
3. Others
  1. dynamic IP (some ISP)
  2. IP conflict (some ISP)
Sessionization
1. Two types of sessionization
  1. Time-oriented
    1. 이벤트 정보 간에 차이가 임계치를 초과하는 경우를 별도의 세션으로 처리
  2. Structure-oriented
    1. Referer 링크와 현재 링크 정보간의 시간차이가 임계치를 초과하는 경우 중에서 Referer의 링크 정보와의 확인
Episode identification
1. Using domain ontology or concept hierarchy
2. Storing each shopping cart information
Path copletion
1. Backtracking pages (cached pages)
2. Missing references

ref links:
Discovery of web robot sessions based on their navigational patterns

Data transformation

Data integration from multiple sources
1. operational database (user data, product attributes, categories)
2. metadata
3. domain knowledge
Data aggregation
Data generalization
Product-oriented events
1. Shopping cart changes
2. Order and shipping
3. Impression
4. Click-throughs
5. Some service-related events
Final transaction database

OLAP (on-line analytical processing)
Visualization ...

ref links:
Discovering internet marketing intelligence through online analytical web usage mining

'강좌 > web data mining' 카테고리의 다른 글

WDM - Chapter 12. Web Usage Mining (3) (2)	2008.04.15
WDM - Chapter 12. Web Usage Mining (1) (0)	2008.04.14
WDM - Chapter 11. Opinion Mining - Sentiment Classification (0)	2008.04.12

data mining for information retrieval

WDM - Chapter 12. Web Usage Mining (2)

'강좌 > web data mining' 카테고리의 다른 글

티스토리툴바

WDM - Chapter 12. Web Usage Mining (2)

'강좌 > web data mining' 카테고리의 다른 글

'강좌/web data mining' Related Articles

티스토리툴바