TU Graz – Institute of Interactive System and Data Science
8010 Graz, Inffeldgasse 16c, Austria, Tel.: +43 316 873-0000
Autoren: BSc Maris Siljak
Knowledge Discovery and Data Mining 2 (VU 706.715)
Data Preprocessing and Cleaning
My dataset consist of crawled products categorised as ”Motorbikes -> Naked
Bikes”. For obvious time reasons, there are only ~3500 crawled products.
After successfully cleaning data and obtaining all numeric values, we are
ready to examine pairwise relations between the target feature, respectively
“price”, and all other features including itself.
From the pairwise plot with itself, it is clearly visible that the price distribution is
not normal. Since we are planning to use linear regression model for
predictions, having normal distribution would be advantageous for the
algorithm. Performing natural logarithm on the target feature results in:
A big advantage of using such simple function is that its results are easily
reversible using exponential function.
Price Prediction
Before the predictions happen, model is fit with previously processed data and
yet then linear regression predictions are made delivering following results:
Flat Data Clean Data
Willhaben Dataset Collector and Price Predictor
Introduction
This project consists of two major components:
Dataset Collector
Price Predictor
Dataset Collector depicts a generic crawler and its storage management.
Crawler is able to crawl any kind of Willhaben product based on search criteria
query and collect relevant data. Data is afterwards stored in a NoSQL DBMS.
Price Predictor depicts dataset preparing and preprocessing, model training
and finally predicting the price feature. Previously acquired data is being
preprocessed and cleaned in order to get it into algorithm usable form and to
remove outliers and erroneous values. When the data is ready, linear
regression model is trained and used for predicting the target feature.
Data Collecting and Storing
Software component used for collecting relevant data is a web crawler built in
Python upon ”Scrapy” 2, the web-crawling and scraping framework. In order to
start the crawling process, one has to feed it with a valid Willhaben search
link. This transfers the search processing responsibilities on Willhaben’s own
search engine, ensures correctness of the whole process flow and absence of
redundant dependencies. During the crawling process all products on result
pages are individually crawled and scraped until there are no more products
left. Crawler is rather an advanced one, because it is able to revisit products
and make revisions of updated product details. This data is collected with the
purpose of advanced analytics like:
How many times does a certain user change product details?
If the change was a good or bad decision (increased views, accelerated
sale)?
Product data (excluding price) used in the model looks like this:
Previously shown data can have arbitrary number of attributes as well as
equipment. Therefore, it was one of the main factors that influenced DBMS
choice decision. “MongoDB” 3 is used as a storage for this data and imported
into project using ”pymongo” 4 library. One of its main advantages is that the
schema of each collection is not constant, since it is a NoSQL DBMS. That
makes it perfect for this case, because Willhaben does not enforce any
product details to be filled, what can result in missing and erroneous values.
Product Data1
Optionaler Platz für Partnerlogos oder Ähnliches
Figure 1: Pairwise plots (x axis = price)
© isti2 -
Foto
lia.c
om
Literatur / Zitat
1 https://www.willhaben.at/iad/gebrauchtwagen/d/auto/audi-q3-sportback-35-tdi-s-line-exterieur-345431452/ 2 https://scrapy.org 3 https://www.mongodb.com 4 https://api.mongodb.com/python/current/
Figure 2: Initial vs normal price distribution
Metrics
Root Mean Squared Error 0.33882925086198107
Mean Squared Error 0.11480526123969131
R2 Score 0.748883640846617
Figure 3: Actual and predicted
price distribution
Figure 4: Scatter plot of test and
predicted prices with best fit line