Accepted Papers

  • Improving M5 model tree pruning by Automatic Programming
    Mr.Hieu Huynh ,Ostfold University College ,Norway
    In this paper, we investigate how the pruning procedure in M5 model tree algorithm can be improved by automatic programming guided by the evolutionary algorithm. We design and implement ex- periments that use Automatic Design of Algorithms Through Evolution (ADATE) system to synthesize new programs for alternating the orig- inal program employed in M5 pruning process. The result shows that M5 trees pruned by the new program become signi cantly smaller in size while their accuracy is slightly improved in general. Through this project, we demonstrate the power of ADATE system and suggest the system can be used to improve other Machine Learning models.
  • Key Business Object Based Definition of Data Migration Tranches
    Mr.Andreas Martens, Adesso AG, Dortmund, Germany
    Sooner or later in almost every company the maintenance and further development of large legacy enterprise applications reaches the limit. From the costs point of view as well as from the technical possibilities the application must be replaced. Data migration is an inseparable part of this replacement. One of the most popular data migration strategies due to its low risks is step-by-step data migration when the whole data volume is split in several data tranches and then migrated in several migration steps. In this work I present the approach which describes how the whole data volume of an enterprise core application can be effectively, with minimum integration overhead, split into data tranches. Each data tranche represents a single data block of one migration step. The approach is based on Key Business Objects of the application - the main Business Object Class, for example, "Customer" or "Product". Some aspects of this approach have already been successfully partially tested in a data migration project for one of the healthcare companies.
  • Exploring Peer-to-Peer Data Mining
    Andrea Marcozzi and Gianluca Mazzini, Lepida SpA, Bologna, Italy
    The emerging widespread use of Peer-to-Peer computing is making the P2P Data Mining a natural choice when data sets are distributed over such kind of systems. The huge amount of data stored within the nodes of P2P networks and the bigger and bigger number of ap- plications dealing with them as p2p le-sharing, p2p chatting, p2p electronic commerce etc.., is moving the spotlight on this challenging eld. In this paper we give an overview of two di erent approaches for implementing primitives for P2P Data Mining, trying then to show di erences and similarities. The rst one is based on the de nition of Local algorithms [1]; the second one relies on the Newscast model of computation [4].
  • Random Projections for Non-Linear Dimensionality Reduction
    Long Cheng, 1Kiwii Power Technology Co., Ltd, Chenyu You, 2 Rensselaer Polytechnic Institute and 1,2Yani Guan, Troy, NY, USA
    The need to analyze high-dimensional data in various areas, such as image processing, human gene regulation and smart grids, raises the importance of dimensionality reduction. While classical linear dimensionality reduction methods are easily implementable and efficiently computable, they fail to discover the true structure of high-dimensional data lying on a non-linear subspace. To overcome this issue, many non-linear dimensionality reduction approaches, such as Locally Linear Embedding, Isometric Embedding and Semidefinite Embedding, have been proposed. Though these approaches can learn the global structure of non-linear manifolds, they are computationally expensive, potentially limiting their use in large-scale applications involve very high-dimensional data. An innovative method framework that combines random projections and non-linear dimensionality reduction methods is proposed in this paper to increase computational speed and reduce memory usage, while preserving the non-linear data geometry. Illustrations with various combinations of random projections and non-linear dimensionality reduction methods tested on a hand-written digits dataset are also given in this paper to demonstrate that this method framework is both computationally efficient and globally optimal.
  • Analysis of Rising Tuition Rates in the United States based on Clustering Analysis and Regression Models
    Long Cheng and Chenyu You , Rensselaer Polytechnic Institute, Troy, NY, USA
    Since higher education is one of the major driving forces for country development and social prosperity, and tuition plays a significant role in determining whether or not a person can afford to receive higher education, the rising tuition is a topic of big concern today. So it is essentially necessary to understand what factors affect the tuition and how they increase or decrease the tuition. Many existing studies on the rising tuition either lack large amounts of real data and proper quantitative models to support their conclusions, or are limited to focus on only a few factors that might affect the tuition, which fail to make a comprehensive analysis. In this paper, we explore a wide variety of factors that might affect the tuition growth rate by use of large amounts of authentic data and different quantitative methods such as clustering analysis and regression models.
  • Performance Evaluation of Trajectory Queries on Multiprocessor and Cluster
    Christine Niyizamwiyitira and Lars Lundberg, Blekinge Institute of Technology, Sweden
    In this study, we evaluate the performance of trajectory queries that are handled by Cassandra, MongoDB, and PostgreSQL. The evaluation is conducted on a multiprocessor and a cluster. Telecommunication companies collect a lot of data from their mobile users. This data must be analysed in order to support business decisions, such as infrastructure planning. The optimal choice of hardware platform and database can be different from a query to another. We use data collected from Telenor Sverige, a telecommunication company that operates in Sweden. These data are collected every five minutes for an entire week in a medium sized city. The execution time results show that Cassandra performs much better than MongoDB and PostgreSQL for queries that do not have spatial features. Statio’s Cassandra Lucene index incorporates a geospatial index into Cassandra, thus making Cassandra to perform similarly as MongoDB to handle spatial queries. In four use cases, namely, distance query, k-nearest neigbhor query, range query, and region query, Cassandra performs much better than MongoDB and PostgreSQL for two cases, namely range query and region query. The scalability is also good for these two use cases.