Intellectual Property Office

Non-Confidential Disclosures

"Single-pass Low-storage Arbitrary Quantile Estimation for Massive Datasets”

PSU Inv. Disc. No 2001-2444

Inventors:

J.C. Liechty, D.K. Lin, J.P. McDermott

Patent status:

Issued Patent No. 7,076,487

Background:

Quantile estimation for an unknown distribution is a commonly studied problem. While it may be desirable, using the sample quantile as an estimate of the population quantile becomes cumbersome and in many cases impractical to obtain when the size of the data set becomes large. Among the single pass, low storage estimation techniques in the Statistical Literature, the only estimators which are easily extended to accommodate tail quantile estimation are the sample quantile estimator, the S.A. estimator of Tierney, and the estimator covered by this invention. The sample quantile estimator is costly to employ when the size of the dataset becomes large, making it unattractive or infeasible for large datasets. The S.A. method's accuracy is dependent upon the size of the initial sample taken for the starting value. The larger the initial sample, the better the starting value and hence the more accurate the final estimate will be.

Invention description:

The proposed invention tracks a small fraction of the original data set and uses a scoring rule and an assigned weight for each data point to determine which data point to rack and which data point to ignore. The subject invention has at least three advantages when compared with the S.A. method. First, the subject invention will return an actual observation as an estimate, whereas there is nothing to prevent the S.A. method from returning an estimate that is outside the range of possible values for the distribution generating the data. Second, the invention performs better than the S.A. method, with regards to approximating the sample quantile, and it appears that the best way to overcome that difference in performance is to increase the size of the initial sample. A third advantage is that the proposed method can be easily extended to estimate a collection of quantiles simultaneously.

Advantages:

  • Query optimization for large databases and network routing problems
  • Network routing decisions improved

Contact:

Mr. Matthew Smith
Sr. Technology Licensing Officer
Intellectual Property Office
113 Technology Center
The Pennsylvania State Univ.
University Park, PA 16802-7000
Phone: (814) 863-1122
Fax: (814) 865-3591
E-mail: mds126@psu.edu