Audience expansion in the era of privacy regulations: Addressing shortened seed lists

Amit Kumar Gupta; Gaurav Garg

Authors

Amit Kumar Gupta
efpm11010@iiml.ac.in
Indian Institute of Management Lucknow, Lucknow - 226013, India
Gaurav Garg Indian Institute of Management Lucknow, Lucknow - 226013, India

Keywords:

Audience Expansion, Look Alike models, Imbalanced dataset, Oversampling

Abstract

Audience expansion enables businesses to acquire new customers by digitally targeting individuals who resemble their existing customer base, making it a critical lever for business growth. These models rely heavily on the diversity and quality of data available on audiences. However, emerging privacy regulations worldwide are limiting both the volume and variety of data that can be collected, which negatively impacts audience expansion models. Specifically, such restrictions reduce the size of the seed audience and weaken the signal in the feature space. A smaller seed list exacerbates class imbalance, which in turn degrades model performance. Synthetic oversampling techniques are commonly used to address class imbalance, but most overlook the challenges posed by high-dimensional binary covariate spaces. Existing methods that handle binary data often treat all features equally and do not selectively choose base samples for generating synthetic data—leading to the introduction of noise and borderline examples. We propose a novel oversampling algorithm, SMOTE-MSFB (SMOTE - Minority Focused Select Features for Binary data), that enhances synthetic sample quality by: (a) Prioritizing minority samples near the decision boundary; (b) Defining neighborhoods using a mutual information-weighted Jaccard distance to manage high dimensionality; and (c) Improving signal strength through union-based voting across minority neighbors to counteract data sparsity. Experiments on two publicly available audience expansion datasets demonstrate that SMOTE-MSFB outperforms existing resampling techniques for discrete features in a statistically significant result. Also SMOTE-MSFB is at least ~70% more computationally efficient than the standard algorithm on the two datasets.

References

Perlich, C., Dalessandro, B., Raeder, T., Stitelman, O., Provost, F.: Machine learning for targeted display advertising: transfer learning in action. Mach Learn. 95, 103–127 (2014). https://doi.org/10.1007/s10994-013-5375-2.

Liu, H., Pardoe, D., Liu, K., Thakur, M., Cao, F., Li, C.: Audience Expansion for Online Social Network Advertising. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 165–174. ACM, San Francisco California USA (2016). https://doi.org/10.1145/2939672.2939680.

Yan Qu, Jing Wang: System and methods for generating expanded user segments.

Jiang, J., Lin, X., Yao, J., Lu, H.: Comprehensive Audience Expansion based on End-to-End Neural Prediction. (2019).

Ma, Q., Wen, M., Xia, Z., Chen, D.: A Sub-linear, Massive-scale Look-alike Audience Extension System. (2016).

Carvalhaes, C.: Reframing Audience Expansion through the Lens of Probability Density Estimation, http://arxiv.org/abs/2311.05853, (2023).

Popov, A., Iakovleva, D.: Adaptive look-alike targeting in social networks advertising. Procedia Computer Science. 136, 255–264 (2018). https://doi.org/10.1016/j.procs.2018.08.264.

Tziortziotis, N., Qiu, Y., Hue, M., Vazirgiannis, M.: Audience expansion based on user browsing history. In: 2021 International Joint Conference on Neural Networks (IJCNN). pp. 1–8. IEEE, Shenzhen, China (2021). https://doi.org/10.1109/IJCNN52387.2021.9533392.

The Connecticut Data Privacy Act, https://portal.ct.gov/ag/sections/privacy/theconnecticut-data-privacy-act, last accessed 2025/06/05.

California Consumer Privacy Act (CCPA), https://oag.ca.gov/privacy/ccpa, last accessed 2025/06/05.

New York privacy act, https://nyassembly.gov/leg/?default_fld=&leg_video=&bn=S00365&term=2023&Su mmary=Y&Actions=Y&Text=Y, last accessed 2025/06/05.

Colorado Privacy Act (CPA), https://coag.gov/resources/colorado-privacyact/, last accessed 2025/06/05.

Virginia - Consumer Data Protection Act, https://law.lis.virginia.gov/vacodefull/title59.1/chapter53/, last accessed 2025/06/05.

Utah Consumer Privacy Act, https://dcp.utah.gov/ucpa/, last accessed 2025/06/05.

Washington My Health My Data Act, https://www.atg.wa.gov/protectingwashingtonians-personal-health-data-and-privacy, last accessed 2025/06/05.

Nevada consumer health data privacy law, https://www.leg.state.nv.us/App/NELIS/REL/82nd2023/Bill/10323/Overview, last accessed 2025/06/05.

Texas Data Privacy And Security Act, https://www.texasattorneygeneral.gov/consumer-protection/file-consumercomplaint/consumer-privacy-rights/texas-data-privacy-and-security-act, last accessed 2025/06/05.

Florida digital bill of rights, https://www.flsenate.gov/Session/Bill/2023/262/BillText/er/HTML, last accessed 2025/06/05.

Delaware Personal Data Privacy Act, https://legis.delaware.gov/json/BillDetail/GenerateHtmlDocument?legislationId=1403 88&legislationTypeId=1&docTypeId=2&legislationName=HB154, last accessed 2025/06/05.

Nebraska data privacy act, https://protectthegoodlife.nebraska.gov/dataprivacy-homepage, last accessed 2025/06/05.

New Hampshire Data Privacy Act, https://www.doj.nh.gov/data-privacyenforcement, last accessed 2025/06/05.

New Jersey Data Privacy Law, https://www.njconsumeraffairs.gov/ocp/Pages/NJ-Data-Privacy-Law-FAQ.aspx, last accessed 2025/06/05.

Mangalampalli, A., Ratnaparkhi, A., Hatch, A.O., Bagherjeiran, A., Parekh, R., Pudi, V.: A feature-pair-based associative classification approach to look-alike modeling for conversion-oriented user-targeting in tail campaigns. In: Proceedings of the 20th international conference companion on World wide web. pp. 85–86. ACM, Hyderabad India (2011). https://doi.org/10.1145/1963192.1963236.

Zhuzhel, V., Grabar, V., Kaploukhaya, N., Rivera-Castro, R., Mironova, L., Zaytsev, A., Burnaev, E.: No Two Users Are Alike: Generating Audiences with Neural Clustering for Temporal Point Processes. Dokl. Math. 108, S511–S528 (2023). https://doi.org/10.1134/S1064562423701661.

Shen, J., Geyik, S.C., Dasdan, A.: Effective Audience Extension in Online Advertising. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 2099–2108. ACM, Sydney NSW Australia (2015). https://doi.org/10.1145/2783258.2788603.

Ramesh, A., Teredesai, A., Bindra, A., Pokuri, S., Uppala, K.: Audience segment expansion using distributed in-database k-means clustering. In: Proceedings of the Seventh International Workshop on Data Mining for Online Advertising. pp. 1– 9. ACM, Chicago Illinois (2013). https://doi.org/10.1145/2501040.2501982.

Frolov, D., Taran, Z., Mirkin, B.: A Method for Audience Extending in Programmatic Advertising by Using Parsimonious Generalization of User Segments. In: Ahram, T., Taiar, R., Colson, S., and Choplin, A. (eds.) Human Interaction and Emerging Technologies. pp. 837–841. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-25629-6_131.

Rajaraman, A., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. (2012).

Ma, Q., Wagh, E., Wen, J., Xia, Z., Ormandi, R., Chen, D.: Score Look-Alike Audiences. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). pp. 647–654. IEEE, Barcelona, Spain (2016). https://doi.org/10.1109/ICDMW.2016.0097.

Doan, K.D., Yadav, P., Reddy, C.K.: Adversarial Factorization Autoencoder for Look-alike Modeling. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. pp. 2803–2812. ACM, Beijing China (2019). https://doi.org/10.1145/3357384.3357807.

Liu, Y., Ge, K., Zhang, X., Lin, L.: Real-time Attention Based Look-alike Model for Recommender System. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 2765–2773 (2019). https://doi.org/10.1145/3292500.3330707.

Liu, Z., Niu, X.-F., Zhuang, C., Tan, Y., Mu, Y., Gu, J., Zhang, G.: TwoStage Audience Expansion for Financial Targeting in Marketing. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management. pp. 2629–2636. ACM, Virtual Event Ireland (2020). https://doi.org/10.1145/3340531.3412748.

Zhuang, C., Liu, Z., Zhang, Z., Tan, Y., Wu, Z., Liu, Z., Wei, J., Gu, J., Zhang, G., Zhou, J., Qi, Y.: Hubble: An Industrial System for Audience Expansion in Mobile Marketing. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 2455–2463. ACM, Virtual Event CA USA (2020). https://doi.org/10.1145/3394486.3403295.

Liu, C., Jin, S., Wang, D., Luo, Z., Yu, J., Zhou, B., Yang, C.: Constrained Oversampling: An Oversampling Approach to Reduce Noise Generation in Imbalanced Datasets With Class Overlapping. IEEE Access. 10, 91452–91465 (2022). https://doi.org/10.1109/ACCESS.2020.3018911.

Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-LevelSMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., and

Ho, T.-B. (eds.) Advances in Knowledge Discovery and Data Mining. pp. 475–482. Springer Berlin Heidelberg, Berlin, Heidelberg (2009). https://doi.org/10.1007/978-3642-01307-2_43.

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. jair. 16, 321–357 (2002). https://doi.org/10.1613/jair.953.

Fahrudin, T., Buliali, J.L., Fatichah, C.: enhancing the performance of smote algorithm by using attribute weighting scheme and new selective sampling method for imbalanced data set.

Kirshners, A., Parshutin, S., Gorskis, H.: Entropy-Based Classifier Enhancement to Handle Imbalanced Class Problem. Procedia Computer Science. 104, 586– 591 (2017). https://doi.org/10.1016/j.procs.2017.01.176.

Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to data mining. Pearson, Harlow (2014).

Saad Hussein, A., Li, T., Yohannese, C.W., Bashir, K.: A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE: IJCIS. 12, 1412 (2019). https://doi.org/10.2991/ijcis.d.191114.002.

Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A New OverSampling Method in Imbalanced Data Sets Learning. In: Huang, D.-S., Zhang, X.-P., and Huang, G.-B. (eds.) Advances in Intelligent Computing. pp. 878–887. Springer Berlin Heidelberg, Berlin, Heidelberg (2005). https://doi.org/10.1007/11538059_91.

Haibo He, Yang Bai, Garcia, E.A., Shutao Li: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). pp. 1322–1328. IEEE, Hong Kong, China (2008). https://doi.org/10.1109/IJCNN.2008.4633969.

Mukherjee, M., Khushi, M.: SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features. ASI. 4, 18 (2021). https://doi.org/10.3390/asi4010018.

Stanfill, C., Waltz, D.: Toward memory-based reasoning. Commun. ACM. 29, 1213–1228 (1986). https://doi.org/10.1145/7902.7906.

Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6, 1–6 (2004). https://doi.org/10.1145/1007730.1007733.

Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl. 6, 80–89 (2004). https://doi.org/10.1145/1007730.1007741.

Shanab, A.A., Khoshgoftaar, T.M., Wald, R., Van Hulse, J.: Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data. In: 2011 IEEE International Conference on Information Reuse & Integration. pp. 234–239. IEEE, Las Vegas, NV, USA (2011). https://doi.org/10.1109/IRI.2011.6009552.

Blagus, R., Lusa, L.: Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics. 11, 523 (2010). https://doi.org/10.1186/1471-2105-11523.

Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A., Wald, R.: Feature Selection with High-Dimensional Imbalanced Data. In: 2009 IEEE International Conference on Data Mining Workshops. pp. 507–514. IEEE, Miami, FL (2009). https://doi.org/10.1109/ICDMW.2009.35.

Deepa, T., Punithavalli, M.: An E-SMOTE technique for feature selection in High-Dimensional Imbalanced Dataset. In: 2011 3rd International Conference on Electronics Computer Technology. pp. 322–324. IEEE, Kanyakumari, India (2011). https://doi.org/10.1109/ICECTECH.2011.5941710.

Qazi, N., Raza, K.: Effect of Feature Selection, SMOTE and under Sampling on Class Imbalance Classification. In: 2012 UKSim 14th International Conference on Computer Modelling and Simulation. pp. 145–150. IEEE, Cambridge, United Kingdom (2012). https://doi.org/10.1109/UKSim.2012.116.

Maldonado, S., López, J., Vairetti, C.: An alternative SMOTE oversampling strategy for high-dimensional datasets. Applied Soft Computing. 76, 380–389 (2019). https://doi.org/10.1016/j.asoc.2018.12.024.

Van der Maaten, L., Hinton, G.: Visualizing Data using t-SNE. Journal of Machine Learning Research. (2008).

Schubert, E., Gertz, M.: Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection. In: Beecks, C., Borutta, F., Kröger, P., and Seidl, T. (eds.) Similarity Search and Applications. pp. 188–203. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-68474-1_13.

Harper, F.M., Konstan, J.A.: The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 1–19 (2016). https://doi.org/10.1145/2827872.

Ni, H., Wang, Z.: Feature Dual Supervision Model for the Searches of Online Advertising Audiences. Scientific Programming. 2023, 1–14 (2023). https://doi.org/10.1155/2023/1217898.

Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–27 (2011). https://doi.org/10.1145/1961189.1961199.

Chen, X., Wasikowski, M.: FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 124–132. ACM, Las Vegas Nevada USA (2008). https://doi.org/10.1145/1401890.1401910.

Audience expansion in the era of privacy regulations: Addressing shortened seed lists

Authors

Keywords:

Abstract

References

Published

How to Cite

Issue

Section

Make a Submission

Mir@bel

E-ISSN

ISSN-L

CiteScore

main menu

ABDC

Scopus

PUBLISHER

Published by:

Contact Info:

Powered by :

Copyright :