Use python to count a CSV column's values and find the most frequent ones¶

A column of a CSV file contains many possible values and I want to find the top 5 most frequent ones. A few python codes can help me get it done.

CSV file example¶

Suppose the CSV file is called "test_data.csv", and its content is:

"date","url","method","status_code","bytes"
"Jun 18, 2022","/posts/find_biggest_files",GET,200,"2.7KB"
"Jun 18, 2022","/posts/gunicorn_concurrency",GET,200,3KB
"Jun 18, 2022","/posts/gunicorn_concurrency",GET,301,169B
"Jun 18, 2022","/posts/auto_sqlalchemy_models",GET,200,"2.2KB"
"Jun 18, 2022","/posts/python_memcached",GET,200,"2.4KB"
"Jun 18, 2022","/posts/python_memcached",GET,200,"2.4KB"
"Jun 17, 2022","/posts/dynamic_table",GET,200,"2.9KB"
"Jun 17, 2022","/posts/python_sys_path",GET,200,"2.3KB"
"Jun 17, 2022","/posts/dynamic_table",GET,200,"2.9KB"
"Jun 17, 2022","/posts/gunicorn_concurrency",GET,200,3KB
"Jun 17, 2022","/posts/gunicorn_concurrency",GET,301,169B
"Jun 17, 2022","/posts/vscode_flake8_line_width",GET,200,"2.3KB"
"Jun 17, 2022","/posts/vscode_flake8_line_width",GET,301,169B
"Jun 17, 2022","/posts/dynamic_table",GET,200,"2.9KB"
"Jun 17, 2022","/posts/aws_content_type",GET,200,"2.2KB"
"Jun 17, 2022","/posts/curl_upload_flask",GET,200,"2.7KB"
"Jun 17, 2022","/posts/dynamic_table",GET,200,"2.9KB"
"Jun 17, 2022","/posts/es_less_mem",GET,200,"2.2KB"
"Jun 17, 2022","/posts/dynamic_table",GET,200,"2.9KB"

In fact, it's a part of easydevguide.com's log file. I want to find the top 5 most frequent urls in this file. url is the second column of this CSV file.

Python code¶

import csv
from collections import defaultdict

reader = csv.reader(open('test_data.csv'))
counter = defaultdict(int)

for i, row in enumerate(reader):
    # ignore the first row (column headers)
    if i == 0:
        continue

    # suppose we want to handle the second column, row[1] is the second column's value.
    col_val = row[1]
    # everytime we meet a column value, we increase the counter of it.
    counter[col_val] += 1

# convert the counter from dict to list, so that we can sort it later.
flat_counter = [(col_val, cnt) for col_val, cnt in counter.items()]

# key argument tells "sorted" to sort on "cnt" of flat_counter's tuples
rs = sorted(flat_counter, key=lambda x: x[1], reverse=True)

# print the top 5
for col_val, cnt in rs[:5]:
    print(col_val, cnt)

Output:

/posts/dynamic_table 5
/posts/gunicorn_concurrency 4
/posts/python_memcached 2
/posts/vscode_flake8_line_width 2
/posts/find_biggest_files 1

Use python to count a CSV column's values and find the most frequent ones¶

CSV file example¶

Python code¶

Navigator

Related Articles

Related Tags

Share this article